29 Commits

Author SHA1 Message Date
cyy
95dbbf713e [Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109)
Follows #125102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109
Approved by: https://github.com/ezyang
2024-07-16 04:23:42 +00:00
03a5c6ea99 Remove LazyStreamContext (1 out of 2) (#59298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59298

After recent changes, LazyStreamContext had in fact always become eager, and was in fact equivalent to a vector of streams. So it makes more sense now to remove this abstraction and use a more self-descriptive type.

This PR migrates the RequestCallback internals. The next PR migrates the TensorPipe agent.
ghstack-source-id: 130583774

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28789175

fbshipit-source-id: fa581a50f9a6a1e42c2ad8c808a9b099bea7433e
2021-06-04 06:53:46 -07:00
7bcd8f94a5 Avoid re-doing CUDA stream sync in OwnerRRef (#57355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57355

We had started fixing OwnerRRef to make it CUDA-compatible, by properly synchronizing CUDA streams/events where appropriate. However, since we started using CUDAFuture (or, well, ivalue::Future nowadays, after they got merged) this is all done automatically for us, hence we can undo these "fixes" as they're now duplicated.
ghstack-source-id: 130583771

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28118182

fbshipit-source-id: 4b1dd9fe88c23802b1df573941d1b73af48bb67b
2021-06-04 06:52:33 -07:00
a3392cafe0 [reland] Set streams when invoking UDFs (#59210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59210

Reland of https://github.com/pytorch/pytorch/pull/58427

Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 130202847

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623889

fbshipit-source-id: ed38242f813dac075d162685d52ae89f408932f9
2021-06-02 05:46:02 -07:00
7a8336a5a7 Revert D28474983: Set streams when invoking UDFs
Test Plan: revert-hammer

Differential Revision:
D28474983 (ab1e958d20)

Original commit changeset: 358292764d0a

fbshipit-source-id: b4d4c25fe551d83848a9d023c139a9f1acc4c23d
2021-05-21 19:24:14 -07:00
ab1e958d20 Set streams when invoking UDFs (#58427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58427

Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 129567052

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474983

fbshipit-source-id: 358292764d0a6832081c34bf6736f0961475ff3d
2021-05-21 13:15:32 -07:00
797dff55b5 Unify fetching RRefs (#57859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57859

Just like with assigning OwnerRRefs, we can also deduplicate the code paths for fetching their values. In fact this was duplicated three times, with different ways of post-processing the value (once for JIT, once for Python, once for autograd). Thanks to future, we can have that logic once, and then connect it to different follow-up steps.
ghstack-source-id: 129567050

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286172

fbshipit-source-id: e0742a99cf555755e848057ab6fee5285ff0df2a
2021-05-21 13:15:15 -07:00
cd9dbbd93a Simplify process(Script|Python)(Remote)?Call (#57857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57857

There used to be a whole lot of methods: `processPythonCall`, `processScriptCall`, `processScriptRemoteCall`, `processPythonRemoteCall`, `processScriptCallOp`, `processBaseScriptRemoteCall` and `processScriptRemoteCallOp`. Thanks to the previous simplification, we can now drop all but the first four, which map nicely 1:1 to the four message types we need to handle. Also their signatures become much simpler: they take an RPC command and return a future.
ghstack-source-id: 129567070

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253848

fbshipit-source-id: e0e45345c414a96900f9d70ee555359d28908833
2021-05-21 13:15:12 -07:00
e220a1bbcd Make processPythonExecution return a future (#57855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57855

We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.
ghstack-source-id: 129567054

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253846

fbshipit-source-id: d854d4aa163798fb015cd6d46932f9ff1d18262e
2021-05-21 13:15:09 -07:00
20d02cb7dd Remove getScriptRemoteCallType (#57854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854

Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong.
ghstack-source-id: 129567060

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253843

fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f
2021-05-21 13:15:07 -07:00
60fc37393e Simplify OwnerRRef completion (#57853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57853

A bunch of methods received an OwnerRRef to "fill in". I think it will be more flexible to do it the other way around, and have these methods return a value (wrapped in a Future), which can then be "connected" to an OwnerRRef, but which can also potentially be consumed in different ways.
ghstack-source-id: 129567059

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253844

fbshipit-source-id: 7e3772312dbacfc75a6ac0f62189fc9828001fc7
2021-05-21 13:15:05 -07:00
ea2f5bbb4c Unify async execution for JIT functions (#57852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57852

Another great example of the benefits of Futures. Thanks to the "right abstraction" (i.e., the `thenAsync` method), adding support for async execution becomes trivial, and the code much simpler than what it used to be.
ghstack-source-id: 129567063

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253842

fbshipit-source-id: b660151ca300f3d6078db0f3e380c80a4d8f5190
2021-05-21 13:15:04 -07:00
bfdc279134 Unify invoking JIT functions (#57851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57851

The same as the previous PR, but for JIT functions.
ghstack-source-id: 129567069

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253841

fbshipit-source-id: 2b8affde16c106f5c76efa8be49af070213708bf
2021-05-21 13:15:02 -07:00
4ac18f6710 Centralize setting messageId in RequestCallback (#57848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57848

This PR looks large, but all it does is add a dozen lines and remove a lot of other ones.

One first advantage of using Futures is that we can easily chain some "post-processing" to them. Until now we needed to pass the message ID around everywhere because it was set separately by each method. Instead, we could simply add a follow-up step to the final future which sets this ID, and remove all the former logic.
ghstack-source-id: 129567065

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224477

fbshipit-source-id: 7b6e21646262abe5bbbf268897e2d792e5accc27
2021-05-21 13:14:57 -07:00
f6844eafce Make RequestCallback collect Futures from methods, rather than providing them (#57847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57847

This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions.

With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain.

In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us.

If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing.

I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order).

In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA.

 ---

Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones.

In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future.
ghstack-source-id: 129567062

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224478

fbshipit-source-id: dbdc66b6458645a4a164c02f00d8618fa64da028
2021-05-21 13:14:55 -07:00
45012da298 Migrate from shared_ptr to intrusive_ptr for Future (#57636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636

The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge.

At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC.

This PR was produced mainly by find-and-replace, plus a couple of manual fixes.
ghstack-source-id: 128296581

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D28187972

fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b
2021-05-07 03:59:20 -07:00
9f89b53d7d Synchronize RRef.to_here() CUDA Streams properly (#54932)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54932

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D27684022

Pulled By: pbelevich

fbshipit-source-id: 2bae51ab6649258d0219ca4e9dbbf45ac6a76c28
2021-04-13 23:24:38 -07:00
1f795e1a9b Remove FutureMessage from RPC request callback logic (#50026)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50026

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753588

Pulled By: mrshenli

fbshipit-source-id: a6fcda7830901dd812fbf0489b001e6bd9673780
2021-01-07 19:50:47 -08:00
55d5b27343 Refactor request_callback_no_python.cpp processRpc function (#47816)
Summary:
Addresses step 1 of https://github.com/pytorch/pytorch/issues/46564

Took processing logic for each case in request_callback_no_python.cpp and put it in a dedicated function.

cc: izdeby

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47816

Reviewed By: izdeby

Differential Revision: D25090207

Pulled By: H-Huang

fbshipit-source-id: bfa38e38db02e077d859125739aaede90ba492e7
2020-11-20 07:29:51 -08:00
781e0ed835 Support RRef.backward() for Owner RRefs. (#46641)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46641

Second part of https://github.com/pytorch/pytorch/pull/46568, allows
RRef.backward() to work for owner RRefs.
ghstack-source-id: 115440252

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D24441300

fbshipit-source-id: 64af28e6b6ae47ea27e611a148f217bc344a4c5b
2020-11-07 21:25:32 -08:00
96d48178c8 Make pipeWrite and pipeRead noexcept (#45783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783

After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.

This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.

Differential Revision: D24094803

Test Plan: Imported from OSS

Reviewed By: mruberry

Pulled By: mrshenli

fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
2020-10-08 18:53:51 -07:00
19dda7c68a Fallback to CPU when remote end does not have CUDA for profiling (#44967)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967

When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D23790729

fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
2020-09-26 13:12:55 -07:00
ff6e560301 Add C++ end to end test for RPC and distributed autograd. (#36893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893

Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.

The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.

As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167

Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D21112208

fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
2020-07-15 12:59:19 -07:00
422e348619 Don't run user function until all UserRRefs in the args are confirmed (#34497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497

Use a thread_local table to intercept UserRRefs created during user
function args deserialization, and then wait for confirmations of
those UserRRefs before launching the given user function.

Differential Revision: D20347464

Test Plan: Imported from OSS

Pulled By: mrshenli

fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07
2020-03-16 18:30:06 -07:00
e7e6d56b77 Allow async work in rpc RequestCallback processing. (#30637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30637

RequestCallback api currently forces work to be always synchronous, which,
as we scale, means we're going to need to throw large number of (mostly
blocked) threads at the rpc problem. For some activities like dependent
autograd rpcs, there's not a necessary reason to block in these threads.

In this change, the RequestCallback api is updated to return a
shared_ptr<FutureMessage> rather than a Message:

   std::shared_ptr<FutureMessage> operator()(Message& request) const;

With a futures-style api, RPC ops that wish to be async can then be async,
while short-lived blocking functions (or Python UDFs) can just block.

In this change, we keep all of the current ops as synchronous (i.e. we block
and then return a completed FutureMessage). We also update the rpc_agents in
a manner compatible with this sort of parallelism.

Here, we only want to incur overhead when we use the async behavior.
Some modest extra cost seems unavoidable here (e.g. the allocation for the
std::make_shared<>), but we can trivially detect the synchronous/completed
case in the rpc_agent and avoid the extra thread-switches/etc. in that case.
ghstack-source-id: 95287026

Test Plan:
- Basic: buck test mode/dev-nosan caffe2/test/...
  - Additional testcase in ThriftRpcAgentTest for deferred work.

Differential Revision: D18774322

fbshipit-source-id: cf49922a71707cfb1726de16f93af23b160385d8
2019-12-10 16:11:05 -08:00
56eb4f7daa Add autograd hook for python rpc call (#28312)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28312

1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached.
it still send rpc with autograd meta. This is not ideal.
This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads

2. meanwhile create a utiliy to attach autograd info and functions as needed

3. add autograd send/recv functions for python rpc call

4. make changes to support nested python rpc calls

5. disallow nested dist autograd context (was landed in #27022)
ghstack-source-id: 92240367

Test Plan: unit tests

Differential Revision: D18017554

fbshipit-source-id: dbe79a5171063901a78a9b3322b9b31c159d098d
2019-10-19 07:38:14 -07:00
af88537483 Back out "Add autograd hook for python rpc call"
Summary: Original commit changeset: 070324c57312

Test Plan: revert

Reviewed By: pritamdamania87

Differential Revision: D18011308

fbshipit-source-id: 4185e4c6f51c1d11b23b8ab44e6e958b09f27c53
2019-10-18 11:53:39 -07:00
56c4215fcc Add autograd hook for python rpc call (#27576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27576

1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached.
it still send rpc with autograd meta. This is not ideal.
This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads

2. meanwhile create a utiliy to attach autograd info and functions as needed

3. add autograd send/recv functions for python rpc call

4. make changes to support nested python rpc calls

5. disallow nested dist autograd context (was landed in #27022)
ghstack-source-id: 92154535

Test Plan: unit tests

Differential Revision: D17819153

fbshipit-source-id: 37d8a85855bf591f2f2da48d475a06e870a30ea1
2019-10-18 10:11:45 -07:00
fe4170bda8 Add send and recv backward functions for builtin operators RPC. (#25527)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527

Master GH issue: https://github.com/pytorch/pytorch/issues/23110.

This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.

Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466

Test Plan: unit tests.

Differential Revision: D17148077

fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233
2019-10-03 01:18:46 -07:00