Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59298
After recent changes, LazyStreamContext had in fact always become eager, and was in fact equivalent to a vector of streams. So it makes more sense now to remove this abstraction and use a more self-descriptive type.
This PR migrates the RequestCallback internals. The next PR migrates the TensorPipe agent.
ghstack-source-id: 130583774
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28789175
fbshipit-source-id: fa581a50f9a6a1e42c2ad8c808a9b099bea7433e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57355
We had started fixing OwnerRRef to make it CUDA-compatible, by properly synchronizing CUDA streams/events where appropriate. However, since we started using CUDAFuture (or, well, ivalue::Future nowadays, after they got merged) this is all done automatically for us, hence we can undo these "fixes" as they're now duplicated.
ghstack-source-id: 130583771
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28118182
fbshipit-source-id: 4b1dd9fe88c23802b1df573941d1b73af48bb67b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59210
Reland of https://github.com/pytorch/pytorch/pull/58427
Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 130202847
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28623889
fbshipit-source-id: ed38242f813dac075d162685d52ae89f408932f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58427
Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 129567052
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28474983
fbshipit-source-id: 358292764d0a6832081c34bf6736f0961475ff3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57859
Just like with assigning OwnerRRefs, we can also deduplicate the code paths for fetching their values. In fact this was duplicated three times, with different ways of post-processing the value (once for JIT, once for Python, once for autograd). Thanks to future, we can have that logic once, and then connect it to different follow-up steps.
ghstack-source-id: 129567050
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28286172
fbshipit-source-id: e0742a99cf555755e848057ab6fee5285ff0df2a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57857
There used to be a whole lot of methods: `processPythonCall`, `processScriptCall`, `processScriptRemoteCall`, `processPythonRemoteCall`, `processScriptCallOp`, `processBaseScriptRemoteCall` and `processScriptRemoteCallOp`. Thanks to the previous simplification, we can now drop all but the first four, which map nicely 1:1 to the four message types we need to handle. Also their signatures become much simpler: they take an RPC command and return a future.
ghstack-source-id: 129567070
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28253848
fbshipit-source-id: e0e45345c414a96900f9d70ee555359d28908833
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57855
We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.
ghstack-source-id: 129567054
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28253846
fbshipit-source-id: d854d4aa163798fb015cd6d46932f9ff1d18262e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854
Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong.
ghstack-source-id: 129567060
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28253843
fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57853
A bunch of methods received an OwnerRRef to "fill in". I think it will be more flexible to do it the other way around, and have these methods return a value (wrapped in a Future), which can then be "connected" to an OwnerRRef, but which can also potentially be consumed in different ways.
ghstack-source-id: 129567059
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28253844
fbshipit-source-id: 7e3772312dbacfc75a6ac0f62189fc9828001fc7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57852
Another great example of the benefits of Futures. Thanks to the "right abstraction" (i.e., the `thenAsync` method), adding support for async execution becomes trivial, and the code much simpler than what it used to be.
ghstack-source-id: 129567063
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28253842
fbshipit-source-id: b660151ca300f3d6078db0f3e380c80a4d8f5190
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57851
The same as the previous PR, but for JIT functions.
ghstack-source-id: 129567069
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28253841
fbshipit-source-id: 2b8affde16c106f5c76efa8be49af070213708bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57848
This PR looks large, but all it does is add a dozen lines and remove a lot of other ones.
One first advantage of using Futures is that we can easily chain some "post-processing" to them. Until now we needed to pass the message ID around everywhere because it was set separately by each method. Instead, we could simply add a follow-up step to the final future which sets this ID, and remove all the former logic.
ghstack-source-id: 129567065
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28224477
fbshipit-source-id: 7b6e21646262abe5bbbf268897e2d792e5accc27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57847
This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions.
With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain.
In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us.
If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing.
I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order).
In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA.
---
Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones.
In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future.
ghstack-source-id: 129567062
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28224478
fbshipit-source-id: dbdc66b6458645a4a164c02f00d8618fa64da028
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636
The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge.
At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC.
This PR was produced mainly by find-and-replace, plus a couple of manual fixes.
ghstack-source-id: 128296581
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D28187972
fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b
Summary:
Addresses step 1 of https://github.com/pytorch/pytorch/issues/46564
Took processing logic for each case in request_callback_no_python.cpp and put it in a dedicated function.
cc: izdeby
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47816
Reviewed By: izdeby
Differential Revision: D25090207
Pulled By: H-Huang
fbshipit-source-id: bfa38e38db02e077d859125739aaede90ba492e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783
After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.
This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.
Differential Revision: D24094803
Test Plan: Imported from OSS
Reviewed By: mruberry
Pulled By: mrshenli
fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44967
When enabling profiler on server, if it is a different machine it may
not have CUDA while caller does. In this case, we would crash but now we
fallback to CPU and log a warning.
ghstack-source-id: 112977906
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D23790729
fbshipit-source-id: dc6eba172b7e666842d54553f52a6b9d5f0a5362
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893
Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.
The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.
As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167
Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D21112208
fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34497
Use a thread_local table to intercept UserRRefs created during user
function args deserialization, and then wait for confirmations of
those UserRRefs before launching the given user function.
Differential Revision: D20347464
Test Plan: Imported from OSS
Pulled By: mrshenli
fbshipit-source-id: 087484a2d2f03fbfb156752ab25653f39b412a07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30637
RequestCallback api currently forces work to be always synchronous, which,
as we scale, means we're going to need to throw large number of (mostly
blocked) threads at the rpc problem. For some activities like dependent
autograd rpcs, there's not a necessary reason to block in these threads.
In this change, the RequestCallback api is updated to return a
shared_ptr<FutureMessage> rather than a Message:
std::shared_ptr<FutureMessage> operator()(Message& request) const;
With a futures-style api, RPC ops that wish to be async can then be async,
while short-lived blocking functions (or Python UDFs) can just block.
In this change, we keep all of the current ops as synchronous (i.e. we block
and then return a completed FutureMessage). We also update the rpc_agents in
a manner compatible with this sort of parallelism.
Here, we only want to incur overhead when we use the async behavior.
Some modest extra cost seems unavoidable here (e.g. the allocation for the
std::make_shared<>), but we can trivially detect the synchronous/completed
case in the rpc_agent and avoid the extra thread-switches/etc. in that case.
ghstack-source-id: 95287026
Test Plan:
- Basic: buck test mode/dev-nosan caffe2/test/...
- Additional testcase in ThriftRpcAgentTest for deferred work.
Differential Revision: D18774322
fbshipit-source-id: cf49922a71707cfb1726de16f93af23b160385d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28312
1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached.
it still send rpc with autograd meta. This is not ideal.
This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads
2. meanwhile create a utiliy to attach autograd info and functions as needed
3. add autograd send/recv functions for python rpc call
4. make changes to support nested python rpc calls
5. disallow nested dist autograd context (was landed in #27022)
ghstack-source-id: 92240367
Test Plan: unit tests
Differential Revision: D18017554
fbshipit-source-id: dbe79a5171063901a78a9b3322b9b31c159d098d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27576
1. currently if autograd context is valid, even tensors do not require grads and grads function are not attached.
it still send rpc with autograd meta. This is not ideal.
This diff makes some change to make sure rpc with autograd meta is sent only if autograd context is valid and tensors require grads
2. meanwhile create a utiliy to attach autograd info and functions as needed
3. add autograd send/recv functions for python rpc call
4. make changes to support nested python rpc calls
5. disallow nested dist autograd context (was landed in #27022)
ghstack-source-id: 92154535
Test Plan: unit tests
Differential Revision: D17819153
fbshipit-source-id: 37d8a85855bf591f2f2da48d475a06e870a30ea1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527
Master GH issue: https://github.com/pytorch/pytorch/issues/23110.
This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.
Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466
Test Plan: unit tests.
Differential Revision: D17148077
fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233