pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-05 00:14:54 +08:00

Author	SHA1	Message	Date
Luca Wehrstedt	4d704e607d	Always use intrusive_ptr for Message (1 out of 2) (#58422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58422 Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move). By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above. In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR. ghstack-source-id: 129567053 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28474878 fbshipit-source-id: 5b76d45e05f6fa58c831e369c5c964d126187a6c	2021-05-21 13:15:24 -07:00
Luca Wehrstedt	45012da298	Migrate from shared_ptr to intrusive_ptr for Future (#57636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636 The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge. At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC. This PR was produced mainly by find-and-replace, plus a couple of manual fixes. ghstack-source-id: 128296581 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D28187972 fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b	2021-05-07 03:59:20 -07:00
Luca Wehrstedt	36e47af58b	Pass reference to parent future in callbacks (#57635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635 Note: this PR looks massive, but it's just one simple change, codemodded many times. In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't). ghstack-source-id: 128296580 Test Plan: CI Reviewed By: wanchaol Differential Revision: D28178783 fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081	2021-05-07 03:59:18 -07:00
Luca Wehrstedt	69de4940f3	Ensure devices are preserved when forwarding between futures (#57432 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57432 In a bunch of places we were creating a future and then "forwarding" the value of another future to it once that other future completed. (This was in order to convert the type of the value, or to "merge" multiple futures into one). However when doing so we often created a child future with an empty set of devices, which meant it didn't support CUDA, and thus would cause a silent synchronization/correctness bug if the parent future did actually contain CUDA tensors. One way this could have been caught earlier would have been to have Future always extract the DataPtrs, even in CPU-only mode, in order to ensure they always reside on the expected set of devices. Unfortunately this might have some averse perf effects thus should be done carefully. ghstack-source-id: 128184667 Test Plan: eyes Reviewed By: mrshenli Differential Revision: D28143045 fbshipit-source-id: 9af1abf270366dc1df0d4857d6a8cc73668af9d1	2021-05-06 01:12:51 -07:00
Luca Wehrstedt	1292602375	Avoid re-extracting DataPtrs when forwarding values between Futures (#57433 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57433 In a bunch of cases we need to "forward" between one future and another, typically because we need to convert the type of the data (e.g., from Message to PyObject). In most of these cases the DataPtrs of the value don't change, and yet the new future must re-extract them from scratch. By allowing the user to obtain the vector of extracted DataPtrs from the old future, we can allow them to "shortcut" this step. Also, this change is a requirement for the next PR to work, since the next PR would otherwise cause us to attempt extracting DataPtrs from Message instances, which doesn't work (because Message is a custom class), but thanks to this PR we actually skip that. ghstack-source-id: 128184663 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28118298 fbshipit-source-id: 70e333ea6a4f8d4d9a86514c350028d412469ee1	2021-05-06 01:11:38 -07:00
CodemodService FBSourceClangFormatLinterBot	9ec6883442	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D28216577 fbshipit-source-id: ce31fb98320a31eb947bdd31c68aaafed034df79	2021-05-05 04:41:21 -07:00
Alexander Golynski	3db45bcb91	Compilation error fix for torch/csrc/distributed/rpc/init.cpp (#57500 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57500 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28162887 Pulled By: agolynski fbshipit-source-id: b6fafd64778fc09a5e832b0a557ae70f06951454	2021-05-03 23:15:02 -07:00
Edward Yang	e845158b1a	Assert that GIL is not held in blocking destructors (#57030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57030 PR #57029 is not perfect; there are still obscure situations in which we might allocate a shared_ptr to an RpcAgent that doesn't have a no GIL constructor, so this PR adds the other half of the equation: assert that we don't hold the GIL when running a blocking destructor. This makes it possible to detect potential deadlocks even if the code doesn't deadlock in practice (because you got lucky and none of the threads you blocked on tried to also take out the GIL). I considered whether or not to make this DEBUG_ONLY. For now it's not, so I can get better CI coverage, and because this test only happens in destructors of objects that die rarely. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D28030582 Pulled By: ezyang fbshipit-source-id: a7d7f6545223c4823c7f6036dfe29bd2edaf60a5	2021-05-02 22:06:02 -07:00
Luca Wehrstedt	0422e67336	Use Devices instead of DeviceIndexes in TensorPipe agent (#57294 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294 With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff. ghstack-source-id: 127916562 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28092733 fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a	2021-05-01 16:12:55 -07:00
Yi Wang	13dbb77b7a	[RPC Framework] Enable RemoteModule to directly send GPU tensors over the wire on TensorPipe RPC backend if a device map is provided (#57288 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288 If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire. Also add pybind of `_get_device_map`. The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`. Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure. #Original PR issue: https://github.com/pytorch/pytorch/issues/51670 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1 CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`. buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts Reviewed By: wanchaol Differential Revision: D28021672 fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79	2021-04-30 18:04:45 -07:00
Yi Wang	c2fbd96735	[RPC Framework] Expose a Python API for device map getter (#57179 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57179 Expose a Python API to get the device map and unblock RemoteModule work. See: https://github.com/pytorch/pytorch/pull/56854#issuecomment-827762398 Additionally, add a const decorator for the C++ getter. #Original PR issue: https://github.com/pytorch/pytorch/issues/51670 ghstack-source-id: 127684266 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D28070160 fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f	2021-04-29 14:29:10 -07:00
Luca Wehrstedt	311ad5e3af	Merge CUDAFuture into ivalue::Future (#57052 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052 This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future. This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles. ghstack-source-id: 127713138 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: mrshenli Differential Revision: D28036829 fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3	2021-04-29 09:31:52 -07:00
Nikita Shulga	eac02f85cf	Fix more clang-tidy errors (#57235 ) Summary: In my last PR I've missed CUDA and distributed folders, fixing this now This change is autogenerated by `python tool/clang_tidy.py -s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235 Reviewed By: janeyx99 Differential Revision: D28084444 Pulled By: malfet fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda	2021-04-28 23:29:10 -07:00
Shen Li	1ee54cc7b4	Add devices argument to RRef constructor (#57085 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085 PR #54932 fixed the CUDA RPC for RRef when RRef is created through RPC. But besides that use case, RRef can also be created locally by directly passing in a value, which would bypass the CUDA stream synchronization in #54932. This commit covers the above gap by adding a `devices` argument to RRef constructor. The RRef will then use this argument to choose between `CUDAFutre` and `ivalue::Future` to hold the value. When `devices` is specified and non-empty, `CUDAFuture` will be used, and the `devices` will be passed to that `CUDAFuture`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D28050001 Pulled By: mrshenli fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e	2021-04-28 19:11:10 -07:00
Pritam Damania	40eea6d9d1	Support device map for distributed autograd while using TensorPipe. (#44859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: https://github.com/pytorch/pytorch/issues/44170 ghstack-source-id: 119950842 Test Plan: 1) waitforbuildbot 2) Unit test added. Reviewed By: mrshenli Differential Revision: D23751975 fbshipit-source-id: 2717d0ef5bde3db029a6172d98aad95734d52140	2021-01-27 13:01:44 -08:00
Shen Li	f9f758e349	Apply clang-format to rpc cpp files (#50236 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50236 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25847892 Pulled By: mrshenli fbshipit-source-id: b4af1221acfcaba8903c629869943abbf877e04e	2021-01-08 11:47:43 -08:00
Shen Li	d730c7e261	Replace FutureMessage with ivalue::Future in RpcAgent retry logic (#49995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49995 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25745301 Pulled By: mrshenli fbshipit-source-id: b5e3a7e0b377496924847d8d70d61de32e2d87f4	2021-01-07 19:50:23 -08:00
Shen Li	84e3237a53	Let RpcAgent::send() return JitFuture (#49906 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49906 This commit modifies RPC Message to inherit from `torch::CustomClassHolder`, and wraps a Message in an IValue in `RpcAgent::send()`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25719518 Pulled By: mrshenli fbshipit-source-id: 694e40021e49e396da1620a2f81226522341550b	2021-01-07 19:47:14 -08:00
Shihao Xu	b803b4ce09	[torch.distributed.rpc] Add stringify WorkerInfo, better error message for py_rref (#39974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974 # Problem When this assertion happens, I don't know - which worker_id it is on, even with the worker_name "trainer:0". - which rref is throwing this exception. ```shell File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp> trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception raise result.exception_type(result.msg) RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0') Traceback (most recent call last): File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function result = python_udf.func(python_udf.args, python_udf.kwargs) File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke return getattr(rref.local_value(), func_name)(args, **kwargs) RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0 ``` Changes, - Add stringify WorkerInfo - Make localValue() assertion message clearer about the case. ghstack-source-id: 105840918 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork Reviewed By: mrshenli Differential Revision: D5690653 fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170	2020-06-13 12:57:05 -07:00
Luca Wehrstedt	7d85e77076	Use atomic operations to manipulate current RPC agent (#39663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39663 I was investigating a memory corruption issue and thought it may be due to a race condition in (un)setting the current RPC agent. It turns out it wasn't (still investigating...). I had already written this fix, and it is a real fix (there could really be a race condition), so I'm sending it out to see whether there's interest in merging it. I believe its practical usefulness is however very limited, since typically the current RPC agent is only changed twice (at start and at shutdown) and thus there's limited risk for races. As there may be some confusion on atomicity of shared_ptrs, let me clarify a few things from the get go. Operations on the control blocks of shared_ptrs (i.e., increasing and decreasing the refcounts) are atomic, which means that it is safe to manipulate two different shared_ptrs that point to the same object from different threads. However, the shared_ptr object itself is not atomic, which means that it is not safe to manipulate the same shared_ptr from two different threads. For that reason, the STL provides atomic functions explicitly specialized for shared_ptrs: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic (in C++ 20, they are being replaced by a specialization of std::atomic<std::shared_ptr<T>>). Note that this has been called "the worst question of all of C++" by Louis Brandy at his CppCon talk: https://youtu.be/lkgszkPnV8g?t=1210 ghstack-source-id: 105475005 Test Plan: Unit tests Differential Revision: D21932817 fbshipit-source-id: da33fedd98efb820f284583ce7ff1c1c531dea9c	2020-06-09 02:11:15 -07:00
Omkar Salpekar	e6993938de	Avoid Releasing, Reacquiring lock per iteration in RPC Retry Thread (#38521 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38521 In the RPC Retry Thread, we add retriable futures to a list under the lock, release the lock, add callbacks/set errors to those futures, then re-acquire the lock to clean up the retry map. We can simply clean up the retry map before releasing the lock and not acquire it again - this would be cleaner and may results in better perf if this reduces context switching between threads looking to acquire the retryMapLock. ghstack-source-id: 104062147 Test Plan: CI, there are thorough tests in the RPC framework to test errors with retries. Differential Revision: D21563085 fbshipit-source-id: 35e620892da630d082c032f5f9ce16e8a9ffdfaa	2020-05-18 10:59:13 -07:00
Omkar Salpekar	af597335d4	Remove unnecessary to_string in RPC logging code. (#38414 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38414 `std::to_string` call is unnecessary when using glog. ghstack-source-id: 104030161 Test Plan: Ran the retry tests and checked logs to ensure correct message was printed upon message failure, Differential Revision: D21266330 fbshipit-source-id: 53519287778d47d99b94ea34b7c551f910affda2	2020-05-14 10:57:00 -07:00
Omkar Salpekar	f5c230b892	Make futures vector a local function var (#36677 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36677 Move the `futures` vector to be a local function var like `errorFutures`. Holding the lock to clear the vector is now unnecessary. ghstack-source-id: 102265569 Differential Revision: D20884589 fbshipit-source-id: c9a13258bee737d86f9b0d11cdd28263bb923697	2020-04-16 10:09:39 -07:00
Omkar Salpekar	87be115fd0	Error Handling in RPC Agent (#35263 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35263 Process Group Agent throws an exception if a send attempt is made after the agent is shutdown. With retries, we should catch this exception and mark the original future with an error. ghstack-source-id: 102153897 Test Plan: Running all rpc/dist_autograd tests. Differential Revision: D20611412 fbshipit-source-id: a6009f0b0aa8be662364158962a054c5c29090bf	2020-04-15 10:53:31 -07:00
Jeremy Lilley	37aab14d14	[future] Avoid some future callback self-captures. (#36502 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502 We're sometimes deleting futures without completing them (discovered by logging), and we've recently noticed a slow memory leak. This change migrates the future lambda cases where there was self-capture. - In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback. This avoids the reference cycle. We use this primarily in the case where the value ends up being moved in the callback (something we want to be careful about) - We also add a convenience api to Future where the completed Future is returned as an arg. This allows us to avoid self-capture, though it assumes that the markCompleted() caller is persisting the future for the markCompleted() duration (this has been the case) ghstack-source-id: 102130672 Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/... Differential Revision: D20998905 fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16	2020-04-14 17:52:44 -07:00
Omkar Salpekar	264da24c9e	Fixing RPC Shutdown and Thread Joining (#36239 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36239 ProcessGroupAgent and ThriftAgent threads were joined at shutdown, but RpcAgent threads were joined by the destructor. This PR joins all threads at shutdown by using a pattern similar to `start` in RPC. The derived classes implement a `shutdownImpl` class that cleans up backend-specific state. RpcAgent implements `shutdown` which cleans up generic state and calls the underlying `shutdownImpl`. The atomic running is now set and unset by RpcAgent so backends do not need to mutate it. ghstack-source-id: 101820415 Test Plan: Ensured this works with `test_duplicate_name` (in which RpcAgent is constructed but PGA is not), and selected `rpc_spawn` and `dist_autograd_spawn` tests with TSAN. Checking Build Bot and CI as well, and continuing to test more with TSAN on devserver (currently running into memory issues). Reviewed By: jjlilley Differential Revision: D20902666 fbshipit-source-id: 5dbb5fc92ba66f75614c050bb10b10810770ab12	2020-04-09 12:32:00 -07:00
Jeremy Lilley	291c910e85	[future] Re-land some safe portions of the future change. (#36254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254 These future use changes were all landed yesterday as part of the future refactoring, quickly reverted due to an observed OOM, but now being relanded, since they've since been tested to be benign. ghstack-source-id: 101776613 Test Plan: buck test mode/dev-nosan caffe2/test/... not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod Differential Revision: D20924010 fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306	2020-04-08 20:05:33 -07:00
Jeremy Lilley	a91535930f	[future] Undo some recent torch::utils::Future api changes (#36220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220 The torch::utils::Future change from yesterday may have introduced a reference cycle, leading to OOM on PS. This change reverts the lambda capture changes with torch::utils::Future until we can analyze further. ghstack-source-id: 101756106 Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset Differential Revision: D20918904 fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624	2020-04-08 11:28:22 -07:00
Jeremy Lilley	72b55fea6b	[jit] Make torch::utils::Future and ivalue::future apis closer (#35849 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849 This change harmonizes some aspects of the api. - torch::utils::Future callback should have no args, like ivalue::future. Many of the lines of this change are related to fixing that up downstream. No args makes the api simpler to use, particularly since many/most of the downstream use cases ignore the passed-in args. It's simple enough to appropriately capture the future in the lambda if necessary. - Add error/hasError methods to ivalue::Future. - Use c10::optional underneath for error to ivalue::Future. - Change markCompleted(error) to setError(error) to ivalue::Future. - Add setValue(FutureError) version to torch::utils::Future ghstack-source-id: 101684435 Test Plan: buck test mode/dev-nosan caffe2/test/... Differential Revision: D20803251 fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269	2020-04-07 17:05:35 -07:00
Omkar Salpekar	7d1f06462c	Fixing Potential TSAN issue with joining RPC helper threads (#36094 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36094 The condition variable waiting in the RPC retry thread must be notified after setting the atomic running to False. This will cause ensure the thread is joinable, and allow `rpc.shutdown` to function correctly ghstack-source-id: 101538860 Test Plan: build bot Differential Revision: D20854763 fbshipit-source-id: b92050712a1e6c31d4dd3b3d98f32ef8dee0f2f2	2020-04-06 15:56:06 -07:00
Omkar Salpekar	19bbfbe1cf	[RPC][Better Engineering] Consolidated all rpcAgentRunning atomic booleans (#33915 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33915 Closes: https://github.com/pytorch/pytorch/issues/32963 Test Plan: build bot Reviewed By: jjlilley Differential Revision: D20074714 fbshipit-source-id: ee89e76f547a1da71825a317c096176524504290	2020-04-03 11:50:05 -07:00
Omkar Salpekar	6792dac90d	Only Schedule Retries before Agent Shutdown (#35554 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35554 We attach a callback to our RPC send attempts that schedule a retry upon failure. This PR only schedules the retry if the agent is running. ghstack-source-id: 101332815 Differential Revision: D20612615 fbshipit-source-id: e1bbb3f162101bce7eb46bad512c9e5dc6d531cc	2020-04-01 19:03:09 -07:00
Yanli Zhao	4d9b649261	jit pickling rref (#32959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959 in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle(). The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref. The pickling process is: 1. push torch.distributed.rpc.rref global string 1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id 2. push the rrefForkData The unpickling process is: 1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function 2. the globa lamda function will get rrefForkData 3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef 4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef 5. meanwhile owner rref will be notified and do reference counting correctly During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff. ghstack-source-id: 98814793 Test Plan: unit test Differential Revision: D19713293 fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e	2020-02-24 11:16:35 -08:00
Omkar Salpekar	507f963aa6	[RPC Reliability] Enabled retries for RPCs with exponential backoff (#33365 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33365 This adds functionality for re-trying RPC's that are sent with the function sendWithRetries(). It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout. GitHub Issue: https://github.com/pytorch/pytorch/issues/32124 Per the first 4 milestones, the following will be addressed in future PR's: * enabling RPC Retries for RRef internal messages Differential Revision: D19915694 fbshipit-source-id: 4a520e32d5084ebcf90e97fd9f26867115a35c0c	2020-02-19 15:59:29 -08:00
George Guanheng Zhang	5cab54e0db	Revert D19560159: [RPC Reliability] Implemented retries for RPCs with exponential backoff Test Plan: revert-hammer Differential Revision: D19560159 Original commit changeset: 40cd86f9a25d fbshipit-source-id: 70f5b19bc05fc34e3c912f42f9d32b9fb80aed06	2020-02-14 14:29:59 -08:00
Omkar Salpekar	92b67c03e4	[RPC Reliability] Implemented retries for RPCs with exponential backoff (#32602 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32602 This adds functionality for re-trying RPC's that are sent with the function `sendWithRetries()`. It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout. GitHub Issue: https://github.com/pytorch/pytorch/issues/32124 Per the first 3 milestones, the following will be addressed in future PR's: * enabling RPC Retries for RRef internal messages Differential Revision: D19560159 fbshipit-source-id: 40cd86f9a25dc24367624d279a3b9720b20824cf	2020-02-14 11:57:24 -08:00
Shihao Xu	5c8535d5b0	Make C++ RpcAgent::currentRPCAgent_ the source of truth of current RPC Agent (#32633 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32633 There were 2 sources of current RPC agent. - One is in Python world, `torch.distributedrpc.api._agent`. - The other is in C++ world, `RpcAgent::defaultRpcAgent_` Setting Python `_agent` to `None`, does not necessarily reset the C++ `defaultRpcAgent_` to `nullptr`. i.e. ``` torch.distributedrpc.api._agent = None ``` does not translate to ``` RpcAgent::defaultRpcAgent_ = nullptr ``` This PR is to remove this ambiguity, and use the C++ pointer as source of truth. The solution is to leverage a pybind11 behavior that it implicitly casts C++ `shared_ptr<RpcAgent>(nullptr)` to Python `None`. ghstack-source-id: 97293315 Test Plan: ``` buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_duplicate_name buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info ``` ``` buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_remote_module buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_embedding buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc ``` Differential Revision: D5733066 fbshipit-source-id: b3e6032ee975f19ca556497edbbf40b517b25be8	2020-01-27 19:34:12 -08:00
Rohan Varma	9ce25cce91	add an option to record time spent waiting for GIL (#30842 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30842 We'd like to profile the time spent on GIL acqusiition to debug performance issues. Test Plan: Unit tests pass. Differential Revision: D18837590 fbshipit-source-id: 925968f71c5fb96b8cd93f1eab4647602d2617d1	2020-01-21 11:29:23 -08:00
Shihao Xu	e66626ae5c	Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. (#29341 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29341 So that other RpcAgent could use this timeout setting as well. ghstack-source-id: 93481902 Differential Revision: D5681951 fbshipit-source-id: 569c768dc342e8a2d9faf142ceccf696e12e41dc	2019-11-07 17:05:45 -08:00
Pritam Damania	3bccd3fc0d	Distributed Autograd - FAST mode backward pass implementation. (#27022 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27022 This change implements the "FAST" mode distributed autograd backward pass as described in https://github.com/pytorch/pytorch/issues/23110. At a high level the backward pass works as follows: 1. We start by computing dependencies on the node that calls `torch.distributed.backward`. 2. This node computes the dependencies starting from the root nodes provided in the backward call and all the 'send' functions present in the current autograd context. The "FAST" mode assumes all 'send' functions are part of the autograd computation. 3. Once the dependency computation is done, the distributed autograd engine calls the local autograd engine to execute the autograd graph. Note that the autograd graph on a single node is not necessarily connected because of inter-node communication. As a result, we have special handling to ensure the local autograd engine ensures we execute the entire graph starting from the provided roots and all 'send' functions on the node. 4. When the local autograd engine hits a 'recv' function, it performs an async RPC to send the gradients over to the appropriate node and stores a future in the autograd context to keep track of this RPC. 5. On the destination node, the appropriate 'send' function is looked up and enqueued on the local autograd engine. If this is the first time the node is hearing about this autograd context id on the backward pass, then the node computes dependencies for the local autograd engine. 6. As part of compute dependencies, the distributed autograd engine discovers all leaf nodes and ensures those are passed as 'outputs' to the local autograd engine. This avoids running the 'AccumulateGrad' function. 7. The gradients computed for the leaf nodes are then actually accumulated in `DistAutogradContext` for the appropriate autograd context id. 8. The distributed autograd engine waits for the local autograd engine to complete and also waits for all the 'Futures' (stored in 4.) for respective RPCs to finish. We have made the following changes to the local autograd engine for this purpose: 1. Expose GraphTask and NodeTask so that the distributed autograd engine can use them. 2. Expose a `execute_with_graph_task` API which gives the distributed engine to build a GraphTask and pass it to the local autograd engine. 3. Expose a `enqueue_on_cpu` API, which allows the distributed engine to build a `NodeTask` for a 'send' function and enqueue it on the local autograd engine. In addition to this a few general improvements: 1. Added a `PropagateGradients` RPC call for the 'recv' function to pass gradients to the appropriate node during the backward pass. 2. Use IValues as much as possible in serialization for RpcWithAutograd. 3. If Future.wait(), contains a message type EXCEPTION, we throw an appropriate exception instead of just returning the message. This is inline with what most Future.wait() APIs do. 4. Added a `get_gradients(context_id)` API which allows users to retrieve a map from Tensor to respective gradient for the provided context_id on the local node. ghstack-source-id: 91794926 Test Plan: unit tests. Differential Revision: D17652615 fbshipit-source-id: 96f65c52adb2706ee29f4b49e1655afaa0a3bec3	2019-10-12 09:47:49 -07:00
Shen Li	2486b0ba82	Add Python RRef as args and return value (#25499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499 See #23110 for model parallel design details, and #26759 for the RRef protocol. This commit add support for using RRef as Python UDF arguments and return value. RRefs can now be shared from owner to user, from user to owner, or from user to user. Limitations: 1. No implicit type conversion yet. (#27099) 2. No failure handling and retry. (#26116) 3. UDF is not yet blocked until all RRefs are confirmed. (#27098) 4. Internal RRef control messages are not idempotent yet. (#26116) 5. Cannot delete RRefs correctly when there are circular dependencies. (#27096) Main changes: 1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations. 2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages. 3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`. 4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure. 5. Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs. 6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`. Test Plan: Imported from OSS buck test mode/dev-nosan //caffe2/test:rpc_fork Differential Revision: D17184146 Pulled By: mrshenli fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265	2019-10-03 17:47:12 -07:00
Pritam Damania	fe4170bda8	Add send and recv backward functions for builtin operators RPC. (#25527 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527 Master GH issue: https://github.com/pytorch/pytorch/issues/23110. This change builds upon https://github.com/pytorch/pytorch/pull/24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. ghstack-source-id: 91240466 Test Plan: unit tests. Differential Revision: D17148077 fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233	2019-10-03 01:18:46 -07:00
Pieter Noordhuis	5407241b4f	Run clang-format on torch/csrc/distributed (#25647 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25647 TSIA Test Plan: N/A Differential Revision: D17182909 fbshipit-source-id: 22a6554693def0032a051cef5fe788f49de1d740	2019-09-04 10:08:09 -07:00
Pritam Damania	40cb5182e9	Attach 'send' autograd function to the autograd graph as part of RPC. (#24876 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24876 This contains very basic functionality of adding 'send' autograd function to our autograd graph. The purpose of this change is to validate the basic structure proposed here makes sense. Once this makes sense, we can build upon this to address more complicated scenarios. At a high level we've added the following functionality: 1) Define a very simple 'SendRpcBackwards' autograd function. 2) Attach this function to appropriate tensors when we call an RPC. 3) Store the send function in our distributed autograd context. ghstack-source-id: 89359708 Test Plan: unit tests. Differential Revision: D16903255 fbshipit-source-id: 6c04794a8e58b199795404225fd9da0c1440460e	2019-09-01 23:54:01 -07:00
Shen Li	c881136215	Move worker name collection code from Python to C++ (#24260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24260 This also simplifies ProcessGroupAgent constructor signature. Test Plan: Imported from OSS Differential Revision: D16789219 Pulled By: mrshenli fbshipit-source-id: bbb69022435467fbb1c28da21dd03d3ab52fc521	2019-08-31 19:02:45 -07:00
Shen Li	1294e55c15	Assign each RpcAgent a unique ID, and use ID for sending RPC messages. (#24195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24195 It is not efficient to use a string destination name in every send. Moreover, when we add RRef later, RpcAgent will frequently check RRef ownership. It will be slow as well if we have to go though string comparison every time. This commit assigns each RpcAgent a unique integer ID. In the Python send API, applications can provide either destination name or id. If it is a string name, it will be converted to id by calling the get_id(workerName) API. Test Plan: Imported from OSS Differential Revision: D16770241 Pulled By: mrshenli fbshipit-source-id: fa56128a77a02a402dc6682474bc301dc1b7f43d	2019-08-29 19:19:11 -07:00
Shen Li	b6803d62fd	Use snake names for all files in distributed.rpc (#24502 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24502 Files in distributed.rpc package mixes snake camel names. This commit cleans that up and all files use snake names now. ghstack-source-id: 88548990 Reviewed By: xush6528 Differential Revision: D16860155 fbshipit-source-id: 3a22a89bf6c4e11aac5849564fc53296a04d6a8b	2019-08-19 10:58:59 -07:00

47 Commits