pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	e1e8491b31	[1/N] Change C-style casts to static_cast or reinterpret_cast (#165750 ) This series of changes try to cover C style casts into C++ alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750 Approved by: https://github.com/Skylion007	2025-10-20 04:36:19 +00:00
cyy	f048569c24	[Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439 ) Follows #131671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439 Approved by: https://github.com/kwen2501	2024-09-24 13:05:15 +00:00
cyy	95dbbf713e	[Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109 ) Follows #125102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109 Approved by: https://github.com/ezyang	2024-07-16 04:23:42 +00:00
cyy	a81d083b1c	[Reland] Add -Wdeprecated and related fixes (#110019 ) This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing ``` ((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR))) ``` to ``` ((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR))) ``` in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon. We also enabled -Wdeprecated on c10. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019 Approved by: https://github.com/clee2000	2023-09-28 03:34:29 +00:00
PyTorch MergeBot	cdb51d2ad0	Revert "[2/N] Add -Wdeprecated and related fixes (#109564 )" This reverts commit 5b50641bac49e00ad05060f0b9fe3dcc5d73bc9b. Reverted https://github.com/pytorch/pytorch/pull/109564 on behalf of https://github.com/atalman due to Need to revert as followup revert of first PR 108626 ([comment](https://github.com/pytorch/pytorch/pull/109564#issuecomment-1728137207))	2023-09-20 17:15:57 +00:00
cyy	5b50641bac	[2/N] Add -Wdeprecated and related fixes (#109564 ) This PR follows #108626. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109564 Approved by: https://github.com/ezyang	2023-09-20 07:03:25 +00:00
Kazuaki Ishizaki	2973994259	fix typo in comments under torch/csrc/distributed (#96062 ) This PR fixes typos in comments and messages of `.cpp` and `.hpp` files under `torch/csrc/distributed` directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/96062 Approved by: https://github.com/ngimel	2023-03-07 02:56:41 +00:00
Michael Suo	30fb2c4aba	[lint] autoformat test/cpp and torch/csrc Let's have some fun. Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828 Approved by: https://github.com/ezyang	2022-06-11 21:11:16 +00:00
Scott Wolchok	cf3ce329b5	[PyTorch] Avoid initializing storage for empty Optionals Pull Request resolved: https://github.com/pytorch/pytorch/pull/78947 We don't need to initialize for the non-constexpr case ever, or in the constexpr case after C++20. Differential Revision: [D36519379](https://our.internmc.facebook.com/intern/diff/D36519379/) Approved by: https://github.com/ezyang, https://github.com/malfet	2022-06-08 03:56:24 +00:00
Tristan Rice	ebba4219ae	torch/distributed: move WorkerInfo registration into libtorch instead of libtorch_python (#78028 ) Summary: This moves torch::class_<WorkerInfo> into `rpc_agent.cpp` so it gets registered in libtorch instead of libtorch_python. This is intermediate work to getting torch::deploy to load an unmodified copy of libtorch. Current RPC is incompatible due to duplicate registrations. ``` unknown file: Failure C++ exception with description "Exception Caught inside torch::deploy embedded library: Custom class with name __torch__.torch.classes.dist_rpc.WorkerInfo is already registered. Ensure that registration with torch::class_ is only called once. Exception raised from registerCustomClass at ../aten/src/ATen/core/custom_class.cpp:61 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3bd9adb92e in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x5c (0x7f3bd9ab7068 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: torch::registerCustomClass(std::shared_ptr<c10::ClassType>) + 0x110 (0x7f3bc2258980 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #3: torch::detail::class_base::class_base(std::string const&, std::string const&, std::string, std::type_info const&, std::type_info const&) + 0x3b9 (0x7f3bc225a419 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so) frame #4: [0x7f3ba45cfea1] frame #5: <unknown function> + 0x1b5334 (0x5652bdab9334 in ./test_deploy) frame #6: <unknown function> + 0x1b4f3e (0x5652bdab8f3e in ./test_deploy) frame #7: <unknown function> + 0x1b519b (0x5652bdab919b in ./test_deploy) frame #8: loadSearchFile(char const) + 0x23e (0x7f3ba62f37f8 in /tmp/torch_deploy9ATEFg) frame #9: deploy_set_self + 0x51 (0x7f3ba62f38f9 in /tmp/torch_deploy9ATEFg) frame #10: torch::deploy::Interpreter::Interpreter(torch::deploy::InterpreterManager, std::shared_ptr<torch::deploy::Environment>) + 0x274 (0x5652bdaaa790 in ./test_deploy) frame #11: void __gnu_cxx::new_allocator<torch::deploy::Interpreter>::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::Interpreter, torch::deploy::InterpreterManager&&, std::shared_ptr<torch::deploy::Environment>&) + 0x81 (0x5652bdaaf58b in ./test_deploy) frame #12: void std::allocator_traits<std::allocator<torch::deploy::Interpreter> >::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager, std::shared_ptr<torch::deploy::Environment>&>(std::allocator<torch::deploy::Interpreter>&, torch::deploy::Interpreter, torch::deploy::InterpreterManager&&, std::shared_ptr<torch::deploy::Environment>&) + 0x4a (0x5652bdaae320 in ./test_deploy) frame #13: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::_M_realloc_insert<torch::deploy::InterpreterManager, std::shared_ptr<torch::deploy::Environment>&>(__gnu_cxx::__normal_iterator<torch::deploy::Interpreter, std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> > >, torch::deploy::InterpreterManager&&, std::shared_ptr<torch::deploy::Environment>&) + 0xee (0x5652bdaae4a0 in ./test_deploy) frame #14: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::emplace_back<torch::deploy::InterpreterManager, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::InterpreterManager&&, std::shared_ptr<torch::deploy::Environment>&) + 0xb6 (0x5652bdaad258 in ./test_deploy) frame #15: torch::deploy::InterpreterManager::InterpreterManager(unsigned long, std::shared_ptr<torch::deploy::Environment>) + 0x123 (0x5652bdaa83b1 in ./test_deploy) frame #16: TorchpyTest_InitTwice_Test::TestBody() + 0x65 (0x5652bda075a9 in ./test_deploy) frame #17: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) + 0x65 (0x5652bda944b7 in ./test_deploy) frame #18: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) + 0x5a (0x5652bda8cfe7 in ./test_deploy) frame #19: testing::Test::Run() + 0x100 (0x5652bda68622 in ./test_deploy) frame #20: testing::TestInfo::Run() + 0x10f (0x5652bda68fb3 in ./test_deploy) frame #21: testing::TestSuite::Run() + 0x121 (0x5652bda6980d in ./test_deploy) frame #22: testing::internal::UnitTestImpl::RunAllTests() + 0x38e (0x5652bda756e6 in ./test_deploy) frame #23: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const) + 0x65 (0x5652bda9586b in ./test_deploy) frame #24: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl, bool (testing::internal::UnitTestImpl::)(), char const*) + 0x5a (0x5652bda8e0f7 in ./test_deploy) frame #25: testing::UnitTest::Run() + 0xc9 (0x5652bda73fd1 in ./test_deploy) frame #26: RUN_ALL_TESTS() + 0x11 (0x5652bda169fa in ./test_deploy) frame #27: main + 0x27 (0x5652bda10ce2 in ./test_deploy) frame #28: <unknown function> + 0x2d310 (0x7f3bc0431310 in /usr/lib/libc.so.6) frame #29: __libc_start_main + 0x81 (0x7f3bc04313c1 in /usr/lib/libc.so.6) frame #30: _start + 0x25 (0x5652bda063b5 in ./test_deploy) ``` Test Plan: CI Differential Revision: D36564258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/78028 Approved by: https://github.com/rohan-varma	2022-05-25 17:46:39 +00:00
James Reed	22afe82ce3	[rpc] Switch RPC agent check to TORCH_CHECK and add more descriptive error (#67882 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67882 I ran into a hard-to-interpret error message when trying to run the following script, which was missing an `init_rpc` call: ``` # $ torchrun --standalone --nnodes=1 --nproc_per_node=1 script.py import os rank = int(os.environ['LOCAL_RANK']) world_size = int(os.environ['WORLD_SIZE']) import torch.distributed # !!!!!! Uncomment the following and the script succeeds # torch.distributed.rpc.init_rpc('worker', rank=rank, world_size=world_size) import torch.distributed as dist dist.init_process_group(backend='gloo') import torchvision.models as models import torch rn50 = models.resnet50() rn50.train() rn50 = torch.nn.parallel.DistributedDataParallel(rn50) from torch.distributed.rpc import RRef from torch.distributed.optim import DistributedOptimizer params = [] for param in rn50.parameters(): params.append(RRef(param)) dist_optim = DistributedOptimizer( torch.optim.SGD, params, lr=0.05) loss_func = torch.nn.CrossEntropyLoss() with torch.distributed.autograd.context() as context_id: pred = rn50(torch.randn(50, 3, 224, 224)) target = torch.randn(50, 1000).softmax(dim=1) loss = loss_func(pred, target) dist.autograd.backward(context_id, [loss]) dist_optim.step(context_id) ``` Error: ``` Traceback (most recent call last): File "/xxx/torchrun_exp/script.py", line 23, in <module> params.append(RRef(param)) RuntimeError: agentINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rpc_agent.cpp":237, please report a bug to PyTorch. Current RPC agent is not set! ``` Since this is a user-facing error, I've changed `TORCH_INTERNAL_ASSERT` to `TORCH_CHECK` and added a hint about how to resolve the issue. On the other hand, the fact that this was originally `TORCH_INTERNAL_ASSERT` may suggest that the author thought that this should be an internal-only error condition. If there is some other place that should be throwing an exception in this case that is failing, let me know and I can adapt the fix to change that location. Question for reviewers: * Is there a good test file where I can add a test for this error condition? cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D32190947 Pulled By: jamesr66a fbshipit-source-id: 3621d755329fd524db68675c55b1daf20e716d43	2021-11-05 17:31:11 -07:00
Nikita Shulga	a9b0a921d5	Disable `avoid-non-const-global-variables` lint check (#62008 ) Summary: As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH` All changes but the ones to `.clang-tidy` are generated using following script: ``` for i in `find . -type f -iname ".c" -or -iname "*.h"\|xargs grep cppcoreguidelines-avoid-non-const-global-variables\|cut -f1 -d:\|sort\|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008 Reviewed By: driazati, r-barnes Differential Revision: D29838584 Pulled By: malfet fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13	2021-07-22 18:04:40 -07:00
Mike Guo	6ecc1a4c4f	Make pytorch clang-tidy clean (#60649 ) Summary: This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master. I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver): ```bash python3 setup.py develop # Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options python3 tools/clang_tidy.py \ -j \ -s \ -k \ -v \ --paths torch/csrc/ \ -g"-torch/csrc/jit/passes/onnx/helper.cpp" \ -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \ -g"-torch/csrc/jit/serialization/onnx.cpp" \ -g"-torch/csrc/jit/serialization/export.cpp" \ -g"-torch/csrc/jit/serialization/import.cpp" \ -g"-torch/csrc/jit/serialization/import_legacy.cpp" \ -g"-torch/csrc/onnx/init.cpp" \ -g"-torch/csrc/cuda/nccl." \ -g"-torch/csrc/cuda/python_nccl.cpp" \ -g"-torch/csrc/autograd/FunctionsManual.cpp" \ -g"-torch/csrc/generic/.cpp" \ -g"-torch/csrc/jit/codegen/cuda/runtime/*" \ -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \ -g"-torch/csrc/deploy/interpreter/interpreter.h" \ -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \ -g"-torch/csrc/deploy/interpreter/test_main.cpp" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649 Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors. Reviewed By: walterddr, janeyx99 Differential Revision: D29504258 Pulled By: 1ntEgr8 fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e	2021-07-01 12:21:07 -07:00
Luca Wehrstedt	2b72068a68	Make Future store Storages instead of references to DataPtrs (#60470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60470 A Future needs to know what DataPtrs are used by its value, but it isn't always able to extract them (and even when it is, that's expensive) so they're cached. DataPtrs are kinda like unique_ptrs (movable only, cannot be copied) hence the Future can only hold _references_ to them. The Future's value, however, is unfortunately mutable (we'd wish that weren't the case, but we don't think we can prevent that), which means the tensor/storage that owned that DataPtr might be deleted and thus the DataPtr could be freed. This means our cached reference becomes stale! Which leads to all kinds of disaster, like reading garbage data or segfaulting. Luckily all the DataPtrs we were dealing with were held inside Storages, which have a shared_ptr semantics, thus allowing us to hold a strong pointer to them which ensures they're kept alive. ghstack-source-id: 132177396 Test Plan: CI Reviewed By: mrshenli Differential Revision: D29303570 fbshipit-source-id: d814754806fa58b24e45269e97d768485ef972ba	2021-06-24 03:56:04 -07:00
Luca Wehrstedt	5ec169b4c3	[reland] Always use intrusive_ptr for Message (1 out of 2) (#59205 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59205 Reland of https://github.com/pytorch/pytorch/pull/58422 Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move). By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above. In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR. ghstack-source-id: 130202849 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28623891 fbshipit-source-id: c9aeea3440679a11741ca78c06b03c57cb815a5e	2021-06-02 05:44:49 -07:00
Xiaodong Wang	4c961beacb	Revert D28474878: Always use intrusive_ptr for Message (1 out of 2) Test Plan: revert-hammer Differential Revision: D28474878 (`4d704e607d`) Original commit changeset: 5b76d45e05f6 fbshipit-source-id: 677c5bc7f02dca23213f778eb0e626a2f6600f3b	2021-05-21 19:24:22 -07:00
Luca Wehrstedt	4d704e607d	Always use intrusive_ptr for Message (1 out of 2) (#58422 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58422 Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move). By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above. In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR. ghstack-source-id: 129567053 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28474878 fbshipit-source-id: 5b76d45e05f6fa58c831e369c5c964d126187a6c	2021-05-21 13:15:24 -07:00
Luca Wehrstedt	45012da298	Migrate from shared_ptr to intrusive_ptr for Future (#57636 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636 The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge. At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC. This PR was produced mainly by find-and-replace, plus a couple of manual fixes. ghstack-source-id: 128296581 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D28187972 fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b	2021-05-07 03:59:20 -07:00
Luca Wehrstedt	36e47af58b	Pass reference to parent future in callbacks (#57635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635 Note: this PR looks massive, but it's just one simple change, codemodded many times. In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't). ghstack-source-id: 128296580 Test Plan: CI Reviewed By: wanchaol Differential Revision: D28178783 fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081	2021-05-07 03:59:18 -07:00
Luca Wehrstedt	69de4940f3	Ensure devices are preserved when forwarding between futures (#57432 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57432 In a bunch of places we were creating a future and then "forwarding" the value of another future to it once that other future completed. (This was in order to convert the type of the value, or to "merge" multiple futures into one). However when doing so we often created a child future with an empty set of devices, which meant it didn't support CUDA, and thus would cause a silent synchronization/correctness bug if the parent future did actually contain CUDA tensors. One way this could have been caught earlier would have been to have Future always extract the DataPtrs, even in CPU-only mode, in order to ensure they always reside on the expected set of devices. Unfortunately this might have some averse perf effects thus should be done carefully. ghstack-source-id: 128184667 Test Plan: eyes Reviewed By: mrshenli Differential Revision: D28143045 fbshipit-source-id: 9af1abf270366dc1df0d4857d6a8cc73668af9d1	2021-05-06 01:12:51 -07:00
Luca Wehrstedt	1292602375	Avoid re-extracting DataPtrs when forwarding values between Futures (#57433 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57433 In a bunch of cases we need to "forward" between one future and another, typically because we need to convert the type of the data (e.g., from Message to PyObject). In most of these cases the DataPtrs of the value don't change, and yet the new future must re-extract them from scratch. By allowing the user to obtain the vector of extracted DataPtrs from the old future, we can allow them to "shortcut" this step. Also, this change is a requirement for the next PR to work, since the next PR would otherwise cause us to attempt extracting DataPtrs from Message instances, which doesn't work (because Message is a custom class), but thanks to this PR we actually skip that. ghstack-source-id: 128184663 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28118298 fbshipit-source-id: 70e333ea6a4f8d4d9a86514c350028d412469ee1	2021-05-06 01:11:38 -07:00
CodemodService FBSourceClangFormatLinterBot	9ec6883442	[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --take CLANGFORMAT` Reviewed By: zertosh Differential Revision: D28216577 fbshipit-source-id: ce31fb98320a31eb947bdd31c68aaafed034df79	2021-05-05 04:41:21 -07:00
Alexander Golynski	3db45bcb91	Compilation error fix for torch/csrc/distributed/rpc/init.cpp (#57500 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57500 Test Plan: Imported from OSS Reviewed By: SciPioneer Differential Revision: D28162887 Pulled By: agolynski fbshipit-source-id: b6fafd64778fc09a5e832b0a557ae70f06951454	2021-05-03 23:15:02 -07:00
Edward Yang	e845158b1a	Assert that GIL is not held in blocking destructors (#57030 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57030 PR #57029 is not perfect; there are still obscure situations in which we might allocate a shared_ptr to an RpcAgent that doesn't have a no GIL constructor, so this PR adds the other half of the equation: assert that we don't hold the GIL when running a blocking destructor. This makes it possible to detect potential deadlocks even if the code doesn't deadlock in practice (because you got lucky and none of the threads you blocked on tried to also take out the GIL). I considered whether or not to make this DEBUG_ONLY. For now it's not, so I can get better CI coverage, and because this test only happens in destructors of objects that die rarely. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D28030582 Pulled By: ezyang fbshipit-source-id: a7d7f6545223c4823c7f6036dfe29bd2edaf60a5	2021-05-02 22:06:02 -07:00
Luca Wehrstedt	0422e67336	Use Devices instead of DeviceIndexes in TensorPipe agent (#57294 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294 With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff. ghstack-source-id: 127916562 Test Plan: CI Reviewed By: mrshenli Differential Revision: D28092733 fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a	2021-05-01 16:12:55 -07:00
Yi Wang	13dbb77b7a	[RPC Framework] Enable RemoteModule to directly send GPU tensors over the wire on TensorPipe RPC backend if a device map is provided (#57288 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288 If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire. Also add pybind of `_get_device_map`. The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`. Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure. #Original PR issue: https://github.com/pytorch/pytorch/issues/51670 Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1 CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`. buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts Reviewed By: wanchaol Differential Revision: D28021672 fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79	2021-04-30 18:04:45 -07:00
Yi Wang	c2fbd96735	[RPC Framework] Expose a Python API for device map getter (#57179 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57179 Expose a Python API to get the device map and unblock RemoteModule work. See: https://github.com/pytorch/pytorch/pull/56854#issuecomment-827762398 Additionally, add a const decorator for the C++ getter. #Original PR issue: https://github.com/pytorch/pytorch/issues/51670 ghstack-source-id: 127684266 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D28070160 fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f	2021-04-29 14:29:10 -07:00
Luca Wehrstedt	311ad5e3af	Merge CUDAFuture into ivalue::Future (#57052 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052 This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future. This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles. ghstack-source-id: 127713138 (Note: this ignores all push blocking failures!) Test Plan: CI Reviewed By: mrshenli Differential Revision: D28036829 fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3	2021-04-29 09:31:52 -07:00
Nikita Shulga	eac02f85cf	Fix more clang-tidy errors (#57235 ) Summary: In my last PR I've missed CUDA and distributed folders, fixing this now This change is autogenerated by `python tool/clang_tidy.py -s` Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235 Reviewed By: janeyx99 Differential Revision: D28084444 Pulled By: malfet fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda	2021-04-28 23:29:10 -07:00
Shen Li	1ee54cc7b4	Add devices argument to RRef constructor (#57085 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085 PR #54932 fixed the CUDA RPC for RRef when RRef is created through RPC. But besides that use case, RRef can also be created locally by directly passing in a value, which would bypass the CUDA stream synchronization in #54932. This commit covers the above gap by adding a `devices` argument to RRef constructor. The RRef will then use this argument to choose between `CUDAFutre` and `ivalue::Future` to hold the value. When `devices` is specified and non-empty, `CUDAFuture` will be used, and the `devices` will be passed to that `CUDAFuture`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D28050001 Pulled By: mrshenli fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e	2021-04-28 19:11:10 -07:00
Pritam Damania	40eea6d9d1	Support device map for distributed autograd while using TensorPipe. (#44859 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44859 TensorPipe's `set_device_map` option was applied during the forward pass. However, if we ran the backward pass for the graph we would not automatically pick up the reverse device mapping. As a result, users had to specify both forward and backward device mapping which is very tedious to do. In this PR, I've added this functionality such that TensorPipe automatically picks up the reverse device mapping during the backward pass. This is done by storing the appropriate device mapping in the "recv" autograd function for distributed autograd. #Closes: https://github.com/pytorch/pytorch/issues/44170 ghstack-source-id: 119950842 Test Plan: 1) waitforbuildbot 2) Unit test added. Reviewed By: mrshenli Differential Revision: D23751975 fbshipit-source-id: 2717d0ef5bde3db029a6172d98aad95734d52140	2021-01-27 13:01:44 -08:00
Shen Li	f9f758e349	Apply clang-format to rpc cpp files (#50236 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50236 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25847892 Pulled By: mrshenli fbshipit-source-id: b4af1221acfcaba8903c629869943abbf877e04e	2021-01-08 11:47:43 -08:00
Shen Li	d730c7e261	Replace FutureMessage with ivalue::Future in RpcAgent retry logic (#49995 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49995 Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25745301 Pulled By: mrshenli fbshipit-source-id: b5e3a7e0b377496924847d8d70d61de32e2d87f4	2021-01-07 19:50:23 -08:00
Shen Li	84e3237a53	Let RpcAgent::send() return JitFuture (#49906 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49906 This commit modifies RPC Message to inherit from `torch::CustomClassHolder`, and wraps a Message in an IValue in `RpcAgent::send()`. Test Plan: Imported from OSS Reviewed By: lw Differential Revision: D25719518 Pulled By: mrshenli fbshipit-source-id: 694e40021e49e396da1620a2f81226522341550b	2021-01-07 19:47:14 -08:00
Shihao Xu	b803b4ce09	[torch.distributed.rpc] Add stringify WorkerInfo, better error message for py_rref (#39974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974 # Problem When this assertion happens, I don't know - which worker_id it is on, even with the worker_name "trainer:0". - which rref is throwing this exception. ```shell File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp> trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception raise result.exception_type(result.msg) RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0') Traceback (most recent call last): File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function result = python_udf.func(python_udf.args, python_udf.kwargs) File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke return getattr(rref.local_value(), func_name)(args, **kwargs) RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0 ``` Changes, - Add stringify WorkerInfo - Make localValue() assertion message clearer about the case. ghstack-source-id: 105840918 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork Reviewed By: mrshenli Differential Revision: D5690653 fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170	2020-06-13 12:57:05 -07:00
Luca Wehrstedt	7d85e77076	Use atomic operations to manipulate current RPC agent (#39663 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39663 I was investigating a memory corruption issue and thought it may be due to a race condition in (un)setting the current RPC agent. It turns out it wasn't (still investigating...). I had already written this fix, and it is a real fix (there could really be a race condition), so I'm sending it out to see whether there's interest in merging it. I believe its practical usefulness is however very limited, since typically the current RPC agent is only changed twice (at start and at shutdown) and thus there's limited risk for races. As there may be some confusion on atomicity of shared_ptrs, let me clarify a few things from the get go. Operations on the control blocks of shared_ptrs (i.e., increasing and decreasing the refcounts) are atomic, which means that it is safe to manipulate two different shared_ptrs that point to the same object from different threads. However, the shared_ptr object itself is not atomic, which means that it is not safe to manipulate the same shared_ptr from two different threads. For that reason, the STL provides atomic functions explicitly specialized for shared_ptrs: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic (in C++ 20, they are being replaced by a specialization of std::atomic<std::shared_ptr<T>>). Note that this has been called "the worst question of all of C++" by Louis Brandy at his CppCon talk: https://youtu.be/lkgszkPnV8g?t=1210 ghstack-source-id: 105475005 Test Plan: Unit tests Differential Revision: D21932817 fbshipit-source-id: da33fedd98efb820f284583ce7ff1c1c531dea9c	2020-06-09 02:11:15 -07:00
Omkar Salpekar	e6993938de	Avoid Releasing, Reacquiring lock per iteration in RPC Retry Thread (#38521 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38521 In the RPC Retry Thread, we add retriable futures to a list under the lock, release the lock, add callbacks/set errors to those futures, then re-acquire the lock to clean up the retry map. We can simply clean up the retry map before releasing the lock and not acquire it again - this would be cleaner and may results in better perf if this reduces context switching between threads looking to acquire the retryMapLock. ghstack-source-id: 104062147 Test Plan: CI, there are thorough tests in the RPC framework to test errors with retries. Differential Revision: D21563085 fbshipit-source-id: 35e620892da630d082c032f5f9ce16e8a9ffdfaa	2020-05-18 10:59:13 -07:00
Omkar Salpekar	af597335d4	Remove unnecessary to_string in RPC logging code. (#38414 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38414 `std::to_string` call is unnecessary when using glog. ghstack-source-id: 104030161 Test Plan: Ran the retry tests and checked logs to ensure correct message was printed upon message failure, Differential Revision: D21266330 fbshipit-source-id: 53519287778d47d99b94ea34b7c551f910affda2	2020-05-14 10:57:00 -07:00
Omkar Salpekar	f5c230b892	Make futures vector a local function var (#36677 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36677 Move the `futures` vector to be a local function var like `errorFutures`. Holding the lock to clear the vector is now unnecessary. ghstack-source-id: 102265569 Differential Revision: D20884589 fbshipit-source-id: c9a13258bee737d86f9b0d11cdd28263bb923697	2020-04-16 10:09:39 -07:00
Omkar Salpekar	87be115fd0	Error Handling in RPC Agent (#35263 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35263 Process Group Agent throws an exception if a send attempt is made after the agent is shutdown. With retries, we should catch this exception and mark the original future with an error. ghstack-source-id: 102153897 Test Plan: Running all rpc/dist_autograd tests. Differential Revision: D20611412 fbshipit-source-id: a6009f0b0aa8be662364158962a054c5c29090bf	2020-04-15 10:53:31 -07:00
Jeremy Lilley	37aab14d14	[future] Avoid some future callback self-captures. (#36502 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502 We're sometimes deleting futures without completing them (discovered by logging), and we've recently noticed a slow memory leak. This change migrates the future lambda cases where there was self-capture. - In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback. This avoids the reference cycle. We use this primarily in the case where the value ends up being moved in the callback (something we want to be careful about) - We also add a convenience api to Future where the completed Future is returned as an arg. This allows us to avoid self-capture, though it assumes that the markCompleted() caller is persisting the future for the markCompleted() duration (this has been the case) ghstack-source-id: 102130672 Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/... Differential Revision: D20998905 fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16	2020-04-14 17:52:44 -07:00
Omkar Salpekar	264da24c9e	Fixing RPC Shutdown and Thread Joining (#36239 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36239 ProcessGroupAgent and ThriftAgent threads were joined at shutdown, but RpcAgent threads were joined by the destructor. This PR joins all threads at shutdown by using a pattern similar to `start` in RPC. The derived classes implement a `shutdownImpl` class that cleans up backend-specific state. RpcAgent implements `shutdown` which cleans up generic state and calls the underlying `shutdownImpl`. The atomic running is now set and unset by RpcAgent so backends do not need to mutate it. ghstack-source-id: 101820415 Test Plan: Ensured this works with `test_duplicate_name` (in which RpcAgent is constructed but PGA is not), and selected `rpc_spawn` and `dist_autograd_spawn` tests with TSAN. Checking Build Bot and CI as well, and continuing to test more with TSAN on devserver (currently running into memory issues). Reviewed By: jjlilley Differential Revision: D20902666 fbshipit-source-id: 5dbb5fc92ba66f75614c050bb10b10810770ab12	2020-04-09 12:32:00 -07:00
Jeremy Lilley	291c910e85	[future] Re-land some safe portions of the future change. (#36254 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254 These future use changes were all landed yesterday as part of the future refactoring, quickly reverted due to an observed OOM, but now being relanded, since they've since been tested to be benign. ghstack-source-id: 101776613 Test Plan: buck test mode/dev-nosan caffe2/test/... not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod Differential Revision: D20924010 fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306	2020-04-08 20:05:33 -07:00
Jeremy Lilley	a91535930f	[future] Undo some recent torch::utils::Future api changes (#36220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220 The torch::utils::Future change from yesterday may have introduced a reference cycle, leading to OOM on PS. This change reverts the lambda capture changes with torch::utils::Future until we can analyze further. ghstack-source-id: 101756106 Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset Differential Revision: D20918904 fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624	2020-04-08 11:28:22 -07:00
Jeremy Lilley	72b55fea6b	[jit] Make torch::utils::Future and ivalue::future apis closer (#35849 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849 This change harmonizes some aspects of the api. - torch::utils::Future callback should have no args, like ivalue::future. Many of the lines of this change are related to fixing that up downstream. No args makes the api simpler to use, particularly since many/most of the downstream use cases ignore the passed-in args. It's simple enough to appropriately capture the future in the lambda if necessary. - Add error/hasError methods to ivalue::Future. - Use c10::optional underneath for error to ivalue::Future. - Change markCompleted(error) to setError(error) to ivalue::Future. - Add setValue(FutureError) version to torch::utils::Future ghstack-source-id: 101684435 Test Plan: buck test mode/dev-nosan caffe2/test/... Differential Revision: D20803251 fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269	2020-04-07 17:05:35 -07:00
Omkar Salpekar	7d1f06462c	Fixing Potential TSAN issue with joining RPC helper threads (#36094 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36094 The condition variable waiting in the RPC retry thread must be notified after setting the atomic running to False. This will cause ensure the thread is joinable, and allow `rpc.shutdown` to function correctly ghstack-source-id: 101538860 Test Plan: build bot Differential Revision: D20854763 fbshipit-source-id: b92050712a1e6c31d4dd3b3d98f32ef8dee0f2f2	2020-04-06 15:56:06 -07:00
Omkar Salpekar	19bbfbe1cf	[RPC][Better Engineering] Consolidated all rpcAgentRunning atomic booleans (#33915 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33915 Closes: https://github.com/pytorch/pytorch/issues/32963 Test Plan: build bot Reviewed By: jjlilley Differential Revision: D20074714 fbshipit-source-id: ee89e76f547a1da71825a317c096176524504290	2020-04-03 11:50:05 -07:00
Omkar Salpekar	6792dac90d	Only Schedule Retries before Agent Shutdown (#35554 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35554 We attach a callback to our RPC send attempts that schedule a retry upon failure. This PR only schedules the retry if the agent is running. ghstack-source-id: 101332815 Differential Revision: D20612615 fbshipit-source-id: e1bbb3f162101bce7eb46bad512c9e5dc6d531cc	2020-04-01 19:03:09 -07:00
Yanli Zhao	4d9b649261	jit pickling rref (#32959 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959 in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle(). The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref. The pickling process is: 1. push torch.distributed.rpc.rref global string 1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id 2. push the rrefForkData The unpickling process is: 1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function 2. the globa lamda function will get rrefForkData 3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef 4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef 5. meanwhile owner rref will be notified and do reference counting correctly During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff. ghstack-source-id: 98814793 Test Plan: unit test Differential Revision: D19713293 fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e	2020-02-24 11:16:35 -08:00
Omkar Salpekar	507f963aa6	[RPC Reliability] Enabled retries for RPCs with exponential backoff (#33365 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33365 This adds functionality for re-trying RPC's that are sent with the function sendWithRetries(). It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout. GitHub Issue: https://github.com/pytorch/pytorch/issues/32124 Per the first 4 milestones, the following will be addressed in future PR's: * enabling RPC Retries for RRef internal messages Differential Revision: D19915694 fbshipit-source-id: 4a520e32d5084ebcf90e97fd9f26867115a35c0c	2020-02-19 15:59:29 -08:00

1 2

63 Commits