63 Commits

Author SHA1 Message Date
e1e8491b31 [1/N] Change C-style casts to static_cast or reinterpret_cast (#165750)
This series of changes try to cover C style casts into C++ alternatives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165750
Approved by: https://github.com/Skylion007
2025-10-20 04:36:19 +00:00
cyy
f048569c24 [Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439)
Follows #131671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439
Approved by: https://github.com/kwen2501
2024-09-24 13:05:15 +00:00
cyy
95dbbf713e [Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109)
Follows #125102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109
Approved by: https://github.com/ezyang
2024-07-16 04:23:42 +00:00
cyy
a81d083b1c [Reland] Add -Wdeprecated and related fixes (#110019)
This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
2023-09-28 03:34:29 +00:00
cdb51d2ad0 Revert "[2/N] Add -Wdeprecated and related fixes (#109564)"
This reverts commit 5b50641bac49e00ad05060f0b9fe3dcc5d73bc9b.

Reverted https://github.com/pytorch/pytorch/pull/109564 on behalf of https://github.com/atalman due to Need to revert as followup revert of first PR 108626 ([comment](https://github.com/pytorch/pytorch/pull/109564#issuecomment-1728137207))
2023-09-20 17:15:57 +00:00
cyy
5b50641bac [2/N] Add -Wdeprecated and related fixes (#109564)
This PR follows #108626.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109564
Approved by: https://github.com/ezyang
2023-09-20 07:03:25 +00:00
2973994259 fix typo in comments under torch/csrc/distributed (#96062)
This PR fixes typos in comments and messages of `.cpp` and `.hpp` files under `torch/csrc/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96062
Approved by: https://github.com/ngimel
2023-03-07 02:56:41 +00:00
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
cf3ce329b5 [PyTorch] Avoid initializing storage for empty Optionals
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78947

We don't need to initialize for the non-constexpr case ever, or in the constexpr case after C++20.

Differential Revision: [D36519379](https://our.internmc.facebook.com/intern/diff/D36519379/)

Approved by: https://github.com/ezyang, https://github.com/malfet
2022-06-08 03:56:24 +00:00
ebba4219ae torch/distributed: move WorkerInfo registration into libtorch instead of libtorch_python (#78028)
Summary:
This moves torch::class_<WorkerInfo> into `rpc_agent.cpp` so it gets registered in libtorch instead of libtorch_python. This is intermediate work to getting torch::deploy to load an unmodified copy of libtorch. Current RPC is incompatible due to duplicate registrations.

```
unknown file: Failure
C++ exception with description "Exception Caught inside torch::deploy embedded library:
Custom class with name __torch__.torch.classes.dist_rpc.WorkerInfo is already registered. Ensure that registration with torch::class_ is only called once.
Exception raised from registerCustomClass at ../aten/src/ATen/core/custom_class.cpp:61 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3bd9adb92e in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5c (0x7f3bd9ab7068 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: torch::registerCustomClass(std::shared_ptr<c10::ClassType>) + 0x110 (0x7f3bc2258980 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::detail::class_base::class_base(std::string const&, std::string const&, std::string, std::type_info const&, std::type_info const&) + 0x3b9 (0x7f3bc225a419 in /home/tristanr/venvs/multipy/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #4: [0x7f3ba45cfea1]
frame #5: <unknown function> + 0x1b5334 (0x5652bdab9334 in ./test_deploy)
frame #6: <unknown function> + 0x1b4f3e (0x5652bdab8f3e in ./test_deploy)
frame #7: <unknown function> + 0x1b519b (0x5652bdab919b in ./test_deploy)
frame #8: loadSearchFile(char const*) + 0x23e (0x7f3ba62f37f8 in /tmp/torch_deploy9ATEFg)
frame #9: deploy_set_self + 0x51 (0x7f3ba62f38f9 in /tmp/torch_deploy9ATEFg)
frame #10: torch::deploy::Interpreter::Interpreter(torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>) + 0x274 (0x5652bdaaa790 in ./test_deploy)
frame #11: void __gnu_cxx::new_allocator<torch::deploy::Interpreter>::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x81 (0x5652bdaaf58b in ./test_deploy)
frame #12: void std::allocator_traits<std::allocator<torch::deploy::Interpreter> >::construct<torch::deploy::Interpreter, torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(std::allocator<torch::deploy::Interpreter>&, torch::deploy::Interpreter*, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0x4a (0x5652bdaae320 in ./test_deploy)
frame #13: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::_M_realloc_insert<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(__gnu_cxx::__normal_iterator<torch::deploy::Interpreter*, std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> > >, torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xee (0x5652bdaae4a0 in ./test_deploy)
frame #14: void std::vector<torch::deploy::Interpreter, std::allocator<torch::deploy::Interpreter> >::emplace_back<torch::deploy::InterpreterManager*, std::shared_ptr<torch::deploy::Environment>&>(torch::deploy::InterpreterManager*&&, std::shared_ptr<torch::deploy::Environment>&) + 0xb6 (0x5652bdaad258 in ./test_deploy)
frame #15: torch::deploy::InterpreterManager::InterpreterManager(unsigned long, std::shared_ptr<torch::deploy::Environment>) + 0x123 (0x5652bdaa83b1 in ./test_deploy)
frame #16: TorchpyTest_InitTwice_Test::TestBody() + 0x65 (0x5652bda075a9 in ./test_deploy)
frame #17: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x65 (0x5652bda944b7 in ./test_deploy)
frame #18: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x5a (0x5652bda8cfe7 in ./test_deploy)
frame #19: testing::Test::Run() + 0x100 (0x5652bda68622 in ./test_deploy)
frame #20: testing::TestInfo::Run() + 0x10f (0x5652bda68fb3 in ./test_deploy)
frame #21: testing::TestSuite::Run() + 0x121 (0x5652bda6980d in ./test_deploy)
frame #22: testing::internal::UnitTestImpl::RunAllTests() + 0x38e (0x5652bda756e6 in ./test_deploy)
frame #23: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x65 (0x5652bda9586b in ./test_deploy)
frame #24: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x5a (0x5652bda8e0f7 in ./test_deploy)
frame #25: testing::UnitTest::Run() + 0xc9 (0x5652bda73fd1 in ./test_deploy)
frame #26: RUN_ALL_TESTS() + 0x11 (0x5652bda169fa in ./test_deploy)
frame #27: main + 0x27 (0x5652bda10ce2 in ./test_deploy)
frame #28: <unknown function> + 0x2d310 (0x7f3bc0431310 in /usr/lib/libc.so.6)
frame #29: __libc_start_main + 0x81 (0x7f3bc04313c1 in /usr/lib/libc.so.6)
frame #30: _start + 0x25 (0x5652bda063b5 in ./test_deploy)
```

Test Plan: CI

Differential Revision: D36564258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78028
Approved by: https://github.com/rohan-varma
2022-05-25 17:46:39 +00:00
22afe82ce3 [rpc] Switch RPC agent check to TORCH_CHECK and add more descriptive error (#67882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67882

I ran into a hard-to-interpret error message when trying to run the following script, which was missing an `init_rpc` call:

```
# $ torchrun --standalone --nnodes=1 --nproc_per_node=1 script.py
import os
rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])

import torch.distributed
# !!!!!! Uncomment the following and the script succeeds
# torch.distributed.rpc.init_rpc('worker', rank=rank, world_size=world_size)

import torch.distributed as dist
dist.init_process_group(backend='gloo')

import torchvision.models as models
import torch

rn50 = models.resnet50()
rn50.train()
rn50 = torch.nn.parallel.DistributedDataParallel(rn50)

from torch.distributed.rpc import RRef
from torch.distributed.optim import DistributedOptimizer

params = []
for param in rn50.parameters():
    params.append(RRef(param))

dist_optim = DistributedOptimizer(
        torch.optim.SGD,
        params,
        lr=0.05)

loss_func = torch.nn.CrossEntropyLoss()

with torch.distributed.autograd.context() as context_id:
    pred = rn50(torch.randn(50, 3, 224, 224))
    target = torch.randn(50, 1000).softmax(dim=1)
    loss = loss_func(pred, target)
    dist.autograd.backward(context_id, [loss])
    dist_optim.step(context_id)
```

Error:

```
Traceback (most recent call last):
  File "/xxx/torchrun_exp/script.py", line 23, in <module>
    params.append(RRef(param))
RuntimeError: agentINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rpc_agent.cpp":237, please report a bug to PyTorch. Current RPC agent is not set!
```

Since this is a user-facing error, I've changed `TORCH_INTERNAL_ASSERT` to `TORCH_CHECK` and added a hint about how to resolve the issue. On the other hand, the fact that this was originally `TORCH_INTERNAL_ASSERT` may suggest that the author thought that this should be an internal-only error condition. If there is some other place that should be throwing an exception in this case that is failing, let me know and I can adapt the fix to change that location.

Question for reviewers:
* Is there a good test file where I can add a test for this error condition?

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D32190947

Pulled By: jamesr66a

fbshipit-source-id: 3621d755329fd524db68675c55b1daf20e716d43
2021-11-05 17:31:11 -07:00
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
2b72068a68 Make Future store Storages instead of references to DataPtrs (#60470)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60470

A Future needs to know what DataPtrs are used by its value, but it isn't always able to extract them (and even when it is, that's expensive) so they're cached. DataPtrs are kinda like unique_ptrs (movable only, cannot be copied) hence the Future can only hold _references_ to them. The Future's value, however, is unfortunately mutable (we'd wish that weren't the case, but we don't think we can prevent that), which means the tensor/storage that owned that DataPtr might be deleted and thus the DataPtr could be freed. This means our cached reference becomes stale! Which leads to all kinds of disaster, like reading garbage data or segfaulting.

Luckily all the DataPtrs we were dealing with were held inside Storages, which have a shared_ptr semantics, thus allowing us to hold a strong pointer to them which ensures they're kept alive.

ghstack-source-id: 132177396

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D29303570

fbshipit-source-id: d814754806fa58b24e45269e97d768485ef972ba
2021-06-24 03:56:04 -07:00
5ec169b4c3 [reland] Always use intrusive_ptr for Message (1 out of 2) (#59205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59205

Reland of https://github.com/pytorch/pytorch/pull/58422

Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).

By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.

In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 130202849

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623891

fbshipit-source-id: c9aeea3440679a11741ca78c06b03c57cb815a5e
2021-06-02 05:44:49 -07:00
4c961beacb Revert D28474878: Always use intrusive_ptr for Message (1 out of 2)
Test Plan: revert-hammer

Differential Revision:
D28474878 (4d704e607d)

Original commit changeset: 5b76d45e05f6

fbshipit-source-id: 677c5bc7f02dca23213f778eb0e626a2f6600f3b
2021-05-21 19:24:22 -07:00
4d704e607d Always use intrusive_ptr for Message (1 out of 2) (#58422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58422

Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).

By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.

In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 129567053

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474878

fbshipit-source-id: 5b76d45e05f6fa58c831e369c5c964d126187a6c
2021-05-21 13:15:24 -07:00
45012da298 Migrate from shared_ptr to intrusive_ptr for Future (#57636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636

The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge.

At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC.

This PR was produced mainly by find-and-replace, plus a couple of manual fixes.
ghstack-source-id: 128296581

Test Plan: CI

Reviewed By: pritamdamania87

Differential Revision: D28187972

fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b
2021-05-07 03:59:20 -07:00
36e47af58b Pass reference to parent future in callbacks (#57635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635

Note: this PR looks massive, but it's just one simple change, codemodded many times.

In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D28178783

fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
2021-05-07 03:59:18 -07:00
69de4940f3 Ensure devices are preserved when forwarding between futures (#57432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57432

In a bunch of places we were creating a future and then "forwarding" the value of another future to it once that other future completed. (This was in order to convert the type of the value, or to "merge" multiple futures into one). However when doing so we often created a child future with an empty set of devices, which meant it didn't support CUDA, and thus would cause a silent synchronization/correctness bug if the parent future did actually contain CUDA tensors.

One way this could have been caught earlier would have been to have Future always extract the DataPtrs, even in CPU-only mode, in order to ensure they always reside on the expected set of devices. Unfortunately this might have some averse perf effects thus should be done carefully.
ghstack-source-id: 128184667

Test Plan: eyes

Reviewed By: mrshenli

Differential Revision: D28143045

fbshipit-source-id: 9af1abf270366dc1df0d4857d6a8cc73668af9d1
2021-05-06 01:12:51 -07:00
1292602375 Avoid re-extracting DataPtrs when forwarding values between Futures (#57433)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57433

In a bunch of cases we need to "forward" between one future and another, typically because we need to convert the type of the data (e.g., from Message to PyObject). In most of these cases the DataPtrs of the value don't change, and yet the new future must re-extract them from scratch. By allowing the user to obtain the vector of extracted DataPtrs from the old future, we can allow them to "shortcut" this step.

Also, this change is a requirement for the next PR to work, since the next PR would otherwise cause us to attempt extracting DataPtrs from Message instances, which doesn't work (because Message is a custom class), but thanks to this PR we actually skip that.

ghstack-source-id: 128184663

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28118298

fbshipit-source-id: 70e333ea6a4f8d4d9a86514c350028d412469ee1
2021-05-06 01:11:38 -07:00
9ec6883442 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28216577

fbshipit-source-id: ce31fb98320a31eb947bdd31c68aaafed034df79
2021-05-05 04:41:21 -07:00
3db45bcb91 Compilation error fix for torch/csrc/distributed/rpc/init.cpp (#57500)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57500

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28162887

Pulled By: agolynski

fbshipit-source-id: b6fafd64778fc09a5e832b0a557ae70f06951454
2021-05-03 23:15:02 -07:00
e845158b1a Assert that GIL is not held in blocking destructors (#57030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57030

PR #57029 is not perfect; there are still obscure situations in which
we might allocate a shared_ptr to an RpcAgent that doesn't have a
no GIL constructor, so this PR adds the other half of the equation:
assert that we don't hold the GIL when running a blocking destructor.
This makes it possible to detect potential deadlocks even if the
code doesn't deadlock in practice (because you got lucky and none
of the threads you blocked on tried to also take out the GIL).

I considered whether or not to make this DEBUG_ONLY.  For now it's
not, so I can get better CI coverage, and because this test only
happens in destructors of objects that die rarely.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28030582

Pulled By: ezyang

fbshipit-source-id: a7d7f6545223c4823c7f6036dfe29bd2edaf60a5
2021-05-02 22:06:02 -07:00
0422e67336 Use Devices instead of DeviceIndexes in TensorPipe agent (#57294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294

With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff.
ghstack-source-id: 127916562

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28092733

fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a
2021-05-01 16:12:55 -07:00
13dbb77b7a [RPC Framework] Enable RemoteModule to directly send GPU tensors over the wire on TensorPipe RPC backend if a device map is provided (#57288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288

If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire.

Also add pybind of `_get_device_map`.

The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`.

Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure.

#Original PR issue: https://github.com/pytorch/pytorch/issues/51670

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1

CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`.
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts

Reviewed By: wanchaol

Differential Revision: D28021672

fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79
2021-04-30 18:04:45 -07:00
c2fbd96735 [RPC Framework] Expose a Python API for device map getter (#57179)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57179

Expose a Python API to get the device map and unblock RemoteModule work.

See: https://github.com/pytorch/pytorch/pull/56854#issuecomment-827762398

Additionally, add a const decorator for the C++ getter.

#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
ghstack-source-id: 127684266

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D28070160

fbshipit-source-id: 624d14552d82b99487f72e16428fa75c7a47f61f
2021-04-29 14:29:10 -07:00
311ad5e3af Merge CUDAFuture into ivalue::Future (#57052)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052

This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future.

This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles.
ghstack-source-id: 127713138

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28036829

fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3
2021-04-29 09:31:52 -07:00
eac02f85cf Fix more clang-tidy errors (#57235)
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235

Reviewed By: janeyx99

Differential Revision: D28084444

Pulled By: malfet

fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
2021-04-28 23:29:10 -07:00
1ee54cc7b4 Add devices argument to RRef constructor (#57085)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085

PR #54932 fixed the CUDA RPC for RRef when RRef is created through
RPC. But besides that use case, RRef can also be created locally
by directly passing in a value, which would bypass the CUDA stream
synchronization in #54932.

This commit covers the above gap by adding a `devices` argument
to RRef constructor. The RRef will then use this argument to
choose between `CUDAFutre` and `ivalue::Future` to hold the value.
When `devices` is specified and non-empty, `CUDAFuture` will be
used, and the `devices` will be passed to that `CUDAFuture`.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D28050001

Pulled By: mrshenli

fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e
2021-04-28 19:11:10 -07:00
40eea6d9d1 Support device map for distributed autograd while using TensorPipe. (#44859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: https://github.com/pytorch/pytorch/issues/44170
ghstack-source-id: 119950842

Test Plan:
1) waitforbuildbot
2) Unit test added.

Reviewed By: mrshenli

Differential Revision: D23751975

fbshipit-source-id: 2717d0ef5bde3db029a6172d98aad95734d52140
2021-01-27 13:01:44 -08:00
f9f758e349 Apply clang-format to rpc cpp files (#50236)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50236

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25847892

Pulled By: mrshenli

fbshipit-source-id: b4af1221acfcaba8903c629869943abbf877e04e
2021-01-08 11:47:43 -08:00
d730c7e261 Replace FutureMessage with ivalue::Future in RpcAgent retry logic (#49995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49995

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25745301

Pulled By: mrshenli

fbshipit-source-id: b5e3a7e0b377496924847d8d70d61de32e2d87f4
2021-01-07 19:50:23 -08:00
84e3237a53 Let RpcAgent::send() return JitFuture (#49906)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49906

This commit modifies RPC Message to inherit from `torch::CustomClassHolder`,
and wraps a Message in an IValue in `RpcAgent::send()`.

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25719518

Pulled By: mrshenli

fbshipit-source-id: 694e40021e49e396da1620a2f81226522341550b
2021-01-07 19:47:14 -08:00
b803b4ce09 [torch.distributed.rpc] Add stringify WorkerInfo, better error message for py_rref (#39974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974

# Problem

When this assertion happens, I don't know
- which worker_id it is on, even with the worker_name "trainer:0".
- which rref is throwing this exception.

```shell
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers
    trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp>
    trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception
    raise result.exception_type(result.msg)
RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0')
Traceback (most recent call last):
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function
    result = python_udf.func(*python_udf.args, **python_udf.kwargs)
  File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke
    return getattr(rref.local_value(), func_name)(*args, **kwargs)
RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0
```

Changes,
- Add stringify WorkerInfo
- Make localValue() assertion message clearer about the case.
ghstack-source-id: 105840918

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner

buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork

Reviewed By: mrshenli

Differential Revision: D5690653

fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170
2020-06-13 12:57:05 -07:00
7d85e77076 Use atomic operations to manipulate current RPC agent (#39663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39663

I was investigating a memory corruption issue and thought it may be due to a race condition in (un)setting the current RPC agent. It turns out it wasn't (still investigating...). I had already written this fix, and it is a real fix (there could really be a race condition), so I'm sending it out to see whether there's interest in merging it. I believe its practical usefulness is however very limited, since typically the current RPC agent is only changed twice (at start and at shutdown) and thus there's limited risk for races.

As there may be some confusion on atomicity of shared_ptrs, let me clarify a few things from the get go. Operations on the control blocks of shared_ptrs (i.e., increasing and decreasing the refcounts) are atomic, which means that it is safe to manipulate *two different* shared_ptrs that point to the *same* object from *different* threads. However, the shared_ptr object itself is not atomic, which means that it is *not* safe to manipulate the *same* shared_ptr from two *different* threads. For that reason, the STL provides atomic functions explicitly specialized for shared_ptrs: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic (in C++ 20, they are being replaced by a specialization of std::atomic<std::shared_ptr<T>>). Note that this has been called "the worst question of all of C++" by Louis Brandy at his CppCon talk: https://youtu.be/lkgszkPnV8g?t=1210
ghstack-source-id: 105475005

Test Plan: Unit tests

Differential Revision: D21932817

fbshipit-source-id: da33fedd98efb820f284583ce7ff1c1c531dea9c
2020-06-09 02:11:15 -07:00
e6993938de Avoid Releasing, Reacquiring lock per iteration in RPC Retry Thread (#38521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38521

In the RPC Retry Thread, we add retriable futures to a list under the lock, release the lock, add callbacks/set errors to those futures, then re-acquire the lock to clean up the retry map. We can simply clean up the retry map before releasing the lock and not acquire it again - this would be cleaner and may results in better perf if this reduces context switching between threads looking to acquire the retryMapLock.
ghstack-source-id: 104062147

Test Plan: CI, there are thorough tests in the RPC framework to test errors with retries.

Differential Revision: D21563085

fbshipit-source-id: 35e620892da630d082c032f5f9ce16e8a9ffdfaa
2020-05-18 10:59:13 -07:00
af597335d4 Remove unnecessary to_string in RPC logging code. (#38414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38414

`std::to_string` call is unnecessary when using glog.
ghstack-source-id: 104030161

Test Plan: Ran the retry tests and checked logs to ensure correct message was printed upon message failure,

Differential Revision: D21266330

fbshipit-source-id: 53519287778d47d99b94ea34b7c551f910affda2
2020-05-14 10:57:00 -07:00
f5c230b892 Make futures vector a local function var (#36677)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36677

Move the `futures` vector to be a local function var like `errorFutures`. Holding the lock to clear the vector is now unnecessary.
ghstack-source-id: 102265569

Differential Revision: D20884589

fbshipit-source-id: c9a13258bee737d86f9b0d11cdd28263bb923697
2020-04-16 10:09:39 -07:00
87be115fd0 Error Handling in RPC Agent (#35263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35263

Process Group Agent throws an exception if a send attempt is made after the agent is shutdown. With retries, we should catch this exception and mark the original future with an error.
ghstack-source-id: 102153897

Test Plan: Running all rpc/dist_autograd tests.

Differential Revision: D20611412

fbshipit-source-id: a6009f0b0aa8be662364158962a054c5c29090bf
2020-04-15 10:53:31 -07:00
37aab14d14 [future] Avoid some future callback self-captures. (#36502)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502

We're sometimes deleting futures without completing them (discovered by logging),
and we've recently noticed a slow memory leak.

This change migrates the future lambda cases where there was self-capture.
 - In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback.
   This avoids the reference cycle. We use this primarily in the case where the
   value ends up being moved in the callback (something we want to be careful about)

 - We also add a convenience api to Future where the completed Future is returned as an arg.
   This allows us to avoid self-capture, though it assumes that the markCompleted()
   caller is persisting the future for the markCompleted() duration (this has been the case)

ghstack-source-id: 102130672

Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20998905

fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16
2020-04-14 17:52:44 -07:00
264da24c9e Fixing RPC Shutdown and Thread Joining (#36239)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36239

ProcessGroupAgent and ThriftAgent threads were joined at shutdown, but RpcAgent threads were joined by the destructor. This PR joins all threads at shutdown by using a pattern similar to `start` in RPC.

The derived classes implement a `shutdownImpl` class that cleans up backend-specific state. RpcAgent implements `shutdown` which cleans up generic state and calls the underlying `shutdownImpl`. The atomic running is now set and unset by RpcAgent so backends do not need to mutate it.
ghstack-source-id: 101820415

Test Plan: Ensured this works with `test_duplicate_name` (in which RpcAgent is constructed but PGA is not), and selected `rpc_spawn` and `dist_autograd_spawn` tests with TSAN. Checking Build Bot and CI as well, and continuing to test more with TSAN on devserver (currently running into memory issues).

Reviewed By: jjlilley

Differential Revision: D20902666

fbshipit-source-id: 5dbb5fc92ba66f75614c050bb10b10810770ab12
2020-04-09 12:32:00 -07:00
291c910e85 [future] Re-land some safe portions of the future change. (#36254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254

These future use changes were all landed yesterday as part of the future
refactoring, quickly reverted due to an observed OOM, but now being relanded, since
they've since been tested to be benign.
ghstack-source-id: 101776613

Test Plan:
buck test mode/dev-nosan caffe2/test/...
   not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod

Differential Revision: D20924010

fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306
2020-04-08 20:05:33 -07:00
a91535930f [future] Undo some recent torch::utils::Future api changes (#36220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220

The torch::utils::Future change from yesterday may have introduced a reference cycle,
leading to OOM on PS. This change reverts the lambda  capture changes with
torch::utils::Future until we can analyze further.
ghstack-source-id: 101756106

Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset

Differential Revision: D20918904

fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624
2020-04-08 11:28:22 -07:00
72b55fea6b [jit] Make torch::utils::Future and ivalue::future apis closer (#35849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849

This change harmonizes some aspects of the api.

- torch::utils::Future callback should have no args, like ivalue::future.
   Many of the lines of this change are related to fixing that up downstream.

   No args makes the api simpler to use, particularly since many/most of the
   downstream use cases ignore the passed-in args. It's simple enough to
   appropriately capture the future in the lambda if necessary.

 - Add error/hasError methods to ivalue::Future.
 - Use c10::optional underneath for error to ivalue::Future.
 - Change markCompleted(error) to setError(error) to ivalue::Future.
 - Add setValue(FutureError) version to torch::utils::Future

ghstack-source-id: 101684435

Test Plan: buck test mode/dev-nosan caffe2/test/...

Differential Revision: D20803251

fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269
2020-04-07 17:05:35 -07:00
7d1f06462c Fixing Potential TSAN issue with joining RPC helper threads (#36094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36094

The condition variable waiting in the RPC retry thread must be notified after setting the atomic running to False. This will cause ensure the thread is joinable, and allow `rpc.shutdown` to function correctly
ghstack-source-id: 101538860

Test Plan: build bot

Differential Revision: D20854763

fbshipit-source-id: b92050712a1e6c31d4dd3b3d98f32ef8dee0f2f2
2020-04-06 15:56:06 -07:00
19bbfbe1cf [RPC][Better Engineering] Consolidated all rpcAgentRunning atomic booleans (#33915)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33915

Closes: https://github.com/pytorch/pytorch/issues/32963

Test Plan: build bot

Reviewed By: jjlilley

Differential Revision: D20074714

fbshipit-source-id: ee89e76f547a1da71825a317c096176524504290
2020-04-03 11:50:05 -07:00
6792dac90d Only Schedule Retries before Agent Shutdown (#35554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35554

We attach a callback to our RPC send attempts that schedule a retry
upon failure. This PR only schedules the retry if the agent is running.
ghstack-source-id: 101332815

Differential Revision: D20612615

fbshipit-source-id: e1bbb3f162101bce7eb46bad512c9e5dc6d531cc
2020-04-01 19:03:09 -07:00
4d9b649261 jit pickling rref (#32959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959

in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle().
The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref.

The pickling process is:
1. push torch.distributed.rpc.rref global string
1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id
2. push the rrefForkData

The unpickling process is:
1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function
2. the globa lamda function will get rrefForkData
3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef
4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef
5. meanwhile owner rref will be notified and do reference counting correctly

During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff.
ghstack-source-id: 98814793

Test Plan: unit test

Differential Revision: D19713293

fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e
2020-02-24 11:16:35 -08:00
507f963aa6 [RPC Reliability] Enabled retries for RPCs with exponential backoff (#33365)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33365

This adds functionality for re-trying RPC's that are sent with the function sendWithRetries(). It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout.

GitHub Issue: https://github.com/pytorch/pytorch/issues/32124

Per the first 4 milestones, the following will be addressed in future PR's:
* enabling RPC Retries for RRef internal messages

Differential Revision: D19915694

fbshipit-source-id: 4a520e32d5084ebcf90e97fd9f26867115a35c0c
2020-02-19 15:59:29 -08:00