This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67882
I ran into a hard-to-interpret error message when trying to run the following script, which was missing an `init_rpc` call:
```
# $ torchrun --standalone --nnodes=1 --nproc_per_node=1 script.py
import os
rank = int(os.environ['LOCAL_RANK'])
world_size = int(os.environ['WORLD_SIZE'])
import torch.distributed
# !!!!!! Uncomment the following and the script succeeds
# torch.distributed.rpc.init_rpc('worker', rank=rank, world_size=world_size)
import torch.distributed as dist
dist.init_process_group(backend='gloo')
import torchvision.models as models
import torch
rn50 = models.resnet50()
rn50.train()
rn50 = torch.nn.parallel.DistributedDataParallel(rn50)
from torch.distributed.rpc import RRef
from torch.distributed.optim import DistributedOptimizer
params = []
for param in rn50.parameters():
params.append(RRef(param))
dist_optim = DistributedOptimizer(
torch.optim.SGD,
params,
lr=0.05)
loss_func = torch.nn.CrossEntropyLoss()
with torch.distributed.autograd.context() as context_id:
pred = rn50(torch.randn(50, 3, 224, 224))
target = torch.randn(50, 1000).softmax(dim=1)
loss = loss_func(pred, target)
dist.autograd.backward(context_id, [loss])
dist_optim.step(context_id)
```
Error:
```
Traceback (most recent call last):
File "/xxx/torchrun_exp/script.py", line 23, in <module>
params.append(RRef(param))
RuntimeError: agentINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rpc_agent.cpp":237, please report a bug to PyTorch. Current RPC agent is not set!
```
Since this is a user-facing error, I've changed `TORCH_INTERNAL_ASSERT` to `TORCH_CHECK` and added a hint about how to resolve the issue. On the other hand, the fact that this was originally `TORCH_INTERNAL_ASSERT` may suggest that the author thought that this should be an internal-only error condition. If there is some other place that should be throwing an exception in this case that is failing, let me know and I can adapt the fix to change that location.
Question for reviewers:
* Is there a good test file where I can add a test for this error condition?
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D32190947
Pulled By: jamesr66a
fbshipit-source-id: 3621d755329fd524db68675c55b1daf20e716d43
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.
I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop
# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
-j \
-s \
-k \
-v \
--paths torch/csrc/ \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649
Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.
Reviewed By: walterddr, janeyx99
Differential Revision: D29504258
Pulled By: 1ntEgr8
fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60470
A Future needs to know what DataPtrs are used by its value, but it isn't always able to extract them (and even when it is, that's expensive) so they're cached. DataPtrs are kinda like unique_ptrs (movable only, cannot be copied) hence the Future can only hold _references_ to them. The Future's value, however, is unfortunately mutable (we'd wish that weren't the case, but we don't think we can prevent that), which means the tensor/storage that owned that DataPtr might be deleted and thus the DataPtr could be freed. This means our cached reference becomes stale! Which leads to all kinds of disaster, like reading garbage data or segfaulting.
Luckily all the DataPtrs we were dealing with were held inside Storages, which have a shared_ptr semantics, thus allowing us to hold a strong pointer to them which ensures they're kept alive.
ghstack-source-id: 132177396
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D29303570
fbshipit-source-id: d814754806fa58b24e45269e97d768485ef972ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59205
Reland of https://github.com/pytorch/pytorch/pull/58422
Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).
By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.
In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 130202849
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28623891
fbshipit-source-id: c9aeea3440679a11741ca78c06b03c57cb815a5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58422
Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).
By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.
In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 129567053
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28474878
fbshipit-source-id: 5b76d45e05f6fa58c831e369c5c964d126187a6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57636
The "preferred" pointer holder for Future is `intrusive_ptr` (e.g., `then` returns an `intrusive_ptr`, `toFuture` returns `intrusive_ptr`, ...). However in RPC we often wrap it with `shared_ptr`. This probably dates back to when we had a separate Future type, before the merge.
At the boundary between RPC and JIT this difference becomes a bit annoying, as conversions between the pointer types are needed. I think it would be simpler and more consistent to always use `intrusive_ptr`, also in RPC.
This PR was produced mainly by find-and-replace, plus a couple of manual fixes.
ghstack-source-id: 128296581
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D28187972
fbshipit-source-id: d4609273a1550b4921910e85d2198e02f31c905b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57635
Note: this PR looks massive, but it's just one simple change, codemodded many times.
In many cases, a callback needs to access the value/error produced by the parent future. In Python this was easy because the callback was invoked with the parent future as argument, and could thus inspect it. In C++ the callbacks didn't take any arguments, thus in many cases we worked around this by capturing the future in its own callback. This is risky (leads to reference cycle and thus memory leak) and must be done carefully (spoiler: sometimes we weren't).
ghstack-source-id: 128296580
Test Plan: CI
Reviewed By: wanchaol
Differential Revision: D28178783
fbshipit-source-id: 6de02c4568be42123372edc008f630d5ddae0081
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57432
In a bunch of places we were creating a future and then "forwarding" the value of another future to it once that other future completed. (This was in order to convert the type of the value, or to "merge" multiple futures into one). However when doing so we often created a child future with an empty set of devices, which meant it didn't support CUDA, and thus would cause a silent synchronization/correctness bug if the parent future did actually contain CUDA tensors.
One way this could have been caught earlier would have been to have Future always extract the DataPtrs, even in CPU-only mode, in order to ensure they always reside on the expected set of devices. Unfortunately this might have some averse perf effects thus should be done carefully.
ghstack-source-id: 128184667
Test Plan: eyes
Reviewed By: mrshenli
Differential Revision: D28143045
fbshipit-source-id: 9af1abf270366dc1df0d4857d6a8cc73668af9d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57433
In a bunch of cases we need to "forward" between one future and another, typically because we need to convert the type of the data (e.g., from Message to PyObject). In most of these cases the DataPtrs of the value don't change, and yet the new future must re-extract them from scratch. By allowing the user to obtain the vector of extracted DataPtrs from the old future, we can allow them to "shortcut" this step.
Also, this change is a requirement for the next PR to work, since the next PR would otherwise cause us to attempt extracting DataPtrs from Message instances, which doesn't work (because Message is a custom class), but thanks to this PR we actually skip that.
ghstack-source-id: 128184663
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28118298
fbshipit-source-id: 70e333ea6a4f8d4d9a86514c350028d412469ee1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57030
PR #57029 is not perfect; there are still obscure situations in which
we might allocate a shared_ptr to an RpcAgent that doesn't have a
no GIL constructor, so this PR adds the other half of the equation:
assert that we don't hold the GIL when running a blocking destructor.
This makes it possible to detect potential deadlocks even if the
code doesn't deadlock in practice (because you got lucky and none
of the threads you blocked on tried to also take out the GIL).
I considered whether or not to make this DEBUG_ONLY. For now it's
not, so I can get better CI coverage, and because this test only
happens in destructors of objects that die rarely.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: zou3519
Differential Revision: D28030582
Pulled By: ezyang
fbshipit-source-id: a7d7f6545223c4823c7f6036dfe29bd2edaf60a5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294
With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff.
ghstack-source-id: 127916562
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28092733
fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288
If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire.
Also add pybind of `_get_device_map`.
The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`.
Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure.
#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1
CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`.
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts
Reviewed By: wanchaol
Differential Revision: D28021672
fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57052
This PR caps a stack whose goal was to merge CUDAFuture into ivalue::Future. CUDAFuture used to be a subclass of ivalue::Future, which was already pretty good, but it meant that in several places we needed `#ifdef`s or registries in order to create the right type of class, which was annoying. We've made CUDAFuture device-agnostic, by using generic helpers, so that it doesn't depend on CUDA. Now all its code can be inserted into ivalue::Future.
This PR does this very naively, by copy-pasting CUDAFuture's code into the (previously empty) virtual methods of ivalue::Future. This helps ensure the correctness of this PR, as it's straightforward to see it behaves exactly like before. However we probably want to polish it a bit later to iron out so wrinkles.
ghstack-source-id: 127713138
(Note: this ignores all push blocking failures!)
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28036829
fbshipit-source-id: 3e5b16402f5dc245c1fcb9d7bf06db64dcb0d2a3
Summary:
In my last PR I've missed CUDA and distributed folders, fixing this now
This change is autogenerated by `python tool/clang_tidy.py -s`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57235
Reviewed By: janeyx99
Differential Revision: D28084444
Pulled By: malfet
fbshipit-source-id: bf222f69ee90c7872c3cb0931e8cdb84f0cb3cda
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57085
PR #54932 fixed the CUDA RPC for RRef when RRef is created through
RPC. But besides that use case, RRef can also be created locally
by directly passing in a value, which would bypass the CUDA stream
synchronization in #54932.
This commit covers the above gap by adding a `devices` argument
to RRef constructor. The RRef will then use this argument to
choose between `CUDAFutre` and `ivalue::Future` to hold the value.
When `devices` is specified and non-empty, `CUDAFuture` will be
used, and the `devices` will be passed to that `CUDAFuture`.
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D28050001
Pulled By: mrshenli
fbshipit-source-id: 2316b419fa69aa4dcd444050f0b74e61c3d0af1e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44859
TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.
As a result, users had to specify both forward and backward device mapping
which is very tedious to do.
In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.
#Closes: https://github.com/pytorch/pytorch/issues/44170
ghstack-source-id: 119950842
Test Plan:
1) waitforbuildbot
2) Unit test added.
Reviewed By: mrshenli
Differential Revision: D23751975
fbshipit-source-id: 2717d0ef5bde3db029a6172d98aad95734d52140
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49906
This commit modifies RPC Message to inherit from `torch::CustomClassHolder`,
and wraps a Message in an IValue in `RpcAgent::send()`.
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D25719518
Pulled By: mrshenli
fbshipit-source-id: 694e40021e49e396da1620a2f81226522341550b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974
# Problem
When this assertion happens, I don't know
- which worker_id it is on, even with the worker_name "trainer:0".
- which rref is throwing this exception.
```shell
File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers
trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp>
trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items()
File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception
raise result.exception_type(result.msg)
RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0')
Traceback (most recent call last):
File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function
result = python_udf.func(*python_udf.args, **python_udf.kwargs)
File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke
return getattr(rref.local_value(), func_name)(*args, **kwargs)
RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0
```
Changes,
- Add stringify WorkerInfo
- Make localValue() assertion message clearer about the case.
ghstack-source-id: 105840918
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork
Reviewed By: mrshenli
Differential Revision: D5690653
fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39663
I was investigating a memory corruption issue and thought it may be due to a race condition in (un)setting the current RPC agent. It turns out it wasn't (still investigating...). I had already written this fix, and it is a real fix (there could really be a race condition), so I'm sending it out to see whether there's interest in merging it. I believe its practical usefulness is however very limited, since typically the current RPC agent is only changed twice (at start and at shutdown) and thus there's limited risk for races.
As there may be some confusion on atomicity of shared_ptrs, let me clarify a few things from the get go. Operations on the control blocks of shared_ptrs (i.e., increasing and decreasing the refcounts) are atomic, which means that it is safe to manipulate *two different* shared_ptrs that point to the *same* object from *different* threads. However, the shared_ptr object itself is not atomic, which means that it is *not* safe to manipulate the *same* shared_ptr from two *different* threads. For that reason, the STL provides atomic functions explicitly specialized for shared_ptrs: https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic (in C++ 20, they are being replaced by a specialization of std::atomic<std::shared_ptr<T>>). Note that this has been called "the worst question of all of C++" by Louis Brandy at his CppCon talk: https://youtu.be/lkgszkPnV8g?t=1210
ghstack-source-id: 105475005
Test Plan: Unit tests
Differential Revision: D21932817
fbshipit-source-id: da33fedd98efb820f284583ce7ff1c1c531dea9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38521
In the RPC Retry Thread, we add retriable futures to a list under the lock, release the lock, add callbacks/set errors to those futures, then re-acquire the lock to clean up the retry map. We can simply clean up the retry map before releasing the lock and not acquire it again - this would be cleaner and may results in better perf if this reduces context switching between threads looking to acquire the retryMapLock.
ghstack-source-id: 104062147
Test Plan: CI, there are thorough tests in the RPC framework to test errors with retries.
Differential Revision: D21563085
fbshipit-source-id: 35e620892da630d082c032f5f9ce16e8a9ffdfaa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38414
`std::to_string` call is unnecessary when using glog.
ghstack-source-id: 104030161
Test Plan: Ran the retry tests and checked logs to ensure correct message was printed upon message failure,
Differential Revision: D21266330
fbshipit-source-id: 53519287778d47d99b94ea34b7c551f910affda2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36677
Move the `futures` vector to be a local function var like `errorFutures`. Holding the lock to clear the vector is now unnecessary.
ghstack-source-id: 102265569
Differential Revision: D20884589
fbshipit-source-id: c9a13258bee737d86f9b0d11cdd28263bb923697
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35263
Process Group Agent throws an exception if a send attempt is made after the agent is shutdown. With retries, we should catch this exception and mark the original future with an error.
ghstack-source-id: 102153897
Test Plan: Running all rpc/dist_autograd tests.
Differential Revision: D20611412
fbshipit-source-id: a6009f0b0aa8be662364158962a054c5c29090bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36502
We're sometimes deleting futures without completing them (discovered by logging),
and we've recently noticed a slow memory leak.
This change migrates the future lambda cases where there was self-capture.
- In some cases, we use weak_ptr<>, plus .lock()/assert in the lambda callback.
This avoids the reference cycle. We use this primarily in the case where the
value ends up being moved in the callback (something we want to be careful about)
- We also add a convenience api to Future where the completed Future is returned as an arg.
This allows us to avoid self-capture, though it assumes that the markCompleted()
caller is persisting the future for the markCompleted() duration (this has been the case)
ghstack-source-id: 102130672
Test Plan: ctr_mobile_feed, buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20998905
fbshipit-source-id: 7dd52fe4e567a5dea20e8d43862fc2335fd3ce16
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36239
ProcessGroupAgent and ThriftAgent threads were joined at shutdown, but RpcAgent threads were joined by the destructor. This PR joins all threads at shutdown by using a pattern similar to `start` in RPC.
The derived classes implement a `shutdownImpl` class that cleans up backend-specific state. RpcAgent implements `shutdown` which cleans up generic state and calls the underlying `shutdownImpl`. The atomic running is now set and unset by RpcAgent so backends do not need to mutate it.
ghstack-source-id: 101820415
Test Plan: Ensured this works with `test_duplicate_name` (in which RpcAgent is constructed but PGA is not), and selected `rpc_spawn` and `dist_autograd_spawn` tests with TSAN. Checking Build Bot and CI as well, and continuing to test more with TSAN on devserver (currently running into memory issues).
Reviewed By: jjlilley
Differential Revision: D20902666
fbshipit-source-id: 5dbb5fc92ba66f75614c050bb10b10810770ab12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36254
These future use changes were all landed yesterday as part of the future
refactoring, quickly reverted due to an observed OOM, but now being relanded, since
they've since been tested to be benign.
ghstack-source-id: 101776613
Test Plan:
buck test mode/dev-nosan caffe2/test/...
not ooming: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod
Differential Revision: D20924010
fbshipit-source-id: 28872e488df34c7a886bcd659fa7e9914639d306
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36220
The torch::utils::Future change from yesterday may have introduced a reference cycle,
leading to OOM on PS. This change reverts the lambda capture changes with
torch::utils::Future until we can analyze further.
ghstack-source-id: 101756106
Test Plan: ctr mobile feed: buck run mode/opt -c=python.package_style=inplace //caffe2/torch/fb/training_toolkit/examples:ctr_mbl_feed_integration -- prod-preset
Differential Revision: D20918904
fbshipit-source-id: d637f2370aa72c1765b98f3b9e10eb969a025624
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35849
This change harmonizes some aspects of the api.
- torch::utils::Future callback should have no args, like ivalue::future.
Many of the lines of this change are related to fixing that up downstream.
No args makes the api simpler to use, particularly since many/most of the
downstream use cases ignore the passed-in args. It's simple enough to
appropriately capture the future in the lambda if necessary.
- Add error/hasError methods to ivalue::Future.
- Use c10::optional underneath for error to ivalue::Future.
- Change markCompleted(error) to setError(error) to ivalue::Future.
- Add setValue(FutureError) version to torch::utils::Future
ghstack-source-id: 101684435
Test Plan: buck test mode/dev-nosan caffe2/test/...
Differential Revision: D20803251
fbshipit-source-id: e3d925287bd9a80d649843eef5f270163f448269
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36094
The condition variable waiting in the RPC retry thread must be notified after setting the atomic running to False. This will cause ensure the thread is joinable, and allow `rpc.shutdown` to function correctly
ghstack-source-id: 101538860
Test Plan: build bot
Differential Revision: D20854763
fbshipit-source-id: b92050712a1e6c31d4dd3b3d98f32ef8dee0f2f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35554
We attach a callback to our RPC send attempts that schedule a retry
upon failure. This PR only schedules the retry if the agent is running.
ghstack-source-id: 101332815
Differential Revision: D20612615
fbshipit-source-id: e1bbb3f162101bce7eb46bad512c9e5dc6d531cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32959
in rpc torch script call path, we need to pickle/unpickle rref, this diff is added to make jit pickler/unpickler be able to pickle/unpickle rref. It is similar to what is implemented for PyRef::pickle() and PyRef::unpickle().
The pickling/unpickling design assumes it is always coupled with RPC calls. It is not needed to checkpoint a model with rref, before checkpointing the model, user should call ref.to_here() to get value inside rref.
The pickling process is:
1. push torch.distributed.rpc.rref global string
1. call rref.fork() and create rrefForkData, which is a few IDs and type str of the value held inside the rref, the IDs includes rref id, fork id, caller work id, callee work id, owner work id
2. push the rrefForkData
The unpickling process is:
1. read torch.distributed.rpc.rref global string, and retrieve the cached global lamda function
2. the globa lamda function will get rrefForkData
3. if callee is also owner work id, then get owner rref based on Ids inside rrefFork data and return the ownerRRef
4. if callee is not owner work id, then create user rref using the rrefForkData and return the userRRef
5. meanwhile owner rref will be notified and do reference counting correctly
During unpickling, a type_resolver is needed to parse type str. This type_resolver has python dependency, so we get it from rpc_agent, and pass it to unpickler during construction. So we added a type_resolver argumenmt to jit unpickler constructor in this diff.
ghstack-source-id: 98814793
Test Plan: unit test
Differential Revision: D19713293
fbshipit-source-id: 4fd776cdd4ce8f457c4034d79acdfb4cd095c52e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33365
This adds functionality for re-trying RPC's that are sent with the function sendWithRetries(). It adds RPC's that will potentially need to be retried to a sorted map that contains the timeout at which to retry the RPC and associated metadata. A separate thread iteratively removes the earliest retry-able RPC from the map, sleeps until the corresponding time point, re-tries the RPC, and adds to the map again with a future timeout.
GitHub Issue: https://github.com/pytorch/pytorch/issues/32124
Per the first 4 milestones, the following will be addressed in future PR's:
* enabling RPC Retries for RRef internal messages
Differential Revision: D19915694
fbshipit-source-id: 4a520e32d5084ebcf90e97fd9f26867115a35c0c