mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 05:34:18 +08:00
[torch.distributed.rpc] Add stringify WorkerInfo, better error message for py_rref (#39974)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39974 # Problem When this assertion happens, I don't know - which worker_id it is on, even with the worker_name "trainer:0". - which rref is throwing this exception. ```shell File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in _initialize_trainers trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/caffe2/torch/fb/training_toolkit/backend/training_strategies/parameter_server_strategy.py", line 246, in <dictcomp> trainer_name: fut.wait() for trainer_name, fut in model_rref_futs.items() File "/mnt/xarfuse/uid-213229/96b122e4-seed-df64b884-e2b4-4520-b7a8-777e79c829ac-ns-4026532900/torch/distributed/rpc/internal.py", line 158, in _handle_exception raise result.exception_type(result.msg) RuntimeError: RuntimeError('Cannot call localValue() on a non-local reference. Call it on trainer:0') Traceback (most recent call last): File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/internal.py", line 148, in _run_function result = python_udf.func(*python_udf.args, **python_udf.kwargs) File "/mnt/xarfuse/uid-213229/96b122e4-seed-21bc7792-3714-4e62-a1c1-32a7c38ed984-ns-4026533058/torch/distributed/rpc/rref_proxy.py", line 5, in _local_invoke return getattr(rref.local_value(), func_name)(*args, **kwargs) RuntimeError: Cannot call localValue() on a non-local reference. Call it on trainer:0 ``` Changes, - Add stringify WorkerInfo - Make localValue() assertion message clearer about the case. ghstack-source-id: 105840918 Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork -- test_local_value_not_on_owner buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit/:rpc_fork Reviewed By: mrshenli Differential Revision: D5690653 fbshipit-source-id: ca6a8b1ff6e09f8644303a0f82f9b1a546a11170
This commit is contained in:
committed by
Facebook GitHub Bot
parent
905c6730b7
commit
b803b4ce09
@ -182,11 +182,16 @@ py::object PyRRef::toHere(const float timeoutSeconds) const {
|
||||
py::object PyRRef::localValue() const {
|
||||
TORCH_CHECK(
|
||||
rref_->isOwner(),
|
||||
"Cannot call localValue() on a non-local reference. Call it on ",
|
||||
owner().name_);
|
||||
"For ",
|
||||
*rref_,
|
||||
", can't call localValue() on user ",
|
||||
RRefContext::getInstance().agent()->getWorkerInfo(),
|
||||
". Call it on owner ",
|
||||
owner());
|
||||
|
||||
py::object res;
|
||||
auto value = c10::static_intrusive_pointer_cast<OwnerRRef>(rref_)->getValue();
|
||||
auto value =
|
||||
c10::static_intrusive_pointer_cast<const OwnerRRef>(rref_)->getValue();
|
||||
auto& rpcHandler = PythonRpcHandler::getInstance();
|
||||
{
|
||||
// acquiring GIL as torch::jit::toPyObject creates new py::object without
|
||||
|
Reference in New Issue
Block a user