[RPC] Support timeout for RRef proxy functions (#50499)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50499

Adds a timeout API to the following functions:
```
rref.rpc_sync()
rref.rpc_async()
rref.remote()
```
so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases:

1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised
2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change.

Test Plan: Added UT

Reviewed By: wanchaol

Differential Revision: D25897495

fbshipit-source-id: f9ad5b8f75121f50537677056a5ab16cf262847e
This commit is contained in:
Rohan Varma
2021-01-15 13:16:15 -08:00
committed by Facebook GitHub Bot
parent ab1ba8f433
commit d64184ef4c
6 changed files with 97 additions and 20 deletions

View File

@ -228,20 +228,22 @@ std::string PyRRef::str() const {
}
}
py::object PyRRef::createRRefProxy(const RRefProxyType& type) const {
py::object PyRRef::createRRefProxy(
const RRefProxyType& type,
float timeoutSeconds) const {
auto& pythonRpcHandler = PythonRpcHandler::getInstance();
pybind11::gil_scoped_acquire ag;
auto& functions = pythonRpcHandler.getRRefProxyFunctions();
auto& ctor = functions.rrefProxyCtor_;
switch (type) {
case RRefProxyType::RPC_SYNC: {
return ctor(*this, functions.rpcSync_);
return ctor(*this, functions.rpcSync_, timeoutSeconds);
}
case RRefProxyType::RPC_ASYNC: {
return ctor(*this, functions.rpcAsync_);
return ctor(*this, functions.rpcAsync_, timeoutSeconds);
}
case RRefProxyType::REMOTE: {
return ctor(*this, functions.remote_);
return ctor(*this, functions.remote_, timeoutSeconds);
}
default: {
TORCH_INTERNAL_ASSERT(false, "Unrecognized RRefProxy type ", type);