Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499
See #23110 for model parallel design details, and #26759 for the RRef
protocol. This commit add support for using RRef as Python UDF arguments
and return value. RRefs can now be shared from owner to user, from user to
owner, or from user to user.
Limitations:
1. No implicit type conversion yet. (#27099)
2. No failure handling and retry. (#26116)
3. UDF is not yet blocked until all RRefs are confirmed. (#27098)
4. Internal RRef control messages are not idempotent yet. (#26116)
5. Cannot delete RRefs correctly when there are circular dependencies. (#27096)
Main changes:
1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations.
2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages.
3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`.
4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure.
5. Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs.
6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`.
Test Plan:
Imported from OSS
buck test mode/dev-nosan //caffe2/test:rpc_fork
Differential Revision: D17184146
Pulled By: mrshenli
fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527
Master GH issue: https://github.com/pytorch/pytorch/issues/23110.
This change builds upon https://github.com/pytorch/pytorch/pull/24876 and
provides all the autograd hooks needed for a forward pass with distributed rpc
for builtin operators. This change does not address distributed rpc for python
UDFs and that will be addressed in follow up PRs.
Summary of changes:
1. Attach send autograd functions when a request is sent from the client and
response is sent from the server.
2. Attach receive autograd functions when a request is received on the server
and a response is received on the client.
3. Generate a globally unique autograd_message_id for each send/recv autograd
function pair to uniquely identify them.
ghstack-source-id: 91240466
Test Plan: unit tests.
Differential Revision: D17148077
fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136
make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators
ghstack-source-id: 91156141
Test Plan: unit tests
Reviewed By: pritamdamania87
Differential Revision: D17405686
fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25169
See #23110 for RRef design details. This commit only implements
RRef as return value for builtin operators, and RRef will communicate
between a user and the owner. More specifically, a RRef is first
created on the `dist.remote` caller, which is a user of the RRef.
Then the RRef user sends and notification to the owner to report
the fork to the owner, and the owner uses a shared_ptr to keep
the RRef alive. When the user RRef is destructed on the caller,
another notification will be sent to the owner, and the owner
can then drop it's RRef as well.
Test Plan: Imported from OSS
Differential Revision: D17048343
Pulled By: mrshenli
fbshipit-source-id: 9dd3b3d0e4fd214c76fecdbed746a6d3029b3efd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24195
It is not efficient to use a string destination name in every
send. Moreover, when we add RRef later, RpcAgent will frequently check
RRef ownership. It will be slow as well if we have to go though string
comparison every time. This commit assigns each RpcAgent a unique
integer ID. In the Python send API, applications can provide either
destination name or id. If it is a string name, it will be converted to
id by calling the get_id(workerName) API.
Test Plan: Imported from OSS
Differential Revision: D16770241
Pulled By: mrshenli
fbshipit-source-id: fa56128a77a02a402dc6682474bc301dc1b7f43d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24138
catch exception thrown on server, send the exception message back to client and rethrow it.
Reviewed By: mrshenli
Differential Revision: D16748748
fbshipit-source-id: ce18b3ea1b1d28645ec292f58aa0c818d93e559e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24502
Files in distributed.rpc package mixes snake camel names. This
commit cleans that up and all files use snake names now.
ghstack-source-id: 88548990
Reviewed By: xush6528
Differential Revision: D16860155
fbshipit-source-id: 3a22a89bf6c4e11aac5849564fc53296a04d6a8b
Summary:
This diff is to support python user defined function over rpc for https://github.com/pytorch/pytorch/issues/23110, work flow is like this:
1. pickle python udf
2. pass pickle to C++
3. C++ pass over rpc from client to server
4. server call runPythonUDF() python function to unpickle and run python udf and pickle the udf result using python embedder
6. pass back serialized result from server to client
7. client call loadPythonUDFResult() python function to unpickle result
7. return it to python
right now, put rpc_sync_builtin() and rpc_async_builtin() as temporary interfaces for builtin operator remote calls, they accept qualified name string, this interface can execute builtin operators in C++ land.
rpc_sync() and rpc_async() accept python callables only right now, it could be user define python functions or builtin operator python functions, the python functions will be executed in python land.
once we can resolve builtin operator python callables to qualified name string, we can merge rpc_sync_builtin() into rpc_sync() then
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23569
Test Plan: unit tests
Differential Revision: D16390764
Pulled By: zhaojuanmao
fbshipit-source-id: 2cf2c22a979646830b5581bd75eabf8b3cca564c
Summary:
Features:
* sync and async RPC for builtin operators
* RpcAgent API
* ProcessGroupAgent implementation
Goal:
* have a minimum working and testable RPC implementation
* make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation
* For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object.
* For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...).
* support blocking and non-blocking RequestCallback
* blocking means the callback won't return before sending out the response
* non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list.
We are not exporting this diff until we finalize distributed autograd design and publish the API review publicly.
https://fb.quip.com/FabTAZKVgQpf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23228
ghstack-source-id: 87816717
Reviewed By: zhaojuanmao
Differential Revision: D15194693
fbshipit-source-id: 7adb600796613cde6073db6c227451b89940ecaf