pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-28 02:04:53 +08:00

Author	SHA1	Message	Date
Shen Li	2486b0ba82	Add Python RRef as args and return value (#25499 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25499 See #23110 for model parallel design details, and #26759 for the RRef protocol. This commit add support for using RRef as Python UDF arguments and return value. RRefs can now be shared from owner to user, from user to owner, or from user to user. Limitations: 1. No implicit type conversion yet. (#27099) 2. No failure handling and retry. (#26116) 3. UDF is not yet blocked until all RRefs are confirmed. (#27098) 4. Internal RRef control messages are not idempotent yet. (#26116) 5. Cannot delete RRefs correctly when there are circular dependencies. (#27096) Main changes: 1. Added `SCRIPT_REMOTE_CALL` and `PYTHON_REMOTE_CALL` to `Message.h` to represent `dist.remote` invocations. 2. Added `SCRIPT_RREF_FETCH_CALL`, `PYTHON_RREF_FETCH_CALL`, `RREF_USER_ACCEPT`, `RREF_USER_DELETE`, `RREF_CHILD_ACCEPT`, and `RREF_FORK_REQUEST` to `Message.h` as internal RRef control messages. 3. New message request handling code is added to `functions.cpp`, and message format is added in `script_remote_call.h`, `python_remote_call.h`, and `rref_proto.h`. 4. Added a `PyRRef` type in `py_rref.h` and `py_rref.cpp` which holds a shared pointer to C++ `RRef` type. `PyRRef` wraps the C++ API and also implements RRef pickling and unpickling. RRef fork related control messages will be sent during RRef pickling/unpickling procedure. 5. Update `RRef.h` and `RRef.cpp` accordingly to support `py::object` RRefs. 6. RRef context (reference count, etc.) are tracked in `rref_context.h` and `rref_context.cpp`. Test Plan: Imported from OSS buck test mode/dev-nosan //caffe2/test:rpc_fork Differential Revision: D17184146 Pulled By: mrshenli fbshipit-source-id: a3a268efc087ac1ef489136ab957080382629265	2019-10-03 17:47:12 -07:00
Pritam Damania	fe4170bda8	Add send and recv backward functions for builtin operators RPC. (#25527 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25527 Master GH issue: https://github.com/pytorch/pytorch/issues/23110. This change builds upon https://github.com/pytorch/pytorch/pull/24876 and provides all the autograd hooks needed for a forward pass with distributed rpc for builtin operators. This change does not address distributed rpc for python UDFs and that will be addressed in follow up PRs. Summary of changes: 1. Attach send autograd functions when a request is sent from the client and response is sent from the server. 2. Attach receive autograd functions when a request is received on the server and a response is received on the client. 3. Generate a globally unique autograd_message_id for each send/recv autograd function pair to uniquely identify them. ghstack-source-id: 91240466 Test Plan: unit tests. Differential Revision: D17148077 fbshipit-source-id: 192d8a3f552ed7cc939f55dcca332965c9bd3233	2019-10-03 01:18:46 -07:00
Yanli Zhao	631e2ee7a4	make python udf serialization format to be binary plus tensor tables (#27136 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27136 make python udf serialization format to be binary plus tensor tables, so that tensors can be attached to autograd graph, handled in the same way as builtin operators ghstack-source-id: 91156141 Test Plan: unit tests Reviewed By: pritamdamania87 Differential Revision: D17405686 fbshipit-source-id: 4a8c9804f6ad239eb0655fa5daeb54580d4741fd	2019-10-02 00:10:32 -07:00
Shen Li	197fd4f707	Adding RRef as return value for builtin operators (#25169 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25169 See #23110 for RRef design details. This commit only implements RRef as return value for builtin operators, and RRef will communicate between a user and the owner. More specifically, a RRef is first created on the `dist.remote` caller, which is a user of the RRef. Then the RRef user sends and notification to the owner to report the fork to the owner, and the owner uses a shared_ptr to keep the RRef alive. When the user RRef is destructed on the caller, another notification will be sent to the owner, and the owner can then drop it's RRef as well. Test Plan: Imported from OSS Differential Revision: D17048343 Pulled By: mrshenli fbshipit-source-id: 9dd3b3d0e4fd214c76fecdbed746a6d3029b3efd	2019-09-05 15:14:17 -07:00
Shen Li	1294e55c15	Assign each RpcAgent a unique ID, and use ID for sending RPC messages. (#24195 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24195 It is not efficient to use a string destination name in every send. Moreover, when we add RRef later, RpcAgent will frequently check RRef ownership. It will be slow as well if we have to go though string comparison every time. This commit assigns each RpcAgent a unique integer ID. In the Python send API, applications can provide either destination name or id. If it is a string name, it will be converted to id by calling the get_id(workerName) API. Test Plan: Imported from OSS Differential Revision: D16770241 Pulled By: mrshenli fbshipit-source-id: fa56128a77a02a402dc6682474bc301dc1b7f43d	2019-08-29 19:19:11 -07:00
Yanli Zhao	1efdf57aa7	throw remote exception on client side (#24138 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24138 catch exception thrown on server, send the exception message back to client and rethrow it. Reviewed By: mrshenli Differential Revision: D16748748 fbshipit-source-id: ce18b3ea1b1d28645ec292f58aa0c818d93e559e	2019-08-20 09:40:35 -07:00
Shen Li	b6803d62fd	Use snake names for all files in distributed.rpc (#24502 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24502 Files in distributed.rpc package mixes snake camel names. This commit cleans that up and all files use snake names now. ghstack-source-id: 88548990 Reviewed By: xush6528 Differential Revision: D16860155 fbshipit-source-id: 3a22a89bf6c4e11aac5849564fc53296a04d6a8b	2019-08-19 10:58:59 -07:00
Yanli Zhao	ab39a55331	python udf over rpc (#23569 ) Summary: This diff is to support python user defined function over rpc for https://github.com/pytorch/pytorch/issues/23110, work flow is like this: 1. pickle python udf 2. pass pickle to C++ 3. C++ pass over rpc from client to server 4. server call runPythonUDF() python function to unpickle and run python udf and pickle the udf result using python embedder 6. pass back serialized result from server to client 7. client call loadPythonUDFResult() python function to unpickle result 7. return it to python right now, put rpc_sync_builtin() and rpc_async_builtin() as temporary interfaces for builtin operator remote calls, they accept qualified name string, this interface can execute builtin operators in C++ land. rpc_sync() and rpc_async() accept python callables only right now, it could be user define python functions or builtin operator python functions, the python functions will be executed in python land. once we can resolve builtin operator python callables to qualified name string, we can merge rpc_sync_builtin() into rpc_sync() then Pull Request resolved: https://github.com/pytorch/pytorch/pull/23569 Test Plan: unit tests Differential Revision: D16390764 Pulled By: zhaojuanmao fbshipit-source-id: 2cf2c22a979646830b5581bd75eabf8b3cca564c	2019-08-14 23:13:33 -07:00
Shen Li	8b349073ce	sync and async torch.distributed.rpc for builtin operators (#23228 ) Summary: Features: * sync and async RPC for builtin operators * RpcAgent API * ProcessGroupAgent implementation Goal: * have a minimum working and testable RPC implementation * make sure the RpcAgent API is sufficient for future ThriftAgent and TensorPipeAgent implementation * For tensor pipe implementation, it might allocate multiple underlying communication channels with different types, and might also use streaming serialization/deserialization for large tensors. To support this requirement, the current implementation only convert a BuiltinOp into a Message which contains a byte vector and a tensor table. It is up to the RpcAgent implementation to determine how it would like to serialize a Message object. * For ThriftAgent, as Thrift has it own request/response matching solution, the Message.id is no longer necessary. Hence the id can be dropped during serialization. All it needs to do is to pass the response Message object to the Future returned by send(...). * support blocking and non-blocking RequestCallback * blocking means the callback won't return before sending out the response * non-blocking can be achieved by enqueue the `(from, request, RpcAgent&)` tuple and use a different thread to process them. That is why there is an `RpcAgent&` arg in the param list. We are not exporting this diff until we finalize distributed autograd design and publish the API review publicly. https://fb.quip.com/FabTAZKVgQpf Pull Request resolved: https://github.com/pytorch/pytorch/pull/23228 ghstack-source-id: 87816717 Reviewed By: zhaojuanmao Differential Revision: D15194693 fbshipit-source-id: 7adb600796613cde6073db6c227451b89940ecaf	2019-08-06 16:03:01 -07:00

9 Commits