Commit Graph

132 Commits

Author SHA1 Message Date
7c07c39845 [torch.distributed.rpc] Install method docstrings from PyRRef to RRef (#40461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40461

It turned out `:inheried-members:` (see [doc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#directive-autoclass)) is not really usable.

Because pybind11 generates a docstring that writes `self` as parent class, `rpc.PyRRef`, type.

As a workaround, I am pulling docstrings on parent-class, `PyRRef` class, into subclass, `RRef`. And do surgery on the docstring generated by pybind11.

{F241283111}

ghstack-source-id: 106472496

Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_rref_str

buck build mode/dev-nosan //caffe2/test/distributed/rpc/:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_return_local_rrefs

buck test mode/dev-nosan //caffe2/torch/fb/distributed/model_parallel/tests:test_elastic_averaging -- 'test_elastic_averaging_center \(caffe2\.torch\.fb\.distributed\.model_parallel\.tests\.test_elastic_averaging\.TestElasticAveragingCenter\)'

P134031188

Differential Revision: D7933834

fbshipit-source-id: c03a8a4c9d98888b64492a8caba1591595bfe247
2020-06-23 19:58:36 -07:00
14f7e95c1a Add prefix of remote events for RPC profiling (#40066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40066

Builds on top of the previous PR to ensure that all remotely profiled events are prefixed with the key for the RPC that generated them.

The key is generated by the result of `_build_rpc_profiling_key` in `rpc/internal.py` and prefixed onto the event name. In order to do this, we set the current-key when creating the RPC in Python, retrieve the currently-set key in C++ and save a GloballyUniqueId -> key mapping to an in-memory map. When we receive an RPC with profiling information, we expect to receive this ID back, and look up the corresponding profiling key in the map.

The key is then added to all the remote events.

Tested by adding tests to ensure the key is added to all the remote events. Also added a UT which tests in under the multi-threading scenario, to ensure that the mapping's correctness is maintained when several RPCs are in the process of being created at once.
ghstack-source-id: 106316106

Test Plan: Unit test

Differential Revision: D22040035

fbshipit-source-id: 9215feb06084b294edbfa6e03385e13c1d730c43
2020-06-22 11:01:07 -07:00
5d0044389a Minor RPC doc improvements (#40305)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40305

Test Plan: Imported from OSS

Differential Revision: D22144304

Pulled By: mrshenli

fbshipit-source-id: 1c8a9648043eabaf909c6e4ae116672396a9f0f5
2020-06-19 15:34:58 -07:00
caf0c286b8 Fix RPC API doc links (#40299)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40299

Test Plan: Imported from OSS

Differential Revision: D22143156

Pulled By: mrshenli

fbshipit-source-id: c11848ebfe8863d59509a0fbc042eed71a58e514
2020-06-19 15:34:53 -07:00
034eddca01 Fix typos in RPC Docs (#40219)
Summary:
Environment variable MASTER_ADDRESS and MASTER_port should be MASTER_ADDR and MASTER_PORT respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40219

Differential Revision: D22116585

Pulled By: mrshenli

fbshipit-source-id: d312ae66210b0a16ec3ab1f468b1654bb0a75a0f
2020-06-18 11:40:11 -07:00
f3f30d4354 [JIT x RPC] Consolidate RRef type class and RRef impl class (#35694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35694

close https://github.com/pytorch/pytorch/issues/35110

Differential Revision: D7881729

fbshipit-source-id: eedda8f1b7510491886d469efeed4e002bb8b991
2020-06-18 07:46:38 -07:00
3fb1e73a4e Add rpc.async_execution support for rpc.remote on script functions (#39758)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39758

Test Plan: Imported from OSS

Differential Revision: D21963789

Pulled By: mrshenli

fbshipit-source-id: f16f464ba01401b160cc4d3daf036e4bc806d7ea
2020-06-10 13:17:07 -07:00
9bfb91b50b Fix possible deadlock in _wait_all_workers (#39535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39535

This is my understanding of what could happen: on workerN (N != 0), `_wait_all_workers_sequence_id_to_states`, which is a `defaultdict`, is accessed twice: once in the body of `_wait_all_workers` (by the "main thread" of workerN) and once in `_set_proceed_shutdown_signal`, called by worker0 through a RPC call. I think the two could race and access the `_wait_all_workers_sequence_id_to_states` at the same time, and thus create two separate copies of `WaitAllWorkersStates`. One of those threads would wait  on the event of one copy, but the other thread would set the event of the other copy. This lead to a deadlock, as the main thread would end up waiting forever.
ghstack-source-id: 105283327

Test Plan: I added additional logging in those functions, ran a stress test of the RPC test suite, based on the logs I suspected that this could be the issue, fixed it and re-run the stress test and didn't see the bug anymore. This is admittedly not very convincing evidence, as I may just have been lucky that second time...

Differential Revision: D21889752

fbshipit-source-id: 05ec710bd2930313e1480ae896b4b2f5f503aa17
2020-06-05 02:42:32 -07:00
8a6914ddb2 Add @rpc.functions.async_execution for rpc.remote (#39486)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39486

Test Plan: Imported from OSS

Differential Revision: D21871422

Pulled By: mrshenli

fbshipit-source-id: 3c432b7718a47732b2aee064c554f6bdcc5c95c1
2020-06-04 22:38:35 -07:00
8b2bb02e09 Implement timeout support for RRefs (#38590)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38590

This PR implements timeout semantics for RRef for parity with rpc_sync and rpc_async. How it works:

- Timeout parameter is added to rpc.remote. If the rpc.remote call times out, note that the error won't be raised to the user in that call, as it is not blocking (similar to rpc_async). Instead, the timeout error will be raised the next time the RRef is used (either by pickling or to_here call).
- Error handling semantics are added to RRef to deal with the timeout errors. Previously, if there was an error creating the OwnerRRef, the callback on the local user would throw an error in a callback, resulting in an `std::terminate`. Instead of this, the error is now caught and surfaced to the user the next time the RRef is used. As part of this, we have added an `RPCErrorType` enum and defined RRef error handlers to handle the `RPCErrorrTypes` (currently just timeout and unknown)
- A timeout parameter is added to `to_here()` which gives the user control over the max amount of time it can block for.
- `ctx.prepareChildForFork()` which is called when the RRef is pickled (i.e. used as an arg over RPC) checks if the `rpc.remote()` call had timed out, and if so, raises that error to the user.
- Tests are added, primarily via delay injection.
ghstack-source-id: 105232837

Test Plan: CI

Differential Revision: D21588165

fbshipit-source-id: c9f9e8aa3521012ea1de3e0f152a41afdf8b23f3
2020-06-04 02:14:42 -07:00
67cea74dd3 Add rpc.async_function decorator for TorchScript functions (#39267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39267

When combined with `torch.jit.script`, the order of decorators matter.
`rpc.functions.async_execution` must be the outmost one. The
`async_execution` decorator will store the TorchScript function in
attribute `_wrapped_async_rpc_function` on the wrapper function, and
pass this wrapped TorchScript function (i.e., an instance of
`torch.jit.ScriptFunction`) to RPC. The caller will mark the ScriptCall
with `isAsyncExecution=true`, and the callee will extract the returned
`Future` in C++ and install subsequent processing as a callback to
that `Future`.

Test Plan: Imported from OSS

Differential Revision: D21792688

fbshipit-source-id: de095eb148d21e9114a478e9e6047c707d34fd07
2020-06-03 22:27:15 -07:00
a05ef17e46 Add rpc.functions.async_execution decorator for rpc_sync/rpc_async (#39216)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39216

The `rpc.functions.async_execution` decorator specifies that the
wrapped function is guaranteed to return a `torch.futures.Future`.
The decorator adds a `_wrapped_async_rpc_function` attribute to
the wrapper function. The caller retrieves this information and
then sets `isAsyncFunction` argument accordingly which is later
added to PythonCall RPC message as a field. On the callee side,
if the PythonCall carries an asynchronous function, it will cast
the function's return value to a jit::PythonFutureWrapper object,
and then install response creation and communication as a callback
on the that jit::PythonFutureWrapper.

For applications, this feature is useful when a function needs to
wait for IO or additional singaling. In those cases, marking the
user function as `rpc.functions.async_execution` will prevent it
from blocking one thread on callee for too long.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D21779962

fbshipit-source-id: 6b6aa698bf6f91dad6ed2a7ee433df429b59e941
2020-06-02 23:21:25 -07:00
a6f0051db2 Fix test_get_and_set_timeout for TensorPipe Agent (#39353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39353

This test failed with TSAN since the shortened timeout prevented all
messages from being processed within the timeout during Phase 1 of
wait_all_workers during RPC shutdown. Phase 2 already had a longer timeout, so
we extend this to Phase 1 as well.
ghstack-source-id: 105045926

Test Plan: Ran the test_get_and_set_timeout with TSAN

Differential Revision: D21826783

fbshipit-source-id: 7edfdeb50169b31e997dd36a3fd8eea0e9ae7189
2020-06-02 12:01:11 -07:00
6736a76cec Back out "[RPC] [Minor] RPC entry point cleanup"
Summary:
Original commit changeset: b509c47fb612

(Note: this ignores all push blocking failures!)

Reviewed By: xush6528

Differential Revision: D21669711

fbshipit-source-id: e452a513a2d22eaa3bffa333fdb3277fabc24b41
2020-05-20 15:35:24 -07:00
befc76bb65 [RPC] [Minor] RPC entry point cleanup (#34292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34292

This is to finish a cleanup request from https://github.com/pytorch/pytorch/pull/34733#discussion_r392479110.

ghstack-source-id: 104361618

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D7436759

fbshipit-source-id: b509c47fb612ec3486ff1199c005eba69480ee05
2020-05-19 14:23:11 -07:00
4d4895a62a Use Future's then() API to fix RPC profiling (#38352)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38352

Fixes the RPC profiling by using the `then()` API added in https://github.com/pytorch/pytorch/pull/37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 104086114

Test Plan: Re-enabled profiling tests

Differential Revision: D21506940

fbshipit-source-id: 35cde22f0551c825c9bc98ddc24cca412878a63a
2020-05-14 12:52:45 -07:00
3d0279862d Consolidate builtin/python_udf RPC to return ivalue::Future like torchscript RPC does (#35154)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35154

This is for issue https://github.com/pytorch/pytorch/issues/34999.

close https://github.com/pytorch/pytorch/issues/34999.

https://github.com/pytorch/pytorch/issues/34997 need more work.

This will make a few work items easier, like 1) Dist autograd profiler, 2) JIT annotation for Future.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_rref_forward_chain --stress-runs 100

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par \
-r test_call_method_on_rref
```

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- 'test_rref_proxy_class \(fb\.test_rpc_fork\.RpcTestWithFork\)' --stress-runs 100

test_rref_proxy_reuse
test_handle_send_exceptions

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_script_call_python_return_future
```

Differential Revision: D7722184

fbshipit-source-id: bd92b855bfea4913d6672700590c57622fa86e0e
2020-05-08 21:28:56 -07:00
7bd2014eec [resubmit][rpc] per-RPC timeouts for rpc_sync and rpc_async (#34650)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34650

Resubmit of https://github.com/pytorch/pytorch/pull/33840, which was overly eager in the sense that it deleted a lot of code that we didn't want to get rid of yet (default timeout handling).

This PR adds an optional argument into `rpc_sync` and `rpc_async` as well as `RpcAgent::send()` that allows the user to specify a timeout for an RPC to override the default set timeout. If the user does not specify this argument, then the currently set default RPC timeout given in the RPC constructor or by `rpc.set_rpc_timeout()` is used. Otherwise, we use the passed in timeout.

This diff does not address:
1) timeout support when called rpc.rpc_async is called as a JIT operator. For this to work, we would need to change the logic in `register_distributed_ops` to pass in this timeout to `rpcTorchscript`. One more issue is that torchscript doesn't support the timedelta object. This will be done in a follow up PR as it requires a fair amount of changes to the argument parsing logic.
2) Per-RPC timeouts for internal messages or `rpc.remote()`. A follow-up diff will address the latter with the approach of raising the timeout error at the earliest next possible time to the user, such as when the next time the RRef is forked or `to_here` is called

Added unit tests to confirm the current behavior
ghstack-source-id: 102622601

Test Plan: Added unit tests in rpc_test

Differential Revision: D20376953

fbshipit-source-id: 9fb3f147520588308ab50dd33286255658d76d47
2020-04-22 13:00:42 -07:00
e75fb4356b Remove (most) Python 2 support from Python code (#35615)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35615

Python 2 has reached end-of-life and is no longer supported by PyTorch.
Now we can clean up a lot of cruft that we put in place to support it.
These changes were all done manually, and I skipped anything that seemed
like it would take more than a few seconds, so I think it makes sense to
review it manually as well (though using side-by-side view and ignoring
whitespace change might be helpful).

Test Plan: CI

Differential Revision: D20842886

Pulled By: dreiss

fbshipit-source-id: 8cad4e87c45895e7ce3938a88e61157a79504aed
2020-04-22 09:23:14 -07:00
752d3c281a [profiler] Allow record_function ctx manager to profile futures (#35055)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35055

This is the first step to improving the way RPCs are profiled as suggested by Ilia. For now, since RPC can return two different types of futures, we have to implement two different code paths, one for the python eager mode future and one for the jit future.

This diff implements the python eager part. We have defined a method `_call_end_callbacks_on_future` that takes in a future and schedules a `RecordFunction` to be completed as a callback on the future.

Once https://github.com/pytorch/pytorch/pull/35039 lands, we can implement the JIT codepath by registering an operator that takes a `Future(t)` as well.

These code paths will be merged once the futures are merged.
ghstack-source-id: 102478180

Test Plan: Added unit tests

Differential Revision: D20452003

fbshipit-source-id: 1acdcb073bd1f63d6fb2e78277ac0be00fd6671d
2020-04-20 12:37:54 -07:00
f59e646faa [rpc] Allow profiling in RPC to work with torchscript function invocations (#36275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36275

Calling a TorchScript function from within RPC was added after initial
support for the profiler with RPC, hence, we were not recording torchscript
funtions invoked under RPC correctly. This diff passes the `RecordFunction` to
the `_invoke_torchscript..` calls similar to what is done for builtin and UDFs.

However, this is only a temporary solution. We will be removing the use of
`RecordFunction` as a standalone in the RPC code in
https://github.com/pytorch/pytorch/pull/35055. This diff is to unblock
recording of torchscript functions in the meantime.
ghstack-source-id: 101800134

Test Plan:
Added tests for calling a script function with builtin, sync, and
asyc. The output looks like below:

```
------  ---------------  ---------------  ---------------  ---------------  ---------------
> Name                                                                                                        Self CPU
total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls
> ----------------------------------------------------------------------------------------------------------  ---------
------  ---------------  ---------------  ---------------  ---------------  ---------------
> rpc_sync#__torch__.torch.testing._internal.distributed.rpc.rpc_test.my_script_func(worker1 -> worker2)      99.92%
        1.056s           99.92%           1.056s           1.056s           1
> select                                                                                                      0.04%
        383.661us        0.04%            383.661us        95.915us         4
> fill_                                                                                                       0.02%
        210.966us        0.02%            210.966us        52.741us         4
> to                                                                                                          0.00%
        26.276us         0.00%            26.276us         26.276us         1
> empty                                                                                                       0.02%
        159.802us        0.02%            159.802us        79.901us         2
> set_                                                                                                        0.01%
        93.818us         0.01%            93.818us         93.818us         1
> ----------------------------------------------------------------------------------------------------------  ---------
------  ---------------  ---------------  ---------------  ---------------  ---------------
> Self CPU time total: 1.057s
```

Note that we use `torch.jit._qualified_name` to get the name of the script fn.

Differential Revision: D20930453

fbshipit-source-id: c6d940aa44fcd9dd8a1a29c156aa19e0d8428d60
2020-04-08 23:58:36 -07:00
82dd01150c Fix race during RPC shutdown. (#36113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36113

As part of debugging https://github.com/pytorch/pytorch/issues/35863,
I discovered that the unit test would timeout during clean shutdown.

Looking into this further, it looks like there is a race in
`_on_leader_follower_report_shutdown_intent` when multiple followers call the
same method on the leader.

To fix this, I've ensured we have an appropriate lock in
`_on_leader_follower_report_shutdown_intent` to guard against this.

I ran the test 500 times to validate that this fix works.

Closes #35863
ghstack-source-id: 101641463

Test Plan:
1) waitforbuildbot
2) Ran the test 500 times.

Differential Revision: D20884373

fbshipit-source-id: 9d580e9892adffc0c9a4c2e832881fb291a1ff16
2020-04-08 14:12:33 -07:00
ac639d927a Reland "[RPC] Use qualified name str directly in RPC torch script code path" (#35489)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35489

Relanding https://github.com/pytorch/pytorch/pull/34733.

Fix is in https://github.com/pytorch/pytorch/pull/34988

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork && \
buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D20661748

fbshipit-source-id: d550daab8d689d0a9aa2450f3bdb7417ab79dae2
2020-03-26 23:41:51 -07:00
5d92a6cc30 Revert D7778113: Reland "[RPC] Use qualified name str directly in RPC torch script code path"
Test Plan: revert-hammer

Differential Revision:
D7778113

Original commit changeset: b830c03ac946

fbshipit-source-id: ef08b287a6db58320c738cde0c99b3333f5724eb
2020-03-19 06:05:23 -07:00
d616cad676 Reland "[RPC] Use qualified name str directly in RPC torch script code path" (#34962)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34962

Relanding #34733. Fix is in https://github.com/pytorch/pytorch/pull/34988.

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

```
buck test mode/dev //caffe2/test/distributed/rpc/jit:rpc_fork_thrift -- test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D7778113

fbshipit-source-id: b830c03ac9463075fca248eba75be364b0e8b080
2020-03-18 22:25:09 -07:00
b35e544772 Minor fixes for RPC API doc (#34955)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34955

Test Plan: Imported from OSS

Differential Revision: D20512262

Pulled By: mrshenli

fbshipit-source-id: 86ed099638fd32dc8fbde5a6f284239b146fd5e9
2020-03-18 11:20:32 -07:00
d29f450e63 Revert D20442573: [RPC] Use qualified name str directly in RPC torch script code path
Test Plan: revert-hammer

Differential Revision:
D20442573

Original commit changeset: 87f8b7d94adc

fbshipit-source-id: db0f10c28352d2b3ca21b5357e8e09c01a50018c
2020-03-18 11:00:09 -07:00
552f9d3a68 Minor fixes for RPC API docs (#34890)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34890

Test Plan: Imported from OSS

Differential Revision: D20491788

Pulled By: mrshenli

fbshipit-source-id: 95a9821d70e0afe51f586b891845b3106c7105ce
2020-03-17 17:43:55 -07:00
6446ccce76 Adding warnings for async Tensor serialization in remote and rpc_async (#34885)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34885

Test Plan: Imported from OSS

Differential Revision: D20491279

Pulled By: mrshenli

fbshipit-source-id: 8c861e7c7e9ea39f9427f80bc4e75c72c0087366
2020-03-17 17:43:35 -07:00
71f02a481b [RPC] Avoid polluting Python root logger on importing "torch" module (#34871)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34871

We used to configure root logger in RPC module. A stream handler is added to `root.handlers`. This is not desired behavior for pytorch users. We should instead keep the root logger handler list untouched.

We can configure the logger local to the rpc module, set it's log level, so it doesn't use it's ancestor, which is usually the root which has no stream handlers in most cases.
https://docs.python.org/3/library/logging.html#logging.Logger.setLevel

And add a stream handler to make it output to stdout, even if the root handlers is not configured and has an empty list.
https://docs.python.org/3/library/logging.html#logging.Logger.addHandler
https://docs.python.org/3/library/logging.handlers.html#logging.StreamHandler
ghstack-source-id: 100322141

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_wait_all_workers
```

Differential Revision: D7677493

fbshipit-source-id: 88a66079e7348c79a7933e3527701917cbebb7ba
2020-03-17 16:07:06 -07:00
ecd7c0f84c [RPC] Use qualified name str directly in RPC torch script code path (#34733)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34733

simplify
ghstack-source-id: 100292435

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_class_rref_in_py_and_use_in_script

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par \
-r test_return_local_script_module_rref_in_py_and_use_in_script
```

Differential Revision: D20442573

fbshipit-source-id: 87f8b7d94adc03544f8e2955d01cd4702bb31a34
2020-03-17 10:28:52 -07:00
f058c03b15 Disallow sending CUDA tensors over RPC for current RPC agents. (#33604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33604

For our current RPC agents, this PR disallows sending CUDA tensors
over RPC and asks users to copy them explicitly to CPU. Currently, this seems
to be the easiest contract to guarantee for our current RPC agents, otherwise
if we do support this transparently it gets a little tricky in terms of whether
a CUDA tensor on the client should be sent to CPU/GPU of the remote end and
also which GPU device on the remote end.

In the future, the TensorPipe RPC agent can have its own specific handling of
CUDA tensors.

Closes https://github.com/pytorch/pytorch/issues/28881
ghstack-source-id: 100166120

Test Plan: waitforbuildbot

Differential Revision: D20020183

fbshipit-source-id: ca4d43d2a24e8fcd3a60b21e654aa0e953e756cb
2020-03-15 15:01:46 -07:00
ad4bc8c9b8 Best-effort Error Detection for Using Deleted UserRRefs (#34673)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/34673

Test Plan: Imported from OSS

Differential Revision: D20427839

Pulled By: mrshenli

fbshipit-source-id: b1b12ca42a9ed5294806c53fa7d6f54e7dc8b188
2020-03-12 21:39:15 -07:00
4e07c35679 Delete all user forks tracked in RRefContext before graceful shutting down (#31893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31893

In order to resolve the issue summarized in https://github.com/pytorch/pytorch/issues/31325.

The overal solution is to proactively send out delete fork messages from user nodes, before user nodes detecting rref leaks.

As the first step, we want to have a weak ref tracker to track all user rrefs.
ghstack-source-id: 100023142

Test Plan:
V22 is the version that make User to wait on delete UseerRRef message.

# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_nested_rref_stress --stress-runs 100

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_nested_rref_stress

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par - r test_rref_forward_chain

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_non_garbage_collected_user_rref_due_to_local_circular_dependency
```

Reviewed By: mrshenli

Differential Revision: D19292254

fbshipit-source-id: 92c3e8d0b00f183c5e22f163bdca482cc25a1ce9
2020-03-12 10:23:08 -07:00
f4da78f1b3 Remove RPC TorchScript private API (#33978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33978

We can directly pass user_callbale to rpc_async API in TorchScript. There is no need to have private API for taking qualified name.
ghstack-source-id: 99600360

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork \
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_torchscript_functions_not_supported
```

Differential Revision: D7420993

fbshipit-source-id: 228c15b21848e67418fab780e3fd6a1c6da5142d
2020-03-05 18:35:05 -08:00
7d01888a75 [JIT] Register rpc.rpc_async(..) as a JIT operator (#33329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33329

# Use case

```
torch.jit.script
def send_rpc_async(dst_worker_name, user_callable_qual_name, tensor):
    # type: (str, str, Tensor) -> None
    rpc._rpc_async_torchscript(
        dst_worker_name, user_callable_qual_name, args=(tensor,)
    )
```

# Problem

```
torch.jit.frontend.NotSupportedError: keyword-arg expansion is not supported:
  File "/data/users/shihaoxu/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/rpc_spawn#binary,link-tree/torch/distributed/rpc/api.py", line 722
    args = args if args else ()
    kwargs = kwargs if kwargs else {}
    fut = _invoke_rpc_torchscript(to, qualified_name, *args, **kwargs)
                                                               ~~~~~~ <--- HERE
    return fut
```

# Solution

Register `rpc.rpc_async(..)` as a JIT operator to handle variable-length argument list.

# Plan

This PR is the required changes to make `rpc.rpc_async(..)` a JIT prim operator, which can dynamically handle different number of arguments.

- Register "prim::rpc_async" as a `Symbol` in "interned_string.h"
- Add a if branch in "python_sugared_value.cpp" `toSugarValue(py::object, ..)` entry utility function to set up how JIT frontend convert `torch.distributed.rpc.rpc_async(..)` Python function (Python object) into a `SpecialFormValue` (IR SugaredValue).
- Add a switch case for "prim::rpc_aynsc" Symbol in "ir_emitter.cpp" and `emitApplySpecialForm(..)` to set up how JIT compiler provides inputs to the "prim::rpc_aynsc" Operator.
- Register "prim::rpc_async" as a `jit::Operator` and provide implementation in "register_distributed_ops.cpp".

Notice, since the distributed module is an optional part when building PyTorch. The code to be added in this PR should be wrapped within preprocessing maco.
```
#ifdef USE_DISTRIBUTED
new code here
#endif
```

Test Plan:
Items that need to be confirmed in the test cases

https://fb.quip.com/DCvdA9ZLjeO0

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_fork  \
\
&& buck-out/gen/caffe2/test/distributed/rpc/jit/rpc_fork\#binary.par -r test_call_python_function_remotely_from_script_not_supported
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:rpc_spawn
```

```
buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:layer_norm_op_test-2.7 -- test_layer_norm_op_jit
```

Differential Revision: D5738300

fbshipit-source-id: a4604fe762e00be062dc8232ca9790df31fb2074
2020-03-03 19:57:42 -08:00
d613bd0522 [rpc][easy] move unnecessary python call directly to pybind (#33174)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33174

Closes https://github.com/pytorch/pytorch/issues/32780. It looks like
this is the only callsite where we do `_get_current_rpc_agent().foo()`, and we
can do this directly in the pybind layer to save some overhead.
ghstack-source-id: 98200664

Test Plan: All UTs should pass.

Differential Revision: D19828786

fbshipit-source-id: 5c34a96b5a970e57e6a1fdf7f6e54c1f6b88f3d8
2020-02-13 09:14:13 -08:00
b0923acb29 Reduce RPC branches for Python/BuiltinOp/TorchScript (#32689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32689

As described in https://github.com/pytorch/pytorch/issues/32565
ghstack-source-id: 97440343

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_functions_not_supported

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_functions_not_supported
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D5721814

fbshipit-source-id: 9079e81764be1e7c7b85dd72a18c76f3ecfd2547
2020-01-30 01:19:35 -08:00
b9f764b1c7 Use the C++ current RpcAgent pointer to eliminate the unnecessary argument passing from Python world (#32635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32635

With the source of truth of current RPC agent moved to C++ world, there is no point of passing current RPC agent from Python world to C++ world.
ghstack-source-id: 97293316

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info
```

Differential Revision: D5703519

fbshipit-source-id: ef7c28bdb1efd293eb6cafe0b0fca7ca80fa08a6
2020-01-27 20:24:32 -08:00
5c8535d5b0 Make C++ RpcAgent::currentRPCAgent_ the source of truth of current RPC Agent (#32633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32633

There were 2 sources of current RPC agent.

- One is in Python world, `torch.distributedrpc.api._agent`.
- The other is in C++ world, `RpcAgent::defaultRpcAgent_`

Setting Python `_agent` to `None`, does not necessarily reset the C++ `defaultRpcAgent_` to `nullptr`.

i.e.
```
 torch.distributedrpc.api._agent = None
```
does not translate to
```
RpcAgent::defaultRpcAgent_ = nullptr
```

This PR is to remove this ambiguity, and use the C++ pointer as source of truth.

The solution is to leverage a pybind11 behavior that it implicitly casts C++ `shared_ptr<RpcAgent>(nullptr)` to Python `None`.
ghstack-source-id: 97293315

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_duplicate_name

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_process_group_debug_info
```

```
buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_remote_module

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_embedding

buck test mode/dev-nosan //caffe2/torch/fb/distributed/modules/tests:test_sharded_pairwise_attention_pooling

buck test mode/dev-nosan //caffe2/torch/fb/distributed/pytorch/tests:test_rpc
```

Differential Revision: D5733066

fbshipit-source-id: b3e6032ee975f19ca556497edbbf40b517b25be8
2020-01-27 19:34:12 -08:00
1695915371 Make _wait_all_workers() support being called for multiple times (#32624)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32624

We need this PR to resolve the issue mentioned in https://github.com/pytorch/pytorch/issues/31325#issuecomment-574918917.

The solution is for each `_wait_all_workers()` call, there is a sequence ID added, to identify different calls.
ghstack-source-id: 97277591

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_wait_all_workers

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_wait_all_workers
```

Differential Revision: D5739520

fbshipit-source-id: a64131e09c365179624700514422f5375afe803f
2020-01-27 17:04:02 -08:00
6ad9e5c70d Support TorchScript call over remote API (RRef) (#32466)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32466

It's a follow-up work of https://github.com/pytorch/pytorch/pull/32197.

In https://github.com/pytorch/pytorch/pull/32197, `rpc.sync_rpc(..) `and `rpc.rpc_async(..)` support taking a TorchScript annotated Python function as the user function for RPC.

This PR extend along this direction by making `rpc.remote(..)` support taking a TorchScript annotated Python function as well.

ghstack-source-id: 97211168

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork -- test_script_function_exception

buck build mode/dev-nosan //caffe2/test/distributed/rpc:rpc_fork

buck-out/gen/caffe2/test/distributed/rpc/rpc_fork\#binary.par -r test_script_function_exception
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork -- test_backward_simple_script_call

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D19440633

fbshipit-source-id: d37f6dcdc0b80d35ac7bcba46ad6f9b831c3779b
2020-01-25 02:18:27 -08:00
9ce25cce91 add an option to record time spent waiting for GIL (#30842)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30842

We'd like to profile the time spent on GIL acqusiition to debug
performance issues.

Test Plan: Unit tests pass.

Differential Revision: D18837590

fbshipit-source-id: 925968f71c5fb96b8cd93f1eab4647602d2617d1
2020-01-21 11:29:23 -08:00
58234c0254 support torch script call over rpc (#32197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32197

This is to reland https://github.com/pytorch/pytorch/pull/30063, the main change is to match a general exception and grep "pickle" error word in "test_script_functions_not_supported" unit test, as Python 3.5 and Python 3.6 throw different types of errors with different error message for the rpc call in the unit test.
[test all]This diff makes following changes:
1. Providing a new set of python rpc privated APIs, they can accept an annotated TorchScript call and this call can be serialized, deserialized and executed in C++ without GIL. These privated APIs will be binded to JIT in the future, and they are different from public APIs as future JIT binded private APIs will be able to accept qualified_name, not callables. These private APIs are subject to be deprecated once JIT supports torch script function to be a JIT type.

Also, these APIs require torch script function to be defined and annotated by users in python land, it can not be script class/module constructor or class/module methods.

2. This diff also allows public rpc APIs to accept an annotated TorchScript call and execute code path that above private APIs ran on. Therefore if users invoke an annotated TorchScript call over RPC, this call can be serialized, deserialized and executed in C++ without GIL as well.

3. The above private APIs call a newly defined C++ function to make rpc torch script call to be serialized, deserialized and executed in C++ land. This C++ function returns an ivalue::Future. so that in follow up diff this C++ function can be called when these privated APIs are binded to JIT.

4. script_call.cpp/.h and request_callback_impl.cpp files are refactored accordingly so that torch script call and builtin call can share same message type and codes.

5. refactored deserializeResponse() and added a new utility to deserizalize response to IValue

ghstack-source-id: 96879167
ghstack-source-id: 96879167

Test Plan: unit test

Differential Revision: D19402374

fbshipit-source-id: 04efcc7c167d08a6503f29efe55e76f2be4b2c5e
2020-01-18 09:24:17 -08:00
02f09a1bbd Implement backend-agnostic rpc._wait_all_workers() utility (#32190)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32190

We need a backend-agnostic mechanism to do barrier-like operation before locally destroy RRef context and shutdown RPC Agent.

- Sort worker names.
- Elect the first name as the leader in the ordered worker names.
- Followers reports therir intent to synchronize to the leader.
- Leader also reports to itself, when `_wait_all_workers()` called.
- If all workers report their intent to proceed, leader send the command to every one to proceed.
ghstack-source-id: 96693296

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn

buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_spawn\#binary.par -r test_rref_leak
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_wait_all_workers
buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_worker_id
```

# Stress runs
```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_light_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_stress_heavy_rpc --stress-runs 10
```

```
buck test mode/dev-nosan //caffe2/test:rpc_spawn_thrift -- test_stress_heavy_rpc --stress-runs 10
```

Differential Revision: D19399908

fbshipit-source-id: 1dee607cd49adafe88534621a1c85e2736e2f595
2020-01-14 19:19:14 -08:00
51a34545e9 Revert D18482934: support torch script call over rpc
Test Plan: revert-hammer

Differential Revision:
D18482934

Original commit changeset: bd82a0d820c4

fbshipit-source-id: ca5e50fb0a883ee311aeb310198d84ad28062158
2020-01-14 13:30:56 -08:00
dbd737158b support torch script call over rpc (#30063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30063

This diff makes following changes:
1. Providing a new set of python rpc privated APIs, they can accept an annotated TorchScript call and this call can be serialized, deserialized and executed in C++ without GIL. These privated APIs will be binded to JIT in the future, and they are different from public APIs as future JIT binded private APIs will be able to accept qualified_name, not callables. These private APIs are subject to be deprecated once JIT supports torch script function to be a JIT type.

Also, these APIs require torch script function to be defined and annotated by users in python land, it can not be script class/module constructor or class/module methods.

2. This diff also allows public rpc APIs to accept an annotated TorchScript call and execute code path that above private APIs ran on. Therefore if users invoke an annotated TorchScript call over RPC, this call can be serialized, deserialized and executed in C++ without GIL as well.

3. The above private APIs call a newly defined C++ function to make rpc torch script call to be serialized, deserialized and executed in C++ land. This C++ function returns an ivalue::Future. so that in follow up diff this C++ function can be called when these privated APIs are binded to JIT.

4. script_call.cpp/.h and request_callback_impl.cpp files are refactored accordingly so that torch script call and builtin call can share same message type and codes.

5. refactored deserializeResponse() and added a new utility to deserizalize response to IValue

ghstack-source-id: 96638829

Test Plan: unit test

Differential Revision: D18482934

fbshipit-source-id: bd82a0d820c47a8e45b2e7c616eca06573f7d7ea
2020-01-14 09:27:04 -08:00
927c2a02b0 enable autograd profiler to work with RPC and RRef. (#31381)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31381

This PR adds support for being able to profile both sync and async RPCs, so that users can use the autograd profiler and be able to view metrics such as RPC latency and number of calls in the profiler output.

The way this is implemented is by using the existing `RecordFunction` class provided by the autograd profiler. We create a `RecordFunction` instance when sending an RPC, if autograd profiling is enabled. We also invoke the starting callbacks on this `RecordFunction` instance, this does things such as start the CPU timer.  This instance is then persisted across the lifetime of the RPC by attaching it to the `Future` created by the RPC. When the RPC is finished (i.e. when `future->markComplete()` is called), we run the `RecordFunction` instance's end callbacks, which among other things, stops the timer so that we get the correct RPC latency.

The `RecordFunction` and relevant callbacks in `profiler.cpp` are modified slightly to support running end callbacks from a different thread (which is needed since futures are marked as completed by a different thread than the main RPC thread). By default, the autograd profiler uses a `thread_local` list of `Events` and `thread_id`. However, since we'd like to run the `RecordFunction`'s callbacks from a different thread, we would like to access the list of `Events` created by the original thread. This is done by attaching the `thread_id` for the event to the `RecordFunction`, and then looking up the event with that thread in `all_event_lists` (see the changes in `profiler.cpp`). To ensure that the original behavior does not change in the profiler, this described behavior is only run when a user calls `setOverrideThreadId()` on the `RecordFunction` object.
ghstack-source-id: 96527291

Test Plan: Added a unit test.

Differential Revision: D19053322

fbshipit-source-id: 9a27a60c809fc4fdb16fa5d85085f3b6b21abfbb
2020-01-10 21:26:18 -08:00
8ea49e7a08 add missing braces for format in rpc _to_worker_info (#31969)
Summary:
This was missing and resulted in the incorrect `name` passed into `_to_worker_info` not being printed out in the error message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31969

Differential Revision: D19331927

Pulled By: rohan-varma

fbshipit-source-id: e74d47daec3224c2d9b9da3c0a6404cfa67baf65
2020-01-09 23:18:46 -08:00
7f723cbd8a Revert D19290954: Implement backend-agnostic rpc._wait_all_workers() utility
Test Plan: revert-hammer

Differential Revision:
D19290954

Original commit changeset: cdb22203c2f2

fbshipit-source-id: 2ae194a06a645e4f48879271eccf0588b0956cd3
2020-01-08 10:25:51 -08:00