29 Commits

Author SHA1 Message Date
56935684c3 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-29 09:23:39 +00:00
83caf4960f Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)"
This reverts commit e40f50cb87bcd176a380b729af5dda13dbe9c399.

Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:24 +00:00
e40f50cb87 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-28 15:37:57 +00:00
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
631fb33fd6 Enable import following in MYPYNOFOLLOW (now MYPYINDUCTOR) (#113830)
Skipping importing some packages for now to make this change more
tractable.

For some reason, lintrunner on CI raises errors in all imported `.pyi` files,
even though it doesn't on my local machine. The errors are all from missing
generic types, as the MYPYINDUCTOR config has `disallow_any_generics`
set. I have thus added `disable-error-code` comments to the relevant files,
though I fixed a few that were easy enough.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113830
Approved by: https://github.com/Skylion007
ghstack dependencies: #113722, #113721
2023-11-17 18:24:21 +00:00
43fb5147e2 [BE] Enable Ruff's Flake8 PYI001 (#112823)
Enable [unprefixed-type-param (PYI001)](https://docs.astral.sh/ruff/rules/unprefixed-type-param/#unprefixed-type-param-pyi001)

Link: #110950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112823
Approved by: https://github.com/Skylion007
2023-11-03 17:25:39 +00:00
79d35bfc01 [BE]: Add PYI files to ruff lintrunner (#107524)
Due to a bug with the lintrunner yaml, PYI files were not convered. This PR builds off #107519 covers all PYI files and adds one noqa to fix a B006 bugprone error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107524
Approved by: https://github.com/ezyang
2023-08-21 18:55:41 +00:00
32b67b5a6b Fix RRef Future type annotation (#105296)
Test Plan: Sandcastle

Differential Revision: D47376236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105296
Approved by: https://github.com/Skylion007
2023-07-23 22:34:09 +00:00
15ea0a00cb Fix RRef type annotations (#104876)
Test Plan: Sandcastle

Reviewed By: H-Huang

Differential Revision: D47334579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104876
Approved by: https://github.com/H-Huang
2023-07-14 17:31:51 +00:00
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
1fd119948e [3/3] Update .pyi Python stub files and enable 'UFMT' linter (#95268)
Changes:

- #95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- #95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- => this PR: #95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95268
Approved by: https://github.com/huydhn
2023-03-01 23:50:56 +00:00
591222f5d9 Fix use-dict-literal lint (#83718)
Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718
Approved by: https://github.com/albanD
2022-08-24 00:26:46 +00:00
1fa9a377d0 [Profiler] Start moving python bindings out of autograd (#82584)
A lot of profiler code still lives in autograd for historic reasons. However as we formalize and clean up profiler internals it makes sense to pull more and more into the profiler folders/namespace. For now I'm just moving some of the core config data structures and those related to `torch::profiler::impl::Result` to keep the scope manageable.

Differential Revision: [D37961462](https://our.internmc.facebook.com/intern/diff/D37961462/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37961462/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82584
Approved by: https://github.com/albanD, https://github.com/Gamrix
2022-08-19 17:15:18 +00:00
e68686bb05 Add optional timeout argument for RpcAgent join() (#76194)
Summary:
This PR was created to resolve issue brought up in https://fb.workplace.com/groups/319878845696681/permalink/741428653541696/

Changes:
- Adds timeout argument to RpcAgent.join()
- Add optional timeout argument to ThriftRpcAgent barrier()
- During shutdown (ThriftRpcAgent join) calls the barrier, the agent will use the timeout passed to shutdown and pass that timeout into the join().
- Update API.py to also include fix bug (missing timeout for signal)
- Change default shutdown timeout to 0 (no timeout). Existing functionality in _all_gather will remain the same and wait indefinitely for signal if no timeout is set for the function. New functionality has user specify timeout for both the signal and rpc calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76194

Test Plan:
Modified barrier test

buck test torch/fb/distributed/thriftRpcBackend/test:ThriftRpcAgentTest -- BarrierTest

Reviewed By: mrshenli

Differential Revision: D35825382

fbshipit-source-id: e91e9ab5d9fca08787cb6b6b8125a4b03d1c7cde
(cherry picked from commit fcf899a387001574bf4e39a213ea741611d76097)
2022-05-03 01:10:17 +00:00
811ccde41a [Dynamic RPC] Add graceful shutdown for dynamic RPC members
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74561

Approved by: https://github.com/mrshenli
2022-04-26 13:12:55 +00:00
8646e0dc28 [Dynamic RPC] Allow existing ranks to communicate with newly joined ranks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74035

Approved by: https://github.com/mrshenli
2022-04-20 18:07:40 +00:00
f76d1c022e [Dynamic RPC] Allow for optional world_size argument in init_rpc (#73372)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73372

This PR which allows for optional `world_size` argument in init_rpc. This makes changes in rendezvous to allow for `NoneType` for world_size and creates a new code path when initializing TensorPipe agent for init_rpc. The TensorPipe agent is protected by a critical section enforced using the store, so that only one node can create a TPAgent at a time.
This PR does not yet enable RPC commands between ranks.

Previously:
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", world_size=1, rank=0)
```

Now (only rank is needed):
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", rank=0)
```

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D34621651

Pulled By: H-Huang

fbshipit-source-id: 09dbb511d5a00c219a6ce0a35501ff2e388998b0
(cherry picked from commit 834aedc3256167399c323169ef2f0c9b3cf98dff)
2022-03-24 16:19:28 +00:00
7b376bf844 Remove ProcessGroup from TensorPipeAgent initialization (#68128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68128

Reland of D31762735 (0cbfd466d2).

This diff was originally reverted due to failure in test_send_export_type_through_rpc_with_custom_pickler.

I updated rpc_pickler_test.py to prevent a race condition where processes were not registering their pickler before handling their rpc_sync calls.

Test Plan:
rpc_pickler_test file:

buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test //caffe2/torch/fb/training_toolkit/backend/metrics/collectors/fbdata_aggregator/tests:batch_collector_test -- --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx

rpc_pickler stress test:

buck test mode/dev-nosan -c 'cxx.coverage_only=caffe2' //caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test -- --exact 'caffe2/torch/fb/training_toolkit/backend/metrics/tests:rpc_pickler_test - test_send_export_type_through_rpc_with_custom_pickler (caffe2.torch.fb.training_toolkit.backend.metrics.tests.rpc_pickler_test.CythonTypeRpcSpawnTest)' --run-disabled --collect-coverage '--code-coverage-session=test_session' --force-tpx --jobs 18 --stress-runs 10 --record-results

Reviewed By: mrshenli

Differential Revision: D32316077

fbshipit-source-id: e58de2335fbaa3ab46d46fe222c659197633a5e4
2021-11-11 12:28:55 -08:00
9fb3ba9d7b Revert D31762735 (#67924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67924

This diff reverts the changes made in D31762735 (0cbfd466d2)

Test Plan: Wait for CI

Reviewed By: derekmod-fb

Differential Revision: D32214744

fbshipit-source-id: e0a65b6a31a88216ae1243549fcbc901ef812374
2021-11-06 17:34:13 -07:00
0cbfd466d2 Remove ProcessGroup from TensorPipeAgent initialization (#66708)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66708

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D31762735

Pulled By: H-Huang

fbshipit-source-id: 9f3879fca6b8258f7e6171b14d2c1d6cce21627d
2021-11-01 14:15:27 -07:00
dc1bd6acee Remove PROCESS GROUP rpc backend (#62411)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62411

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D29990408

Pulled By: H-Huang

fbshipit-source-id: 183d3b316767b12993cebbe32b73c2850fd1cc42
2021-08-02 12:26:22 -07:00
8f4cfaa9db Fix race condition in TP agent (#58753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753

TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing.

One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways.

Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++.
ghstack-source-id: 130583775

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D28603754

fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290
2021-06-04 06:53:42 -07:00
0422e67336 Use Devices instead of DeviceIndexes in TensorPipe agent (#57294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294

With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff.
ghstack-source-id: 127916562

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28092733

fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a
2021-05-01 16:12:55 -07:00
13dbb77b7a [RPC Framework] Enable RemoteModule to directly send GPU tensors over the wire on TensorPipe RPC backend if a device map is provided (#57288)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57288

If the device map provided by RemoteModue is not empty, then TensorPipe RPC backend can support directly sending GPU tensors over the wire.

Also add pybind of `_get_device_map`.

The changes in unit test setup is separated out as a follow-up PR, as currently it breaks some tests in `distributed/rpc/test_faulty_agent.py`.

Still need to fix test_load_di_parts in `torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`. Currently an early return is used to bypass this test failure.

#Original PR issue: https://github.com/pytorch/pytorch/issues/51670

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule -j 1

CAUTION: This one actually fails and now it is bypassed. See FIXME in `_remote_forward`.
buck test mode/dev-nosan caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- test_load_di_parts

Reviewed By: wanchaol

Differential Revision: D28021672

fbshipit-source-id: a89245dc35e1d9479811ec6f98d9f34116837d79
2021-04-30 18:04:45 -07:00
d83ae5d1b7 Add devices to TensorPipe options (#56405)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56405

If not provided, the `devices` field will be initialized to local
devices in local `device_maps` and corresponding devices in peers'
`device_maps`. When processing CUDA RPC requests, the agent will
use a dedicated stream for each device in the devices list to 1)
accept argument CUDA tensors 2) run user functions 3) send return
value tensors.

closes #54017

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D27863133

Pulled By: mrshenli

fbshipit-source-id: 5d078c3b6d1812f85d62b0eb0f89f2b6c82cb060
2021-04-21 16:16:48 -07:00
c7b1979b6b Use Store collect and verify names in all RPC agents (#53209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53209

closes #40048

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D26791524

Pulled By: mrshenli

fbshipit-source-id: fc75589f9707014334fcfae6f05af3c04217783b
2021-03-07 16:51:46 -08:00
a1c67b0763 Silence harmless error logs of TensorPipe agent during shutdown (#51785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51785

The TensorPipe pipes do not really support a "graceful" shutdown: if one side is expecting data (i.e., it has scheduled a readDescriptor call) and the other side closes, the former will receive an error. Such an error will not even be predictable, as it depends on the backend: some may detect this and report it "well" (through an EOFError), others may not be able to tell this apart from a failure and report it as such.

This meant that during shutdown some of these errors would fire and thus the agent would log them as warning. We did add a note that these were expected under some conditions, so that users wouldn't be alarmed, but it was still a far-from-ideal experience.

In principle we could build a "protocol" on top of these pipes to "agree" on a graceful shutdown, and this was the plan to solve this. However, it was rather complicated to implement.

Here I am proposing a quicker, but perhaps hackier, solution, which re-uses the already existing graceful shutdown "protocol" of the agent (i.e., the `join` method) to put the agent in a special state in which it will silence all errors due to a remote shutting down.

Such a check cannot happen in the `shutdown` method, because that's also used in case of ungraceful shutdown (in which case I believe we'd still want to display errors). Since it needs to make sure that all participants have transitioned to this new state before any of them can continue (as otherwise one of them may close its pipes before another one has realized that this is now expected), we need to perform a barrier. Hence the ideal place for it is the `join` method, where we're already doing a lot of gang-wide synchronization. Since the `join` method isn't only called during shutdown, we need to make sure we only switch the agent to this state when it's the last call to join, and we do so by adding a new optional argument to it (which will be ignored by all agents except the TensorPipe one).

I realize this isn't the prettiest solution, and since it changes the agent's API it's worth discussing it carefully. Let me know what you think!
ghstack-source-id: 121131940

Test Plan: Run on CircleCI, where this occurred quite a bit, and check the logs.

Reviewed By: mrshenli

Differential Revision: D26276137

fbshipit-source-id: 69ef14fe10908e80e627d9b4505352e482089cc8
2021-02-10 10:58:22 -08:00
d64184ef4c [RPC] Support timeout for RRef proxy functions (#50499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50499

Adds a timeout API to the following functions:
```
rref.rpc_sync()
rref.rpc_async()
rref.remote()
```
so that RPCs initiated by these proxy calls can be appropriately timed out similar to the regular RPC APIs. Timeouts are supported in the following use cases:

1. rpc.remote finishes in time and successfully, but function run by rref.rpc_async() is slow and times out. Timeout error will be raised
2. rref.rpc_async() function is fast, but rpc.remote() is slow/hanging. Then when rref.rpc_async() is called, it will still timeout with the passed in timeout (and won't block for the rpc.remote() to succeed, which is what happens currently). Although, the timeout will occur during the future creation itself (and not the wait) since it calls `rref._get_type` which blocks. We can consider making this nonblocking by modifying rref._get_type to return a future, although that is likely a larger change.

Test Plan: Added UT

Reviewed By: wanchaol

Differential Revision: D25897495

fbshipit-source-id: f9ad5b8f75121f50537677056a5ab16cf262847e
2021-01-15 13:23:23 -08:00
eaa993a2e0 Add type annotations to torch._C._distributed_rpc module. (#46624)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46624

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D24761656

Pulled By: xuzhao9

fbshipit-source-id: b55aee5dd2b97f573a50e5bbfddde7d984943fec
2020-11-06 01:28:51 -08:00