I hit this in tests when calling `init_process_group(init_method="tcp://localhost:0", ...)`. You can't use port 0 due to the bug in the conditional and will get error `ValueError: Error initializing torch.distributed using tcp:// rendezvous: port number missing`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154156
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.
Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">
We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.
**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.
One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.
**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:
1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.
These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.
**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.
**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.
`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">
**TODO**
1. Update the doc at
- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization
2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.
**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.
`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">
Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
Fixes#112639
```txt
torch/utils/_sympy/value_ranges.py
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`:
D400: First line should end with a period (not 'g')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D400: First line should end with a period (not '0')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D403: First word of the first line should be properly capitalized ('Fn', not 'fn')
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D400: First line should end with a period (not ':')
torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`:
D400: First line should end with a period (not 'e')
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:489 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:511 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:514 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:517 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:558 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:562 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:577 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`:
D103: Missing docstring in public function
36
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:491 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:513 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:516 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:519 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:560 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:564 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:579 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`:
D103: Missing docstring in public function
28
torch/utils/viz/_cycles.py
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D400: First line should end with a period (not 'g')
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D400: First line should end with a period (not 'p')
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D401: First line should be in imperative mood; try rephrasing (found 'Reference')
14
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
9
torch/distributed/argparse_util.py
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:13 in public class `env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:13 in public class `env`:
D400: First line should end with a period (not 'g')
torch/distributed/argparse_util.py:13 in public class `env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:61 in public class `check_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:61 in public class `check_env`:
D400: First line should end with a period (not 's')
torch/distributed/argparse_util.py:61 in public class `check_env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
11
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
5
torch/distributed/_composable_state.py
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D400: First line should end with a period (not '`')
3
0
torch/distributed/launch.py
torch/distributed/launch.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/launch.py:1 at module level:
D400: First line should end with a period (not 'd')
torch/distributed/launch.py:156 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:171 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:180 in public function `main`:
D103: Missing docstring in public function
5
torch/distributed/launch.py:157 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:172 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:181 in public function `main`:
D103: Missing docstring in public function
3
torch/distributed/remote_device.py
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:81 in private method `worker_name`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:81 in private method `worker_name`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:88 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:88 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:95 in private method `device`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/remote_device.py:95 in private method `device`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
7
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:85 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:85 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
3
torch/distributed/rendezvous.py
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/distributed/rendezvous.py:88 in public function `rendezvous`:
D103: Missing docstring in public function
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D400: First line should end with a period (not 'r')
5
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:89 in public function `rendezvous`:
D103: Missing docstring in public function
2
torch/distributed/run.py
torch/distributed/run.py:9 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:9 at module level:
D400: First line should end with a period (not '`')
torch/distributed/run.py:393 in public function `get_args_parser`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/run.py:393 in public function `get_args_parser`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/distributed/run.py:610 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:615 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:629 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:670 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:677 in public function `get_use_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:677 in public function `get_use_env`:
D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves')
torch/distributed/run.py:689 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:770 in public function `run_script_path`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:770 in public function `run_script_path`:
D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
torch/distributed/run.py:781 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:804 in public function `main`:
D103: Missing docstring in public function
15
torch/distributed/run.py:611 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:616 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:630 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:671 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:691 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:784 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:807 in public function `main`:
D103: Missing docstring in public function
7
torch/distributed/__init__.py
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/distributed/__init__.py:8 in public function `is_available`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/__init__.py:8 in public function `is_available`:
D400: First line should end with a period (not ',')
torch/distributed/__init__.py:8 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
4
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
1
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D400: First line should end with a period (not ')')
torch/distributed/utils.py:47 in private function `_cast_forward_inputs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:88 in private function `_recursive_to`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/utils.py:141 in private function `_p_assert`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:141 in private function `_p_assert`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/distributed/utils.py:141 in private function `_p_assert`:
D400: First line should end with a period (not 't')
torch/distributed/utils.py:141 in private function `_p_assert`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D400: First line should end with a period (not 'n')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D400: First line should end with a period (not 'y')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes')
15
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953
Approved by: https://github.com/weifengpy
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
### **SUMMARY:**
It is unnecessary to perform `n + 1` calls to `cast` (one cast for each parameter name-value pair and one cast for the filter generator itself) in a dictionary comprehension in an effort to avoid mypy errors.
Previously, the `cast` to `Tuple[str, str]` was necessary to prevent mypy from complaining that we are trying to create a dictionary out of lists as opposed to tuples (see the mypy issue [here](https://github.com/python/mypy/issues/7509)). However, a `cast` is both adding unnecessary overhead due to the function call and should generally only be used when a linter is unable to properly infer the type of a variable, not to "lie" to it about the type. We can avoid this by instead using a generator within the dictionary comprehension and then indexing into it twice to produce tuples of size 2; mypy recognizes this as a valid dictionary initialization.
The above change is much more performant than the previous version of the code. Timing the two versions of the dictionary construction yielded the following results:
```
>python -m timeit -s "from typing import cast, Dict, Tuple, Iterable" -n 100000 -p "dict(cast(Tuple[str, str], pair.split('=')) for pair in cast(Iterable[str], filter(None, 'rank=3&world_size=5'.split('&'))))"
100000 loops, best of 5: 2.66 usec per loop
>python -m timeit -n 100000 -p "dict((pair[0], pair[1]) for pair in (pair.split('=') for pair in filter(None, 'rank=3&world_size=5'.split('&'))))"
100000 loops, best of 5: 1.09 usec per loop
```
The `cast` to `Iterable[str]` is similarly a "lie" that is not necessary. It is best to be as transparent as possible to the linter rather than using `cast` to eliminate errors. This actually does not even produce any mypy errors when removed in isolation from the other changes.
Further, it is good practice to type hint the return value of a function rather than specifying the type of the return value inside the function. Thus, the unnecessary intermediate variable `query_dict` inside `_query_to_dict` was removed, and the type hint of the return value was moved to the function declaration.
The type of the argument `query` is specified as `str`.
### **EDITS (additional commits):**
[The sole type hint for `query_dict` (in `_env_rendezvous_handler`) was removed to match all other functions in the file.](76d78bfc9c)
[Incorrect typing is fixed for _env_rendezvous_handler typing so that `rank`, `world_size`, `master_port`, and `master_addr` are specified to be `int`, `int`, `int`, and `str`, respectively.](3cc5844264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75959
Approved by: https://github.com/kumpera, https://github.com/mrshenli
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73372
This PR which allows for optional `world_size` argument in init_rpc. This makes changes in rendezvous to allow for `NoneType` for world_size and creates a new code path when initializing TensorPipe agent for init_rpc. The TensorPipe agent is protected by a critical section enforced using the store, so that only one node can create a TPAgent at a time.
This PR does not yet enable RPC commands between ranks.
Previously:
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", world_size=1, rank=0)
```
Now (only rank is needed):
```python
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
init_rpc("worker0", rank=0)
```
Test Plan: Imported from OSS
Reviewed By: mrshenli
Differential Revision: D34621651
Pulled By: H-Huang
fbshipit-source-id: 09dbb511d5a00c219a6ce0a35501ff2e388998b0
(cherry picked from commit 834aedc3256167399c323169ef2f0c9b3cf98dff)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71863
Port number is int in python, but needs to be uint16_t when called for TCPStore constructor.
Related to #67172
Test Plan: Imported from OSS
Reviewed By: cbalioglu
Differential Revision: D33793270
Pulled By: H-Huang
fbshipit-source-id: 89ab47ec8bd7518f9ecbf7d01871fe059b0e77b1
(cherry picked from commit 84bff1f5bb11029ff3fcf7a04faa3b9c7b25286a)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910
Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:
```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```
An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.
For details see: https://github.com/pytorch/pytorch/issues/63874.
This change does a couple of things:
1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths
NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.
Test Plan: Unittests.
Reviewed By: cbalioglu
Differential Revision: D30529984
fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61590
This PR fixes the bug where the state of the first run of a failed training job leaks to the secondary runs due to constant worker key prefix.
ghstack-source-id: 133494239
Test Plan: Run the existing integ tests.
Reviewed By: SciPioneer
Differential Revision: D29682743
fbshipit-source-id: d96ecadcfe5b6563225ee19f5d0776c7f935393a
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.
With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006
Reviewed By: jbschlosser, malfet
Differential Revision: D29133237
Pulled By: albanD
fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329
This PR is part of a stack that addresses the GitHub issue #41614; it introduces:
- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.
- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.
Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389
Test Plan: Run the existing tests since there are no behavioral changes.
Reviewed By: rohan-varma
Differential Revision: D28424978
fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55687
The diff makes sure that users can transfer the following parameters:
* master_addr
* master_port
* node_rank
* use_env
The diff implement StaticTCPRendezvous that creates a store with listener on agent rank #0
The diff modifies caffe2/rendezvous: If the worker process launched with torchelastic agent, the worker processes will create a PrefixStore("worker/") from TCPStore without listener.
The diff adds macros functionality to torch/distributed/ealstic/utils that helps to resolve local_rank parameter.
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/test:launch_test
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27643206
fbshipit-source-id: 540fb26feac322cc3ec0a989fe53324755ccc4ea
Summary:
Enable TcpStore for DDP on Windows platform, in order to improve running DDP cross machines performance.
Related RFC is https://github.com/pytorch/pytorch/issues/47659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47749
Reviewed By: bdhirsh
Differential Revision: D25220401
Pulled By: mrshenli
fbshipit-source-id: da4b46b42296e666fa7d8ec8040093de7443a529
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33434
Reland of https://github.com/pytorch/pytorch/pull/33325, since the
unit test was flaky and failed on land.
To ensure that the test is not flaky, I bumped the timeout so the rendezvous
does not timeout (timing out the rendezvous in 1s led to the flakiness). I also
generalized our mechanism for retrying on errors to include retrying on errors
due to timeout in rendezvous.
ghstack-source-id: 98558377
Test Plan: Added UT test_tcp_store_timeout_set
Differential Revision: D19935390
fbshipit-source-id: 56ccf8c333dd2f954a33614d35cd1642d4e9473a