Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
Resolves#126888
- #126888
This PR is split from PR #126898.
- #126898
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.
Resolves#126888
- #126888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
Summary:
Minor logging cleanup in distributed library
1. Don't use "f" formatted strings - address linter issues.
2. Nits: Make use of unused `e` (error) in a few logs.
3. Change info->debug as asked in issue #113545
4. Nit: rename log -> logger in a few files for consistency
5. Fix a linter error.
Test Plan:
1. Local build passes.
2. Linter is happy.
Reviewers: wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
Fixes#112639
```txt
torch/utils/_sympy/value_ranges.py
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`:
D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`:
D400: First line should end with a period (not 'g')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D400: First line should end with a period (not '0')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
D403: First word of the first line should be properly capitalized ('Fn', not 'fn')
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
D400: First line should end with a period (not ':')
torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`:
D400: First line should end with a period (not 'e')
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:489 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:511 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:514 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:517 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:558 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:562 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:577 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`:
D103: Missing docstring in public function
36
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`:
D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:491 in public method `__init__`:
D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:513 in public method `load`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:516 in public method `store`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:519 in public method `reduction`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:560 in public method `square`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:564 in public method `neg`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:579 in public method `sub`:
D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`:
D103: Missing docstring in public function
28
torch/utils/viz/_cycles.py
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
D400: First line should end with a period (not 'g')
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D400: First line should end with a period (not 'p')
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
D401: First line should be in imperative mood; try rephrasing (found 'Reference')
14
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:256 in public class `Node`:
D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
D103: Missing docstring in public function
9
torch/distributed/argparse_util.py
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:13 in public class `env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:13 in public class `env`:
D400: First line should end with a period (not 'g')
torch/distributed/argparse_util.py:13 in public class `env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:61 in public class `check_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:61 in public class `check_env`:
D400: First line should end with a period (not 's')
torch/distributed/argparse_util.py:61 in public class `check_env`:
D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
11
torch/distributed/argparse_util.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/argparse_util.py:43 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
D102: Missing docstring in public method
torch/distributed/argparse_util.py:97 in public method `__init__`:
D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
D102: Missing docstring in public method
5
torch/distributed/_composable_state.py
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
D400: First line should end with a period (not '`')
3
0
torch/distributed/launch.py
torch/distributed/launch.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/launch.py:1 at module level:
D400: First line should end with a period (not 'd')
torch/distributed/launch.py:156 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:171 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:180 in public function `main`:
D103: Missing docstring in public function
5
torch/distributed/launch.py:157 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/launch.py:172 in public function `launch`:
D103: Missing docstring in public function
torch/distributed/launch.py:181 in public function `main`:
D103: Missing docstring in public function
3
torch/distributed/remote_device.py
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:81 in private method `worker_name`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:81 in private method `worker_name`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:88 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:88 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:95 in private method `device`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/remote_device.py:95 in private method `device`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
7
torch/distributed/remote_device.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/remote_device.py:85 in private method `rank`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:85 in private method `rank`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
3
torch/distributed/rendezvous.py
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/distributed/rendezvous.py:88 in public function `rendezvous`:
D103: Missing docstring in public function
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
D400: First line should end with a period (not 'r')
5
torch/distributed/rendezvous.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/rendezvous.py:89 in public function `rendezvous`:
D103: Missing docstring in public function
2
torch/distributed/run.py
torch/distributed/run.py:9 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:9 at module level:
D400: First line should end with a period (not '`')
torch/distributed/run.py:393 in public function `get_args_parser`:
D202: No blank lines allowed after function docstring (found 1)
torch/distributed/run.py:393 in public function `get_args_parser`:
D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/distributed/run.py:610 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:615 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:629 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:670 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:677 in public function `get_use_env`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:677 in public function `get_use_env`:
D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves')
torch/distributed/run.py:689 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:770 in public function `run_script_path`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:770 in public function `run_script_path`:
D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
torch/distributed/run.py:781 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:804 in public function `main`:
D103: Missing docstring in public function
15
torch/distributed/run.py:611 in public function `parse_args`:
D103: Missing docstring in public function
torch/distributed/run.py:616 in public function `parse_min_max_nnodes`:
D103: Missing docstring in public function
torch/distributed/run.py:630 in public function `determine_local_world_size`:
D103: Missing docstring in public function
torch/distributed/run.py:671 in public function `get_rdzv_endpoint`:
D103: Missing docstring in public function
torch/distributed/run.py:691 in public function `config_from_args`:
D103: Missing docstring in public function
torch/distributed/run.py:784 in public function `run`:
D103: Missing docstring in public function
torch/distributed/run.py:807 in public function `main`:
D103: Missing docstring in public function
7
torch/distributed/__init__.py
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/distributed/__init__.py:8 in public function `is_available`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/__init__.py:8 in public function `is_available`:
D400: First line should end with a period (not ',')
torch/distributed/__init__.py:8 in public function `is_available`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
4
torch/distributed/__init__.py:1 at module level:
D104: Missing docstring in public package
1
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
D400: First line should end with a period (not ')')
torch/distributed/utils.py:47 in private function `_cast_forward_inputs`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:88 in private function `_recursive_to`:
D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/utils.py:141 in private function `_p_assert`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:141 in private function `_p_assert`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/distributed/utils.py:141 in private function `_p_assert`:
D400: First line should end with a period (not 't')
torch/distributed/utils.py:141 in private function `_p_assert`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D400: First line should end with a period (not 'n')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D400: First line should end with a period (not 'y')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes')
15
torch/distributed/utils.py:1 at module level:
D100: Missing docstring in public module
1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953
Approved by: https://github.com/weifengpy
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)
```python
class BooleanOptionalAction(Action):
def __init__(...):
if option_string.startswith('--'):
option_string = '--no-' + option_string[2:]
_option_strings.append(option_string)
```
It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
This is a new version of #15648 based on the latest master branch.
Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.
In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)
Fixes https://github.com/pytorch/pytorch/issues/71105
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
Summary:
This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049
Reviewed By: cbalioglu
Differential Revision: D30584041
Pulled By: kiukchung
fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910
Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:
```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```
An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.
For details see: https://github.com/pytorch/pytorch/issues/63874.
This change does a couple of things:
1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths
NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.
Test Plan: Unittests.
Reviewed By: cbalioglu
Differential Revision: D30529984
fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925
* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies
Issues resolved:
https://github.com/pytorch/pytorch/issues/60716https://github.com/pytorch/pytorch/issues/60754
Test Plan:
sandcastle
python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts
python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning
python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning
Output of running torch.distributed.launch without --use_env:
$path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ('LOCAL_RANK')` instead.
New section:
{F628923078}
{F628974089}
Reviewed By: cbalioglu
Differential Revision: D29559553
fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925
* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies
Issues resolved:
https://github.com/pytorch/pytorch/issues/60716https://github.com/pytorch/pytorch/issues/60754
Test Plan:
sandcastle
python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts
python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning
python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning
Output of running torch.distributed.launch without --use_env:
$path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ('LOCAL_RANK')` instead.
New section:
{F628923078}
{F628974089}
Reviewed By: kiukchung, cbalioglu
Differential Revision: D29413019
fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56214
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037
The diff introduces new `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.
Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`
The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...
Reviewed By: H-Huang
Differential Revision: D27805799
fbshipit-source-id: 599a4c0592fbc7a1bc1953040626dd6b72bac907
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037
The diff introduces new `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.
Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`
The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`
Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...
Reviewed By: cbalioglu
Differential Revision: D27753803
fbshipit-source-id: 5f24bcfdcb70356f0787b11f6cb9479f3515fb47
Summary:
These unused variables were identified by [pyflakes](https://pypi.org/project/pyflakes/). They can be safely removed to simplify the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50181
Reviewed By: gchanan
Differential Revision: D25844270
fbshipit-source-id: 0e648ffe8c6db6daf56788a13ba89806923cbb76
Summary:
I was exhausted with needing to hunt down zombies when working with ddp launcher, so this PR solves the various zombie issues.
This PR addresses 2 distinct zombie scenarios caused by ddp launch.py:
1. When the main process is killed, the child processes aren't killed and continue running
2. When any of the children processes dies (e.g. OOM), the rest of the children and the parent remain running, but really are stuck
To solve these problems this PR switches from `wait` to `poll` and uses signal handlers.
The main problem with `wait()` was that it's not async, and I was having a 2nd process OOM, and the code was stuck waiting for the first process to finish which will not happen since the first process is blocking now waiting for the 2nd process - a sort of deadlock. My 2nd card is smaller than the first one, so it occasionally OOMs.
Using `asyncio` would probably be the cleanest solution, but as it's relatively new in python, perhaps polling is good enough.
I wrote this little script to reproduce 2 problematic scenarios and a normal running setup, it does 3 different things according to the `--mode` arg
- `oom` - causes the 2nd process to exit prematurely emulating OOM
- `clean-finish` - just exit normally in both processes
- `False` (lack of arg) just keep on running - emulating multiple normally running processes
```
# oom.py
import argparse
from time import sleep
import sys
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", default=False, type=int)
parser.add_argument("--mode", default=False, type=str)
args, _ = parser.parse_known_args()
print(f"{args.local_rank} is starting")
sleep(3)
if args.mode == "oom":
# emulate OOM in 2nd card
if args.local_rank == 1:
raise RuntimeError("OOM")
if args.mode == "clean-finish":
sleep(1)
print(f"{args.local_rank} is cleanly finishing")
sys.exit(0)
while (True):
# emulate long running process
print(f"{args.local_rank} is running")
sleep(1)
if __name__ == "__main__":
main()
```
Let's begin:
### 1. Normal execution
```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=clean-finish
```
All the processes exit upon completion - I won't bother pasting the log here - just testing that my code didn't break the normal running
### 2. OOM
```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=oom
```
```
POLLING FOR 17547
POLLING FOR 17548
0
0 is starting
1
1 is starting
POLLING FOR 17547
POLLING FOR 17548
POLLING FOR 17548
POLLING FOR 17547
POLLING FOR 17547
POLLING FOR 17548
0 is running
Traceback (most recent call last):
File "./oom.py", line 33, in <module>
main()
File "./oom.py", line 20, in main
raise RuntimeError("OOM")
RuntimeError: OOM
POLLING FOR 17548
process 17548 is no more
Killing subprocess 17547
Killing subprocess 17548
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 341, in <module>
main()
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 327, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/stas/anaconda3/envs/main-38/bin/python', '-u', './oom.py', '--local_rank=1', '--mode=oom']' returned non-zero exit status 1.
```
All processes exited and the trace was printed
### 3. Exit on SIGINT/SIGTERM
If I started a process and then realized I made a mistake I want to be able to kill it cleanly and if any sub-processes have already been spawned I want them to be killed too. Here the sighandler takes care of trapping the SIGTERM/SIGINT.
```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py
```
Here the processes emulate a long normal run.
So let's Ctrl-C the process as soon as it started and see:
```
POLLING FOR 18749
POLLING FOR 18750
0
0 is starting
1
1 is starting
POLLING FOR 18749
POLLING FOR 18750
POLLING FOR 18750
POLLING FOR 18749
POLLING FOR 18749
POLLING FOR 18750
0 is running
1 is running
POLLING FOR 18750
POLLING FOR 18749
0 is running
1 is running
^CTraceback (most recent call last):
Killing subprocess 18749
Traceback (most recent call last):
File "./oom.py", line 33, in <module>
File "./oom.py", line 33, in <module>
Killing subprocess 18750
Parent got kill signal=SIGINT, exiting
```
all processes got killed
--------------------------------
So this covered the 2 problematic cases and 1 normal case
Notes:
- we could probably switch to `sleep(3)` - `1` is probably too fast
- all the debug prints will be removed once you are happy - I left them so that it's easier for you to test that my PR does the right thing.
Thank you!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49305
Reviewed By: izdeby
Differential Revision: D25565617
Pulled By: rohan-varma
fbshipit-source-id: 1ea864113f283d4daac5eef1131c8d745aae4c99
Summary:
Closes https://github.com/pytorch/pytorch/issues/7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.
The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.
To support this, we add an optional argument `--logdir` set the subprocess stdout to be the a file of the format "node_rank_{}_local_rank_{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.
Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running `python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py`. This results in a directory `test_logdir` with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/33193
Reviewed By: gchanan
Differential Revision: D24496013
Pulled By: rohan-varma
fbshipit-source-id: 1d3264cba242290d43db736073e841bbb5cb9e68
Summary:
Adds a '-m' flag to torch.distributed.launch that allows users to launch python modules using launch instead of specifying the full file path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24910
Differential Revision: D17221653
Pulled By: pietern
fbshipit-source-id: 5c6453ed266fd121103b11caab303e3f9404227d
Summary:
per https://github.com/pytorch/pytorch/issues/22260, default number of open mp threads are spawned to be the same of number of cores available, for multi processing data parallel cases, too many threads may be spawned and could overload the CPU, resulting in performance regression.
so set OMP_NUM_THREADS = number of CPU processors/number of processes in default to neither overload or waste CPU threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22501
Test Plan:
1. without and with this change, example codes result in same result
python ~/local/fbsource-fbcode/fbcode/caffe2/torch/distributed/launch.py --nproc_per_node=2 pytorch/examples/yanlizhao/distributed_launch_example.py
Setting OMP_NUM_THREADS environment variable for each process to be: 24, which
is max(1, num_cpus / num_processes), you can further tune the variable for optimal performance in your application if needed.
final loss = tensor(0.5211, device='cuda:0', grad_fn=<MseLossBackward>)
Differential Revision: D16092225
Pulled By: zhaojuanmao
fbshipit-source-id: b792a4c27a7ffae40e4a59e96669209c6a85e27f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
In `torch.distributed.launch.py`, it passes `local_rank` as argument and requires user's program to parse it. However, it would be more flexible for users and consistent with other variables, e.g. `RANK`, `MASTER_PORT`, `WORLD_SIZE`, if passing through environment variables.
265ed8ff45/torch/distributed/launch.py (L200-L212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16360
Differential Revision: D14070372
Pulled By: ezyang
fbshipit-source-id: c3f6a8e55ab513918cad09d1326eccdedb4d98c9
Summary:
`torch.distributed.launch.py` will not raise error when `subprocess.Popen` is not return 0.
For better debugging it should always raise an error if processes launched have unusual behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16069
Differential Revision: D13709467
Pulled By: ezyang
fbshipit-source-id: 31d32a5ec8fed7bccd62d845bfba0e670ed3fe20
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CC deepakn94
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12370
Differential Revision: D10220135
Pulled By: ezyang
fbshipit-source-id: 6d1a8a383951ae52753e4f75a14b8080bf02b815