43 Commits

Author SHA1 Message Date
995df34b19 [BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547
Approved by: https://github.com/kwen2501
2025-02-28 07:35:56 +00:00
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
0eab740db3 [Docs][Distributed] Add migration notes for --local-rank option style change for torchrun in PyTorch 2.0 (#109480)
Fixes https://github.com/pytorch/pytorch/pull/94505#issuecomment-1722777767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109480
Approved by: https://github.com/ezyang
2024-04-16 05:51:57 +00:00
b6201a60c5 [BE] minor logging cleanup in distributed (#122921)
Summary:
    Minor logging cleanup in distributed library
    1. Don't use "f" formatted strings - address linter issues.
    2. Nits: Make use of unused `e` (error) in a few logs.
    3. Change info->debug as asked in issue #113545
    4. Nit: rename log -> logger in a few files for consistency
    5. Fix a linter error.

    Test Plan:
    1. Local build passes.
    2. Linter is happy.

    Reviewers: wanchaol

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
2024-03-29 03:34:01 +00:00
a8097ed479 Fix docstring errors in _composable_state.py, remote_device.py, value_ranges.py, utils.py, run.py, rendezvous.py, launch.py, argparse_util.py, __init__.py, _cycles.py (#112953)
Fixes #112639

```txt
 torch/utils/_sympy/value_ranges.py
 torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`:
        D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`:
        D400: First line should end with a period (not ')')
torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`:
        D400: First line should end with a period (not 'g')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
        D400: First line should end with a period (not '0')
torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`:
        D403: First word of the first line should be properly capitalized ('Fn', not 'fn')
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`:
        D400: First line should end with a period (not ':')
torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`:
        D400: First line should end with a period (not 'e')
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`:
        D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
        D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:489 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:511 in public method `load`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:514 in public method `store`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:517 in public method `reduction`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:558 in public method `square`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:562 in public method `neg`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:577 in public method `sub`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`:
        D103: Missing docstring in public function
36
torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:68 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:86 in public method `tighten`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:90 in public method `__and__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:103 in public method `__or__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:118 in public method `unknown`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:122 in public method `wrap`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`:
        D400: First line should end with a period (not 's')
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
        D210: No whitespaces allowed surrounding docstring text
torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`:
        D400: First line should end with a period (not 'n')
torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`:
        D101: Missing docstring in public class
torch/utils/_sympy/value_ranges.py:491 in public method `__init__`:
        D107: Missing docstring in __init__
torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:513 in public method `load`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:516 in public method `store`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:519 in public method `reduction`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:560 in public method `square`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:564 in public method `neg`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:579 in public method `sub`:
        D102: Missing docstring in public method
torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`:
        D105: Missing docstring in magic method
torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`:
        D103: Missing docstring in public function
28

torch/utils/viz/_cycles.py
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:207 in public function `object_annotation`:
        D400: First line should end with a period (not 'g')
torch/utils/viz/_cycles.py:256 in public class `Node`:
        D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
        D205: 1 blank line required between summary line and description (found 0)
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
        D400: First line should end with a period (not 'p')
torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`:
        D401: First line should be in imperative mood; try rephrasing (found 'Reference')
14
torch/utils/viz/_cycles.py:14 in public function `observe_garbage`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:256 in public class `Node`:
        D101: Missing docstring in public class
torch/utils/viz/_cycles.py:262 in public function `create_graph`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:308 in public function `escape`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:335 in public function `to_dot`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:406 in public function `to_html`:
        D103: Missing docstring in public function
torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`:
        D103: Missing docstring in public function
9

torch/distributed/argparse_util.py
torch/distributed/argparse_util.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/argparse_util.py:13 in public class `env`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:13 in public class `env`:
        D400: First line should end with a period (not 'g')
torch/distributed/argparse_util.py:13 in public class `env`:
        D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:43 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
        D102: Missing docstring in public method
torch/distributed/argparse_util.py:61 in public class `check_env`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/argparse_util.py:61 in public class `check_env`:
        D400: First line should end with a period (not 's')
torch/distributed/argparse_util.py:61 in public class `check_env`:
        D412: No blank lines allowed between a section header and its content ('Example')
torch/distributed/argparse_util.py:97 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
        D102: Missing docstring in public method
11
torch/distributed/argparse_util.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/argparse_util.py:43 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:56 in public method `__call__`:
        D102: Missing docstring in public method
torch/distributed/argparse_util.py:97 in public method `__init__`:
        D107: Missing docstring in __init__
torch/distributed/argparse_util.py:102 in public method `__call__`:
        D102: Missing docstring in public method
5

torch/distributed/_composable_state.py
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
        D202: No blank lines allowed after function docstring (found 1)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/_composable_state.py:20 in private function `_get_module_state`:
        D400: First line should end with a period (not '`')
3
0

torch/distributed/launch.py
torch/distributed/launch.py:1 at module level:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/launch.py:1 at module level:
        D400: First line should end with a period (not 'd')
torch/distributed/launch.py:156 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/launch.py:171 in public function `launch`:
        D103: Missing docstring in public function
torch/distributed/launch.py:180 in public function `main`:
        D103: Missing docstring in public function
5
torch/distributed/launch.py:157 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/launch.py:172 in public function `launch`:
        D103: Missing docstring in public function
torch/distributed/launch.py:181 in public function `main`:
        D103: Missing docstring in public function
3

torch/distributed/remote_device.py
torch/distributed/remote_device.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/remote_device.py:81 in private method `worker_name`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:81 in private method `worker_name`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:88 in private method `rank`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:88 in private method `rank`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/distributed/remote_device.py:95 in private method `device`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/remote_device.py:95 in private method `device`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
7
torch/distributed/remote_device.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/remote_device.py:85 in private method `rank`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/remote_device.py:85 in private method `rank`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
3

torch/distributed/rendezvous.py
torch/distributed/rendezvous.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`:
        D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/distributed/rendezvous.py:88 in public function `rendezvous`:
        D103: Missing docstring in public function
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`:
        D400: First line should end with a period (not 'r')
5
torch/distributed/rendezvous.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/rendezvous.py:89 in public function `rendezvous`:
        D103: Missing docstring in public function
2

torch/distributed/run.py
torch/distributed/run.py:9 at module level:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:9 at module level:
        D400: First line should end with a period (not '`')
torch/distributed/run.py:393 in public function `get_args_parser`:
        D202: No blank lines allowed after function docstring (found 1)
torch/distributed/run.py:393 in public function `get_args_parser`:
        D401: First line should be in imperative mood; try rephrasing (found 'Helper')
torch/distributed/run.py:610 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:615 in public function `parse_min_max_nnodes`:
        D103: Missing docstring in public function
torch/distributed/run.py:629 in public function `determine_local_world_size`:
        D103: Missing docstring in public function
torch/distributed/run.py:670 in public function `get_rdzv_endpoint`:
        D103: Missing docstring in public function
torch/distributed/run.py:677 in public function `get_use_env`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:677 in public function `get_use_env`:
        D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves')
torch/distributed/run.py:689 in public function `config_from_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:770 in public function `run_script_path`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/run.py:770 in public function `run_script_path`:
        D401: First line should be in imperative mood (perhaps 'Run', not 'Runs')
torch/distributed/run.py:781 in public function `run`:
        D103: Missing docstring in public function
torch/distributed/run.py:804 in public function `main`:
        D103: Missing docstring in public function
15
torch/distributed/run.py:611 in public function `parse_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:616 in public function `parse_min_max_nnodes`:
        D103: Missing docstring in public function
torch/distributed/run.py:630 in public function `determine_local_world_size`:
        D103: Missing docstring in public function
torch/distributed/run.py:671 in public function `get_rdzv_endpoint`:
        D103: Missing docstring in public function
torch/distributed/run.py:691 in public function `config_from_args`:
        D103: Missing docstring in public function
torch/distributed/run.py:784 in public function `run`:
        D103: Missing docstring in public function
torch/distributed/run.py:807 in public function `main`:
        D103: Missing docstring in public function
7

torch/distributed/__init__.py
torch/distributed/__init__.py:1 at module level:
        D104: Missing docstring in public package
torch/distributed/__init__.py:8 in public function `is_available`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/__init__.py:8 in public function `is_available`:
        D400: First line should end with a period (not ',')
torch/distributed/__init__.py:8 in public function `is_available`:
        D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
4
torch/distributed/__init__.py:1 at module level:
        D104: Missing docstring in public package
1

torch/distributed/utils.py:1 at module level:
        D100: Missing docstring in public module
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:16 in private function `_pack_kwargs`:
        D400: First line should end with a period (not ')')
torch/distributed/utils.py:47 in private function `_cast_forward_inputs`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:88 in private function `_recursive_to`:
        D200: One-line docstring should fit on one line with quotes (found 3)
torch/distributed/utils.py:141 in private function `_p_assert`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:141 in private function `_p_assert`:
        D209: Multi-line docstring closing quotes should be on a separate line
torch/distributed/utils.py:141 in private function `_p_assert`:
        D400: First line should end with a period (not 't')
torch/distributed/utils.py:141 in private function `_p_assert`:
        D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:275 in private function `_sync_module_states`:
        D400: First line should end with a period (not 'n')
torch/distributed/utils.py:275 in private function `_sync_module_states`:
        D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
        D205: 1 blank line required between summary line and description (found 0)
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
        D400: First line should end with a period (not 'y')
torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`:
        D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes')
15
torch/distributed/utils.py:1 at module level:
        D100: Missing docstring in public module
1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953
Approved by: https://github.com/weifengpy
2023-11-08 01:13:09 +00:00
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
65e6194aeb Introduce the torchrun entrypoint (#64049)
Summary:
This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049

Reviewed By: cbalioglu

Differential Revision: D30584041

Pulled By: kiukchung

fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d
2021-08-26 20:17:48 -07:00
9d95d48567 (torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910

Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such:

```
$ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py
```

An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port.

For details see: https://github.com/pytorch/pytorch/issues/63874.

This change does a couple of things:

1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic.
1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function.
1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0).
1. Adds a bunch of unittests to cover the different code paths

NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue.

Test Plan: Unittests.

Reviewed By: cbalioglu

Differential Revision: D30529984

fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5
2021-08-25 22:57:43 -07:00
13658b10bb [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#61294)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: cbalioglu

Differential Revision: D29559553

fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
2021-07-08 16:28:06 -07:00
ccfdb30644 Revert D29413019: [torch] Various improvements to torch.distributed.launch and torch.distributed.run
Test Plan: revert-hammer

Differential Revision:
D29413019 (4e181dfc35)

Original commit changeset: 323bfbad9d0e

fbshipit-source-id: 1f8ae4b3d0a23f3eaff28c37e9148efff25fafe2
2021-07-01 08:44:51 -07:00
4e181dfc35 [torch] Various improvements to torch.distributed.launch and torch.distributed.run (#60925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925

* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies

Issues resolved:

https://github.com/pytorch/pytorch/issues/60716
https://github.com/pytorch/pytorch/issues/60754

Test Plan:
sandcastle

    python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
    python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts

    python -m torch.distributed.launch --nproc_per_node=4  --use_env --no_python  main.py -> produces error
    python -m torch.distributed.launch --nproc_per_node=4  --use_env main.py -> no warning
    python -m torch.distributed.launch --nproc_per_node=4  --no_python  main.py ->warning

Output of running torch.distributed.launch without --use_env:

    $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torch.distributed.run.
    Note that --use_env is set by default in torch.distributed.run.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ('LOCAL_RANK')` instead.

New section:

{F628923078}

{F628974089}

Reviewed By: kiukchung, cbalioglu

Differential Revision: D29413019

fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
2021-06-30 23:31:02 -07:00
c3745dc580 Small change for torch.distributed launcher (#59152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59152

Small change for https://fb.workplace.com/groups/319878845696681

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D28773682

Pulled By: H-Huang

fbshipit-source-id: acf82273e8622b7ffd3088d8d766bdf49273754c
2021-06-02 15:05:41 -07:00
8a949f9e51 [23/n][torch/elastic][upstream] Rename torch.distributed.elastic_launch to torch.distributed.run (#56831)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56831

Rename torch.distributed.elastic_launch to torch.distributed.run

Test Plan:
buck test mode/dev-nosan //pytorch/elastic/torchelastic/...
  buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/...
  flow-cli canary  pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow

Reviewed By: kiukchung

Differential Revision: D27921159

fbshipit-source-id: cc7f2f035223b2d4abd7373af298998887e14c12
2021-04-29 11:06:20 -07:00
a6940aae37 [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037

The diff introduces new  `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.

Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`

The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: H-Huang

Differential Revision: D27805799

fbshipit-source-id: 599a4c0592fbc7a1bc1953040626dd6b72bac907
2021-04-16 13:38:23 -07:00
90e103ddfe Revert D27753803: [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher
Test Plan: revert-hammer

Differential Revision:
D27753803 (7c708ef4ea)

Original commit changeset: 5f24bcfdcb70

fbshipit-source-id: 650e229b788d046450615364e5cba65065a95e3b
2021-04-15 15:03:14 -07:00
7c708ef4ea [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037

The diff introduces new  `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility.

Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch`

The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch`

Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/...

Reviewed By: cbalioglu

Differential Revision: D27753803

fbshipit-source-id: 5f24bcfdcb70356f0787b11f6cb9479f3515fb47
2021-04-15 11:09:12 -07:00
2c4b6ec457 Unused exception variables (#50181)
Summary:
These unused variables were identified by [pyflakes](https://pypi.org/project/pyflakes/). They can be safely removed to simplify the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50181

Reviewed By: gchanan

Differential Revision: D25844270

fbshipit-source-id: 0e648ffe8c6db6daf56788a13ba89806923cbb76
2021-01-08 13:33:18 -08:00
39d3578e91 [ddp launch] solve zombie problem (#49305)
Summary:
I was exhausted with needing to hunt down zombies when working with ddp launcher, so this PR solves the various zombie issues.

This PR addresses 2 distinct zombie scenarios caused by ddp launch.py:

1. When the main process is killed, the child processes aren't killed and continue running
2. When any of the children processes dies (e.g. OOM), the rest of the children and the parent remain running, but really are stuck

To solve these problems this PR switches from `wait` to `poll` and uses signal handlers.

The main problem with `wait()` was that it's not async, and I was having a 2nd process OOM, and the code was stuck waiting for the first process to finish which will not happen since the first process is blocking now waiting for the 2nd process - a sort of deadlock. My 2nd card is smaller than the first one, so it occasionally OOMs.

Using `asyncio` would probably be the cleanest solution, but as it's relatively new in python, perhaps polling is good enough.

I wrote this little script to reproduce 2 problematic scenarios and a normal running setup, it does 3 different things according to the `--mode` arg

- `oom` - causes the 2nd process to exit prematurely emulating OOM
- `clean-finish` - just exit normally in both processes
- `False` (lack of arg) just keep on running - emulating multiple normally running processes

```
# oom.py
import argparse
from time import sleep
import sys

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=False, type=int)
    parser.add_argument("--mode", default=False, type=str)
    args, _ = parser.parse_known_args()

    print(f"{args.local_rank} is starting")
    sleep(3)

    if args.mode == "oom":
        # emulate OOM in 2nd card
        if args.local_rank == 1:
            raise RuntimeError("OOM")

    if args.mode == "clean-finish":
        sleep(1)
        print(f"{args.local_rank} is cleanly finishing")
        sys.exit(0)

    while (True):
        # emulate long running process
        print(f"{args.local_rank} is running")
        sleep(1)

if __name__ == "__main__":
    main()
```

Let's begin:

###  1. Normal execution

```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=clean-finish
```

All the processes exit upon completion - I won't bother pasting the log here - just testing that my code didn't break the normal running

### 2. OOM

```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=oom
```

```
POLLING FOR 17547
POLLING FOR 17548
0
0 is starting
1
1 is starting
POLLING FOR 17547
POLLING FOR 17548
POLLING FOR 17548
POLLING FOR 17547
POLLING FOR 17547
POLLING FOR 17548
0 is running
Traceback (most recent call last):
  File "./oom.py", line 33, in <module>
    main()
  File "./oom.py", line 20, in main
    raise RuntimeError("OOM")
RuntimeError: OOM
POLLING FOR 17548
process 17548 is no more
Killing subprocess 17547
Killing subprocess 17548
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 341, in <module>
    main()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 327, in main
    sigkill_handler(signal.SIGTERM, None) # not coming back
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/stas/anaconda3/envs/main-38/bin/python', '-u', './oom.py', '--local_rank=1', '--mode=oom']' returned non-zero exit status 1.
```

All processes exited and the trace was printed

### 3. Exit on SIGINT/SIGTERM

If I started a process and then realized I made a mistake I want to be able to kill it cleanly and if any sub-processes have already been spawned I want them to be killed too. Here the sighandler takes care of trapping the SIGTERM/SIGINT.

```
python -m torch.distributed.launch --nproc_per_node=2 ./oom.py
```

Here the processes emulate a long normal run.

So let's Ctrl-C the process as soon as it started and see:

```
POLLING FOR 18749
POLLING FOR 18750
0
0 is starting
1
1 is starting
POLLING FOR 18749
POLLING FOR 18750
POLLING FOR 18750
POLLING FOR 18749
POLLING FOR 18749
POLLING FOR 18750
0 is running
1 is running
POLLING FOR 18750
POLLING FOR 18749
0 is running
1 is running
^CTraceback (most recent call last):
Killing subprocess 18749
Traceback (most recent call last):
  File "./oom.py", line 33, in <module>
  File "./oom.py", line 33, in <module>
Killing subprocess 18750
Parent got kill signal=SIGINT, exiting
```

all processes got killed

--------------------------------

So this covered the 2 problematic cases and 1 normal case

Notes:
- we could probably switch to `sleep(3)` - `1` is probably too fast
- all the debug prints will be removed once you are happy - I left them so that it's easier for you to test that my PR does the right thing.

Thank you!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49305

Reviewed By: izdeby

Differential Revision: D25565617

Pulled By: rohan-varma

fbshipit-source-id: 1ea864113f283d4daac5eef1131c8d745aae4c99
2020-12-17 20:07:59 -08:00
49f0e5dfeb Fix typing errors in torch.distributed.*, close issue #42967. (#47534)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47534

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D24952497

Pulled By: xuzhao9

fbshipit-source-id: 063bfd0707198436fcfd9431f72f9a392bc0017e
2020-11-16 23:27:59 -08:00
ccb79f3ac7 Add option to log subprocess output to files in DDP launcher. (#33193)
Summary:
Closes https://github.com/pytorch/pytorch/issues/7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout.

The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files.

To support this, we add an optional argument `--logdir` set the subprocess stdout to be the a file of the format "node_rank_{}_local_rank_{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument.

Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running `python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py`. This results in a directory `test_logdir` with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/33193

Reviewed By: gchanan

Differential Revision: D24496013

Pulled By: rohan-varma

fbshipit-source-id: 1d3264cba242290d43db736073e841bbb5cb9e68
2020-10-23 11:22:57 -07:00
20ac736200 Remove py2 compatible future imports (#44735)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735

Reviewed By: mruberry

Differential Revision: D23731306

Pulled By: ezyang

fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f
2020-09-16 12:55:57 -07:00
0fc0a9308a fix autodoc for torch.distributed.launch (#40963)
Summary:
The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line.
542ac74987/torch/distributed/launch.py (L1-L5)
I move it below the docstring to make the autodoc in Sphinx work normally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963

Differential Revision: D22380816

Pulled By: mrshenli

fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f
2020-07-04 08:59:41 -07:00
f326045b37 Fix typos, via a Levenshtein-type corrector (#31523)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking.

Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523

Differential Revision: D19216749

Pulled By: mrshenli

fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea
2020-01-17 16:03:19 -08:00
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
183aa1534f Add --no_python flag (#29144)
Summary:
Allows you to use a bash script wrapper in-between launch and your
training script. e.g.
```
python -m torch.distributed.launch --nproc_per_node=8 --no_python --use_env \
    bash -c 'exec numactl --cpunodebind=$(( LOCAL_RANK / 4 )) "$@"' -- \
    python train.py ...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29144

Differential Revision: D18345647

Pulled By: pietern

fbshipit-source-id: f05849c38c82de782988d07d300e00cf9f37253a
2019-11-22 06:05:41 -08:00
d47ced49ad Adds a -m flag to pytorch.distributed.launch (#24910)
Summary:
Adds a '-m' flag to torch.distributed.launch that allows users to launch python modules using launch instead of specifying the full file path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/24910

Differential Revision: D17221653

Pulled By: pietern

fbshipit-source-id: 5c6453ed266fd121103b11caab303e3f9404227d
2019-09-06 01:13:44 -07:00
1c0309a9a9 make OMP_NUM_THREADS default in launch.py (#22501)
Summary:
per https://github.com/pytorch/pytorch/issues/22260, default number of open mp threads are spawned to be the same of number of cores available, for multi processing data parallel cases, too many threads may be spawned and could overload the CPU, resulting in performance regression.

so set OMP_NUM_THREADS = number of CPU processors/number of processes in default to neither overload or waste CPU threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22501

Test Plan:
1. without and with this change, example codes result in same result
      python ~/local/fbsource-fbcode/fbcode/caffe2/torch/distributed/launch.py --nproc_per_node=2 pytorch/examples/yanlizhao/distributed_launch_example.py

  Setting OMP_NUM_THREADS environment variable for each process to be: 24, which
  is max(1, num_cpus / num_processes), you can further tune the variable for optimal performance in your application if needed.
  final loss =  tensor(0.5211, device='cuda:0', grad_fn=<MseLossBackward>)

Differential Revision: D16092225

Pulled By: zhaojuanmao

fbshipit-source-id: b792a4c27a7ffae40e4a59e96669209c6a85e27f
2019-07-23 16:14:24 -07:00
5395db22a4 Typo fixed in documentation
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22600

Differential Revision: D16156989

Pulled By: mrshenli

fbshipit-source-id: e491b083d872eaceb829028dadbab2e28ecfc785
2019-07-08 19:29:07 -07:00
f13fadd510 fix python2 corner-case in torch.distributed.launch (#20996)
Summary:
Small fix for the comment raised in 4cf76574b9 (r33134850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20996

Differential Revision: D15991510

Pulled By: pietern

fbshipit-source-id: 4e5a35864b5a4ec9402aa83a19c4a3ba0df2f01f
2019-06-27 05:19:37 -07:00
173f224570 Turn on F401: Unused import warning. (#18598)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a

Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**

This was requested by someone at Facebook; this lint is turned
on for Facebook by default.  "Sure, why not."

I had to noqa a number of imports in __init__.  Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it.  Left for future work.

Be careful!  flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments.  flake8-3 will
report an import unused; flake8-2 will not.  For now, I just
noqa'd all these sites.

All the changes were done by hand.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D14687478

fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
2019-03-30 09:01:17 -07:00
5eee0670ab Pass torch.distributed launch process local rank as environment variable instead of argument (#16360)
Summary:
In `torch.distributed.launch.py`, it passes `local_rank` as argument and requires user's program to parse it. However, it would be more flexible for users and consistent with other variables, e.g. `RANK`, `MASTER_PORT`, `WORLD_SIZE`, if passing through environment variables.

265ed8ff45/torch/distributed/launch.py (L200-L212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16360

Differential Revision: D14070372

Pulled By: ezyang

fbshipit-source-id: c3f6a8e55ab513918cad09d1326eccdedb4d98c9
2019-02-15 14:52:55 -08:00
4cf76574b9 Raise CalledProcessError when torch.distributed launch process not return 0 (#16069)
Summary:
`torch.distributed.launch.py` will not raise error when `subprocess.Popen` is not return 0.
For better debugging it should always raise an error if processes launched have unusual behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16069

Differential Revision: D13709467

Pulled By: ezyang

fbshipit-source-id: 31d32a5ec8fed7bccd62d845bfba0e670ed3fe20
2019-01-22 08:50:47 -08:00
058a31839d Warn about local_rank not being globally unique. (#12370)
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

CC deepakn94
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12370

Differential Revision: D10220135

Pulled By: ezyang

fbshipit-source-id: 6d1a8a383951ae52753e4f75a14b8080bf02b815
2018-10-05 17:38:41 -07:00
8e33451e2e Make torch.cuda.* take device objects; Update distributed docs (#10833)
Summary:
Commits:

1. Make `torch.cuda.*` take device objects
2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833

Differential Revision: D9514241

Pulled By: SsnL

fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e
2018-08-27 15:24:42 -07:00
db6e4576da Use customized python interpreter (#7520) 2018-05-12 13:06:39 -04:00
4ab6ea5b1f Add unbuffered flag to distributed node launcher (#7226) 2018-05-03 11:49:06 +02:00
f5beff334b Added distributed docs on NCCL2 backend/functions and launch module (#6579) 2018-04-15 21:53:10 -04:00
37059ba0ec Added torch.distributed.launch module for easier multi-proc/node distributed job launching (#5348) 2018-03-13 12:04:38 +01:00