pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 21:49:24 +08:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
Xuehai Pan	0eab740db3	[Docs][Distributed] Add migration notes for `--local-rank` option style change for `torchrun` in PyTorch 2.0 (#109480 ) Fixes https://github.com/pytorch/pytorch/pull/94505#issuecomment-1722777767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109480 Approved by: https://github.com/ezyang	2024-04-16 05:51:57 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
ooooo	a8097ed479	Fix docstring errors in _composable_state.py, remote_device.py, value_ranges.py, utils.py, run.py, rendezvous.py, launch.py, argparse_util.py, __init__.py, _cycles.py (#112953 ) Fixes #112639 ```txt torch/utils/_sympy/value_ranges.py torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:68 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:86 in public method `tighten`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:90 in public method `__and__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:103 in public method `__or__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:118 in public method `unknown`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:122 in public method `wrap`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:129 in public method `increasing_map`: D400: First line should end with a period (not ')') torch/utils/_sympy/value_ranges.py:135 in public method `decreasing_map`: D400: First line should end with a period (not ')') torch/utils/_sympy/value_ranges.py:141 in public method `monotone_map`: D400: First line should end with a period (not 'g') torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`: D400: First line should end with a period (not '0') torch/utils/_sympy/value_ranges.py:149 in public method `convex_min_zero_map`: D403: First word of the first line should be properly capitalized ('Fn', not 'fn') torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`: D205: 1 blank line required between summary line and description (found 0) torch/utils/_sympy/value_ranges.py:158 in public method `coordinatewise_increasing_map`: D400: First line should end with a period (not ':') torch/utils/_sympy/value_ranges.py:171 in public method `coordinatewise_monotone_map`: D400: First line should end with a period (not 'e') torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`: D205: 1 blank line required between summary line and description (found 0) torch/utils/_sympy/value_ranges.py:180 in private class `SymPyValueRangeAnalysis`: D400: First line should end with a period (not 's') torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`: D210: No whitespaces allowed surrounding docstring text torch/utils/_sympy/value_ranges.py:386 in private method `reciprocal`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:488 in public class `ValueRangeAnalysis`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:489 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:501 in public method `bool_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:506 in public method `default_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:511 in public method `load`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:514 in public method `store`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:517 in public method `reduction`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:520 in public method `index_expr`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:525 in public method `to_dtype`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:558 in public method `square`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:562 in public method `neg`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:566 in public method `truncdiv`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:577 in public method `sub`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:580 in public method `__getattr__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:585 in public function `bound_sympy`: D103: Missing docstring in public function 36 torch/utils/_sympy/value_ranges.py:60 in public class `ValueRanges`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:68 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:81 in public method `__contains__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:86 in public method `tighten`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:90 in public method `__and__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:103 in public method `__or__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:113 in public method `is_singleton`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:118 in public method `unknown`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:122 in public method `wrap`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`: D205: 1 blank line required between summary line and description (found 0) torch/utils/_sympy/value_ranges.py:182 in private class `SymPyValueRangeAnalysis`: D400: First line should end with a period (not 's') torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`: D210: No whitespaces allowed surrounding docstring text torch/utils/_sympy/value_ranges.py:388 in private method `reciprocal`: D400: First line should end with a period (not 'n') torch/utils/_sympy/value_ranges.py:490 in public class `ValueRangeAnalysis`: D101: Missing docstring in public class torch/utils/_sympy/value_ranges.py:491 in public method `__init__`: D107: Missing docstring in __init__ torch/utils/_sympy/value_ranges.py:503 in public method `bool_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:508 in public method `default_handler`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:513 in public method `load`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:516 in public method `store`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:519 in public method `reduction`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:522 in public method `index_expr`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:527 in public method `to_dtype`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:560 in public method `square`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:564 in public method `neg`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:568 in public method `truncdiv`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:579 in public method `sub`: D102: Missing docstring in public method torch/utils/_sympy/value_ranges.py:582 in public method `__getattr__`: D105: Missing docstring in magic method torch/utils/_sympy/value_ranges.py:587 in public function `bound_sympy`: D103: Missing docstring in public function 28 torch/utils/viz/_cycles.py torch/utils/viz/_cycles.py:14 in public function `observe_garbage`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:207 in public function `object_annotation`: D205: 1 blank line required between summary line and description (found 0) torch/utils/viz/_cycles.py:207 in public function `object_annotation`: D400: First line should end with a period (not 'g') torch/utils/viz/_cycles.py:256 in public class `Node`: D101: Missing docstring in public class torch/utils/viz/_cycles.py:262 in public function `create_graph`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:308 in public function `escape`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:335 in public function `to_dot`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:406 in public function `to_html`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`: D205: 1 blank line required between summary line and description (found 0) torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`: D400: First line should end with a period (not 'p') torch/utils/viz/_cycles.py:429 in public function `warn_tensor_cycles`: D401: First line should be in imperative mood; try rephrasing (found 'Reference') 14 torch/utils/viz/_cycles.py:14 in public function `observe_garbage`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:256 in public class `Node`: D101: Missing docstring in public class torch/utils/viz/_cycles.py:262 in public function `create_graph`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:308 in public function `escape`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:312 in public function `is_cuda_tensor`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:315 in public function `cuda_allocation_context`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:335 in public function `to_dot`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:406 in public function `to_html`: D103: Missing docstring in public function torch/utils/viz/_cycles.py:416 in public function `observe_tensor_cycles`: D103: Missing docstring in public function 9 torch/distributed/argparse_util.py torch/distributed/argparse_util.py:1 at module level: D100: Missing docstring in public module torch/distributed/argparse_util.py:13 in public class `env`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/argparse_util.py:13 in public class `env`: D400: First line should end with a period (not 'g') torch/distributed/argparse_util.py:13 in public class `env`: D412: No blank lines allowed between a section header and its content ('Example') torch/distributed/argparse_util.py:43 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:56 in public method `__call__`: D102: Missing docstring in public method torch/distributed/argparse_util.py:61 in public class `check_env`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/argparse_util.py:61 in public class `check_env`: D400: First line should end with a period (not 's') torch/distributed/argparse_util.py:61 in public class `check_env`: D412: No blank lines allowed between a section header and its content ('Example') torch/distributed/argparse_util.py:97 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:102 in public method `__call__`: D102: Missing docstring in public method 11 torch/distributed/argparse_util.py:1 at module level: D100: Missing docstring in public module torch/distributed/argparse_util.py:43 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:56 in public method `__call__`: D102: Missing docstring in public method torch/distributed/argparse_util.py:97 in public method `__init__`: D107: Missing docstring in __init__ torch/distributed/argparse_util.py:102 in public method `__call__`: D102: Missing docstring in public method 5 torch/distributed/_composable_state.py torch/distributed/_composable_state.py:20 in private function `_get_module_state`: D202: No blank lines allowed after function docstring (found 1) torch/distributed/_composable_state.py:20 in private function `_get_module_state`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/_composable_state.py:20 in private function `_get_module_state`: D400: First line should end with a period (not '`') 3 0 torch/distributed/launch.py torch/distributed/launch.py:1 at module level: D205: 1 blank line required between summary line and description (found 0) torch/distributed/launch.py:1 at module level: D400: First line should end with a period (not 'd') torch/distributed/launch.py:156 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/launch.py:171 in public function `launch`: D103: Missing docstring in public function torch/distributed/launch.py:180 in public function `main`: D103: Missing docstring in public function 5 torch/distributed/launch.py:157 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/launch.py:172 in public function `launch`: D103: Missing docstring in public function torch/distributed/launch.py:181 in public function `main`: D103: Missing docstring in public function 3 torch/distributed/remote_device.py torch/distributed/remote_device.py:1 at module level: D100: Missing docstring in public module torch/distributed/remote_device.py:81 in private method `worker_name`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/remote_device.py:81 in private method `worker_name`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') torch/distributed/remote_device.py:88 in private method `rank`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/remote_device.py:88 in private method `rank`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') torch/distributed/remote_device.py:95 in private method `device`: D200: One-line docstring should fit on one line with quotes (found 3) torch/distributed/remote_device.py:95 in private method `device`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') 7 torch/distributed/remote_device.py:1 at module level: D100: Missing docstring in public module torch/distributed/remote_device.py:85 in private method `rank`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/remote_device.py:85 in private method `rank`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') 3 torch/distributed/rendezvous.py torch/distributed/rendezvous.py:1 at module level: D100: Missing docstring in public module torch/distributed/rendezvous.py:23 in public function `register_rendezvous_handler`: D401: First line should be in imperative mood (perhaps 'Register', not 'Registers') torch/distributed/rendezvous.py:88 in public function `rendezvous`: D103: Missing docstring in public function torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/rendezvous.py:147 in private function `_create_c10d_store`: D400: First line should end with a period (not 'r') 5 torch/distributed/rendezvous.py:1 at module level: D100: Missing docstring in public module torch/distributed/rendezvous.py:89 in public function `rendezvous`: D103: Missing docstring in public function 2 torch/distributed/run.py torch/distributed/run.py:9 at module level: D205: 1 blank line required between summary line and description (found 0) torch/distributed/run.py:9 at module level: D400: First line should end with a period (not '`') torch/distributed/run.py:393 in public function `get_args_parser`: D202: No blank lines allowed after function docstring (found 1) torch/distributed/run.py:393 in public function `get_args_parser`: D401: First line should be in imperative mood; try rephrasing (found 'Helper') torch/distributed/run.py:610 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/run.py:615 in public function `parse_min_max_nnodes`: D103: Missing docstring in public function torch/distributed/run.py:629 in public function `determine_local_world_size`: D103: Missing docstring in public function torch/distributed/run.py:670 in public function `get_rdzv_endpoint`: D103: Missing docstring in public function torch/distributed/run.py:677 in public function `get_use_env`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/run.py:677 in public function `get_use_env`: D401: First line should be in imperative mood (perhaps 'Retrieve', not 'Retrieves') torch/distributed/run.py:689 in public function `config_from_args`: D103: Missing docstring in public function torch/distributed/run.py:770 in public function `run_script_path`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/run.py:770 in public function `run_script_path`: D401: First line should be in imperative mood (perhaps 'Run', not 'Runs') torch/distributed/run.py:781 in public function `run`: D103: Missing docstring in public function torch/distributed/run.py:804 in public function `main`: D103: Missing docstring in public function 15 torch/distributed/run.py:611 in public function `parse_args`: D103: Missing docstring in public function torch/distributed/run.py:616 in public function `parse_min_max_nnodes`: D103: Missing docstring in public function torch/distributed/run.py:630 in public function `determine_local_world_size`: D103: Missing docstring in public function torch/distributed/run.py:671 in public function `get_rdzv_endpoint`: D103: Missing docstring in public function torch/distributed/run.py:691 in public function `config_from_args`: D103: Missing docstring in public function torch/distributed/run.py:784 in public function `run`: D103: Missing docstring in public function torch/distributed/run.py:807 in public function `main`: D103: Missing docstring in public function 7 torch/distributed/__init__.py torch/distributed/__init__.py:1 at module level: D104: Missing docstring in public package torch/distributed/__init__.py:8 in public function `is_available`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/__init__.py:8 in public function `is_available`: D400: First line should end with a period (not ',') torch/distributed/__init__.py:8 in public function `is_available`: D401: First line should be in imperative mood (perhaps 'Return', not 'Returns') 4 torch/distributed/__init__.py:1 at module level: D104: Missing docstring in public package 1 torch/distributed/utils.py:1 at module level: D100: Missing docstring in public module torch/distributed/utils.py:16 in private function `_pack_kwargs`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:16 in private function `_pack_kwargs`: D400: First line should end with a period (not ')') torch/distributed/utils.py:47 in private function `_cast_forward_inputs`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:88 in private function `_recursive_to`: D200: One-line docstring should fit on one line with quotes (found 3) torch/distributed/utils.py:141 in private function `_p_assert`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:141 in private function `_p_assert`: D209: Multi-line docstring closing quotes should be on a separate line torch/distributed/utils.py:141 in private function `_p_assert`: D400: First line should end with a period (not 't') torch/distributed/utils.py:141 in private function `_p_assert`: D401: First line should be in imperative mood; try rephrasing (found 'This') torch/distributed/utils.py:275 in private function `_sync_module_states`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:275 in private function `_sync_module_states`: D400: First line should end with a period (not 'n') torch/distributed/utils.py:275 in private function `_sync_module_states`: D401: First line should be in imperative mood (perhaps 'Sync', not 'Syncs') torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`: D205: 1 blank line required between summary line and description (found 0) torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`: D400: First line should end with a period (not 'y') torch/distributed/utils.py:300 in private function `_sync_params_and_buffers`: D401: First line should be in imperative mood (perhaps 'Synchronize', not 'Synchronizes') 15 torch/distributed/utils.py:1 at module level: D100: Missing docstring in public module 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/112953 Approved by: https://github.com/weifengpy	2023-11-08 01:13:09 +00:00
Kazuaki Ishizaki	35fd5c548e	Fix typos under torch/distributed directory (#95638 ) This PR fixes typos in comments and messages of `.py` files under torch/distributed directory Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638 Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980	2023-03-27 21:13:44 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
joncrall	4618371da5	Integrate xdoctest - Rebased (#82797 ) This is a new version of #15648 based on the latest master branch. Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR. In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.) Fixes https://github.com/pytorch/pytorch/issues/71105 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797 Approved by: https://github.com/ezyang	2022-08-12 02:08:01 +00:00
Can Balioglu	65e6194aeb	Introduce the torchrun entrypoint (#64049 ) Summary: This PR introduces a new `torchrun` entrypoint that simply "points" to `python -m torch.distributed.run`. It is shorter and less error-prone to type and gives a nicer syntax than a rather cryptic `python -m ...` command line. Along with the new entrypoint the documentation is also updated and places where `torch.distributed.run` are mentioned are replaced with `torchrun`. cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse agolynski SciPioneer H-Huang mrzzd cbalioglu gcramer23 Pull Request resolved: https://github.com/pytorch/pytorch/pull/64049 Reviewed By: cbalioglu Differential Revision: D30584041 Pulled By: kiukchung fbshipit-source-id: d99db3b5d12e7bf9676bab70e680d4b88031ae2d	2021-08-26 20:17:48 -07:00
Kiuk Chung	9d95d48567	(torch.distributed) Add torch.distributed.is_torchelastic_launched() util method + make init_method=tcp:// compatible with torchelastic (#63910 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: https://github.com/pytorch/pytorch/issues/63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5	2021-08-25 22:57:43 -07:00
Aliaksandr Ivanou	13658b10bb	[torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#61294 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294 Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925 * Make `torch.distributed.launch` restarts to 0 * Remove unnecessary `-use_env` warning, move `-use_env` warnings * Move `-use_env` warnings to `torch.distributed.launch` * Make default log level WARNING * Add new doc section around transitioning to `torch.distributed.run` * Make `torch.distributed.launch` not use error-propagation * Set default events handler to `null` that does not print events to console * Add reference from `torch.distributed.launch` to `torch.distributed.run` * Set correct preexec function that sends SIGTERM to child processes when parent dies Issues resolved: https://github.com/pytorch/pytorch/issues/60716 https://github.com/pytorch/pytorch/issues/60754 Test Plan: sandcastle python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning Output of running torch.distributed.launch without --use_env: $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ('LOCAL_RANK')` instead. New section: {F628923078} {F628974089} Reviewed By: cbalioglu Differential Revision: D29559553 fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b	2021-07-08 16:28:06 -07:00
Vitaly Fedyunin	ccfdb30644	Revert D29413019: [torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` Test Plan: revert-hammer Differential Revision: D29413019 (`4e181dfc35`) Original commit changeset: 323bfbad9d0e fbshipit-source-id: 1f8ae4b3d0a23f3eaff28c37e9148efff25fafe2	2021-07-01 08:44:51 -07:00
Aliaksandr Ivanou	4e181dfc35	[torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#60925 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925 * Make `torch.distributed.launch` restarts to 0 * Remove unnecessary `-use_env` warning, move `-use_env` warnings * Move `-use_env` warnings to `torch.distributed.launch` * Make default log level WARNING * Add new doc section around transitioning to `torch.distributed.run` * Make `torch.distributed.launch` not use error-propagation * Set default events handler to `null` that does not print events to console * Add reference from `torch.distributed.launch` to `torch.distributed.run` * Set correct preexec function that sends SIGTERM to child processes when parent dies Issues resolved: https://github.com/pytorch/pytorch/issues/60716 https://github.com/pytorch/pytorch/issues/60754 Test Plan: sandcastle python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning Output of running torch.distributed.launch without --use_env: $path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ('LOCAL_RANK')` instead. New section: {F628923078} {F628974089} Reviewed By: kiukchung, cbalioglu Differential Revision: D29413019 fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630	2021-06-30 23:31:02 -07:00
Howard Huang	c3745dc580	Small change for torch.distributed launcher (#59152 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59152 Small change for https://fb.workplace.com/groups/319878845696681 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D28773682 Pulled By: H-Huang fbshipit-source-id: acf82273e8622b7ffd3088d8d766bdf49273754c	2021-06-02 15:05:41 -07:00
Aliaksandr Ivanou	8a949f9e51	[23/n][torch/elastic][upstream] Rename torch.distributed.elastic_launch to torch.distributed.run (#56831 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56831 Rename torch.distributed.elastic_launch to torch.distributed.run Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/... buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test/... flow-cli canary pytorch.elastic.examples.classy_vision.main --entitlement gpu_prod --run-as-secure-group oncall_dai_pet --buck-target //fblearner/flow/projects/pytorch/elastic/examples:workflow Reviewed By: kiukchung Differential Revision: D27921159 fbshipit-source-id: cc7f2f035223b2d4abd7373af298998887e14c12	2021-04-29 11:06:20 -07:00
Aliaksandr Ivanou	a6940aae37	[19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56214 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037 The diff introduces new `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility. Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch` The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch` Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/... Reviewed By: H-Huang Differential Revision: D27805799 fbshipit-source-id: 599a4c0592fbc7a1bc1953040626dd6b72bac907	2021-04-16 13:38:23 -07:00
Vitaly Fedyunin	90e103ddfe	Revert D27753803: [19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher Test Plan: revert-hammer Differential Revision: D27753803 (`7c708ef4ea`) Original commit changeset: 5f24bcfdcb70 fbshipit-source-id: 650e229b788d046450615364e5cba65065a95e3b	2021-04-15 15:03:14 -07:00
Aliaksandr Ivanou	7c708ef4ea	[19/n][torch/elastic][upstream] Replace pytorch.distributed.launch with torchelastic launcher (#56037 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56037 The diff introduces new `torch.distributed.elastic_launch` and removes internals of `torch.distributed.launch` keeping backwards compatibility. Since torchelastic and torch.launch are not fully compatible due to `--use_env` arg, the `torch.distributed.launch` deprecation is going to be iterative: as part of pytorch 1.9 we are going to deprecate it, and in the following releases we will remove `torch.distributed.launch` The diff leaves `torchelastic.distributed.launch` module, and the follow up diffs will migrate the users form `torchelastic.distributed.launch` to `torch.distributed.elastic_launch` Test Plan: buck test mode/dev-nosan //pytorch/elastic/torchelastic/distributed/... Reviewed By: cbalioglu Differential Revision: D27753803 fbshipit-source-id: 5f24bcfdcb70356f0787b11f6cb9479f3515fb47	2021-04-15 11:09:12 -07:00
Alex Henrie	2c4b6ec457	Unused exception variables (#50181 ) Summary: These unused variables were identified by [pyflakes](https://pypi.org/project/pyflakes/). They can be safely removed to simplify the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/50181 Reviewed By: gchanan Differential Revision: D25844270 fbshipit-source-id: 0e648ffe8c6db6daf56788a13ba89806923cbb76	2021-01-08 13:33:18 -08:00
Stas Bekman	39d3578e91	[ddp launch] solve zombie problem (#49305 ) Summary: I was exhausted with needing to hunt down zombies when working with ddp launcher, so this PR solves the various zombie issues. This PR addresses 2 distinct zombie scenarios caused by ddp launch.py: 1. When the main process is killed, the child processes aren't killed and continue running 2. When any of the children processes dies (e.g. OOM), the rest of the children and the parent remain running, but really are stuck To solve these problems this PR switches from `wait` to `poll` and uses signal handlers. The main problem with `wait()` was that it's not async, and I was having a 2nd process OOM, and the code was stuck waiting for the first process to finish which will not happen since the first process is blocking now waiting for the 2nd process - a sort of deadlock. My 2nd card is smaller than the first one, so it occasionally OOMs. Using `asyncio` would probably be the cleanest solution, but as it's relatively new in python, perhaps polling is good enough. I wrote this little script to reproduce 2 problematic scenarios and a normal running setup, it does 3 different things according to the `--mode` arg - `oom` - causes the 2nd process to exit prematurely emulating OOM - `clean-finish` - just exit normally in both processes - `False` (lack of arg) just keep on running - emulating multiple normally running processes ``` # oom.py import argparse from time import sleep import sys def main(): parser = argparse.ArgumentParser() parser.add_argument("--local_rank", default=False, type=int) parser.add_argument("--mode", default=False, type=str) args, _ = parser.parse_known_args() print(f"{args.local_rank} is starting") sleep(3) if args.mode == "oom": # emulate OOM in 2nd card if args.local_rank == 1: raise RuntimeError("OOM") if args.mode == "clean-finish": sleep(1) print(f"{args.local_rank} is cleanly finishing") sys.exit(0) while (True): # emulate long running process print(f"{args.local_rank} is running") sleep(1) if __name__ == "__main__": main() ``` Let's begin: ### 1. Normal execution ``` python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=clean-finish ``` All the processes exit upon completion - I won't bother pasting the log here - just testing that my code didn't break the normal running ### 2. OOM ``` python -m torch.distributed.launch --nproc_per_node=2 ./oom.py --mode=oom ``` ``` POLLING FOR 17547 POLLING FOR 17548 0 0 is starting 1 1 is starting POLLING FOR 17547 POLLING FOR 17548 POLLING FOR 17548 POLLING FOR 17547 POLLING FOR 17547 POLLING FOR 17548 0 is running Traceback (most recent call last): File "./oom.py", line 33, in <module> main() File "./oom.py", line 20, in main raise RuntimeError("OOM") RuntimeError: OOM POLLING FOR 17548 process 17548 is no more Killing subprocess 17547 Killing subprocess 17548 Traceback (most recent call last): File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/stas/anaconda3/envs/main-38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 341, in <module> main() File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 327, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/stas/anaconda3/envs/main-38/bin/python', '-u', './oom.py', '--local_rank=1', '--mode=oom']' returned non-zero exit status 1. ``` All processes exited and the trace was printed ### 3. Exit on SIGINT/SIGTERM If I started a process and then realized I made a mistake I want to be able to kill it cleanly and if any sub-processes have already been spawned I want them to be killed too. Here the sighandler takes care of trapping the SIGTERM/SIGINT. ``` python -m torch.distributed.launch --nproc_per_node=2 ./oom.py ``` Here the processes emulate a long normal run. So let's Ctrl-C the process as soon as it started and see: ``` POLLING FOR 18749 POLLING FOR 18750 0 0 is starting 1 1 is starting POLLING FOR 18749 POLLING FOR 18750 POLLING FOR 18750 POLLING FOR 18749 POLLING FOR 18749 POLLING FOR 18750 0 is running 1 is running POLLING FOR 18750 POLLING FOR 18749 0 is running 1 is running ^CTraceback (most recent call last): Killing subprocess 18749 Traceback (most recent call last): File "./oom.py", line 33, in <module> File "./oom.py", line 33, in <module> Killing subprocess 18750 Parent got kill signal=SIGINT, exiting ``` all processes got killed -------------------------------- So this covered the 2 problematic cases and 1 normal case Notes: - we could probably switch to `sleep(3)` - `1` is probably too fast - all the debug prints will be removed once you are happy - I left them so that it's easier for you to test that my PR does the right thing. Thank you! Pull Request resolved: https://github.com/pytorch/pytorch/pull/49305 Reviewed By: izdeby Differential Revision: D25565617 Pulled By: rohan-varma fbshipit-source-id: 1ea864113f283d4daac5eef1131c8d745aae4c99	2020-12-17 20:07:59 -08:00
Xu Zhao	49f0e5dfeb	Fix typing errors in torch.distributed.*, close issue #42967 . (#47534 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47534 Test Plan: Imported from OSS Reviewed By: walterddr Differential Revision: D24952497 Pulled By: xuzhao9 fbshipit-source-id: 063bfd0707198436fcfd9431f72f9a392bc0017e	2020-11-16 23:27:59 -08:00
Rohan Varma	ccb79f3ac7	Add option to log subprocess output to files in DDP launcher. (#33193 ) Summary: Closes https://github.com/pytorch/pytorch/issues/7134. This request is to add an option to log the subprocess output (each subprocess is training a network with DDP) to a file instead of the default stdout. The reason for this is that if we have N processes all writing to stdout, it'll be hard to decipher the output, and it would be cleaner to log these to separate files. To support this, we add an optional argument `--logdir` set the subprocess stdout to be the a file of the format "node_rank_{}_local_rank_{}" in the logging directory. With this enabled, none of the training processes output to the parent process stdout, and instead write to the aformentioned file. If a user accidently passes in something that's not a directory, we fallback to ignoring this argument. Tested by taking a training script at https://gist.github.com/rohan-varma/2ff1d6051440d2c18e96fe57904b55d9 and running `python -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --node_rank=0 --master_addr="127.0.0.1" --master_port="29500" --logdir test_logdir train.py`. This results in a directory `test_logdir` with files "node_0_local_rank_0" and "node_0_local_rank_1" being created with the training process stdout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/33193 Reviewed By: gchanan Differential Revision: D24496013 Pulled By: rohan-varma fbshipit-source-id: 1d3264cba242290d43db736073e841bbb5cb9e68	2020-10-23 11:22:57 -07:00
Xiang Gao	20ac736200	Remove py2 compatible future imports (#44735 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44735 Reviewed By: mruberry Differential Revision: D23731306 Pulled By: ezyang fbshipit-source-id: 0ba009a99e475ddbe22981be8ac636f8a1c8b02f	2020-09-16 12:55:57 -07:00
Xin Yao	0fc0a9308a	fix autodoc for torch.distributed.launch (#40963 ) Summary: The doc for `torch.distributed.launch` is missing since v1.2.0 (see issue https://github.com/pytorch/pytorch/issues/36386) because PR https://github.com/pytorch/pytorch/issues/22501 added some imports at the first line. `542ac74987/torch/distributed/launch.py (L1-L5)` I move it below the docstring to make the autodoc in Sphinx work normally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/40963 Differential Revision: D22380816 Pulled By: mrshenli fbshipit-source-id: ee8406785b9a198bbf3fc65e589854379179496f	2020-07-04 08:59:41 -07:00
Brian Wignall	f326045b37	Fix typos, via a Levenshtein-type corrector (#31523 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking. Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523 Differential Revision: D19216749 Pulled By: mrshenli fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea	2020-01-17 16:03:19 -08:00
Brian Wignall	e7fe64f6a6	Fix typos (#30606 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos. Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606 Differential Revision: D18763028 Pulled By: mrshenli fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c	2019-12-02 20:17:42 -08:00
Luke Yeager	183aa1534f	Add --no_python flag (#29144 ) Summary: Allows you to use a bash script wrapper in-between launch and your training script. e.g. ``` python -m torch.distributed.launch --nproc_per_node=8 --no_python --use_env \ bash -c 'exec numactl --cpunodebind=$(( LOCAL_RANK / 4 )) "$@"' -- \ python train.py ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/29144 Differential Revision: D18345647 Pulled By: pietern fbshipit-source-id: f05849c38c82de782988d07d300e00cf9f37253a	2019-11-22 06:05:41 -08:00
Cameron Long	d47ced49ad	Adds a -m flag to pytorch.distributed.launch (#24910 ) Summary: Adds a '-m' flag to torch.distributed.launch that allows users to launch python modules using launch instead of specifying the full file path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/24910 Differential Revision: D17221653 Pulled By: pietern fbshipit-source-id: 5c6453ed266fd121103b11caab303e3f9404227d	2019-09-06 01:13:44 -07:00
Yanli Zhao	1c0309a9a9	make OMP_NUM_THREADS default in launch.py (#22501 ) Summary: per https://github.com/pytorch/pytorch/issues/22260, default number of open mp threads are spawned to be the same of number of cores available, for multi processing data parallel cases, too many threads may be spawned and could overload the CPU, resulting in performance regression. so set OMP_NUM_THREADS = number of CPU processors/number of processes in default to neither overload or waste CPU threads Pull Request resolved: https://github.com/pytorch/pytorch/pull/22501 Test Plan: 1. without and with this change, example codes result in same result python ~/local/fbsource-fbcode/fbcode/caffe2/torch/distributed/launch.py --nproc_per_node=2 pytorch/examples/yanlizhao/distributed_launch_example.py Setting OMP_NUM_THREADS environment variable for each process to be: 24, which is max(1, num_cpus / num_processes), you can further tune the variable for optimal performance in your application if needed. final loss = tensor(0.5211, device='cuda:0', grad_fn=<MseLossBackward>) Differential Revision: D16092225 Pulled By: zhaojuanmao fbshipit-source-id: b792a4c27a7ffae40e4a59e96669209c6a85e27f	2019-07-23 16:14:24 -07:00
Shoaib Ahmed Siddiqui	5395db22a4	Typo fixed in documentation Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22600 Differential Revision: D16156989 Pulled By: mrshenli fbshipit-source-id: e491b083d872eaceb829028dadbab2e28ecfc785	2019-07-08 19:29:07 -07:00
Soumith Chintala	f13fadd510	fix python2 corner-case in torch.distributed.launch (#20996 ) Summary: Small fix for the comment raised in `4cf76574b9 (r33134850)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/20996 Differential Revision: D15991510 Pulled By: pietern fbshipit-source-id: 4e5a35864b5a4ec9402aa83a19c4a3ba0df2f01f	2019-06-27 05:19:37 -07:00
Edward Yang	173f224570	Turn on F401: Unused import warning. (#18598 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598 ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a Stack from [ghstack](https://github.com/ezyang/ghstack): * #18598 Turn on F401: Unused import warning. This was requested by someone at Facebook; this lint is turned on for Facebook by default. "Sure, why not." I had to noqa a number of imports in __init__. Hypothetically we're supposed to use __all__ in this case, but I was too lazy to fix it. Left for future work. Be careful! flake8-2 and flake8-3 behave differently with respect to import resolution for # type: comments. flake8-3 will report an import unused; flake8-2 will not. For now, I just noqa'd all these sites. All the changes were done by hand. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D14687478 fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3	2019-03-30 09:01:17 -07:00
Andy Wei	5eee0670ab	Pass torch.distributed launch process local rank as environment variable instead of argument (#16360 ) Summary: In `torch.distributed.launch.py`, it passes `local_rank` as argument and requires user's program to parse it. However, it would be more flexible for users and consistent with other variables, e.g. `RANK`, `MASTER_PORT`, `WORLD_SIZE`, if passing through environment variables. `265ed8ff45/torch/distributed/launch.py (L200-L212)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/16360 Differential Revision: D14070372 Pulled By: ezyang fbshipit-source-id: c3f6a8e55ab513918cad09d1326eccdedb4d98c9	2019-02-15 14:52:55 -08:00
Andy Wei	4cf76574b9	Raise CalledProcessError when torch.distributed launch process not return 0 (#16069 ) Summary: `torch.distributed.launch.py` will not raise error when `subprocess.Popen` is not return 0. For better debugging it should always raise an error if processes launched have unusual behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/16069 Differential Revision: D13709467 Pulled By: ezyang fbshipit-source-id: 31d32a5ec8fed7bccd62d845bfba0e670ed3fe20	2019-01-22 08:50:47 -08:00
Edward Yang	058a31839d	Warn about local_rank not being globally unique. (#12370 ) Summary: Signed-off-by: Edward Z. Yang <ezyang@fb.com> CC deepakn94 Pull Request resolved: https://github.com/pytorch/pytorch/pull/12370 Differential Revision: D10220135 Pulled By: ezyang fbshipit-source-id: 6d1a8a383951ae52753e4f75a14b8080bf02b815	2018-10-05 17:38:41 -07:00
Tongzhou Wang	8e33451e2e	Make torch.cuda.* take device objects; Update distributed docs (#10833 ) Summary: Commits: 1. Make `torch.cuda.*` take device objects 2. Update `torch.distributed` docs to emphasize calling `torch.cuda.set_device` before `init_process_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/10833 Differential Revision: D9514241 Pulled By: SsnL fbshipit-source-id: 2497464305fb1e63d6c495291a5744aaa7e2696e	2018-08-27 15:24:42 -07:00
Kaiyu Shi	db6e4576da	Use customized python interpreter (#7520 )	2018-05-12 13:06:39 -04:00
Edgar Andrés Margffoy Tuay	4ab6ea5b1f	Add unbuffered flag to distributed node launcher (#7226 )	2018-05-03 11:49:06 +02:00
Teng Li	f5beff334b	Added distributed docs on NCCL2 backend/functions and launch module (#6579 )	2018-04-15 21:53:10 -04:00
Teng Li	37059ba0ec	Added torch.distributed.launch module for easier multi-proc/node distributed job launching (#5348 )	2018-03-13 12:04:38 +01:00

43 Commits