Commit Graph

372 Commits

Author SHA1 Message Date
09aefe1502 Fix ouput typos (#120870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870
Approved by: https://github.com/clee2000
2024-02-29 08:29:14 +00:00
440a9b212d [profiler] log process group config information in distributedInfo field (#119443)
Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well

Differential Revision: D53557965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443
Approved by: https://github.com/kwen2501
2024-02-27 18:21:54 +00:00
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
9bce208dfb Replace follow_imports = silent with normal (#118414)
This is a lot of files changed! Don't panic! Here's how it works:

* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.

In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.

The codemod was done with this script authored by GPT-4:

```
import glob

exclude_patterns = [
    ...
]

for pattern in exclude_patterns:
    for filepath in glob.glob(pattern, recursive=True):
        if filepath.endswith('.py'):
            with open(filepath, 'r+') as f:
                content = f.read()
                f.seek(0, 0)
                f.write('# mypy: ignore-errors\n\n' + content)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
2024-01-27 02:44:11 +00:00
1e185c7803 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-24 18:42:14 +00:00
b5799d9977 Revert "[c10d] Barrier uses stream sync instead of device sync (#117804)"
This reverts commit 0f6bbb1c070c3a9713893659377e20e147c125f6.

Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788.  Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))
2024-01-22 21:54:03 +00:00
0f6bbb1c07 [c10d] Barrier uses stream sync instead of device sync (#117804)
Resubmitting #96785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804
Approved by: https://github.com/wconstab
2024-01-22 20:14:51 +00:00
58c4bc62bb [c10d] Deprecate Work.result() (#117565)
Work.result() returns a vector of tensors. This signature is problematic as some collectives may just return one tensor (e.g all-reduce), while some others may return multiple tensors (e.g. all-gather).

It would be clearer/easier for users to directly access the result via the tensor/tensorlist passed to the collective APIs.

Deprecating work.result() would also allow us to remove the `outputs_` field in the Work class, avoiding an "artificial" reference to the tensor, which could potentially hold up the tensor's memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117565
Approved by: https://github.com/wconstab
2024-01-18 01:22:37 +00:00
6de28e92d2 [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027)
This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027
Approved by: https://github.com/malfet
2023-12-20 19:35:08 +00:00
ef01e78fd9 disable test_ddp_profiling_autograd_profiler in distributed_test.py (#115704)
test was previously disabled in upstream: https://github.com/pytorch/pytorch/issues/77342, currently failing in NVIDIA internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115704
Approved by: https://github.com/soulitzer
2023-12-14 01:41:37 +00:00
23fa9621e4 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099) (#115193)
Summary:

Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation.
We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available().

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/115099
Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above.

Test Plan: CI.

Differential Revision: D51861018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193
Approved by: https://github.com/fegin
2023-12-08 08:44:32 +00:00
233ce0d24b Support GPU annotations for auto-trace jobs similar on-demand support (#114638)
Summary: When using auto_trace, gpu_user_annotation is not shown in the results. Fixing this by including `GPU_USER_ANNOTATION` in `kCudaTypes`.

Differential Revision: D51597995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114638
Approved by: https://github.com/aaronenyeshi
2023-12-06 09:38:13 +00:00
a827ac71f2 Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)"
This reverts commit eaa64339d640ed1d36520ada379213f8361be5ff.
2023-12-05 08:59:36 -08:00
eaa64339d6 [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099)
Summary:
Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.

Original diff reverted: D51629761
Original PR reverted: https://github.com/pytorch/pytorch/pull/114991
It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file.

Test Plan: CI.

Differential Revision: D51825114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-12-05 05:44:52 +00:00
3a2e2044cd Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)"
This reverts commit 729ac7317a50a6a195b324cf6cefd748bf4f5498.

Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))
2023-12-02 17:55:51 +00:00
729ac7317a [DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710) (#114991)
Summary:

Same content of changes as https://github.com/pytorch/pytorch/pull/114710

Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation.
ghstack-source-id: 208980207
exported-using-ghexport

Test Plan: CI.

Reviewed By: wanchaol

Differential Revision: D51629761

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin
2023-12-02 04:39:41 +00:00
3c78ea4c9d [DDP][Compile] Test to Ensure torch.compile works w/static_graph=True (#114621)
Resolves https://github.com/pytorch/pytorch/issues/93672. This was
actually fixed by https://github.com/pytorch/pytorch/pull/103487 but I didn't
realize that PR also fixes torch compile at the time.

Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114621
Approved by: https://github.com/wconstab
2023-12-01 22:18:45 +00:00
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
7b3e45be59 [DeviceMesh] Rename get_dim_groups to get_group (#114708)
Rename get_dim_groups to get_group and update all callsites.

Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin
2023-11-30 23:40:14 +00:00
f505d76462 Bug fixes to DDP _update_process_group API. (#114194)
https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state.

As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194
Approved by: https://github.com/rohan-varma
2023-11-27 23:52:40 +00:00
dc65f6c601 [c10d] Remove deprecated multi-gpu-per-thread APIs (#114156)
As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document.  The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156
Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang
2023-11-21 03:50:23 +00:00
87aeb248c9 More random stepcurrent (#113620)
Distributed tests for different backends have the same name, so they end up clashing using the current stepcurrent key, so tests were not being run.

Disabled the following tests because they are failing:
test_ddp_has_finalized

test_broadcast_object_list
<details>

```

2023-11-14T06:44:01.0428686Z
2023-11-14T06:44:01.0430447Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_broadcast_object_list <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.0431048Z [1699943450.893723] [99f90b6e6ff3:10028:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0431625Z [1699943450.914385] [99f90b6e6ff3:10029:0]     ucc_context.c:402  UCC  ERROR failed to create tl context for cuda
2023-11-14T06:44:01.0432314Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0433178Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0434677Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0435435Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0436895Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0437500Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0438917Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0439637Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0441122Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0441873Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0443340Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0444077Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0445769Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0446732Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0448433Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0449187Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0450553Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0451621Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0453161Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0454065Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0455441Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0456183Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0457775Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0458649Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0460923Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 1][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0461471Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0462430Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0463552Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0464082Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0465136Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0465945Z [rank1]:[2023-11-14 06:30:51,405] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.0466605Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0467303Z [1699943451.005633] [99f90b6e6ff3:10029:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0467972Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.0468743Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.0470233Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.0471106Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.0472581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.0473162Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.0474581Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.0475314Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.0476776Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 143, in wrapper
2023-11-14T06:44:01.0477535Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0478993Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 274, in wrapper
2023-11-14T06:44:01.0479886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     ret = func(*args, **kwargs)
2023-11-14T06:44:01.0481593Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7717, in test_broadcast_object_list
2023-11-14T06:44:01.0482429Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return self._test_broadcast_object_list()
2023-11-14T06:44:01.0484145Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 7683, in _test_broadcast_object_list
2023-11-14T06:44:01.0484886Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     dist.broadcast_object_list(
2023-11-14T06:44:01.0486271Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0487018Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0488559Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2650, in broadcast_object_list
2023-11-14T06:44:01.0489470Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     broadcast(object_sizes_tensor, src=src, group=group)
2023-11-14T06:44:01.0491078Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
2023-11-14T06:44:01.0491912Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.0493369Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1947, in broadcast
2023-11-14T06:44:01.0494419Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]     work = default_pg.broadcast([tensor], opts)
2023-11-14T06:44:01.0496679Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] RuntimeError: [/var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp:488] [Rank 0][ProcessGroupUCC-0][READY]failed to init cuda collective, error code -1: Operation is not supported, system error code 2
2023-11-14T06:44:01.0497211Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0498198Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.0499291Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_broadcast_object_list
2023-11-14T06:44:01.0499838Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.0500881Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.0501667Z [rank0]:[2023-11-14 06:30:51,462] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.0502343Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.0503024Z [1699943451.002362] [99f90b6e6ff3:10028:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.0503411Z ('RERUN', {'yellow': True}) [6.1102s] [100%]
```
</details>

test_ddp_sync_bn_training_vs_eval

<details>

```

2023-11-14T06:44:01.1494815Z
2023-11-14T06:44:01.1496630Z distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_sync_bn_training_vs_eval <- ../../../../opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py INFO:numba.cuda.cudadrv.driver:init
2023-11-14T06:44:01.1497290Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1498119Z [1699943779.976037] [99f90b6e6ff3:10758:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1498808Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1499465Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  unused environment variables: UCX_COMMIT; UCX_HOME
2023-11-14T06:44:01.1500160Z [1699943779.970792] [99f90b6e6ff3:10757:0]          parser.c:2034 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2023-11-14T06:44:01.1500820Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
2023-11-14T06:44:01.1501556Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502239Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:320] Completed Stage: Collection
2023-11-14T06:44:01.1502952Z STAGE:2023-11-14 06:36:20 10757:10757 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1503678Z STAGE:2023-11-14 06:36:20 10758:10758 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
2023-11-14T06:44:01.1504350Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1505119Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1506729Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1507492Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1508992Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1509578Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1510994Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1511725Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1513193Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1513962Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1515697Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1516529Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1518019Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1518910Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1520177Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1521062Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1522238Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1523099Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1523923Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1524470Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1525481Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1526632Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1527180Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1528223Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1529029Z [rank0]:[2023-11-14 06:36:20,668] torch.testing._internal.common_distributed: [ERROR]  exiting process 0 with exit code: 10
2023-11-14T06:44:01.1529786Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Caught exception:
2023-11-14T06:44:01.1530576Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] Traceback (most recent call last):
2023-11-14T06:44:01.1532383Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T06:44:01.1533127Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     getattr(self, test_name)()
2023-11-14T06:44:01.1534608Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T06:44:01.1535194Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     fn()
2023-11-14T06:44:01.1536817Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T06:44:01.1537575Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     method(*args, **kwargs)
2023-11-14T06:44:01.1539036Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T06:44:01.1539800Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     return func(*args, **kwargs)
2023-11-14T06:44:01.1541531Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 9230, in test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1542388Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self.assertNotEqual([], all_gather_calls)
2023-11-14T06:44:01.1544015Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3448, in assertNotEqual
2023-11-14T06:44:01.1544907Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     with self.assertRaises(AssertionError, msg=msg):
2023-11-14T06:44:01.1546061Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 226, in __exit__
2023-11-14T06:44:01.1546944Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     self._raiseFailure("{} not raised".format(exc_name))
2023-11-14T06:44:01.1548142Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]   File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
2023-11-14T06:44:01.1548991Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]     raise self.test_case.failureException(msg)
2023-11-14T06:44:01.1549806Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] AssertionError: AssertionError not raised
2023-11-14T06:44:01.1550350Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1551304Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] To execute this test, run the following from the base repo dir:
2023-11-14T06:44:01.1552462Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]      python test/distributed/test_distributed_spawn.py -k test_ddp_sync_bn_training_vs_eval
2023-11-14T06:44:01.1553095Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]
2023-11-14T06:44:01.1554166Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T06:44:01.1554976Z [rank1]:[2023-11-14 06:36:20,890] torch.testing._internal.common_distributed: [ERROR]  exiting process 1 with exit code: 10
2023-11-14T06:44:01.1555235Z ('RERUN', {'yellow': True}) [6.6107s] [100%]
```
</details>

test_backend_full_group
<details>

```
2023-11-14T22:51:56.4502470Z FAILED [5.2125s] distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_backend_full_group - RuntimeError: Process 0 exited with error code 10 and exception:
2023-11-14T22:51:56.4502665Z Traceback (most recent call last):
2023-11-14T22:51:56.4503603Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4503796Z     getattr(self, test_name)()
2023-11-14T22:51:56.4504710Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4504845Z     fn()
2023-11-14T22:51:56.4505737Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4505896Z     method(*args, **kwargs)
2023-11-14T22:51:56.4506823Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4506992Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4508285Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4508640Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4509798Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4510104Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4510629Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4510650Z
2023-11-14T22:51:56.4510987Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4511525Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4511545Z
2023-11-14T22:51:56.4511970Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4511989Z
2023-11-14T22:51:56.4512242Z Process 1 exited with error code 10 and exception:
2023-11-14T22:51:56.4512454Z Traceback (most recent call last):
2023-11-14T22:51:56.4513380Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 658, in run_test
2023-11-14T22:51:56.4513687Z     getattr(self, test_name)()
2023-11-14T22:51:56.4514612Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 544, in wrapper
2023-11-14T22:51:56.4514746Z     fn()
2023-11-14T22:51:56.4515633Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2536, in wrapper
2023-11-14T22:51:56.4515791Z     method(*args, **kwargs)
2023-11-14T22:51:56.4516708Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 174, in wrapper
2023-11-14T22:51:56.4516895Z     return func(*args, **kwargs)
2023-11-14T22:51:56.4518008Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 882, in test_backend_full_group
2023-11-14T22:51:56.4518352Z     self._test_group_override_backend(self._init_full_group_test)
2023-11-14T22:51:56.4519509Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/distributed_test.py", line 852, in _test_group_override_backend
2023-11-14T22:51:56.4519813Z     group, group_id, rank = initializer(backend=new_backend)
2023-11-14T22:51:56.4520334Z UnboundLocalError: local variable 'new_backend' referenced before assignment
2023-11-14T22:51:56.4520355Z
2023-11-14T22:51:56.4528843Z To execute this test, run the following from the base repo dir:
2023-11-14T22:51:56.4529492Z      python test/distributed/test_distributed_spawn.py -k test_backend_full_group
2023-11-14T22:51:56.4529681Z
2023-11-14T22:51:56.4530122Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
2023-11-14T22:51:56.4530423Z !!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
```
</details>

pretty sure the solution for this one is to add ucc in _test_group_override_backend
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430019
https://ossci-raw-job-status.s3.amazonaws.com/log/18651430132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113620
Approved by: https://github.com/huydhn
2023-11-15 21:56:10 +00:00
17e2313dd3 Add an API to DDP for dynamically updating the underlying process group. (#113580)
# Motivation

If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following:

```
del old_ddp
del old_pg
pg = init_pg(...)
ddp = DDP(pg)
model = torch.compile(DDP)
```

This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again.

# Proposal

As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580
Approved by: https://github.com/fduwjj
2023-11-15 09:05:02 +00:00
f2963642c2 [DDP] Add device_mesh to DDP ctor (#112761)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112761
Approved by: https://github.com/fegin
2023-11-08 03:08:08 +00:00
8219bf051b [BE]: Apply RUF015 to torch folder (#113025)
Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-07 00:48:15 +00:00
3080fd8383 [profiler] add send/recv src/dst info (#111811)
Summary: There is an ask to add src/dst to nccl trace. This feels like the easiest way to do - adding it to metadata seems to require plumbing a few stacks so will be more work

Test Plan: {F1128545195}

Reviewed By: davidberard98

Differential Revision: D50560692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111811
Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi, https://github.com/fduwjj
2023-10-28 02:48:23 +00:00
0ad91c2bfb Add an explicit _shutdown method to ProcessGroupNCCL (#111392)
Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor.

However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor.

As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111392
Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-24 05:47:12 +00:00
b5f9696d81 Fix typo under torch directory (#110824)
This PR fixes typo `the the` of comments and exception messages in files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824
Approved by: https://github.com/H-Huang
2023-10-09 19:16:43 +00:00
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b3cb987e1f3230876654f156bd11a09.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
3577ae3e53 Fix TestDistBackendWithSpawn.test_backend_group and test_backend_full_group (#107231)
Fixes https://github.com/pytorch/pytorch/issues/107078 and allows tests to be run with 2 GPUs only.

testing command:
`BACKEND=gloo WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_group`
`BACKEND=nccl WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_full_group`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107231
Approved by: https://github.com/rohan-varma
2023-08-16 12:01:09 +00:00
c11412b4a8 [DDP] Support optim in backward after DDP init (#105995)
This allows in backward optimizers to be configured after DDP init, in
addition to before as was previously supported.

Differential Revision: [D47783347](https://our.internmc.facebook.com/intern/diff/D47783347/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105995
Approved by: https://github.com/fegin
2023-07-29 01:36:25 +00:00
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
be03a56955 [BE] Enable ruff's UP rules and autoformat testing/ (#105425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425
Approved by: https://github.com/malfet
2023-07-18 21:04:39 +00:00
572ff2779b [RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103925)
https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior.

However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op.

To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`.

I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103925
Approved by: https://github.com/osalpekar
2023-06-27 04:22:03 +00:00
f044613f78 Back out "Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)" (#103938)
Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938
Approved by: https://github.com/awgu, https://github.com/fegin
2023-06-22 21:55:58 +00:00
f7737bb96b Revert "Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264)"
This reverts commit 03881b0c925f191ec41d6899d589ed420ac285b5.

Reverted https://github.com/pytorch/pytorch/pull/103264 on behalf of https://github.com/osalpekar due to This commits seems to have been causing failures in test_nccl_init_abort. Those failures may have been masked by pre-existing failures in the distributed jobs on trunk when running CI on this PR. Since those breaking changes are now reverted, we should be able to rebase this and get clean signal + uncover the breakages caused by this PR. ([comment](https://github.com/pytorch/pytorch/pull/103264#issuecomment-1599451197))
2023-06-20 20:29:43 +00:00
b1ddd5a293 Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)
Per the discussion in https://github.com/pytorch/pytorch/pull/103629#issuecomment-1598001313, I preemptively create this revert PR to revert all commits in the stack.  This seems like a safer option than using the bot as the commit has already been in trunk since last week.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103873
Approved by: https://github.com/rohan-varma
2023-06-20 16:25:00 +00:00
03881b0c92 Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264)
https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior.

However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op.

To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`.

I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103264
Approved by: https://github.com/kwen2501
2023-06-15 23:40:22 +00:00
80139fc2db [DDP] multiple forward support for static graph (#103487)
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487
Approved by: https://github.com/awgu
2023-06-14 16:14:52 +00:00
a8549357d2 Add distributed category to TORCH_LOGS (#103351)
Fix use of torch distributed testing assertLogs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103351
Approved by: https://github.com/wanchaol
2023-06-10 02:21:36 +00:00
9a2df0a5af [RFC] Add method to DDP to check for backward finalization. (#100773)
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.

As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.

Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
2023-05-31 20:43:06 +00:00
11d1cd899a Replace require_backend with require_backend_is_available (#101891)
[BE] `require_backend_is_available` offers the a more thorough check as `require_backend` but both are often used together. This remove `require_backend` and centralizes on the `require_backend_is_available` decorator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101891
Approved by: https://github.com/awgu
2023-05-25 00:00:06 +00:00
d7f6bfe651 Fix require_backends_available to reenable distributed tests (#101704)
## TLDR
Fix decorator to re-enable 26+ distributed tests that were previously being skipped in CI

## Explanation

As part of the UCC upstream, we updated the backend tests cases to also include "ucc".

3ed1569e86/torch/testing/_internal/common_distributed.py (L90-L92)

In distributed tests we use a decorator which reads from this config and makes sure all backends are available on the system.

3ed1569e86/torch/testing/_internal/distributed/distributed_test.py (L7131)

 **However**, UCC is not configured on by default for a certain subset of CI tests, which causes the entire test to be skipped (even if the test is meant for nccl and the backend being tested is nccl).

As the fix, we should just check that only the `BACKEND` being tested is available

## Changes
- Change logic to only check if the current backend being used is available
- Rename `require_backends_available` -> `require_backend_is_available`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101704
Approved by: https://github.com/rohan-varma
2023-05-18 21:33:15 +00:00
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
01abbfbaae [BE] Fix all B022 useless-contextlib-suppress (#100335)
No arguments passed to contextlib.suppress. No exceptions will be suppressed and therefore this context manager is redundant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100335
Approved by: https://github.com/Skylion007
2023-04-30 18:47:40 +00:00
628a8df1c9 [c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215)
This is a mirror PR of D45339293

Summary:
These tests cause the following errors internally with unknown reason:
```
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd'
```
Commenting these tests out to unblock other PRs.

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-04-28 17:38:12 +00:00
9609aeefbb Revert "[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215)"
This reverts commit ae40a6c7356190ef86b14b10a94a58ca41ca496b.

Reverted https://github.com/pytorch/pytorch/pull/100215 on behalf of https://github.com/huydhn due to Sorry for revert your change, but it breaks lint, please run lintrunner -a torch/testing/_internal/distributed/distributed_test.py to fix the issue then reland it
2023-04-28 01:21:06 +00:00
ae40a6c735 [c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215)
This is a mirror PR of D45339293

Summary:
These tests cause the following errors internally with unknown reason:
```
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd'
```
Commenting these tests out to unblock other PRs.

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-04-28 00:05:46 +00:00