Commit Graph

17 Commits

Author SHA1 Message Date
0d17029fea [BE][6/6] fix typos in test/ (test/distributed/) (#157640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640
Approved by: https://github.com/yewentao256, https://github.com/malfet
2025-07-11 14:09:37 +00:00
6d5c789ad5 [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555
Approved by: https://github.com/ezyang
ghstack dependencies: #144551, #144554
2025-06-24 04:53:54 +00:00
87f44d70b1 [torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233)
Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233
Approved by: https://github.com/rohan-varma, https://github.com/d4l3k
2024-04-19 04:07:00 +00:00
9cc040fef6 Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880)
Previously:

```
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
[W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt)
```

With this PR, those warnings disappear.  They were introduced in #114077

This change was generated with this sed script, applied with `sed -i -f /tmp/x **/*.{py,hpp,cpp,cc}` and hand inspected.

```
s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g
s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g
s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g
s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g
s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g
s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880
Approved by: https://github.com/kwen2501
2023-12-01 20:08:23 +00:00
ebcc42ea10 [Dist] Fix coalescing manager + DETAIL debug mode (#111878)
Fix https://github.com/pytorch/pytorch/issues/109520 by adding it to
ProcessGroupWrapper.

Differential Revision: [D50583403](https://our.internmc.facebook.com/intern/diff/D50583403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111878
Approved by: https://github.com/fegin, https://github.com/wanchaol, https://github.com/fduwjj
2023-10-24 07:47:39 +00:00
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
9ba2bfea9c [PG Wrapper] Add diff capability (#100214)
Currently we print out the mismatched collectives, but it is hard to
tell exactly the mismatch. This diff adds functionality to detect the exact mismatch
and report it.

New error is as follows:

```
Detected mismatch between collectives on ranks. Rank 0 is running collecti     ve: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=ALLREDUCE, TensorShape     =[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (defaul     t), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_me     mory=false (default), memory_format=(nullopt))), but Rank 1 is running collective:      CollectiveFingerPrint(SequenceNumber=1151423632, OpType=REDUCE, TensorShape=[20,      10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), de     vice=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=f     alse (default), memory_format=(nullopt))). Collectives differ in the following aspects:         Op type: ALLREDUCEvs REDUCE
```

i.e. the "Collectives differ in the following..." messaging is added.

Differential Revision: [D45375737](https://our.internmc.facebook.com/intern/diff/D45375737/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100214
Approved by: https://github.com/H-Huang
2023-05-10 15:32:30 +00:00
dab1a7e6a1 [PG Wrapper] Add sequence number (#97462)
Adds sequence number to PG wrapper

Differential Revision: [D44347419](https://our.internmc.facebook.com/intern/diff/D44347419/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97462
Approved by: https://github.com/zhaojuanmao
2023-04-06 06:47:19 +00:00
046e88a291 [BE] [3/3] Rewrite super() calls in test (#94592)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-12 22:20:53 +00:00
08795f9afc Add _reduce_scatter_base to ProcessGroupWrapper. (#79633)
Fixes #66329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79633
Approved by: https://github.com/fduwjj, https://github.com/rohan-varma
2022-06-29 15:32:42 +00:00
34051d74da Add test owner to distributed files starting with test_ (#66797)
Summary:
Action based on https://github.com/pytorch/pytorch/issues/66232

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797

Reviewed By: gchanan

Differential Revision: D31761389

Pulled By: janeyx99

fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c
2021-10-19 10:55:20 -07:00
2d671ca41b [8/N] Remove c10d/ddp fork tests. (#63454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454

Continuation of https://github.com/pytorch/pytorch/pull/63443, this
PR removes all fork tests from torch.distributed.
ghstack-source-id: 136285511

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D30387872

fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513
2021-08-20 12:23:18 -07:00
f7611b31aa [4/N] Enable opt-asan for distributed unit tests. (#62051)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051

The goal here is to enable opt-asan for "spawn" based unit tests since
this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for
"spawn" unit tests as well.

This means we can completely remove fork unit tests from the code base since
the only purpose for these tests was to run ASAN.
ghstack-source-id: 135523770

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29854514

fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b
2021-08-10 22:38:31 -07:00
82d81455ae [2/N] Remove unittest.skip across all of torch.distributed. (#61887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887

1) Introduced a `sandcastle_skip_if` decorator that ensures these
tests just get passed on sandcastle.
2) Fixed all test files under `test/distributed` to not use `unittest.skip`

Overall goal is to avoid using skips since sandcastle tags these tests as
continuously skipping.
ghstack-source-id: 134382237

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29784152

fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d
2021-07-27 10:53:23 -07:00
f5341bd5e6 Enhance ProcessGroupWrapper with additional checks + refactor (#60237)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60237

Closes https://github.com/pytorch/pytorch/issues/58711

This diff refactors the collective consistency checking in `ProcessGroupWrapper` as described in the above issue. In particular, we no longer run separate verification checks (`all_gather`s) for shapes, op type, etc. Instead, we implement a function `serialize_fingerprint` to serialize all this data into a single tensor and only verify that.

This has the benefit of being a lot more extensible, the developer does not need to add separate `all_gather` calls in order to verify additional data in the future. We can also provide some sort of mechanism where we allow data that needs to be verified to be "registered" in the `CollectiveFingerPrint` struct and make it even easier to add additional data, we can consider doing this if there are significant additions to `process group wrapper`.

We now also begin to check tensor `dtypes` and device types for consistency as well. Tests are refactored/added accordingly.
ghstack-source-id: 132520261

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28597287

fbshipit-source-id: b09f14f628df9e2457623ba81fc13fd4e214f3c9
2021-06-28 10:24:11 -07:00
0be65cd52a [c10d] Fix test_collective_hang flakiness (#60662)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60662

Fixes this flaky test. Basically, sometimes a rank can exit the test
early before rank 0 calls into allreduce. In this case Gloo will throw
connection reset error on all other ranks.
ghstack-source-id: 132363151

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D29364806

fbshipit-source-id: ce0c292a2166edad57ea0dbb76df12cfd560a10d
2021-06-25 10:15:18 -07:00
c2098487e8 [c10d] Move pg wrapper tests to their own file. (#59840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59840

moving these tests to their own standalone file. No meaningful code changes.
ghstack-source-id: 131359162

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D29012664

fbshipit-source-id: 348870016509a6ed7e69240fa82bccef4a12d674
2021-06-14 15:05:55 -07:00