pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Xuehai Pan	0d17029fea	[BE][6/6] fix typos in test/ (test/distributed/) (#157640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640 Approved by: https://github.com/yewentao256, https://github.com/malfet	2025-07-11 14:09:37 +00:00
Xuehai Pan	6d5c789ad5	[BE][PYFMT] migrate PYFMT for `test/[a-h]*/` to `ruff format` (#144555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555 Approved by: https://github.com/ezyang ghstack dependencies: #144551, #144554	2025-06-24 04:53:54 +00:00
Kiuk Chung	87f44d70b1	[torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233 ) Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233 Approved by: https://github.com/rohan-varma, https://github.com/d4l3k	2024-04-19 04:07:00 +00:00
Chip Turner	9cc040fef6	Switch env variable use in test harnesses to the non-deprecated names to fix warnings (#114880 ) Previously: ``` [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) [W Utils.hpp:133] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function getCvarInt) ``` With this PR, those warnings disappear. They were introduced in #114077 This change was generated with this sed script, applied with `sed -i -f /tmp/x */.{py,hpp,cpp,cc}` and hand inspected. ``` s/\bNCCL_BLOCKING_WAIT\b/TORCH_NCCL_BLOCKING_WAIT/g s/\bNCCL_ENABLE_TIMING\b/TORCH_NCCL_ENABLE_TIMING/g s/\bNCCL_DESYNC_DEBUG\b/TORCH_NCCL_DESYNC_DEBUG/g s/\bNCCL_ASYNC_ERROR_HANDLING\b/TORCH_NCCL_ASYNC_ERROR_HANDLING/g s/\bENABLE_NCCL_HEALTH_CHECK\b/TORCH_ENABLE_NCCL_HEALTH_CHECK/g s/\bNCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK\b/TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK/g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114880 Approved by: https://github.com/kwen2501	2023-12-01 20:08:23 +00:00
Rohan Varma	ebcc42ea10	[Dist] Fix coalescing manager + DETAIL debug mode (#111878 ) Fix https://github.com/pytorch/pytorch/issues/109520 by adding it to ProcessGroupWrapper. Differential Revision: [D50583403](https://our.internmc.facebook.com/intern/diff/D50583403/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111878 Approved by: https://github.com/fegin, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-24 07:47:39 +00:00
Ke Wen	0848ed21b8	[c10d] Figure out device to use for object collectives (#100954 ) Fixes https://github.com/pytorch/pytorch/issues/97938 this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But @kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction. the only confliction is `distributed_c10d.py:2653` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954 Approved by: https://github.com/kwen2501	2023-05-11 01:49:09 +00:00
Rohan Varma	9ba2bfea9c	[PG Wrapper] Add diff capability (#100214 ) Currently we print out the mismatched collectives, but it is hard to tell exactly the mismatch. This diff adds functionality to detect the exact mismatch and report it. New error is as follows: ``` Detected mismatch between collectives on ranks. Rank 0 is running collecti ve: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=ALLREDUCE, TensorShape =[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (defaul t), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_me mory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=REDUCE, TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), de vice=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=f alse (default), memory_format=(nullopt))). Collectives differ in the following aspects: Op type: ALLREDUCEvs REDUCE ``` i.e. the "Collectives differ in the following..." messaging is added. Differential Revision: [D45375737](https://our.internmc.facebook.com/intern/diff/D45375737/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100214 Approved by: https://github.com/H-Huang	2023-05-10 15:32:30 +00:00
Rohan Varma	dab1a7e6a1	[PG Wrapper] Add sequence number (#97462 ) Adds sequence number to PG wrapper Differential Revision: [D44347419](https://our.internmc.facebook.com/intern/diff/D44347419/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97462 Approved by: https://github.com/zhaojuanmao	2023-04-06 06:47:19 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
Rodrigo Kumpera	08795f9afc	Add _reduce_scatter_base to ProcessGroupWrapper. (#79633 ) Fixes #66329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/79633 Approved by: https://github.com/fduwjj, https://github.com/rohan-varma	2022-06-29 15:32:42 +00:00
Jane Xu	34051d74da	Add test owner to distributed files starting with test_ (#66797 ) Summary: Action based on https://github.com/pytorch/pytorch/issues/66232 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/66797 Reviewed By: gchanan Differential Revision: D31761389 Pulled By: janeyx99 fbshipit-source-id: c27c9ab4acec1eb71d5edd4538cd113b770dfc6c	2021-10-19 10:55:20 -07:00
Pritam Damania	2d671ca41b	[8/N] Remove c10d/ddp fork tests. (#63454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63454 Continuation of https://github.com/pytorch/pytorch/pull/63443, this PR removes all fork tests from torch.distributed. ghstack-source-id: 136285511 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D30387872 fbshipit-source-id: f6d6313db126ae7b95b86f78a1e0726887c5c513	2021-08-20 12:23:18 -07:00
Pritam Damania	f7611b31aa	[4/N] Enable opt-asan for distributed unit tests. (#62051 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62051 The goal here is to enable opt-asan for "spawn" based unit tests since this works for "spawn" unlike "dev-asan". As a result, we can run ASAN for "spawn" unit tests as well. This means we can completely remove fork unit tests from the code base since the only purpose for these tests was to run ASAN. ghstack-source-id: 135523770 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29854514 fbshipit-source-id: 02a5bfcfae2afc21badecff77082c7a6ad83636b	2021-08-10 22:38:31 -07:00
Pritam Damania	82d81455ae	[2/N] Remove unittest.skip across all of torch.distributed. (#61887 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61887 1) Introduced a `sandcastle_skip_if` decorator that ensures these tests just get passed on sandcastle. 2) Fixed all test files under `test/distributed` to not use `unittest.skip` Overall goal is to avoid using skips since sandcastle tags these tests as continuously skipping. ghstack-source-id: 134382237 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D29784152 fbshipit-source-id: 17b4df6c5a55ff1d1e8e1de128fa679c3dfbcb7d	2021-07-27 10:53:23 -07:00
Rohan Varma	f5341bd5e6	Enhance ProcessGroupWrapper with additional checks + refactor (#60237 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60237 Closes https://github.com/pytorch/pytorch/issues/58711 This diff refactors the collective consistency checking in `ProcessGroupWrapper` as described in the above issue. In particular, we no longer run separate verification checks (`all_gather`s) for shapes, op type, etc. Instead, we implement a function `serialize_fingerprint` to serialize all this data into a single tensor and only verify that. This has the benefit of being a lot more extensible, the developer does not need to add separate `all_gather` calls in order to verify additional data in the future. We can also provide some sort of mechanism where we allow data that needs to be verified to be "registered" in the `CollectiveFingerPrint` struct and make it even easier to add additional data, we can consider doing this if there are significant additions to `process group wrapper`. We now also begin to check tensor `dtypes` and device types for consistency as well. Tests are refactored/added accordingly. ghstack-source-id: 132520261 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D28597287 fbshipit-source-id: b09f14f628df9e2457623ba81fc13fd4e214f3c9	2021-06-28 10:24:11 -07:00
Rohan Varma	0be65cd52a	[c10d] Fix test_collective_hang flakiness (#60662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60662 Fixes this flaky test. Basically, sometimes a rank can exit the test early before rank 0 calls into allreduce. In this case Gloo will throw connection reset error on all other ranks. ghstack-source-id: 132363151 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29364806 fbshipit-source-id: ce0c292a2166edad57ea0dbb76df12cfd560a10d	2021-06-25 10:15:18 -07:00
Rohan Varma	c2098487e8	[c10d] Move pg wrapper tests to their own file. (#59840 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59840 moving these tests to their own standalone file. No meaningful code changes. ghstack-source-id: 131359162 Test Plan: CI Reviewed By: cbalioglu Differential Revision: D29012664 fbshipit-source-id: 348870016509a6ed7e69240fa82bccef4a12d674	2021-06-14 15:05:55 -07:00

17 Commits