pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
PyTorch MergeBot	633a3b7f67	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit fa0db212e717b6cb225159cb32ea3d83baa52381. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217))	2025-10-19 19:20:45 +00:00
Bruce Chang	fa0db212e7	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/kwen2501	2025-10-19 18:00:08 +00:00
PyTorch MergeBot	fae74cd52f	Revert "shrink_group implementation to expose ncclCommShrink API (#164518 )" This reverts commit a032510db38e8331afa08f7635d146f9cefdd0ab. Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3416718767))	2025-10-17 18:55:53 +00:00
Bruce Chang	a032510db3	shrink_group implementation to expose ncclCommShrink API (#164518 ) Closes #164529 To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink) API to PyTorch. This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization. For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518 Approved by: https://github.com/Skylion007, https://github.com/syed-ahmed, https://github.com/kwen2501	2025-10-17 17:55:03 +00:00
Yuanyuan Chen	e925dfcc6b	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-10-17 07:27:11 +00:00
fduwjj	8ca986ee60	[fr] Enable reset the FR recording for fault tolerance (#164988 ) We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988 Approved by: https://github.com/tushar00jain ghstack dependencies: #164752	2025-10-09 01:03:01 +00:00
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Prachi	3ca09d65f1	[ROCm] Enable several distributed UTs (#164390 ) Increase the tolerance for the following UTs as there was a slight mismatch seen on MI200. - test_data_parallel.py:test_strided_grad_layout - test_c10d_nccl.py:test_grad_layout_1devicemodule_1replicaperprocess Skip for MI200: - test_fully_shard_training.py:test_2d_mlp_with_nd_mesh - test_2d_composability.py:test_train_parity_2d_mlp - test_fully_shard_overlap.py:test_fully_shard_training_overlap Fixes #159489 Fixes #159488 Fixes #152700 Fixes #125555 Fixes #134139 Working as is on both MI200 and MI300: Fixes #125991 Fixes #125918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164390 Approved by: https://github.com/jeffdaily	2025-10-03 19:52:51 +00:00
Natalia Gimelshein	71eec6a0bf	[dist] handle discontiguous allgather/reducescatter inputs (#163712 ) Fixes #163483 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163712 Approved by: https://github.com/ezyang, https://github.com/kwen2501	2025-09-24 19:38:44 +00:00
Ke Wen	11a231ef52	[c10d] P2P tensors must be dense (#163719 ) Fixes #161324 by adding `is_non_overlapping_and_dense` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163719 Approved by: https://github.com/ngimel	2025-09-24 06:58:03 +00:00
Yuanyuan Chen	96a3afb8ec	Simplify BFLOAT16_AVAILABLE (#163445 ) Simplify `BFLOAT16_AVAILABLE` by using `torch.cuda.is_bf16_supported()` and `torch.xpu.is_bf16_supported()`. Outdated comments are also removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163445 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-09-22 07:31:46 +00:00
Alexander Grund	6702f545d8	Restore environment after NcclUserBufferRegistrationTest (#163063 ) This test sets "NCCL_ALGO=NVLS" in NcclUserBufferRegistrationTest which affects tests run in the same process such as `test_on_completion_hook_*` that fail with > invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.26.2 > ncclInvalidUsage: This usually reflects invalid usage of NCCL library. > Last error: > Error : no algorithm/protocol available for function Broadcast with datatype ncclInt8. NCCL_ALGO was set to NVLS. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163063 Approved by: https://github.com/ezyang	2025-09-16 17:37:09 +00:00
suo	fe8cc619b8	[torch][c10d] fix split_group in mixed backend case (#162424 ) Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang	2025-09-11 16:29:32 +00:00
PyTorch MergeBot	d033d11d26	Revert "[torch][c10d] fix split_group in mixed backend case (#162424 )" This reverts commit 2dc26131801a430e030a773c4fbfe874e263259d. Reverted https://github.com/pytorch/pytorch/pull/162424 on behalf of https://github.com/clee2000 due to failure seems related, maybe a hang/timeout distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_model_diff_shape_across_ranks log classifier is pointing at the wrong line ([comment](https://github.com/pytorch/pytorch/pull/162424#issuecomment-3276360494))	2025-09-10 20:13:44 +00:00
suo	2dc2613180	[torch][c10d] fix split_group in mixed backend case (#162424 ) Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options. However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends. This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead. Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs. As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424 Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang	2025-09-10 16:59:18 +00:00
Shunzhi Wen	c10195e723	[C10d][Gloo] Enable complex datatype support in ProcessGroupGloo (#156633 ) - Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it. - Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo. Fixes #156632 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633 Approved by: https://github.com/d4l3k	2025-09-05 21:24:36 +00:00
Natalia Gimelshein	726dce3c94	[nccl symm mem] don't use arg for mempool, correctly use symmetric registration in hooks (#161238 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/161238 Approved by: https://github.com/kwen2501, https://github.com/syed-ahmed	2025-08-25 03:09:32 +00:00
frost-intel	9b4adc4db7	[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL (#158568 ) Adds support for FlightRecorder in ProcessGroupXCCL. See https://github.com/intel/torch-xpu-ops/pull/1867 for XCCL implementation and more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj	2025-08-22 09:03:35 +00:00
fduwjj	31c9ac4319	[c10d] Fix test test_nccl_user_buffer_registration (#160497 ) Fixed `test_nccl_user_buffer_registration ` due to https://github.com/pytorch/pytorch/pull/160145, somehow CI didn't capture it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160497 Approved by: https://github.com/ngimel	2025-08-13 15:29:41 +00:00
fduwjj	b1f43548ca	[c10d] Error out the case when registering symmetric memory without eager init (#160145 ) Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160145 Approved by: https://github.com/kwen2501	2025-08-12 23:25:04 +00:00
PyTorch MergeBot	356ac3103a	Revert "Stop parsing command line arguments every time common_utils is imported. (#156703 )" This reverts commit 310f901a71e53688866b14bb2f2b4c8eef9979b3. Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/izaitsevfb due to breaking tests internally with `assert common_utils.SEED is not None` ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3152337518))	2025-08-04 20:37:39 +00:00
Anthony Barbier	310f901a71	Stop parsing command line arguments every time common_utils is imported. (#156703 ) Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs: https://github.com/pytorch/pytorch/pull/154612 https://github.com/pytorch/pytorch/pull/154628 https://github.com/pytorch/pytorch/pull/154715 https://github.com/pytorch/pytorch/pull/154716 https://github.com/pytorch/pytorch/pull/154725 https://github.com/pytorch/pytorch/pull/154728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703 Approved by: https://github.com/clee2000	2025-08-02 16:38:54 +00:00
Xuehai Pan	0d17029fea	[BE][6/6] fix typos in test/ (test/distributed/) (#157640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640 Approved by: https://github.com/yewentao256, https://github.com/malfet	2025-07-11 14:09:37 +00:00
Syed Tousif Ahmed	f5bbaa2253	Fixes typo in nccl_window_registration test (#157293 ) As mentioned here: https://github.com/pytorch/pytorch/pull/155134#discussion_r2175605192 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157293 Approved by: https://github.com/Skylion007	2025-07-09 11:01:18 +00:00
Tristan Rice	1b3d69b59f	Work: block_current_stream API (#156883 ) This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo. The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP. There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780 This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set. Test plan: ``` pytest test/distributed/test_c10d_gloo.py -v -k wait_stream pytest test/distributed/test_c10d_nccl.py -v -k wait_stream ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2025-07-08 23:55:46 +00:00
Xuehai Pan	6d5c789ad5	[BE][PYFMT] migrate PYFMT for `test/[a-h]*/` to `ruff format` (#144555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555 Approved by: https://github.com/ezyang ghstack dependencies: #144551, #144554	2025-06-24 04:53:54 +00:00
fduwjj	87d615efab	[fr] Use a vector to temporarily keep the reference to future object to avoid block (#156653 ) At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished. Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653 Approved by: https://github.com/kwen2501	2025-06-24 03:25:04 +00:00
Prachi Gupta	da910e603a	[ROCm] update state check for test_trace_while_active* (#153545 ) When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled" Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-06-23 17:58:14 +00:00
Syed Tousif Ahmed	f70c80105e	Enables NCCL symmetric memory kernels through mempool registration (#155134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@meta.com>	2025-06-21 23:24:04 +00:00
Ke Wen	0f0c010714	[c10d] init_process_group supports index-only device id (#156214 ) Before: ``` acc = torch.accelerator.current_accelerator() if acc: local_idx = ... dist.init_process_group( device_id=torch.device(acc.type, local_idx) ) ``` After: ``` dist.init_process_group(device_id=local_idx) ``` That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214 Approved by: https://github.com/guangyey, https://github.com/albanD	2025-06-21 00:02:37 +00:00
Ke Wen	d32deb664a	[c10d] Disable NCCL NVLS when using deterministic mode (#156381 ) via setting env `NCCL_ALGO=^NVLS`. Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381 Approved by: https://github.com/ngimel	2025-06-19 20:09:24 +00:00
Luca Wehrstedt	b30e04b3c8	Make the NCCL PG Options and Config copyable and safe to init standalone (#155700 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700 Approved by: https://github.com/kwen2501	2025-06-18 13:36:27 +00:00
fduwjj	b8aee84fb9	[c10d][fr] Shrink the range of mutex lock to avoid deadlock (#155949 ) While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949 Approved by: https://github.com/Skylion007	2025-06-15 00:37:42 +00:00
Aaron Gokaslan	a317c63d1b	[BE]: Update NCCL to 2.27.3 (#155233 ) Fixes: https://github.com/pytorch/pytorch/issues/155052 and https://github.com/pytorch/pytorch/issues/153517 This upgrade is needed to effectively use those symmetric memory kernels anyway. Also fixes some nasty NCCL bugs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155233 Approved by: https://github.com/nWEIdia, https://github.com/kwen2501, https://github.com/atalman, https://github.com/eqy	2025-06-14 19:20:31 +00:00
fduwjj	450180fbcd	[c10d][fr] Add the log of thread name and thread id into fr (#155142 ) There is an ask from internal head users to have thread id and thread name inside fr. This would be useful to users when it comes to cases when we launches collectives not just on main thread as well. Differential Revision: [D75973919](https://our.internmc.facebook.com/intern/diff/D75973919) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155142 Approved by: https://github.com/kwen2501	2025-06-05 03:33:01 +00:00
Wei Wang	a01bb9da14	[CI][CUDA] Re-enable the test-nan-assert on CUDA12 (#154448 ) We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround #153479 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448 Approved by: https://github.com/kwen2501	2025-06-05 02:09:31 +00:00
Ke Wen	e25074d462	[c10d][CI] Change expected return code in Sandcastle for Nan tests (#154441 ) Fixing internal error caused by #153167. `skip_but_pass_in_sandcastle_if` returns exit code 0. But `test_nan_assert` expects exit code -6. So we'd need to set expected return code conditional on `IS_SANDCASTLE`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154441 Approved by: https://github.com/fduwjj, https://github.com/nWEIdia ghstack dependencies: #153167	2025-05-28 20:35:52 +00:00
Ke Wen	8c16d0e404	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-26 00:56:05 +00:00
Laith Sakka	43b2716e89	PYFMT lint grandfathered files 1 (#154261 ) lint: - test/test_fake_tensor.py - test/test_flop_counter.py - torch/_export/verifier.py with same rules as other files, it was a night mare for me to update tests in one of the skipped files with not being able to lint them locally like other files with lintrunner -a. note that those file do have active dev and not old not touched files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-05-25 17:36:14 +00:00
PyTorch MergeBot	54932d865e	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit 03e102dbe8cbffc2e42a3122b262d02f03571de7. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))	2025-05-25 13:17:27 +00:00
Ke Wen	03e102dbe8	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-25 03:48:34 +00:00
PyTorch MergeBot	28af44285b	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit 499a76b844bbcbc5465cb76c617b3076c1b0fd65. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see `fe784c5a2c/1` ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))	2025-05-23 19:44:08 +00:00
Ke Wen	499a76b844	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-23 19:04:28 +00:00
Ke Wen	25149cd173	[c10d] Add more tests to prevent extra context (#154174 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Loop a bunch of sync ops and see if any of them creates extra context. Requires nvml to check number of processes resident on a device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154174 Approved by: https://github.com/atalman	2025-05-23 09:54:01 +00:00
Wei Wang	7128b50a65	[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501	2025-05-22 06:33:29 +00:00
PyTorch MergeBot	1478d0185c	Revert "[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594 )" This reverts commit 8cabd23b3d357ec38a400978bb5423efcb433f2a. Reverted https://github.com/pytorch/pytorch/pull/151594 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a distributed test in trunk ([comment](https://github.com/pytorch/pytorch/pull/151594#issuecomment-2896230131))	2025-05-21 01:45:20 +00:00
Wei Wang	8cabd23b3d	[CI][CUDA] Move cu118 distributed pull jobs to cu126, move cu124-sm75 to cu126-sm75 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever	2025-05-20 21:56:47 +00:00
Ke Wen	efa07df257	[c10d] Remove unordered PG destroy test (#153110 ) torch.distributed does not support unordered ProcessGroup destroy. Removing the test. Resolves #137507 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153110 Approved by: https://github.com/fduwjj, https://github.com/fegin	2025-05-08 15:29:44 +00:00
Ke Wen	a8f727c439	[c10d] Fix extra CUDA context created by barrier (#149144 ) Fixes #149119. In ProcessGroup.hpp, we create a dummy tensor for dispatching. This requires a correct device index. This PR uses `device_id` given by user when calling `init_process_group`. This PR also uses `torch._C._get_accelerator()` to determine the device type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149144 Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/cyyever	2025-05-06 15:27:30 +00:00

1 2 3 4 5 ...

416 Commits