pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Howard Huang	f4406689b8	fix MPCT destroy_pg call (#157952 ) I was seeing hangs / exceptions not raising in some cases. Only call `c10d.destroy_process_group()` for `MultiProcessContinuousTest` in the clean exit case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157952 Approved by: https://github.com/fduwjj ghstack dependencies: #157589	2025-07-12 00:46:19 +00:00
Howard Huang	0f31445139	Add stack trace of exception to MultiProcContinousTest (#157589 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157589 Approved by: https://github.com/Skylion007	2025-07-08 17:54:35 +00:00
Xuehai Pan	cec2977ed2	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-23 02:57:34 +00:00
PyTorch MergeBot	3f44fdc03d	Revert "[BE][6/16] fix typos in torch/ (#156316 )" This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4. Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	b210cf1ea5	[BE][6/16] fix typos in torch/ (#156316 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316 Approved by: https://github.com/albanD ghstack dependencies: #156313, #156314, #156315	2025-06-22 08:43:33 +00:00
Hari Krishna Sai Kodali	e1f28fe17b	add device generalisation support for distributed tests (#152471 ) ### MOTIVATION To generalize Distributed test cases for non-CUDA devices ### CHANGES - test/distributed/optim/test_zero_redundancy_optimizer.py - test/distributed/test_c10d_logger.py - test/distributed/test_compute_comm_reordering.py Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp. DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions. - torch/testing/_internal/common_distributed.py extended common utility functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471 Approved by: https://github.com/d4l3k	2025-06-20 07:35:42 +00:00
Howard Huang	8e1471bdc9	Allow MultiProcContinuousTest to set world_size (#155920 ) `MultiProcContinuousTest` will automatically set world_size to number of devices. This change allows this attribute to be modified by the derived test class Pull Request resolved: https://github.com/pytorch/pytorch/pull/155920 Approved by: https://github.com/fduwjj	2025-06-15 00:24:17 +00:00
Ke Wen	8c16d0e404	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-26 00:56:05 +00:00
PyTorch MergeBot	54932d865e	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit 03e102dbe8cbffc2e42a3122b262d02f03571de7. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to It broke lint ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2907820789))	2025-05-25 13:17:27 +00:00
Ke Wen	9d922b55ef	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-25 03:49:29 +00:00
Ke Wen	03e102dbe8	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-25 03:48:34 +00:00
PyTorch MergeBot	28af44285b	Revert "[c10d] Add support for testing SIGABRT return (#153167 )" This reverts commit 499a76b844bbcbc5465cb76c617b3076c1b0fd65. Reverted https://github.com/pytorch/pytorch/pull/153167 on behalf of https://github.com/malfet due to Broke lint, see `fe784c5a2c/1` ([comment](https://github.com/pytorch/pytorch/pull/153167#issuecomment-2905623868))	2025-05-23 19:44:08 +00:00
Ke Wen	499a76b844	[c10d] Add support for testing SIGABRT return (#153167 ) `SIGABRT` is a common return by negative distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc. These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`. Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167 Approved by: https://github.com/fduwjj	2025-05-23 19:04:28 +00:00
PyTorch MergeBot	674a85cf26	Revert "[Distributed][CI] Rework continuous TestCase (#153653 )" This reverts commit 0d5c628a6e96e0a960af39d1d0de4bf04df69c39. Reverted https://github.com/pytorch/pytorch/pull/153653 on behalf of https://github.com/kwen2501 due to More fixes needed ([comment](https://github.com/pytorch/pytorch/pull/153653#issuecomment-2891931028))	2025-05-19 18:29:27 +00:00
Ke Wen	0d5c628a6e	[Distributed][CI] Rework continuous TestCase (#153653 ) 1. Reworked `MultiProcContinousTest` to spawn processes during `setUpClass` instead of `main` (so that we can support multiple TestClass'es in one file). 2. The child processes are now an infinite loop, monitoring test IDs passed from main process via a task queue. Reciprocally, the child processes inform the main process completion of a test via a completion queue. 3. Added a test template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153653 Approved by: https://github.com/d4l3k, https://github.com/fegin, https://github.com/fduwjj	2025-05-19 18:20:42 +00:00
PyTorch MergeBot	3443627e07	Revert "[BE]: Enable RUFF TRY400 rule - log.exception (#153473 )" This reverts commit 4f4ecc583e0f48ad2d062a53bf91c61ab40b4948. Reverted https://github.com/pytorch/pytorch/pull/153473 on behalf of https://github.com/jeanschmidt due to seems to have broken internal signals, @albanD may I count on you to help the author merge his PR? D74837988 ([comment](https://github.com/pytorch/pytorch/pull/153473#issuecomment-2886017075))	2025-05-16 08:29:26 +00:00
Aaron Gokaslan	4f4ecc583e	[BE]: Enable RUFF TRY400 rule - log.exception (#153473 ) Change logging.error to logging.exception to log additional information when relevant. A few places have slipped in logging.errors in try except since I last did a clean up here and the rule is stabilized so I am enabling it codebase wide. I have NOQA'd much of our custom exception stack trace handling for RPC calls and distributed and tried to a fix a few errors based on whether we immediately reraised it or if we didn't print any exception handling where it could be useful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153473 Approved by: https://github.com/albanD, https://github.com/cyyever	2025-05-15 13:36:59 +00:00
Ke Wen	5dd746b4b5	[c10d] Reduce test verbosity (#153116 ) Has been seeing a lot of `Starting event listener thread for rank` recently in test print-out. Moving them to `logger.debug`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153116 Approved by: https://github.com/fduwjj	2025-05-08 22:22:22 +00:00
Xilun Wu	0f9821d0e3	[BE][lint] fix PYFMT for PT-D code under torch.testing._internal, add them to the lint list (#153114 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/153114 Approved by: https://github.com/cyyever, https://github.com/fegin, https://github.com/H-Huang, https://github.com/Skylion007	2025-05-08 14:01:49 +00:00
wizzniu	2246cb6e14	Fix common_distributed.py to NOT set root logger (#152319 ) Using `logging.basicConfig` to set root logger's level is not a good behavior. Fix common_distributed.py to set level for current logger only, because it affects downstream's 3rd-party testing plugins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152319 Approved by: https://github.com/Skylion007	2025-04-28 17:51:32 +00:00
Tristan Rice	df4e5294a6	Reapply "ProcessGroupGloo: support lazy_init (#150801 )" (#151031 ) This reverts commit 73f3d6d9aaa128d9917e8b3790933ba2855066cc. Reapplies #150801 Test plan: See #150801 submodule Pull Request resolved: https://github.com/pytorch/pytorch/pull/151031 Approved by: https://github.com/fduwjj	2025-04-11 01:58:35 +00:00
PyTorch MergeBot	73f3d6d9aa	Revert "ProcessGroupGloo: support lazy_init (#150801 )" This reverts commit f237ee54bfb35d16cd10e358d4b78578c88a5781. Reverted https://github.com/pytorch/pytorch/pull/150801 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/150801#issuecomment-2793161239))	2025-04-10 13:44:31 +00:00
Tristan Rice	f237ee54bf	ProcessGroupGloo: support lazy_init (#150801 ) This adds lazy initialization support to ProcessGroupGloo via `TORCH_GLOO_LAZY_INIT` or via `create_device(..., lazy_init=True)` This is still a draft PR as there's one race condition when doing coalesced operations that needs to be fixed upstream in Gloo first. Depends on https://github.com/facebookincubator/gloo/pull/427 landing first This also updates the gloo submodule to include the required changes. Test plan: added lazy init test variants ``` pytest -v test/distributed/test_c10d_gloo.py -k Lazy ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150801 Approved by: https://github.com/fduwjj	2025-04-09 19:29:50 +00:00
lzhang2	84b58bd63e	Enable FSDP tests on XPU device (#147518 ) Motivation: Enable FSDP tests on XPU device Pull Request resolved: https://github.com/pytorch/pytorch/pull/147518 Approved by: https://github.com/weifengpy	2025-03-04 23:49:37 +00:00
PyTorch MergeBot	e06ee4aa9f	Revert "Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 )" This reverts commit 06f4a5c0e578d7da10ebdf14edcd24e5dcef78d6. Reverted https://github.com/pytorch/pytorch/pull/146073 on behalf of https://github.com/atalman due to breaks macos builds: ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ([comment](https://github.com/pytorch/pytorch/pull/146073#issuecomment-2659802389))	2025-02-14 16:44:46 +00:00
atalman	06f4a5c0e5	Nccl update to 2.25.1 for cuda 12.4-12.8 (#146073 ) Should resolve: https://github.com/pytorch/pytorch/issues/144768 We use one common nccl version for cuda builds 12.4-12.8 : ``NCCL_VERSION=v2.25.1-1`` For CUDA 11.8 we use legacy ``NCCL_VERSION=v2.21.1-1`` We use pinned version of NCCL rather then submodule. Move nccl location from ``third_party/nccl/nccl`` to ``third_party/nccl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146073 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kwen2501, https://github.com/fduwjj	2025-02-14 15:29:59 +00:00
Aaron Orenstein	dea7ad3371	PEP585 update - torch/testing (#145200 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200 Approved by: https://github.com/bobrenjc93	2025-01-20 22:42:42 +00:00
Will Constable	2c4281d7da	Make MultiProcContinuousTest timeout configurable (#145099 ) Allows test classes using MPCT to set their own timeout as a class property, which is good enough since the processgroup is shared across test instances and the timeout is set at processgroup init. Also sets a default timeout of 2 minutes, which is probably (?) long enough for reasonable tests, but can be changed if it causes flakyness. It's preferable to have as short default timeout as possible, since when debugging tests getting a timeout quickly helps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145099 Approved by: https://github.com/d4l3k, https://github.com/fduwjj ghstack dependencies: #145010, #145011	2025-01-18 04:37:12 +00:00
bobrenjc93	3b6b306b71	Migrate from Tuple -> tuple in torch/testing (#144256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256 Approved by: https://github.com/aorenste	2025-01-10 06:37:55 +00:00
PyTorch MergeBot	080f992d68	Revert "[CI] Reduce distributed test timeout to 60s (#141168 )" This reverts commit e8de8f3969bf935442378efd125442de90e78431. Reverted https://github.com/pytorch/pytorch/pull/141168 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think we missed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/141168#issuecomment-2494060624))	2024-11-22 15:46:37 +00:00
Ke Wen	e8de8f3969	[CI] Reduce distributed test timeout to 60s (#141168 ) Pulling a PR to test viability. Today's timeout is 300s, which could waste quite some machine time if a hang happens in CI. Differential Revision: [D66275756](https://our.internmc.facebook.com/intern/diff/D66275756) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141168 Approved by: https://github.com/clee2000	2024-11-22 00:59:55 +00:00
Ke Wen	66476617bf	[Dist][CI] Easier override of destroy-upon-exit setting (#141192 ) Adding `destroy_pg_upon_exit` property to allow derived Test classes to control whether auto destroy is desired. (Otherwise, derived test classes will need to rewrite the `_run()` method, leading to duplicated code of `_run()` and if one needs to add things to `_run` in the future, more code change is needed.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141192 Approved by: https://github.com/wconstab	2024-11-21 07:32:56 +00:00
Syed Tousif Ahmed	e0482fdf95	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-21 01:40:11 +00:00
PyTorch MergeBot	496c1e78c5	Revert "Implements user buffer registration using MemPool (#133603 )" This reverts commit 25d9be37bef949c675e42b4929ddcb6997af2a7b. Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))	2024-11-19 22:42:26 +00:00
Anant Gulati	b379a28a95	Generalization of distributed test cases for non-CUDA devices (#138216 ) # Motivation This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(https://github.com/pytorch/pytorch/pull/131758#discussion_r1762422784) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey	2024-11-18 09:38:00 +00:00
Will Constable	1d5a8ee8fb	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin	2024-11-18 04:26:21 +00:00
PyTorch MergeBot	bf8709b08a	Revert "[C10D] call destroy_process_group after MultiProcess tests (#140820 )" This reverts commit 77d1f076dadec7a77c4bcf807c4efbef6ca5a8f1. Reverted https://github.com/pytorch/pytorch/pull/140820 on behalf of https://github.com/wconstab due to failures on trunk not on PR CI ([comment](https://github.com/pytorch/pytorch/pull/140820#issuecomment-2480644227))	2024-11-16 16:32:14 +00:00
Will Constable	77d1f076da	[C10D] call destroy_process_group after MultiProcess tests (#140820 ) Faced with an annoying string of warnings like this when running tests, <img width="1644" alt="Screenshot 2024-11-15 at 11 23 21 AM" src="https://github.com/user-attachments/assets/91ff4e1d-3c29-4510-9a61-46e7df68a212"> My choices seem to be (1) call destroy_process_group() at the end of each test fn, (2) do this in some wrapper, (3) do it in the base test class. Since tests in MultiProcessTestCase are responsible for calling init_process_group themselves, they should also be responsible for calling destroy (or at least method (3) would be asymmetric and may result in double-destroy). But it doesn't feel worth it to go add a destroy call manually to each test, and try/except for a possible second destroy call seems like a happy middle ground. Note: tests that want to ensure that destroy runs cleanly can and should still call destroy _inside_ the test, and this change does not affect that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140820 Approved by: https://github.com/fegin ghstack dependencies: #140460, #140815	2024-11-16 14:24:52 +00:00
Syed Tousif Ahmed	25d9be37be	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-15 12:47:49 +00:00
Edward Z. Yang	4e647871d6	Ensure TORCH_TRACE is run for Dynamo/Distributed tests (#139786 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139786 Approved by: https://github.com/bobrenjc93, https://github.com/c00w, https://github.com/anijain2305 ghstack dependencies: #139716	2024-11-07 01:58:05 +00:00
Will Feng	e4ad02892f	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy, https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-10-20 23:48:54 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
Ke Wen	c88b77af9c	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 21:39:39 +00:00
PyTorch MergeBot	0ff6f7a040	Revert "[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 )" This reverts commit 1581a93e8705dc23f649573d4404cd6816d614af. Reverted https://github.com/pytorch/pytorch/pull/138245 on behalf of https://github.com/albanD due to Breaks distributed inductor tests ([comment](https://github.com/pytorch/pytorch/pull/138245#issuecomment-2422462579))	2024-10-18 13:21:17 +00:00
Ke Wen	1581a93e87	[Distributed][CI] Add SM guard for compiled tests involving BF16 (#138245 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138245 Approved by: https://github.com/yf225	2024-10-18 09:10:01 +00:00
PyTorch MergeBot	24ee4af86b	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit 2b7c7a20b9c0e8e7f2773ffc5c9f79c3cae2070b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/kwen2501 due to breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2417833666))	2024-10-16 20:05:38 +00:00
Ke Wen	2b7c7a20b9	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 16:42:57 +00:00
PyTorch MergeBot	78632b97b1	Revert "Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 )" This reverts commit f43c4d28b8f955fe1f2b80f193815edadc95507b. Reverted https://github.com/pytorch/pytorch/pull/137161 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems another failure showing up after the upgrade ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2415941159))	2024-10-16 07:26:34 +00:00
Ke Wen	f43c4d28b8	Upgrade distributed test to g4dn instances (T4 GPUs) (#137161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137161 Approved by: https://github.com/seemethere, https://github.com/eqy	2024-10-16 05:03:08 +00:00
PyTorch MergeBot	b55ff476bd	Revert "[Distributed] Fix extra context on device 0 (#135273 )" This reverts commit cdd8fa98c77b052085cca65dd54769ae18b72104. Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/PaliC due to broken tests on trunk ([comment](https://github.com/pytorch/pytorch/pull/137161#issuecomment-2406236337))	2024-10-10 23:47:25 +00:00

1 2 3 4

196 Commits