pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Wei Wang	d7e275d4b4	[CI][CUDA] Add periodic b200 distributed job (#159323 ) 1. Run distributed job with B200 runner, periodically. 2. discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159323 Approved by: https://github.com/eqy Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>	2025-10-16 21:54:04 +00:00
IvanKobzarev	585b9dbb5e	[async_tp] Support ag+mm with gather_dim lastdim of mat_A (#163068 ) Adding ag+mm support for the case, when gather_dim is last dim of matmul (reduction dim). When we decompose matmul by reduction dimension we result in partials that needs additional reduction, we allocate memory for accumulator. Decomposition should not produce small (thin) mms that can not efficiently load the GPU. Limiting for minimal size of the shard 1024 (found empirically by testing in torchtitan). scaled_mm is not supported yet for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068 Approved by: https://github.com/ngimel	2025-10-16 20:14:39 +00:00
Ke Wen	19bf67be32	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel	2025-10-08 05:25:16 +00:00
PyTorch MergeBot	f505caa71b	Revert "multimem reduce (#164517 )" This reverts commit d1cbb74fb16406488a174832e1b58b7c242f418d. Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))	2025-10-07 20:12:38 +00:00
Ke Wen	d1cbb74fb1	multimem reduce (#164517 ) Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op. The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517 Approved by: https://github.com/ngimel	2025-10-03 02:41:10 +00:00
Syed Tousif Ahmed	ed90040d33	Releases multicast object before releasing mapped buffers in CUDASymmetricMemory (#163750 ) Fixes: https://github.com/pytorch/pytorch/issues/162429. In B200, cuMulticastUnbind can error if the mapped buffers are free'd before the multicast object is free'd. The only documentation I could find is here: `e11d7f77c1/src/transport/nvls.cc (L113)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163750 Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/cyyever ghstack dependencies: #163575	2025-10-01 09:07:48 +00:00
Alexander Grund	8bb71c07c4	Skip symmetric memory tests calling `_scaled_mm` on CCC < 8.9 (#164251 ) This avoids them failing on e.g. A100 GPUs with > RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/164251 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-10-01 03:26:21 +00:00
can-gaa-hou	e64dd8c694	[Fix] Adding missing `f` prefixes to formatted strings [4/N] (#164068 ) As stated in the title. * __->__ #164068 * #164067 * #164066 * #164065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164068 Approved by: https://github.com/Skylion007	2025-09-29 04:07:07 +00:00
Wei Wang	96182faf96	[CI][Distributed][CUDA][Symm-Mem] Enable B200 Symm Mem Test (#162988 ) Inspired by https://github.com/pytorch/pytorch/pull/162981 and motivated by https://github.com/pytorch/pytorch/pull/159323 taking a total of 20 hours to finish (and unlikely to make it in short time due to https://github.com/pytorch/pytorch/issues/162178 ) Creating this subtest to get something distributed on B200. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162988 Approved by: https://github.com/malfet	2025-09-27 05:12:05 +00:00
IvanKobzarev	22fcc8b76b	[async_tp] Support mm+rs with scatter_dim matmul K by sharding B (#162794 ) Current state: Shape mismatch failure when mm+rs on the last mm scatter dim. Adding separate path to handle lastdim for aten.mm, scaled_mm should be handled similarly, but needs additional PR. So disabling scaled_mm case with filter matmul function. Adding inductor.config for this change that is True by default for fast debuggability of new path. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162794 Approved by: https://github.com/fegin	2025-09-25 12:18:39 +00:00
Prachi Gupta	f638854e1d	[ROCm][SymmMem] re-enable UTs (#162811 ) After the UT suite moved to `MultiProcContinuousTest`, `skipIfRocm` decorator started failing rather than skipping UTs because now we spawn multiple threads before the skip decorator is taken into account and the skip decorator was raising an exception to exit the process. But, the parent process treated the child process exiting as a crash rather than a skip. Additionally, in `MultiProcContinuousTest`, if one UT fails all subsequent ones are also skipped which makes sense since there's one setup for the entire suite. However, this showed up as many failing/skipped UTs in the parity. I added multiprocess version of skip decorators for ROCm, including, `skip_if_rocm_arch_multiprocess` and `skip_if_rocm_ver_lessthan_multiprocess`. These are needed as symmetric memory feature is only supported on MI300 onwards and we need to skip them for other archs and some UTs only work after ROCm7.0. Fixes #161249 Fixes #161187 Fixes #161078 Fixes #160989 Fixes #160881 Fixes #160768 Fixes #160716 Fixes #160665 Fixes #160621 Fixes #160549 Fixes #160506 Fixes #160445 Fixes #160347 Fixes #160203 Fixes #160177 Fixes #160049 Fixes #159921 Fixes #159764 Fixes #159643 Fixes #159499 Fixes #159397 Fixes #159396 Fixes #159347 Fixes #159067 Fixes #159066 Fixes #158916 Fixes #158760 Fixes #158759 Fixes #158422 Fixes #158138 Fixes #158136 Fixes #158135 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/162811 Approved by: https://github.com/jeffdaily	2025-09-16 15:35:39 +00:00
Prachi Gupta	c0142f5c06	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/msaroufim Co-authored-by: Mark Saroufim <marksaroufim@fb.com>	2025-09-09 15:49:21 +00:00
Chien-Chin Huang	f044fa2902	[AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041 ) The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162041 Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel ghstack dependencies: #162040	2025-09-08 16:12:52 +00:00
Chien-Chin Huang	5b90e85112	[AsyncTP] Fixes AsyncMM (#162040 ) The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect. Removing the alpha and beta fixes the issue. Thanks @ngimel to figure out the root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040 Approved by: https://github.com/danielvegamyhre	2025-09-08 10:53:59 +00:00
PyTorch MergeBot	8235c4f65d	Revert "[ROCm] Enabling several UTs (#161715 )" This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354. Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))	2025-09-07 21:03:17 +00:00
Prachi Gupta	b9ba612f7a	[ROCm] Enabling several UTs (#161715 ) All these UTs are working as is, just removing the skip - test_p2p_ipc - test_repros.py: working, added fp8 support - test_activation_checkpointing.py - test_content_store.py - test_cuda_multigpu.py - test_compute_comm_reordering.py - test_segment_reductions.py - test_dataloader.py - test_math_ops.py - test_loop_ordering.py - test_control_flow.py - distributed_test.py - test_mem_tracker.py - test_fsdp_optim_state.py - test_fully_shard_mixed_precision.py: skippped for < ROCm7.0 - test_aot_inductor_custom_ops.py - test_c10d_ops_nccl.py - test_eager_transforms.py - test_sparse_csr.py - test_inductor_collectives.py - test_fake_tensor.py - test_cupy_as_tensor.py - test_cuda.py: enable UTs that are working - test_matmul_cuda.py: enable UTs that are working Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-09-04 20:43:03 +00:00
Ke Wen	994f2a5dbc	[SymmMem][CI] Make sure group names are consistent (#162035 ) Unblocking #161741 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162035 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2025-09-03 20:40:24 +00:00
Chien-Chin Huang	c1e504ec2f	[SymmMEM] Move AsyncTP tests to a seperate test class (#161820 ) We move AsyncTP tests to a seperate test suite because 1) Async TP ops are not the core symmetric memory APIs, they are more like applications, 2) MultiProcContinuousTest will skip all the following tests if a test fails (we should fix this too). We still want to get the test signals for the core symmetric memory APIs when Async TP ops fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161820 Approved by: https://github.com/kwen2501	2025-08-30 00:40:40 +00:00
Chien-Chin Huang	cd6d63f453	[SymmMEM] Fix test_empty_strided_p2p_persistent (#161677 ) test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test. https://github.com/pytorch/pytorch/pull/161668 should also fix the issue but we can land this PR for a safer test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161677 Approved by: https://github.com/kwen2501 ghstack dependencies: #161676	2025-08-29 16:11:58 +00:00
Ke Wen	eec876deb6	[SymmMem] Isolate set_device tests to avoid hang (#161668 ) `test_symmetric_memory.py` hangs like this: ``` SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s] SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ... ``` This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`. However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function. Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time. Hang is gone now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161668 Approved by: https://github.com/fegin	2025-08-28 05:43:49 +00:00
Will Constable	779fc29c04	[C10D] Fix spelling of MultiProcContinuousTest (#160892 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892 Approved by: https://github.com/fduwjj	2025-08-19 20:17:19 +00:00
Frank Lin	a9f902add0	[CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295 ) Reopen https://github.com/pytorch/pytorch/pull/156097 Fixes https://github.com/pytorch/pytorch/issues/154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-07-16 23:14:36 +00:00
PyTorch MergeBot	702a304b07	Revert "[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit 9a5278225fc5e7b46d54a65ae1a3f049ee49824f. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/ngimel due to breaks 525 driver installs ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-3063742807))	2025-07-11 20:36:36 +00:00
Xuehai Pan	0d17029fea	[BE][6/6] fix typos in test/ (test/distributed/) (#157640 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640 Approved by: https://github.com/yewentao256, https://github.com/malfet	2025-07-11 14:09:37 +00:00
Frank Lin	9a5278225f	[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-07-10 14:38:18 +00:00
Ke Wen	fc10d4b1d6	[SymmMem] Allow selection of allocation backend (#156661 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): Today the only way to choose allocation backend is via env `TORCH_SYMMMEM=...`. This is a bit hard to set in CI on test file basis. (The env has to be set before program is loaded). This PR added a programmatic way -- a `set_backend` API. Implementation: Since this API is slightly more dynamic than static registration, at static time each backend registers its availability rather than filling itself as the allocator directly. Later when `set_backend` is called, the allocator would actually fill in the device-to-allocation `map_`. Though added, `set_backend` is not a necessary API for user to call -- one backend is still registered as the default at static time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156661 Approved by: https://github.com/ngimel, https://github.com/fduwjj	2025-06-26 21:37:44 +00:00
fduwjj	4585c33e74	[symm_mem] Fix nccl test for symm mem (#156752 ) Try not to call set_device to Fixes #156569 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156752 Approved by: https://github.com/kwen2501	2025-06-26 02:59:38 +00:00
PyTorch MergeBot	e583b88819	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit ac86ec0e60370c037e018137f2048cafd47c5c28. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))	2025-06-23 17:36:44 +00:00
Prachi Gupta	276c790010	[ROCm][SymmetricMemory] Avoid bf16 to float conversion during reduce (#155587 ) This PR helps improve the performance of one-shot and two-shot allreduce as reported here: https://github.com/pytorch/FBGEMM/issues/4072 One-Shot: ![image](https://github.com/user-attachments/assets/69fe0d53-6636-42e1-90e0-e5efb989f59f) As shown in the numbers presented above, symmetric memory performance prior to the PR (baseline) was on average about 26% less than fbgemm's number reported in the issue above. After this PR, we are seeing 16% improvement on average as compared to fbgemm and 59% as compared to our baseline numbers. Two-Shot: ![image](https://github.com/user-attachments/assets/e5c8a288-303e-4d50-814b-4348e589e1fc) Similarly, in two-shot, we were originally underperforming by 12%. We have improved by 22% after this PR as compared to symmetric memory performance prior to this PR. However, two-shot performance is still about 23% lower than fbgemm. This work is still in progress and will be pushing those changes through a separate PR. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/155587 Approved by: https://github.com/jeffdaily	2025-06-23 16:14:01 +00:00
Frank Lin	ac86ec0e60	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-21 01:34:41 +00:00
PyTorch MergeBot	bfccfa0b31	Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 )" This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c. Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))	2025-06-18 21:48:50 +00:00
Frank Lin	cf90c9f8d1	[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097 ) Fixes #154073 Reference: https://github.com/NVIDIA/Fuser/pull/4197 See PR #154097 @nWEIdia is currently out of the office, so I’ve temporarily taken over his work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097 Approved by: https://github.com/ngimel, https://github.com/cyyever Co-authored-by: Wei Wang <weiwan@nvidia.com>	2025-06-17 14:15:49 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
Wei Wang	48807d568e	[CI][CUDA] Migrate remaining cu118 jobs to cu128 (#154169 ) Contributing to the fix of #147383 and #154119 Additional steps required: `3218b1b684/.github/workflows/lint.yml` cu118 needs to be updated. Make install_cuda.sh accept both 12.8 and 12.8.* as CUDA_VERSION argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154169 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman, https://github.com/tinglvv	2025-06-02 20:22:14 +00:00
Ke Wen	062387fb53	[SymmMem] Speed up tests (#153677 ) Use `MultiProcContinousTest` to avoid re-create ProcessGroup in each test instance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153677 Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/ngimel ghstack dependencies: #153653	2025-05-26 03:39:11 +00:00
Wei Wang	7128b50a65	[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594 ) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) https://github.com/pytorch/pytorch/issues/153122 CUDA context related https://github.com/pytorch/pytorch/issues/153517 NCCL regression, future NCCL may fix it https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed See: https://github.com/pytorch/pytorch/issues/147383 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501	2025-05-22 06:33:29 +00:00
Chien-Chin Huang	498f364518	Fix test_fused_scaled_matmul_reduce_scatter when scatter_dim is 0 (#153286 ) The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this. Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286 Approved by: https://github.com/kwen2501	2025-05-12 17:38:49 +00:00
Jithun Nair	fe8ebacee4	[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368 Approved by: https://github.com/jeffdaily, https://github.com/malfet Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-05-08 16:12:16 +00:00
Prachi Gupta	1ea2731e26	[ROCm] Add support for SymmetricMemory (#150580 ) This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream. NOTE: ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580 Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu Co-authored-by: Xiaodong Wang <xdwang@fb.com>	2025-05-02 18:35:14 +00:00
Prachi Gupta	7e5f6dcf7f	Add @requires_multicast_support to test_multimem_all_gather (#151227 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/151227 Approved by: https://github.com/jeffdaily	2025-04-15 18:41:12 +00:00
Natalia Gimelshein	d04a6ec021	add reduce_scatter to symm mem ops (#150813 ) + a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf). Pull Request resolved: https://github.com/pytorch/pytorch/pull/150813 Approved by: https://github.com/xw285cornell	2025-04-09 17:59:17 +00:00
Natalia Gimelshein	1700599266	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:43 +00:00
Natalia Gimelshein	414b9ae016	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-04-01 05:36:04 +00:00
PyTorch MergeBot	57fa99c5c3	Revert "enable out variant of 2-shot reduction (#150153 )" This reverts commit cdeb32d2d1c31b60c65133e83510977c5c180005. Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))	2025-03-31 15:43:24 +00:00
PyTorch MergeBot	e57fa18b40	Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 )" This reverts commit 8a872261dcb3797557d1965af6832677a77efec1. Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))	2025-03-31 15:37:54 +00:00
Natalia Gimelshein	cdeb32d2d1	enable out variant of 2-shot reduction (#150153 ) Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153 Approved by: https://github.com/xw285cornell	2025-03-28 19:06:03 +00:00
Natalia Gimelshein	8a872261dc	Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129 ) Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129 Approved by: https://github.com/xw285cornell	2025-03-28 02:14:27 +00:00
Yifu Wang	db33d23aa8	[SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886 Approved by: https://github.com/awgu	2025-01-28 01:43:37 +00:00
Jagadish Krishnamoorthy	8f3eb84373	ROCm: Enable 4 gpu tests for distributed config (#140319 ) Change the label to make sure the jobs land on a node which has >= 4 GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140319 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/kwen2501	2025-01-02 17:22:11 +00:00
Yifu Wang	af190479c8	[fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160 ) ## Benchmark M=2048, N=3584, K=8192 baseline (nccl + cublas): 301us decomp-based async-tp: 354us comm-aware async-tp: 295us multimem_all_gather matmul: 277us As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline): ![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160 Approved by: https://github.com/weifengpy ghstack dependencies: #142283, #142810, #143159	2024-12-17 01:07:27 +00:00

1 2 3

103 Commits