Commit Graph

103 Commits

Author SHA1 Message Date
d7e275d4b4 [CI][CUDA] Add periodic b200 distributed job (#159323)
1. Run distributed job with B200 runner, periodically.
2. discovered generic distributed test issue that certain unit test hard-coded ranks, calling for require_exact_world_size(world_size) API instead of require_world_size(world_size).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159323
Approved by: https://github.com/eqy

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
2025-10-16 21:54:04 +00:00
585b9dbb5e [async_tp] Support ag+mm with gather_dim lastdim of mat_A (#163068)
Adding ag+mm support for the case, when gather_dim is last dim of matmul (reduction dim).

When we decompose matmul by reduction dimension we result in partials that needs additional reduction,
we allocate memory for accumulator.

Decomposition should not produce small (thin) mms that can not efficiently load the GPU. Limiting for minimal size of the shard 1024 (found empirically by testing in torchtitan).

scaled_mm is not supported yet for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068
Approved by: https://github.com/ngimel
2025-10-16 20:14:39 +00:00
19bf67be32 multimem reduce (#164517)
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
2025-10-08 05:25:16 +00:00
f505caa71b Revert "multimem reduce (#164517)"
This reverts commit d1cbb74fb16406488a174832e1b58b7c242f418d.

Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))
2025-10-07 20:12:38 +00:00
d1cbb74fb1 multimem reduce (#164517)
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
2025-10-03 02:41:10 +00:00
ed90040d33 Releases multicast object before releasing mapped buffers in CUDASymmetricMemory (#163750)
Fixes: https://github.com/pytorch/pytorch/issues/162429. In B200, cuMulticastUnbind can error if the mapped buffers are free'd before the multicast object is free'd. The only documentation I could find is here: e11d7f77c1/src/transport/nvls.cc (L113).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163750
Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/cyyever
ghstack dependencies: #163575
2025-10-01 09:07:48 +00:00
8bb71c07c4 Skip symmetric memory tests calling _scaled_mm on CCC < 8.9 (#164251)
This avoids them failing on e.g. A100 GPUs with
> RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164251
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-10-01 03:26:21 +00:00
e64dd8c694 [Fix] Adding missing f prefixes to formatted strings [4/N] (#164068)
As stated in the title.

* __->__ #164068
* #164067
* #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164068
Approved by: https://github.com/Skylion007
2025-09-29 04:07:07 +00:00
96182faf96 [CI][Distributed][CUDA][Symm-Mem] Enable B200 Symm Mem Test (#162988)
Inspired by https://github.com/pytorch/pytorch/pull/162981 and motivated by https://github.com/pytorch/pytorch/pull/159323 taking a total of 20 hours to finish (and unlikely to make it in short time due to https://github.com/pytorch/pytorch/issues/162178 )

Creating this subtest to get *something distributed* on B200.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162988
Approved by: https://github.com/malfet
2025-09-27 05:12:05 +00:00
22fcc8b76b [async_tp] Support mm+rs with scatter_dim matmul K by sharding B (#162794)
Current state: Shape mismatch failure when mm+rs on the last mm scatter dim.

Adding separate path to handle lastdim for aten.mm, scaled_mm should be handled similarly, but needs additional PR.
So disabling scaled_mm case with filter matmul function.

Adding inductor.config for this change that is True by default for fast debuggability of new path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162794
Approved by: https://github.com/fegin
2025-09-25 12:18:39 +00:00
f638854e1d [ROCm][SymmMem] re-enable UTs (#162811)
After the UT suite moved to `MultiProcContinuousTest`, `skipIfRocm` decorator started failing rather than skipping UTs because now we spawn multiple threads before the skip decorator is taken into account and the skip decorator was raising an exception to exit the process. But, the parent process treated the child process exiting as a crash rather than a skip. Additionally, in `MultiProcContinuousTest`, if one UT fails all subsequent ones are also skipped which makes sense since there's one setup for the entire suite. However, this showed up as many failing/skipped UTs in the parity.

I added multiprocess version of skip decorators for ROCm, including, `skip_if_rocm_arch_multiprocess` and
`skip_if_rocm_ver_lessthan_multiprocess`. These are needed as symmetric memory feature is only supported on MI300 onwards and we need to skip them for other archs and some UTs only work after ROCm7.0.

Fixes #161249
Fixes #161187
Fixes #161078
Fixes #160989
Fixes #160881
Fixes #160768
Fixes #160716
Fixes #160665
Fixes #160621
Fixes #160549
Fixes #160506
Fixes #160445
Fixes #160347
Fixes #160203
Fixes #160177
Fixes #160049
Fixes #159921
Fixes #159764
Fixes #159643
Fixes #159499
Fixes #159397
Fixes #159396
Fixes #159347
Fixes #159067
Fixes #159066
Fixes #158916
Fixes #158760
Fixes #158759
Fixes #158422
Fixes #158138
Fixes #158136
Fixes #158135

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162811
Approved by: https://github.com/jeffdaily
2025-09-16 15:35:39 +00:00
c0142f5c06 [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@fb.com>
2025-09-09 15:49:21 +00:00
f044fa2902 [AsyncTP] Use assertEqual instead of allClose for bf16 tests (#162041)
The async tp result and regular MM result are very close. If we adjust the allclose threshold, the test succeeds. This seems to indicate that the error is from numerical error of low precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162041
Approved by: https://github.com/danielvegamyhre, https://github.com/ngimel
ghstack dependencies: #162040
2025-09-08 16:12:52 +00:00
5b90e85112 [AsyncTP] Fixes AsyncMM (#162040)
The original implementation set beta to be 1, which cause the out (C) being added to the the output. Thus if the output is not initialized as zero beforehand, the output can be incorrect.

Removing the alpha and beta fixes the issue.

Thanks @ngimel to figure out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162040
Approved by: https://github.com/danielvegamyhre
2025-09-08 10:53:59 +00:00
8235c4f65d Revert "[ROCm] Enabling several UTs (#161715)"
This reverts commit b9ba612f7a968f7b27e121ca8f4d0a4d954f5354.

Reverted https://github.com/pytorch/pytorch/pull/161715 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/159473, feel free to merge it back once conflicts are cleared ([comment](https://github.com/pytorch/pytorch/pull/161715#issuecomment-3264040604))
2025-09-07 21:03:17 +00:00
b9ba612f7a [ROCm] Enabling several UTs (#161715)
All these UTs are working as is, just removing the skip
- test_p2p_ipc
- test_repros.py: working, added fp8 support
- test_activation_checkpointing.py
- test_content_store.py
- test_cuda_multigpu.py
- test_compute_comm_reordering.py
- test_segment_reductions.py
- test_dataloader.py
- test_math_ops.py
- test_loop_ordering.py
- test_control_flow.py
- distributed_test.py
- test_mem_tracker.py
- test_fsdp_optim_state.py
- test_fully_shard_mixed_precision.py: skippped for < ROCm7.0
- test_aot_inductor_custom_ops.py
- test_c10d_ops_nccl.py
- test_eager_transforms.py
- test_sparse_csr.py
- test_inductor_collectives.py
- test_fake_tensor.py
- test_cupy_as_tensor.py
- test_cuda.py: enable UTs that are working
- test_matmul_cuda.py: enable UTs that are working

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161715
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-09-04 20:43:03 +00:00
994f2a5dbc [SymmMem][CI] Make sure group names are consistent (#162035)
Unblocking #161741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162035
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2025-09-03 20:40:24 +00:00
c1e504ec2f [SymmMEM] Move AsyncTP tests to a seperate test class (#161820)
We move AsyncTP tests to a seperate test suite because 1) Async TP ops are not the core symmetric memory APIs, they are more like applications, 2) MultiProcContinuousTest will skip all the following tests if a test fails (we should fix this too). We still want to get the test signals for the core
symmetric memory APIs when Async TP ops fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161820
Approved by: https://github.com/kwen2501
2025-08-30 00:40:40 +00:00
cd6d63f453 [SymmMEM] Fix test_empty_strided_p2p_persistent (#161677)
test_empty_strided_p2p_persistent allocates persistent symm memory tensors. However, it uses the same alloc_id for different tests, which could cause troubles if these tests are ran under the same process. This PR fixes the issue by using a different alloc_id for different test.

https://github.com/pytorch/pytorch/pull/161668 should also fix the issue but we can land this PR for a safer test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161677
Approved by: https://github.com/kwen2501
ghstack dependencies: #161676
2025-08-29 16:11:58 +00:00
eec876deb6 [SymmMem] Isolate set_device tests to avoid hang (#161668)
`test_symmetric_memory.py` hangs like this:
```
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_False PASSED [5.6364s]
SymmetricMemoryTest::test_empty_strided_p2p_persistent_set_device_True ...
```

This set of tests parameterizes whether user sets the device before calling `symm_mem.emtpy`.
However, such parametrization does not work well with `MultiProcContinuousTest` because the set device will "contaminate" the next test function.

Solution is to move the "set device" tests to a separate test suite using the traditional `MultiProcessTestCase`, which would respawn processes every time.

Hang is gone now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161668
Approved by: https://github.com/fegin
2025-08-28 05:43:49 +00:00
779fc29c04 [C10D] Fix spelling of MultiProcContinuousTest (#160892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160892
Approved by: https://github.com/fduwjj
2025-08-19 20:17:19 +00:00
a9f902add0 [CUDA] Use runtime driver API for cuStreamWriteValue32 (#158295)
Reopen https://github.com/pytorch/pytorch/pull/156097

Fixes https://github.com/pytorch/pytorch/issues/154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR https://github.com/pytorch/pytorch/pull/156097 and https://github.com/pytorch/pytorch/pull/154097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158295
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/eqy, https://github.com/huydhn

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-07-16 23:14:36 +00:00
702a304b07 Revert "[CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit 9a5278225fc5e7b46d54a65ae1a3f049ee49824f.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/ngimel due to breaks 525 driver installs ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-3063742807))
2025-07-11 20:36:36 +00:00
0d17029fea [BE][6/6] fix typos in test/ (test/distributed/) (#157640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157640
Approved by: https://github.com/yewentao256, https://github.com/malfet
2025-07-11 14:09:37 +00:00
9a5278225f [CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/syed-ahmed, https://github.com/wujingyue, https://github.com/atalman

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-07-10 14:38:18 +00:00
fc10d4b1d6 [SymmMem] Allow selection of allocation backend (#156661)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Today the only way to choose allocation backend is via env `TORCH_SYMMMEM=...`.
This is a bit hard to set in CI on test file basis. (The env has to be set before program is loaded).

This PR added a programmatic way -- a `set_backend` API.

Implementation:
Since this API is slightly more dynamic than static registration, at static time each backend registers its availability rather than filling itself as **the** allocator directly. Later when `set_backend` is called, the allocator would actually fill in the device-to-allocation `map_`.

Though added, `set_backend` is **not** a necessary API for user to call -- one backend is still registered as the default at static time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156661
Approved by: https://github.com/ngimel, https://github.com/fduwjj
2025-06-26 21:37:44 +00:00
4585c33e74 [symm_mem] Fix nccl test for symm mem (#156752)
Try not to call set_device to Fixes #156569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156752
Approved by: https://github.com/kwen2501
2025-06-26 02:59:38 +00:00
e583b88819 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit ac86ec0e60370c037e018137f2048cafd47c5c28.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))
2025-06-23 17:36:44 +00:00
276c790010 [ROCm][SymmetricMemory] Avoid bf16 to float conversion during reduce (#155587)
This PR helps improve the performance of one-shot and two-shot allreduce as reported here: https://github.com/pytorch/FBGEMM/issues/4072

One-Shot:
![image](https://github.com/user-attachments/assets/69fe0d53-6636-42e1-90e0-e5efb989f59f)
As shown in the numbers presented above, symmetric memory performance prior to the PR (baseline) was on average about 26% less than fbgemm's number reported in the issue above. After this PR, we are seeing 16% improvement on average as compared to fbgemm and 59% as compared to our baseline numbers.

Two-Shot:
![image](https://github.com/user-attachments/assets/e5c8a288-303e-4d50-814b-4348e589e1fc)
Similarly, in two-shot, we were originally underperforming by 12%. We have improved by 22% after this PR as compared to symmetric memory performance prior to this PR. However, two-shot performance is still about 23% lower than fbgemm. This work is still in progress and will be pushing those changes through a separate PR.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155587
Approved by: https://github.com/jeffdaily
2025-06-23 16:14:01 +00:00
ac86ec0e60 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-21 01:34:41 +00:00
bfccfa0b31 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))
2025-06-18 21:48:50 +00:00
cf90c9f8d1 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel, https://github.com/cyyever

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-17 14:15:49 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
48807d568e [CI][CUDA] Migrate remaining cu118 jobs to cu128 (#154169)
Contributing to the fix of #147383   and #154119

Additional steps required: 3218b1b684/.github/workflows/lint.yml cu118 needs to be updated.
Make install_cuda.sh accept both 12.8 and 12.8.* as CUDA_VERSION argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154169
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman, https://github.com/tinglvv
2025-06-02 20:22:14 +00:00
062387fb53 [SymmMem] Speed up tests (#153677)
Use `MultiProcContinousTest` to avoid re-create ProcessGroup in each test instance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153677
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #153653
2025-05-26 03:39:11 +00:00
7128b50a65 [CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 (#151594)
This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

https://github.com/pytorch/pytorch/issues/153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
https://github.com/pytorch/pytorch/issues/153122 CUDA context related
https://github.com/pytorch/pytorch/issues/153517  NCCL regression, future NCCL may fix it
https://github.com/pytorch/pytorch/issues/154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: https://github.com/pytorch/pytorch/issues/147383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
2025-05-22 06:33:29 +00:00
498f364518 Fix test_fused_scaled_matmul_reduce_scatter when scatter_dim is 0 (#153286)
The function signature of fused_scaled_matmul_reduce_scatter was changed. This PR fixes the function signature. However when scatter_dim is 1, the two outputs are not close. We need a followup on this.

Another followup is to change fused_scaled_matmul_reduce_scatter to make those newly added arguments optional. Users shouldn't need to these arguments if they don't flatten the inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153286
Approved by: https://github.com/kwen2501
2025-05-12 17:38:49 +00:00
fe8ebacee4 [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-08 16:12:16 +00:00
1ea2731e26 [ROCm] Add support for SymmetricMemory (#150580)
This is an attempt to re-land the initial PR https://github.com/pytorch/pytorch/pull/134817 with recent design changes from upstream.

**NOTE:**
ROCm currently does NOT have multicast/multimem hardware support at the moment, so those features are disabled in symmetric memory for ROCm. This also means that we currently do not have a way of lowering add + all_reduce + wait_tensor into one_shot_all_reduce op in inductor as it depends on a multicast buffer support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150580
Approved by: https://github.com/jeffdaily, https://github.com/kwen2501, https://github.com/yoyoyocmu

Co-authored-by: Xiaodong Wang <xdwang@fb.com>
2025-05-02 18:35:14 +00:00
7e5f6dcf7f Add @requires_multicast_support to test_multimem_all_gather (#151227)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151227
Approved by: https://github.com/jeffdaily
2025-04-15 18:41:12 +00:00
d04a6ec021 add reduce_scatter to symm mem ops (#150813)
+ a few small fixes (don't error out on 0-element tensors, a few more checks for contiguous outputs, more threads for better perf).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150813
Approved by: https://github.com/xw285cornell
2025-04-09 17:59:17 +00:00
1700599266 Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)
Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:43 +00:00
414b9ae016 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-04-01 05:36:04 +00:00
57fa99c5c3 Revert "enable out variant of 2-shot reduction (#150153)"
This reverts commit cdeb32d2d1c31b60c65133e83510977c5c180005.

Reverted https://github.com/pytorch/pytorch/pull/150153 on behalf of https://github.com/clee2000 due to failing internal builds D72083877 ([comment](https://github.com/pytorch/pytorch/pull/150153#issuecomment-2766633712))
2025-03-31 15:43:24 +00:00
e57fa18b40 Revert "Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)"
This reverts commit 8a872261dcb3797557d1965af6832677a77efec1.

Reverted https://github.com/pytorch/pytorch/pull/150129 on behalf of https://github.com/clee2000 due to breaking internal builds D72080428 ([comment](https://github.com/pytorch/pytorch/pull/150129#issuecomment-2766619006))
2025-03-31 15:37:54 +00:00
cdeb32d2d1 enable out variant of 2-shot reduction (#150153)
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
2025-03-28 19:06:03 +00:00
8a872261dc Add one_shot_all_reduce_copy to allow non-symm-mem allocated tensors to be reduced (#150129)
Per title, we want to be able to use it even if inputs are not registered. Separate copy would add latency, and one-shot is all about the lowest possible latency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150129
Approved by: https://github.com/xw285cornell
2025-03-28 02:14:27 +00:00
db33d23aa8 [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886
Approved by: https://github.com/awgu
2025-01-28 01:43:37 +00:00
8f3eb84373 ROCm: Enable 4 gpu tests for distributed config (#140319)
Change the label to make sure the jobs land on a
node which has >= 4 GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140319
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/kwen2501
2025-01-02 17:22:11 +00:00
af190479c8 [fused_all_gather_matmul] use _multimem_all_gather_matmul for small global Ms (#143160)
## Benchmark
M=2048, N=3584, K=8192

baseline (nccl + cublas): 301us
decomp-based async-tp: 354us
comm-aware async-tp: 295us
**multimem_all_gather matmul: 277us**

As M further decreases, the multimem_all_gather approach consistently outperforms the baseline and other approaches (omitted other approaches in the chart as they start to be slower than the baseline):
![image](https://github.com/user-attachments/assets/5811455a-68c9-43fe-9d82-ca488dd77bc1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143160
Approved by: https://github.com/weifengpy
ghstack dependencies: #142283, #142810, #143159
2024-12-17 01:07:27 +00:00