pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Files

Wei Feng 6918f17114 [FSDP2] provide public API to share cuda streams across roots (#165024 )

for pipeline parallel, we can have multiple FSDP roots (chunks)
```
model = nn.Sequential([chunk0, chunk1])
fully_shard(model.chunk0)
fully_shard(model.chunk1)
```

we can call `share_comm_ctx` to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation
```
from torch.distributed.fsdp import share_comm_ctx
share_comm_ctx([model.chunk0, model.chunk1])
```

unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context`

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165024
Approved by: https://github.com/mori360

2025-10-14 17:50:46 +00:00

_internal

[FSDP2] provide public API to share cuda streams across roots (#165024 )

2025-10-14 17:50:46 +00:00

__init__.py

Add pyrefly suppressions 2/n (#164513 )

2025-10-03 02:46:13 +00:00

_comparison.py

Pyrefly suppressions 6/n (#164877 )

2025-10-08 02:30:57 +00:00

_creation.py

[BE][PYFMT] migrate PYFMT for torch/[p-z]*/ to ruff format (#144552 )

2025-08-07 00:09:56 +00:00

_utils.py

Fix code descriptions in the test package. (#148145 )

2025-03-04 19:14:41 +00:00