Fixes#162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first.
This prevents silent failures and ensures proper dispatch system integration for all distributed operations.
Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations.
Please let me know if additional distributed operators should be tested or if any other updates are needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665
Approved by: https://github.com/ezyang
The current behavior is to do "nothing", which means you will corrupt
data. If you're doing something similar to LocalTensor, where you're
overriding the behavior of collectives to do something numerically,
this can be unwelcome behavior. If you can error when this happens
it can help prevent silent numerical incorrectness.
Authored with claude code.
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162841
Approved by: https://github.com/dcci
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
`get_remote_tensor `: return a symmetric tensor given a peer rank.
The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.
As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
(Porting most of #161008)
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
`get_remote_tensor `: return a symmetric tensor given a peer rank.
The difference between `get_buffer` API and `get_remote_tensor` API:
- the former accepts an offset, whereas the latter doesn't
- the latter returns a symmetric tensor at `hdl.offset` on `peer`.
As a refactorization, this PR also moves the implementation of `get_buffer` and `get_signal_pad` to the `SymmetricMemory` level as their code is common to all backends.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161533
Approved by: https://github.com/ngimel
ghstack dependencies: #161470, #161471, #161532
(Porting most of #161008)
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
(Porting most of #161008)
Hooking SymmetricMemory Allocator to MemPool so that user can create symmetric tensors with regular `torch.zeros`, `torch.arange` etc factories. Also so that our ops can have functional variants that create `out` tensors on symmetric memory.
To end users, this PR supports a python UI as follows:
```
allocator = symm_mem.get_mempool_allocator(device)
mempool = torch.cuda.MemPool(allocator)
with torch.cuda.use_mem_pool(mempool):
tensor = torch.arange(numel, dtype=dtype, device=device)
```
Added tests for both use cases above.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161471
Approved by: https://github.com/ngimel
ghstack dependencies: #161470
expose the pointer so that we can create the `ncclConfig_t` object from pytorch and use it elsewhere. this is useful to control the nccl communicator parameters for multiple nccl communicators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161136
Approved by: https://github.com/kwen2501
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.
We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh
Also ncclx needs to be aware of that its Option is a subclass of BackendOption
Test Plan:
CI
Rollback Plan:
Differential Revision: D79201132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
This updates ProcessGroupGloo to support per operation timeouts. Previously the timeouts were ignored even if they were set.
* This checks if the timeout is `kUnsetTimeout` and conditionally uses the provided timeout or the default timeout from the context.
* This exposes `set_timeout` as a standard method on ProcessGroup/Backend so we can test the global timeout.
Test plan:
```
pytest test/distributed/test_c10d_gloo.py -v -k allreduce_timeout
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158128
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
This adds new context manager based PG management to dist2. This allows for managing the active process group much in the same way as a stream
```py
with dist2.process_group(pg):
dist2.current_process_group().allreduce(...).wait()
```
matches
```py
with torch.cuda.stream(stream):
torch.cuda.current_stream().synchronize()
```
Test plan:
```
pytest test/distributed/test_dist2.py -k context
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157988
Approved by: https://github.com/fduwjj
This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo.
The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP.
There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780
This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set.
Test plan:
```
pytest test/distributed/test_c10d_gloo.py -v -k wait_stream
pytest test/distributed/test_c10d_nccl.py -v -k wait_stream
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
Today the only way to choose allocation backend is via env `TORCH_SYMMMEM=...`.
This is a bit hard to set in CI on test file basis. (The env has to be set before program is loaded).
This PR added a programmatic way -- a `set_backend` API.
Implementation:
Since this API is slightly more dynamic than static registration, at static time each backend registers its availability rather than filling itself as **the** allocator directly. Later when `set_backend` is called, the allocator would actually fill in the device-to-allocation `map_`.
Though added, `set_backend` is **not** a necessary API for user to call -- one backend is still registered as the default at static time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156661
Approved by: https://github.com/ngimel, https://github.com/fduwjj
In recent versions NCCL introduced support for "user buffer registration", i.e., allowing user-owned memory (such as regular PyTorch tensors) to be "registered" (pinned, page-locked, etc.) with all the various hardware (NVLink, InfiniBand, ...) in order to support zero-copy transfers and thus accelerate communication and reduce resource footprint of NCCL's kernels (which reduces contention).
This was already exposed in PyTorch through a custom allocator provided by the NCCL process group. DDP already uses this, via a memory pool to allow caching and reusing.
FSDP2 is also particularly suited to leverage user buffer registration because the buffers it passes to NCCL are allocated by FSDP2 itself, since it anyways needs to (de)interleave the parameters to/from these private buffers.
This PR adds an extra flag to FSDP2 that tells it to use the ProcessGroup allocator for these private buffers, thus allowing it to leverage NCCL zero-copy (when supported).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150564
Approved by: https://github.com/kwen2501, https://github.com/weifengpy, https://github.com/syed-ahmed