Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.
This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.
Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective
Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
`py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe.
For more details see #143593https://github.com/pybind/pybind11/issues/5473
Test plan:
```
python setup.py develop
```
CI
```py
import time
from concurrent.futures import ThreadPoolExecutor
from torch import distributed as dist
def run():
store = dist.TCPStore(
host_name="localhost",
port=0,
is_master=True,
wait_for_workers=False,
)
# this sleep is required to trigger the crash
time.sleep(0.1)
del store
futures = []
with ThreadPoolExecutor(
max_workers=100,
) as executor:
for i in range(100000):
print(i)
futures.append(executor.submit(run))
if len(futures) > 100:
futures.pop(0).result()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598
Approved by: https://github.com/c-p-i-o
### Motivation:
As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch.
1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`.
2. **Intel distributed Backend register in PyTorch distributed package**. This PR is to contribute section 2 change.
### Example:
Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors.
```
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
def run_allreduce(rank, world_size):
setup(rank, world_size)
device = torch.device('xpu:{}'.format(rank))
x = torch.randn([2, 2], device=device)
dist.all_reduce(x)
cleanup()
if __name__ == '__main__':
world_size = 2
mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856
Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD
Doc updates:
* This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 .
* It also does some general cleanups to simplify the distributed.rst by using `:methods`.
* It adds `__init__` definitions for the Stores
* I've reordered things so the collective APIs are before the Store/PG apis
Test plan:
```
lintrunner -a
cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853
Approved by: https://github.com/kwen2501
Before this PR, users need to call `empty_strided_p2p()` with a `group_name`:
```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0")
symm_mem = _SymmetricMemory.rendezvous(tensor)
```
Users can now omit `group_name` at allocation time and specify it later at rendezvous time:
```python
tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device)
symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0")
```
Rationales for this change:
- This allows the same allocation to establish symmetric memory under different groups
- Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529
Approved by: https://github.com/lw
Summary:
If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.
This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG.
Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective
Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087
Approved by: https://github.com/kwen2501
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities:
- Changed the API to take a uint32 tensor as argument, instead of a device pointer
- Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method
- Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934
Approved by: https://github.com/weifengpy
ghstack dependencies: #139227
1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing.
2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338
Approved by: https://github.com/kwen2501
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.
To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility.
To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`:
- `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`.
- `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755
Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw
## This Stack
This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`
These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.
## This PR
```python
# Obtain the signal pad of the specified peer rank as a tensor.
# If both shape and dtype are unspecified, the returned tensor will be a
# 1d uint32 tensor, which is most natural for signaling purposes.
symm_mem.get_signal_pad(peer_rank)
# If only shape is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape)
symm_mem.get_signal_pad(peer_rank, shape)
# If only dtype is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank).view(dtype)
symm_mem.get_signal_pad(peer_rank, dtype=dtype)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754
Approved by: https://github.com/weifengpy, https://github.com/lw
This PR aims to support the following use case:
```python
def all_reduce_eager(x):
y = x * x
req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True)
assert isinstance(req, torch.distributed.Work)
return y
@torch.compile(fullgraph=True)
def all_reduce_wait_compiled(y):
torch.ops.c10d_functional.wait_tensor(y)
return y * y
x = torch.ones(1280, 1280, device="cuda") + self.rank
with allow_inflight_collective_as_graph_input_ctx():
y = all_reduce_eager(x)
z = all_reduce_wait_compiled(y)
```
where the collective is issued in eager (with `async_op=True`) but waited in compiled region.
This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel.
----
**Update**: Did two items to prevent regression to existing use cases:
1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case
2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users).
The risk of this new version of PR causing regression should be very low.
------
Test commands:
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives`
- `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor`
- `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited`
- `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal`
- `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph`
- `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing`
- `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli`
- `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True`
- `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees`
- `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco`
------
Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763
Approved by: https://github.com/yifuwang
Summary:
Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths.
It's important for c10d to differentiate different reasons of watchdog
failures. E.g, timeout vs nccl errors, and possibly let users to handle the
errors differently depends on the type of errors
Test Plan:
UT
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240
Approved by: https://github.com/Skylion007
Summary:
This PR is trying to let users to know what exact collective call from the python thread is failing, and
customize their own error handling function, instead of watchdog thread crashing everything.
This is potentially very useful in fault tolerant training, in which we can have in-process restart.
E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job.
This is to allow users to check the status of each collective call,
using the ivalue::future libs in PT core. This also allows users to
attach its customized failure handling functions by:
work.get_future_result().then(erro_handling_func)
Note that the above call is also non-blocking for CPU thread
Test Plan:
Added a new test: test_get_future_result to verify the workResutl is
correctly propagated to the users
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799
Approved by: https://github.com/fduwjj, https://github.com/wconstab
Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797
This PR is meant to:
- provide a way to abort multiple PGs without deadlocking each other.
- provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such).
One can find an example from: https://github.com/NVIDIA/nccl/issues/1013
## How is it different from `destroy_process_group`?
`destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`.
## What's new in `_abort_process_group`?
It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22.
## What's next?
Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs).
In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291
Approved by: https://github.com/eqy
## This Stack
Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.
## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.
Also, remove the need of a private synchronization function.
In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.
Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295
Approved by: https://github.com/kwen2501