pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
cyyever	97b3b73f3e	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2025-01-28 15:21:12 +00:00
Shuqiang Zhang	c0861d092c	[PGNCCL] Add an API to get the status/error code at the PG level (#144498 ) Summary: This PR is basically a replacement of https://github.com/pytorch/pytorch/pull/140087, which caused some perf drop due to frequent TCPStore check in watchdog thread. The fix is to move the tcpstore check in monitoring thread If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498 Approved by: https://github.com/kwen2501	2025-01-24 16:47:32 +00:00
cyy	6a35d9aaa4	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-24 12:22:13 +00:00
PyTorch MergeBot	6a2b4db0a1	Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 )" This reverts commit 42f4fda2ebb27693411f7acca1665778d539bf79. Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))	2025-01-24 00:17:34 +00:00
cyy	42f4fda2eb	Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806 Approved by: https://github.com/kwen2501	2025-01-23 22:47:18 +00:00
Tristan Rice	6e58c37542	c10d: no call_guard in init (#143598 ) `py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe. For more details see #143593 https://github.com/pybind/pybind11/issues/5473 Test plan: ``` python setup.py develop ``` CI ```py import time from concurrent.futures import ThreadPoolExecutor from torch import distributed as dist def run(): store = dist.TCPStore( host_name="localhost", port=0, is_master=True, wait_for_workers=False, ) # this sleep is required to trigger the crash time.sleep(0.1) del store futures = [] with ThreadPoolExecutor( max_workers=100, ) as executor: for i in range(100000): print(i) futures.append(executor.submit(run)) if len(futures) > 100: futures.pop(0).result() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598 Approved by: https://github.com/c-p-i-o	2024-12-20 22:23:36 +00:00
lzhang2	5d6acd5a31	Register Intel distributed Backend (`XCCL`) in PyTorch distributed package (#141856 ) ### Motivation: As design illustrated in Intel distributed support RFC https://github.com/pytorch/pytorch/issues/141741, two sections are needed to enable intel distributed backend (`XCCL`) support in PyTorch. 1. Intel GPU distributed Backend integration in PyTorch `torch-xpu-ops`. 2. Intel distributed Backend register in PyTorch distributed package. This PR is to contribute section 2 change. ### Example: Here is a simple example of using spawn to launch XCCL backend and perform allreduce on XPU tensors. ``` import os import torch import torch.distributed as dist import torch.multiprocessing as mp def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() def run_allreduce(rank, world_size): setup(rank, world_size) device = torch.device('xpu:{}'.format(rank)) x = torch.randn([2, 2], device=device) dist.all_reduce(x) cleanup() if __name__ == '__main__': world_size = 2 mp.spawn(run_allreduce, args=(world_size,), nprocs=world_size, join=True) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141856 Approved by: https://github.com/kwen2501, https://github.com/gujinghui, https://github.com/albanD	2024-12-10 01:58:06 +00:00
PyTorch MergeBot	614e727191	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit cd942d00dde73dbf9d7c5f89fdd7152f3440c4ca. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/izaitsevfb due to causes crash internally during test listing ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2492328790))	2024-11-21 21:05:22 +00:00
Syed Tousif Ahmed	e0482fdf95	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-21 01:40:11 +00:00
cyyever	cd942d00dd	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-11-21 00:25:20 +00:00
PyTorch MergeBot	9fac5a16fd	Revert "[PGNCCL] Add an API to get the status/error code of each PG (#140087 )" This reverts commit 80aa19a622bc6b159f7cf07b3501269f3356d752. Reverted https://github.com/pytorch/pytorch/pull/140087 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/140087#issuecomment-2486912231))	2024-11-19 22:53:46 +00:00
PyTorch MergeBot	496c1e78c5	Revert "Implements user buffer registration using MemPool (#133603 )" This reverts commit 25d9be37bef949c675e42b4929ddcb6997af2a7b. Reverted https://github.com/pytorch/pytorch/pull/133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133603#issuecomment-2486897708))	2024-11-19 22:42:26 +00:00
Tristan Rice	2673a440d0	[distributed] add PG APIs and general doc cleanups (#140853 ) Doc updates: * This adds documentation for the object oriented ProcessGroup APIs that are being used in torchft as well as https://github.com/pytorch/rfcs/pull/71 . * It also does some general cleanups to simplify the distributed.rst by using `:methods`. * It adds `__init__` definitions for the Stores * I've reordered things so the collective APIs are before the Store/PG apis Test plan: ``` lintrunner -a cd docs && sphinx-autobuild source build/ -j auto -WT --keep-going ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140853 Approved by: https://github.com/kwen2501	2024-11-19 02:06:32 +00:00
Yifu Wang	ab5c8857ef	[SymmetricMemory] support specifying group_name at rendezvous time (#139529 ) Before this PR, users need to call `empty_strided_p2p()` with a `group_name`: ```python tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device, group_name="0") symm_mem = _SymmetricMemory.rendezvous(tensor) ``` Users can now omit `group_name` at allocation time and specify it later at rendezvous time: ```python tensor = _SymmetricMemory.empty_strided_p2p((1024,), (1,), device=device) symm_mem = _SymmetricMemory.rendezvous(tensor, group_name="0") ``` Rationales for this change: - This allows the same allocation to establish symmetric memory under different groups - Specifying `group_name` at rendezvous time instead of allocation time is a more natural UX Pull Request resolved: https://github.com/pytorch/pytorch/pull/139529 Approved by: https://github.com/lw	2024-11-17 09:31:17 +00:00
Syed Tousif Ahmed	25d9be37be	Implements user buffer registration using MemPool (#133603 ) This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133603 Approved by: https://github.com/kwen2501, https://github.com/eqy	2024-11-15 12:47:49 +00:00
Shuqiang Zhang	80aa19a622	[PGNCCL] Add an API to get the status/error code of each PG (#140087 ) Summary: If unhealthy, the user should be able to get the type of errors, e.g., timeout,nccl error or remote error. This API is applied to PG level, compared to the work.get_future_result() API which is applied to Work Level. Error detection at PG level is much more convenient for users to handle the PG failure as a whole, e.g, restarting the PG. Error handling at the work level is still useful for users to attach work specific context and debug the RC of the specific failing work/collective Note it is critical for all ranks in the PG to be notified about an error as soon as it occurs, so we introduce an errorType of REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a local error) to all other ranks in the PG, the broadcast is done through TCPStore currently Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/140087 Approved by: https://github.com/kwen2501	2024-11-15 04:11:00 +00:00
Yifu Wang	684db9beb2	[SymmetricMemory] fix a bug where get_signal_pad() returns a tensor backed by a buffer ptr instead of a signal_pad ptr (#140128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/140128 Approved by: https://github.com/lw ghstack dependencies: #140127	2024-11-14 23:29:16 +00:00
PyTorch MergeBot	4a18e26ff5	Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211 )" This reverts commit a3cff4bbd4130d36b188dbe101a790e6d7da644f. Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ezyang due to One of these diffs had incorrect downstream optional handling, we must reaudit all of these diffs ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2473709246))	2024-11-13 14:05:01 +00:00
cyy	40fb738197	Use Wextra-semi (#140236 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140236 Approved by: https://github.com/ezyang	2024-11-13 02:15:16 +00:00
cyy	a3cff4bbd4	[Environment Variable][7/N] Use thread-safe getenv functions (#140211 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-11-12 18:49:51 +00:00
Yifu Wang	0a0915fb5e	[SymmetricMemory] improve the API for stream_write_value32 (#139934 ) This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities: - Changed the API to take a uint32 tensor as argument, instead of a device pointer - Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method - Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934 Approved by: https://github.com/weifengpy ghstack dependencies: #139227	2024-11-11 18:49:22 +00:00
PyTorch MergeBot	5f4a21dc58	Revert "[SymmetricMemory] improve the API for stream_write_value32 (#139934 )" This reverts commit 2f3a5a15ef701ffab9a880cf822ff8e5224a4b33. Reverted https://github.com/pytorch/pytorch/pull/139934 on behalf of https://github.com/malfet due to Broke distributed tests, see https://github.com/pytorch/pytorch/actions/runs/11770673088/job/32784210441 ([comment](https://github.com/pytorch/pytorch/pull/139934#issuecomment-2468641512))	2024-11-11 17:02:07 +00:00
Yifu Wang	2f3a5a15ef	[SymmetricMemory] improve the API for stream_write_value32 (#139934 ) This PR updates the binding for `stream_write_value32` to be consistent with `memset32` which IMO makes more sense for this type of utilities: - Changed the API to take a uint32 tensor as argument, instead of a device pointer - Changed the Python binding to be a static method of `_SymmetricMemory`, instead of a object method - Use the dispatcher for device dispatching, as opposed to `SymmetricMemory` backends Pull Request resolved: https://github.com/pytorch/pytorch/pull/139934 Approved by: https://github.com/weifengpy ghstack dependencies: #139227	2024-11-11 01:54:35 +00:00
PyTorch MergeBot	1400fedf76	Revert "add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 )" This reverts commit e5574445b01f264e57653a8a42af1118e89acc9a. Reverted https://github.com/pytorch/pytorch/pull/135338 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. Please see D65663382 for more details ([comment](https://github.com/pytorch/pytorch/pull/135338#issuecomment-2465911854))	2024-11-08 23:52:49 +00:00
Luca Wehrstedt	5f287df422	Add type information for FakeProcessGroup (#133211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133211 Approved by: https://github.com/Skylion007	2024-11-08 11:18:52 +00:00
taozhiwei	e5574445b0	add supports_coalescing property in c10d::Backend to determine whether backend supports coalescing (#135338 ) 1. My company is using privateuseone to connect new hardware device and requires the use of `batch_isend_irecv` function. However, `batch_isend_irecv` is currently only open to CUDA, so I add `supports_coalescing` property in `c10d::Backend` to determine whether backend supports coalescing. 2. If `pg._has_hooks` return True, We don't need to determine if the current device is CUDA. So privateuseone can also support `pg._wait_for_pending_works` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135338 Approved by: https://github.com/kwen2501	2024-11-08 11:08:45 +00:00
cyy	83fa1014f1	[3/N] Replace c10::sv with std::sv (#139861 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139861 Approved by: https://github.com/ezyang	2024-11-07 20:03:57 +00:00
Yifu Wang	ee42a99745	[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility. To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`: - `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`. - `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755 Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw	2024-11-05 18:47:24 +00:00
PyTorch MergeBot	3ca794783f	Revert "[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 )" This reverts commit 924e726c3a2566125f55cdbff4dff054d3db3232. Reverted https://github.com/pytorch/pytorch/pull/138755 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. Can you please fix this PR so it works internally and re-merge it? See D65401876 for more details ([comment](https://github.com/pytorch/pytorch/pull/138755#issuecomment-2455173596))	2024-11-04 16:34:34 +00:00
Yifu Wang	924e726c3a	[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility. To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`: - `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`. - `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755 Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw	2024-11-03 21:37:31 +00:00
Yifu Wang	0dbc284a72	[SymmetricMemory] expose signal_pads as tensors in Python (#138754 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR ```python # Obtain the signal pad of the specified peer rank as a tensor. # If both shape and dtype are unspecified, the returned tensor will be a # 1d uint32 tensor, which is most natural for signaling purposes. symm_mem.get_signal_pad(peer_rank) # If only shape is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape) symm_mem.get_signal_pad(peer_rank, shape) # If only dtype is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank).view(dtype) symm_mem.get_signal_pad(peer_rank, dtype=dtype) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-11-01 20:17:15 +00:00
cyy	4a2da52137	[1/N] Replace c10::sv with std::sv (#139453 ) Picks some safe replacements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139453 Approved by: https://github.com/Skylion007	2024-11-01 05:39:37 +00:00
Will Feng	4ee514144b	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ---- Update: Did two items to prevent regression to existing use cases: 1. Added memory-stressed test case to test_c10d_nccl.py `test_unwaited` to cover existing user's "not calling work.wait() for non-functional collective" use case 2. Gated all new `register_work()` / `unregister_work()` calls with `c10d::allow_inflight_collective_as_graph_input()` check, which is a new context manager that requires explicit user enablement (i.e. not on by default, so should not affect existing users). The risk of this new version of PR causing regression should be very low. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-29 03:31:19 +00:00
PyTorch MergeBot	e5595f10c8	Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 )" This reverts commit a688c57033b4536ef59356cdad241d65ca52a869. Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/yf225 due to Seems to have bad interaction with latest commits on trunk, reverting to be safe ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2442527696))	2024-10-28 20:13:46 +00:00
Will Feng	a688c57033	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() on output tensor of eager `async_op=True` collective if under `allow_inflight_collective_as_graph_input_ctx()` context manager (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y x = torch.ones(1280, 1280, device="cuda") + self.rank with allow_inflight_collective_as_graph_input_ctx(): y = all_reduce_eager(x) z = all_reduce_wait_compiled(y) ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_wait_tensor` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D65023311](https://our.internmc.facebook.com/intern/diff/D65023311) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-28 18:11:23 +00:00
cyy	f9ae3fac8c	[Distributed] [19/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138903 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138903 Approved by: https://github.com/ezyang	2024-10-28 05:29:25 +00:00
cyyever	ce631939f0	[Distributed] [18/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138692 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138692 Approved by: https://github.com/ezyang	2024-10-25 05:32:38 +00:00
Shuqiang Zhang	4c91481656	[c10d] allow sub group to be eagerly inited even if default one is not (#138665 ) Summary: Currently, eager mode is applied either to all PGs or NONE of them. There are cases where we don't want to initialize the comms for default PG, but we still want to initialize the comms for sub PG. Now with a device_id passed to new group, we can achieve this case Test Plan: newly added UT Tags: Resolves https://github.com/pytorch/pytorch/issues/137018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138665 Approved by: https://github.com/kwen2501 ghstack dependencies: #138781	2024-10-24 23:51:28 +00:00
PyTorch MergeBot	e7f1e306df	Revert "[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective (#137763 )" This reverts commit 362ca54f03f9bb72ba7633ed580fb788b1a8dea9. Reverted https://github.com/pytorch/pytorch/pull/137763 on behalf of https://github.com/wdvr due to this change is breaking our prod training pipeline (verified with bisect) by increasing memory consumption 4x and causing OOM ([comment](https://github.com/pytorch/pytorch/pull/137763#issuecomment-2435962833))	2024-10-24 17:46:09 +00:00
cyy	2bcfbf2505	[Distributed] [17/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#138465 ) Follows #137404 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138465 Approved by: https://github.com/ezyang	2024-10-24 04:58:49 +00:00
Will Feng	362ca54f03	[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager `async_op=True` collective (#137763 ) This PR aims to support the following use case: ```python def all_reduce_eager(x): y = x * x req = dist.all_reduce(y, op=dist.ReduceOp.SUM, async_op=True) assert isinstance(req, torch.distributed.Work) return y @torch.compile(fullgraph=True) def all_reduce_wait_compiled(y): torch.ops.c10d_functional.wait_tensor(y) return y * y ``` where the collective is issued in eager (with `async_op=True`) but waited in compiled region. This is important for internal use cases such as TorchRec, where we issue collectives in eager for SparseArch all_to_all but want to wait for them in compiled region at beginning of OverArch, so that the all_to_all can be overlapped with the DenseArch compute that runs in parallel. ------ Test commands: - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives` - `pytest -rA test/test_fx.py::TestDCE::test_keep_collectives_no_overload` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_unwaited` - `pytest -rA test/distributed/test_c10d_functional_native.py::TestWithNCCL::test_work_registry` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_unwaited` - `pytest -rA test/distributed/test_c10d_nccl.py::CommTest::test_work_registry` - `pytest -rA test/distributed/_tensor/test_tensor_ops.py::DistTensorOpsTest::test_equal` - `pytest -rA test/distributed/_tensor/test_random_ops.py::DistTensorRandomOpTest::test_manual_seed` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_asymmetric_compilation` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_scalar` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_speculation_divergence` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_automatic_dynamic_tensor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_dim_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_graph_break_empty_graph_still_collective` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_scalar_missing_source` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_compiler_collectives_type_mismatch` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_activation_checkpointing` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_setattr` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_fsdp_unspecialized_forced_getattr_no_inline` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_aot_eager_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_ddp_inductor_static_graph` - `pytest -rA test/distributed/test_dynamo_distributed.py::TestMultiProc::test_hf_bert_fsdp_activation_checkpointing` - `pytest -rA test/distributed/_tensor/test_experimental_ops.py::DistOtherOpsTest::test_bernoulli` - `pytest -rA test/distributed/_tensor/test_dtensor_compile.py::TestDTensorCompileE2E::test_tp_compile_fullgraph_is_seq_parallel_True` - `pytest -rA test/distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees` - `python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --total-partitions 2 --partition-id 1 --output inference_torchbench.csv --only moco` ------ Differential Revision: [D64511994](https://our.internmc.facebook.com/intern/diff/D64511994) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137763 Approved by: https://github.com/yifuwang	2024-10-21 06:02:57 +00:00
Richard Barnes	542f7c8383	Eliminate C10_NODISCARD (#138336 ) Test Plan: Sandcastle Reviewed By: swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/138336 Approved by: https://github.com/Skylion007	2024-10-19 02:54:06 +00:00
Shuqiang Zhang	21a9c06ca9	[c10d] differentiate timeout errors from nccl errors (#138240 ) Summary: Our watchdog does not differentiate timeout from NCCL errors clearly in terms of both log and code paths. It's important for c10d to differentiate different reasons of watchdog failures. E.g, timeout vs nccl errors, and possibly let users to handle the errors differently depends on the type of errors Test Plan: UT Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/138240 Approved by: https://github.com/Skylion007	2024-10-18 01:36:32 +00:00
Shuqiang Zhang	a1b22e369b	[c10d] add an API to get the future result(success or failure) of a collective and customize error handling (#137799 ) Summary: This PR is trying to let users to know what exact collective call from the python thread is failing, and customize their own error handling function, instead of watchdog thread crashing everything. This is potentially very useful in fault tolerant training, in which we can have in-process restart. E.g., when an nccl error is detected, users can potentially abort comms, re-init comms and go back to the previous check pointed step and try again, instead of crashing the whole job. This is to allow users to check the status of each collective call, using the ivalue::future libs in PT core. This also allows users to attach its customized failure handling functions by: work.get_future_result().then(erro_handling_func) Note that the above call is also non-blocking for CPU thread Test Plan: Added a new test: test_get_future_result to verify the workResutl is correctly propagated to the users Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/137799 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-10-16 00:20:09 +00:00
Yifu Wang	91bc9dc2c9	[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643 ) Suggested by @lw for better safety/reliability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137643 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-10-15 21:35:14 +00:00
Andrew Gu	e269a5cb09	[TCPStore] Throw value error if passing `world_size=0` to TCPStore (#137792 ) This fixes https://github.com/pytorch/pytorch/issues/137577. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137792 Approved by: https://github.com/fegin, https://github.com/H-Huang ghstack dependencies: #137713, #137721	2024-10-11 23:42:57 +00:00
Ke Wen	fe148024fe	[c10d][experimental] Add _abort_process_group (#132291 ) Thanks @eqy for reminding me of this RFC: https://github.com/pytorch/pytorch/issues/119797 This PR is meant to: - provide a way to abort multiple PGs without deadlocking each other. - provide a possibility to manually handle comm errors or timeouts (and potentially recovery of such). One can find an example from: https://github.com/NVIDIA/nccl/issues/1013 ## How is it different from `destroy_process_group`? `destroy_process_group` is meant for normal exit, while `_abort_process_group` is meant for bailout upon hangs or failures. Similar to `ncclCommDestroy` vs `ncclCommAbort`. ## What's new in `_abort_process_group`? It added support for "group abort" semantic. The "group abort" semantic is capable of aborting multiple NCCL comms concurrently, avoiding deadlock in otherwise serialized `ncclCommAbort` executions. Details are in the [RFC](https://github.com/pytorch/pytorch/issues/119797) targeting [the hang issue in multi-comm case](https://github.com/NVIDIA/nccl/issues/1013). `Group abort` semantic is added in NCCL 2.22. ## What's next? Ideally, the watchdog's behavior should support "group abort" too. But this is hard to implement today due to a lack of "global view" by each PG's individual watchdog. A big semi-big refactor may be needed to "uplift" the watchdogs to a global level or consolidate them into one (i.e. one dog watching multiple PGs). In any case, it may not be a bad idea to experiment the "group abort" feature with a manual API first and then extend to the automatic mode (watchdog). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132291 Approved by: https://github.com/eqy	2024-10-11 05:04:17 +00:00
Yifu Wang	39c5122a4f	[IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475 ) ## This Stack Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them. ## This PR - Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out` - Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_` - Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks. - Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`). - Removes methods that were made for the python binding. - Replaces nvlink detection logic with `DMAConnectivityDetector`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475 Approved by: https://github.com/Chillee ghstack dependencies: #137471, #137472, #137473, #137474	2024-10-09 23:30:16 +00:00
Shuqiang Zhang	76dca1fef3	[c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295 ) code Summary: This PR should not change the existing behavior of work.wait(), just separate the stream synchronization code from the CPU busy wait code. Also, remove the need of a private synchronization function. In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves. Test Plan: python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295 Approved by: https://github.com/kwen2501	2024-10-08 08:53:47 +00:00
Howard Huang	0ccd39a64b	Fix prefix store seg fault (#136872 ) fixes https://github.com/pytorch/pytorch/issues/136723 Do not allow `None` to be passed into `PrefixStore` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136872 Approved by: https://github.com/kwen2501	2024-09-30 20:43:08 +00:00

1 2 3 4 5 ...

588 Commits