pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-22 14:15:01 +08:00

Author	SHA1	Message	Date
Chip Turner	64a5372e6c	Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 ) Currently `ncclCommInitRankConfig` is always used when creating new communicator groups. This is wasteful as it creates non-shared pairs of endpoint queues as well as costs time to re-establish communication. This change is transparent and opportunistic; when `dist.new_group` is called, it will use the existing, healthy world process group to select the right ranks to include in the process group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889 Approved by: https://github.com/kwen2501	2023-11-21 21:03:52 +00:00
Pavan Balaji	8f8722e3f1	[nccl-pg] Avoid using NCCL_ prefix for non-NCCL env variables (#114077 ) NCCL_ prefix should only be used for NCCL library's environment variables. We currently use a few environment variables in PyTorch with the NCCL_ prefix that are the NCCL library does not understand. This patch renames such environment variables to use the TORCH_NCCL_ prefix instead. We still maintain the old NCCL_ variables, but throw a warning when they are used. The following env changes have been made: `NCCL_BLOCKING_WAIT` -> `TORCH_NCCL_BLOCKING_WAIT` `NCCL_ENABLE_TIMING` -> `TORCH_NCCL_ENABLE_TIMING` `NCCL_DESYNC_DEBUG` -> `TORCH_NCCL_DESYNC_DEBUG` `NCCL_ASYNC_ERROR_HANDLING` -> `TORCH_NCCL_ASYNC_ERROR_HANDLING` `ENABLE_NCCL_HEALTH_CHECK` -> `TORCH_ENABLE_NCCL_HEALTH_CHECK` `NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` -> `TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114077 Approved by: https://github.com/fduwjj	2023-11-21 07:23:42 +00:00
Ke Wen	dc65f6c601	[c10d] Remove deprecated multi-gpu-per-thread APIs (#114156 ) As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156 Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang	2023-11-21 03:50:23 +00:00
Pritam Damania	17e2313dd3	Add an API to DDP for dynamically updating the underlying process group. (#113580 ) # Motivation If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following: ``` del old_ddp del old_pg pg = init_pg(...) ddp = DDP(pg) model = torch.compile(DDP) ``` This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again. # Proposal As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580 Approved by: https://github.com/fduwjj	2023-11-15 09:05:02 +00:00
Ke Wen	bb7ac12cbf	[ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896 ) Summary: Follows PR #111431, save memory for DTensor init Test Plan: Sandcastle Reviewed By: wanchaol Differential Revision: D50985365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896 Approved by: https://github.com/wanchaol	2023-11-07 15:44:04 +00:00
Will Constable	ff51f94e32	[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094 Approved by: https://github.com/fduwjj	2023-11-07 05:34:26 +00:00
PyTorch MergeBot	75adb9f371	Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 )" This reverts commit f9d47e13813bbefc9f19a6c0430b7122f9d09b91. Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor `f9d47e1381` https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))	2023-11-06 22:49:53 +00:00
Will Constable	f9d47e1381	Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: #112611, #112803	2023-11-06 20:48:39 +00:00
Yifu Wang	ac9476ba99	Add .boxed() to c10d::ProcessGroup and c10d::Work's pybind (#111997 ) Summary: When passed from C++ to Python, `c10d::ProcessGroup` and `c10d::Work` are automatically converted to their pybind class which can't be used for dispatcher ops. `.boxed()` exposes `c10d::ProcessGroup` and `c10d::Work` as boxed custom class object to Python. ```python import tempfile import torch import torch.distributed as dist if __name__ == "__main__": with tempfile.NamedTemporaryFile(delete=False) as tmpf: dist.init_process_group( backend="nccl", init_method=f"file://{tmpf.name}", rank=0, world_size=1 ) group = dist.group.WORLD print(group) print(group.boxed()) ``` ``` <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe42fb78d30> ScriptObject <__torch__.torch.classes.c10d.ProcessGroup> ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/111997 Approved by: https://github.com/lw	2023-11-02 20:35:20 +00:00
Yifu Wang	ec18ef62f4	Native c10d_functional ops (#110570 ) This PR introduces a native version of c10d_functional ops. The main goal is to add collective support in AOTInductor and allow collective ops to work in multi-threaded native runtimes. The native version also incorporated API improvements we wished to implement in Python c10d_functional: - Removed `ranks` and `group_size` from collective op signatures which were proven to be redundant. - Use tensor storage as opposed to `void*` to resolve in-flight work. The native process group registration/resolution mechansim is only used for native c10d_functional in the PR. It will become the single source of truth in upcoming PRs. The upcoming PRs will implement Inductor/AOTInductor support for c10d_functional, after which native c10d_functional will replace Python c10d_functional. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110570 Approved by: https://github.com/wanchaol	2023-10-25 22:56:06 +00:00
zdevito	babb6c6ac4	nccl flight recorder (#110960 ) Keep a buffer of the last 16384 nccl work actions, including the stack trace that launched the event. When torch._C._distributed_c10d._dump_nccl_trace(), it an dump these to a pickled archive. For each action we get: process_group_id, seq_id, collective_name, size_of_first_tensor, stack trace state - issued, started, completed (based on cuda events and queried if necessary when the dump is requested) I tested that it is possible to query event state when the streams are otherwise stuck. Differential Revision: [D50138956](https://our.internmc.facebook.com/intern/diff/D50138956) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110960 Approved by: https://github.com/wconstab	2023-10-24 07:12:21 +00:00
Pritam Damania	0ad91c2bfb	Add an explicit _shutdown method to ProcessGroupNCCL (#111392 ) Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor. However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor. As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111392 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-24 05:47:12 +00:00
Howard Huang	0b602b13c8	[small] fix tcpstore doc arg (#111807 ) incorrect arg name `wait_for_worker` -> `wait_for_workers` Pull Request resolved: https://github.com/pytorch/pytorch/pull/111807 Approved by: https://github.com/awgu, https://github.com/fduwjj	2023-10-23 19:51:09 +00:00
Ke Wen	18cc8a92ac	[ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431 ) For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way. To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`). This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now. Cc: @awgu @janeyx99 @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431 Approved by: https://github.com/H-Huang	2023-10-19 00:41:09 +00:00
PyTorch MergeBot	1e70f4d02c	Revert "Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 )" This reverts commit bb1424d46e656dfcdd4c12efe58ada9f1720c4d8. Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))	2023-10-16 23:03:26 +00:00
PyTorch MergeBot	493618d745	Revert "[C10D] Introduce C++ side Collective Callbacks. (#110307 )" This reverts commit 359336e3e9a0f67974e53805b5207fbbbc149490. Reverted https://github.com/pytorch/pytorch/pull/110307 on behalf of https://github.com/wconstab due to this sits on top of another PR https://github.com/pytorch/pytorch/pull/111072 that needs to be reverted due to internal release testing failure / multisect blame ([comment](https://github.com/pytorch/pytorch/pull/110307#issuecomment-1764910301))	2023-10-16 17:07:58 +00:00
Will Constable	359336e3e9	[C10D] Introduce C++ side Collective Callbacks. (#110307 ) C++ side callbacks allow for advance users to get access to the collective firehose. It's worth mentioning and discussing the dire environment that those callbacks are invoked. From either main thread of watchdog thread and with a PTD lock held. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110307 Approved by: https://github.com/fduwjj ghstack dependencies: #111061, #111072	2023-10-13 13:53:16 +00:00
Will Constable	bb1424d46e	Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 ) This reverts commit 314a502eb04c6382e2cc9af0573533efba54109d. Changes since original PR: Reland 1 * rename torch.distributed.hooks to torch.distributed._hooks Reland 2 * make _hooks importable even if !distributed.is_available() * handle cuda driver exit intermittent failure caused by new cuda api usage in callback caller (see prev PR in stack) (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111072 Approved by: https://github.com/malfet ghstack dependencies: #111061	2023-10-12 16:59:23 +00:00
PyTorch MergeBot	314a502eb0	Revert "Reland "[C10] PG observability hooks. (#108815 )" (#110907 )" This reverts commit 7678cd22af46c9df4fb47a409d3e8ad71a6127ea. Reverted https://github.com/pytorch/pytorch/pull/110907 on behalf of https://github.com/huydhn due to Sorry for reverting this, but macos job in trunk starts failing after this `7678cd22af` ([comment](https://github.com/pytorch/pytorch/pull/110907#issuecomment-1756497387))	2023-10-11 00:23:42 +00:00
Will Constable	ca03f36233	Change ProcessGroupNCCL default timeout to 10 min (#110947 ) Avoid changing default for other backends as CPU backend (GLOO) may need longer timeouts. Motivated by trying to save cluster time when encountering collective hangs. Generally collectives should time out within seconds and 30 minutes (or 10 minutes) should provide ample headroom for edge cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110947 Approved by: https://github.com/xw285cornell, https://github.com/fduwjj	2023-10-10 22:28:39 +00:00
Will Constable	7678cd22af	Reland "[C10] PG observability hooks. (#108815 )" (#110907 ) This reverts commit ff0358b0384d6a3a5b8ceeae625c93221612ba8e. (original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907 Approved by: https://github.com/fduwjj	2023-10-10 20:09:40 +00:00
Edward Z. Yang	de3ae93e9b	Include rank of default PG in C++ log messages (#110623 ) I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test. The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work. This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623 Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-10 00:26:52 +00:00
PyTorch MergeBot	ff0358b038	Revert "[C10] PG observability hooks. (#108815 )" This reverts commit 0c7a877745f98b8fce8868291408945c0dd817d6. Reverted https://github.com/pytorch/pytorch/pull/108815 on behalf of https://github.com/albanD due to Add a new torch.distributed.hooks namespace but does not document it, test was added this morning ([comment](https://github.com/pytorch/pytorch/pull/108815#issuecomment-1751327751))	2023-10-06 19:49:49 +00:00
Rodrigo Kumpera	0c7a877745	[C10] PG observability hooks. (#108815 ) Expose a set of observability hooks into C10D such that our users can detect collectives failure both faster and more easily. The design is similar to NCCL desync debug that it minimized the overhead by doing most of the work out of the main thread. This PR introduces a new module torch.distributed.hooks that exposes the following set of methods: register_collective_start_hook register_collective_end_hook register_process_group_hook The process group hook exposes PG creation on the member ranks and call them inline from the the PG creation code. This is fine since this happens during initialization and a limited number of times. The collective start/end hooks are fired from a single background thread. It reads events from a C++ queue and dispatches over. Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown and have it as background thread. This is not possible with more reasonable choices like a condvar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2023-10-06 18:52:46 +00:00
Rodrigo Kumpera	9a1b6d44bb	[C10d] Add PG::enableCollectivesTiming to make it dynamically enabled. (#108814 ) Collectives timing gates the tracking when a collective starts on device. Currently it's enabled by set the NCCL_ENABLE_TIMING env var. The goal of this PR is to make it possible to dynamically enable that flag so users of the PG hooks don't have to set that flag in order to have their hooks work. The design is that once set, all new collectives will have such behavior so we track it on each Work object. We make enableTiming_ atomic in PGNCCL to avoid races on non-TSO hardware. To ensure consistency, we copy its value during Work construction and replace all previous usage of enableTiming_ from the PG with usages from the Work, which now has an immutable value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108814 Approved by: https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #108813	2023-09-20 19:47:41 +00:00
Rodrigo Kumpera	2bca5f2af7	[C10D] Track pg name in c++. (#108813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/108813 Approved by: https://github.com/wconstab	2023-09-15 01:10:29 +00:00
Rodrigo Kumpera	b26af5d5ac	[c10d] Add TCPSTore libuv backend support to c10d rendezvous. (#108284 ) This enables libuv under env and tcp urls. Under env either use the environment variable USE_LIBUV=1 or the url parameter use_lib=1. Under tcp use the url parameter use_lib=1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108284 Approved by: https://github.com/H-Huang, https://github.com/XilunWu	2023-09-07 21:39:58 +00:00
Rodrigo Kumpera	23e8a11fef	[c10d] Introduce TCPStore client metrics collection. (#108348 ) We collect timing and counts of every operation. They are acessible from python using TCPStore::collect_client_counters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108348 Approved by: https://github.com/XilunWu	2023-09-05 18:36:27 +00:00
Shen Li	45128ab67c	[Reland] Add OnCompletion Hook to ProcessGroup (#106988 ) (#107233 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233 Approved by: https://github.com/kumpera	2023-08-15 17:35:14 +00:00
PyTorch MergeBot	fd214aa8be	Revert "Add OnCompletion Hook to ProcessGroup (#106988 )" This reverts commit ba1da47e8fa95ca0dd8b2d63430f7eb54fdbbccb. Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error. The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))	2023-08-15 08:24:33 +00:00
Shen Li	ba1da47e8f	Add OnCompletion Hook to ProcessGroup (#106988 ) This allows infra/trainers to get detailed stats about communication efficiencies without know anything about what model or distributed training paradigms have been used. This is helpful as infra/trainer package usually prefers to be as model/algorithm agnostic as possible. Therefore, we cannot assume that infra/trainer can have access to all collectives used by the model authors. This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which will be fired on every work completion event. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988 Approved by: https://github.com/kumpera, https://github.com/H-Huang ghstack dependencies: #107140, #107141, #107160	2023-08-15 04:32:23 +00:00
Rodrigo Kumpera	2636751fb9	[C10d] Add skeleton of LibUV backend. (#105672 ) This commit hooks up tcpstore creation and build flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672 Approved by: https://github.com/fduwjj	2023-07-28 13:19:06 +00:00
Syed Tousif Ahmed	38861ba39f	Fixes netName assignment for NCCL Config (#105776 ) Fixes #104340 The core issue is described here https://github.com/pybind/pybind11/issues/1168#issuecomment-341969643. Note that NCCL calls free on the netName pointer when destroying the communicator. So memory is safely managed here. CC: @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105776 Approved by: https://github.com/kwen2501	2023-07-26 21:13:56 +00:00
Louis Feng	3a01c056f5	[PyTorch][ET] Collect Process Groups Mapping Info (#104373 ) Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info). Test Plan: Testing in HPC: ``` TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000 ``` Example output in ET: ``` { "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "", "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"], "outputs": [], "output_shapes": [], "output_types": [] }, ``` Differential Revision: D46321690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373 Approved by: https://github.com/kwen2501	2023-07-25 03:34:53 +00:00
Howard Huang	d455d48744	Add back in reduce_scatter_tensor_coalesced (#104345 ) #104256 erroneously removed the pybind definition for `reduce_scatter_tensor_coalesced` introduced in #103561 This adds it back in and introduces a test for the API. Test command: ``` pytest test/distributed/test_c10d_nccl.py -vsk test_reduce_scatter_tensor_coalesced ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/104345 Approved by: https://github.com/kwen2501	2023-06-29 22:53:26 +00:00
Howard Huang	9165d46b89	DDP + C10D sparse all_reduce changes (#103916 ) (#104256 ) Summary: reland of https://github.com/pytorch/pytorch/pull/103916 ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D47056695 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256 Approved by: https://github.com/rohan-varma	2023-06-28 00:37:52 +00:00
PyTorch MergeBot	436d035dc7	Revert "DDP + C10D sparse all_reduce changes (#103916 )" This reverts commit fed5fba6e4ee3f221bac481798c5a31f785ba75e. Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))	2023-06-26 22:37:58 +00:00
Howard Huang	fed5fba6e4	DDP + C10D sparse all_reduce changes (#103916 ) Summary: ## Changes prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function. prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...` ## Example script ```python # python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py import torch import torch.distributed as dist def main(): dist.init_process_group(backend="nccl") rank = dist.get_rank() a = torch.tensor([[0, 2.], [3, 0]]).to(rank) a = a.to_sparse() print(f"rank {rank} - a: {a}") dist.all_reduce(a) if __name__ == "__main__": main() ``` output: ``` rank 1 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse rank 0 - a: tensor(indices=tensor([[0, 1], [1, 0]]), values=tensor([2., 3.]), device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo) allreduce_sparse_cuda_ tensor.is_sparse() = 1 in ProcessGroupNCCL::allreduceSparse ``` Test Plan: Testing commands (OSS): ``` # python pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops # c++ build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Testing commands (internal, ondemand GPU): ddp tests: ``` buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output # Get the .par file from the previous command and use it below TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata ``` c10d tests: ``` # build tests and run with log output (python) buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops # python NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)' # c++ NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce ``` Differential Revision: D46724856 Pulled By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916 Approved by: https://github.com/rohan-varma	2023-06-26 20:42:17 +00:00
Jon Maltiel Swenson	0da38409a0	[gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478 ) Summary: This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port. Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved). Test Plan: Added new unit test: buck2 test caffe2/test/distributed:store Differential Revision: D46622317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478 Approved by: https://github.com/H-Huang	2023-06-16 17:15:56 +00:00
Ke Wen	22e8a61d9b	Implement coalesced reduce_scatter_tensor (#103561 ) Map of #101157. This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax: Sync communication style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) ``` Async communication style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the reduce-scatters' results ``` Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561 Approved by: https://github.com/fegin	2023-06-15 20:11:12 +00:00
Rodrigo Kumpera	ca1c1fdc91	[C10D] Implement Store fallbacks for append, multi_get and multi_set. (#100768 ) These fallbacks exposed some issue in quite a few spots in our bindings. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100768 Approved by: https://github.com/fduwjj	2023-06-01 19:58:47 +00:00
Pritam Damania	9a2df0a5af	[RFC] Add method to DDP to check for backward finalization. (#100773 ) Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck. As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction. Test Plan: Added unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773 Approved by: https://github.com/rohan-varma	2023-05-31 20:43:06 +00:00
Syed Tousif Ahmed	870880236b	Enables configuration of NCCL communicators (#97394 ) NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using [ncclConfig_t](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclConfig_t) datatype and [ncclCommInitRankConfig](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcomminitrankconfig). This PR enables that feature. A user can tune the parameters as follows: ``` import torch.distributed as dist nccl_options = dist.ProcessGroupNCCL.Options() nccl_options.config.max_ctas = 32 nccl_options.config.min_ctas = 8 nccl_options.config.cga_cluster_size = 2 dist.init_process_group(backend='nccl', init_method='env://', pg_options=nccl_options) my_group = dist.new_group(pg_options=nccl_options) ``` The default values of these parameters are what is initialized by `NCCL_CONFIG_INITIALIZER`. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads). Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels. CC: @ptrblck @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97394 Approved by: https://github.com/kwen2501	2023-05-25 20:46:19 +00:00
shaoyf42	97180aca5e	Enables barrier to support the specified device (#99589 ) Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919 Today, there are two limitations of barrier: One is that barrier does not support custom #device: `fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)` The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device. `789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589 Approved by: https://github.com/kwen2501	2023-05-17 05:26:04 +00:00
Ke Wen	daed3bf8f9	Implement coalesced all_gather_into_tensor (#101157 ) This PR adds support for the following use cases: - Sync style: ``` with dist._coalescing_manager(): for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) ``` - Async style: ``` with dist._coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_gather_into_tensor(output_tensors[i], input_tensors[i]) # do a bunch of other things cm.wait() # do things that depend on the all-gather's ``` Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL). Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157 Approved by: https://github.com/kumpera, https://github.com/wanchaol	2023-05-11 20:58:47 +00:00
Ke Wen	0848ed21b8	[c10d] Figure out device to use for object collectives (#100954 ) Fixes https://github.com/pytorch/pytorch/issues/97938 this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But @kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction. the only confliction is `distributed_c10d.py:2653` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954 Approved by: https://github.com/kwen2501	2023-05-11 01:49:09 +00:00
Rodrigo Kumpera	8360b6c2a8	[c10d] Expose new Store methods. (#100381 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/100381 Approved by: https://github.com/fduwjj	2023-05-06 02:50:51 +00:00
Rodrigo Kumpera	fa2bfab93e	[C10D] Drop the GIL when creating a TCPSTore to avoid deadlocks. (#100555 ) TCPSTore creation is a blocking operation so it can lead to a deadlock if multiple threads are trying to instantiate it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100555 Approved by: https://github.com/H-Huang	2023-05-04 00:15:55 +00:00
Ke Wen	3a09aa5977	[c10d] Faster coalescing (#98793 ) ### Description The PR aims at reducing CPU overhead of context manager style coalescing. By "context manager style coalescing", we mean: Sync style: ``` with _coalescing_manager(): for i in range(num_coll): dist.all_reduce(tensors[i]) ``` Async style: ``` with _coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_reduce(tensors[i]) cm.wait() ``` In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead. In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager. ### Tests In current PR, the "fast path" only applies to all-reduce. - Flattened 512M: 16.38 ms, including CPU time 131.21 us - Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us - New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us Hence a 4x reduction in CPU overhead (dependent on `num_coll`). Cc @mrshenli @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793 Approved by: https://github.com/kumpera	2023-04-24 21:27:26 +00:00
PyTorch MergeBot	9861ec9785	Revert "[c10d] Faster coalescing (#98793 )" This reverts commit db456ab83da6a505dcebc128903d5ee4fc2d5712. Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-21 09:15:04 +00:00

... 2 3 4 5 6 ...

508 Commits