pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
Nikita Shulga	f901b02066	[Distributed] Do not expose `nlohmann/json.hpp` in public headers (#131925 ) Move `<hlohmann/json.hpp>` dependency as well as `NCCLTraceBuffer::getCollectiveTraceJson` and `NCCLTraceBuffer::dump_json` implementation introduced by https://github.com/pytorch/pytorch/pull/129505 from the header into .cpp file. This relaxes the requirement on all downstream client to depend on the library Fixes https://github.com/pytorch/pytorch/issues/130678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131925 Approved by: https://github.com/albanD, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #131922	2024-07-28 18:45:24 +00:00
Nikita Shulga	07389163f0	[C10][BE] Use range loop (#131922 ) Non-function change that iterates over entries in `getCollectiveTraceJson` and uses `C10_UNUSED` rather than `(void)i;` trick Pull Request resolved: https://github.com/pytorch/pytorch/pull/131922 Approved by: https://github.com/XilunWu	2024-07-27 11:26:27 +00:00
fduwjj	e20fb5e975	[PTD][c10d] Include PG status into flight recorder (#131268 ) We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268 Approved by: https://github.com/shuqiangzhang	2024-07-25 01:01:00 +00:00
Chirag Pandya	83c95c48f7	Flight recoder data as JSON (#129505 ) Summary: Provide a new API to retrieve flight recorder data as JSON. The one minor difference between flight recorder as Pickle v/s JSON is that the JSON API does not retrieve stack traces at the moment. This ends up being far too much data. Test Plan: unit test Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505 Approved by: https://github.com/wconstab, https://github.com/d4l3k	2024-07-10 21:50:27 +00:00
cyy	29861779ce	[2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236 ) Follows #128301. The changes were made by grep and sed Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236 Approved by: https://github.com/ezyang	2024-07-09 03:17:24 +00:00
cyy	f4dcf2ae93	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang, https://github.com/r-barnes	2024-07-08 07:03:53 +00:00
HOOLoLo	5ad2ad5921	Update start_, end_ and retired only for the right entry when retire a work (#128948 ) Fixes #128805 If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10, so when watchdog retire the work 0, the start_ and end_ of the entry 0 shouldn't be set to nullptr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128948 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-06-26 21:58:00 +00:00
FFFrog	e49525275d	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-19 09:06:49 +00:00
PyTorch MergeBot	846bb30e13	Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )" This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5. Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build `bd72e28314`. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))	2024-06-15 01:58:20 +00:00
cyy	bd72e28314	[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301 Approved by: https://github.com/ezyang	2024-06-14 23:21:01 +00:00
PyTorch MergeBot	5001f41b90	Revert "Make TraceUtils.h to be device-agnostic (#126969 )" This reverts commit 648625b230e8e6e7478fb219ff4f0aa6a45070f5. Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))	2024-06-12 16:32:57 +00:00
FFFrog	648625b230	Make TraceUtils.h to be device-agnostic (#126969 ) Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files. In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969 Approved by: https://github.com/c-p-i-o	2024-06-11 08:38:07 +00:00
Shengbao Zheng	46948300a2	[c10d] integrate PMI NCCL initialization to NCCL-PG (#128243 ) Summary: Move broadcastUniqueID check to NCCLUtils Differential Revision: D58273755 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128243 Approved by: https://github.com/wconstab	2024-06-10 17:20:03 +00:00
Cory Modlin	8830b81208	[c10d] Add commCreateFromRanks to c10d (#127421 ) (#127982 ) This is a duplicate of: https://github.com/pytorch/pytorch/pull/127421 which we can't merge. its landed internally already Summary: `ncclCommCreateFromRanks` - described in this [document](https://docs.google.com/document/d/1QIRkAO4SAQ6eFBpxE51JmRKRAH2bwAHn8OIj69XuFqQ/edit#heading=h.5g71oqe3soez), replaces `ncclCommSplit` in NCCLX versions 2.21.5+. The difference is that `ncclCommCreateFromRanks` is given a list of active ranks and is collective only over those ranks as opposed to `ncclCommSplit` for which you give it a color for every rank including NO_COLOR for inactive ranks and the collective is over the entire world. This diff connects `ncclCommCreateFromRanks` to `c10d` `ncclCommSplit` will still be available at the NCCL API but, in this diff, is not used starting at version 2.21.5 Split the python test and implementation of `split()` for internal FB and external OSS builds. The diff defines `"USE_C10D_NCCL_FBCODE"` as a compiler option. When defined, we use the version of split in the newly created `NCCLUtils.cpp` in the `fb` directory. The `fb` directory is not shipit-ed to github. The same API is used for `split()` in both the `ncclx` and `nccl` versions adding `ranks` to the API. This argument is not used in the `nccl` version nor in the 2.18 `ncclx` version where `ncclCommSplit()` is used instead of `ncclCommCreateFromRanks()` in `ncclx` This diff was squashed with D57343946 - see D57343946 for additional review comments. Test Plan: for 2.18.3-1 and 2.21.5-1 versions: ``` buck2 run fbcode//mode/opt -c param.use_nccl=True -c fbcode.nvcc_arch=a100 -c hpc_comms.use_ncclx="$VERSION" -c fbcode.enable_gpu_sections=true fbcode//caffe2/test/distributed/fb:test_comm_split_subgroup_x ``` ``` BUILD SUCCEEDED ... ok ---------------------------------------------------------------------- Ran 1 test in 10.210s OK ~/scripts ``` OSS build: `[cmodlin@devgpu003.vll5 ~/fbsource/third-party/ncclx/v2.21.5-1 (e56338cfa)]$ ./maint/oss_build.sh` OSS build output: ``` ... ncclCommHash 197dce9b413e2775 nccl commDesc example_pg Dump from comm 0x4708aa0 rings: [[0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]] Dump from comm 0x4708aa0 commDesc: example_pg Dump from comm 0x4708aa0 nRanks: 1 Dump from comm 0x4708aa0 nNodes: 1 Dump from comm 0x4708aa0 node: 0 Dump from comm 0x4708aa0 localRanks: 1 Dump from comm 0x4708aa0 localRank: 0 Dump from comm 0x4708aa0 rank: 0 Dump from comm 0x4708aa0 commHash: "197dce9b413e2775" 2024-05-24T09:02:54.385543 devgpu003:3040664:3040744 [0][AsyncJob]ctran/backends/ib/CtranIb.cc:143 NCCL WARN CTRAN-IB : No active device found. 2024-05-24T09:02:54.385607 devgpu003:3040664:3040744 [0][AsyncJob]ctran/mapper/CtranMapper.cc:187 NCCL WARN CTRAN: IB backend not enabled Created NCCL_SPLIT_TYPE_NODE type splitComm 0x11c76d0, rank 0 ~/fbsource/third-party/ncclx/v2.21.5-1 ``` Reviewed By: wconstab, wesbland Differential Revision: D56907877 Fixes #ISSUE_NUMBER Co-authored-by: Cory Modlin <cmodlin@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127982 Approved by: https://github.com/izaitsevfb	2024-06-05 00:19:52 +00:00
Benson Ma	fc73d07e5e	[c10d] Decorate methods in `NCCLUtils.hpp` with `TORCH_API` (#127550 ) Summary: User-defined PyTorch modules that uses `C10D_NCCL_CHECK` run into undefined symbol errors when loaded by `torch.library.load()`, because they have not been exported. This change exports the symbols needed to resolve those runtime errors. Test Plan: PyTorch CI Differential Revision: D57977944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127550 Approved by: https://github.com/Skylion007	2024-05-31 00:17:25 +00:00
Richard Barnes	ed327876f5	[codemod] `c10:optional` -> `std::optional` (#126135 ) Generated by running the following from PyTorch root: ``` find . -regex ".*\.$cpp\\|h\\|cu\\|hpp\\|cc\\|cxx$$" \| grep -v "build/" \| xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/' ``` `c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi	2024-05-14 19:35:51 +00:00
Wes Bland	6f5f405b05	[ncclx] Rename NCCL-EXP to NCCLX (#125238 ) Reviewed By: kryanchun Differential Revision: D56534548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125238 Approved by: https://github.com/kwen2501	2024-05-01 23:29:55 +00:00
Shuqiang Zhang	ea1cd31b50	[c10d] Log the target of FR dump (#122345 ) Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump Test Plan: Modified unit tests Differential Revision: D54972069 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345 Approved by: https://github.com/wconstab	2024-03-21 08:03:05 +00:00
Shuqiang Zhang	8e20385447	[c10d] fix the macro definition of NCCL_COMM_DUMP (#120502 ) Summary: Only if both macros are defined, should we dump the comm dump, otherwise, use the original definition. The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined Test Plan: Build and unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502 Approved by: https://github.com/dsjohns2, https://github.com/Skylion007	2024-02-23 20:59:39 +00:00
Shuqiang Zhang	a24cba35b0	[c10d][flight recorder] dump additinal NCCL debug info (#120063 ) Summary: This PR is mainly about flight recorder side of changes that takes a map of maps as input, and dump it as picklable. Also add functions that should be compiled only when NCCL_COMM_DUMP is defined Test Plan: Integration tests with NCCL would be done later, here we only do the c10d side of dump test, aka,NCCLTraceTest Testing the dump function is a bit tricky as we don't have existing C++ unit tests for them. So we still use the Python NCCLTraceTest with the python binding of _dump_nccl_trace(), we manually fed the dump_nccl_trace with a map of test info, and assert the pickle result and print the converted python dict: ``` (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python test/distributed/test_c10d_nccl.py NCCLTraceTest NCCL version 2.19.3+cuda12.0 [rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 . ---------------------------------------------------------------------- Ran 8 tests in 95.761s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063 Approved by: https://github.com/wconstab	2024-02-21 16:35:23 +00:00
Ke Wen	b2043c0543	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-12 18:45:49 +00:00
PyTorch MergeBot	0342b227e5	Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 )" This reverts commit f3e7d809936d9f1bf63102e8afe241e13ed8766a. Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))	2024-02-12 07:34:20 +00:00
Ke Wen	f3e7d80993	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-09 20:23:20 +00:00
Shuqiang Zhang	c7af626a26	[c10d] allow nonblocking wrap of ncclCommInitRankConfig (#118256 ) resolve #117749 Summary: Updated the PR with the following intentions: 1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled. 2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call. 3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call. 4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256 Approved by: https://github.com/kwen2501	2024-01-30 06:23:20 +00:00
eqy	8d790abab9	[NCCL][c10d] Log failing pointer if deregistration fails (#118455 ) For debugging convenience CC @minsii @Aidyn-A @syed-ahmed @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/118455 Approved by: https://github.com/wconstab	2024-01-27 11:03:02 +00:00
Min Si	838d3620cd	[NCCL PG] log NCCL comm at creation and abort (#118335 ) Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs. Differential Revision: D53107647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335 Approved by: https://github.com/wconstab	2024-01-27 01:43:53 +00:00
fduwjj	05ef2030ea	[c10d] Add logs for NCCL Comm Abort call (#117868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117868 Approved by: https://github.com/kwen2501	2024-01-20 21:34:13 +00:00
fduwjj	ca4df16fdd	[c10d] Make DebugInfoWriter Singleton across all PG objects (#116489 ) Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances. Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489 Approved by: https://github.com/kwen2501	2024-01-03 03:42:54 +00:00
fduwjj	f6dfbffb3b	[c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238 ) For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL. This is a debugging feature so that we can rule out the bug from c10d level. <img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238 Approved by: https://github.com/wconstab, https://github.com/fegin	2023-12-25 22:25:38 +00:00
PyTorch MergeBot	f101426790	Revert "Move class definition of DebugInfoWriter to TraceUtil as well (#114901 )" This reverts commit fb325bbd46f69bea8b2debd3ab5830c9eedadc0d. Reverted https://github.com/pytorch/pytorch/pull/114901 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/114901#issuecomment-1838815178))	2023-12-04 14:55:39 +00:00
fduwjj	fb325bbd46	Move class definition of DebugInfoWriter to TraceUtil as well (#114901 ) Since we moved the implementation of the class to TraceUtils in https://github.com/pytorch/pytorch/pull/114367, maybe we also want to move the implementation here as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114901 Approved by: https://github.com/XilunWu	2023-12-01 03:28:16 +00:00
Chip Turner	066e072524	Retry #112889 (Opportunistically use ncclCommSplit when creating new NCCL groups) (#114385 ) - [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889) - Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit` Fixes cause of revert of original PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385 Approved by: https://github.com/huydhn	2023-11-23 07:00:00 +00:00
PyTorch MergeBot	b927a4e2ca	Revert "Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 )" This reverts commit 64a5372e6ce9b6ca0ee5c7482b27e24561725b28. Reverted https://github.com/pytorch/pytorch/pull/112889 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing ROCm distributed jobs in trunk `4d07428ede` ([comment](https://github.com/pytorch/pytorch/pull/112889#issuecomment-1823214376))	2023-11-22 17:43:51 +00:00
Chip Turner	64a5372e6c	Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889 ) Currently `ncclCommInitRankConfig` is always used when creating new communicator groups. This is wasteful as it creates non-shared pairs of endpoint queues as well as costs time to re-establish communication. This change is transparent and opportunistic; when `dist.new_group` is called, it will use the existing, healthy world process group to select the right ranks to include in the process group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889 Approved by: https://github.com/kwen2501	2023-11-21 21:03:52 +00:00
fduwjj	5fb1d8f18a	[NCCL PG] Enable storing nccl traces into storage and make it configurable (#113503 ) This PR is to enable the store of NCCL flight recorder to storage and make it configurable by letting users register their own way of storing the debug info. We will then provide users a script to offline parse and process the dumped blobs. One thing, this PR is not trying to resolve is to decide where to dump the debug info. I will send a follow-up PR to address that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113503 Approved by: https://github.com/zdevito	2023-11-16 07:44:15 +00:00
Min Si	ab1f6d58bc	[c10d] use allocator trace callbacks for NCCL PG register (#112850 ) Summary: We need to register all cache segments allocated by allocator, so that NCCL can apply zero copy algorithms at collective and point-to-point operations. How to track and register all cache segments: - It registers a register and a deregister hook to cache allocator as action tracker callbacks, tracking SEGMENT_ALLOC and SEGMENT_FREE trace entries, respectively. When SEGMENT_ALLOC is tracked, the register hook will register to the PG's communicators on the same device. Similarly, when SEGMENT_FREE is tracked, the deregister hook handles deregistration before cudaFree. - When a new NCCL communicator is created, it dumps the snapspot from cache allocator to register all existing cache segments at once. - When a NCCL communicator is aborted, it deregisters all segments that have been registered by this communicator Test Plan: See test in D50726971 Reviewed By: wconstab Differential Revision: D50726970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112850 Approved by: https://github.com/wconstab	2023-11-06 19:29:32 +00:00
Shen Li	dd6319198d	Apply clang-format to distributed/c10d folder (#107140 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140 Approved by: https://github.com/H-Huang	2023-08-14 23:16:38 +00:00
Syed Tousif Ahmed	870880236b	Enables configuration of NCCL communicators (#97394 ) NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using [ncclConfig_t](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclConfig_t) datatype and [ncclCommInitRankConfig](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcomminitrankconfig). This PR enables that feature. A user can tune the parameters as follows: ``` import torch.distributed as dist nccl_options = dist.ProcessGroupNCCL.Options() nccl_options.config.max_ctas = 32 nccl_options.config.min_ctas = 8 nccl_options.config.cga_cluster_size = 2 dist.init_process_group(backend='nccl', init_method='env://', pg_options=nccl_options) my_group = dist.new_group(pg_options=nccl_options) ``` The default values of these parameters are what is initialized by `NCCL_CONFIG_INITIALIZER`. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads). Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels. CC: @ptrblck @kwen2501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97394 Approved by: https://github.com/kwen2501	2023-05-25 20:46:19 +00:00
Eddie Yan	c4f81cb6f4	[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715 ) Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature. Enabled via the environment variable: ``` TORCH_NCCL_USE_COMM_NONBLOCKING=1 ``` CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715 Approved by: https://github.com/kwen2501	2023-04-14 02:03:33 +00:00
PyTorch MergeBot	1149ba5553	Revert "[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715 )" This reverts commit a33eac398881cfa9aad679ceffd28ace3fa44f01. Reverted https://github.com/pytorch/pytorch/pull/95715 on behalf of https://github.com/PaliC due to This pr has caused a regression on distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess causing it to timeout (https://hud.pytorch.org/failure/distributed%2Ftest_dynamo_distributed.py%3A%3ATestMultiProc%3A%3Atest_ddp_baseline_aot_eager_multiprocess)	2023-04-12 21:15:49 +00:00
Eddie Yan	a33eac3988	[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715 ) Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature. Enabled via the environment variable: ``` TORCH_NCCL_USE_COMM_NONBLOCKING=1 ``` CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715 Approved by: https://github.com/kwen2501	2023-04-12 18:33:10 +00:00
cyy	f172feae0d	More tidy fixes (#93069 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/93069 Approved by: https://github.com/Skylion007	2023-01-27 06:40:50 +00:00
Howard Huang	bc66ddb5cb	Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134 ) Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8 Changes: - introduce new error type - Update `C10D_NCCL_CHECK` Sample script to demonstrate new error type ```python # python -m torch.distributed.run --nproc_per_node=2 <script>.py import torch import torch.distributed as dist if __name__ == "__main__": dist.init_process_group("nccl") dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0) ``` Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134 Approved by: https://github.com/rohan-varma	2022-11-08 13:26:42 +00:00
Ke Wen	1f38abb5d2	Adopt ncclRemoteError (#85887 ) `ncclRemoteError` was added in NCCL 2.13 to indicate a network error or a remote process exiting prematurely. Pull Request resolved: https://github.com/pytorch/pytorch/pull/85887 Approved by: https://github.com/wanchaol	2022-09-30 09:17:49 +00:00
Andrey	dde43d083b	[c10d] Reorder macros so they are defined before getting used (#85850 ) Summary: Move preprocessor macros all the way up, so they are defined before being used. Test Plan: existing tests Reviewed By: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/85850 Approved by: https://github.com/wanchaol	2022-09-29 23:44:57 +00:00
Wanchao Liang	72b32f1644	[c10d] move ncclgetlasterror directive definition upfront (#85825 ) Move the directive definition of ncclGetLastError() upfront so that C++ preprocessor does not treat this as a empty string Pull Request resolved: https://github.com/pytorch/pytorch/pull/85825 Approved by: https://github.com/H-Huang, https://github.com/kwen2501	2022-09-29 06:17:43 +00:00
Wanchao Liang	976f8bee94	[c10d] add ncclGetLastError to NCCL pg (#83724 ) This PR add ncclGetLastError API to the nccl pg, to provide better error reporting out of nccl failures directly, instead of guessing on random reasons Differential Revision: [D39161199](https://our.internmc.facebook.com/intern/diff/D39161199) Pull Request resolved: https://github.com/pytorch/pytorch/pull/83724 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2022-09-14 23:21:33 +00:00
Masaki Kozuki	ab6c57217a	Add NCCL PreMul Sum to c10d `redce` ops (#84243 ) This is based on #81272 but this conforms to TorchScript Compiler ## TODO - [ ] Update `abaf8112e6/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp (L64-L73)` to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit. cc @ptrblck @kwen2501 @aazzolini cc @zasdfgbnm for visibility to the TODO above Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243 Approved by: https://github.com/kwen2501	2022-09-02 21:57:45 +00:00
PyTorch MergeBot	1f61c39ac4	Revert "Support NCCL Premul Sum (#81272 )" This reverts commit 432c508e71111f9d5382322e0e6b1bc1c66bf0ec. Reverted https://github.com/pytorch/pytorch/pull/81272 on behalf of https://github.com/weiwangmeta due to breaking internal builds	2022-08-25 05:01:37 +00:00
Masaki Kozuki	432c508e71	Support NCCL Premul Sum (#81272 ) This PR adds the support for https://docs.nvidia.com/deeplearning/nccl/archives/nccl_21212/user-guide/docs/api/ops.html?highlight=premul#c.ncclRedOpCreatePreMulSum. The major changes include - convert enum ReduceOp to struct - add premul sum specific paths to init.cpp and Ops.cpp. note: - For pip wheels / conda binaries to support this, ~~I think https://github.com/pytorch/pytorch/pull/79132 would be needed~~ https://github.com/pytorch/pytorch/pull/82775 landed The commit titled "add nccl premul" whose current hash is `cb99ad6744` was authored by @mcarilli and @ptrblck. cc @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/81272 Approved by: https://github.com/kwen2501	2022-08-24 04:53:25 +00:00

1 2

56 Commits