Fixes https://github.com/pytorch/pytorch/issues/137856.
### Issue 1
Today under `ProcessGroupNCCL::Options`, color is declared as:
```
int64_t split_color{0};
```
When passing this variable to `ncclCommSplit` which accepts `int`, the value may overflow and become negative, as in #137856. But NCCL API only accepts non-negative colors (or `NCCL_SPLIT_NOCOLOR`).
But that's not all.
### Issue 2
`split_color` is pybind'ed to python frontend. If we just change from `int64_t` to `int` in C++, pybind will complain:
```
[rank0]: TypeError: (): incompatible function arguments. The following argument types are supported:
[rank0]: 1. (self: torch._C._distributed_c10d.ProcessGroupNCCL.Options, arg0: int) -> None
```
This is because python `int` represents a wider range than C++ `int`. So we cannot pass hash values -- which are potentially big ints -- from python to C++. The PR modulo the hash value with `c_int`'s max value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137855
Approved by: https://github.com/wconstab
### Fix 1: Throw async error during init wait
Previously we just busy wait for `ncclSuccess`, if the nonblocking init encountered error, we never report that. Added detection of async error via `ncclGetAsyncError`.
### Fix 2: Add wait after comm split
```
// After calling ncclCommSplit in non-blocking mode, we should wait for the
// source communicator to be out of ncclInProgress state.
// Reason 1:
// it's unsafe to call new operations on the parent comm while it's in
// ncclInProgress state.
// Reason 2:
// as of NCCL 2.23, the ptr value of child comm will not be filled until the
// state of parent comm is ncclSuccess. This may change in the future. See:
// https://github.com/NVIDIA/nccl/issues/1472
```
This wait does not mean the child comm is ready for use, neither does it block till that point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137741
Approved by: https://github.com/shuqiangzhang
When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it.
`NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`.
To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o
ghstack dependencies: #137572
We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL.
This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049
Approved by: https://github.com/kwen2501
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.
This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.
This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.
Test plan:
existing CI for regressions
will add unit tests on `C10D_LOCK_GUARD`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).
The check is thus moved after the point where we depend NCCL stream from the last compute kernel.
Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.
This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.
This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.
Test plan:
existing CI for regressions
will add unit tests on `C10D_LOCK_GUARD`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
This patch makes two changes:
1. Whenever ncclCommSplit accepts groupRanks in its config, we should
populate it. This is independent of using PMI or not. For example,
non-PMI NCCL can also use this information, if it chooses to.
2. Provide a user flag to decide when to do a uniqueId broadcast and
when to skip it. This is a performance optimization, and not a
correctness requirement. If the user forgets to set this, we will
do the uniqueId broadcast, which is wasteful (because it will be
ignored by NCCL), but not incorrect.
@exported-using-ghexport
Differential Revision: [D60966774](https://our.internmc.facebook.com/intern/diff/D60966774/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133960
Approved by: https://github.com/shuqiangzhang
Summary:
We saw ncclCommAbort was called and hang during the NCCLComm:create.
If NCCL comm is not properly initialized, ncclCommAbort behavior is
'undefined', avoid calling it would allow the process to properly throw
exception
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133630
Approved by: https://github.com/wconstab
We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268
Approved by: https://github.com/shuqiangzhang
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.
In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.
In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
This is a duplicate of: https://github.com/pytorch/pytorch/pull/127421 which we can't merge. its landed internally already
Summary:
`ncclCommCreateFromRanks` - described in this [document](https://docs.google.com/document/d/1QIRkAO4SAQ6eFBpxE51JmRKRAH2bwAHn8OIj69XuFqQ/edit#heading=h.5g71oqe3soez), replaces `ncclCommSplit` in NCCLX versions 2.21.5+. The difference is that `ncclCommCreateFromRanks` is given a list of active ranks and is collective only over those ranks as opposed to `ncclCommSplit` for which you give it a color for every rank including NO_COLOR for inactive ranks and the collective is over the entire world.
This diff connects `ncclCommCreateFromRanks` to `c10d`
`ncclCommSplit` will still be available at the NCCL API but, in this diff, is not used starting at version 2.21.5
Split the python test and implementation of `split()` for internal FB and external OSS builds.
The diff defines `"USE_C10D_NCCL_FBCODE"` as a compiler option. When defined, we use the version of split in the newly created `NCCLUtils.cpp` in the `fb` directory. The `fb` directory is not *shipit*-ed to *github*.
The same API is used for `split()` in both the `ncclx` and `nccl` versions adding `ranks` to the API. This argument is not used in the `nccl` version nor in the 2.18 `ncclx` version where `ncclCommSplit()` is used instead of `ncclCommCreateFromRanks()` in `ncclx`
This diff was squashed with D57343946 - see D57343946 for additional review comments.
Test Plan:
for 2.18.3-1 and 2.21.5-1 versions:
```
buck2 run fbcode//mode/opt -c param.use_nccl=True -c fbcode.nvcc_arch=a100 -c hpc_comms.use_ncclx="$VERSION" -c fbcode.enable_gpu_sections=true fbcode//caffe2/test/distributed/fb:test_comm_split_subgroup_x
```
```
BUILD SUCCEEDED
...
ok
----------------------------------------------------------------------
Ran 1 test in 10.210s
OK
~/scripts
```
OSS build:
`[cmodlin@devgpu003.vll5 ~/fbsource/third-party/ncclx/v2.21.5-1 (e56338cfa)]$ ./maint/oss_build.sh`
OSS build output:
```
...
ncclCommHash 197dce9b413e2775
nccl commDesc example_pg
Dump from comm 0x4708aa0 rings: [[0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]]
Dump from comm 0x4708aa0 commDesc: example_pg
Dump from comm 0x4708aa0 nRanks: 1
Dump from comm 0x4708aa0 nNodes: 1
Dump from comm 0x4708aa0 node: 0
Dump from comm 0x4708aa0 localRanks: 1
Dump from comm 0x4708aa0 localRank: 0
Dump from comm 0x4708aa0 rank: 0
Dump from comm 0x4708aa0 commHash: "197dce9b413e2775"
2024-05-24T09:02:54.385543 devgpu003:3040664:3040744 [0][AsyncJob]ctran/backends/ib/CtranIb.cc:143 NCCL WARN CTRAN-IB : No active device found.
2024-05-24T09:02:54.385607 devgpu003:3040664:3040744 [0][AsyncJob]ctran/mapper/CtranMapper.cc:187 NCCL WARN CTRAN: IB backend not enabled
Created NCCL_SPLIT_TYPE_NODE type splitComm 0x11c76d0, rank 0
~/fbsource/third-party/ncclx/v2.21.5-1
```
Reviewed By: wconstab, wesbland
Differential Revision: D56907877
Fixes #ISSUE_NUMBER
Co-authored-by: Cory Modlin <cmodlin@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127982
Approved by: https://github.com/izaitsevfb
Summary:
User-defined PyTorch modules that uses `C10D_NCCL_CHECK` run into undefined symbol errors
when loaded by `torch.library.load()`, because they have not been exported. This change
exports the symbols needed to resolve those runtime errors.
Test Plan: PyTorch CI
Differential Revision: D57977944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127550
Approved by: https://github.com/Skylion007
Summary: It would be useful to log the destination of the trace dump in either manifold or local file for the users to quickly locate the dump
Test Plan: Modified unit tests
Differential Revision: D54972069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122345
Approved by: https://github.com/wconstab
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest
Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
resolve#117749
Summary:
Updated the PR with the following intentions:
1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled.
2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call.
3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call.
4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256
Approved by: https://github.com/kwen2501
Previously, we have the writer register to each NCCL PG(backend), so for every pg, we have a NCCL PG instance, so if we use some customized writer when multiple sub-PGs are used, we need to ensure user to register the writer for every backend which indicates a bad UX. Furthermore, the debug info is global, so it does not make sense to have the writer for each instance. We even have a static mutex in the `dumpDebuggingInfo` to ensure we serialize the write, that makes it more obvious that we can make the writer a singleton so that we only have one writer instance for all PG instances.
Although the rationale is clear, the implementation may vary a lot. So this PR is RFC for now to see if this implementation makes sense or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116489
Approved by: https://github.com/kwen2501
- [c10d] (retry) Opportunistically use `ncclCommSplit` when creating new NCCL groups (#112889)
- Guard use of `split_from` with a `hasattr` check for cases when NCCL (or RCCL) lacks `ncclCommSplit`
Fixes cause of revert of original PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114385
Approved by: https://github.com/huydhn
Currently `ncclCommInitRankConfig` is always used when creating new
communicator groups. This is wasteful as it creates non-shared pairs
of endpoint queues as well as costs time to re-establish
communication.
This change is transparent and opportunistic; when `dist.new_group` is
called, it will use the existing, healthy world process group to
select the right ranks to include in the process group.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112889
Approved by: https://github.com/kwen2501
This PR is to enable the store of NCCL flight recorder to storage and make it configurable by letting users register their own way of storing the debug info. We will then provide users a script to offline parse and process the dumped blobs.
One thing, this PR is not trying to resolve is to decide where to dump the debug info. I will send a follow-up PR to address that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113503
Approved by: https://github.com/zdevito