We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988
Approved by: https://github.com/tushar00jain
ghstack dependencies: #164752
Increase the tolerance for the following UTs as there was a slight mismatch seen on MI200.
- test_data_parallel.py:test_strided_grad_layout
- test_c10d_nccl.py:test_grad_layout_1devicemodule_1replicaperprocess
Skip for MI200:
- test_fully_shard_training.py:test_2d_mlp_with_nd_mesh
- test_2d_composability.py:test_train_parity_2d_mlp
- test_fully_shard_overlap.py:test_fully_shard_training_overlap
Fixes#159489Fixes#159488Fixes#152700Fixes#125555Fixes#134139
Working as is on both MI200 and MI300:
Fixes#125991Fixes#125918
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164390
Approved by: https://github.com/jeffdaily
This test sets "NCCL_ALGO=NVLS" in NcclUserBufferRegistrationTest which affects tests run in the same process such as `test_on_completion_hook_*` that fail with
> invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.26.2
> ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
> Last error:
> Error : no algorithm/protocol available for function Broadcast with datatype ncclInt8. NCCL_ALGO was set to NVLS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163063
Approved by: https://github.com/ezyang
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
Today we can initialize a mixed-backend process group (e.g. "cpu:gloo,cuda:nccl") but we can only pass one set of process group options.
However, when we call `split_group`, we retrieve that set of options from the parent PG and pass it to the ProcessGroup::groupSplit C++ API, which then attempts to propagate that set of options to all backends.
This leads to an assert on some user code, where ProcessGroupGloo::split is expecting gloo options but receives nccl options instead.
Arguably the APIs as currently designed are just broken; we should not ever expect a single set of backend options to apply across multiple backends. However, fixing this would require changing quite a few public APIs.
As a quick fix, since user-provided options really only exist for NCCL, just warn and fall-back to defaulted options for Gloo if non-gloo options are detected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162424
Approved by: https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/H-Huang
- Enable communication of tensors with Complex datatype in ProcessGroupGloo, similar to how ProcessGroupNCCL handles it.
- Move a function, which checks if Complex datatype is supported by a reduce operation, from ProcessGroupNCCL.cpp into a new file to be shared with ProcessGroupGloo.
Fixes#156632
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156633
Approved by: https://github.com/d4l3k
Instead of implicitly creating nccl comm inside mem pool registration for symmetric memory, we decide to error it out so that we only support eager init case when the nccl comm is already initiated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160145
Approved by: https://github.com/kwen2501
This implements a new `wait_stream` API in Work that matches how `wait` works for ProcessGroupNCCL for CPU based backends such as Gloo.
The idea is to support Gloo communication overlap in FSDPv2/HSDP with minimal changes to FSDP.
There was a previous attempt to make FSDPv2 use Work.wait but given the extensive stream semantics used it doesn't play nicely. https://github.com/pytorch/pytorch/pull/148780
This uses a "Baton" CUDA kernel which spinlocks on a pinned CPU tensor waiting for it to be set.
Test plan:
```
pytest test/distributed/test_c10d_gloo.py -v -k wait_stream
pytest test/distributed/test_c10d_nccl.py -v -k wait_stream
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156883
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished.
Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653
Approved by: https://github.com/kwen2501
When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
While looking into a case when FR dump (actual dump not monitoring thread) takes 30 mins, I realized that our global write lock is grabbed too early so the second effort to dump FR without stack trace will fail because of a deadlock because the global write lock is still hold. So we should only grab the lock when we are ready to write so that we are less likely to keep the lock forever. Also I did an audit to the lock within FR as well and found that there is one place we can shrink as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155949
Approved by: https://github.com/Skylion007
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.
I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).
Workaround #153479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154448
Approved by: https://github.com/kwen2501
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.
These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.
Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
lint:
- test/test_fake_tensor.py
- test/test_flop_counter.py
- torch/_export/verifier.py
with same rules as other files, it was a night mare for me to update tests in one of the skipped files
with not being able to lint them locally like other files with lintrunner -a.
note that those file do have active dev and not old not touched files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261
Approved by: https://github.com/angelayi, https://github.com/Skylion007
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.
These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.
Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj
`SIGABRT` is a common return by *negative* distributed tests, which checks for effectiveness of NaN assert, watchdog throw, etc.
These errors are not detectable by traditional statements like `with self.assertRaises(RuntimeError)`.
Instead, we'd need to check for the process's return code, e.g. `SIGABRT(6)` would have a return code of -6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153167
Approved by: https://github.com/fduwjj