[fr] Enable reset the FR recording for fault tolerance (#164988)

We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988
Approved by: https://github.com/tushar00jain
ghstack dependencies: #164752
This commit is contained in:
fduwjj
2025-10-08 13:36:35 -07:00
committed by PyTorch MergeBot
parent 81dbeb06f4
commit 8ca986ee60
6 changed files with 49 additions and 0 deletions

View File

@ -393,6 +393,10 @@ static std::
#endif // (defined(IS_NCCLX) || defined(USE_ROCM)) && defined(NCCL_COMM_DUMP)
}
void reset_nccl_trace() {
FlightRecorderCUDA::get()->reset_all();
}
std::string dump_nccl_trace(
bool includeCollectives,
bool includeStackTraces,