mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
[fr] Enable reset the FR recording for fault tolerance (#164988)
We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988 Approved by: https://github.com/tushar00jain ghstack dependencies: #164752
This commit is contained in:
@ -393,6 +393,10 @@ static std::
|
||||
#endif // (defined(IS_NCCLX) || defined(USE_ROCM)) && defined(NCCL_COMM_DUMP)
|
||||
}
|
||||
|
||||
void reset_nccl_trace() {
|
||||
FlightRecorderCUDA::get()->reset_all();
|
||||
}
|
||||
|
||||
std::string dump_nccl_trace(
|
||||
bool includeCollectives,
|
||||
bool includeStackTraces,
|
||||
|
Reference in New Issue
Block a user