mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Fix PgNccl coalseced profiling (#160680)
Admittedly I'm a noob when looking at traces, but this looked pretty off to me: <img width="1528" height="824" alt="Screenshot 2025-08-14 at 5 27 49 PM" src="https://github.com/user-attachments/assets/871e7b4c-0e47-4c84-97cc-8198b7b76d4b" /> 1. Why are there so many "nccl:coalesced" on the CPU thread 2. Why is there "nccl:coalesced" on compute stream (stream 7) Here is what is happening: **CPU side**: In `endCoalescing`, we create a [work object ](3be70dc30e/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3473)
) with the profiling title "nccl:coalesced" **GPU side**: The CUDA kernels will inherit this profiling title What is missing: We forgot to call the record function [callback](3be70dc30e/torch/csrc/distributed/c10d/Work.cpp (L35-L38)
). With this change we finishs immediately on the CPU side, but the ncclDevKernel_SendRecv still have the coalesced title. New trace looks like this: <img width="1123" height="637" alt="image" src="https://github.com/user-attachments/assets/f015fd64-85cd-452a-be24-3e7724f84e44" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160680 Approved by: https://github.com/fegin, https://github.com/kwen2501
This commit is contained in:
committed by
PyTorch MergeBot
parent
fa54b08cd5
commit
47ed41109f
@ -3512,6 +3512,30 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::endCoalescing(OpType optype) {
|
||||
workEnqueue(work);
|
||||
}
|
||||
|
||||
{
|
||||
c10::cuda::CUDAMultiStreamGuard streamGuard(ncclStream);
|
||||
std::vector<at::Device> devices{device};
|
||||
work->future_ = c10::make_intrusive<at::ivalue::Future>(
|
||||
c10::ListType::create(c10::TensorType::get()), devices);
|
||||
|
||||
// Add a callback that runs profiling end callbacks. wrapCallback() in CUDA
|
||||
// future blocks the stream this callback runs on the corresponding
|
||||
// ncclEndEvents_ ensuring appropriate synchronization.
|
||||
if (work->recordFunctionEndCallback_) {
|
||||
work->future_->addCallback(
|
||||
[work](at::ivalue::Future& /* unused */) {
|
||||
work->recordFunctionEndCallback_();
|
||||
},
|
||||
// uses_future = false allows us to skip synchronization in
|
||||
// ivalue::Future, but is only valid as long as the lambda doesn't use
|
||||
// the "Future" argument.
|
||||
/*uses_future=*/false);
|
||||
}
|
||||
// Mark the future as completed since coalesced operations complete
|
||||
// immediately
|
||||
work->future_->markCompleted(at::IValue(std::vector<at::Tensor>{}));
|
||||
}
|
||||
|
||||
// Reset coalescing state
|
||||
coalescing_state_ = 0;
|
||||
coalescedComm_ = nullptr;
|
||||
|
Reference in New Issue
Block a user