Fix PgNccl coalseced profiling (#160680)

Admittedly I'm a noob when looking at traces, but this looked pretty off to me:
<img width="1528" height="824" alt="Screenshot 2025-08-14 at 5 27 49 PM" src="https://github.com/user-attachments/assets/871e7b4c-0e47-4c84-97cc-8198b7b76d4b" />
1. Why are there so many "nccl:coalesced" on the CPU thread
2. Why is there "nccl:coalesced" on compute stream (stream 7)

Here is what is happening:

**CPU side**: In `endCoalescing`, we create a [work object ](3be70dc30e/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3473)) with the profiling title "nccl:coalesced"
**GPU side**: The CUDA kernels will inherit this profiling title

What is missing:

We forgot to call the record function [callback](3be70dc30e/torch/csrc/distributed/c10d/Work.cpp (L35-L38)). With this change we finishs immediately on the CPU side, but the ncclDevKernel_SendRecv still have the coalesced title. New trace looks like this:

<img width="1123" height="637" alt="image" src="https://github.com/user-attachments/assets/f015fd64-85cd-452a-be24-3e7724f84e44" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160680
Approved by: https://github.com/fegin, https://github.com/kwen2501
This commit is contained in:
Howard Huang
2025-08-14 14:26:28 -07:00
committed by PyTorch MergeBot
parent fa54b08cd5
commit 47ed41109f

View File

@ -3512,6 +3512,30 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::endCoalescing(OpType optype) {
workEnqueue(work);
}
{
c10::cuda::CUDAMultiStreamGuard streamGuard(ncclStream);
std::vector<at::Device> devices{device};
work->future_ = c10::make_intrusive<at::ivalue::Future>(
c10::ListType::create(c10::TensorType::get()), devices);
// Add a callback that runs profiling end callbacks. wrapCallback() in CUDA
// future blocks the stream this callback runs on the corresponding
// ncclEndEvents_ ensuring appropriate synchronization.
if (work->recordFunctionEndCallback_) {
work->future_->addCallback(
[work](at::ivalue::Future& /* unused */) {
work->recordFunctionEndCallback_();
},
// uses_future = false allows us to skip synchronization in
// ivalue::Future, but is only valid as long as the lambda doesn't use
// the "Future" argument.
/*uses_future=*/false);
}
// Mark the future as completed since coalesced operations complete
// immediately
work->future_->markCompleted(at::IValue(std::vector<at::Tensor>{}));
}
// Reset coalescing state
coalescing_state_ = 0;
coalescedComm_ = nullptr;