Fix PgNccl coalseced profiling (#160680)

Admittedly I'm a noob when looking at traces, but this looked pretty off to me: <img width="1528" height="824" alt="Screenshot 2025-08-14 at 5 27 49 PM" src="https://github.com/user-attachments/assets/871e7b4c-0e47-4c84-97cc-8198b7b76d4b" /> 1. Why are there so many "nccl:coalesced" on the CPU thread 2. Why is there "nccl:coalesced" on compute stream (stream 7) Here is what is happening: **CPU side**: In `endCoalescing`, we create a [work object ](3be70dc30e/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3473)) with the profiling title "nccl:coalesced" **GPU side**: The CUDA kernels will inherit this profiling title What is missing: We forgot to call the record function [callback](3be70dc30e/torch/csrc/distributed/c10d/Work.cpp (L35-L38)). With this change we finishs immediately on the CPU side, but the ncclDevKernel_SendRecv still have the coalesced title. New trace looks like this: <img width="1123" height="637" alt="image" src="https://github.com/user-attachments/assets/f015fd64-85cd-452a-be24-3e7724f84e44" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160680 Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-10-20 21:14:14 +08:00 · 2025-08-14 14:26:28 -07:00
parent fa54b08cd5
commit 47ed41109f
1 changed files with 24 additions and 0 deletions
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
@ -3512,6 +3512,30 @@ c10::intrusive_ptr<Work> ProcessGroupNCCL::endCoalescing(OpType optype) {
    workEnqueue(work);
  }

+  {
+    c10::cuda::CUDAMultiStreamGuard streamGuard(ncclStream);
+    std::vector<at::Device> devices{device};
+    work->future_ = c10::make_intrusive<at::ivalue::Future>(
+        c10::ListType::create(c10::TensorType::get()), devices);
+
+    // Add a callback that runs profiling end callbacks. wrapCallback() in CUDA
+    // future blocks the stream this callback runs on the corresponding
+    // ncclEndEvents_ ensuring appropriate synchronization.
+    if (work->recordFunctionEndCallback_) {
+      work->future_->addCallback(
+          [work](at::ivalue::Future& /* unused */) {
+            work->recordFunctionEndCallback_();
+          },
+          // uses_future = false allows us to skip synchronization in
+          // ivalue::Future, but is only valid as long as the lambda doesn't use
+          // the "Future" argument.
+          /*uses_future=*/false);
+    }
+    // Mark the future as completed since coalesced operations complete
+    // immediately
+    work->future_->markCompleted(at::IValue(std::vector<at::Tensor>{}));
+  }
+
  // Reset coalescing state
  coalescing_state_ = 0;
  coalescedComm_ = nullptr;