Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633)

Thanks @awgu for raising this issue and the small repro From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633 Approved by: https://github.com/albanD
2025-10-20 21:14:14 +08:00 · 2024-10-04 13:00:27 -07:00
parent 79562f3af8
commit d6f340f66c
6 changed files with 64 additions and 23 deletions
--- a/torch/csrc/autograd/function.h
+++ b/torch/csrc/autograd/function.h
@ -252,6 +252,23 @@ struct TORCH_API Node : std::enable_shared_from_this<Node> {
    return std::nullopt;
  }

+  // Used by the engine to determine what device thread to run on
+  at::Device device() {
+    // Since we pick the first non-CPU tensor, this won't work with
+    // mixed device-type operations (e.g., an op that is both CUDA
+    // and XLA).  This is *incredibly* unlikely, so we don't worry
+    // about it.
+    for (const auto& metadata : input_metadata_) {
+      auto device = metadata.device();
+      if (device.type() != at::kCPU) {
+        return device;
+      }
+    }
+    // Only report to the CPU thread if there really were no tensors
+    // from other devices.
+    return at::kCPU;
+  }
+
  void clear_input_metadata() {
    input_metadata_.clear();
  }