Files
pytorch/torch/csrc/distributed/c10d/NanCheck.hpp
Ke Wen 3645634f3c [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 08:28:49 +00:00

17 lines
328 B
C++

#pragma once
#ifdef USE_C10D_NCCL
#include <ATen/ATen.h>
#include <c10/cuda/CUDAStream.h>
namespace c10d {
// Check for NaNs in a tensor on a given stream. If any are found, throw a
// device-side error.
void checkForNan(const at::Tensor& tensor, at::cuda::CUDAStream& stream);
} // namespace c10d
#endif // USE_C10D_NCCL