Fix a getenv segfault due to a race (#133744)

Summary: * TLDR: `getenv` is not thread safe w.r.t `setenv`. Environment variables are kept as a per-process "dictionary" by libc. `setenv` can essentially realloc the whole thing move this list to a completely different location. If there is a concurrent `getenv` happening as the same time, it is possible that it might end up reading stale memory and segfault. `getenv` is thread safe w.r.t other `getenv`. * Details: Inside PTD init: ``` ProcessGroupNCCL ctor ... ncclCommWatchdogThread_ = std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this); (https://fburl.com/code/terf9ai7) ``` Inside ncclCommWatchdog thread: ``` ... ncclHeartbeatMonitorThread_ = std::thread(&ProcessGroupNCCL::heartbeatMonitor, this); (https://fburl.com/code/fv9camg2) ... ``` Inside heartbeatMonitor thread: ``` ... std::optional<DumpPipe> dumpPipe = std::nullopt; (https://fburl.com/code/qdvahzbu) dumpPipe.emplace(rank_); ... ``` Inside DumpPipe ctor (https://fburl.com/code/wvixlqcz) ``` getCvarString getenv <=== SIGSEGV ``` On the main thread: We go on to initialize NCCL: Inside getNCCLComm, we call: `getNcclVersion` -> `initEnv` (https://fburl.com/code/j312pccu) `initEnv` inside NCCL does this: `initEnv` -> `setEnvFile` This guy, reads the /etc/nccl.conf file, and sets values of env variables with "setenv" (https://fburl.com/code/cq4r0y0h) This "setenv" can race with "getenv" in heartbeatMonitor thread. Ideally, all `setenv` should be done by a single thread before launching other threads. This diff moves getNCCLVersion before launching watchdog thread to make sure all setenvs are done beforehand. I think we are just getting lucky that we are not hitting it in production. IIRC in fact we saw getenv segfault once in one of the large scale runs, but now I dont remember the details. Test Plan: A lot of testing done as part of D61411062 & CI Differential Revision: D61421292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133744 Approved by: https://github.com/wconstab, https://github.com/fduwjj
2025-10-20 21:14:14 +08:00 · 2024-08-21 06:27:31 +00:00
parent af664882dd
commit 843fdf81c2
1 changed files with 7 additions and 1 deletions
--- a/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
@ -819,6 +819,12 @@ ProcessGroupNCCL::ProcessGroupNCCL(
      ValueError,
      at::cuda::getNumGPUs() != 0,
      "ProcessGroupNCCL is only supported with GPUs, no GPUs found!");
+
+  // getNcclVersion needs to get called before launching threads which can
+  // potentially call getenv. getNcclVersion internally calls setenv to set some
+  // environment variables from config file, which can race with getenv from
+  // other threads and cause segfaults.
+  const auto ncclVersion = getNcclVersion();
  this->setGroupUid(options_->group_name);
  this->localDeviceCount_ = at::cuda::getNumGPUs();
  logPrefix_ = createLogPrefix();
@ -911,7 +917,7 @@ ProcessGroupNCCL::ProcessGroupNCCL(
            << ", PG Name: " << options_->group_name;

  LOG(INFO) << logPrefix() << "ProcessGroupNCCL environments: "
-            << "NCCL version: " << getNcclVersion()
+            << "NCCL version: " << ncclVersion
            << ", TORCH_NCCL_ASYNC_ERROR_HANDLING: " << asyncErrorHandling_
            << ", TORCH_NCCL_DUMP_ON_TIMEOUT: " << dumpOnTimeoutOrEx_
            << ", TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: "