Stop runner service when its GPU crashes (#97585)

Per title, I'm looking for a way to take the runner out of service when its GPU crashes and couldn't recover. Taking the faulty runner out of service would prevent future jobs to be assigned to it as they will surely fail. This is based on the observation that GPU crash usually happen in the middle of the test or in the next `setup-nvidia` step. This is only happens on G5 runner with A10G GPU, so the suspicion is that this is a hardware failure. Updating to the newer NVIDIA driver (525.85.06) might or might not help with the issue (https://github.com/pytorch/pytorch/pull/96904), so I'm preparing this PR as a preemptive measure. Here are the symptoms when the GPU crashes: * Test fails with "No CUDA GPUs are available" error when initialize CUDA. For examples: * https://github.com/pytorch/pytorch/actions/runs/4506110581/jobs/7932832519 * https://github.com/pytorch/pytorch/actions/runs/4507220502/jobs/7935084759 * Calling nvidia-smi timeouts after 60 second. For example: * https://github.com/pytorch/pytorch/actions/runs/4496201282/jobs/7910938448 * Fail to run nvidia-smi with an unable to determine the device handle for GPU unknown error * https://github.com/pytorch/pytorch/actions/runs/4546343549/jobs/8015359600 * Run `docker --gpus all` fails with error response from daemon while the command `nvidia-container-cli` fails with `detection error: nvml error: unknown error` * https://github.com/pytorch/pytorch/actions/runs/4545579871/jobs/8013667872 I'm assume that an offline runner with a stopped runner service would be teardown and recycle properly by infra scaling process. ### Testing https://github.com/pytorch/pytorch/actions/runs/4517112069/jobs/7956204805. When it runs, the code fetches the service name from `${{ RUNNER_WORKSPACE }}/../../.service` file and issue `sudo systemctl stop ${RUNNER_SERVICE_NAME}` to stop the self-hosted runner service. The job will show its status as `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/97585 Approved by: https://github.com/jeanschmidt
2025-10-20 21:14:14 +08:00 · 2023-03-29 21:17:13 +00:00
parent 2806fa4470
commit 099b2801db
2 changed files with 82 additions and 0 deletions
--- a/.github/scripts/stop_runner_service.sh
+++ b/.github/scripts/stop_runner_service.sh
@ -0,0 +1,24 @@
+#!/bin/bash
+
+set +e
+set -x
+
+# Get the service name
+RUNNER_SERVICE=$(cat "${RUNNER_WORKSPACE}/../../.service")
+echo "GitHub self-hosted runner service: ${RUNNER_SERVICE}"
+
+if [[ -n "${RUNNER_SERVICE}" ]]; then
+  echo "The self-hosted runner has encountered an unrecoverable error and will be shutdown"
+
+  pushd "${RUNNER_WORKSPACE}/../../"
+  # Stop it to prevent the runner from receiving new jobs
+  sudo ./svc.sh stop
+  # then uninstall the service
+  sudo ./svc.sh uninstall
+  # Finally, shutting down the runner completely
+  sudo shutdown -P now
+  # NB: In my test, cleaning up and shutting down the runner this way would already
+  # remove the runner from the list of registered runners. Calling config.sh remove
+  # seems redundant as it would require an org token to use, which I don't want to
+  # add as yet another secret to the CI if there is no need
+fi
--- a/.github/workflows/_linux-test.yml
+++ b/.github/workflows/_linux-test.yml
@ -90,6 +90,7 @@ jobs:
          docker-image: ${{ inputs.docker-image }}

      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
+        id: install-nvidia-driver
        uses: pytorch/test-infra/.github/actions/setup-nvidia@main
        if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')

@ -270,3 +271,60 @@ jobs:
      - name: Teardown Linux
        uses: pytorch/test-infra/.github/actions/teardown-linux@main
        if: always()
+
+      - name: Check NVIDIA driver installation step
+        if: failure() && steps.install-nvidia-driver.conclusion && steps.install-nvidia-driver.conclusion == 'failure'
+        shell: bash
+        env:
+          RUNNER_WORKSPACE: ${{ runner.workspace }}
+        run: |
+          set +e
+          set -x
+
+          nvidia-smi
+          NVIDIA_SMI_STATUS=$?
+
+          # These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action
+          if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
+            echo "NVIDIA driver installation has failed, shutting down the runner..."
+            .github/scripts/stop_runner_service.sh
+          fi
+
+      - name: Check GPU health (run this last)
+        if: failure() && contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
+        shell: bash
+        env:
+          RUNNER_WORKSPACE: ${{ runner.workspace }}
+        run: |
+          set +e
+          set -x
+
+          # NB: We are currently having an intermittent GPU-related issue on G5 runners with
+          # A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does
+          # not seem to help. Here are some symptoms:
+          #   * Calling nvidia-smi timeouts after 60 second
+          #   * Fail to run nvidia-smi with an unable to determine the device handle for GPU
+          #     unknown error
+          #   * Test fails with "No CUDA GPUs are available" error when initialize CUDA
+          #     in PyTorch
+          #   * Run docker --gpus all fails with error response from daemon while the command
+          #     nvidia-container-cli fails with detection error: nvml error: unknown error.
+          #
+          # As both the root cause and recovery path are unclear, let's take the runner out of
+          # service so that it doesn't get any more jobs
+          UNRECOVERABLE_ERRORS=(
+            "No CUDA GPUs are available"
+            "docker: Error response from daemon"
+          )
+
+          for ERROR in "${UNRECOVERABLE_ERRORS[@]}"
+          do
+            grep -Rli "${ERROR}" "${RUNNER_WORKSPACE}/../../_diag/pages"
+            RC=$?
+
+            # If GPU crashes, stop the runner to prevent it from receiving new jobs
+            if [[ "${RC}" == "0" ]]; then
+              echo "The runner has encoutered an unrecoverable error (${ERROR}), shutting it down..."
+              .github/scripts/stop_runner_service.sh
+            fi
+          done