Stop runner service when its GPU crashes (#97585)

Per title, I'm looking for a way to take the runner out of service when its GPU crashes and couldn't recover.  Taking the faulty runner out of service would prevent future jobs to be assigned to it as they will surely fail.

This is based on the observation that GPU crash usually happen in the middle of the test or in the next `setup-nvidia` step.  This is only happens on G5 runner with A10G GPU, so the suspicion is that this is a hardware failure.  Updating to the newer NVIDIA driver (525.85.06) might or might not help with the issue (https://github.com/pytorch/pytorch/pull/96904), so I'm preparing this PR as a preemptive measure.  Here are the symptoms when the GPU crashes:

* Test fails with "No CUDA GPUs are available" error when initialize CUDA.  For examples:
  * https://github.com/pytorch/pytorch/actions/runs/4506110581/jobs/7932832519
  * https://github.com/pytorch/pytorch/actions/runs/4507220502/jobs/7935084759
* Calling nvidia-smi timeouts after 60 second.  For example:
  * https://github.com/pytorch/pytorch/actions/runs/4496201282/jobs/7910938448
* Fail to run nvidia-smi with an unable to determine the device handle for GPU unknown error
  * https://github.com/pytorch/pytorch/actions/runs/4546343549/jobs/8015359600
*  Run `docker --gpus all` fails with error response from daemon while the command `nvidia-container-cli` fails with `detection error: nvml error: unknown error`
  * https://github.com/pytorch/pytorch/actions/runs/4545579871/jobs/8013667872

I'm assume that an offline runner with a stopped runner service would be teardown and recycle properly by infra scaling process.

### Testing
https://github.com/pytorch/pytorch/actions/runs/4517112069/jobs/7956204805.  When it runs, the code fetches the service name from `${{ RUNNER_WORKSPACE }}/../../.service` file and issue `sudo systemctl stop ${RUNNER_SERVICE_NAME}` to stop the self-hosted runner service.

The job will show its status as `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97585
Approved by: https://github.com/jeanschmidt
This commit is contained in:
Huy Do
2023-03-29 21:17:13 +00:00
committed by PyTorch MergeBot
parent 2806fa4470
commit 099b2801db
2 changed files with 82 additions and 0 deletions

24
.github/scripts/stop_runner_service.sh vendored Executable file
View File

@ -0,0 +1,24 @@
#!/bin/bash
set +e
set -x
# Get the service name
RUNNER_SERVICE=$(cat "${RUNNER_WORKSPACE}/../../.service")
echo "GitHub self-hosted runner service: ${RUNNER_SERVICE}"
if [[ -n "${RUNNER_SERVICE}" ]]; then
echo "The self-hosted runner has encountered an unrecoverable error and will be shutdown"
pushd "${RUNNER_WORKSPACE}/../../"
# Stop it to prevent the runner from receiving new jobs
sudo ./svc.sh stop
# then uninstall the service
sudo ./svc.sh uninstall
# Finally, shutting down the runner completely
sudo shutdown -P now
# NB: In my test, cleaning up and shutting down the runner this way would already
# remove the runner from the list of registered runners. Calling config.sh remove
# seems redundant as it would require an org token to use, which I don't want to
# add as yet another secret to the CI if there is no need
fi

View File

@ -90,6 +90,7 @@ jobs:
docker-image: ${{ inputs.docker-image }}
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
@ -270,3 +271,60 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
- name: Check NVIDIA driver installation step
if: failure() && steps.install-nvidia-driver.conclusion && steps.install-nvidia-driver.conclusion == 'failure'
shell: bash
env:
RUNNER_WORKSPACE: ${{ runner.workspace }}
run: |
set +e
set -x
nvidia-smi
NVIDIA_SMI_STATUS=$?
# These are acceptable return code from nvidia-smi as copied from setup-nvidia GitHub action
if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
echo "NVIDIA driver installation has failed, shutting down the runner..."
.github/scripts/stop_runner_service.sh
fi
- name: Check GPU health (run this last)
if: failure() && contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu')
shell: bash
env:
RUNNER_WORKSPACE: ${{ runner.workspace }}
run: |
set +e
set -x
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does
# not seem to help. Here are some symptoms:
# * Calling nvidia-smi timeouts after 60 second
# * Fail to run nvidia-smi with an unable to determine the device handle for GPU
# unknown error
# * Test fails with "No CUDA GPUs are available" error when initialize CUDA
# in PyTorch
# * Run docker --gpus all fails with error response from daemon while the command
# nvidia-container-cli fails with detection error: nvml error: unknown error.
#
# As both the root cause and recovery path are unclear, let's take the runner out of
# service so that it doesn't get any more jobs
UNRECOVERABLE_ERRORS=(
"No CUDA GPUs are available"
"docker: Error response from daemon"
)
for ERROR in "${UNRECOVERABLE_ERRORS[@]}"
do
grep -Rli "${ERROR}" "${RUNNER_WORKSPACE}/../../_diag/pages"
RC=$?
# If GPU crashes, stop the runner to prevent it from receiving new jobs
if [[ "${RC}" == "0" ]]; then
echo "The runner has encoutered an unrecoverable error (${ERROR}), shutting it down..."
.github/scripts/stop_runner_service.sh
fi
done