S390x ci periodic tests (#125401)

Periodically run testsuite for s390x

**Dependencies update**
Package z3-solver is updated from version 4.12.2.0 to version 4.12.6.0. This is a minor version update, so no functional change is expected.
The reason for update is build on s390x. pypi doesn't provide binary build for z3-solver for versions 4.12.2.0 or 4.12.6.0 for s390x. Unfortunately, version 4.12.2.0 fails to build with newer gcc used on s390x builders, but those errors are fixed in version 4.12.6.0. Due to this minor version bump fixes build on s390x.

```
# pip3 install z3-solver==4.12.2.0
...
      In file included from /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:53:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp: In member function ‘void* region::allocate(size_t)’:
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/tptr.h:29:62: error: ‘uintptr_t’ does not name a type
         29 | #define ALIGN(T, PTR) reinterpret_cast<T>(((reinterpret_cast<uintptr_t>(PTR) >> PTR_ALIGNMENT) + \
            |                                                              ^~~~~~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:82:22: note: in expansion of macro ‘ALIGN’
         82 |         m_curr_ptr = ALIGN(char *, new_curr_ptr);
            |                      ^~~~~
      /tmp/pip-install-756iytc6/z3-solver_ce6f750b780b4146a9a7c01e52672071/core/src/util/region.cpp:57:1: note: ‘uintptr_t’ is defined in header ‘<cstdint>’; did you forget to ‘#include <cstdint>’?
         56 | #include "util/page.h"
        +++ |+#include <cstdint>
         57 |
```

**Python paths update**
On AlmaLinux 8 s390x, old paths:
```
python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())'
/usr/lib/python3.12/site-packages
```

Total result is `/usr/lib/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages`

New paths:
```
python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))'
/usr/local/lib64/python3.12/site-packages;/usr/local/lib/python3.12/site-packages;/usr/lib64/python3.12/site-packages;/usr/lib/python3.12/site-packages;/usr/local/lib64/python3.12/site-packages/torch;/usr/local/lib/python3.12/site-packages/torch;/usr/lib64/python3.12/site-packages/torch;/usr/lib/python3.12/site-packages/torch
```

```
# python -c 'import torch ; print(torch)'
<module 'torch' from '/usr/local/lib64/python3.12/site-packages/torch/__init__.py'>
```

`pip3 install dist/*.whl` installs torch into `/usr/local/lib64/python3.12/site-packages`, and later it's not found by cmake with old paths:

```
CMake Error at CMakeLists.txt:9 (find_package):
  By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Torch", but
  CMake did not find one.
```

https://github.com/pytorch/pytorch/actions/runs/10994060107/job/30521868178?pr=125401

**Builders availability**
Build took 60 minutes
Tests took: 150, 110, 65, 55, 115, 85, 50, 70, 105, 110 minutes (split into 10 shards)

60 + 150 + 110 + 65 + 55 + 115 + 85 + 50 + 70 + 105 + 110 = 975 minutes used. Let's double it. It would be 1950 minutes.

We have 20 machines * 24 hours = 20 * 24 * 60 = 20 * 1440 = 28800 minutes

We currently run 5 nightly binaries builds, each on average 90 minutes build, 15 minutes test, 5 minutes upload, 110 minutes total for each, 550 minutes total. Doubling would be 1100 minutes.

That leaves 28800 - 1100 = 27700 minutes total. Periodic tests would use will leave 25750 minutes.

Nightly binaries build + nightly tests = 3050 minutes.

25750 / 3050 = 8.44. So we could do both 8 more times for additional CI runs for any reason. And that is with pretty good safety margin.

**Skip test_tensorexpr**
On s390x, pytorch is built without llvm.
Even if it would be built with llvm, llvm currently doesn't support used features on s390x and test fails with errors like:
```
JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
unknown file: Failure
C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
```
**Disable cpp/static_runtime_test on s390x**

Quantization is not fully supported on s390x in pytorch yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125401
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
This commit is contained in:
Aleksei Nikiforov
2025-01-10 18:21:07 +00:00
committed by PyTorch MergeBot
parent 603e1c0b02
commit 4143312e67
9 changed files with 143 additions and 21 deletions

View File

@ -304,7 +304,7 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.12.2.0
z3-solver==4.12.6.0
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:

View File

@ -228,7 +228,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
export CMAKE_BUILD_TYPE=RelWithAssert
fi
# Do not change workspace permissions for ROCm CI jobs
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)

View File

@ -12,9 +12,9 @@ export TERM=vt100
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# Do not change workspace permissions for ROCm CI jobs
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* && -d /var/lib/jenkins/workspace ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* && -d /var/lib/jenkins/workspace ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -86,6 +86,13 @@ if [[ "$BUILD_ENVIRONMENT" == *clang9* || "$BUILD_ENVIRONMENT" == *xpu* ]]; then
export VALGRIND=OFF
fi
if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then
# There are additional warnings on s390x, maybe due to newer gcc.
# Skip this check for now
export VALGRIND=OFF
fi
if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]] || [[ "${CONTINUE_THROUGH_ERROR}" == "1" ]]; then
# When rerunning disable tests, do not generate core dumps as it could consume
# the runner disk space when crashed tests are run multiple times. Running out
@ -910,10 +917,20 @@ test_libtorch_api() {
else
# Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy
OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"
python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr
# On s390x, pytorch is built without llvm.
# Even if it would be built with llvm, llvm currently doesn't support used features on s390x and
# test fails with errors like:
# JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
# unknown file: Failure
# C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then
python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr
fi
fi
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* ]]; then
# quantization is not fully supported on s390x yet
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* && "${BUILD_ENVIRONMENT}" != *asan* && "${BUILD_ENVIRONMENT}" != *s390x* ]]; then
# NB: This test is not under TORCH_BIN_DIR but under BUILD_BIN_DIR
export CPP_TESTS_DIR="${BUILD_BIN_DIR}"
python test/run_test.py --cpp --verbose -i cpp/static_runtime_test

View File

@ -81,7 +81,7 @@ jobs:
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
if: ${{ !contains(matrix.runner, 'gcp.a100') }}
if: ${{ !contains(matrix.runner, 'gcp.a100') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -95,9 +95,10 @@ jobs:
- name: Setup Linux
uses: ./.github/actions/setup-linux
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' }}
if : ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
uses: aws-actions/configure-aws-credentials@v3
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
@ -107,11 +108,13 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
@ -121,6 +124,7 @@ jobs:
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -166,6 +170,7 @@ jobs:
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
use-gha: ${{ inputs.use-gha }}
- name: Download TD artifacts
continue-on-error: true
@ -262,9 +267,21 @@ jobs:
# comes from https://github.com/pytorch/test-infra/pull/6058
TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3))
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
SHM_OPTS=
JENKINS_USER=
# since some steps are skipped on s390x, if they are necessary, run them here
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
else
SHM_OPTS="--shm-size=${SHM_SIZE}"
JENKINS_USER="--user jenkins"
fi
# detached container should get cleaned up by teardown_ec2_linux
# TODO: Stop building test binaries as part of the build phase
# Used for GPU_FLAG since that doesn't play nice
# Used for GPU_FLAG, SHM_OPTS and JENKINS_USER since that doesn't play nice
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
@ -316,11 +333,11 @@ jobs:
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--ipc=host \
--shm-size="${SHM_SIZE}" \
${SHM_OPTS} \
--tty \
--detach \
--name="${container_name}" \
--user jenkins \
${JENKINS_USER} \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
@ -328,6 +345,11 @@ jobs:
# Propagate download.pytorch.org IP to container
grep download.pytorch.org /etc/hosts | docker exec -i "${container_name}" sudo bash -c "/bin/cat >> /etc/hosts"
echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
docker exec -t "${container_name}" sh -c "python3 -m pip install -r .ci/docker/requirements-ci.txt"
fi
docker exec -t "${container_name}" sh -c "python3 -m pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"
- name: Upload pytest cache if tests failed
@ -467,3 +489,12 @@ jobs:
echo "NVIDIA driver detects $GPU_COUNT GPUs. The runner has a broken GPU, shutting it down..."
.github/scripts/stop_runner_service.sh
fi
- name: Cleanup docker
if: always() && inputs.build-environment == 'linux-s390x-binary-manywheel'
shell: bash
run: |
# on s390x stop the container for clean worker stop
# ignore expansion of "docker ps -q" since it could be empty
# shellcheck disable=SC2046
docker stop $(docker ps -q) || true

77
.github/workflows/s390x-periodic.yml vendored Normal file
View File

@ -0,0 +1,77 @@
name: s390x-periodic
on:
schedule:
# We have several schedules so jobs can check github.event.schedule to activate only for a fraction of the runs.
# Also run less frequently on weekends.
- cron: 29 8 * * * # about 1:29am PDT, for mem leak check and rerun disabled tests
push:
tags:
- ciflow/periodic/*
- ciflow/s390/*
branches:
- release/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}-${{ github.event.schedule }}
cancel-in-progress: true
permissions: read-all
jobs:
llm-td:
if: github.repository_owner == 'pytorch'
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read
linux-manylinux-2_28-py3-cpu-s390x-build:
if: github.repository_owner == 'pytorch'
name: linux-manylinux-2_28-py3-cpu-s390x
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-s390x-binary-manywheel
docker-image-name: pytorch/manylinuxs390x-builder:cpu-s390x-main
runner: linux.s390x
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 2, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 3, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 4, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 5, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 6, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 7, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 8, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 9, num_shards: 10, runner: "linux.s390x" },
{ config: "default", shard: 10, num_shards: 10, runner: "linux.s390x" },
]}
secrets: inherit
linux-manylinux-2_28-py3-cpu-s390x-test:
permissions:
id-token: write
contents: read
name: linux-manylinux-2_28-py3-cpu-s390x
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-manylinux-2_28-py3-cpu-s390x-build
- target-determination
with:
build-environment: linux-s390x-binary-manywheel
docker-image: pytorch/manylinuxs390x-builder:cpu-s390x-main
test-matrix: ${{ needs.linux-manylinux-2_28-py3-cpu-s390x-build.outputs.test-matrix }}
timeout-minutes: 480
use-gha: "yes"
secrets: inherit

View File

@ -27,9 +27,9 @@ from torch._inductor import config as inductor_config
from torch._inductor.test_case import run_tests, TestCase
from torch.nn.attention.flex_attention import flex_attention
from torch.testing._internal.common_utils import (
IS_S390X,
scoped_load_inline,
skipIfWindows,
xfailIfS390X,
)
from torch.testing._internal.inductor_utils import GPU_TYPE, HAS_CPU, HAS_CUDA, HAS_GPU
from torch.testing._internal.logging_utils import logs_to_string
@ -2831,7 +2831,6 @@ TORCH_LIBRARY(test_cudagraphs_cpu_scalar_used_in_cpp_custom_op, m) {
out.backward()
# should not RuntimeError: Node redefined name aot0_expand!
@xfailIfS390X
def test_verbose_logs_graph(self):
def fn():
model = torch.nn.Sequential(
@ -3044,7 +3043,6 @@ TORCH_LIBRARY(test_cudagraphs_cpu_scalar_used_in_cpp_custom_op, m) {
)
@skipIfWindows(msg="AssertionError: Scalars are not equal!")
@xfailIfS390X
def test_verbose_logs_cpp(self):
torch._logging.set_logs(compiled_autograd_verbose=True)
@ -3623,6 +3621,9 @@ if not HAS_CUDA:
# Found Tesla M60 which is too old to be supported by the triton GPU compiler
known_failing_tests.add("test_type_conversions")
if IS_S390X:
known_failing_tests.add("test_deep_reentrant")
test_autograd = load_test_module("test_autograd")
test_custom_ops = load_test_module("test_custom_ops")

View File

@ -310,7 +310,6 @@ S390X_TESTLIST = [
"inductor/test_coordinate_descent_tuner",
"inductor/test_cpp_wrapper_hipify",
"inductor/test_cpu_cpp_wrapper",
"inductor/test_cuda_cpp_wrapper",
"inductor/test_cudagraph_trees",
"inductor/test_cudagraph_trees_expandable_segments",
"inductor/test_cuda_repro",
@ -330,6 +329,7 @@ S390X_TESTLIST = [
"inductor/test_fx_fusion",
"inductor/test_graph_transform_observer",
"inductor/test_group_batch_fusion",
"inductor/test_gpu_cpp_wrapper",
"inductor/test_halide",
"inductor/test_indexing",
"inductor/test_inductor_freezing",
@ -470,7 +470,6 @@ S390X_TESTLIST = [
"test_type_promotion",
"test_typing",
"test_utils",
"test_utils_internal",
"test_view_ops",
"test_vulkan",
"test_weak",

View File

@ -78,7 +78,6 @@ from torch.testing._internal.common_utils import (
skipIfWindows,
slowTest,
TestCase,
xfailIfS390X,
xfailIfTorchDynamo,
)
from torch.utils._mode_utils import no_dispatch
@ -3212,7 +3211,6 @@ class TestAutograd(TestCase):
with self.assertRaises(RuntimeError):
b.add_(5)
@xfailIfS390X
def test_attribute_deletion(self):
x = torch.randn((5, 5), requires_grad=True)
del x.grad
@ -6810,7 +6808,6 @@ for shape in [(1,), ()]:
IS_MACOS,
"Fails with SIGBUS on macOS; https://github.com/pytorch/pytorch/issues/25941",
)
@xfailIfS390X
def test_deep_reentrant(self):
class DeepReentrant(Function):
@staticmethod
@ -7213,7 +7210,6 @@ for shape in [(1,), ()]:
out = checkpoint(fn, a, use_reentrant=False, debug=True)
out.backward()
@xfailIfS390X
def test_access_saved_tensor_twice_without_recomputation_works(self):
count = [0]
@ -8336,7 +8332,6 @@ for shape in [(1,), ()]:
c = Func.apply(a)
self.assertEqual(repr(c), "tensor([2.], grad_fn=<FuncBackward>)")
@xfailIfS390X
def test_autograd_inplace_view_of_view(self):
x = torch.zeros(2)
with torch.no_grad():

View File

@ -32,6 +32,7 @@ from torch.testing._internal.common_utils import (
IS_WINDOWS,
IS_FBCODE,
IS_SANDCASTLE,
IS_S390X,
parametrize,
skipIfTorchDynamo,
xfailIfTorchDynamo,
@ -1083,6 +1084,7 @@ class TestTensorCreation(TestCase):
# nondeterministically fails, warning "invalid value encountered in cast"
@onlyCPU
@unittest.skipIf(IS_MACOS, "Nonfinite conversion results on MacOS are different from others.")
@unittest.skipIf(IS_S390X, "Test fails for int16 on s390x. Needs investigation.")
@dtypes(torch.bool, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64)
def test_float_to_int_conversion_nonfinite(self, device, dtype):
vals = (float('-inf'), float('inf'), float('nan'))