Update Continuous-Integration and Running-and-writing-tests

2025-10-20 21:14:14 +08:00 · 2023-07-10 11:33:54 -07:00
parent 585fa7dc1c
commit 3a5a3138a3
2 changed files with 86 additions and 97 deletions
--- a/Continuous-Integration.md
+++ b/Continuous-Integration.md
@ -2,32 +2,25 @@

 This document discusses the Continuous Integration (CI) system for PyTorch.

-Currently PyTorch utilizes Github Actions and CircleCI for various different CI build/test configurations. We will discuss these related contents in the following sections:
+Currently PyTorch utilizes Github Actions for various different CI build/test configurations. We will discuss these related contents in the following sections:

 - [CI System Overview](#ci-system-overview)
- [CI Matrix](#ci-matrix)
-  * [Basic Configurations](#basic-configurations)
-  * [Other Variances](#other-variances)
- [Entrypoint for CI](#entrypoint-for-ci)
-  * [Build System](#build-system)
-  * [Test System](#test-system)
- [What is CI testing and When](#what-is-ci-testing-and-when)
-  * [CI workflow on PR](#ci-workflow-on-pr)
-    + [Using Github Labels](#using-github-labels)
-  * [CI workflow on master commits](#ci-workflow-on-master-commits)
-  * [Change CI workflow behavior on PR](#change-ci-workflow-behavior-on-pr)
-  * [Disabled tests in CI](#disabled-tests-in-ci)
-    + [How to disable a test](#how-to-disable-a-test)
-    + [Disabling a test for all dtypes/devices](#disabling-a-test-for-all-dtypesdevices)
-    + [How to test the disabled test on CI](#how-to-test-the-disabled-test-on-ci)
- [Other Topics](#other-topics)
-  * [CI Metrics](#ci-metrics)
-  * [CI Internals](#ci-internals)
-  * [Binary Builds](#binary-builds)
-  * [Docker Builds](#docker-builds)
-  * [Other Related Repos](#other-related-repos)
-  * [Which commit is used in CI for your PR?](#which-commit-is-used-in-ci-for-your-pr-)
-  * [CI failure tips on PR](#ci-failure-tips-on-pr)
+  - [Basic Configurations](#basic-configurations)
+  - [Other Variances](#other-variances)
+  - [CI on Commits](#ci-on-commits)
+  - [Using labels to change CI on PR](#using-labels-to-change-ci-on-pr)
+- [Other](#other)
+  - [Sharding](#sharding)
+  - [Disabled tests in CI](#disabled-tests-in-ci)
+    - [How to disable a test](#how-to-disable-a-test)
+    - [Disabling a test for all dtypes/devices](#disabling-a-test-for-all-dtypesdevices)
+    - [How to test the disabled test on CI](#how-to-test-the-disabled-test-on-ci)
+  - [Reruns/retries](#rerunsretries)
+  - [Which commit is used in CI for your PR?](#which-commit-is-used-in-ci-for-your-pr)
+  - [CI failure tips on PR](#ci-failure-tips-on-pr)
+- [Additional Resources](#additional-resources)
+  - [Related wikis](#related-wikis)
+  - [Related Repos](#related-repos)


 ## CI System Overview
@ -37,22 +30,9 @@ PyTorch CI system ensures that proper build/test process should pass on both PRs
 - *CI workflow*: a CI workflow consist of a collection of CI jobs that are depending on each other. For example: [lint](https://github.com/pytorch/pytorch/blob/5abe454d6c02ac16e2e1fc87500aa46697385501/.github/workflows/lint.yml)
 - *CI job*: a CI job consist of a sequence of predefined steps, each of which will execute some specific script to build or test PyTorch. For example: [lintrunner](https://github.com/pytorch/pytorch/blob/5abe454d6c02ac16e2e1fc87500aa46697385501/.github/workflows/lint.yml#L14)
 - *CI run*: Refers to all the CI workflows triggered on a single commit (either on PR or on master).
- *CI platform*: we currently have 3 CI platforms (Github Actions, CirlceCI).
+- *CI platform*: we currently have 1 CI platform (CircleCI was used also used previously and some artifacts can be found in the repo.  CircleCI is also used in some other domain libraries).

-PyTorch supports many different hardware architectures, operation systems, and accelerator GPUs. Therefore, many different CI workflows run parallel on each commit to ensure that PyTorch can be built and run correctly in different environments and configurations. See [CI Matrix](#ci-matrix) section for more info.
-
-For historic reasons we currently have 2 different CI platforms and there are dozens of CI jobs run on these 2 platforms. For each platform specific information please refer to the [CI Internals](#ci-internals) section for more info.
-
-## CI Matrix
-
-Currently there are 3 different categories of CI runs for a single commit: CI run (1) on PR, (2) on master commits, (3) on nightly commits. These CI workflow configurations changes from time to time (Please refer to the [CI Internals](#ci-internals) section for more detail), but in general the rule of thumb for our CI runs are.
-1. We consider the set of CI workflows run on master commits to be the base line.
-2. We run on PRs a subset of CI workflows comparing to the base line we run on master commits.
-3. We mostly run binary CI workflows and smoke tests on nightly commits.
-
-For more details on which CI workflow is being run on which category, please refer to the section: [What is CI testing and When](#what-is-ci-testing-and-when).
-
-Each one of these triggers a specific subset of CI workflows defined in the CI matrix. Currently our CI matrix consists of some combination of the following basic configurations.
+PyTorch supports many different hardware architectures, operation systems, and accelerator GPUs. Therefore, many different CI workflows run parallel on each commit to ensure that PyTorch can be built and run correctly in different environments and configurations.

 ### Basic Configurations

@ -72,41 +52,59 @@ Obviously not every Cartesian product of these combination is being tested. Plea
 Other than the 4 basic configuration coordinates, we also have some special variances in configurations that we run. These variances are created to test some specific features or to cover some specific test domains. For example:

 1. *ASAN build*: to run address sanitizers.
-2. *libtorch builds*: to target CPP APIs.
-3. *coverage builds*: to report test coverages.
-4. *pure torch build*: to test PyTorch built without Caffe2.
-5. *ONNX build*: to test ONNX integration.
-6. *XLA/Vulkan build*: to test XLA integration.
-7. *Android/IOS build*: to test PyTorch run mobile platforms.
+2. *libtorch busilds*: to target CPP APIs.
+3. *pure torch build*: to test PyTorch built without Caffe2.
+4. *ONNX build*: to test ONNX integration.
+5. *XLA/Vulkan build*: to test XLA integration.
+6. *Android/IOS build*: to test PyTorch run mobile platforms.

-## Entrypoint for CI
+### CI on Commits

-### Build System
+Currently there are 2 main categories of CI runs for a single commit:
+* pull_request trigger (pull, lint): These are jobs that are always run on every PR and on every commit on main.
+* ciflow label trigger: These are jobs that are triggered by adding `ciflow/<workflow name>` tags to the PR.  These can be further divided into:
+  * run on every commit in main (trunk, inductor)
+  * periodic workflows (periodic, slow)
+  * nightlies/binaries: These are usually run on the nightly branch instead of the main branch and generally build and smoke test our binaries.

-### Test System
+Each category contains a subset of CI workflows defined in the CI matrix.

-See ["Running and writing tests" page](https://github.com/pytorch/pytorch/wiki/Running-and-writing-tests) for information.
+We generally consider CI workflows run on commits in main to be the baseline, and we run subsets of this baseline on PRs.

-## What is CI testing and When
+### Using labels to change CI on PR

-### CI workflow on PR
+As mentioned in [CI Matrix](#ci-matrix), PyTorch runs different sets of jobs on PR vs. on main commits.

-#### Using Github Labels
+In order to control the behavior of CI jobs on PR. The most commonly used labels are:
+- `ciflow/trunk`: automatically added when `@pytorchbot merge` is invoked.  These tests are run on every commit in master.
+- `ciflow/periodic`: runs every 4 hours on master.  Includes jobs that are either expensive or slow to run, such as mac x86-64 tests, slow gradcheck, and multigpu.
+- `ciflow/inductor`: runs inductor builds and tests.  This label may be automatically added by our autolabeler if your PR touches certain files.  These jobs are run on every commit in master.
+- `ciflow/slow`: runs every 4 hours on master.  Runs tests that are marked as slow.

-See ["how to use GitHub labels" section on the "Running and writing tests" page](https://github.com/pytorch/pytorch/wiki/Running-and-writing-tests#using-github-label-to-control-ci-behavior-on-pr) for information.
+For a complete definition of every job that is triggered by these labels, as well as other labels that are not listed here, [search for `ciflow/` in the `.github/workflows` folder](https://github.com/search?q=repo%3Apytorch%2Fpytorch+ciflow%2F+path%3A.github%2Fworkflows%2F*.yml&type=code) or run `grep -r 'ciflow/' .github/workflows`.

-### CI workflow on master commits
+Additional labels include:
+- `keep-going`: test jobs stop at first test failure.  Use this label to keep going after first failure.
+- `test-config/<default, distributed, etc>`: only run a specific test config.

-### Change CI workflow behavior on PR
+
+## Other
+
+### Sharding
+The total time of our entire test suite takes over 24 hours to run if run serially, so we shard and parallelize our tests to decrease this time.  Test job names generally look like `<configuration/architecture information, ex os, python version, compiler version> / test (<test_config>, <shard number>, <total number of shards>, <machine type>, <optional additional information>)`.  For example `linux-bionic-py3.11-clang9 / test (default, 2, 2, linux.2xlarge)` is running the second shard of two shards for the default test config.
+
+Tests are distributed across the number of shards based on how long they take.  Long tests are broken into smaller chunks based on their test times and may show up on different shards as well.  Test time information updates everyday at around 5PM PT, which can cause tests to move to different shards.
+
+Information about what test files are run on each shard can be found by searching the logs for `Selected tests`.

 ### Disabled tests in CI

 Some PyTorch tests are currently disabled due to their flakiness, incompatibility with certain platforms, or other temporary brokenness. We have a system where GitHub issues titled “DISABLED test_a_name” disable desired tests in PyTorch CI until the issues are closed, e.g., #62970. If you are wondering what tests are currently disabled in CI, please check out [disabled-tests.json](https://github.com/pytorch/test-infra/blob/generated-stats/stats/disabled-tests-condensed.json), where these test cases are all gathered.

 #### How to disable a test
-First, you should never disable a test if you're not sure what you're doing. Tests are important in validating PyTorch functionality, and ignoring test failures is not recommended as it can degrade user experience. When you are certain that disabling a test is the best option, for example, when it is flaky, make plans to fix the test so that it is not disabled indefinitely.
+First, you should never disable a test if you're not sure what you're doing, and only contributors with write permissions can disable tests. Tests are important in validating PyTorch functionality, and ignoring test failures is not recommended as it can degrade user experience. When you are certain that disabling a test is the best option, for example, when it is flaky, make plans to fix the test so that it is not disabled indefinitely.

-To disable a test, say, create an issue with the title `DISABLED test_case_name (test.ClassName)`. A real title example would look like: `DISABLED test_jit_cuda_extension (__main__.TestCppExtensionJIT)`. In the body of the issue, feel free to include any details and logs as you normally would with any issue. If you would like to skip the test for particular platforms, such as ROCm, please include a line (case insensitive) in the issue body like so: "<start of line>Platforms: Mac, Windows<end of line>." The available platforms to choose from are: mac/macos, windows, rocm, linux, and asan (whether the test is with ASAN).
+To disable a test, create an issue with the title `DISABLED test_case_name (test.ClassName)`. A real title example would look like: `DISABLED test_jit_cuda_extension (__main__.TestCppExtensionJIT)`. In the body of the issue, feel free to include any details and logs as you normally would with any issue. It is possible to only skip the test for specific platforms (like rocm, windows, linux, or mac), or for specific test configs (available ones are inductor, dynamo, and slow).  To do this, include a line (case insensitive) in the issue body like so: "\<start of line>Platforms: Mac, Windows\<end of line>."

 #### Disabling a test for all dtypes/devices
 Currently, you are able to disable all sorts of test instantiations by removing the suffixed device type (e.g., `cpu`, `cuda`, or `meta`) AND dtype (e.g., `int8`, `bool`, `complex`). So given a test `test_case_name_cuda_int64 (test.ClassNameCUDA)`, you can:
@ -114,36 +112,42 @@ Currently, you are able to disable all sorts of test instantiations by removing
 2. disable ALL instances of the test with `DISABLED test_case_name (test.ClassName)`. NOTE: You must get rid of BOTH the device and dtype suffixes AND the suffix in the test suite as well.

 #### How to test the disabled test on CI
-It is not easy to test these disabled tests with CI because, well, they’re intentionally disabled. Previous alternatives were to either mimic the test environment locally (often not convenient) or to close the issue and re-enable the test in all of CI (risking breaking trunk CI).
+It is not easy to test these disabled tests with CI since they are automatically skipped. Previous alternatives were to either mimic the test environment locally (often not convenient) or to close the issue and re-enable the test in all of CI (risking breaking trunk CI).

-After #62851 and #74981, PRs with key phrases like “fixes #55555” or “Close https://github.com/pytorch/pytorch/issues/62359” in their PR bodies or commit messages will re-enable the tests disabled by the linked issues (in this example, #55555 and #62359).  Including a key phrase in the PR body only works for tests triggered by pull request and does not work for push-triggered CI.  Including a key phrase in the commit message will work for both pull and  push triggered CI.  The test will also be run if ANY of the commit messages in the PR contains a key phrase.
-
-<img width="830" alt="Screen Shot 2022-04-04 at 11 20 55 AM" src="https://user-images.githubusercontent.com/44682903/161607690-429e9df3-5f6d-4fed-9e60-b31a65373bec.png">
+PRs with key phrases like “fixes #55555” or “Close https://github.com/pytorch/pytorch/issues/62359” in their PR bodies or commit messages will re-enable the tests disabled by the linked issues (in this example, #55555 and #62359).

 More accepted key phrases are defined by the [GitHub docs](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword).

-Limitations: this feature only works for GitHub Actions CI.
+### Reruns/retries
+We have various sources of reruns in our CI to help deal with flakiness:
+* Job/workflow reruns: These are done on both main and PR commits.  We have various rules for what jobs should be rerun and when which can be found in test-infra.  Jobs can also be rerun through the Github UI.  Unfortunately, due to Github limitations, jobs can only be rerun after the entire workflow is finished.
+* Step reruns: These rerun individual steps in a job and can be identified by steps that use the nick-fields/retry action.
+* Test file reruns: These rerun test files in a new subprocess to help with flakiness that results in segfaults, other signal related errors, or issues related to tests changing global state.  These are done 3 times and if the test fails on all 3, the job fails.
+* Test reruns: In addition to test file reruns, tests are rerun within a subprocess 3 times.

-## Other Topics
-
-### CI Metrics
-
-A compilation of dashboards and metrics relating to CI could be found in [the PyTorch CI Metrics Dashboards wiki](https://github.com/pytorch/pytorch/wiki/PyTorch-CI-Metrics-Dashboards)
-
-### CI Internals
-
-### Binary Builds
-
-### Docker Builds
-
-See: https://github.com/pytorch/pytorch/wiki/Docker-image-build-on-CircleCI for more information.
-
-### Other Related Repos
-
-#### [pytorch/builder](https://github.com/pytorch/builder)
-#### [pytorch/test-infra](https://github.com/pytorch/test-infra)
-#### [pytorch/pytorch-ci-hud](https://github.com/pytorch/pytorch-ci-hud)
+Additionally, there may be additional sources of retries, for example retrying network calls.

 ### Which commit is used in CI for your PR?

+The code used in your PR is the code in the most recent commit of your PR.  However, the workflow file definitions are a merge between the current `main` brant workflow file definitions and those found in your PR.  This can cause failures due to the merged workflow file referencing a file that might not exist yet and is generally resolved by rebasing.
+
 ### CI failure tips on PR
+
+1. Rebasing can resolve failures due to an already red merge base and changes to workflow file definitions.
+2. Rerunning jobs can help determine if the failure is flaky.
+3. Checking the [HUD home page](https://github.com/pytorch/pytorch/wiki/Using-hud.pytorch.org#hud-home-page) can tell you if CI is in a bad state and also what failures are currently common.
+4. Use the [HUD](https://github.com/pytorch/pytorch/wiki/Using-hud.pytorch.org#pull-requests-and-commits) to view the log viewer and log classifier to see what we think is the most likely cuase of your failure.
+
+
+## Additional Resources
+
+A compilation of dashboards and metrics relating to CI could be found in [HUD metrics page](https://hud.pytorch.org/metrics)
+
+### Related wikis
+* [Running and writing tests](https://github.com/pytorch/pytorch/wiki/Running-and-writing-tests)
+* [Docker image build on CircleCI](https://github.com/pytorch/pytorch/wiki/Docker-image-build-on-CircleCI)
+### Related Repos
+
+* [pytorch/builder](https://github.com/pytorch/builder)
+* [pytorch/test-infra](https://github.com/pytorch/test-infra)
+* [pytorch/pytorch-ci-hud](https://github.com/pytorch/pytorch-ci-hud)
--- a/Running-and-writing-tests.md
+++ b/Running-and-writing-tests.md
@ -69,7 +69,7 @@ Alternatively you can pass in additional arguments to run specific test(s), use
 ```
 python test/run_test.py -h
 ```
-Using `test/run_test.py` will usually require some extra dependencies, like pytest-rerunfailures and pytest-shard.
+Using `test/run_test.py` will usually require some extra dependencies, like pytest-rerunfailures and pytest-shard.  Running `pip install -r .ci/docker/requirements-ci.txt` will install all the dependencies needed.

 ## Using environment variables:

@ -88,21 +88,6 @@ PYTORCH_TEST_WITH_SLOW=1 python test_torch.py
 ```
 will run the tests in test_torch.py, including those decorated with `@slowTest`.

-
-## Using Github label to control CI behavior on PR
-
-_(last updated 2023-03-30)_
-
-PyTorch runs different sets of jobs on PR vs. on master commits.
-
-In order to control the behavior of CI jobs on PR. The most commonly used labels are:
- `ciflow/trunk`: automatically added when `@pytorchbot merge` is invoked.  These tests are run on every commit in master.
- `ciflow/periodic`: runs every 4 hours on master.  Includes jobs that are either expensive or slow to run, such as mac x86-64 tests, slow gradcheck, and multigpu.
- `ciflow/inductor`: runs inductor builds and tests.  This label may be automatically added by our autolabeler if your PR touches certain files.  These jobs are run on every commit in master.
- `ciflow/mps`: a subset of `ciflow/trunk` that runs mps related builds and tests
-
-For a complete definition of every job that is triggered by these labels, as well as other labels that are not listed here, [search for `ciflow/` in the `.github/workflows` folder](https://github.com/search?q=repo%3Apytorch%2Fpytorch+ciflow%2F+path%3A.github%2Fworkflows%2F*.yml&type=code) or run `grep -r 'ciflow/' .github/workflows`.
-
 # Common test utilities

 Use the test case's [assertEqual](https://github.com/pytorch/pytorch/blob/0e7b5ea6c003b763603d8ca0fe12f476fdf9bf32/torch/testing/_internal/common_utils.py#L1277) to compare objects for equality.