62 Commits

Author SHA1 Message Date
6a6d838832 Add H100 runner to be recognized in actionlint (#163795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163795
Approved by: https://github.com/huydhn, https://github.com/seemethere
2025-09-25 22:09:11 +00:00
66133b1ab7 Build vLLM aarch64 nightly wheels (#162664)
PyTorch has published its aarch64 nightly wheels for all CUDA version after https://github.com/pytorch/pytorch/pull/162364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162664
Approved by: https://github.com/atalman
2025-09-13 03:43:55 +00:00
93fb23d6fa Build vLLM nightly wheels (#162000)
This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated.  The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli

1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get
2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build
3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist`
4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89]
5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled
6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM

We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date.

### Usage

* Install latest nightlies
```
pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \
  --index-url https://download.pytorch.org/whl/nightly/cu129
```

* Install a specific version
```
pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \
  vllm==1.0.0.dev20250903 \
  xformers=0.0.33.dev20250903 \
  flashinfer_python=0.2.14.dev20250903 \
  --index-url https://download.pytorch.org/whl/nightly/cu129
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000
Approved by: https://github.com/atalman
2025-09-07 06:09:17 +00:00
5737372862 [CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882)
Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list

Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners

This should help increase available runners even with same number of CI nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-08-12 22:42:40 +00:00
53d68b95de [ROCm CI] Migrate to MI325 Capacity. (#159059)
This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159059
Approved by: https://github.com/jithunnair-amd, https://github.com/atalman

Co-authored-by: deedongala <deekshitha.dongala@amd.com>
2025-07-30 19:47:59 +00:00
f02b783aae [1/N] Remove MacOS-13 MPS testing (#159278)
Starts addressing https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159278
Approved by: https://github.com/dcci
ghstack dependencies: #159277
2025-07-28 23:52:47 +00:00
716d52779f [BE] Delete non-existing labels (#159277)
As no such runners has been online for last 2+ month
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159277
Approved by: https://github.com/clee2000
2025-07-28 20:28:57 +00:00
bf06190e21 Integrated AMD AWS runners into Pytorch CI (#153704)
Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd

Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>
2025-06-18 15:58:22 +00:00
794ef6c9b8 Enable manywheel build and smoke test on main branch for ROCm (#153287)
Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287
Approved by: https://github.com/jeffdaily
2025-06-14 19:14:31 +00:00
afd7a13bca Migrate to new Windows Arm64 runners (#152099)
This PR moves the Windows Arm64 nightly jobs to the new runner image, see [arm-windows-11-image](https://github.com/actions/partner-runner-images/blob/main/images/arm-windows-11-image.md )

Fixes #151671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152099
Approved by: https://github.com/seemethere
2025-05-21 09:13:15 +00:00
001695c397 [ROCm][CI] Enable distributed CI on MI300 (#150667)
* Enable distributed CI on MI300 runners, same schedule-based and release-branch triggers as `periodic.yml`; also uses label `ciflow/periodic-rocm-mi300` for triggering on PRs.
* Disabled failing distributed tests on MI300 via Github issues: [151077](https://github.com/pytorch/pytorch/issues/151077), [151078](https://github.com/pytorch/pytorch/issues/151078), [151081](https://github.com/pytorch/pytorch/issues/151081), [151082](https://github.com/pytorch/pytorch/issues/151082), [151083](https://github.com/pytorch/pytorch/issues/151083), [151084](https://github.com/pytorch/pytorch/issues/151084), [151085](https://github.com/pytorch/pytorch/issues/151085), [151086](https://github.com/pytorch/pytorch/issues/151086), [151087](https://github.com/pytorch/pytorch/issues/151087), [151088](https://github.com/pytorch/pytorch/issues/151088), [151089](https://github.com/pytorch/pytorch/issues/151089), [151090](https://github.com/pytorch/pytorch/issues/151090), [151153](https://github.com/pytorch/pytorch/issues/151153)
* Disable failing distributed tests via `skipIfRocm`: ea9315ff95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150667
Approved by: https://github.com/jeffdaily
2025-04-14 16:19:04 +00:00
48af2cdd27 [BE] Move all lint runner to 24.04 (#150427)
As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101
This forces older python version to be 3.8
Delete all linux-20.04 runners from the lintrunner.yml
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427
Approved by: https://github.com/seemethere
2025-04-01 17:33:15 +00:00
1c7196f04b Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394)
Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working

Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535
Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460
![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032)
Run with cached docker image:
![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813)
~6 min vs 3 s :)

Thanks @saienduri for the help on the MI300 infra side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394
Approved by: https://github.com/jeffdaily
2025-03-15 00:34:04 +00:00
56b2e4b8f0 ci: Update linux.20_04 --> linux.24_04 (#149142)
Ubuntu 20.04 is getting deprecated soon so we might as well proactively
move to the latest LTS which is 24.04

> [!NOTE]
> The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142
Approved by: https://github.com/atalman, https://github.com/wdvr
2025-03-14 02:20:10 +00:00
61c4074df7 Add Windows Arm64 Nightly Builds (#139760)
This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links:
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml
https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760
Approved by: https://github.com/malfet

Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2025-03-07 18:53:56 +00:00
d789c22712 Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469)
The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04

They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115
```
[do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885)
This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101
```

Should we be using ubuntu-latest instead?

I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed.  I'll try again later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-03-04 22:29:04 +00:00
654f2666d9 Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`.

This change depends on https://github.com/pytorch/test-infra/pull/6316
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-21 14:15:40 +00:00
e5da9df421 Revert "Increase memory for linux binary builds (#147542)"
This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830.

Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))
2025-02-21 07:14:57 +00:00
87e6e2924e Increase memory for linux binary builds (#147542)
Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)).

After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available.

But given the symptoms, the likehood of this being a OOM problem is high.

So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542
Approved by: https://github.com/ZainRizvi, https://github.com/atalman
2025-02-20 23:02:45 +00:00
362ecad9bb [ROCm] Use linux.rocm.gpu.2 for 2-GPU and linux.rocm.gpu.4 for 4-GPU runners (#143769)
* Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4`
* Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point)
* Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769
Approved by: https://github.com/jeffdaily
2024-12-24 08:04:00 +00:00
37f340c1e5 [EZ] Remove remaining amz2023 runner variant references (#136540)
Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it

Explicit references to the amz2023 runner type variants were removed in the following PRs:
- https://github.com/pytorch/ignite/pull/3285
- https://github.com/pytorch/ao/pull/887
- https://github.com/pytorch/fbscribelogger/pull/1
- https://github.com/pytorch/pytorch/pull/134355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-25 19:01:00 +00:00
dcf05fcb14 Fix stale job using non-existant ARC runner (#134863)
The ARC CI system has been shutdown so this job is currently using a runner that doesn't exist.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134863
Approved by: https://github.com/ZainRizvi
2024-09-04 12:57:10 +00:00
78128cbdd8 [CD] Use ephemeral arm64 runners for nightly and docker builds (#134473)
Follow up after adding linux arm64 ephemeral instances: https://github.com/pytorch/pytorch/pull/134469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134473
Approved by: https://github.com/malfet
2024-08-26 17:47:20 +00:00
a6fac0e969 Use ephemeral runners for windows nightly builds (#134463)
This is definition of windows.4xlarge:

```
  windows.4xlarge:
    disk_size: 256
    instance_type: c5d.4xlarge
    is_ephemeral: true
    max_available: 420
    os: windows
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463
Approved by: https://github.com/jeanschmidt
2024-08-26 16:33:19 +00:00
ff77c67d16 Use ephemeral runners for linux nightly builds (#134367)
Should be landed with https://github.com/pytorch/test-infra/pull/5590
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134367
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere
2024-08-24 12:49:07 +00:00
750d68ff70 Use amazon linux2 for Docker builds, fix build-docker-conda condition (#134116)
1. Switches failing jobs to amzon linux 2:
- CUDA, CPU, ROCM jobs are failing
3. Fix trigger condition for build-docker-conda to be same as manywheel and libtorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134116
Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia
2024-08-21 18:01:16 +00:00
918367ebb0 Add new runner: G4DN Extra Large with T4 for windows binary builds (#133229)
Prep for #103104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133229
Approved by: https://github.com/ZainRizvi
2024-08-14 03:08:49 +00:00
e73fa28ec8 [CI] Fix arm64 docker build arch (#131869)
Attempt to fix arm64 docker build arch on https://github.com/pytorch/pytorch/pull/131855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131869
Approved by: https://github.com/desertfire
2024-07-26 13:19:36 +00:00
d0e2ab617d Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch (#129022)
Migration of Docker conda builds  to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml

Related to: https://github.com/pytorch/builder/issues/1849

Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR.

Test when executing on PR, upload to ecr:
https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327
```
308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b
```

Test With-Push, upload to dockerhub:
https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427
```
docker.io/pytorch/conda-builder:cpu done
```
Will upload here: https://hub.docker.com/r/pytorch/conda-builder/

Test using ecr image in the nightly workflow:
https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87

Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch.

Cleanup of workflows and scripts from builder repo: https://github.com/pytorch/builder/pull/1923
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129022
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet, https://github.com/chuanqi129
2024-07-25 14:36:15 +00:00
eb5883f8aa Add new runner labels to actionlint (#131525)
Adding the labels corresponding to the Amazon2023 ami
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131525
Approved by: https://github.com/atalman
2024-07-24 15:28:59 +00:00
ca023f77bc [CD] Add pytorch xpu wheel build in nightly (#129560)
Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560
Approved by: https://github.com/atalman
2024-07-11 15:49:04 +00:00
754e6d4ad0 Make jobs with LF runners still pass lint (#128175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128175
Approved by: https://github.com/huydhn
2024-06-07 17:13:04 +00:00
ddef7c350f Add comments about runner labels (#127827)
To distinguish between org-wide and repo-specific runners as well as highlight where they are hosted (by DevInfra, LF or various partners

Delete unused `bm-runner`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127827
Approved by: https://github.com/huydhn
2024-06-04 02:06:43 +00:00
57baae9c9b Migrating CI/CD jobs to macOS 14 (#127582)
We have half the fleet in MacoS 14 already and it has been running fine so far https://github.com/pytorch/pytorch/issues/127490.  So, I'm preparing the final push to replace the rest of them.  This also switches release build from 13 to 14 (GitHub runners)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127582
Approved by: https://github.com/atalman
2024-05-31 22:30:59 +00:00
da7ced6e8c S390x binaries (#120398)
Allow building nightly, rc and release binaries for s390x.

This PR implements building binaries, but publishing part is currently missing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120398
Approved by: https://github.com/huydhn
2024-05-11 02:32:25 +00:00
4e29e80bf0 Run MPS tests on MacOS Sonoma (#125801)
Those ones are running 14.4.1, so I wonder if they actually pass CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-05-09 13:43:12 +00:00
55ae8fb1f6 Switched m1 runners to the lable macos-m1-stable (#120997)
Switched m1 runners to use  `macos-m1-stable` label, which points to exactly the same M1 running MacOS-13.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120997
Approved by: https://github.com/malfet
2024-03-01 19:52:34 +00:00
a545ebc870 Switched macOS runners type to macos-m1-stable (#117651)
Switched macOS runners type to `macos-m1-stable`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117651
Approved by: https://github.com/huydhn
2024-01-24 11:55:13 +00:00
c5c4d81b1b Switched stale workflow to linux.large.arc (#115635)
Switched stale workflow to linux.large.arc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115635
Approved by: https://github.com/jeanschmidt
2023-12-12 15:33:59 +00:00
21cf6e76c2 Revert "Use linux.large.arc for stale workflow (#115440)"
This reverts commit dadb3694ffaa2a0bfe78516c294a46566430c1ad.

Reverted https://github.com/pytorch/pytorch/pull/115440 on behalf of https://github.com/DanilBaibak due to Did not merge properly ([comment](https://github.com/pytorch/pytorch/pull/115440#issuecomment-1852126050))
2023-12-12 14:20:29 +00:00
dadb3694ff Use linux.large.arc for stale workflow (#115440)
* Try linux.large.arc for stale workflow

* Run stale workflow on PR changes

* Added arc runner lable to the list of self hosted runners

* Added concurency linux-job

* Cleanup

* Added workflow_dispatch for testing purpose
2023-12-12 15:11:09 +01:00
f7909cb947 Build and test iOS on GitHub M1 runners (#110406)
They are here https://github.blog/2023-10-02-introducing-the-new-apple-silicon-powered-m1-macos-larger-runner-for-github-actions

I have been able to run iOS simulator tests on my M1 laptop without issues.  Some numbers:

* iOS build takes ~1h with x86 runners
* The new M1 runners take ~20m https://github.com/pytorch/pytorch/actions/runs/6386171957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110406
Approved by: https://github.com/malfet, https://github.com/seemethere
2023-10-03 03:17:10 +00:00
a5de10d7a5 Remove linux.t4g.2xlarge Usage (#110064)
Switched from linux.t4g.2xlarge to linux.arm64.2xlarge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110064
Approved by: https://github.com/atalman, https://github.com/malfet
2023-09-26 14:30:35 +00:00
ec85ab6157 Adding aarch64 wheel CI workflows (#104109)
Adding Workflows for building aarch64 Linux PyTorch PIP wheels

Updates:
* Created aarch64 template for generated workflows
* Updated generate_ci_workflows.py to include aarch64
* Generated the aarch64 wheel workflow
* added _binary-build-aarch64.yml for building aarch64 wheel
* added _binary-test-aarch64.yml for sanity check of aarch64 wheel
* Updated binary_linux_test.sh to use --extra-index-url for aarch64 till needed aarch64 dependencies are available at https://download.pytorch.org/whl/nightly/cpu

NOTES:
* The build and test workflows are using arm64v8/alpine and quay.io/pypa/manylinux2014_aarch64:latest docker images at this time.
* Conda generated workflow not included at this time and being worked on.

Workflows were successfully tested at https://github.com/xncqr/pytorch/actions/runs/5351891068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104109
Approved by: https://github.com/malfet, https://github.com/atalman
2023-06-29 18:58:43 +00:00
2ac6ee7f12 Migrate jobs: windows.4xlarge->windows.4xlarge.nonephemeral (#100548)
This is reopening of the PR https://github.com/pytorch/pytorch/pull/100377

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit 7caac545b1d8e5de797c9593981c9578685dba81)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100548
Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99
2023-05-03 15:47:18 +00:00
543b7ebb50 Revert "Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377)"
This reverts commit 7caac545b1d8e5de797c9593981c9578685dba81.

Reverted https://github.com/pytorch/pytorch/pull/100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](https://github.com/pytorch/pytorch/pull/100377#issuecomment-1532148086))
2023-05-02 21:05:53 +00:00
7caac545b1 Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377)
This is reopening of the PR [100091](https://github.com/pytorch/pytorch/pull/100091)

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
2023-05-02 20:41:12 +00:00
e5291e633f Revert "Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091)"
This reverts commit 1183eecbf19f77e2b1d9f3cee56dd8039653a5f5.

Reverted https://github.com/pytorch/pytorch/pull/100091 on behalf of https://github.com/huydhn due to CPU jobs start failing in trunk due to some error in MSVC setup
2023-04-26 19:17:58 +00:00
1183eecbf1 Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091) 2023-04-26 18:32:50 +02:00
61cdae0ce5 Switch Windows CI jobs to G5 runners (#91727)
### Changelist

* Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU
* Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by https://github.com/pytorch/pytorch/pull/91979
* G5 runner has `AMD EPYC 7R32` CPU, not an Intel one
  * This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`.  This might need to be investigated further (TODO: TRACKING ISSUE).  In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build
  * Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.

###  Performance gain

* (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration)
* (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (**25%**)

### Pricing

On demand hourly rate:

* (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428
* (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36

So the current runner is not only more expensive but is also slower.  Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = **~45%**

### Rolling out

https://github.com/pytorch/test-infra/pull/1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91727
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere
2023-01-13 01:11:59 +00:00