pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Jane Xu	6a6d838832	Add H100 runner to be recognized in actionlint (#163795 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163795 Approved by: https://github.com/huydhn, https://github.com/seemethere	2025-09-25 22:09:11 +00:00
Huy Do	66133b1ab7	Build vLLM aarch64 nightly wheels (#162664 ) PyTorch has published its aarch64 nightly wheels for all CUDA version after https://github.com/pytorch/pytorch/pull/162364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162664 Approved by: https://github.com/atalman	2025-09-13 03:43:55 +00:00
Huy Do	93fb23d6fa	Build vLLM nightly wheels (#162000 ) This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated. The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli 1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get 2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build 3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist` 4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89] 5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled 6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date. ### Usage * Install latest nightlies ``` pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \ --index-url https://download.pytorch.org/whl/nightly/cu129 ``` * Install a specific version ``` pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \ vllm==1.0.0.dev20250903 \ xformers=0.0.33.dev20250903 \ flashinfer_python=0.2.14.dev20250903 \ --index-url https://download.pytorch.org/whl/nightly/cu129 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000 Approved by: https://github.com/atalman	2025-09-07 06:09:17 +00:00
deedongala	5737372862	[CI] Switch ROCm MI300 GitHub Actions workflows from 2-GPU to 1-GPU runners (#158882 ) Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners This should help increase available runners even with same number of CI nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-08-12 22:42:40 +00:00
saienduri	53d68b95de	[ROCm CI] Migrate to MI325 Capacity. (#159059 ) This PR moves PyTorch CI capacity from mi300 to a new, larger mi325 cluster. Both of these GPUs are the same architecture gfx942 and our testing plans don't change within an architecture, so we pool them under the same label `linux.rocm.gpu.gfx942.<#gpus>` with this PR as well to reduce overhead and confusion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159059 Approved by: https://github.com/jithunnair-amd, https://github.com/atalman Co-authored-by: deedongala <deekshitha.dongala@amd.com>	2025-07-30 19:47:59 +00:00
Nikita Shulga	f02b783aae	[1/N] Remove MacOS-13 MPS testing (#159278 ) Starts addressing https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159278 Approved by: https://github.com/dcci ghstack dependencies: #159277	2025-07-28 23:52:47 +00:00
Nikita Shulga	716d52779f	[BE] Delete non-existing labels (#159277 ) As no such runners has been online for last 2+ month Pull Request resolved: https://github.com/pytorch/pytorch/pull/159277 Approved by: https://github.com/clee2000	2025-07-28 20:28:57 +00:00
charan-ponnada	bf06190e21	Integrated AMD AWS runners into Pytorch CI (#153704 ) Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704 Approved by: https://github.com/malfet, https://github.com/jithunnair-amd Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>	2025-06-18 15:58:22 +00:00
Jithun Nair	794ef6c9b8	Enable manywheel build and smoke test on main branch for ROCm (#153287 ) Fixes issue of not discovering breakage of ROCm wheel builds until the nightly job runs e.g. https://github.com/pytorch/pytorch/pull/153253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/153287 Approved by: https://github.com/jeffdaily	2025-06-14 19:14:31 +00:00
Irem Yuksel	afd7a13bca	Migrate to new Windows Arm64 runners (#152099 ) This PR moves the Windows Arm64 nightly jobs to the new runner image, see [arm-windows-11-image](https://github.com/actions/partner-runner-images/blob/main/images/arm-windows-11-image.md ) Fixes #151671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152099 Approved by: https://github.com/seemethere	2025-05-21 09:13:15 +00:00
Jithun Nair	001695c397	[ROCm][CI] Enable distributed CI on MI300 (#150667 ) * Enable distributed CI on MI300 runners, same schedule-based and release-branch triggers as `periodic.yml`; also uses label `ciflow/periodic-rocm-mi300` for triggering on PRs. * Disabled failing distributed tests on MI300 via Github issues: [151077](https://github.com/pytorch/pytorch/issues/151077), [151078](https://github.com/pytorch/pytorch/issues/151078), [151081](https://github.com/pytorch/pytorch/issues/151081), [151082](https://github.com/pytorch/pytorch/issues/151082), [151083](https://github.com/pytorch/pytorch/issues/151083), [151084](https://github.com/pytorch/pytorch/issues/151084), [151085](https://github.com/pytorch/pytorch/issues/151085), [151086](https://github.com/pytorch/pytorch/issues/151086), [151087](https://github.com/pytorch/pytorch/issues/151087), [151088](https://github.com/pytorch/pytorch/issues/151088), [151089](https://github.com/pytorch/pytorch/issues/151089), [151090](https://github.com/pytorch/pytorch/issues/151090), [151153](https://github.com/pytorch/pytorch/issues/151153) * Disable failing distributed tests via `skipIfRocm`: `ea9315ff95` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150667 Approved by: https://github.com/jeffdaily	2025-04-14 16:19:04 +00:00
Nikita Shulga	48af2cdd27	[BE] Move all lint runner to 24.04 (#150427 ) As Ubuntu-20 reached EOL on Apr 1st, see https://github.com/actions/runner-images/issues/11101 This forces older python version to be 3.8 Delete all linux-20.04 runners from the lintrunner.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/150427 Approved by: https://github.com/seemethere	2025-04-01 17:33:15 +00:00
Jithun Nair	1c7196f04b	Add new GHA workflow to cache ROCm CI docker images on MI300 CI runners periodically (#148394 ) Refiling https://github.com/pytorch/pytorch/pull/148387 from pytorch repo branch to get AWS login via OIDC working Successful docker caching run: https://github.com/pytorch/pytorch/actions/runs/13843689908/job/38737095535 Run without cached docker image: https://github.com/pytorch/pytorch/actions/runs/13843692637/job/38746033460 ![image](https://github.com/user-attachments/assets/c410ff35-a150-4885-b904-3a5e1888c032) Run with cached docker image: ![image](https://github.com/user-attachments/assets/41e417b5-a795-4ed2-a9cd-00151db8f813) ~6 min vs 3 s :) Thanks @saienduri for the help on the MI300 infra side Pull Request resolved: https://github.com/pytorch/pytorch/pull/148394 Approved by: https://github.com/jeffdaily	2025-03-15 00:34:04 +00:00
Eli Uriegas	56b2e4b8f0	ci: Update linux.20_04 --> linux.24_04 (#149142 ) Ubuntu 20.04 is getting deprecated soon so we might as well proactively move to the latest LTS which is 24.04 > [!NOTE] > The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version. Signed-off-by: Eli Uriegas <eliuriegas@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142 Approved by: https://github.com/atalman, https://github.com/wdvr	2025-03-14 02:20:10 +00:00
Irem Yuksel	61c4074df7	Add Windows Arm64 Nightly Builds (#139760 ) This PR creates 3 new worklflows for Windows Arm64 target. The workflows and outputs can be reviewed at the following links: https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-release-nightly.yml https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-libtorch-debug-nightly.yml https://github.com/pytorch/pytorch/actions/workflows/generated-windows-arm64-binary-wheel-nightly.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/139760 Approved by: https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2025-03-07 18:53:56 +00:00
Catherine Lee	d789c22712	Upgrade github ubuntu-20.04 runners to ubuntu-24.04 (#148469 ) The github provided ubuntu-20.04 gha runners are being deprecated (https://togithub.com/actions/runner-images/issues/11101) so upgrade workflows using them to the latest runner 24.04 They are currently doing a brownout, resulting in failures like: https://github.com/pytorch/pytorch/actions/runs/13660782115 ``` [do_update_viablestrict](https://github.com/pytorch/pytorch/actions/runs/13660782115/job/38192777885) This is a scheduled Ubuntu 20.04 brownout. Ubuntu 20.04 LTS runner will be removed on 2025-04-01. For more details, see https://github.com/actions/runner-images/issues/11101 ``` Should we be using ubuntu-latest instead? I attempted to upgrade actionlint to 1.7.7 but on my local in test-infra it seems to add a lot of new checks, and on test-infra's CI, I seem to have uploaded the wrong executable or something so it failed. I'll try again later Pull Request resolved: https://github.com/pytorch/pytorch/pull/148469 Approved by: https://github.com/seemethere, https://github.com/malfet	2025-03-04 22:29:04 +00:00
Jean Schmidt	654f2666d9	Increase memory for linux binary builds (#147542 ) Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)). After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available. But given the symptoms, the likehood of this being a OOM problem is high. So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.12xlarge.memory.ephemeral`. This change depends on https://github.com/pytorch/test-infra/pull/6316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-02-21 14:15:40 +00:00
PyTorch MergeBot	e5da9df421	Revert "Increase memory for linux binary builds (#147542 )" This reverts commit 87e6e2924eb706b928cdfc4a11623b39259fa830. Reverted https://github.com/pytorch/pytorch/pull/147542 on behalf of https://github.com/jeanschmidt due to seems that it is best to use another machine type ([comment](https://github.com/pytorch/pytorch/pull/147542#issuecomment-2673765724))	2025-02-21 07:14:57 +00:00
Jean Schmidt	87e6e2924e	Increase memory for linux binary builds (#147542 ) Recently I detected that some linux manywheels builds are flaky ([ex](https://github.com/pytorch/pytorch/actions/runs/13438309056/job/37555475510)). After investigating, could not detect issues when investigating the runner logs, its disk space available, network usage or CPU load. Unfortunately, memory information is not available. But given the symptoms, the likehood of this being a OOM problem is high. So, moving those build jobs from a `linux.12xlarge.ephemeral` to `linux.24xlarge.ephemeral`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147542 Approved by: https://github.com/ZainRizvi, https://github.com/atalman	2025-02-20 23:02:45 +00:00
Jithun Nair	362ecad9bb	[ROCm] Use `linux.rocm.gpu.2` for 2-GPU and `linux.rocm.gpu.4` for 4-GPU runners (#143769 ) * Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4` * Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point) * Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769 Approved by: https://github.com/jeffdaily	2024-12-24 08:04:00 +00:00
Zain Rizvi	37f340c1e5	[EZ] Remove remaining amz2023 runner variant references (#136540 ) Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it Explicit references to the amz2023 runner type variants were removed in the following PRs: - https://github.com/pytorch/ignite/pull/3285 - https://github.com/pytorch/ao/pull/887 - https://github.com/pytorch/fbscribelogger/pull/1 - https://github.com/pytorch/pytorch/pull/134355 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-25 19:01:00 +00:00
Thanh Ha	dcf05fcb14	Fix stale job using non-existant ARC runner (#134863 ) The ARC CI system has been shutdown so this job is currently using a runner that doesn't exist. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134863 Approved by: https://github.com/ZainRizvi	2024-09-04 12:57:10 +00:00
atalman	78128cbdd8	[CD] Use ephemeral arm64 runners for nightly and docker builds (#134473 ) Follow up after adding linux arm64 ephemeral instances: https://github.com/pytorch/pytorch/pull/134469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134473 Approved by: https://github.com/malfet	2024-08-26 17:47:20 +00:00
atalman	a6fac0e969	Use ephemeral runners for windows nightly builds (#134463 ) This is definition of windows.4xlarge: ``` windows.4xlarge: disk_size: 256 instance_type: c5d.4xlarge is_ephemeral: true max_available: 420 os: windows ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463 Approved by: https://github.com/jeanschmidt	2024-08-26 16:33:19 +00:00
atalman	ff77c67d16	Use ephemeral runners for linux nightly builds (#134367 ) Should be landed with https://github.com/pytorch/test-infra/pull/5590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134367 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere	2024-08-24 12:49:07 +00:00
atalman	750d68ff70	Use amazon linux2 for Docker builds, fix build-docker-conda condition (#134116 ) 1. Switches failing jobs to amzon linux 2: - CUDA, CPU, ROCM jobs are failing 3. Fix trigger condition for build-docker-conda to be same as manywheel and libtorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/134116 Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia	2024-08-21 18:01:16 +00:00
Wouter Devriendt	918367ebb0	Add new runner: G4DN Extra Large with T4 for windows binary builds (#133229 ) Prep for #103104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133229 Approved by: https://github.com/ZainRizvi	2024-08-14 03:08:49 +00:00
Huy Do	e73fa28ec8	[CI] Fix arm64 docker build arch (#131869 ) Attempt to fix arm64 docker build arch on https://github.com/pytorch/pytorch/pull/131855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131869 Approved by: https://github.com/desertfire	2024-07-26 13:19:36 +00:00
Julia Guo	d0e2ab617d	Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch (#129022 ) Migration of Docker conda builds to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml Related to: https://github.com/pytorch/builder/issues/1849 Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR. Test when executing on PR, upload to ecr: https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327 ``` 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b ``` Test With-Push, upload to dockerhub: https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427 ``` docker.io/pytorch/conda-builder:cpu done ``` Will upload here: https://hub.docker.com/r/pytorch/conda-builder/ Test using ecr image in the nightly workflow: https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87 Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch. Cleanup of workflows and scripts from builder repo: https://github.com/pytorch/builder/pull/1923 Co-authored-by: atalman <atalman@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129022 Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet, https://github.com/chuanqi129	2024-07-25 14:36:15 +00:00
Zain Rizvi	eb5883f8aa	Add new runner labels to actionlint (#131525 ) Adding the labels corresponding to the Amazon2023 ami Pull Request resolved: https://github.com/pytorch/pytorch/pull/131525 Approved by: https://github.com/atalman	2024-07-24 15:28:59 +00:00
chuanqiw	ca023f77bc	[CD] Add pytorch xpu wheel build in nightly (#129560 ) Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560 Approved by: https://github.com/atalman	2024-07-11 15:49:04 +00:00
Zain Rizvi	754e6d4ad0	Make jobs with LF runners still pass lint (#128175 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128175 Approved by: https://github.com/huydhn	2024-06-07 17:13:04 +00:00
Nikita Shulga	ddef7c350f	Add comments about runner labels (#127827 ) To distinguish between org-wide and repo-specific runners as well as highlight where they are hosted (by DevInfra, LF or various partners Delete unused `bm-runner` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127827 Approved by: https://github.com/huydhn	2024-06-04 02:06:43 +00:00
Huy Do	57baae9c9b	Migrating CI/CD jobs to macOS 14 (#127582 ) We have half the fleet in MacoS 14 already and it has been running fine so far https://github.com/pytorch/pytorch/issues/127490. So, I'm preparing the final push to replace the rest of them. This also switches release build from 13 to 14 (GitHub runners) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127582 Approved by: https://github.com/atalman	2024-05-31 22:30:59 +00:00
Aleksei Nikiforov	da7ced6e8c	S390x binaries (#120398 ) Allow building nightly, rc and release binaries for s390x. This PR implements building binaries, but publishing part is currently missing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120398 Approved by: https://github.com/huydhn	2024-05-11 02:32:25 +00:00
Nikita Shulga	4e29e80bf0	Run MPS tests on MacOS Sonoma (#125801 ) Those ones are running 14.4.1, so I wonder if they actually pass CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/125801 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-05-09 13:43:12 +00:00
DanilBaibak	55ae8fb1f6	Switched m1 runners to the lable macos-m1-stable (#120997 ) Switched m1 runners to use `macos-m1-stable` label, which points to exactly the same M1 running MacOS-13.2 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120997 Approved by: https://github.com/malfet	2024-03-01 19:52:34 +00:00
DanilBaibak	a545ebc870	Switched macOS runners type to macos-m1-stable (#117651 ) Switched macOS runners type to `macos-m1-stable`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117651 Approved by: https://github.com/huydhn	2024-01-24 11:55:13 +00:00
DanilBaibak	c5c4d81b1b	Switched stale workflow to linux.large.arc (#115635 ) Switched stale workflow to linux.large.arc Pull Request resolved: https://github.com/pytorch/pytorch/pull/115635 Approved by: https://github.com/jeanschmidt	2023-12-12 15:33:59 +00:00
PyTorch MergeBot	21cf6e76c2	Revert "Use linux.large.arc for stale workflow (#115440 )" This reverts commit dadb3694ffaa2a0bfe78516c294a46566430c1ad. Reverted https://github.com/pytorch/pytorch/pull/115440 on behalf of https://github.com/DanilBaibak due to Did not merge properly ([comment](https://github.com/pytorch/pytorch/pull/115440#issuecomment-1852126050))	2023-12-12 14:20:29 +00:00
Danylo Baibak	dadb3694ff	Use linux.large.arc for stale workflow (#115440 ) * Try linux.large.arc for stale workflow * Run stale workflow on PR changes * Added arc runner lable to the list of self hosted runners * Added concurency linux-job * Cleanup * Added workflow_dispatch for testing purpose	2023-12-12 15:11:09 +01:00
Huy Do	f7909cb947	Build and test iOS on GitHub M1 runners (#110406 ) They are here https://github.blog/2023-10-02-introducing-the-new-apple-silicon-powered-m1-macos-larger-runner-for-github-actions I have been able to run iOS simulator tests on my M1 laptop without issues. Some numbers: * iOS build takes ~1h with x86 runners * The new M1 runners take ~20m https://github.com/pytorch/pytorch/actions/runs/6386171957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110406 Approved by: https://github.com/malfet, https://github.com/seemethere	2023-10-03 03:17:10 +00:00
DanilBaibak	a5de10d7a5	Remove linux.t4g.2xlarge Usage (#110064 ) Switched from linux.t4g.2xlarge to linux.arm64.2xlarge Pull Request resolved: https://github.com/pytorch/pytorch/pull/110064 Approved by: https://github.com/atalman, https://github.com/malfet	2023-09-26 14:30:35 +00:00
Mike Schneider	ec85ab6157	Adding aarch64 wheel CI workflows (#104109 ) Adding Workflows for building aarch64 Linux PyTorch PIP wheels Updates: * Created aarch64 template for generated workflows * Updated generate_ci_workflows.py to include aarch64 * Generated the aarch64 wheel workflow * added _binary-build-aarch64.yml for building aarch64 wheel * added _binary-test-aarch64.yml for sanity check of aarch64 wheel * Updated binary_linux_test.sh to use --extra-index-url for aarch64 till needed aarch64 dependencies are available at https://download.pytorch.org/whl/nightly/cpu NOTES: * The build and test workflows are using arm64v8/alpine and quay.io/pypa/manylinux2014_aarch64:latest docker images at this time. * Conda generated workflow not included at this time and being worked on. Workflows were successfully tested at https://github.com/xncqr/pytorch/actions/runs/5351891068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104109 Approved by: https://github.com/malfet, https://github.com/atalman	2023-06-29 18:58:43 +00:00
Jean Schmidt	2ac6ee7f12	Migrate jobs: `windows.4xlarge`->`windows.4xlarge.nonephemeral` (#100548 ) This is reopening of the PR https://github.com/pytorch/pytorch/pull/100377 # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 7caac545b1d8e5de797c9593981c9578685dba81) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/100548 Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99	2023-05-03 15:47:18 +00:00
PyTorch MergeBot	543b7ebb50	Revert "Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377 )" This reverts commit 7caac545b1d8e5de797c9593981c9578685dba81. Reverted https://github.com/pytorch/pytorch/pull/100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](https://github.com/pytorch/pytorch/pull/100377#issuecomment-1532148086))	2023-05-02 21:05:53 +00:00
Jean Schmidt	7caac545b1	Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377 ) This is reopening of the PR [100091](https://github.com/pytorch/pytorch/pull/100091) # About this PR Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral. Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances. As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower. Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072 This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows: * migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch` * migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` * submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` * terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu` * evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn) The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs. # Copilot Summary <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`. # Copilot Poem <!-- copilot:poem --> ### <samp>🤖 Generated by Copilot at 579d87a</samp> > _We're breaking free from the ephemeral chains_ > _We're running on the nonephemeral lanes_ > _We're building faster, testing stronger, supporting newer_ > _We're the non-ephemeral runners of fire_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman	2023-05-02 20:41:12 +00:00
PyTorch MergeBot	e5291e633f	Revert "Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091 )" This reverts commit 1183eecbf19f77e2b1d9f3cee56dd8039653a5f5. Reverted https://github.com/pytorch/pytorch/pull/100091 on behalf of https://github.com/huydhn due to CPU jobs start failing in trunk due to some error in MSVC setup	2023-04-26 19:17:58 +00:00
Jean Schmidt	1183eecbf1	Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091 )	2023-04-26 18:32:50 +02:00
Huy Do	61cdae0ce5	Switch Windows CI jobs to G5 runners (#91727 ) ### Changelist * Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU * Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by https://github.com/pytorch/pytorch/pull/91979 * G5 runner has `AMD EPYC 7R32` CPU, not an Intel one * This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build * Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test. ### Performance gain * (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration) * (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (25%) ### Pricing On demand hourly rate: * (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428 * (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36 So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = ~45% ### Rolling out https://github.com/pytorch/test-infra/pull/1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91727 Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere	2023-01-13 01:11:59 +00:00

1 2

62 Commits