This uses the same approach as building triton wheel where we publish a nightly wheel for vLLM whenever its pinned commit is updated. The key change is to use `pytorch/manylinux2_28-builder` as the base image to build vLLM, so there are a couple of changes on the vLLM Dockerfile used by lumen_cli
1. `pytorch/manylinux2_28-builder` is RedHat instead of Debian-based, so no apt-get
2. Fix a bug in `.github/actions/build-external-packages/action.yml` where `CUDA_VERSION` is not set correctly, preventing CUDA 12.9 build
3. Fix a bug in `.github/actions/build-external-packages/action.yml` where `TORCH_WHEELS_PATH` is not set correctly and always defaulted to `dist`
4. In vLLM Dockerfile, use the correct index for the selected CUDA version, i.e. https://download.pytorch.org/whl/nightly/cu12[89] for CUDA 12.[89]
5. Install torch, vision, audio in one command. Unlike the CI image `pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm`, `pytorch/manylinux2_28-builder` doesn't have any torch dependencies preinstalled
6. Bump xformers version to 0.0.32.post2 now that PyTorch 2.8.0 has been landed on vLLM
We need to prepare 3 wheels for vLLM, xformers, and flashinfer-python. And I rename them in the same convention as PyTorch nightlies `MAJOR.MINOR.PATCH.devYYYYMMDD` so that vLLM nightlies will work with torch nightlies on the same date.
### Usage
* Install latest nightlies
```
pip install --pre torch torchvision torchaudio vllm xformers flashinfer_python \
--index-url https://download.pytorch.org/whl/nightly/cu129
```
* Install a specific version
```
pip install --pre torch==2.9.0.dev20250903 torchvision torchaudio \
vllm==1.0.0.dev20250903 \
xformers=0.0.33.dev20250903 \
flashinfer_python=0.2.14.dev20250903 \
--index-url https://download.pytorch.org/whl/nightly/cu129
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162000
Approved by: https://github.com/atalman
Updated .github/actionlint.yaml to replace linux.rocm.gpu.mi300.2 with linux.rocm.gpu.mi300.1 in the supported runner list
Modified all affected workflows (inductor-perf-test-nightly-rocm.yml, inductor-periodic.yml, inductor-rocm-mi300.yml, and rocm-mi300.yml) to run jobs on 1-GPU MI300 runners instead of 2-GPU runners
This should help increase available runners even with same number of CI nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158882
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Ubuntu 20.04 is getting deprecated soon so we might as well proactively
move to the latest LTS which is 24.04
> [!NOTE]
> The oldest supported version of python on 24.04 is Python 3.8. Since we test for Python 3.6 compat in our collect_env test we need to have this particular job stick with 20.04 for now until we decide to upgrade it to a newer python version.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149142
Approved by: https://github.com/atalman, https://github.com/wdvr
* Will enable us to target `periodic`/distributed CI jobs to 4-GPU runners using a different label `linux.rocm.gpu.4`
* Use 2-GPU runners for `trunk`, `pull` and `slow` (in addition to `inductor-rocm`) as well (although this currently will not change anything, since all our MI2xx runners have both `linux.rocm.gpu` and `linux.rocm.gpu.2` labels... but this will change in the future: see next point)
* Continue to use `linux.rocm.gpu` label for any job that doesn't need more than 1-GPU eg. binary test jobs in `workflows/generated-linux-binary-manywheel-nightly.yml`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143769
Approved by: https://github.com/jeffdaily
* Try linux.large.arc for stale workflow
* Run stale workflow on PR changes
* Added arc runner lable to the list of self hosted runners
* Added concurency linux-job
* Cleanup
* Added workflow_dispatch for testing purpose
Adding Workflows for building aarch64 Linux PyTorch PIP wheels
Updates:
* Created aarch64 template for generated workflows
* Updated generate_ci_workflows.py to include aarch64
* Generated the aarch64 wheel workflow
* added _binary-build-aarch64.yml for building aarch64 wheel
* added _binary-test-aarch64.yml for sanity check of aarch64 wheel
* Updated binary_linux_test.sh to use --extra-index-url for aarch64 till needed aarch64 dependencies are available at https://download.pytorch.org/whl/nightly/cpu
NOTES:
* The build and test workflows are using arm64v8/alpine and quay.io/pypa/manylinux2014_aarch64:latest docker images at this time.
* Conda generated workflow not included at this time and being worked on.
Workflows were successfully tested at https://github.com/xncqr/pytorch/actions/runs/5351891068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104109
Approved by: https://github.com/malfet, https://github.com/atalman
This is reopening of the PR https://github.com/pytorch/pytorch/pull/100377
# About this PR
Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.
Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.
As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.
Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072
This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:
* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)
The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.
# Copilot Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>
This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.
# Copilot Poem
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>
> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
(cherry picked from commit 7caac545b1d8e5de797c9593981c9578685dba81)
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100548
Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99
This is reopening of the PR [100091](https://github.com/pytorch/pytorch/pull/100091)
# About this PR
Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.
Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.
As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.
Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072
This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:
* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)
The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.
# Copilot Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>
This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.
# Copilot Poem
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>
> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
### Changelist
* Change Windows TORCH_CUDA_ARCH_LIST from `7.0` to `8.6` to compatible with NVIDIA A10G TPU
* Correctly disable some tests that requires flash attention, which is not available on Windows at the moment. This has been fixed by https://github.com/pytorch/pytorch/pull/91979
* G5 runner has `AMD EPYC 7R32` CPU, not an Intel one
* This seems to change the behavior of `GetDefaultMobileCPUAllocator` in `cpu_profiling_allocator_test`. This might need to be investigated further (TODO: TRACKING ISSUE). In the meantime, the test has been updated accordingly to use `GetDefaultCPUAllocator` correctly instead of `GetDefaultMobileCPUAllocator` for mobile build
* Also one periodic test `test_cpu_gpu_parity_nn_Conv3d_cuda_float32` fails with Tensor not close error when comparing grad tensors between CPU and GPU. This is fixed by turning off TF32 for the test.
### Performance gain
* (CURRENT) p3.2xlarge - https://hud.pytorch.org/tts shows each Windows CUDA shards (1-5 + functorch) takes about 2 hours to finish (duration)
* (NEW RUNNER) g5.4xlarge - The very rough estimation of the duration is 1h30m for each shard, meaning a half an hour gain (**25%**)
### Pricing
On demand hourly rate:
* (CURRENT) p3.2xlarge: $3.428. Total = Total hours spent on Windows CUDA tests * 3.428
* (NEW RUNNER) g5.4xlarge: $2.36. Total = Total hours spent on Windows CUDA tests * Duration gain (0.75) * 2.36
So the current runner is not only more expensive but is also slower. Switching to G5 runners for Windows should cut down the cost by (3.428 - 0.75 * 2.36) / 3.428 = **~45%**
### Rolling out
https://github.com/pytorch/test-infra/pull/1376 needs to be reviewed and approved to ensure the capacity of the runner before PR can be merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91727
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/seemethere