lint

test commit
2025-10-26 00:24:53 +08:00 · 2025-07-16 10:17:19 -07:00 · 2025-07-16 10:16:01 -07:00 · 2025-07-16 09:54:53 -07:00 · 2025-07-16 11:46:25 -05:00 · 2025-07-16 11:45:12 -05:00
1196 changed files with 63058 additions and 36388 deletions
--- a/.bazelrc
+++ b/.bazelrc
@ -2,7 +2,7 @@ build --cxxopt=--std=c++17
 build --copt=-I.
 # Bazel does not support including its cc_library targets as system
 # headers. We work around this for generated code
-# (e.g. torch/headeronly/macros/cmake_macros.h) by making the generated directory a
+# (e.g. c10/macros/cmake_macros.h) by making the generated directory a
 # system include path.
 build --copt=-isystem --copt bazel-out/k8-fastbuild/bin
 build --copt=-isystem --copt bazel-out/darwin-fastbuild/bin
--- a/.ci/docker/README.md
+++ b/.ci/docker/README.md
@ -36,105 +36,3 @@ See `build.sh` for valid build environments (it's the giant switch).
 # Set flags (see build.sh) and build image
 sudo bash -c 'TRITON=1 ./build.sh pytorch-linux-bionic-py3.8-gcc9 -t myimage:latest
 ```
-
-## [Guidance] Adding a New Base Docker Image
-
-### Background
-
-The base Docker images in directory `.ci/docker/` are built by the `docker-builds.yml` workflow. Those images are used throughout the PyTorch CI/CD pipeline. You should only create or modify a base Docker image if you need specific environment changes or dependencies before building PyTorch on CI.
-
-1. **Automatic Rebuilding**:
-   - The Docker image building process is triggered automatically when changes are made to files in the `.ci/docker/*` directory
-   - This ensures all images stay up-to-date with the latest dependencies and configurations
-
-2. **Image Reuse in PyTorch Build Workflows** (example: linux-build):
-   - The images generated by `docker-builds.yml` are reused in `_linux-build.yml` through the `calculate-docker-image` step
-   - The `_linux-build.yml` workflow:
-     - Pulls the Docker image determined by the `calculate-docker-image` step
-     - Runs a Docker container with that image
-     - Executes `.ci/pytorch/build.sh` inside the container to build PyTorch
-
-3. **Usage in Test Workflows** (example: linux-test):
-   - The same Docker images are also used in `_linux-test.yml` for running tests
-   - The `_linux-test.yml` workflow follows a similar pattern:
-     - It uses the `calculate-docker-image` step to determine which Docker image to use
-     - It pulls the Docker image and runs a container with that image
-     - It installs the wheels from the artifacts generated by PyTorch build jobs
-     - It executes test scripts (like `.ci/pytorch/test.sh` or `.ci/pytorch/multigpu-test.sh`) inside the container
-
-### Understanding File Purposes
-
-#### `.ci/docker/build.sh` vs `.ci/pytorch/build.sh`
- **`.ci/docker/build.sh`**:
-  - Used for building base Docker images
-  - Executed by the `docker-builds.yml` workflow to pre-build Docker images for CI
-  - Contains configurations for different Docker build environments
-
- **`.ci/pytorch/build.sh`**:
-  - Used for building PyTorch inside a Docker container
-  - Called by workflows like `_linux-build.yml` after the Docker container is started
-  - Builds PyTorch wheels and other artifacts
-
-#### `.ci/docker/ci_commit_pins/` vs `.github/ci_commit_pins`
- **`.ci/docker/ci_commit_pins/`**:
-  - Used for pinning dependency versions during base Docker image building
-  - Ensures consistent environments for building PyTorch
-  - Changes here trigger base Docker image rebuilds
-
- **`.github/ci_commit_pins`**:
-  - Used for pinning dependency versions during PyTorch building and tests
-  - Ensures consistent dependencies for PyTorch across different builds
-  - Used by build scripts running inside Docker containers
-
-### Step-by-Step Guide for Adding a New Base Docker Image
-
-#### 1. Add Pinned Commits (If Applicable)
-
-We use pinned commits for build stability. The `nightly.yml` workflow checks and updates pinned commits for certain repository dependencies daily.
-
-If your new Docker image needs a library installed from a specific pinned commit or built from source:
-
-1. Add the repository you want to track in `nightly.yml` and `merge-rules.yml`
-2. Add the initial pinned commit in `.ci/docker/ci_commit_pins/`. The text filename should match the one defined in step 1
-
-#### 2. Configure the Base Docker Image
-1. **Add new Base Docker image configuration** (if applicable):
-
-   Add the configuration in `.ci/docker/build.sh`. For example:
-   ```bash
-   pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-new1)
-     CUDA_VERSION=12.8.1
-     CUDNN_VERSION=9
-     ANACONDA_PYTHON_VERSION=3.12
-     GCC_VERSION=11
-     VISION=yes
-     KATEX=yes
-     UCX_COMMIT=${_UCX_COMMIT}
-     UCC_COMMIT=${_UCC_COMMIT}
-     TRITON=yes
-     NEW_ARG_1=yes
-     ;;
-   ```
-
-2. **Add build arguments to Docker build command**:
-
-   If you're introducing a new argument to the Docker build, make sure to add it in the Docker build step in `.ci/docker/build.sh`:
-   ```bash
-   docker build \
-      ....
-      --build-arg "NEW_ARG_1=${NEW_ARG_1}"
-   ```
-
-3. **Update Dockerfile logic**:
-
-   Update the Dockerfile to use the new argument. For example, in `ubuntu/Dockerfile`:
-   ```dockerfile
-   ARG NEW_ARG_1
-   # Set up environment for NEW_ARG_1
-   RUN if [ -n "${NEW_ARG_1}" ]; then bash ./do_something.sh; fi
-   ```
-
-4. **Add the Docker configuration** in `.github/workflows/docker-builds.yml`:
-
-   The `docker-builds.yml` workflow pre-builds the Docker images whenever changes occur in the `.ci/docker/` directory. This includes the
-   pinned commit updates.
--- a/.ci/docker/build.sh
+++ b/.ci/docker/build.sh
@ -160,17 +160,6 @@ case "$tag" in
    UCC_COMMIT=${_UCC_COMMIT}
    TRITON=yes
    ;;
-  pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm)
-    CUDA_VERSION=12.8.1
-    CUDNN_VERSION=9
-    ANACONDA_PYTHON_VERSION=3.12
-    GCC_VERSION=11
-    VISION=yes
-    KATEX=yes
-    UCX_COMMIT=${_UCX_COMMIT}
-    UCC_COMMIT=${_UCC_COMMIT}
-    TRITON=yes
-    ;;
  pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks)
    CUDA_VERSION=12.6
    CUDNN_VERSION=9
@ -242,6 +231,18 @@ case "$tag" in
    VISION=yes
    TRITON=yes
    ;;
+  pytorch-linux-jammy-rocm-n-1-py3)
+    ANACONDA_PYTHON_VERSION=3.10
+    GCC_VERSION=11
+    VISION=yes
+    ROCM_VERSION=6.3
+    NINJA_VERSION=1.9.0
+    TRITON=yes
+    KATEX=yes
+    UCX_COMMIT=${_UCX_COMMIT}
+    UCC_COMMIT=${_UCC_COMMIT}
+    INDUCTOR_BENCHMARKS=yes
+    ;;
  pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-noble-rocm-n-py3)
    if [[ $tag =~ "jammy" ]]; then
      ANACONDA_PYTHON_VERSION=3.10
@ -258,19 +259,6 @@ case "$tag" in
    UCC_COMMIT=${_UCC_COMMIT}
    INDUCTOR_BENCHMARKS=yes
    ;;
-  pytorch-linux-noble-rocm-alpha-py3)
-    ANACONDA_PYTHON_VERSION=3.12
-    GCC_VERSION=11
-    VISION=yes
-    ROCM_VERSION=7.0
-    NINJA_VERSION=1.9.0
-    TRITON=yes
-    KATEX=yes
-    UCX_COMMIT=${_UCX_COMMIT}
-    UCC_COMMIT=${_UCC_COMMIT}
-    INDUCTOR_BENCHMARKS=yes
-    PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"
-    ;;
  pytorch-linux-jammy-xpu-2025.0-py3)
    ANACONDA_PYTHON_VERSION=3.9
    GCC_VERSION=11
@ -287,7 +275,7 @@ case "$tag" in
    NINJA_VERSION=1.9.0
    TRITON=yes
    ;;
-  pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
+    pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
    ANACONDA_PYTHON_VERSION=3.9
    GCC_VERSION=11
    VISION=yes
--- a/.ci/docker/ci_commit_pins/triton.txt
+++ b/.ci/docker/ci_commit_pins/triton.txt
@ -1 +1 @@
-11ec6354315768a85da41032535e3b7b99c5f706
+ae848267bebc65c6181e8cc5e64a6357d2679260
--- a/.ci/docker/common/install_conda.sh
+++ b/.ci/docker/common/install_conda.sh
@ -4,8 +4,12 @@ set -ex

 # Optionally install conda
 if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
-  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore
-  CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
+  BASE_URL="https://repo.anaconda.com/miniconda"
+  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
+  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]] || [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
+    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"  # @lint-ignore
+    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
+  fi

  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
@ -17,6 +21,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
      exit 1
      ;;
  esac
+
  mkdir -p /opt/conda
  chown jenkins:jenkins /opt/conda

--- a/.ci/docker/common/install_rocm.sh
+++ b/.ci/docker/common/install_rocm.sh
@ -30,25 +30,16 @@ EOF

    # we want the patch version of 6.4 instead
    if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
-        ROCM_VERSION="${ROCM_VERSION}.2"
-    fi
-
-    # Default url values
-    rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
-    amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
-
-    # Special case for ROCM_VERSION == 7.0
-    if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then
-        rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"
-        amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"
+        ROCM_VERSION="${ROCM_VERSION}.1"
    fi

    # Add amdgpu repository
    UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
-    echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
+    echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

    # Add rocm repository
    wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add -
+    local rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
    echo "deb [arch=amd64] ${rocm_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/rocm.list
    apt-get update --allow-insecure-repositories

@ -82,33 +73,30 @@ EOF
    done

    # ROCm 6.3 had a regression where initializing static code objects had significant overhead
-    # CI no longer builds for ROCm 6.3, but
    # ROCm 6.4 did not yet fix the regression, also HIP branch names are different
-    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.4) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
-        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.2) ]]; then
-            HIP_TAG=rocm-6.4.2
-            CLR_HASH=74d78ba3ac4bac235d02bcb48511c30b5cfdd457  # branch release/rocm-rel-6.4.2-statco-hotfix
-        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
-            HIP_TAG=rocm-6.4.1
-            CLR_HASH=efe6c35790b9206923bfeed1209902feff37f386  # branch release/rocm-rel-6.4.1-statco-hotfix
+    if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
+        if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
+            HIP_BRANCH=release/rocm-rel-6.4
+            VER_STR=6.4
+            VER_PATCH=.1
        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
-            HIP_TAG=rocm-6.4.0
-            CLR_HASH=600f5b0d2baed94d5121e2174a9de0851b040b0c  # branch release/rocm-rel-6.4-statco-hotfix
+            HIP_BRANCH=release/rocm-rel-6.4
+            VER_STR=6.4
+        elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
+            HIP_BRANCH=rocm-6.3.x
+            VER_STR=6.3
        fi
        # clr build needs CppHeaderParser but can only find it using conda's python
        python -m pip install CppHeaderParser
-        git clone https://github.com/ROCm/HIP -b $HIP_TAG
+        git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
        HIP_COMMON_DIR=$(readlink -f HIP)
-        git clone https://github.com/jeffdaily/clr
-        pushd clr
-        git checkout $CLR_HASH
-        popd
+        git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix
        mkdir -p clr/build
        pushd clr/build
        # Need to point CMake to the correct python installation to find CppHeaderParser
        cmake .. -DPython3_EXECUTABLE=/opt/conda/envs/py_${ANACONDA_PYTHON_VERSION}/bin/python3 -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR
        make -j
-        cp hipamd/lib/libamdhip64.so.6.4.* /opt/rocm/lib/libamdhip64.so.6.4.*
+        cp hipamd/lib/libamdhip64.so.${VER_STR}.* /opt/rocm/lib/libamdhip64.so.${VER_STR}.*
        popd
        rm -rf HIP clr
    fi
--- a/.ci/docker/libtorch/build.sh
+++ b/.ci/docker/libtorch/build.sh
@ -41,7 +41,7 @@ case ${DOCKER_TAG_PREFIX} in
    rocm*)
        # we want the patch version of 6.4 instead
        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
-            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
+            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"
        fi
        BASE_TARGET=rocm
        GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
--- a/.ci/docker/manywheel/build.sh
+++ b/.ci/docker/manywheel/build.sh
@ -77,7 +77,7 @@ case ${image} in
    manylinux2_28-builder:rocm*)
        # we want the patch version of 6.4 instead
        if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
-            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
+            GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"
        fi
        TARGET=rocm_final
        MANY_LINUX_VERSION="2_28"
--- a/.ci/docker/requirements-ci.txt
+++ b/.ci/docker/requirements-ci.txt
@ -50,7 +50,7 @@ flatbuffers==24.12.23
 hypothesis==5.35.1
 # Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
 #Description: advanced library for generating parametrized tests
-#Pinned versions: 5.35.1
+#Pinned versions: 3.44.6, 4.53.2
 #test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py

 junitparser==2.1.1
@ -221,9 +221,9 @@ pygments==2.15.0
 #Pinned versions: 2.12.0
 #test that import: the doctests

-#pyyaml
+#PyYAML
 #Description: data serialization format
-#Pinned versions: 6.0.2
+#Pinned versions:
 #test that import:

 #requests
@ -233,7 +233,7 @@ pygments==2.15.0

 #rich
 #Description: rich text and beautiful formatting in the terminal
-#Pinned versions: 14.1.0
+#Pinned versions: 10.9.0
 #test that import:

 scikit-image==0.19.3 ; python_version < "3.10"
@ -307,7 +307,7 @@ pytest-cpp==2.3.0
 #Pinned versions: 2.3.0
 #test that import:

-z3-solver==4.15.1.0
+z3-solver==4.12.6.0
 #Description: The Z3 Theorem Prover Project
 #Pinned versions:
 #test that import:
@ -389,9 +389,3 @@ tlparse==0.3.30
 cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
 #Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
 #test that import: test_cuda.py
-
-setuptools-git-versioning==2.1.0
-scikit-build==0.18.1
-pyre-extensions==0.0.32
-tabulate==0.9.0
-#Description: These package are needed to build FBGEMM and torchrec on PyTorch CI
--- a/.ci/docker/requirements-docs.txt
+++ b/.ci/docker/requirements-docs.txt
@ -4,7 +4,7 @@ sphinx==5.3.0
 -e git+https://github.com/pytorch/pytorch_sphinx_theme.git@pytorch_sphinx_theme2#egg=pytorch_sphinx_theme2

 # TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
-# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
+# but it doesn't seem to work and hangs around idly. The initial thought is probably
 # something related to Docker setup. We can investigate this later.

 sphinxcontrib.katex==0.8.6
@ -59,4 +59,3 @@ sphinx-copybutton==0.5.0
 sphinx-design==0.4.0
 sphinxcontrib-mermaid==1.0.0
 myst-parser==0.18.1
-myst-nb
--- a/.ci/manywheel/build_common.sh
+++ b/.ci/manywheel/build_common.sh
@ -97,7 +97,8 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
    exit 1
 fi
 pushd "$PYTORCH_ROOT"
-retry pip install -qUr requirements-build.txt
+retry pip install -q "setuptools>=70.1.0" packaging
+retry pip install -qU cmake ninja
 python setup.py clean
 retry pip install -qr requirements.txt
 case ${DESIRED_PYTHON} in
--- a/.ci/manywheel/build_libtorch.sh
+++ b/.ci/manywheel/build_libtorch.sh
@ -92,7 +92,8 @@ if [[ -z "$PYTORCH_ROOT" ]]; then
    exit 1
 fi
 pushd "$PYTORCH_ROOT"
-retry pip install -qUr requirements-build.txt
+retry pip install -q "setuptools>=70.1.0" packaging
+retry pip install -qU cmake ninja
 python setup.py clean
 retry pip install -qr requirements.txt
 retry pip install -q numpy==2.0.1
--- a/.ci/pytorch/build-mobile.sh
+++ b/.ci/pytorch/build-mobile.sh
@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+# DO NOT ADD 'set -x' not to reveal CircleCI secret context environment variables
+set -eu -o pipefail
+
+# This script uses linux host toolchain + mobile build options in order to
+# build & test mobile libtorch without having to setup Android/iOS
+# toolchain/simulator.
+
+# shellcheck source=./common.sh
+source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
+# shellcheck source=./common-build.sh
+source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
+
+# Install torch & torchvision - used to download & trace test model.
+# Ideally we should use the libtorch built on the PR so that backward
+# incompatible changes won't break this script - but it will significantly slow
+# down mobile CI jobs.
+# Here we install nightly instead of stable so that we have an option to
+# temporarily skip mobile CI jobs on BC-breaking PRs until they are in nightly.
+retry pip install --pre torch torchvision \
+  -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html \
+  --progress-bar off
+
+# Run end-to-end process of building mobile library, linking into the predictor
+# binary, and running forward pass with a real model.
+if [[ "$BUILD_ENVIRONMENT" == *-mobile-custom-build-static* ]]; then
+  TEST_CUSTOM_BUILD_STATIC=1 test/mobile/custom_build/build.sh
+elif [[ "$BUILD_ENVIRONMENT" == *-mobile-lightweight-dispatch* ]]; then
+  test/mobile/lightweight_dispatch/build.sh
+else
+  TEST_DEFAULT_BUILD=1 test/mobile/custom_build/build.sh
+fi
+
+print_sccache_stats
--- a/.ci/pytorch/build.sh
+++ b/.ci/pytorch/build.sh
@ -11,6 +11,10 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
 # shellcheck source=./common-build.sh
 source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"

+if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
+  exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
+fi
+
 echo "Python version:"
 python --version

@ -50,6 +54,9 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
  export ATEN_THREADING=NATIVE
 fi

+# Enable LLVM dependency for TensorExpr testing
+export USE_LLVM=/opt/llvm
+export LLVM_DIR=/opt/llvm/lib/cmake/llvm

 if ! which conda; then
  # In ROCm CIs, we are doing cross compilation on build machines with
@ -117,8 +124,26 @@ if [[ "$BUILD_ENVIRONMENT" == *libtorch* ]]; then
 fi

 # Use special scripts for Android builds
+if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
+  export ANDROID_NDK=/opt/ndk
+  build_args=()
+  if [[ "${BUILD_ENVIRONMENT}" == *-arm-v7a* ]]; then
+    build_args+=("-DANDROID_ABI=armeabi-v7a")
+  elif [[ "${BUILD_ENVIRONMENT}" == *-arm-v8a* ]]; then
+    build_args+=("-DANDROID_ABI=arm64-v8a")
+  elif [[ "${BUILD_ENVIRONMENT}" == *-x86_32* ]]; then
+    build_args+=("-DANDROID_ABI=x86")
+  elif [[ "${BUILD_ENVIRONMENT}" == *-x86_64* ]]; then
+    build_args+=("-DANDROID_ABI=x86_64")
+  fi
+  if [[ "${BUILD_ENVIRONMENT}" == *vulkan* ]]; then
+    build_args+=("-DUSE_VULKAN=ON")
+  fi
+  build_args+=("-DUSE_LITE_INTERPRETER_PROFILER=OFF")
+  exec ./scripts/build_android.sh "${build_args[@]}" "$@"
+fi

-if [[ "$BUILD_ENVIRONMENT" == *vulkan* ]]; then
+if [[ "$BUILD_ENVIRONMENT" != *android* && "$BUILD_ENVIRONMENT" == *vulkan* ]]; then
  export USE_VULKAN=1
  # shellcheck disable=SC1091
  source /var/lib/jenkins/vulkansdk/setup-env.sh
@ -189,6 +214,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
  export USE_ASAN=1
  export REL_WITH_DEB_INFO=1
  export UBSAN_FLAGS="-fno-sanitize-recover=all"
+  unset USE_LLVM
 fi

 if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then
@ -199,7 +225,7 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
    export USE_PRECOMPILED_HEADERS=1
 fi

-if [[ "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
+if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
  export BUILD_STATIC_RUNTIME_BENCHMARK=ON
 fi

@ -280,22 +306,6 @@ else
    fi
    pip_install_whl "$(echo dist/*.whl)"

-    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *vision* ]]; then
-      install_torchvision
-    fi
-
-    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *audio* ]]; then
-      install_torchaudio
-    fi
-
-    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchrec* || "${BUILD_ADDITIONAL_PACKAGES:-}" == *fbgemm* ]]; then
-      install_torchrec_and_fbgemm
-    fi
-
-    if [[ "${BUILD_ADDITIONAL_PACKAGES:-}" == *torchao* ]]; then
-      install_torchao
-    fi
-
    if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
      echo "Checking that xpu is compiled"
      pushd dist/
--- a/.ci/pytorch/common_utils.sh
+++ b/.ci/pytorch/common_utils.sh
@ -78,34 +78,6 @@ function pip_install_whl() {
  fi
 }

-function pip_build_and_install() {
-  local build_target=$1
-  local wheel_dir=$2
-
-  local found_whl=0
-  for file in "${wheel_dir}"/*.whl
-  do
-    if [[ -f "${file}" ]]; then
-      found_whl=1
-      break
-    fi
-  done
-
-  # Build the wheel if it doesn't exist
-  if [ "${found_whl}" == "0" ]; then
-    python3 -m pip wheel \
-      --no-build-isolation \
-      --no-deps \
-      --no-use-pep517 \
-      -w "${wheel_dir}" \
-      "${build_target}"
-  fi
-
-  for file in "${wheel_dir}"/*.whl
-  do
-    pip_install_whl "${file}"
-  done
-}

 function pip_install() {
  # retry 3 times
@ -152,7 +124,14 @@ function get_pinned_commit() {
 function install_torchaudio() {
  local commit
  commit=$(get_pinned_commit audio)
-  pip_build_and_install "git+https://github.com/pytorch/audio.git@${commit}" dist/audio
+  if [[ "$1" == "cuda" ]]; then
+    # TODO: This is better to be passed as a parameter from _linux-test workflow
+    # so that it can be consistent with what is set in build
+    TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install --no-use-pep517 "git+https://github.com/pytorch/audio.git@${commit}"
+  else
+    pip_install --no-use-pep517 "git+https://github.com/pytorch/audio.git@${commit}"
+  fi
+
 }

 function install_torchtext() {
@ -160,8 +139,8 @@ function install_torchtext() {
  local text_commit
  data_commit=$(get_pinned_commit data)
  text_commit=$(get_pinned_commit text)
-  pip_build_and_install "git+https://github.com/pytorch/data.git@${data_commit}" dist/data
-  pip_build_and_install "git+https://github.com/pytorch/text.git@${text_commit}" dist/text
+  pip_install --no-use-pep517 "git+https://github.com/pytorch/data.git@${data_commit}"
+  pip_install --no-use-pep517 "git+https://github.com/pytorch/text.git@${text_commit}"
 }

 function install_torchvision() {
@ -174,14 +153,7 @@ function install_torchvision() {
    echo 'char* dlerror(void) { return "";}'|gcc -fpic -shared -o "${HOME}/dlerror.so" -x c -
    LD_PRELOAD=${orig_preload}:${HOME}/dlerror.so
  fi
-
-  if [[ "${BUILD_ENVIRONMENT}" == *cuda* ]]; then
-    # Not sure if both are needed, but why not
-    export FORCE_CUDA=1
-    export WITH_CUDA=1
-  fi
-  pip_build_and_install "git+https://github.com/pytorch/vision.git@${commit}" dist/vision
-
+  pip_install --no-use-pep517 "git+https://github.com/pytorch/vision.git@${commit}"
  if [ -n "${LD_PRELOAD}" ]; then
    LD_PRELOAD=${orig_preload}
  fi
@ -201,73 +173,25 @@ function install_torchrec_and_fbgemm() {

  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]] ; then
    # install torchrec first because it installs fbgemm nightly on top of rocm fbgemm
-    pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
+    pip_install --no-use-pep517 "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
    pip_uninstall fbgemm-gpu-nightly

-    # Set ROCM_HOME isn't available, use ROCM_PATH if set or /opt/rocm
-    ROCM_HOME="${ROCM_HOME:-${ROCM_PATH:-/opt/rocm}}"
-
-    # Find rocm_version.h header file for ROCm version extract
-    rocm_version_h="${ROCM_HOME}/include/rocm-core/rocm_version.h"
-    if [ ! -f "$rocm_version_h" ]; then
-        rocm_version_h="${ROCM_HOME}/include/rocm_version.h"
-    fi
-
-    # Error out if rocm_version.h not found
-    if [ ! -f "$rocm_version_h" ]; then
-        echo "Error: rocm_version.h not found in expected locations." >&2
-        exit 1
-    fi
-
-    # Extract major, minor and patch ROCm version numbers
-    MAJOR_VERSION=$(grep 'ROCM_VERSION_MAJOR' "$rocm_version_h" | awk '{print $3}')
-    MINOR_VERSION=$(grep 'ROCM_VERSION_MINOR' "$rocm_version_h" | awk '{print $3}')
-    PATCH_VERSION=$(grep 'ROCM_VERSION_PATCH' "$rocm_version_h" | awk '{print $3}')
-    ROCM_INT=$((MAJOR_VERSION * 10000 + MINOR_VERSION * 100 + PATCH_VERSION))
-    echo "ROCm version: $ROCM_INT"
-    export BUILD_ROCM_VERSION="$MAJOR_VERSION.$MINOR_VERSION"
-
    pip_install tabulate  # needed for newer fbgemm
    pip_install patchelf  # needed for rocm fbgemm
-    pushd /tmp
-
-    local wheel_dir=dist/fbgemm_gpu
-    local found_whl=0
-    for file in "${wheel_dir}"/*.whl
-    do
-      if [[ -f "${file}" ]]; then
-        found_whl=1
-        break
-      fi
-    done
-
-    # Build the wheel if it doesn't exist
-    if [ "${found_whl}" == "0" ]; then
-      git clone --recursive https://github.com/pytorch/fbgemm
-      pushd fbgemm/fbgemm_gpu
-      git checkout "${fbgemm_commit}"
-      python setup.py bdist_wheel \
-        --build-variant=rocm \
-        -DHIP_ROOT_DIR="${ROCM_PATH}" \
-        -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
-        -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
-      popd
-
-      # Save the wheel before cleaning up
-      mkdir -p dist/fbgemm_gpu
-      cp fbgemm/fbgemm_gpu/dist/*.whl dist/fbgemm_gpu
-    fi
-
-    for file in "${wheel_dir}"/*.whl
-    do
-      pip_install_whl "${file}"
-    done
-
-    rm -rf fbgemm
+    git clone --recursive https://github.com/pytorch/fbgemm
+    pushd fbgemm/fbgemm_gpu
+    git checkout "${fbgemm_commit}"
+    python setup.py install \
+      --package_variant=rocm \
+      -DHIP_ROOT_DIR="${ROCM_PATH}" \
+      -DCMAKE_C_FLAGS="-DTORCH_USE_HIP_DSA" \
+      -DCMAKE_CXX_FLAGS="-DTORCH_USE_HIP_DSA"
    popd
+    rm -rf fbgemm
  else
-    pip_build_and_install "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}" dist/torchrec
-    pip_build_and_install "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#subdirectory=fbgemm_gpu" dist/fbgemm_gpu
+    # See https://github.com/pytorch/pytorch/issues/106971
+    CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
+    pip_install --no-use-pep517 "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
  fi
 }

@ -310,7 +234,7 @@ function checkout_install_torchbench() {
 function install_torchao() {
  local commit
  commit=$(get_pinned_commit torchao)
-  pip_build_and_install "git+https://github.com/pytorch/ao.git@${commit}" dist/ao
+  pip_install --no-use-pep517 "git+https://github.com/pytorch/ao.git@${commit}"
 }

 function print_sccache_stats() {
--- a/.ci/pytorch/create_test_cert.py
+++ b/.ci/pytorch/create_test_cert.py
@ -0,0 +1,123 @@
+from datetime import datetime, timedelta, timezone
+from tempfile import mkdtemp
+
+from cryptography import x509
+from cryptography.hazmat.primitives import hashes, serialization
+from cryptography.hazmat.primitives.asymmetric import rsa
+from cryptography.x509.oid import NameOID
+
+
+temp_dir = mkdtemp()
+print(temp_dir)
+
+
+def genrsa(path):
+    key = rsa.generate_private_key(
+        public_exponent=65537,
+        key_size=2048,
+    )
+    with open(path, "wb") as f:
+        f.write(
+            key.private_bytes(
+                encoding=serialization.Encoding.PEM,
+                format=serialization.PrivateFormat.TraditionalOpenSSL,
+                encryption_algorithm=serialization.NoEncryption(),
+            )
+        )
+    return key
+
+
+def create_cert(path, C, ST, L, O, key):
+    subject = issuer = x509.Name(
+        [
+            x509.NameAttribute(NameOID.COUNTRY_NAME, C),
+            x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
+            x509.NameAttribute(NameOID.LOCALITY_NAME, L),
+            x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
+        ]
+    )
+    cert = (
+        x509.CertificateBuilder()
+        .subject_name(subject)
+        .issuer_name(issuer)
+        .public_key(key.public_key())
+        .serial_number(x509.random_serial_number())
+        .not_valid_before(datetime.now(timezone.utc))
+        .not_valid_after(
+            # Our certificate will be valid for 10 days
+            datetime.now(timezone.utc) + timedelta(days=10)
+        )
+        .add_extension(
+            x509.BasicConstraints(ca=True, path_length=None),
+            critical=True,
+        )
+        .sign(key, hashes.SHA256())
+    )
+    # Write our certificate out to disk.
+    with open(path, "wb") as f:
+        f.write(cert.public_bytes(serialization.Encoding.PEM))
+    return cert
+
+
+def create_req(path, C, ST, L, O, key):
+    csr = (
+        x509.CertificateSigningRequestBuilder()
+        .subject_name(
+            x509.Name(
+                [
+                    # Provide various details about who we are.
+                    x509.NameAttribute(NameOID.COUNTRY_NAME, C),
+                    x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
+                    x509.NameAttribute(NameOID.LOCALITY_NAME, L),
+                    x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
+                ]
+            )
+        )
+        .sign(key, hashes.SHA256())
+    )
+    with open(path, "wb") as f:
+        f.write(csr.public_bytes(serialization.Encoding.PEM))
+    return csr
+
+
+def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
+    cert = (
+        x509.CertificateBuilder()
+        .subject_name(csr_cert.subject)
+        .issuer_name(ca_cert.subject)
+        .public_key(csr_cert.public_key())
+        .serial_number(x509.random_serial_number())
+        .not_valid_before(datetime.now(timezone.utc))
+        .not_valid_after(
+            # Our certificate will be valid for 10 days
+            datetime.now(timezone.utc) + timedelta(days=10)
+            # Sign our certificate with our private key
+        )
+        .sign(private_ca_key, hashes.SHA256())
+    )
+    with open(path, "wb") as f:
+        f.write(cert.public_bytes(serialization.Encoding.PEM))
+    return cert
+
+
+ca_key = genrsa(temp_dir + "/ca.key")
+ca_cert = create_cert(
+    temp_dir + "/ca.pem",
+    "US",
+    "New York",
+    "New York",
+    "Gloo Certificate Authority",
+    ca_key,
+)
+
+pkey = genrsa(temp_dir + "/pkey.key")
+csr = create_req(
+    temp_dir + "/csr.csr",
+    "US",
+    "California",
+    "San Francisco",
+    "Gloo Testing Company",
+    pkey,
+)
+
+cert = sign_certificate_request(temp_dir + "/cert.pem", csr, ca_cert, ca_key)
--- a/.ci/pytorch/run_glootls_test.sh
+++ b/.ci/pytorch/run_glootls_test.sh
@ -0,0 +1,18 @@
+#!/bin/bash
+
+CREATE_TEST_CERT="$(dirname "${BASH_SOURCE[0]}")/create_test_cert.py"
+TMP_CERT_DIR=$(python "$CREATE_TEST_CERT")
+
+openssl verify -CAfile "${TMP_CERT_DIR}/ca.pem" "${TMP_CERT_DIR}/cert.pem"
+
+export GLOO_DEVICE_TRANSPORT=TCP_TLS
+export GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY=${TMP_CERT_DIR}/pkey.key
+export GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT=${TMP_CERT_DIR}/cert.pem
+export GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE=${TMP_CERT_DIR}/ca.pem
+
+time python test/run_test.py --include distributed/test_c10d_gloo --verbose -- ProcessGroupGlooTest
+
+unset GLOO_DEVICE_TRANSPORT
+unset GLOO_DEVICE_TRANSPORT_TCP_TLS_PKEY
+unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CERT
+unset GLOO_DEVICE_TRANSPORT_TCP_TLS_CA_FILE
--- a/.ci/pytorch/run_tests.sh
+++ b/.ci/pytorch/run_tests.sh
@ -74,13 +74,12 @@ else
 fi

 # Environment initialization
-retry pip install -qUr requirements-build.txt
 if [[ "$(uname)" == Darwin ]]; then
    # Install the testing dependencies
-    retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest
+    retry pip install -q future hypothesis ${NUMPY_PACKAGE} ${PROTOBUF_PACKAGE} pytest setuptools six typing_extensions pyyaml
 else
    retry pip install -qr requirements.txt || true
-    retry pip install -q hypothesis protobuf pytest || true
+    retry pip install -q hypothesis protobuf pytest setuptools || true
    numpy_ver=1.15
    case "$(python --version 2>&1)" in
      *2* | *3.5* | *3.6*)
--- a/.ci/pytorch/smoke_test/smoke_test.py
+++ b/.ci/pytorch/smoke_test/smoke_test.py
@ -385,29 +385,6 @@ def smoke_test_compile(device: str = "cpu") -> None:
    x_pt2 = torch.compile(model, mode="max-autotune")(x)


-def smoke_test_nvshmem() -> None:
-    if not torch.cuda.is_available():
-        print("CUDA is not available, skipping NVSHMEM test")
-        return
-
-    # Check if NVSHMEM is compiled in current build
-    try:
-        from torch._C._distributed_c10d import _is_nvshmem_available
-    except ImportError:
-        # Not built with NVSHMEM support.
-        # torch is not compiled with NVSHMEM prior to 2.9
-        if torch.__version__ < "2.9":
-            return
-        else:
-            # After 2.9: NVSHMEM is expected to be compiled in current build
-            raise RuntimeError("torch not compiled with NVSHMEM") from None
-
-    print("torch compiled with NVSHMEM")
-
-    # Check if NVSHMEM is available on current system.
-    print(f"NVSHMEM available at run time: {_is_nvshmem_available()}")
-
-
 def smoke_test_modules():
    cwd = os.getcwd()
    for module in MODULES:
@ -502,8 +479,6 @@ def main() -> None:
        options.pypi_pkg_check,
    )

-    smoke_test_nvshmem()
-

 if __name__ == "__main__":
    main()
--- a/.ci/pytorch/test.sh
+++ b/.ci/pytorch/test.sh
@ -289,12 +289,6 @@ elif [[ $TEST_CONFIG == 'nogpu_AVX512' ]]; then
  export ATEN_CPU_CAPABILITY=avx2
 fi

-if [[ "${TEST_CONFIG}" == "legacy_nvidia_driver" ]]; then
-  # Make sure that CUDA can be initialized
-  (cd test && python -c "import torch; torch.rand(2, 2, device='cuda')")
-  export USE_LEGACY_DRIVER=1
-fi
-
 test_python_legacy_jit() {
  time python test/run_test.py --include test_jit_legacy test_jit_fuser_legacy --verbose
  assert_git_not_dirty
@ -345,12 +339,6 @@ test_h100_symm_mem() {
  assert_git_not_dirty
 }

-test_h100_cutlass_backend() {
-  # cutlass backend tests for H100
-  TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_backend -k "not addmm" $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
-  TORCHINDUCTOR_CUTLASS_DIR=$(realpath "./third_party/cutlass") python test/run_test.py --include inductor/test_cutlass_evt $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
-}
-
 test_lazy_tensor_meta_reference_disabled() {
  export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
  echo "Testing lazy tensor operations without meta reference"
@ -365,6 +353,7 @@ test_dynamo_wrapped_shard() {
    exit 1
  fi
  python tools/dynamo/verify_dynamo.py
+  python tools/dynamo/gb_id_mapping.py verify
  # PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.
  # Instead, use @skipIfTorchDynamo on your tests.
  time python test/run_test.py --dynamo \
@ -462,7 +451,7 @@ test_inductor_aoti() {
  # rebuild with the build cache with `BUILD_AOT_INDUCTOR_TEST` enabled
  /usr/bin/env CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 "${BUILD_COMMAND[@]}"

-  /usr/bin/env "${TEST_ENVS[@]}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference cpp/test_vec_half_AVX2 -dist=loadfile
+  /usr/bin/env "${TEST_ENVS[@]}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
 }

 test_inductor_cpp_wrapper_shard() {
@ -928,6 +917,12 @@ test_torchbench_gcp_smoketest(){
  popd
 }

+test_python_gloo_with_tls() {
+  source "$(dirname "${BASH_SOURCE[0]}")/run_glootls_test.sh"
+  assert_git_not_dirty
+}
+
+
 test_aten() {
  # Test ATen
  # The following test(s) of ATen have already been skipped by caffe2 in rocm environment:
@ -974,8 +969,6 @@ test_without_numpy() {
  if [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then
    python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch;torch.compile(lambda x:print(x))('Hello World')"
  fi
-  # Regression test for https://github.com/pytorch/pytorch/pull/157734 (torch.onnx should be importable without numpy)
-  python -c "import sys;sys.path.insert(0, 'fake_numpy');import torch; import torch.onnx"
  popd
 }

@ -1039,10 +1032,20 @@ test_libtorch_api() {
    mkdir -p $TEST_REPORTS_DIR

    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" "$TORCH_BIN_DIR"/test_api --gtest_filter='-IMethodTest.*' --gtest_output=xml:$TEST_REPORTS_DIR/test_api.xml
+    "$TORCH_BIN_DIR"/test_tensorexpr --gtest_output=xml:$TEST_REPORTS_DIR/test_tensorexpr.xml
  else
    # Exclude IMethodTest that relies on torch::deploy, which will instead be ran in test_deploy
    OMP_NUM_THREADS=2 TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_api -k "not IMethodTest"

+    # On s390x, pytorch is built without llvm.
+    # Even if it would be built with llvm, llvm currently doesn't support used features on s390x and
+    # test fails with errors like:
+    # JIT session error: Unsupported target machine architecture in ELF object pytorch-jitted-objectbuffer
+    # unknown file: Failure
+    # C++ exception with description "valOrErr INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/jit/tensorexpr/llvm_jit.h":34, please report a bug to PyTorch. Unexpected failure in LLVM JIT: Failed to materialize symbols: { (main, { func }) }
+    if [[ "${BUILD_ENVIRONMENT}" != *s390x* ]]; then
+      python test/run_test.py --cpp --verbose -i cpp/test_tensorexpr
+    fi
  fi

  # quantization is not fully supported on s390x yet
@ -1310,13 +1313,10 @@ EOF

  # Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing
  # file is modified to introduce an invalid public API function.
-  # The filepath here must not have __all__ defined in it, otherwise the test will pass.
-  # If your PR introduces __all__ to torch/cuda/streams.py please point this to another file
-  # that does not have __all__ defined.
-  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/cuda/streams.py"
+  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"
  cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"
  echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"
-  invalid_api="torch.cuda.streams.new_public_func"
+  invalid_api="torch.nn.parameter.new_public_func"
  echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."

  check_public_api_test_fails \
@ -1550,7 +1550,7 @@ test_executorch() {
 test_linux_aarch64() {
  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
        test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
-        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
+        test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops test_cpp_extensions_open_device_registration \
        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

  # Dynamo tests
@ -1600,13 +1600,7 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze
 fi
 if [[ "${TEST_CONFIG}" == *numpy_2* ]]; then
  # Install numpy-2.0.2 and compatible scipy & numba versions
-  # Force re-install of pandas to avoid error where pandas checks numpy version from initial install and fails upon import
-  TMP_PANDAS_VERSION=$(python -c "import pandas; print(pandas.__version__)" 2>/dev/null)
-  if [ -n "$TMP_PANDAS_VERSION" ]; then
-    python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0 pandas=="$TMP_PANDAS_VERSION" --force-reinstall
-  else
-    python -m pip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0
-  fi
+  python -mpip install --pre numpy==2.0.2 scipy==1.13.1 numba==0.60.0
  python test/run_test.py --include dynamo/test_functions.py dynamo/test_unspec.py test_binary_ufuncs.py test_fake_tensor.py test_linalg.py test_numpy_interop.py test_tensor_creation_ops.py test_torch.py torch_np/test_basic.py
 elif [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then
  test_linux_aarch64
@ -1660,19 +1654,23 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
  id=$((SHARD_NUMBER-1))
  test_dynamo_benchmark timm_models "$id"
 elif [[ "${TEST_CONFIG}" == cachebench ]]; then
-  install_torchaudio
+  install_torchaudio cuda
  install_torchvision
  checkout_install_torchbench nanogpt BERT_pytorch resnet50 hf_T5 llama moco
  PYTHONPATH=$(pwd)/torchbench test_cachebench
 elif [[ "${TEST_CONFIG}" == verify_cachebench ]]; then
-  install_torchaudio
+  install_torchaudio cpu
  install_torchvision
  checkout_install_torchbench nanogpt
  PYTHONPATH=$(pwd)/torchbench test_verify_cachebench
 elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
-  install_torchaudio
+  if [[ "${TEST_CONFIG}" == *cpu* ]]; then
+    install_torchaudio cpu
+  else
+    install_torchaudio cuda
+  fi
  install_torchvision
-  install_torchao
+  TORCH_CUDA_ARCH_LIST="8.0;8.6" install_torchao
  id=$((SHARD_NUMBER-1))
  # https://github.com/opencv/opencv-python/issues/885
  pip_install opencv-python==4.8.0.74
@ -1763,8 +1761,6 @@ elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
  test_h100_distributed
 elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then
  test_h100_symm_mem
-elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then
-  test_h100_cutlass_backend
 else
  install_torchvision
  install_monkeytype
--- a/.ci/pytorch/win-arm64-build.ps1
+++ b/.ci/pytorch/win-arm64-build.ps1
@ -1,34 +0,0 @@
-# If you want to rebuild, run this with $env:REBUILD=1
-# If you want to build with CUDA, run this with $env:USE_CUDA=1
-# If you want to build without CUDA, run this with $env:USE_CUDA=0
-
-# Check for setup.py in the current directory
-if (-not (Test-Path "setup.py")) {
-    Write-Host "ERROR: Please run this build script from PyTorch root directory."
-    exit 1
-}
-
-# Get the script's parent directory
-$ScriptParentDir = Split-Path -Parent $MyInvocation.MyCommand.Definition
-
-# Set TMP_DIR and convert to Windows path
-$env:TMP_DIR = Join-Path (Get-Location) "build\win_tmp"
-$env:TMP_DIR_WIN = $env:TMP_DIR  # Already in Windows format, no cygpath needed
-
-# Set final package directory with default fallback
-if (-not $env:PYTORCH_FINAL_PACKAGE_DIR) {
-    $env:PYTORCH_FINAL_PACKAGE_DIR = "C:\w\build-results"
-}
-
-# Create the final package directory if it doesn't exist
-if (-not (Test-Path $env:PYTORCH_FINAL_PACKAGE_DIR)) {
-    New-Item -Path $env:PYTORCH_FINAL_PACKAGE_DIR -ItemType Directory -Force | Out-Null
-}
-
-# Set script helpers directory
-$env:SCRIPT_HELPERS_DIR = Join-Path $ScriptParentDir "win-test-helpers\arm64"
-
-# Run the main build script
-& "$env:SCRIPT_HELPERS_DIR\build_pytorch.ps1"
-
-Write-Host "BUILD PASSED"
--- a/.ci/pytorch/win-arm64-test.sh
+++ b/.ci/pytorch/win-arm64-test.sh
@ -1,24 +0,0 @@
-#!/bin/bash
-set -ex -o pipefail
-
-SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
-# shellcheck source=./common.sh
-source "$SCRIPT_PARENT_DIR/common.sh"
-
-run_tests() {
-    echo Running smoke_test.py...
-    python ./.ci/pytorch/smoke_test/smoke_test.py --package torchonly
-
-    echo Running test_autograd.oy, test_nn.py, test_torch.py...
-    cd test
-
-    CORE_TEST_LIST=("test_autograd.py" "test_nn.py" "test_modules.py")
-
-    for t in "${CORE_TEST_LIST[@]}"; do
-        echo "Running test: $t"
-        python "$t" --verbose --save-xml --use-pytest -vvvv -rfEsxXP -p no:xdist
-    done
-}
-
-run_tests
-echo "TEST PASSED"
--- a/.ci/pytorch/win-test-helpers/arm64/build_pytorch.ps1
+++ b/.ci/pytorch/win-test-helpers/arm64/build_pytorch.ps1
@ -1,98 +0,0 @@
-# TODO: we may can use existing build_pytorch.bat for arm64
-
-if ($env:DEBUG -eq "1") {
-    $env:BUILD_TYPE = "debug"
-} else {
-    $env:BUILD_TYPE = "release"
-}
-
-# This inflates our log size slightly, but it is REALLY useful to be
-# able to see what our cl.exe commands are. (since you can actually
-# just copy-paste them into a local Windows setup to just rebuild a
-# single file.)
-# log sizes are too long, but leaving this here in case someone wants to use it locally
-# $env:CMAKE_VERBOSE_MAKEFILE = "1"
-
-$env:INSTALLER_DIR = Join-Path $env:SCRIPT_HELPERS_DIR "installation-helpers"
-
-cd ..
-
-# Environment variables
-$env:SCCACHE_IDLE_TIMEOUT = "0"
-$env:SCCACHE_IGNORE_SERVER_IO_ERROR = "1"
-$env:CMAKE_BUILD_TYPE = $env:BUILD_TYPE
-$env:CMAKE_C_COMPILER_LAUNCHER = "sccache"
-$env:CMAKE_CXX_COMPILER_LAUNCHER = "sccache"
-$env:libuv_ROOT = Join-Path $env:DEPENDENCIES_DIR "libuv\install"
-$env:MSSdk = "1"
-
-if ($env:PYTORCH_BUILD_VERSION) {
-    $env:PYTORCH_BUILD_VERSION = $env:PYTORCH_BUILD_VERSION
-    $env:PYTORCH_BUILD_NUMBER = "1"
-}
-
-$env:CMAKE_POLICY_VERSION_MINIMUM = "3.5"
-
-# Set BLAS type
-if ($env:ENABLE_APL -eq "1") {
-    $env:BLAS = "APL"
-    $env:USE_LAPACK = "1"
-} elseif ($env:ENABLE_OPENBLAS -eq "1") {
-    $env:BLAS = "OpenBLAS"
-    $env:OpenBLAS_HOME = Join-Path $env:DEPENDENCIES_DIR "OpenBLAS\install"
-}
-
-# Change to source directory
-Set-Location $env:PYTORCH_ROOT
-
-# Copy libuv.dll
-Copy-Item -Path (Join-Path $env:libuv_ROOT "lib\Release\uv.dll") -Destination "torch\lib\uv.dll" -Force
-
-# Create virtual environment
-python -m venv .venv
-.\.venv\Scripts\Activate.ps1
-where.exe python
-
-# Python install dependencies
-python -m pip install --upgrade pip
-pip install setuptools pyyaml
-pip install -r requirements.txt
-
-# Set after installing psutil
-$env:DISTUTILS_USE_SDK = "1"
-
-# Print all environment variables
-Get-ChildItem Env:
-
-# Start and inspect sccache
-sccache --start-server
-sccache --zero-stats
-sccache --show-stats
-
-# Build the wheel
-python setup.py bdist_wheel
-if ($LASTEXITCODE -ne 0) { exit 1 }
-
-# Install the wheel locally
-$whl = Get-ChildItem -Path "dist\*.whl" | Select-Object -First 1
-if ($whl) {
-    python -mpip install --no-index --no-deps $whl.FullName
-}
-
-# Copy final wheel
-robocopy "dist" "$env:PYTORCH_FINAL_PACKAGE_DIR" *.whl
-
-# Export test times
-python tools/stats/export_test_times.py
-
-# Copy additional CI files
-robocopy ".additional_ci_files" "$env:PYTORCH_FINAL_PACKAGE_DIR\.additional_ci_files" /E
-
-# Save ninja log
-Copy-Item -Path "build\.ninja_log" -Destination $env:PYTORCH_FINAL_PACKAGE_DIR -Force
-
-# Final sccache stats and stop
-sccache --show-stats
-sccache --stop-server
-
-exit 0
--- a/.ci/pytorch/win-test.sh
+++ b/.ci/pytorch/win-test.sh
@ -41,7 +41,7 @@ fi
 python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1

 # Install Z3 optional dependency for Windows builds.
-python -m pip install z3-solver==4.15.1.0
+python -m pip install z3-solver==4.12.2.0

 # Install tlparse for test\dynamo\test_structured_trace.py UTs.
 python -m pip install tlparse==0.3.30
--- a/.ci/pytorch/windows/internal/smoke_test.bat
+++ b/.ci/pytorch/windows/internal/smoke_test.bat
@ -148,7 +148,14 @@ if "%NVIDIA_GPU_EXISTS%" == "0" (
    goto end
 )

-cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ
+set BUILD_SPLIT_CUDA=
+if exist "%install_root%\lib\torch_cuda_cu.lib" if exist "%install_root%\lib\torch_cuda_cpp.lib" set BUILD_SPLIT_CUDA=ON
+
+if "%BUILD_SPLIT_CUDA%" == "ON" (
+    cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda_cu.lib torch_cuda_cpp.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ /INCLUDE:?_torch_cuda_cu_linker_symbol_op_cuda@native@at@@YA?AVTensor@2@AEBV32@@Z
+) else (
+    cl %PYTORCH_ROOT%\.ci\pytorch\test_example_code\check-torch-cuda.cpp torch_cpu.lib c10.lib torch_cuda.lib /EHsc /std:c++17 /link /INCLUDE:?warp_size@cuda@at@@YAHXZ
+)
 .\check-torch-cuda.exe
 if ERRORLEVEL 1 exit /b 1

--- a/.ci/wheel/build_wheel.sh
+++ b/.ci/wheel/build_wheel.sh
@ -184,8 +184,7 @@ tmp_env_name="wheel_py$python_nodot"
 conda create ${EXTRA_CONDA_INSTALL_FLAGS} -yn "$tmp_env_name" python="$desired_python" ${CONDA_ENV_CREATE_FLAGS}
 source activate "$tmp_env_name"

-retry pip install -r "${pytorch_rootdir}/requirements-build.txt"
-pip install "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing-extensions
+pip install "numpy=${NUMPY_PINNED_VERSION}"  "pyyaml${PYYAML_PINNED_VERSION}" requests ninja "setuptools${SETUPTOOLS_PINNED_VERSION}" typing_extensions
 retry pip install -r "${pytorch_rootdir}/requirements.txt" || true
 retry brew install libomp

--- a/.flake8
+++ b/.flake8
@ -7,12 +7,12 @@ max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 ignore =
-    E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,F824,
+    E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
    # shebang has extra meaning in fbcode lints, so I think it's not worth trying
    # to line this up with executable bit
    EXE001,
    # these ignores are from flake8-bugbear; please fix!
-    B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907,B908,B910
+    B007,B008,B017,B019,B023,B028,B903,B904,B905,B906,B907
    # these ignores are from flake8-comprehensions; please fix!
    C407,
    # these ignores are from flake8-logging-format; please fix!
--- a/.github/actions/build-android/action.yml
+++ b/.github/actions/build-android/action.yml
@ -0,0 +1,78 @@
+name: build android
+
+description: build android for a specific arch
+
+inputs:
+  arch:
+    description: arch to build
+    required: true
+  arch-for-build-env:
+    description: |
+      arch to pass to build environment.
+      This is currently different than the arch name we use elsewhere, which
+      should be fixed.
+    required: true
+  github-secret:
+    description: github token
+    required: true
+  build-environment:
+    required: true
+    description: Top-level label for what's being built/tested.
+  docker-image:
+    required: true
+    description: Name of the base docker image to build with.
+  branch:
+    required: true
+    description: What branch we are building on.
+outputs:
+  container_id:
+    description: Docker container identifier used to build the artifacts
+    value: ${{ steps.build.outputs.container_id }}
+
+runs:
+  using: composite
+  steps:
+    - name: Build-${{ inputs.arch }}
+      id: build
+      shell: bash
+      env:
+        BRANCH: ${{ inputs.branch }}
+        BUILD_ENVIRONMENT: pytorch-linux-xenial-py3-clang5-android-ndk-r19c-${{ inputs.arch-for-build-env }}-build"
+        AWS_DEFAULT_REGION: us-east-1
+        PR_NUMBER: ${{ github.event.pull_request.number }}
+        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
+        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
+        SCCACHE_REGION: us-east-1
+        DOCKER_IMAGE: ${{ inputs.docker-image  }}
+        MATRIX_ARCH: ${{ inputs.arch }}
+      run: |
+        # detached container should get cleaned up by teardown_ec2_linux
+        set -exo pipefail
+        export container_name
+        container_name=$(docker run \
+          -e BUILD_ENVIRONMENT \
+          -e MAX_JOBS="$(nproc --ignore=2)" \
+          -e AWS_DEFAULT_REGION \
+          -e PR_NUMBER \
+          -e SHA1 \
+          -e BRANCH \
+          -e SCCACHE_BUCKET \
+          -e SCCACHE_REGION \
+          -e SKIP_SCCACHE_INITIALIZATION=1 \
+          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
+          --security-opt seccomp=unconfined \
+          --cap-add=SYS_PTRACE \
+          --tty \
+          --detach \
+          --user jenkins \
+          -w /var/lib/jenkins/workspace \
+          "${DOCKER_IMAGE}"
+        )
+        git submodule sync && git submodule update -q --init --recursive --depth 1
+        docker cp "${GITHUB_WORKSPACE}/." "${container_name}:/var/lib/jenkins/workspace"
+        (echo "sudo chown -R jenkins . && .ci/pytorch/build.sh && find ${BUILD_ROOT} -type f -name "*.a" -or -name "*.o" -delete" | docker exec -u jenkins -i "${container_name}" bash) 2>&1
+
+        # Copy install binaries back
+        mkdir -p "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
+        docker cp "${container_name}:/var/lib/jenkins/workspace/build_android/install" "${GITHUB_WORKSPACE}/build_android_install_${MATRIX_ARCH}"
+        echo "container_id=${container_name}" >> "${GITHUB_OUTPUT}"
--- a/.github/actions/filter-test-configs/action.yml
+++ b/.github/actions/filter-test-configs/action.yml
@ -70,7 +70,7 @@ runs:
          set -eux
          # PyYAML 6.0 doesn't work with MacOS x86 anymore
          # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2
-          python3 -m pip install requests==2.27.1 pyyaml==6.0.2
+          python3 -m pip install requests==2.27.1 pyyaml==6.0.1

    - name: Parse ref
      id: parse-ref
--- a/.github/actions/linux-test/action.yml
+++ b/.github/actions/linux-test/action.yml
@ -126,7 +126,7 @@ runs:
      shell: bash
      continue-on-error: true
      run: |
-        python3 -m pip install psutil==5.9.8 nvidia-ml-py==11.525.84
+        python3 -m pip install psutil==5.9.1 nvidia-ml-py==11.525.84
        python3 -m tools.stats.monitor > usage_log.txt 2>&1 &
        echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

--- a/.github/ci_commit_pins/audio.txt
+++ b/.github/ci_commit_pins/audio.txt
@ -1 +1 @@
-f6dfe1231dcdd221a68416e49ab85c2575cbb824
+6c57850358f34c47802db216b0746e4e9d08a95a
--- a/.github/ci_commit_pins/fbgemm_rocm.txt
+++ b/.github/ci_commit_pins/fbgemm_rocm.txt
@ -1 +1 @@
-7f1de94a4c2d14f59ad4ca84538c36084ea6b2c8
+5fb5024118e9bb9decf96c2b0b1a8f0010bf56be
--- a/.github/ci_commit_pins/vllm.txt
+++ b/.github/ci_commit_pins/vllm.txt
@ -1 +0,0 @@
-8f605ee30912541126c0fe46d0c8c413101b600a
--- a/.github/ci_commit_pins/xla.txt
+++ b/.github/ci_commit_pins/xla.txt
@ -1 +1 @@
-29ae4c76c026185f417a25e841d2cd5e65f087a3
+1c00dea2c9adb2137903c86b4191e8c247f8fda9
--- a/.github/merge_rules.yaml
+++ b/.github/merge_rules.yaml
@ -76,7 +76,6 @@
  - .github/ci_commit_pins/audio.txt
  - .github/ci_commit_pins/vision.txt
  - .github/ci_commit_pins/torchdynamo.txt
-  - .github/ci_commit_pins/vllm.txt
  - .ci/docker/ci_commit_pins/triton.txt
  approved_by:
  - pytorchbot
@ -131,6 +130,21 @@
  - Lint
  - pull

+- name: Mobile
+  patterns:
+  - ios/**
+  - android/**
+  - test/mobile/**
+  approved_by:
+  - linbinyu
+  - IvanKobzarev
+  - dreiss
+  - raziel
+  mandatory_checks_name:
+  - EasyCLA
+  - Lint
+  - pull
+
 - name: PrimTorch
  patterns:
  - torch/_meta_registrations.py
@ -477,19 +491,6 @@
  - srossross
  - chillee
  - zou3519
-  - guilhermeleobas
-  mandatory_checks_name:
-  - EasyCLA
-  - Lint
-  - pull
-
- name: Dynamo
-  patterns:
-  - torch/_dynamo/**
-  - torch/csrc/dynamo/**
-  - test/dynamo/**
-  approved_by:
-  - guilhermeleobas
  mandatory_checks_name:
  - EasyCLA
  - Lint
--- a/.github/pytorch-probot.yml
+++ b/.github/pytorch-probot.yml
@ -31,9 +31,7 @@ ciflow_push_tags:
 - ciflow/pull
 - ciflow/h100
 - ciflow/h100-distributed
- ciflow/win-arm64
 - ciflow/h100-symm-mem
- ciflow/h100-cutlass-backend
 retryable_workflows:
 - pull
 - trunk
--- a/.github/requirements-gha-cache.txt
+++ b/.github/requirements-gha-cache.txt
@ -1,15 +1,14 @@
 # This file is to cache other dependencies not specified elsewhere in:
-#   requirements.txt
-#   requirements-build.txt
+#   requirement.txt
 #   docs/requirements.txt
 #   docs/cpp/requirements.txt
 #   functorch/docs/requirements.txt
 #   .ci/docker/requirements-ci.txt
 boto3==1.35.42
 jinja2==3.1.6
-lintrunner==0.12.7
+lintrunner==0.10.7
 ninja==1.10.0.post1
 nvidia-ml-py==11.525.84
-pyyaml==6.0.2
+pyyaml==6.0
 requests==2.32.4
-rich==14.1.0
+rich==10.9.0
--- a/.github/requirements/pip-requirements-macOS.txt
+++ b/.github/requirements/pip-requirements-macOS.txt
@ -2,7 +2,7 @@ boto3==1.35.42
 cmake==3.27.*
 expecttest==0.3.0
 fbscribelogger==0.1.7
-filelock==3.13.1
+filelock==3.6.0
 hypothesis==6.56.4
 librosa>=0.6.2
 mpmath==1.3.0
@ -16,7 +16,7 @@ packaging==23.1
 parameterized==0.8.1
 pillow==10.3.0
 protobuf==5.29.4
-psutil==5.9.8
+psutil==5.9.1
 pygments==2.15.0
 pytest-cpp==2.3.0
 pytest-flakefinder==1.1.0
@ -33,4 +33,4 @@ tensorboard==2.13.0
 typing-extensions==4.12.2
 unittest-xml-reporting<=3.2.0,>=2.0.0
 xdoctest==1.1.0
-z3-solver==4.15.1.0
+z3-solver==4.12.2.0
--- a/.github/scripts/lintrunner.sh
+++ b/.github/scripts/lintrunner.sh
@ -2,7 +2,7 @@
 set -ex

 # Use uv to speed up lintrunner init
-python3 -m pip install -U uv==0.8.* setuptools
+python3 -m pip install uv==0.1.45 setuptools

 CACHE_DIRECTORY="/tmp/.lintbin"
 # Try to recover the cached binaries
--- a/.github/workflows/_get-changed-files.yml
+++ b/.github/workflows/_get-changed-files.yml
@ -1,43 +0,0 @@
-name: Get Changed Files
-
-on:
-  workflow_call:
-    outputs:
-      changed-files:
-        description: "List of changed files (space-separated) or '*' if not in a PR"
-        value: ${{ jobs.get-changed-files.outputs.changed-files }}
-
-jobs:
-  get-changed-files:
-    runs-on: ubuntu-latest
-    outputs:
-      changed-files: ${{ steps.get-files.outputs.changed-files }}
-
-    steps:
-      - name: Get changed files
-        id: get-files
-        env:
-          GH_TOKEN: ${{ github.token }}
-        run: |
-          # Check if we're in a pull request context
-          if [ "${{ github.event_name }}" = "pull_request" ] || [ "${{ github.event_name }}" = "pull_request_target" ]; then
-            echo "Running in PR context"
-
-            # Get the PR number from the github context
-            PR_NUMBER="${{ github.event.number }}"
-
-            # Use gh CLI to get changed files in the PR with explicit repo
-            CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
-
-            if [ -z "$CHANGED_FILES" ]; then
-              echo "No changed files found, setting to '*'"
-              CHANGED_FILES="*"
-            fi
-
-            echo "Changed files: $CHANGED_FILES"
-            echo "changed-files=$CHANGED_FILES" >> "$GITHUB_OUTPUT"
-
-          else
-            echo "Not in PR context, setting changed files to '*'"
-            echo "changed-files=*" >> "$GITHUB_OUTPUT"
-          fi
--- a/.github/workflows/_linux-build.yml
+++ b/.github/workflows/_linux-build.yml
@ -16,6 +16,11 @@ on:
        type: boolean
        default: true
        description: If set, upload generated build artifacts.
+      build-with-debug:
+        required: false
+        type: boolean
+        default: false
+        description: If set, build in debug mode.
      sync-tag:
        required: false
        type: string
@ -82,6 +87,7 @@ on:
        required: false
        type: number
        default: 1
+
      allow-reuse-old-whl:
        description: |
          If set, the build try to pull an old wheel from s3 that was built on a
@ -89,13 +95,6 @@ on:
        required: false
        type: boolean
        default: true
-      build-additional-packages:
-        description: |
-          If set, the build job will also builds these packages and saves their
-          wheels as artifacts
-        required: false
-        type: string
-        default: ""

    secrets:
      HUGGING_FACE_HUB_TOKEN:
@ -107,6 +106,7 @@ on:
        description: |
          FB app token to write to scribe endpoint

+
    outputs:
      docker-image:
        value: ${{ jobs.build.outputs.docker-image }}
@ -225,7 +225,7 @@ jobs:
          MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
        run: |
          mkdir -p ../../usage_logs
-          python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7
+          python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
          python3 -m tools.stats.monitor \
          --log-interval "$MONITOR_LOG_INTERVAL" \
          --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" \
@ -247,6 +247,8 @@ jobs:
        env:
          BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
          BRANCH: ${{ steps.parse-ref.outputs.branch }}
+          # TODO duplicated
+          AWS_DEFAULT_REGION: us-east-1
          PR_NUMBER: ${{ github.event.pull_request.number }}
          SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
          # Do not set SCCACHE_S3_KEY_PREFIX to share the cache between all build jobs
@ -258,10 +260,10 @@ jobs:
          DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
          DOCKER_IMAGE_S390X: ${{ inputs.docker-image-name }}
          XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
+          DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
          OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
          HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
          SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
-          BUILD_ADDITIONAL_PACKAGES: ${{ inputs.build-additional-packages }}
        run: |
          START_TIME=$(date +%s)
          if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
@ -293,6 +295,7 @@ jobs:
          container_name=$(docker run \
            -e BUILD_ENVIRONMENT \
            -e MAX_JOBS="$(nproc --ignore=2)" \
+            -e AWS_DEFAULT_REGION \
            -e PR_NUMBER \
            -e SHA1 \
            -e BRANCH \
@ -307,7 +310,6 @@ jobs:
            -e HUGGING_FACE_HUB_TOKEN \
            -e SCRIBE_GRAPHQL_ACCESS_TOKEN \
            -e USE_SPLIT_BUILD \
-            -e BUILD_ADDITIONAL_PACKAGES \
            --memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
            --memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
            --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
@ -321,11 +323,6 @@ jobs:
            "${USED_IMAGE}" \
            ${DOCKER_SHELL_CMD}
          )
-
-          if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
-            docker exec -t "${container_name}" sh -c "python3 -m pip install -r requirements.txt"
-          fi
-
          docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'

          END_TIME=$(date +%s)
--- a/.github/workflows/_linux-test.yml
+++ b/.github/workflows/_linux-test.yml
@ -164,8 +164,6 @@ jobs:
      - name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
        id: install-nvidia-driver
        uses: pytorch/test-infra/.github/actions/setup-nvidia@main
-        with:
-          driver-version: ${{ matrix.config == 'legacy_nvidia_driver' && '525.105.17' || '570.133.07' }}
        if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' && matrix.runner != 'B200' }}

      - name: Setup GPU_FLAG for docker run
@ -205,7 +203,7 @@ jobs:
          MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
          MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
        run: |
-          python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
+          python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
          python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

--- a/.github/workflows/_mac-test.yml
+++ b/.github/workflows/_mac-test.yml
@ -136,7 +136,7 @@ jobs:
          MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
          MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
        run: |
-          "$VENV_PATH/bin/python3" -m pip install psutil==5.9.8 dataclasses_sajson==0.6.7
+          "$VENV_PATH/bin/python3" -m pip install psutil==5.9.1 dataclasses_json==0.6.7
          "$VENV_PATH/bin/python3" -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

@ -281,7 +281,7 @@ jobs:
        continue-on-error: true
        run: |
          if [[ -n "$REINSTALL_BREW_MINICONDA" ]]; then
-              brew install --cask miniconda
+              brew install miniconda
          fi

      - name: Clean up disk space
--- a/.github/workflows/_rocm-test.yml
+++ b/.github/workflows/_rocm-test.yml
@ -132,7 +132,7 @@ jobs:
        shell: bash
        continue-on-error: true
        run: |
-          python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7
+          python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
          python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

@ -269,8 +269,8 @@ jobs:
          # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
          docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "cd ../pytorch && sudo cp -R test/test-reports ../workspace/test"

-      - name: Change permissions (only needed for MI300 and MI355 kubernetes runners for now)
-        if: ${{ always() && steps.test.conclusion && (contains(matrix.runner, 'mi300') || contains(matrix.runner, 'mi355')) }}
+      - name: Change permissions (only needed for MI300 runners for now)
+        if: ${{ always() && steps.test.conclusion && contains(matrix.runner, 'mi300') }}
        run: |
          docker exec -t "${{ env.CONTAINER_NAME }}" sh -c "sudo chown -R 1001:1001 test"

--- a/.github/workflows/_win-test.yml
+++ b/.github/workflows/_win-test.yml
@ -138,7 +138,7 @@ jobs:
        continue-on-error: true
        run: |
          # Windows conda doesn't have python3 binary, only python, but it's python3
-          ${CONDA_RUN} python -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
+          ${CONDA_RUN} python -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
          ${CONDA_RUN} python -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

--- a/.github/workflows/_xpu-test.yml
+++ b/.github/workflows/_xpu-test.yml
@ -133,7 +133,7 @@ jobs:
          MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
          MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
        run: |
-          python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
+          python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84
          python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
          echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"

--- a/.github/workflows/check_mergeability_ghstack.yml
+++ b/.github/workflows/check_mergeability_ghstack.yml
@ -56,7 +56,7 @@ jobs:
          cache: pip
          architecture: x64

-      - run: pip install pyyaml==6.0.2
+      - run: pip install pyyaml==6.0
        shell: bash

      - name: Verify mergeability
--- a/.github/workflows/cherry-pick.yml
+++ b/.github/workflows/cherry-pick.yml
@ -26,7 +26,7 @@ jobs:
          cache: pip

      # Not the direct dependencies but the script uses trymerge
-      - run: pip install pyyaml==6.0.2
+      - run: pip install pyyaml==6.0

      - name: Setup committer id
        run: |
--- a/.github/workflows/docker-builds.yml
+++ b/.github/workflows/docker-builds.yml
@ -50,7 +50,6 @@ jobs:
        runner: [linux.12xlarge]
        docker-image-name: [
          pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11,
-          pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm,
          pytorch-linux-jammy-cuda12.6-cudnn9-py3-gcc9-inductor-benchmarks,
          pytorch-linux-jammy-cuda12.6-cudnn9-py3.12-gcc9-inductor-benchmarks,
          pytorch-linux-jammy-cuda12.6-cudnn9-py3.13-gcc9-inductor-benchmarks,
@ -63,9 +62,9 @@ jobs:
          pytorch-linux-jammy-py3.11-clang12,
          pytorch-linux-jammy-py3.12-clang12,
          pytorch-linux-jammy-py3.13-clang12,
+          pytorch-linux-jammy-rocm-n-1-py3,
          pytorch-linux-jammy-rocm-n-py3,
          pytorch-linux-noble-rocm-n-py3,
-          pytorch-linux-noble-rocm-alpha-py3,
          pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-clang12,
          pytorch-linux-jammy-py3.9-gcc11,
          pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks,
--- a/.github/workflows/docker-release.yml
+++ b/.github/workflows/docker-release.yml
@ -144,7 +144,7 @@ jobs:
        run: |
          make -f docker.Makefile "${BUILD_IMAGE_TYPE}-image"
      - name: Push nightly tags
-        if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.platform == 'linux/amd4' }}
+        if: ${{ github.event.ref == 'refs/heads/nightly' && matrix.image_type == 'runtime' && matrix.build_platforms == 'linux/amd4' }}
        run: |
          PYTORCH_DOCKER_TAG="${PYTORCH_VERSION}-cuda${CUDA_VERSION_SHORT}-cudnn${CUDNN_VERSION}-runtime"
          CUDA_SUFFIX="-cu${CUDA_VERSION}"
--- a/.github/workflows/h100-cutlass-backend.yml
+++ b/.github/workflows/h100-cutlass-backend.yml
@ -1,58 +0,0 @@
-name: Limited CI for CUTLASS backend on H100
-
-on:
-  pull_request:
-    paths:
-      - .github/workflows/h100-cutlass-backend.yml
-  workflow_dispatch:
-  schedule:
-    - cron: 22 9 * * *  # every 24 hours about 2:22am PDT
-  push:
-    tags:
-      - ciflow/h100-cutlass-backend/*
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
-  cancel-in-progress: true
-
-permissions:
-  id-token: write
-  contents: read
-
-jobs:
-
-  get-label-type:
-    if: github.repository_owner == 'pytorch'
-    name: get-label-type
-    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
-    with:
-      triggering_actor: ${{ github.triggering_actor }}
-      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
-      curr_branch: ${{ github.head_ref || github.ref_name }}
-      curr_ref_type: ${{ github.ref_type }}
-
-  linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend:
-    name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
-    uses: ./.github/workflows/_linux-build.yml
-    needs: get-label-type
-    with:
-      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
-      build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
-      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
-      cuda-arch-list: '9.0'
-      test-matrix: |
-        { include: [
-          { config: "h100_cutlass_backend", shard: 1, num_shards: 1, runner: "linux.aws.h100", owners: ["oncall:pt2"] },
-        ]}
-    secrets: inherit
-
-  linux-jammy-cuda12_8-py3_10-gcc11-sm90-test:
-    name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
-    uses: ./.github/workflows/_linux-test.yml
-    needs:
-      - linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend
-    with:
-      build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-cutlass-backend
-      docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend.outputs.docker-image }}
-      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-cutlass-backend.outputs.test-matrix }}
-    secrets: inherit
--- a/.github/workflows/inductor-nightly.yml
+++ b/.github/workflows/inductor-nightly.yml
@ -48,7 +48,6 @@ jobs:
          { config: "dynamic_cpu_max_autotune_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.8xlarge.amx" },
          { config: "dynamic_cpu_max_autotune_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.8xlarge.amx" },
        ]}
-      build-additional-packages: "vision audio torchao"
    secrets: inherit

  linux-jammy-cpu-py3_9-gcc11-nightly-dynamo-benchmarks-test:
--- a/.github/workflows/inductor-perf-compare.yml
+++ b/.github/workflows/inductor-perf-compare.yml
@ -43,7 +43,6 @@ jobs:
          { config: "inductor_timm_perf_compare", shard: 2, num_shards: 2, runner: "linux.aws.a100" },
          { config: "inductor_torchbench_perf_compare", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
        ]}
-      build-additional-packages: "vision audio fbgemm torchao"
    secrets: inherit

  test:
--- a/.github/workflows/inductor-perf-test-nightly-aarch64.yml
+++ b/.github/workflows/inductor-perf-test-nightly-aarch64.yml
@ -116,7 +116,6 @@ jobs:
          { config: "inductor_torchbench_perf_cpu_aarch64", shard: 15, num_shards: 15, runner: "linux.arm64.m7g.metal" },
        ]}
      selected-test-configs: ${{ inputs.benchmark_configs }}
-      build-additional-packages: "vision audio torchao"
    secrets: inherit


--- a/.github/workflows/inductor-perf-test-nightly-h100.yml
+++ b/.github/workflows/inductor-perf-test-nightly-h100.yml
@ -2,7 +2,7 @@ name: inductor-perf-nightly-h100

 on:
  schedule:
-    - cron: 15 0,12 * * 1-6
+    - cron: 15 0,4,8,12,16,20 * * 1-6
    - cron: 0 7 * * 0
  # NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
  # out, let try to run torchao cudagraphs_low_precision as part of cudagraphs
@ -86,11 +86,6 @@ jobs:
    needs: get-label-type
    with:
      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
-      # Use a bigger runner here because CUDA_ARCH 9.0 is only built for H100
-      # or newer GPUs, so it doesn't benefit much from existing compiler cache
-      # from trunk. Also use a memory-intensive runner here because memory is
-      # usually the bottleneck
-      runner: linux.12xlarge.memory
      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
      cuda-arch-list: '9.0'
@ -119,14 +114,13 @@ jobs:
          { config: "inductor_torchbench_perf_cuda_h100", shard: 9, num_shards: 9, runner: "linux.aws.h100" },
        ]}
      selected-test-configs: ${{ inputs.benchmark_configs }}
-      build-additional-packages: "vision audio fbgemm torchao"
    secrets: inherit

  test-periodically:
    name: cuda12.8-py3.10-gcc9-sm90
    uses: ./.github/workflows/_linux-test.yml
    needs: build
-    if: github.event.schedule == '15 0,12 * * 1-6'
+    if: github.event.schedule == '15 0,4,8,12,16,20 * * 1-6'
    with:
      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
      dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true
--- a/.github/workflows/inductor-perf-test-nightly-x86.yml
+++ b/.github/workflows/inductor-perf-test-nightly-x86.yml
@ -98,7 +98,6 @@ jobs:
          { config: "inductor_torchbench_perf_cpu_x86", shard: 4, num_shards: 4, runner: "linux.24xl.spr-metal" },
        ]}
      selected-test-configs: ${{ inputs.benchmark_configs }}
-      build-additional-packages: "vision audio torchao"
    secrets: inherit

  linux-jammy-cpu-py3_9-gcc11-inductor-test-nightly-freezing:
--- a/.github/workflows/inductor-perf-test-nightly.yml
+++ b/.github/workflows/inductor-perf-test-nightly.yml
@ -86,8 +86,6 @@ jobs:
    needs: get-label-type
    with:
      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
-      # Every bit to make perf run faster helps
-      runner: linux.12xlarge.memory
      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9-inductor-benchmarks
      cuda-arch-list: '8.0'
@ -114,7 +112,6 @@ jobs:
          { config: "cachebench", shard: 2, num_shards: 2, runner: "linux.aws.a100" },
        ]}
      selected-test-configs: ${{ inputs.benchmark_configs }}
-      build-additional-packages: "vision audio fbgemm torchao"
    secrets: inherit

  test-nightly:
--- a/.github/workflows/inductor-periodic.yml
+++ b/.github/workflows/inductor-periodic.yml
@ -58,7 +58,6 @@ jobs:
          { config: "dynamic_aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
          { config: "dynamic_aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
        ]}
-      build-additional-packages: "vision audio fbgemm torchao"
    secrets: inherit

  linux-jammy-cuda12_8-py3_10-gcc9-periodic-dynamo-benchmarks-test:
@ -126,7 +125,6 @@ jobs:
        { include: [
          { config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
        ]}
-      build-additional-packages: "vision audio fbgemm torchao"
    secrets: inherit

  linux-jammy-cuda12_8-py3_10-gcc9-inductor-smoke-test:
@ -161,7 +159,6 @@ jobs:
          { config: "cpu_inductor_freezing_avx2_timm", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
          { config: "cpu_inductor_freezing_avx2_timm", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
        ]}
-      build-additional-packages: "vision audio torchao"
    secrets: inherit

  linux-jammy-cpu-py3_9-gcc11-periodic-dynamo-benchmarks-test:
@ -198,7 +195,6 @@ jobs:
          { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
          { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
        ]}
-      build-additional-packages: "vision audio fbgemm torchao"
    secrets: inherit

  linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:
@ -244,7 +240,6 @@ jobs:
          { config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.8xlarge.amx" },
          { config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.8xlarge.amx" },
        ]}
-      build-additional-packages: "vision audio torchao"
    secrets: inherit

  linux-jammy-cpu-py3_9-gcc11-inductor-test:
--- a/.github/workflows/inductor.yml
+++ b/.github/workflows/inductor.yml
@ -62,7 +62,6 @@ jobs:
          { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
          { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
        ]}
-      build-additional-packages: "vision audio fbgemm torchao"
    secrets: inherit

  linux-jammy-cuda12_8-py3_10-gcc9-inductor-test:
@ -95,7 +94,6 @@ jobs:
          { config: "dynamic_cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.8xlarge.amx" },
          { config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.24xl.spr-metal" },
        ]}
-      build-additional-packages: "vision audio torchao"
    secrets: inherit

  linux-jammy-cpu-py3_9-gcc11-inductor-test:
--- a/.github/workflows/lint.yml
+++ b/.github/workflows/lint.yml
@ -26,30 +26,9 @@ jobs:
      triggering_actor: ${{ github.triggering_actor }}
      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
      curr_branch: ${{ github.head_ref || github.ref_name }}
-
-  get-changed-files:
-    if: github.repository_owner == 'pytorch'
-    name: Get changed files
-    uses: ./.github/workflows/_get-changed-files.yml
-
  lintrunner-clang:
    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
-    needs: [get-label-type, get-changed-files]
-    # Only run if there are changed files relevant to clangtidy / clangformat
-    if: |
-      github.repository_owner == 'pytorch' && (
-        needs.get-changed-files.outputs.changed-files == '*' ||
-        contains(needs.get-changed-files.outputs.changed-files, '.h') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.cpp') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.cc') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.cxx') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.hpp') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.hxx') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.cu') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.cuh') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.mm') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.metal')
-      )
+    needs: get-label-type
    with:
      timeout: 120
      runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
@ -60,44 +39,13 @@ jobs:
      submodules: true
      ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
      script: |
-        CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"
-        if [ "$CHANGED_FILES" = "*" ]; then
-          export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT --all-files"
-        else
-          export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT $CHANGED_FILES"
-        fi
+        export ADDITIONAL_LINTRUNNER_ARGS="--take CLANGTIDY,CLANGFORMAT --all-files"
        export CLANG=1
        .github/scripts/lintrunner.sh

-  # NOTE: mypy needs its own job because it depends on --all-files, without assessing all files it sometimes
-  #       fails to find types when it should
-  lintrunner-mypy:
-    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
-    needs: [get-label-type, get-changed-files]
-    # Only run if there are changed files relevant to mypy
-    if: |
-      github.repository_owner == 'pytorch' && (
-        needs.get-changed-files.outputs.changed-files == '*' ||
-        contains(needs.get-changed-files.outputs.changed-files, '.py') ||
-        contains(needs.get-changed-files.outputs.changed-files, '.pyi')
-      )
-    with:
-      timeout: 120
-      runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
-      docker-image: ci-image:pytorch-linux-jammy-linter
-      # NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
-      # to run git rev-parse HEAD~:.ci/docker when a new image is needed
-      fetch-depth: 0
-      submodules: true
-      ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
-      script: |
-        CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"
-        echo "Running mypy"
-        ADDITIONAL_LINTRUNNER_ARGS="--take MYPY --all-files" .github/scripts/lintrunner.sh
-
  lintrunner-noclang:
    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
-    needs: [get-label-type, get-changed-files]
+    needs: get-label-type
    with:
      timeout: 120
      runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
@ -108,13 +56,8 @@ jobs:
      submodules: true
      ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
      script: |
-        CHANGED_FILES="${{ needs.get-changed-files.outputs.changed-files }}"
-        echo "Running all other linters"
-        if [ "$CHANGED_FILES" = '*' ]; then
-          ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT,MYPY --all-files" .github/scripts/lintrunner.sh
-        else
-          ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT,MYPY ${CHANGED_FILES}" .github/scripts/lintrunner.sh
-        fi
+        export ADDITIONAL_LINTRUNNER_ARGS="--skip CLANGTIDY,CLANGFORMAT --all-files"
+        .github/scripts/lintrunner.sh

  quick-checks:
    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
@ -317,7 +260,6 @@ jobs:
          check-latest: false
          cache: pip
          cache-dependency-path: |
-            **/requirements-build.txt
            **/requirements.txt
      - name: Setup Min Python version
        if: matrix.test_type != 'older_python_version'
@ -328,7 +270,6 @@ jobs:
          check-latest: false
          cache: pip
          cache-dependency-path: |
-            **/requirements-build.txt
            **/requirements.txt
      - name: Install torch
        if: matrix.test_type == 'with_torch'
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@ -83,10 +83,6 @@ jobs:
            repo-owner: triton-lang
            branch: main
            pin-folder: .ci/docker/ci_commit_pins
-          - repo-name: vllm
-            repo-owner: vllm-project
-            branch: main
-            pin-folder: .github/ci_commit_pins
    # Allow this to be triggered on either a schedule or on workflow_dispatch to allow for easier testing
    if: github.repository_owner == 'pytorch' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
    steps:
--- a/.github/workflows/periodic.yml
+++ b/.github/workflows/periodic.yml
@ -82,36 +82,6 @@ jobs:
      test-matrix: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-sm89-build.outputs.test-matrix }}
    secrets: inherit

-  linux-jammy-cuda12_4-py3_10-gcc11-build:
-    name: linux-jammy-cuda12.4-py3.10-gcc11
-    uses: ./.github/workflows/_linux-build.yml
-    needs: get-label-type
-    with:
-      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
-      build-environment: linux-jammy-cuda12.4-py3.10-gcc11
-      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11
-      test-matrix: |
-        { include: [
-          { config: "legacy_nvidia_driver", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
-          { config: "legacy_nvidia_driver", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
-          { config: "legacy_nvidia_driver", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
-          { config: "legacy_nvidia_driver", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
-          { config: "legacy_nvidia_driver", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
-        ]}
-    secrets: inherit
-
-  linux-jammy-cuda12_4-py3_10-gcc11-test:
-    name: linux-jammy-cuda12.4-py3.10-gcc11
-    uses: ./.github/workflows/_linux-test.yml
-    needs:
-      - linux-jammy-cuda12_4-py3_10-gcc11-build
-      - target-determination
-    with:
-      build-environment: linux-jammy-cuda12.4-py3.10-gcc11
-      docker-image: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-build.outputs.docker-image }}
-      test-matrix: ${{ needs.linux-jammy-cuda12_4-py3_10-gcc11-build.outputs.test-matrix }}
-    secrets: inherit
-
  linux-jammy-cuda12_8-py3_10-gcc11-build:
    name: linux-jammy-cuda12.8-py3.10-gcc11
    uses: ./.github/workflows/_linux-build.yml
@ -157,6 +127,7 @@ jobs:
          { config: "multigpu", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.12xlarge.nvidia.gpu", owners: ["oncall:distributed"] },
          { config: "multigpu", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.12xlarge.nvidia.gpu", owners: ["oncall:distributed"] },
        ]}
+      build-with-debug: false
    secrets: inherit

  linux-jammy-cuda12_8-py3_9-gcc9-test:
@ -177,6 +148,7 @@ jobs:
      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
      build-environment: linux-jammy-cuda12.8-py3.10-gcc9-debug
      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9
+      build-with-debug: true
      test-matrix: |
        { include: [
          { config: "default", shard: 1, num_shards: 7, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu", owners: ["oncall:debug-build"] },
--- a/.github/workflows/pull.yml
+++ b/.github/workflows/pull.yml
@ -315,6 +315,21 @@ jobs:
      test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.test-matrix }}
    secrets: inherit

+  linux-jammy-py3-clang18-mobile-build:
+    name: linux-jammy-py3-clang18-mobile-build
+    uses: ./.github/workflows/_linux-build.yml
+    needs: get-label-type
+    with:
+      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
+      build-environment: linux-jammy-py3-clang12-mobile-build
+      docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan
+      build-generates-artifacts: false
+      test-matrix: |
+        { include: [
+          { config: "default", shard: 1, num_shards: 1 },
+        ]}
+    secrets: inherit
+
  linux-jammy-cuda12_8-cudnn9-py3_9-clang12-build:
    name: linux-jammy-cuda12.8-cudnn9-py3.9-clang12
    uses: ./.github/workflows/_linux-build.yml
--- a/.github/workflows/revert.yml
+++ b/.github/workflows/revert.yml
@ -26,7 +26,7 @@ jobs:
          architecture: x64
          check-latest: false
          cache: pip
-      - run: pip install pyyaml==6.0.2
+      - run: pip install pyyaml==6.0

      - name: Setup committer id
        run: |
--- a/.github/workflows/rocm-mi355.yml
+++ b/.github/workflows/rocm-mi355.yml
@ -1,68 +0,0 @@
-name: rocm-mi355
-
-on:
-  workflow_dispatch:
-  schedule:
-    - cron: 30 9 * * *  # about 2:30am PDT
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
-  cancel-in-progress: true
-
-permissions: read-all
-
-jobs:
-  target-determination:
-    if: github.repository_owner == 'pytorch'
-    name: before-test
-    uses: ./.github/workflows/target_determination.yml
-    permissions:
-      id-token: write
-      contents: read
-
-  get-label-type:
-    name: get-label-type
-    uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
-    if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
-    with:
-      triggering_actor: ${{ github.triggering_actor }}
-      issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
-      curr_branch: ${{ github.head_ref || github.ref_name }}
-      curr_ref_type: ${{ github.ref_type }}
-
-  linux-noble-rocm-py3_12-build:
-    if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
-    name: linux-noble-rocm-py3.12-mi355
-    uses: ./.github/workflows/_linux-build.yml
-    needs: get-label-type
-    with:
-      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
-      build-environment: linux-noble-rocm-py3.12-mi355
-      docker-image-name: ci-image:pytorch-linux-noble-rocm-alpha-py3
-      sync-tag: rocm-build
-      test-matrix: |
-        { include: [
-          { config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },
-          { config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },
-          { config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },
-          { config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },
-          { config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },
-          { config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.mi355.2" },
-        ]}
-    secrets: inherit
-
-  linux-noble-rocm-py3_12-test:
-    permissions:
-      id-token: write
-      contents: read
-    name: linux-noble-rocm-py3.12-mi355
-    uses: ./.github/workflows/_rocm-test.yml
-    needs:
-      - linux-noble-rocm-py3_12-build
-      - target-determination
-    with:
-      build-environment: linux-noble-rocm-py3.12-mi355
-      docker-image: ${{ needs.linux-noble-rocm-py3_12-build.outputs.docker-image }}
-      test-matrix: ${{ needs.linux-noble-rocm-py3_12-build.outputs.test-matrix }}
-      tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor"
-    secrets: inherit
--- a/.github/workflows/test-h100.yml
+++ b/.github/workflows/test-h100.yml
@ -37,7 +37,7 @@ jobs:
    needs: get-label-type
    with:
      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
-      runner: linux.12xlarge.memory
+      runner: "linux.12xlarge"
      build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
      docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
      cuda-arch-list: '9.0'
--- a/.github/workflows/trymerge.yml
+++ b/.github/workflows/trymerge.yml
@ -28,7 +28,7 @@ jobs:
          check-latest: false
          cache: pip
          architecture: x64
-      - run: pip install pyyaml==6.0.2
+      - run: pip install pyyaml==6.0

      - name: Setup committer id
        run: |
--- a/.github/workflows/tryrebase.yml
+++ b/.github/workflows/tryrebase.yml
@ -25,7 +25,7 @@ jobs:
          architecture: x64
          check-latest: false
          cache: pip
-      - run: pip install pyyaml==6.0.2
+      - run: pip install pyyaml==6.0

      - name: Setup committer id
        run: |
--- a/.github/workflows/upload-test-stats.yml
+++ b/.github/workflows/upload-test-stats.yml
@ -14,7 +14,6 @@ on:
      - inductor-periodic
      - rocm
      - rocm-mi300
-      - rocm-mi355
      - inductor-micro-benchmark
      - inductor-micro-benchmark-x86
      - inductor-cu124
--- a/.github/workflows/win-arm64-build-test.yml
+++ b/.github/workflows/win-arm64-build-test.yml
@ -1,187 +0,0 @@
-name: windows-arm64-build-test
-
-on:
-  push:
-    tags:
-      - ciflow/win-arm64/*
-
-env:
-  GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
-  PYTHON_VERSION: "3.12"
-  PYTORCH_ROOT: ${{ github.workspace }}/pytorch
-  DOWNLOADS_DIR: c:\temp\downloads
-  DEPENDENCIES_DIR: c:\temp\dependencies
-  ENABLE_APL: 1
-  ENABLE_OPENBLAS: 0
-  BUILD_TYPE: release
-
-permissions:
-  id-token: write
-  contents: read
-
-jobs:
-  build:
-    # Don't run on forked repos.
-    if: github.repository_owner == 'pytorch'
-    runs-on: "windows-11-arm64-preview"
-    timeout-minutes: 240
-    steps:
-      - name: configure aws credentials
-        id: aws_creds
-        uses: aws-actions/configure-aws-credentials@v4
-        with:
-          role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_sscache
-          aws-region: us-east-1
-          role-duration-seconds: 18000
-
-      - name: Enable long paths
-        shell: cmd
-        run: |
-          git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
-          git config --system core.longpaths true
-
-      - name: Git checkout PyTorch
-        uses: actions/checkout@v4
-        with:
-          path: pytorch
-          submodules: recursive
-
-      - name: Bootstrap Python
-        shell: cmd
-        run: |
-          "pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
-
-      - name: Parse ref
-        id: parse-ref
-        shell: bash
-        run: python pytorch/.github/scripts/parse_ref.py
-
-      - name: Get workflow job id
-        shell: bash
-        id: get-job-id
-        run: |
-          set -eux
-          python pytorch/.github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}"
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
-      - name: Bootstrap APL
-        shell: cmd
-        run: |
-          "pytorch/.ci/pytorch/windows/arm64/bootstrap_apl.bat"
-
-      - name: Bootstrap Rust
-        shell: cmd
-        run: |
-          "pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
-
-      - name: Bootstrap sccache
-        shell: cmd
-        run: |
-          "pytorch/.ci/pytorch/windows/arm64/bootstrap_sccache.bat"
-
-      - name: Bootstrap Libuv
-        shell: cmd
-        run: |
-          "pytorch/.ci/pytorch/windows/arm64/bootstrap_libuv.bat"
-
-      - name: Build
-        id: build
-        shell: cmd
-        env:
-          PYTORCH_FINAL_PACKAGE_DIR: C:/${{ github.run_id }}/build-results/
-          BRANCH: ${{ steps.parse-ref.outputs.branch }}
-          BUILD_WHEEL: 1
-          MAX_JOBS: 8
-          PYTHON_VERSION: "3.12"
-          SCCACHE_BUCKET: "ossci-compiler-cache"
-          SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
-          SCCACHE_REGION: us-east-1
-          VC_PRODUCT: "BuildTools"
-          VC_VERSION: ""
-          ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
-          AWS_DEFAULT_REGION: us-east-1
-          USE_CUDA: '0'
-          USE_XPU: '0'
-          OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
-        run: |
-          cd pytorch
-          call "C:\Program Files\Microsoft Visual Studio\2022\Enterprise\VC\Auxiliary\Build\vcvarsall.bat" arm64
-          powershell -ExecutionPolicy Bypass -File ".ci/pytorch/win-arm64-build.ps1"
-
-      - name: Upload artifacts
-        uses: actions/upload-artifact@v4.4.0
-        if: always()
-        with:
-          name: torch-wheel-win-arm64-py3-12
-          retention-days: 14
-          if-no-files-found: error
-          path: C:\${{ github.run_id }}\build-results
-
-  test:
-    if: github.repository_owner == 'pytorch'
-    strategy:
-      fail-fast: false
-    runs-on: "windows-11-arm64-preview"
-    needs: build
-    steps:
-      - name: Enable long paths
-        shell: cmd
-        run: |
-          git config --system --get core.longpaths || echo "core.longpaths is not set, setting it now"
-          git config --system core.longpaths true
-
-      - name: Git checkout PyTorch
-        uses: actions/checkout@v4
-        with:
-          path: pytorch
-          submodules: recursive
-
-      - name: Bootstrap Python
-        shell: cmd
-        run: |
-          "pytorch/.ci/pytorch/windows/arm64/bootstrap_python.bat"
-
-      - name: Bootstrap Rust
-        shell: cmd
-        run: |
-          "pytorch/.ci/pytorch/windows/arm64/bootstrap_rust.bat"
-
-      - name: Get workflow job id
-        shell: bash
-        id: get-job-id
-        run: |
-          set -eux
-          python pytorch/.github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}"
-        env:
-          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
-
-      - name: Download Build Artifacts
-        uses: actions/download-artifact@v4.1.7
-        with:
-          name: torch-wheel-win-arm64-py3-12
-          path: C:\${{ github.run_id }}\build-results
-
-      - name: Test
-        id: test
-        shell: cmd
-        env:
-          USE_CUDA: '0'
-          INSTALL_WINDOWS_SDK: 1
-          PYTHON_VERSION: "3.12"
-          VC_PRODUCT: "BuildTools"
-          AWS_DEFAULT_REGION: us-east-1
-          GITHUB_REPOSITORY: ${{ github.repository }}
-          GITHUB_WORKFLOW: ${{ github.workflow }}
-          GITHUB_JOB: ${{ github.job }}
-          GITHUB_RUN_ID: ${{ github.run_id }}
-          GITHUB_RUN_NUMBER: ${{ github.run_number }}
-          GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
-          JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
-          JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
-          PYTORCH_FINAL_PACKAGE_DIR: C:/${{ github.run_id }}/build-results/
-        run: |
-          mkdir "%PYTORCH_FINAL_PACKAGE_DIR%"
-          call pytorch/.ci/pytorch/windows/arm64/bootstrap_tests.bat
-          set GIT_BASH=C:\Program Files\Git\usr\bin\bash.exe
-          "%GIT_BASH%" -c "bash --noprofile --norc .ci/pytorch/win-arm64-test.sh"
--- a/.lintrunner.toml
+++ b/.lintrunner.toml
@ -39,16 +39,16 @@ init_command = [
    'python3',
    'tools/linter/adapters/pip_init.py',
    '--dry-run={{DRYRUN}}',
-    'flake8==7.3.0',
-    'flake8-bugbear==24.12.12',
-    'flake8-comprehensions==3.16.0',
+    'flake8==6.1.0',
+    'flake8-bugbear==23.3.23',
+    'flake8-comprehensions==3.15.0',
    'flake8-executable==2.1.3',
-    'flake8-logging-format==2024.24.12',
-    'flake8-pyi==25.5.0',
-    'flake8-simplify==0.22.0',
+    'flake8-logging-format==0.9.0',
+    'flake8-pyi==23.3.1',
+    'flake8-simplify==0.19.3',
    'mccabe==0.7.0',
-    'pycodestyle==2.14.0',
-    'pyflakes==3.4.0',
+    'pycodestyle==2.11.1',
+    'pyflakes==3.1.0',
    'torchfix==0.4.0 ; python_version >= "3.9" and python_version < "3.13"',
 ]

@ -158,7 +158,7 @@ init_command = [
    'mypy==1.16.0',
    'sympy==1.13.3',
    'types-requests==2.27.25',
-    'types-pyyaml==6.0.2',
+    'types-pyyaml==6.0.1',
    'types-tabulate==0.8.8',
    'types-protobuf==5.29.1.20250403',
    'types-setuptools==79.0.0.20250422',
@ -166,8 +166,8 @@ init_command = [
    'types-colorama==0.4.6',
    'filelock==3.13.1',
    'junitparser==2.1.1',
-    'rich==14.1.0',
-    'pyyaml==6.0.2',
+    'rich==10.9.0',
+    'pyyaml==6.0.1',
    'optree==0.13.0',
    'dataclasses-json==0.6.7',
    'pandas==2.2.3',
@ -500,7 +500,7 @@ include_patterns = [
    '**/*.h',
 ]
 exclude_patterns = [
-    'torch/headeronly/macros/Macros.h',
+    'c10/macros/Macros.h',
 ]
 command = [
    'python3',
@ -523,7 +523,7 @@ include_patterns = [
    '**/*.h',
 ]
 exclude_patterns = [
-    'torch/headeronly/macros/Macros.h',
+    'c10/macros/Macros.h',
 ]
 command = [
    'python3',
@ -1111,7 +1111,7 @@ init_command = [
    'python3',
    'tools/linter/adapters/pip_init.py',
    '--dry-run={{DRYRUN}}',
-    'pyyaml==6.0.2',
+    'PyYAML==6.0.1',
 ]

 [[linter]]
@ -1133,7 +1133,7 @@ init_command = [
    'python3',
    'tools/linter/adapters/pip_init.py',
    '--dry-run={{DRYRUN}}',
-    'pyyaml==6.0.2',
+    'PyYAML==6.0.1',
 ]

 [[linter]]
@ -1162,9 +1162,14 @@ exclude_patterns = [
    # These files are all grandfathered in, feel free to remove from this list
    # as necessary
    # NOTE: remove the patterns in the order they are listed
+    'aten/**',
+    'aten/src/ATen/native/**',
+    'aten/src/ATen/native/q*/**',
    'aten/src/ATen/native/[a-pA-P]*/**',
    'aten/src/ATen/[a-mA-M]*/**',
    'test/**',
+    'test/[a-hA-h]*/**',
+    'torch/distributed/tensor/**',
 ]
 init_command = [
    'python3',
@ -1600,10 +1605,7 @@ is_formatter = true
 # the same line, merge conflicts should not arise in git or hg
 [[linter]]
 code = 'MERGE_CONFLICTLESS_CSV'
-include_patterns = [
-    'benchmarks/dynamo/ci_expected_accuracy/*.csv',
-    'benchmarks/dynamo/pr_time_benchmarks/expected_results.csv',
-]
+include_patterns = ['benchmarks/dynamo/ci_expected_accuracy/*.csv']
 command = [
    'python3',
    'tools/linter/adapters/no_merge_conflict_csv_linter.py',
@ -1794,12 +1796,3 @@ include_patterns = [
    'torch/header_only_apis.txt',
 ]
 is_formatter = false
-
-
-[[linter]]
-code = "GB_REGISTRY"
-include_patterns = ["torch/_dynamo/**/*.py"]
-command = [
-  "python3",
-  "tools/linter/adapters/gb_registry_linter.py",
-]
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -1190,6 +1190,10 @@ if(APPLE)
  append_cxx_flag_if_supported("-Wno-missing-braces" CMAKE_CXX_FLAGS)
 endif()

+if(USE_XPU)
+  string(APPEND CMAKE_CXX_FLAGS " -DUSE_XPU")
+endif()
+
 if(EMSCRIPTEN)
  string(
    APPEND
@ -1241,7 +1245,6 @@ if(USE_MIMALLOC AND USE_MIMALLOC_ON_MKL)
 endif()

 # ---[ Main build
-add_subdirectory(torch/headeronly)  # headeronly headers
 add_subdirectory(c10)
 add_subdirectory(caffe2)

--- a/2
+++ b/2
@ -136,7 +136,7 @@ torch/profiler/ @sraikund16
 test/functorch/test_aotdispatch.py @ezyang @Chillee

 # Dataloader
-torch/utils/data/ @divyanshk @ramanishsingh @scotts
+torch/utils/data/ @divyanshk @ramanishsingh

 # hipify
 torch/utils/hipify/ @jeffdaily @jithunnair-amd
--- a/17
+++ b/17
@ -33,7 +33,7 @@ RUN case ${TARGETPLATFORM} in \
         *)              MINICONDA_ARCH=x86_64   ;; \
    esac && \
    curl -fsSL -v -o ~/miniconda.sh -O  "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-${MINICONDA_ARCH}.sh"
-COPY requirements.txt requirements-build.txt .
+COPY requirements.txt .
 # Manually invoke bash on miniconda script per https://github.com/conda/conda/issues/10431
 RUN chmod +x ~/miniconda.sh && \
    bash ~/miniconda.sh -b -p /opt/conda && \
@ -47,6 +47,18 @@ WORKDIR /opt/pytorch
 COPY . .
 RUN git submodule update --init --recursive

+FROM conda as build
+ARG CMAKE_VARS
+WORKDIR /opt/pytorch
+COPY --from=conda /opt/conda /opt/conda
+COPY --from=submodule-update /opt/pytorch /opt/pytorch
+RUN make triton
+RUN --mount=type=cache,target=/opt/ccache \
+    export eval ${CMAKE_VARS} && \
+    TORCH_CUDA_ARCH_LIST="7.0 7.2 7.5 8.0 8.6 8.7 8.9 9.0 9.0a" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
+    CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
+    python -m pip install --no-build-isolation -v .
+
 FROM conda as conda-installs
 ARG PYTHON_VERSION=3.11
 ARG CUDA_PATH=cu121
@ -97,5 +109,4 @@ WORKDIR /workspace

 FROM official as dev
 # Should override the already installed version from the official-image stage
-COPY --from=conda /opt/conda /opt/conda
-COPY --from=submodule-update /opt/pytorch /opt/pytorch
+COPY --from=build /opt/conda /opt/conda
--- a/README.md
+++ b/README.md
@ -294,12 +294,14 @@ Install PyTorch

 ```bash
 export CMAKE_PREFIX_PATH="${CONDA_PREFIX:-'$(dirname $(which conda))/../'}:${CMAKE_PREFIX_PATH}"
+python -m pip install -r requirements.txt
 python -m pip install --no-build-isolation -v -e .
 ```

 **On macOS**

 ```bash
+python -m pip install -r requirements.txt
 python -m pip install --no-build-isolation -v -e .
 ```

@ -518,7 +520,7 @@ on [our website](https://pytorch.org/get-started/previous-versions).

 ## Getting Started

-Three pointers to get you started:
+Three-pointers to get you started:
 - [Tutorials: get you started with understanding and using PyTorch](https://pytorch.org/tutorials/)
 - [Examples: easy to understand PyTorch code across all domains](https://github.com/pytorch/examples)
 - [The API Reference](https://pytorch.org/docs/)
--- a/aten/src/ATen/CMakeLists.txt
+++ b/aten/src/ATen/CMakeLists.txt
@ -458,7 +458,7 @@ if(LAPACK_FOUND)
    # would not need this at all), some of our libraries (magma in particular)
    # backend to CPU BLAS/LAPACK implementations, and so it is very important
    # we get the *right* implementation, because even if the symbols are the
-    # same, LAPACK implementations may have different calling conventions.
+    # same, LAPACK implementions may have different calling conventions.
    # This caused https://github.com/pytorch/pytorch/issues/7353
    #
    # We do NOT do this on Linux, since we just rely on torch_cpu to
@ -586,10 +586,17 @@ if(USE_CUDA AND NOT USE_ROCM)
      CUDA::cufft_static_nocallback
    )
   if(NOT BUILD_LAZY_CUDA_LINALG)
-     list(APPEND ATen_CUDA_DEPENDENCY_LIBS
-       CUDA::cusolver_static
-       ${CUDAToolkit_LIBRARY_DIR}/libcusolver_lapack_static.a     # needed for libcusolver_static
-     )
+     if(CUDA_VERSION_MAJOR LESS_EQUAL 11)
+       list(APPEND ATen_CUDA_DEPENDENCY_LIBS
+         CUDA::cusolver_static
+         ${CUDAToolkit_LIBRARY_DIR}/liblapack_static.a     # needed for libcusolver_static
+       )
+     elseif(CUDA_VERSION_MAJOR GREATER_EQUAL 12)
+       list(APPEND ATen_CUDA_DEPENDENCY_LIBS
+         CUDA::cusolver_static
+         ${CUDAToolkit_LIBRARY_DIR}/libcusolver_lapack_static.a     # needed for libcusolver_static
+       )
+     endif()
   endif()
  else()
    list(APPEND ATen_CUDA_DEPENDENCY_LIBS
--- a/aten/src/ATen/Context.cpp
+++ b/aten/src/ATen/Context.cpp
@ -14,9 +14,7 @@
 #include <ATen/cpu/FlushDenormal.h>

 #ifdef USE_FBGEMM
-C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wextra-semi")
 #include <fbgemm/Fbgemm.h>
-C10_DIAGNOSTIC_POP()
 #endif // USE_FBGEMM
 #if defined(__aarch64__) && !defined(C10_MOBILE)
 #include <cpuinfo.h>
@ -29,7 +27,7 @@ namespace {
  These const variables defined the fp32 precisions for different backend
  We have "generic", "cuda", "mkldnn" backend now and we can choose fp32
  prevision from "ieee", "tf32", "bf16" and "none". The "ieee" precision means
-  IEEE standard floating point format, "tf32" and "bf16" means we are allowed to
+  IEEE standard floating point format "tf32" and "bf16" means we are allowed to
  use "tf32" or "bf16" as internal computation data types for fp32 computations.
  And "none" means it is override-able by parent's node

@ -42,7 +40,7 @@ namespace {
 */
 const std::map<std::string, std::vector<std::string>> _fp32_precisions = {
    {"generic", {{"ieee", "tf32", "bf16", "none"}}},
-    {"mkldnn", {{"ieee", "tf32", "bf16", "none"}}},
+    {"mkldnn", {{"ieee", "bf16", "none"}}},
    {"cuda", {{"ieee", "tf32", "none"}}}};

 // Check whether the backend and op are legal
@ -78,9 +76,7 @@ void check_fp32_prec_backend_and_op(

  C10_ALWAYS_INLINE void warn_deprecated_fp32_precision_api(){
    TORCH_WARN_ONCE(
-      "Please use the new API settings to control TF32 behavior, such as torch.backends.cudnn.conv.fp32_precision = 'tf32' "
-      "or torch.backends.cuda.matmul.fp32_precision = 'ieee'. Old settings, e.g, torch.backends.cuda.matmul.allow_tf32 = True, "
-      "torch.backends.cudnn.allow_tf32 = True, allowTF32CuDNN() and allowTF32CuBLAS() will be deprecated after Pytorch 2.9. Please see "
+      "This API is going to be deprecated, please see "
      "https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices"
    );
  }
@ -334,14 +330,6 @@ void Context::setBenchmarkLimitCuDNN(int b) {
  benchmark_limit_cudnn = b;
 }

-bool Context::immediateMiopen() const {
-  return immediate_miopen;
-}
-
-void Context::setImmediateMiopen(bool b) {
-  immediate_miopen = b;
-}
-
 bool Context::allowTF32CuBLAS() const {
 #ifdef USE_ROCM
    const auto allow_tf32 = c10::utils::check_env(hipblaslt_allow_tf32);
@ -380,9 +368,6 @@ Float32MatmulPrecision Context::float32MatmulPrecision() const {
  invalid = invalid ||
      (float32Precision("mkldnn", "matmul") == "bf16" &&
       float32_matmul_precision != at::Float32MatmulPrecision::MEDIUM);
-  invalid = invalid ||
-      (float32Precision("mkldnn", "matmul") == "tf32" &&
-       float32_matmul_precision != at::Float32MatmulPrecision::HIGH);
  TORCH_CHECK(
      !invalid,
      "PyTorch is checking the matmul precision without a specific backend name,",
@ -416,7 +401,7 @@ void Context::setFloat32MatmulPrecision(const std::string &s) {
    } else if (s_ == "high") {
      float32_matmul_precision = at::Float32MatmulPrecision::HIGH;
      setFloat32Precision("cuda", "matmul", "tf32");
-      setFloat32Precision("mkldnn", "matmul", "tf32");
+      setFloat32Precision("mkldnn", "matmul", "ieee");
      return true;
    } else if (s_ == "medium") {
      float32_matmul_precision = at::Float32MatmulPrecision::MEDIUM;
@ -512,7 +497,7 @@ at::BlasBackend Context::blasPreferredBackend() {
      static const std::vector<std::string> archs = {
          "gfx90a", "gfx942",
 #if ROCM_VERSION >= 60300
-          "gfx1100", "gfx1101", "gfx1200", "gfx1201", "gfx908",
+          "gfx1100", "gfx1101", "gfx1200", "gfx1201",
 #endif
 #if ROCM_VERSION >= 60500
          "gfx950"
--- a/aten/src/ATen/Context.h
+++ b/aten/src/ATen/Context.h
@ -205,8 +205,6 @@ class TORCH_API Context {
  void setBenchmarkCuDNN(bool);
  int benchmarkLimitCuDNN() const;
  void setBenchmarkLimitCuDNN(int);
-  bool immediateMiopen() const;
-  void setImmediateMiopen(bool);
  bool deterministicCuDNN() const;
  void setDeterministicCuDNN(bool);
  bool deterministicMkldnn() const;
@ -442,7 +440,6 @@ class TORCH_API Context {
  bool enabled_overrideable = true;
  bool allow_fp16_bf16_reduction_mathSDP = false;
  bool benchmark_cudnn = false;
-  bool immediate_miopen = false;
  Float32MatmulPrecision float32_matmul_precision =
      c10::utils::check_env("TORCH_ALLOW_TF32_CUBLAS_OVERRIDE") == true
      ? at::Float32MatmulPrecision::HIGH
--- a/aten/src/ATen/DLConvertor.cpp
+++ b/aten/src/ATen/DLConvertor.cpp
@ -69,41 +69,37 @@ DLDataType getDLDataType(const Tensor& t) {
    case ScalarType::Float8_e4m3fn:
    case ScalarType::Float8_e4m3fnuz:
    case ScalarType::Float8_e8m0fnu:
-      TORCH_CHECK_BUFFER(false, "float8 types are not supported by dlpack");
+      TORCH_CHECK(false, "float8 types are not supported by dlpack");
      break;
    case ScalarType::Float4_e2m1fn_x2:
-      TORCH_CHECK_BUFFER(false, "float4 types are not supported by dlpack");
+      TORCH_CHECK(false, "float4 types are not supported by dlpack");
      break;
    case ScalarType::QInt8:
    case ScalarType::QUInt8:
    case ScalarType::QInt32:
    case ScalarType::QUInt4x2:
    case ScalarType::QUInt2x4:
-      TORCH_CHECK_BUFFER(false, "QUInt/QInt types are not supported by dlpack");
+      TORCH_CHECK(false, "QUInt/QInt types are not supported by dlpack");
      break;
    case ScalarType::Bits1x8:
    case ScalarType::Bits2x4:
    case ScalarType::Bits4x2:
    case ScalarType::Bits8:
    case ScalarType::Bits16:
-      TORCH_CHECK_BUFFER(false, "Bit types are not supported by dlpack");
+      TORCH_CHECK(false, "Bit types are not supported by dlpack");
      break;
    case ScalarType::Undefined:
-      TORCH_CHECK_BUFFER(false, "Undefined is not a valid ScalarType");
+      TORCH_CHECK(false, "Undefined is not a valid ScalarType");
    case ScalarType::NumOptions:
-      TORCH_CHECK_BUFFER(false, "NumOptions is not a valid ScalarType");
+      TORCH_CHECK(false, "NumOptions is not a valid ScalarType");
  }
  return dtype;
 }

-DLDevice torchDeviceToDLDevice(at::Device device) {
+static DLDevice getDLDevice(const Tensor& tensor, c10::DeviceIndex device_id) {
  DLDevice ctx;
-
-  ctx.device_id = (device.is_cuda() || device.is_privateuseone())
-      ? static_cast<int32_t>(static_cast<unsigned char>(device.index()))
-      : 0;
-
-  switch (device.type()) {
+  ctx.device_id = static_cast<int32_t>(static_cast<unsigned char>(device_id));
+  switch (tensor.device().type()) {
    case DeviceType::CPU:
      ctx.device_type = DLDeviceType::kDLCPU;
      break;
@ -124,7 +120,8 @@ DLDevice torchDeviceToDLDevice(at::Device device) {
      break;
    case DeviceType::XPU:
      ctx.device_type = DLDeviceType::kDLOneAPI;
-      ctx.device_id = at::detail::getXPUHooks().getGlobalIdxFromDevice(device);
+      ctx.device_id =
+          at::detail::getXPUHooks().getGlobalIdxFromDevice(tensor.device());
      break;
    case DeviceType::MAIA:
      ctx.device_type = DLDeviceType::kDLMAIA;
@ -132,52 +129,45 @@ DLDevice torchDeviceToDLDevice(at::Device device) {
    case DeviceType::PrivateUse1:
      ctx.device_type = DLDeviceType::kDLExtDev;
      break;
-    case DeviceType::MPS:
-      ctx.device_type = DLDeviceType::kDLMetal;
-      break;
    default:
-      TORCH_CHECK_BUFFER(false, "Cannot pack tensors on " + device.str());
+      TORCH_CHECK(false, "Cannot pack tensors on " + tensor.device().str());
  }
-
  return ctx;
 }

-static Device getATenDevice(DLDeviceType type, c10::DeviceIndex index, void* data = nullptr) {
-  switch (type) {
+static Device getATenDevice(const DLDevice& ctx, void* data) {
+  switch (ctx.device_type) {
    case DLDeviceType::kDLCPU:
      return at::Device(DeviceType::CPU);
 #ifndef USE_ROCM
    // if we are compiled under HIP, we cannot do cuda
    case DLDeviceType::kDLCUDA:
-      return at::Device(DeviceType::CUDA, index);
+      return at::Device(DeviceType::CUDA, static_cast<c10::DeviceIndex>(ctx.device_id));
 #endif
    case DLDeviceType::kDLOpenCL:
-      return at::Device(DeviceType::OPENCL, index);
+      return at::Device(DeviceType::OPENCL, static_cast<c10::DeviceIndex>(ctx.device_id));
    case DLDeviceType::kDLROCM:
 #ifdef USE_ROCM
      // this looks funny, we need to return CUDA here to masquerade
-      return at::Device(DeviceType::CUDA, index);
+      return at::Device(DeviceType::CUDA, static_cast<c10::DeviceIndex>(ctx.device_id));
 #else
-      return at::Device(DeviceType::HIP, index);
+      return at::Device(DeviceType::HIP, static_cast<c10::DeviceIndex>(ctx.device_id));
 #endif
    case DLDeviceType::kDLOneAPI:
-      TORCH_CHECK(data != nullptr, "Can't get ATen device for XPU without XPU data.");
      return at::detail::getXPUHooks().getDeviceFromPtr(data);
    case DLDeviceType::kDLMAIA:
-      return at::Device(DeviceType::MAIA, index);
+      return at::Device(DeviceType::MAIA, static_cast<c10::DeviceIndex>(ctx.device_id));
    case DLDeviceType::kDLExtDev:
-      return at::Device(DeviceType::PrivateUse1, index);
-    case DLDeviceType::kDLMetal:
-      return at::Device(DeviceType::MPS, index);
+      return at::Device(DeviceType::PrivateUse1, static_cast<c10::DeviceIndex>(ctx.device_id));
    default:
-      TORCH_CHECK_BUFFER(
-          false, "Unsupported device_type: ", std::to_string(type));
+      TORCH_CHECK(
+          false, "Unsupported device_type: ", std::to_string(ctx.device_type));
  }
 }

 ScalarType toScalarType(const DLDataType& dtype) {
  ScalarType stype = ScalarType::Undefined;
-  TORCH_CHECK_BUFFER(dtype.lanes == 1, "ATen does not support lanes != 1");
+  TORCH_CHECK(dtype.lanes == 1, "ATen does not support lanes != 1");
  switch (dtype.code) {
    case DLDataTypeCode::kDLUInt:
      switch (dtype.bits) {
@ -194,7 +184,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
          stype = ScalarType::UInt64;
          break;
        default:
-          TORCH_CHECK_BUFFER(
+          TORCH_CHECK(
              false, "Unsupported kUInt bits ", std::to_string(dtype.bits));
      }
      break;
@ -213,7 +203,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
          stype = ScalarType::Long;
          break;
        default:
-          TORCH_CHECK_BUFFER(
+          TORCH_CHECK(
              false, "Unsupported kInt bits ", std::to_string(dtype.bits));
      }
      break;
@ -229,7 +219,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
          stype = ScalarType::Double;
          break;
        default:
-          TORCH_CHECK_BUFFER(
+          TORCH_CHECK(
              false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
      }
      break;
@ -239,7 +229,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
          stype = ScalarType::BFloat16;
          break;
        default:
-          TORCH_CHECK_BUFFER(
+          TORCH_CHECK(
              false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
      }
      break;
@ -255,7 +245,7 @@ ScalarType toScalarType(const DLDataType& dtype) {
          stype = ScalarType::ComplexDouble;
          break;
        default:
-          TORCH_CHECK_BUFFER(
+          TORCH_CHECK(
              false, "Unsupported kFloat bits ", std::to_string(dtype.bits));
      }
      break;
@ -265,12 +255,12 @@ ScalarType toScalarType(const DLDataType& dtype) {
          stype = ScalarType::Bool;
          break;
        default:
-          TORCH_CHECK_BUFFER(
+          TORCH_CHECK(
              false, "Unsupported kDLBool bits ", std::to_string(dtype.bits));
      }
      break;
    default:
-      TORCH_CHECK_BUFFER(false, "Unsupported code ", std::to_string(dtype.code));
+      TORCH_CHECK(false, "Unsupported code ", std::to_string(dtype.code));
  }
  return stype;
 }
@ -324,7 +314,11 @@ T* toDLPackImpl(const Tensor& src) {
  atDLMTensor->tensor.manager_ctx = atDLMTensor;
  atDLMTensor->tensor.deleter = &deleter<T>;
  atDLMTensor->tensor.dl_tensor.data = view.data_ptr();
-  atDLMTensor->tensor.dl_tensor.device = torchDeviceToDLDevice(src.device());
+  c10::DeviceIndex device_id = 0;
+  if (src.is_cuda() || src.is_privateuseone()) {
+    device_id = src.get_device();
+  }
+  atDLMTensor->tensor.dl_tensor.device = getDLDevice(src, device_id);
  atDLMTensor->tensor.dl_tensor.ndim = static_cast<int32_t>(src.dim());
  atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
  atDLMTensor->tensor.dl_tensor.shape = view.sizes().data();
@ -352,7 +346,7 @@ at::Tensor fromDLPackImpl(T* src, std::function<void(void*)> deleter) {
  }

  DLTensor& dl_tensor = src->dl_tensor;
-  Device device = getATenDevice(dl_tensor.device.device_type, dl_tensor.device.device_id, dl_tensor.data);
+  Device device = getATenDevice(dl_tensor.device, dl_tensor.data);
  ScalarType stype = toScalarType(dl_tensor.dtype);

  if (!dl_tensor.strides) {
@ -394,35 +388,4 @@ Tensor fromDLPackVersioned(DLManagedTensorVersioned* src, std::function<void(voi
  return fromDLPackImpl<DLManagedTensorVersioned>(src, std::move(deleter));
 }

-Tensor maybeCopyTensor(
-    const Tensor& data,
-    std::optional<DLDevice> optional_dl_device,
-    std::optional<bool> copy) {
-  bool force_copy = copy.has_value() && *copy;
-  bool force_move = copy.has_value() && !*copy;
-
-  if (optional_dl_device.has_value()) {
-    auto device = at::getATenDevice(
-        optional_dl_device->device_type,
-        static_cast<c10::DeviceIndex>(optional_dl_device->device_id));
-
-    if (device != data.device()) {
-      TORCH_CHECK_VALUE(
-          !force_move,
-          "cannot move (i.e. copy=False) tensor from ",
-          data.device(),
-          " to ",
-          device,
-          " without copying.");
-      return data.to(device);
-    }
-  }
-
-  if (force_copy) {
-    return data.clone();
-  }
-
-  return data;
-}
-
 } // namespace at
--- a/aten/src/ATen/DLConvertor.h
+++ b/aten/src/ATen/DLConvertor.h
@ -4,7 +4,7 @@
 #include <ATen/Tensor.h>
 #include <ATen/dlpack.h>

-// this converter will:
+// this convertor will:
 // 1) take a Tensor object and wrap it in the DLPack tensor
 // 2) take a dlpack tensor and convert it to the ATen Tensor

@ -21,16 +21,6 @@ TORCH_API Tensor fromDLPackVersioned(
 TORCH_API DLDataType getDLDataType(const Tensor& t);
 TORCH_API DLDevice getDLContext(const Tensor& tensor, const int64_t& device_id);

-// Copies the Tensor if there's a device mismatch or copy is forced.
-// This should be used before actually creating the DLPack capsule.
-TORCH_API Tensor maybeCopyTensor(
-    const Tensor& data,
-    std::optional<DLDevice> optional_dl_device,
-    std::optional<bool> copy);
-
-// Converts the given at::Device into a DLDevice.
-TORCH_API DLDevice torchDeviceToDLDevice(at::Device device);
-
 // This trait class is used for retrieving different attributes, such as the
 // PyCapsule names and conversion functions for both DLPack tensor classes:
 // `DLManagedTensor` and `DLManagedTensorVersioned`.
--- a/aten/src/ATen/FunctionalInverses.cpp
+++ b/aten/src/ATen/FunctionalInverses.cpp
@ -233,8 +233,8 @@ Tensor FunctionalInverses::slice_Tensor_inverse(const Tensor& base, const Tensor

 // NOLINTNEXTLINE(performance-unnecessary-value-param)
 Tensor FunctionalInverses::split_Tensor_inverse(const Tensor& base, const Tensor& mutated_view, InverseReturnMode inverse_return_mode, int64_t mutated_view_idx, c10::SymInt split_size, int64_t dim) {
-    // It would be nice if this logic could be reused from autograd's split_backward(), but I don't think it can.
-    // For functionalization, we have only have one of the tensors from the TensorList outputted by split(), and we want to layer i
+    // It would be nice if this logic could be re-used from autograd's split_backward(), but I don't think it can.
+    // For functionalization, we have only have one of the tensors from the TensorList outputed by split(), and we want to layer i
    // on top of the base tensor.
    // For autograd, we have all of the tensors outputted by split() and we just want to stack them.
    dim = at::maybe_wrap_dim(dim, base.dim());
--- a/aten/src/ATen/FunctionalTensorWrapper.cpp
+++ b/aten/src/ATen/FunctionalTensorWrapper.cpp
@ -286,11 +286,11 @@ void FunctionalTensorWrapper::storage_resize_(const c10::SymInt& new_size) {
  // storage resizing is severely limited: we only support resizing either to zero, or from zero bytes.
  TORCH_CHECK(new_size == 0 || curr_storage_size == 0, "new_size: ", new_size, ". curr_storage_size: ", curr_storage_size);
  // The "functionalization rule" for storage resizing is a giant no-op, mainly because we don't want
-  // resize_() calls to actually emit any ops in the functional graph.
+  // resize_() calls to actualy emit any ops in the functional graph.
  // How does it work?
  // Resizing up (old size == 0):
  //   We do nothing in this case.
-  //   The expectation is that for the user code to be valid, the next op that should run against the current tensor "x"
+  //   The expection is that for the user code to be valid, the next op that should run against the current tensor "x"
  //   will be a x.copy_(y) (or similar), that will fully overwrite the data of x.
  //   If there are any outstanding aliases of x, we expect them not to be used until after the copy_() call
  //   (otherwise the eager code would be invalid),
@ -327,7 +327,7 @@ void FunctionalTensorWrapper::maybe_replace_storage(const Tensor& other) {
  // We're also no longer re-generate "b" fully from "a" anymore, since "a" refers to a slice of "b"'s data.
  //
  // This is probably fixable in theory, but:
-  // - the fix would likely complicated the functionalization logic quite a bit.
+  // - the fix would likey complicated the functionalization logic quite a bit.
  // - the primary use case for resize_() today is resizing zero-sized tensors in out= variants of operators
  // - resize_() also can give you weird results today if you try to resize_() a weirdly strided tensor.
  //
@ -344,7 +344,7 @@ void FunctionalTensorWrapper::maybe_replace_storage(const Tensor& other) {
  set_sizes_and_strides(value_.sizes(), value_.strides());
  refresh_numel();
  // (Technically we should be guaranteed that the tensor was already contiguous,
-  // since it's guaranteed not to have been a view. Doesn't hurt to run though)
+  // since it's guaranteed not to have been a view. Doesnt hurt to run though)
  refresh_contiguous();
  // Swapping out the storage of a tensor (aka from a resize_() call) will update the sizes and strides of the tensor,
  // so we need to record the fact that metadata was mutated.
@ -819,7 +819,7 @@ void setFunctionalizationReapplyViewsTLS(bool reapply_views) {
 // This function will "functionalize" it.
 // That is, it will call the operator, but removing any intermediate views/mutations
 // that are performed inside of it.
-// This is useful for LTC/XLA, which would like to reuse some of our composite kernels
+// This is useful for LTC/XLA, which would like to re-use some of our composite kernels
 // from pytorch core but not have to worry about the view ops that they might call.
 // e.g. at::block_diag
 void functionalize_op_helper(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
--- a/aten/src/ATen/LegacyBatchedFallback.cpp
+++ b/aten/src/ATen/LegacyBatchedFallback.cpp
@ -218,7 +218,7 @@ static Tensor safeStack(TensorList tensors) {
  // is possible for the backward function to return an undefined grad for some
  // grad_input for each example. In that case, we return an undefined grad.
  //
-  // It is theoretically possible for *some* of the examples to produce an
+  // It is theoretically posssible for *some* of the examples to produce an
  // undefined grad (a kernel could peek at the gradient values and return an
  // undefined tensor if it determines the gradient is full of zeros). We
  // could handle this by treating the undefined grad as a zero-filled tensor
--- a/aten/src/ATen/LegacyVmapTransforms.h
+++ b/aten/src/ATen/LegacyVmapTransforms.h
@ -140,7 +140,7 @@ struct TORCH_API VmapPhysicalView {
  // mapping a physical tensor to a new logical tensor (BatchedTensor)
  VmapPhysicalToLogicalMap getPhysicalToLogicalMap() const;

-  // Maps a logical shape to a physical shape by prepending the batch
+  // Maps a logical shape to a physical shape by pre-pending the batch
  // sizes to the logical shape.
  VmapDimVector getPhysicalShape(IntArrayRef logical_shape) const;

--- a/aten/src/ATen/MapAllocator.cpp
+++ b/aten/src/ATen/MapAllocator.cpp
@ -299,7 +299,7 @@ MapAllocator::MapAllocator(WithFd, std::string_view filename, int fd, int flags,
            ::close(fd);
            TORCH_CHECK(false, "unable to stretch file <", filename_, "> to the right size: ", c10::utils::str_error(last_err), " (", last_err, ")");
          }
-/* on macOS write returns with errno 45 (Operation not supported) when used
+/* on macOS write returns with errno 45 (Opperation not supported) when used
 * with a file descriptor obtained via shm_open
 */
 #ifndef __APPLE__
--- a/aten/src/ATen/NestedTensorImpl.cpp
+++ b/aten/src/ATen/NestedTensorImpl.cpp
@ -211,7 +211,7 @@ NestedTensorImpl::NestedTensorImpl(
 }

 // assume contiguous, `nested_strides` and `offsets`
-// can be inferred from `nested_sizes`
+// can be infered from `nested_sizes`
 NestedTensorImpl::NestedTensorImpl(
    const at::Tensor& buffer,
    const at::Tensor& nested_sizes)
--- a/aten/src/ATen/NestedTensorImpl.h
+++ b/aten/src/ATen/NestedTensorImpl.h
@ -32,7 +32,7 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
      at::Tensor nested_strides,
      at::Tensor storage_offsets);
  // assume contiguous, `nested_strides` and `offsets`
-  // can be inferred from `nested_sizes`
+  // can be infered from `nested_sizes`
  explicit NestedTensorImpl(
      const at::Tensor& buffer,
      const at::Tensor& nested_sizes);
--- a/aten/src/ATen/Parallel.h
+++ b/aten/src/ATen/Parallel.h
@ -93,12 +93,12 @@ ident: identity for binary combination function sf. sf(ident, x) needs to return
 x.

 f: function for reduction over a chunk. f needs to be of signature scalar_t
-f(int64_t partial_begin, int64_t partial_end, scalar_t identify)
+f(int64_t partial_begin, int64_t partial_end, scalar_t identifiy)

 sf: function to combine two partial results. sf needs to be of signature
 scalar_t sf(scalar_t x, scalar_t y)

-For example, you might have a tensor of 10000 entries and want to sum together
+For example, you might have a tensor of 10000 entires and want to sum together
 all the elements. Parallel_reduce with a grain_size of 2500 will then allocate
 an intermediate result tensor with 4 elements. Then it will execute the function
 "f" you provide and pass the beginning and end index of these chunks, so
--- a/aten/src/ATen/ScalarOps.cpp
+++ b/aten/src/ATen/ScalarOps.cpp
@ -8,28 +8,7 @@ namespace at {
 namespace {
 template <typename scalar_t>
 inline void fill_inplace(Tensor& self, const Scalar& value_scalar) {
-  scalar_t value{};
-
-  if constexpr (std::is_same_v<scalar_t, at::Half> ||
-                std::is_same_v<scalar_t, at::BFloat16> ||
-                std::is_same_v<scalar_t, at::Float8_e5m2> ||
-                std::is_same_v<scalar_t, at::Float8_e5m2fnuz> ||
-                std::is_same_v<scalar_t, at::Float8_e4m3fn> ||
-                std::is_same_v<scalar_t, at::Float8_e4m3fnuz> ||
-                std::is_same_v<scalar_t, at::Float8_e8m0fnu>) {
-    // relaxed float cast: allow inf similar to the torch.tensor constructor
-    //
-    // without this, we had the following divergence:
-    //   torch.tensor(1123581321.0, dtype=torch.float16)
-    //     => tensor(inf, dtype=torch.float16)
-    //   torch.ops.aten.scalar_tensor.default(1123581321, dtype=torch.float16)
-    //     => RuntimeError: value cannot be converted to type at::Half without overflow
-
-    value = static_cast<scalar_t>(value_scalar.to<double>());
-  } else {
-    value = value_scalar.to<scalar_t>();
-  }
-
+  auto value = value_scalar.to<scalar_t>();
  scalar_t* dptr = static_cast<scalar_t*>(self.data_ptr());
  *dptr = value;
 }
--- a/aten/src/ATen/TensorIndexing.h
+++ b/aten/src/ATen/TensorIndexing.h
@ -252,7 +252,7 @@ inline Tensor applySelect(
    // Note: `size >= -index` is not equivalent to `size > -1 - index` if index
    // is INT64_MIN For std::numeric_limits<int64_t>::min() result of unary
    // minus is undefined by the standard but in practice is equal to self. On
-    // the other hand, indexing wrapping is valid for all negative int64_t
+    // the other hand, indexing wraping is valid for all negative int64_t
    // values, as x[INT64_MIN] is the same as x[INT64_MAX]
    TORCH_CHECK_INDEX(
        size.sym_gt(-1 - index)
@ -315,17 +315,10 @@ inline void recordTensorIndex(
    const Tensor& tensor,
    std::vector<Tensor>& outIndices,
    int64_t* dim_ptr) {
-  if (outIndices.empty()) {
-    outIndices.resize(*dim_ptr + 1);
-    outIndices[*dim_ptr] = tensor;
-  } else {
-    outIndices.push_back(tensor);
-  }
-  if (tensor.scalar_type() == kByte || tensor.scalar_type() == kBool) {
-    *dim_ptr += tensor.dim();
-  } else {
-    *dim_ptr += 1;
-  }
+  // TODO: check scalarType
+  outIndices.resize(*dim_ptr + 1);
+  outIndices[*dim_ptr] = tensor;
+  (*dim_ptr)++;
 }

 inline c10::List<::std::optional<Tensor>> typeConvertIndices(
@ -465,23 +458,13 @@ inline Tensor handleDimInMultiDimIndexing(
        original_tensor_device,
        prev_dim_result_sizes);
    (*dim_ptr)++;
-    if (!outIndices.empty()) {
-      outIndices.resize(outIndices.size() + 1);
-    }
    return result;
  } else if (index.is_ellipsis()) {
-    auto ellipsis_ndims = original_tensor.dim() - *specified_dims_ptr;
-    (*dim_ptr) += ellipsis_ndims;
-    if (!outIndices.empty()) {
-      outIndices.resize(outIndices.size() + ellipsis_ndims);
-    }
+    (*dim_ptr) += original_tensor.dim() - (*specified_dims_ptr);
    return prev_dim_result;
  } else if (index.is_none()) {
    Tensor result = prev_dim_result.unsqueeze(*dim_ptr);
    (*dim_ptr)++;
-    if (!outIndices.empty()) {
-      outIndices.resize(outIndices.size() + 1);
-    }
    return result;
  } else if (index.is_boolean()) {
    Tensor result = prev_dim_result.unsqueeze(*dim_ptr);
@ -577,10 +560,6 @@ inline Tensor applySlicing(
 inline Tensor dispatch_index(
    const Tensor& self,
    std::vector<Tensor>&& indices) {
-  // Remove trailing null elements from indices
-  while (!indices.empty() && !indices.back().defined()) {
-    indices.pop_back();
-  }
  return self.index(impl::typeConvertIndices(self, std::move(indices)));
 }

@ -588,10 +567,6 @@ inline Tensor dispatch_index_put_(
    Tensor& self,
    std::vector<Tensor>&& indices,
    const Tensor& value) {
-  // Remove trailing null elements from indices
-  while (!indices.empty() && !indices.back().defined()) {
-    indices.pop_back();
-  }
  return self.index_put_(
      impl::typeConvertIndices(self, std::move(indices)), value);
 }
--- a/aten/src/ATen/TensorIterator.cpp
+++ b/aten/src/ATen/TensorIterator.cpp
@ -208,7 +208,7 @@ bool TensorIteratorConfig::is_tensor_const(size_t idx) {
 // same strides are increasing. If dimensions are non-increasing, we move on to the next input to break the tie.
 //
 // Instead of applying rule 4 for tie breaking, we could move on to the next tensor directly. This would result in possibly
-// losing the correct permutation of the first tensor if there are permuted trivial dimensions, but could potentially
+// losing the correct permuation of the first tensor if there are permuted trivial dimensions, but could potentially
 // improve traversal order of the second tensor. We chose the former option to better propagate channels last layout
 // for example for a tensor with the sizes N1H1
 // These rules result in the intuitive behavior that in most cases recovers permutation of either the first argument (if all
@ -244,7 +244,7 @@ void TensorIteratorBase::reorder_dimensions() {
  // initialize perm with n-1, n-2, ..., 1, 0
  std::iota(perm_.rbegin(), perm_.rend(), 0);

-  // Reordering dimensions changes iteration order
+  // Reordering dimensions changes iteraton order
  if (enforce_linear_iteration_) {
    permute_dimensions(perm_);
    return;
--- a/aten/src/ATen/TensorIterator.h
+++ b/aten/src/ATen/TensorIterator.h
@ -388,7 +388,7 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {

  /// Return scalar value from original_tensor_base if it is defined. When
  /// common_dtype is Half, casting scalar input to common_dtype might overflow.
-  /// If the scalar is already given in the type of Half, then return scalar
+  /// If the scalar is aleady given in the type of Half, then return scalar
  /// value from tensor_base.
  template <typename T>
  T original_scalar_value(int64_t arg) {
@ -502,7 +502,7 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
  /// kernels
  bool can_use_32bit_indexing() const;

-  /// An "iterable" object that recursively splits this iterator into
+  /// An "iteratable" object that recursively splits this iterator into
  /// sub-iterators that can use 32-bit indexing.
  SplitUntil32Bit with_32bit_indexing() const;

@ -878,7 +878,7 @@ class TORCH_API TensorIteratorConfig final {

  // Sets the enforce_linear_iteration_ flag, which is false by default.
  // If true, iteration goes in the same order as a C-contiguous tensor
-  // is laid out in memory. i.e. last dimension iterates fastest.
+  // is layed out in memory. i.e. last dimension iterates fastest.
  //
  // This iteration order can be less efficient and may even prevent
  // vectorization. So only use if the correctness of your kernel depends on it.
--- a/aten/src/ATen/TensorSubclassLikeUtils.h
+++ b/aten/src/ATen/TensorSubclassLikeUtils.h
@ -78,7 +78,7 @@ inline bool areAnyOptionalTensorSubclassLike(
 // NOTE: This function expects a scalar tensor of boolean dtype.
 // Eg.
 // Non-Composite Compliant Pattern : (t == 0).all().item<bool>()
-// Composite Compliant Pattern : is_salar_tensor_true((t == 0).all())
+// Composite Compliant Patter : is_salar_tensor_true((t == 0).all())
 inline bool is_scalar_tensor_true(const Tensor& t) {
  TORCH_INTERNAL_ASSERT(t.dim() == 0)
  TORCH_INTERNAL_ASSERT(t.scalar_type() == kBool)
--- a/aten/src/ATen/TensorUtils.cpp
+++ b/aten/src/ATen/TensorUtils.cpp
@ -378,9 +378,9 @@ inline static std::optional<ResultVec> computeStride_impl(
        (TORCH_GUARD_OR_TRUE(sym_ne(oldshape[tensor_d - 1], 1)) &&
        TORCH_GUARD_OR_TRUE(sym_ne(oldstride[tensor_d - 1], tensor_numel * chunk_base_stride)))) {
     // We want to accumulate stuff in view_numel until view_numel == tensor_numel, if we do not
-     // know if that is satisfied we keep accumulating. For example if view_numel = 1 and tensor_numel = u1,
+     // know if that is satisfied we keep accumalating. For example if view_numel = 1 and tensor_numel = u1,
     // we want to take that path, view_numel will become u0. Next iteration if u0==u1 we want to stop.
-     // That's why we use TORCH_GUARD_OR_TRUE below.
+     // Thats why we use TORCH_GUARD_OR_TRUE below.

     // we use TORCH_GUARD_OR_FALSE and not TORCH_GUARD_OR_TRUE when comparing newshape[view_d] ==1 because
     // if we know view_numel < tensor_numel is false, we want to stop. Unless we know for sure newshape[view_d]==1
--- a/aten/src/ATen/TracerMode.h
+++ b/aten/src/ATen/TracerMode.h
@ -27,7 +27,7 @@
 //    ops (ops being called by other ops). After the intermediate op call
 //    finishes it's set back to the original `TracingState` object.
 //
-//    The `TracingState` object in TLS can also be read/written via its Python
+//    The `TracingState` obect in TLS can also be read/written via its Python
 //    binding in `python_tracer.cpp`, and `get/setTracingState()` C++ APIs,
 //    which are also exposed as `TORCH_API`.
 //
--- a/aten/src/ATen/ZeroTensorFallback.cpp
+++ b/aten/src/ATen/ZeroTensorFallback.cpp
@ -9,36 +9,7 @@

 namespace at {

- /*
-  * Design:
-  * 1. ZeroTensors are regular tensors with TensorOptions, a storage
-  *    pointing to nullptr and a ZeroTensor dispatch key set.
-  *
-  * 2. ZeroTensors are immutable. This is done to prevent data race in the case of multithreading
-  *    (when two threads try to read the same zero tensor and materialize it in-place).
-  *
-  * 3. ZeroTensor has a boxed fallback that will be dispatched to any ops that don't
-  *    have special ZeroTensor handling. This fallback materializes each ZeroTensor to
-  *    `at::zeros({}, tensor.options()).expand(tensor.sizes())`.
-
-  * 4. ZeroTensors are handled above autograd. This is necessary because fallback
-  *    operations are not differentiable.
-  *     - Example: Consider add in the case it was using the fallback: zerotensor_a + b.
-  *       zerotensor_a would be materialized to c=torch.zeros_like(zerotensor_a) after
-  *       passing through the fallback. If this happens above the autograd, then the
-  *       gradients would be populated on c instead of zerotensor_a.
-  *
-  * 5. The grad field is always populated with an honest to goodness tensor. This
-  *    materialization of ZeroTensors will happen in:
-  *     - AccumulateGrad for Backward Mode AD.
-  *     - will never be required for ForwardMode AD.
-  *       - This is because if all the tangents were undefined (efficient ZeroTensors),
-  *         no computation will be performed (this is ensured via an existing pre-check).
-  *
-  * Today ZeroTensors are primarily used to represent undefined gradients in forward AD,
-  * it does not perfectly handle NaNs and Infs as we don't check the actual values
-  * and assume that they are non-zero, non-inf, non-NaN etc.
-  */
+  // TODO: add a note explaining the design decisions
  // ZeroTensors are designed to be immutable. Thus, we error out when an in-place operation is performed on ZeroTensors
  static void zeroTensorFallback(const c10::OperatorHandle& op, DispatchKeySet dispatch_keys, torch::jit::Stack* stack) {
    const auto& arguments = op.schema().arguments();
@ -124,7 +95,7 @@ namespace at {
    m.impl("clone", torch::CppFunction::makeFallthrough());
    m.impl("dot", torch::CppFunction::makeFallthrough());
    m.impl("vdot", torch::CppFunction::makeFallthrough());
-    // The functions in the list below have a specific registration in native_functions.yaml and
+    // The functions in the list below have a specific registeration in native_functions.yaml and
    // do not use the fallback.
    // m.impl("mul.Tensor", torch::CppFunction::makeFallthrough());
    // m.impl("add.Tensor", torch::CppFunction::makeFallthrough());
--- a/aten/src/ATen/autocast_mode.h
+++ b/aten/src/ATen/autocast_mode.h
@ -377,7 +377,7 @@ Keep it simple for now by assuming only one such flag is
 present in the argument list.  If I ever need a function
 with more than flag I'll figure out something else.
 The policy is:
-If the user has explicitly specified a dtype, respect it.
+If the user has explicity specified a dtype, respect it.
 Otherwise, set it to the autocast type.
 ********************************************************/

--- a/aten/src/ATen/cpu/vec/intrinsics.h
+++ b/aten/src/ATen/cpu/vec/intrinsics.h
@ -1 +1,55 @@
-#include <torch/headeronly/cpu/vec/intrinsics.h>
+#pragma once
+#if defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))
+/* GCC or clang-compatible compiler, targeting x86/x86-64 */
+#include <x86intrin.h>
+#elif defined(__clang__) && (defined(__ARM_NEON__) || defined(__aarch64__))
+/* Clang-compatible compiler, targeting arm neon */
+#include <arm_neon.h>
+#if defined(__ARM_FEATURE_SVE)
+/* CLANG-compatible compiler, targeting ARM with SVE */
+#include <arm_sve.h>
+#endif
+#elif defined(_MSC_VER)
+/* Microsoft C/C++-compatible compiler */
+#include <intrin.h>
+#if _MSC_VER <= 1900
+#define _mm256_extract_epi64(X, Y) \
+  (_mm_extract_epi64(_mm256_extractf128_si256(X, Y >> 1), Y % 2))
+#define _mm256_extract_epi32(X, Y) \
+  (_mm_extract_epi32(_mm256_extractf128_si256(X, Y >> 2), Y % 4))
+#define _mm256_extract_epi16(X, Y) \
+  (_mm_extract_epi16(_mm256_extractf128_si256(X, Y >> 3), Y % 8))
+#define _mm256_extract_epi8(X, Y) \
+  (_mm_extract_epi8(_mm256_extractf128_si256(X, Y >> 4), Y % 16))
+#endif
+#elif defined(__GNUC__) && (defined(__ARM_NEON__) || defined(__aarch64__))
+/* GCC-compatible compiler, targeting ARM with NEON */
+#include <arm_neon.h>
+#if defined(__ARM_FEATURE_SVE)
+/* GCC-compatible compiler, targeting ARM with SVE */
+#include <arm_sve.h>
+#endif
+#if defined(MISSING_ARM_VLD1)
+#include <ATen/cpu/vec/vec256/missing_vld1_neon.h>
+#elif defined(MISSING_ARM_VST1)
+#include <ATen/cpu/vec/vec256/missing_vst1_neon.h>
+#endif
+#elif defined(__GNUC__) && defined(__IWMMXT__)
+/* GCC-compatible compiler, targeting ARM with WMMX */
+#include <mmintrin.h>
+#elif defined(__s390x__)
+// targets Z/architecture
+// we will include vecintrin later
+#elif (defined(__GNUC__) || defined(__xlC__)) && \
+    (defined(__VEC__) || defined(__ALTIVEC__))
+/* XLC or GCC-compatible compiler, targeting PowerPC with VMX/VSX */
+#include <altivec.h>
+/* We need to undef those tokens defined by <altivec.h> to avoid conflicts
+   with the C++ types. => Can still use __bool/__vector */
+#undef bool
+#undef vector
+#undef pixel
+#elif defined(__GNUC__) && defined(__SPE__)
+/* GCC-compatible compiler, targeting PowerPC with SPE */
+#include <spe.h>
+#endif
--- a/aten/src/ATen/cpu/vec/sve/vec_bfloat16.h
+++ b/aten/src/ATen/cpu/vec/sve/vec_bfloat16.h
@ -5,7 +5,6 @@
 #include <ATen/cpu/vec/sve/vec_common_sve.h>
 #include <ATen/cpu/vec/sve/vec_float.h>
 #include <ATen/cpu/vec/vec_base.h>
-#include <c10/util/bit_cast.h>
 #include <cmath>
 namespace at {
 namespace vec {
@ -37,7 +36,7 @@ class Vectorized<BFloat16> {
    return VECTOR_WIDTH / sizeof(BFloat16);
  }

-  Vectorized();
+  Vectorized() {}
  Vectorized(svbfloat16_t v) : values(v) {}
  Vectorized(int val);
  Vectorized(BFloat16 val);
@ -307,11 +306,6 @@ Vectorized<c10::BFloat16> inline operator/(
  return binary_operator_via_float(std::divides<Vectorized<float>>(), a, b);
 }

-inline Vectorized<BFloat16>::Vectorized() {
-  const short zero = 0;
-  values = svdup_n_bf16(c10::bit_cast<bfloat16_t>(zero));
-}
-
 inline Vectorized<BFloat16>::Vectorized(int val) {
  auto vals_f = svdup_n_f32(val);
  values = convert_float_bfloat16(vals_f, vals_f);
--- a/aten/src/ATen/cpu/vec/sve/vec_double.h
+++ b/aten/src/ATen/cpu/vec/sve/vec_double.h
@ -38,9 +38,7 @@ class Vectorized<double> {
  static constexpr size_type size() {
    return VECTOR_WIDTH / sizeof(double);
  }
-  Vectorized() {
-    values = svdup_n_f64(0);
-  }
+  Vectorized() {}
  Vectorized(svfloat64_t v) : values(v) {}
  Vectorized(double val) {
    values = svdup_n_f64(val);
@ -587,30 +585,6 @@ Vectorized<double> inline fmadd(
  return svmad_f64_x(ptrue, a, b, c);
 }

-template <>
-Vectorized<double> inline fnmadd(
-    const Vectorized<double>& a,
-    const Vectorized<double>& b,
-    const Vectorized<double>& c) {
-  return svmsb_f64_x(ptrue, a, b, c);
-}
-
-template <>
-Vectorized<double> inline fmsub(
-    const Vectorized<double>& a,
-    const Vectorized<double>& b,
-    const Vectorized<double>& c) {
-  return svnmsb_f64_x(ptrue, a, b, c);
-}
-
-template <>
-Vectorized<double> inline fnmsub(
-    const Vectorized<double>& a,
-    const Vectorized<double>& b,
-    const Vectorized<double>& c) {
-  return svnmad_f64_x(ptrue, a, b, c);
-}
-
 #endif // defined(CPU_CAPABILITY_SVE)

 } // namespace CPU_CAPABILITY
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Camyll Harajli	6f733d481d	lint	2025-07-16 10:17:19 -07:00
Camyll Harajli	d61ae9a2ec	lint	2025-07-16 10:16:01 -07:00
Camyll Harajli	a62fa46502	test commit	2025-07-16 09:54:53 -07:00
Zain Rizvi	b45b26f68e	Fixes	2025-07-16 11:46:25 -05:00
Zain Rizvi	3c65f00a6f	Make hook resilient to being launched on older branches	2025-07-16 11:45:12 -05:00
Zain Rizvi	9d4fda5637	Minor fixes	2025-07-15 19:02:24 -05:00
Zain Rizvi	73da8c1c12	typo fix	2025-07-15 18:57:46 -05:00
Zain Rizvi	3c479b95c9	ensure pipx path	2025-07-15 18:56:59 -05:00
Zain Rizvi	241250ff90	Remove CI check	2025-07-15 18:08:23 -05:00
Zain Rizvi	6bcb74e2cc	lint fix	2025-07-15 17:55:05 -05:00
Zain Rizvi	bdb15094c6	moar good	2025-07-15 17:53:29 -05:00
Zain Rizvi	cbd7ad6a27	no-op	2025-07-15 17:53:29 -05:00
Zain Rizvi	c3ec715b74	fix	2025-07-15 17:53:29 -05:00
Zain Rizvi	ae1fc1de26	Initial working version	2025-07-15 17:53:29 -05:00
Zain Rizvi	892e11c770	update lintrunner wrapper	2025-07-15 17:53:29 -05:00
Zain Rizvi	887f933fd9	test	2025-07-15 17:53:29 -05:00