Fix typo in torch._refs (#167310 )

Should be a typo here, but it doesn't raise an error because the inner function splits it into `a` and `,`, and the `,` case check is skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167310 Approved by: https://github.com/eellison
Revert "[ARM] Improve LLM performance & mem usage using int4-bf16 KleidiAI kernels (#158250 )"
2025-11-12 06:44:55 +08:00 · 2025-11-11 22:31:09 +00:00 · 2025-11-11 22:27:51 +00:00 · 2025-11-11 21:32:22 +00:00 · 2025-11-11 21:31:12 +00:00 · 2025-11-11 20:37:53 +00:00
14 changed files with 251 additions and 864 deletions
--- a/.github/actionlint.yaml
+++ b/.github/actionlint.yaml
@ -63,7 +63,7 @@ self-hosted-runner:
    - linux.rocm.gpu.gfx942.1
    - linux.rocm.gpu.gfx942.2
    - linux.rocm.gpu.gfx942.4
-    - rocm-docker
+    - linux.rocm.gfx942.docker-cache
    # Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
    - macos-m1-stable
    - macos-m1-14
--- a/.github/workflows/docker-cache-mi300.yml
+++ b/.github/workflows/docker-cache-mi300.yml
@ -1,55 +0,0 @@
-name: docker-cache-mi300
-
-on:
-  # run every 6 hours
-  schedule:
-    - cron: 0 0,6,12,18 * * *
-  workflow_dispatch:
-
-concurrency:
-  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}
-  cancel-in-progress: true
-
-permissions:
-  id-token: write
-  contents: read
-
-jobs:
-  docker-cache:
-    if: github.repository_owner == 'pytorch'
-    runs-on: rocm-docker
-    steps:
-      - name: Checkout PyTorch
-        uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
-        with:
-          no-sudo: true
-
-      - name: configure aws credentials
-        id: aws_creds
-        uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
-        with:
-          role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
-          aws-region: us-east-1
-          role-duration-seconds: 18000
-
-      - name: Login to Amazon ECR
-        id: login-ecr
-        continue-on-error: false
-        uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
-
-      - name: Calculate docker image
-        id: calculate-docker-image
-        uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
-        with:
-          docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
-          push: false
-
-      - name: Pull docker image
-        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
-        with:
-          docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
-
-      - name: Tar and upload to S3 bucket
-        run: |
-          sudo docker save -o ~/docker-data/pytorch/pytorch_docker_image.tar ${{ steps.calculate-docker-image.outputs.docker-image }}
-          sudo rclone copy -P --s3-upload-concurrency 64 --s3-chunk-size 200M --s3-upload-cutoff 300M ~/docker-data/pytorch/pytorch_docker_image.tar oci:pytorchbucket0002/pytorch_docker_image --progress
--- a/.github/workflows/docker-cache-rocm.yml
+++ b/.github/workflows/docker-cache-rocm.yml
@ -0,0 +1,108 @@
+name: docker-cache-rocm
+
+on:
+  workflow_run:
+    workflows: [docker-builds]
+    # TODO: Uncomment before merging
+    #branches: [main, release]
+    types:
+      - completed
+  workflow_dispatch:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}
+  cancel-in-progress: true
+
+permissions:
+  id-token: write
+  contents: read
+  actions: read
+
+jobs:
+  download-docker-builds-artifacts:
+    if: github.repository_owner == 'pytorch'
+    name: download-docker-builds-artifacts
+    runs-on: ubuntu-latest
+    outputs:
+      pytorch-linux-jammy-rocm-n-py3: ${{ steps.process-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3 }}
+      pytorch-linux-noble-rocm-n-py3: ${{ steps.process-artifacts.outputs.pytorch-linux-noble-rocm-n-py3 }}
+      pytorch-linux-jammy-rocm-n-py3-benchmarks: ${{ steps.process-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3-benchmarks }}
+    steps:
+      - name: Download artifacts
+        uses: actions/download-artifact@v4.1.7
+        with:
+          run-id: ${{ github.event.workflow_run.id }}
+          path: ./docker-builds-artifacts
+          merge-multiple: true
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Process artifacts
+        id: process-artifacts
+        run: |
+          ls -R ./docker-builds-artifacts
+          cat ./docker-builds-artifacts/*txt >> "${GITHUB_OUTPUT}"
+          cat "${GITHUB_OUTPUT}"
+
+  docker-cache:
+    if: github.repository_owner == 'pytorch'
+    needs: download-docker-builds-artifacts
+    strategy:
+      fail-fast: false
+      matrix:
+        runner: [linux.rocm.gfx942.docker-cache]
+        docker-image: [
+          "${{ needs.download-docker-builds-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3 }}",
+          "${{ needs.download-docker-builds-artifacts.outputs.pytorch-linux-noble-rocm-n-py3 }}",
+          "${{ needs.download-docker-builds-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3-benchmarks }}"
+        ]
+    runs-on: "${{ matrix.runner }}"
+    steps:
+      - name: debug
+        run: |
+          JSON_STRINGIFIED="${{ toJSON(needs.download-docker-builds-artifacts.outputs) }}"
+          echo "Outputs of download-docker-builds-artifacts job: ${JSON_STRINGIFIED}"
+
+      - name: configure aws credentials
+        id: aws_creds
+        uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
+        with:
+          role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
+          aws-region: us-east-1
+          role-duration-seconds: 18000
+
+      - name: Login to Amazon ECR
+        id: login-ecr
+        continue-on-error: false
+        uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
+
+      - name: Generate ghrc.io tag
+        id: ghcr-io-tag
+        run: |
+            ecr_image="${{ matrix.docker-image }}"
+            ghcr_image="ghcr.io/pytorch/ci-image:${ecr_image##*:}"
+            echo "ghcr_image=${ghcr_image}" >> "$GITHUB_OUTPUT"
+
+      - name: Pull docker image
+        uses: pytorch/test-infra/.github/actions/pull-docker-image@main
+        with:
+          docker-image: ${{ steps.ghcr-io-tag.outputs.ghcr_image }}
+
+      - name: Save as tarball
+        run: |
+          docker_image_tag=${{ matrix.docker-image }}
+          docker_image_tag="${docker_image_tag#*:}" # Remove everything before and including first ":"
+          docker_image_tag="${docker_image_tag%-*}" # Remove everything after and including last "-"
+          ref_name=${{ github.event.workflow_run.head_branch }}
+          if [[ $ref_name =~ "release/" ]]; then
+            ref_suffix="release"
+          elif [[ $ref_name == "main" ]]; then
+            ref_suffix="main"
+          else
+            # TODO: Remove below
+            ref_suffix="main"
+            # echo "Unexpected branch in ref_name: ${ref_name}" && exit 1
+          fi
+          docker tag ${{ steps.ghcr-io-tag.outputs.ghcr_image }} ${{ matrix.docker-image }}
+          # mv is atomic operation, so we use intermediate tar.tmp file to prevent read-write contention
+          docker save -o ~/pytorch-data/docker/${docker_image_tag}.tar.tmp ${{ matrix.docker-image }}
+          mv ~/pytorch-data/docker/${docker_image_tag}.tar.tmp ~/pytorch-data/docker/${docker_image_tag}_${ref_suffix}.tar
--- a/aten/src/ATen/native/LinearAlgebra.cpp
+++ b/aten/src/ATen/native/LinearAlgebra.cpp
@ -3541,9 +3541,9 @@ Tensor _dyn_quant_matmul_4bit_cpu(
    const int64_t out_features) {
  auto M = inp.size(0);
  TORCH_CHECK(
-      inp.dtype() == kFloat || (inp.dtype() == kBFloat16 && block_size == in_features),
+      inp.dtype() == kFloat,
      __func__,
-      " : expect input to be float32 or bfloat16 tensor.");
+      " : expect input to be 32-bit float tensor.");
  TORCH_CHECK(
      block_size == in_features ||
          (!(block_size % 32) && !(in_features % block_size)),
--- a/aten/src/ATen/native/cpu/int4mm_kernel.cpp
+++ b/aten/src/ATen/native/cpu/int4mm_kernel.cpp
@ -8,7 +8,6 @@
 #include <ATen/cpu/vec/vec.h>
 #include <ATen/native/cpu/int_mm_kernel.h>
 #include <ATen/native/cpu/utils.h>
-#include <cmath>
 #include <c10/util/Unroll.h>
 #include <c10/util/irange.h>

@ -794,139 +793,6 @@ bool can_use_kleidiai(
 }
 #endif

-static void ref_dyn_quant_matmul_4bit_channelwise_kernel_bf16(
-    size_t m,
-    size_t n,
-    size_t k,
-    const uint16_t* lhs_bf16,
-    const uint8_t* rhs_qs4cx,
-    const float* rhs_scales,
-    uint16_t* dst_bf16,
-    float scalar_min,
-    float scalar_max,
-    const float* bias) {
-  // Roundup lambda for internal stride calculations
-  auto roundup = [](size_t a, size_t b) { return ((a + b - 1) / b) * b; };
-
-  // Cast bfloat16 to float32 inline
-  auto cast_bf16_to_f32 = [](uint16_t bf16_val) {
-    uint32_t tmp = static_cast<uint32_t>(bf16_val) << 16;
-    float f;
-    std::memcpy(&f, &tmp, sizeof(f));
-    return f;
-  };
-
-  // Cast float32 to bfloat16 inline
-  auto cast_f32_to_bf16 = [](float f) {
-    uint32_t bits;
-    std::memcpy(&bits, &f, sizeof(bits));
-    return static_cast<uint16_t>(bits >> 16);
-  };
-
-  // Quantization pack lambda (channelwise QA8DX)
-  auto quant_pack_8bit_channelwise =
-      [&](size_t M, size_t K, const uint16_t* src_bf16, int8_t* dst_qa8dx) {
-        constexpr int8_t kI8Min = std::numeric_limits<std::int8_t>::lowest();
-        constexpr int8_t kI8Max = std::numeric_limits<std::int8_t>::max();
-
-        const size_t dst_stride =
-            K * sizeof(int8_t) + sizeof(float) + sizeof(int32_t);
-        for (size_t i = 0; i < M; ++i) {
-          const uint16_t* row_ptr = src_bf16 + i * K;
-          // find min/max
-          float mn = FLT_MAX, mx = -FLT_MAX;
-          for (size_t j = 0; j < K; ++j) {
-            float v = cast_bf16_to_f32(row_ptr[j]);
-            mn = std::min(mn, v);
-            mx = std::max(mx, v);
-          }
-          float rmin = std::min(0.0f, mn);
-          float rmax = std::max(0.0f, mx);
-          constexpr float qmin = static_cast<float>(kI8Min);
-          constexpr float qmax = static_cast<float>(kI8Max);
-          float scale = (rmin == rmax) ? 1.f : (qmax - qmin) / (rmax - rmin);
-          float recip = scale ? 1.0f / scale : 0.0f;
-          int32_t zp;
-          float des_min = rmin * scale;
-          float des_max = rmax * scale;
-          float err_min = qmin + des_min;
-          float err_max = qmax + des_max;
-          float zp_f =
-              (err_min + err_max) > 0 ? qmin - des_min : qmax - des_max;
-          zp_f = std::clamp(zp_f, qmin, qmax);
-          zp = std::lrintf(zp_f);
-          int8_t* out_ptr = dst_qa8dx + i * dst_stride;
-          // store header
-          *reinterpret_cast<float*>(out_ptr) = recip;
-          *reinterpret_cast<int32_t*>(out_ptr + sizeof(float)) = -zp;
-          out_ptr += sizeof(float) + sizeof(int32_t);
-          // quantize
-          for (size_t j = 0; j < K; ++j) {
-            float v = cast_bf16_to_f32(row_ptr[j]);
-            int32_t q = static_cast<int32_t>(std::round(v * scale)) + zp;
-            q = std::clamp(
-                q, static_cast<int32_t>(kI8Min), static_cast<int32_t>(kI8Max));
-            *out_ptr++ = static_cast<int8_t>(q);
-          }
-        }
-      };
-
-  // MatMul lambda (MXN x MXK -> MNXK BF16)
-  auto matmul_kernel = [&](size_t M,
-                           size_t N,
-                           size_t K,
-                           const int8_t* lhs,
-                           const uint8_t* rhs,
-                           const float* scales,
-                           uint16_t* dst,
-                           float lo,
-                           float hi) {
-    const size_t lhs_stride =
-        K * sizeof(int8_t) + sizeof(float) + sizeof(int32_t);
-    const size_t rhs_stride = roundup(K, 2) / 2;
-    for (size_t i = 0; i < M; ++i) {
-      const int8_t* lhs_row = lhs + i * lhs_stride;
-      for (size_t j = 0; j < N; ++j) {
-        int32_t acc = 0;
-        const int8_t* lptr = lhs_row;
-        const uint8_t* rptr = rhs + j * rhs_stride;
-        float lhs_scale = *reinterpret_cast<const float*>(lptr);
-        int32_t lhs_off =
-            *reinterpret_cast<const int32_t*>(lptr + sizeof(float));
-        lptr += sizeof(float) + sizeof(int32_t);
-        for (size_t t = 0; t < K; ++t) {
-          int32_t lv = static_cast<int32_t>(lptr[t]);
-          uint8_t bv = rptr[t / 2];
-          int32_t rv = ((t & 1) == 0) ? (static_cast<int32_t>(bv & 0xF) - 8)
-                                      : (static_cast<int32_t>(bv >> 4) - 8);
-          acc += lv * rv + lhs_off * rv;
-        }
-        float res = static_cast<float>(acc) * scales[j] * lhs_scale;
-        if (bias) {
-          res += bias[j];
-        }
-        res = std::clamp(res, lo, hi);
-        *dst++ = cast_f32_to_bf16(res);
-      }
-    }
-  };
-
-  // allocate and run
-  std::unique_ptr<int8_t[]> packed(
-      new int8_t[m * (k * sizeof(int8_t) + sizeof(float) + sizeof(int32_t))]);
-  quant_pack_8bit_channelwise(m, k, lhs_bf16, packed.get());
-  matmul_kernel(
-      m,
-      n,
-      k,
-      packed.get(),
-      rhs_qs4cx,
-      rhs_scales,
-      dst_bf16,
-      scalar_min,
-      scalar_max);
-}
-
 /**
 * The Int4 quantized weights must be represented as a uint8 tensor
 * For matrix multiplication with a weight shape of (N x K)
@ -953,21 +819,21 @@ void dyn_quant_pack_4bit_weight_kernel(
 #if AT_KLEIDIAI_ENABLED()
  if (can_use_kleidiai(scales_zeros, K, block_size)) {
    const int64_t weight_packed_size =
-        kleidiai::kai_pack_rhs_int4_size(N, K, block_size, weights.scalar_type());
+        kleidiai::kai_pack_rhs_int4_size(N, K, block_size);
    packed_weights.resize_({weight_packed_size});
    kleidiai::kai_pack_int4_rhs(
        packed_weights, weights, scales_zeros, bias, N, K, block_size);
  } else
 #endif
  {
+    TORCH_CHECK(
+        bias.has_value() == 0,
+        __func__,
+        " : Bias is unsupported in reference implementation");
    packed_weights = packed_weights.to(kFloat);
-    auto weight_reshaped = weights.reshape({-1}).to(kFloat);
-    auto scales_zeros_reshaped = scales_zeros.reshape({-1}).to(kFloat);
-    std::vector<at::Tensor> tensors_to_cat = {weight_reshaped, scales_zeros_reshaped};
-    if (bias.has_value()) {
-      tensors_to_cat.push_back(bias.value().view({-1}).to(kFloat));
-    }
-    auto res = at::cat(tensors_to_cat, 0);
+    auto weight_reshaped = weights.view({-1}).to(kFloat);
+    auto scales_zeros_reshaped = scales_zeros.view({-1}).to(kFloat);
+    auto res = at::cat({weight_reshaped, scales_zeros_reshaped}, 0);
    packed_weights.resize_(res.sizes()).copy_(res);
  }
 }
@ -981,8 +847,7 @@ void ref_dyn_quant_matmul_4bit_channelwise_kernel(
    const float* rhs_scales_f32,
    float* dst_f32,
    float scalar_min,
-    float scalar_max,
-    const float* bias) {
+    float scalar_max) {
  const size_t input_size_8bit = m * (k + sizeof(int32_t) + sizeof(float));

  auto lhs_qa8dx_buffer = std::make_unique<uint8_t[]>(input_size_8bit);
@ -992,9 +857,6 @@ void ref_dyn_quant_matmul_4bit_channelwise_kernel(
  // required format for matmul
  auto input_quant_pack_8bit_channelwise =
      [&](size_t m, size_t k, const float* lhs_f32, int8_t* lhs_qa8dx) {
-        constexpr int8_t kI8Min = std::numeric_limits<std::int8_t>::lowest();
-        constexpr int8_t kI8Max = std::numeric_limits<std::int8_t>::max();
-
        const size_t dst_stride =
            (k * sizeof(int8_t) + sizeof(float) + sizeof(int32_t));

@ -1015,8 +877,8 @@ void ref_dyn_quant_matmul_4bit_channelwise_kernel(
          }

          // Maximum/minimum int8 values
-          constexpr float qmin = static_cast<float>(kI8Min);
-          constexpr float qmax = static_cast<float>(kI8Max);
+          const float qmin = (float)INT8_MIN;
+          const float qmax = (float)INT8_MAX;

          const float rmin0 = std::min(0.0f, min0);
          const float rmax0 = std::max(0.0f, max0);
@ -1042,7 +904,7 @@ void ref_dyn_quant_matmul_4bit_channelwise_kernel(
          zero_point0 = std::min(zero_point0, qmax);

          // Round to nearest integer
-          const int32_t nudged_zero_point0 = std::lrintf(zero_point0);
+          const int32_t nudged_zero_point0 = lrintf(zero_point0);

          int8_t* dst_ptr = lhs_qa8dx + m_idx * dst_stride;

@ -1060,8 +922,8 @@ void ref_dyn_quant_matmul_4bit_channelwise_kernel(
            int32_t v0_s32 = (int32_t)(std::round(src0_0 * scale0));

            v0_s32 = v0_s32 + nudged_zero_point0;
-            v0_s32 = std::max(v0_s32, static_cast<int32_t>(kI8Min));
-            v0_s32 = std::min(v0_s32, static_cast<int32_t>(kI8Max));
+            v0_s32 = std::max(v0_s32, static_cast<int32_t>(INT8_MIN));
+            v0_s32 = std::min(v0_s32, static_cast<int32_t>(INT8_MAX));
            dst_ptr[0] = (int8_t)v0_s32;
            dst_ptr += sizeof(int8_t);
          }
@ -1125,10 +987,6 @@ void ref_dyn_quant_matmul_4bit_channelwise_kernel(

      main_acc = main_acc * lhs_scale;

-      if (bias) {
-        main_acc += bias[n_idx];
-      }
-
      // Clamp (min-max) operation
      main_acc = std::max(main_acc, scalar_min);
      main_acc = std::min(main_acc, scalar_max);
@ -1149,16 +1007,12 @@ void ref_dyn_quant_matmul_4bit_groupwise_kernel(
    const float* rhs_scales_fp32,
    float* dst_f32,
    float scalar_min,
-    float scalar_max,
-    const float* bias) {
+    float scalar_max) {
  // Lambda for LHS quantization
  auto lhs_quant_pack = [&](size_t m,
                            size_t k,
                            const float* lhs_f32,
                            int8_t* lhs_qa8dx) {
-    constexpr int8_t kI8Min = std::numeric_limits<std::int8_t>::lowest();
-    constexpr int8_t kI8Max = std::numeric_limits<std::int8_t>::max();
-
    const size_t dst_stride =
        (k * sizeof(int8_t) + sizeof(float) + sizeof(int32_t));

@ -1174,8 +1028,8 @@ void ref_dyn_quant_matmul_4bit_groupwise_kernel(
        min0 = std::min(src0_0, min0);
      }

-      constexpr float qmin = static_cast<float>(kI8Min);
-      constexpr float qmax = static_cast<float>(kI8Max);
+      const float qmin = (float)INT8_MIN;
+      const float qmax = (float)INT8_MAX;

      const float rmin0 = std::min(0.0f, min0);
      const float rmax0 = std::max(0.0f, max0);
@ -1192,7 +1046,7 @@ void ref_dyn_quant_matmul_4bit_groupwise_kernel(

      zero_point0 = std::max(zero_point0, qmin);
      zero_point0 = std::min(zero_point0, qmax);
-      const int32_t nudged_zero_point0 = std::lrintf(zero_point0);
+      const int32_t nudged_zero_point0 = lrintf(zero_point0);

      int8_t* dst_ptr = lhs_qa8dx + row_idx * dst_stride;

@ -1205,8 +1059,9 @@ void ref_dyn_quant_matmul_4bit_groupwise_kernel(
        const float src0_0 = src_ptr[k_idx];
        int32_t v0_s32 = (int32_t)(std::round(src0_0 * scale0));
        v0_s32 = std::max(
-            std::min(v0_s32 + nudged_zero_point0, static_cast<int32_t>(kI8Max)),
-            static_cast<int32_t>(kI8Min));
+            std::min(
+                v0_s32 + nudged_zero_point0, static_cast<int32_t>(INT8_MAX)),
+            static_cast<int32_t>(INT8_MIN));
        dst_ptr[0] = (int8_t)v0_s32;
        dst_ptr += sizeof(int8_t);
      }
@ -1263,11 +1118,6 @@ void ref_dyn_quant_matmul_4bit_groupwise_kernel(
      }

      main_acc = main_acc * lhs_scale;
-
-      if (bias) {
-        main_acc += bias[col_idx];
-      }
-
      main_acc = std::max(main_acc, scalar_min);
      main_acc = std::min(main_acc, scalar_max);

@ -1278,27 +1128,28 @@ void ref_dyn_quant_matmul_4bit_groupwise_kernel(
 }

 /**
- * Dynamic INT4 weight-only MatMul with per-row input quantization.
- *
- * Execution Flow:
- *
- *   (INT4 Weights + FP Scales [+ optional Bias])
- *
- *    Input (FP32 or BF16)         Packed Weight Buffer
- *           |                             |
- *    Row-wise Quantization (INT8)         |
- *           |                             |
- *     INT8 Input Activation      INT4 Quantized Weights + Scales
- *                  \             /
- *                   \           /
- *              Quantized Matrix Multiply
- *                     |
- *              Output Tensor (BF16 or FP32)
- *
- * Notes:
- *   - Groupwise kernels expect BF16 scales
- *   - Channelwise kernels expect FP32 scales
- *   - Bias is currently unsupported in fallback path
+ * Dynamic Input Quant 4 bit weights matmul execution flow
+              (INT4 Weights + FP scales + FP32 Bias)
+  FP32 Input              Packed Buffer
+       |                       |
+    Quantize                Cast
+   to INT8                 to INT8
+       |                       |
+       v                       v
+ INT8 Input              INT8 Weights
+          \               /
+            \            /
+             \         /
+           INT8 Matrix Multiplication
+                   |
+                   v
+ FP32 Dequantized and Accumulate in FP32
+                   |
+                   v
+             FP32 Final Output
+
+ * The Groupwise kernel requires BFloat16 Scales and Channelwise kernel requires
+ * Float32 Scales. If not provided, we will use fallback implementation.
 */
 void dyn_quant_matmul_4bit_kernel(
    const Tensor& output,
@ -1310,75 +1161,65 @@ void dyn_quant_matmul_4bit_kernel(
    const int64_t block_size) {
 #if AT_KLEIDIAI_ENABLED()
  const int64_t weight_packed_size =
-      kleidiai::kai_pack_rhs_int4_size(N, K, block_size, inp.scalar_type());
+      kleidiai::kai_pack_rhs_int4_size(N, K, block_size);
  if (weight_packed_size == packed_weights.numel()) {
    // KleidiAI interface internally handles the Channelwise and groupwise
    // distinction
-    kleidiai::kai_quant_pack_lhs_int4_mm(output, inp, packed_weights, M, N, K, block_size);
+    kleidiai::kai_quant_pack_lhs_int4_mm(
+        output, inp, packed_weights, M, N, K, block_size);
  } else
 #endif
  {
-    {
-    void* input = inp.data_ptr();
-    void* dst = output.data_ptr();
-
-    // Extract weights, sclaes and biases form from packed tensor
-    const int weights_elements = N * K / 2;
-    const int scale_elements = N * (K / block_size);
-    TORCH_CHECK(packed_weights.numel() >=  (weights_elements + scale_elements), "Invalid packed weight tensor size");
-
-    auto extracted_weights = packed_weights.narrow(0, 0, weights_elements).to(kByte);
-    auto extracted_scales_and_bias = packed_weights.narrow(0, weights_elements, packed_weights.size(0) - weights_elements).to(kFloat);
-    auto float32_scales = extracted_scales_and_bias.narrow(0, 0, scale_elements);
-
-    int bias_elements = packed_weights.numel() - (weights_elements + scale_elements);
-    float* weight_scales = float32_scales.data_ptr<float>();
-
-    void* bias_data = nullptr;
-    if (bias_elements) {
-        auto float32_bias = extracted_scales_and_bias.narrow(0, scale_elements, bias_elements);
-        TORCH_CHECK(float32_bias.size(0) == N, "Expected bias length to match output dimension");
-        bias_data = float32_bias.data_ptr();
-
-    }
-    // 2 elements of 4 bit weights are packed into 1 uint8 packet
-    uint8_t* weights_4bit = reinterpret_cast<uint8_t*>(extracted_weights.data_ptr());
-
-    // Dispatch to reference kernels
-    if (inp.scalar_type() == at::kBFloat16) {
-        // BF16 input, BF16 output
-        constexpr float BF16_MAX = 3.38953139e+38f;
-        constexpr float BF16_MIN = -BF16_MAX;
-        if (block_size == K) {
-            ref_dyn_quant_matmul_4bit_channelwise_kernel_bf16(
-                M, N, K,
-                (uint16_t*)input, weights_4bit, weight_scales,
-                (uint16_t*)dst, BF16_MIN, BF16_MAX, (float*)bias_data);
-        } else {
-            TORCH_CHECK(false, "Unsupported block size for BF16 fallback");
-        }
-    } else if (inp.scalar_type() == at::kFloat) {
-        // FP32 input, FP32 output
-        if (block_size == K) {
-            ref_dyn_quant_matmul_4bit_channelwise_kernel(
-                M, N, K,
-                (float*)input, weights_4bit, weight_scales,
-                (float*)dst, -FLT_MAX, FLT_MAX, (float*)bias_data);
-        } else if (!(block_size % 32) && !(K % block_size)) {
-            ref_dyn_quant_matmul_4bit_groupwise_kernel(
-                M, N, K, block_size,
-                (float*)input, weights_4bit, weight_scales,
-                (float*)dst, -FLT_MAX, FLT_MAX, (float*)bias_data);
-        } else {
-            TORCH_CHECK(false, "Unsupported block size for FP32 fallback");
-        }
+    float* lhs_f32 = reinterpret_cast<float*>(inp.data_ptr());
+    const auto weights_size = N * K / 2;
+    // The weights needs to be in uint8_t data type after quantization
+    auto extracted_weights =
+        (packed_weights.narrow(0, 0, weights_size)).to(kByte);
+    auto float32_scales =
+        (packed_weights.narrow(
+             0, weights_size, packed_weights.size(0) - weights_size))
+            .to(kFloat);
+    uint8_t* rhs_4bit =
+        reinterpret_cast<uint8_t*>(extracted_weights.data_ptr());
+    float* rhs_scales_f32 = reinterpret_cast<float*>(float32_scales.data_ptr());
+    float* dst_f32 = reinterpret_cast<float*>(output.data_ptr());
+    if (block_size == K) {
+      ref_dyn_quant_matmul_4bit_channelwise_kernel(
+          M,
+          N,
+          K,
+          lhs_f32,
+          rhs_4bit,
+          rhs_scales_f32,
+          dst_f32,
+          -FLT_MAX,
+          FLT_MAX);
+    } else if (!(block_size % 32) && !(K % block_size)) {
+      ref_dyn_quant_matmul_4bit_groupwise_kernel(
+          M,
+          N,
+          K,
+          block_size,
+          lhs_f32,
+          rhs_4bit,
+          rhs_scales_f32,
+          dst_f32,
+          -FLT_MAX,
+          FLT_MAX);
    } else {
-        TORCH_CHECK(false, "Unsupported input/output dtype combination for int4mm kernel");
+      TORCH_CHECK(
+          block_size == K || (!(block_size % 32) && !(K % block_size)),
+          __func__,
+          ": Group size should be multiple 32 or in_features [",
+          K,
+          "]. Provided ",
+          block_size);
    }
+  }
 }
-}
+
 } // anonymous namespace
-}
+
 ALSO_REGISTER_AVX512_DISPATCH(weight_to_int4pack_stub, &weight_to_int4pack_kernel)
 ALSO_REGISTER_AVX512_DISPATCH(int4pack_mm_stub, &int4pack_mm_kernel)
 REGISTER_DISPATCH(dyn_quant_pack_4bit_weight_stub, &dyn_quant_pack_4bit_weight_kernel)
--- a/aten/src/ATen/native/kleidiai/kai_kernels.cpp
+++ b/aten/src/ATen/native/kleidiai/kai_kernels.cpp
@ -21,27 +21,18 @@ void kai_pack_int4_rhs(
    const int64_t n,
    const int64_t k,
    const int64_t bl) {
+  // Prefer Channelwise kernel over Groupwise kernel for conflicting cases
  if (bl == k) {
    // Channelwise
-    if (weight.scalar_type() == at::kBFloat16) {
-      auto kernel_packet = kai_select_bf16_channelwise_matmul_ukernel(
-          kai_kernel_id::
-              matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod);
-      auto& params = kernel_packet.rhs_pack_params;
-      params.lhs_zero_point = 1;
-      params.rhs_zero_point = 8;
-      kai_pack_rhs_channelwise_int4<kai_matmul_ukernel_bf16_qa8dxp_qs4cxp>(
-          kernel_packet, weight_packed, weight, scales, bias, n, k);
-    } else {
-      auto kernel_packet = kai_select_channelwise_matmul_ukernel(
-          kai_kernel_id::
-              matmul_clamp_f32_qai8dxp1x8_qsi4cxp8x8_1x8x32_neon_dotprod);
-      auto& params = kernel_packet.rhs_pack_params;
-      params.lhs_zero_point = 1;
-      params.rhs_zero_point = 8;
-      kai_pack_rhs_channelwise_int4<kai_matmul_ukernel_f32_qa8dxp_qs4cxp>(
-          kernel_packet, weight_packed, weight, scales, bias, n, k);
-    }
+    auto kernel_packet = kai_select_channelwise_matmul_ukernel(
+        kai_kernel_id::
+            matmul_clamp_f32_qai8dxp1x8_qsi4cxp8x8_1x8x32_neon_dotprod);
+    auto& params = kernel_packet.rhs_pack_params;
+    params.lhs_zero_point = 1;
+    params.rhs_zero_point = 8;
+
+    kai_pack_rhs_channelwise_int4<kai_matmul_ukernel_f32_qa8dxp_qs4cxp>(
+        kernel_packet, weight_packed, weight, scales, bias, n, k);
  } else if (!(bl % 32) && !(k % bl)) {
    // Groupwise
    auto kernel_packet = kai_select_groupwise_matmul_ukernel(
@ -72,29 +63,19 @@ void kai_pack_int4_rhs(
 size_t kai_pack_rhs_int4_size(
    const int64_t n,
    const int64_t k,
-    const int64_t bl,
-    at::ScalarType tensor_dtype) {
+    const int64_t bl) {
  size_t packed_size = n * k;
+  // Prefer Channelwise kernel over Groupwise kernel for conflicting cases
  if (bl == k) {
-    if (tensor_dtype == at::kBFloat16) {
-      auto kernel_packet = kai_select_bf16_channelwise_matmul_ukernel(
-          kai_kernel_id::
-              matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod);
-      const auto& ukernel = kernel_packet.ukernel;
-      const size_t nr = ukernel.get_nr();
-      const size_t kr = ukernel.get_kr();
-      const size_t sr = ukernel.get_sr();
-      packed_size = kernel_packet.kai_get_rhs_packed_size(n, k, nr, kr, sr);
-    } else {
-      auto kernel_packet = kai_select_channelwise_matmul_ukernel(
-          kai_kernel_id::
-              matmul_clamp_f32_qai8dxp1x8_qsi4cxp8x8_1x8x32_neon_dotprod);
-      const auto& ukernel = kernel_packet.ukernel;
-      const size_t nr = ukernel.get_nr();
-      const size_t kr = ukernel.get_kr();
-      const size_t sr = ukernel.get_sr();
-      packed_size = kernel_packet.kai_get_rhs_packed_size(n, k, nr, kr, sr);
-    }
+    // Channelwise
+    auto kernel_packet = kai_select_channelwise_matmul_ukernel(
+        kai_kernel_id::
+            matmul_clamp_f32_qai8dxp1x8_qsi4cxp8x8_1x8x32_neon_dotprod);
+    const auto& ukernel = kernel_packet.ukernel;
+    const size_t nr = ukernel.get_nr();
+    const size_t kr = ukernel.get_kr();
+    const size_t sr = ukernel.get_sr();
+    packed_size = kernel_packet.kai_get_rhs_packed_size(n, k, nr, kr, sr);
  } else if (!(bl % 32) && !(k % bl)) {
    // Groupwise
    auto kernel_packet = kai_select_groupwise_matmul_ukernel(
@ -167,7 +148,8 @@ static void kai_quant_pack_lhs_int4_mm_groupwise(
    const auto lhs_src_ptr = lhs_native_mtx_f32 + thread_id * src_stride;
    const int64_t m_idx = thread_id * vec_per_thread;
    auto lhs_packed_ptr = lhs_packed_base +
-        kernel_packet.kai_get_lhs_quant_pack_offset(m_idx, k, mr, kr, sr);
+        kai_get_lhs_packed_offset_lhs_quant_pack_qai8dxp_f32(
+                              m_idx, k, mr, kr, sr);
    const int64_t vec_num = (thread_id == num_threads - 1)
        ? (m - vec_per_thread * thread_id)
        : vec_per_thread;
@ -277,7 +259,8 @@ static void kai_quant_pack_lhs_int4_mm_channelwise(
    const auto lhs_src_ptr = lhs_native_mtx_f32 + thread_id * src_stride;
    const int64_t m_idx = thread_id * vec_per_thread;
    auto lhs_packed_ptr = lhs_packed_base +
-        kernel_packet.kai_get_lhs_quant_pack_offset(m_idx, k, mr, kr, sr);
+        kai_get_lhs_packed_offset_lhs_quant_pack_qai8dxp_f32(
+                              m_idx, k, mr, kr, sr);
    const int64_t vec_num = (thread_id == num_threads - 1)
        ? (m - vec_per_thread * thread_id)
        : vec_per_thread;
@ -337,144 +320,19 @@ static void kai_quant_pack_lhs_int4_mm_channelwise(
      });
 }

-static void kai_quant_pack_lhs_int4_mm_bf16_channelwise(
+void kai_quant_pack_lhs_int4_mm(
    const Tensor& output,
    const Tensor& input,
    const Tensor& weight,
    const int64_t m,
    const int64_t n,
-    const int64_t k) {
-  // Kernel IDs for GEMM and GEMV
-  constexpr kai_kernel_id gemm_id =
-      kai_kernel_id::matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm;
-  constexpr kai_kernel_id gemv_id =
-      kai_kernel_id::matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod;
-
-  // Get total threads and select kernel
-  const int64_t total_threads = at::get_num_threads();
-  auto kernel_packet = kai_select_bf16_channelwise_matmul_ukernel(gemv_id);
-  if (cpuinfo_has_arm_i8mm() && m > 1) {
-    kernel_packet = kai_select_bf16_channelwise_matmul_ukernel(gemm_id);
-  }
-
-  // Thread blocking parameters
-  const int64_t n_step = kernel_packet.ukernel.get_n_step();
-  const size_t mr = kernel_packet.ukernel.get_mr();
-  const size_t kr = kernel_packet.ukernel.get_kr();
-  const size_t sr = kernel_packet.ukernel.get_sr();
-
-  const size_t lhs_packed_size =
-      kernel_packet.kai_get_lhs_packed_size(m, k, mr, kr, sr);
-  auto lhs_packed = std::make_unique<uint8_t[]>(lhs_packed_size);
-  uint8_t* dst_act_mtx_bf16 = reinterpret_cast<uint8_t*>(output.data_ptr());
-  const uint8_t* lhs_native_mtx_bf16 =
-      reinterpret_cast<const uint8_t*>(input.data_ptr());
-  const uint8_t* rhs_packed_mtx_qs4cx =
-      reinterpret_cast<const uint8_t*>(weight.data_ptr());
-  uint8_t* lhs_packed_base = lhs_packed.get();
-
-  constexpr int32_t element_size = sizeof(uint16_t);
-  const size_t lhs_stride = k * element_size;
-  const size_t dst_stride = n * element_size;
-
-  // LHS quantization packing
-  int64_t vec_per_thread = get_vec_per_thread(m, total_threads, mr);
-  int64_t num_threads = (m + vec_per_thread - 1) / vec_per_thread;
-  const size_t src_stride = vec_per_thread * lhs_stride;
-
-  auto lhs_quant_pack = [=, &kernel_packet](int64_t thread_id) {
-    const auto lhs_src_ptr = lhs_native_mtx_bf16 + thread_id * src_stride;
-    const int64_t m_idx = thread_id * vec_per_thread;
-    auto lhs_packed_ptr = lhs_packed_base +
-        kernel_packet.kai_get_lhs_quant_pack_offset(m_idx, k, mr, kr, sr);
-    const int64_t vec_num = (thread_id == num_threads - 1)
-        ? (m - vec_per_thread * thread_id)
-        : vec_per_thread;
-
-    kernel_packet.kai_run_lhs_quant_pack(
-        vec_num,
-        k,
-        mr,
-        kr,
-        sr,
-        0,
-        (const uint16_t*)lhs_src_ptr,
-        lhs_stride,
-        lhs_packed_ptr);
-  };
-
-  at::parallel_for(
-      0, num_threads, /*grain_size=*/1, [&](int64_t begin, int64_t end) {
-        for (int64_t thread_id = begin; thread_id < end; ++thread_id) {
-          lhs_quant_pack(thread_id);
-        }
-      });
-
-  // Matrix multiplication
-  vec_per_thread = get_vec_per_thread(n, total_threads, n_step);
-  num_threads = (n + vec_per_thread - 1) / vec_per_thread;
-
-  auto mm = [=, &kernel_packet](int64_t thread_id) {
-    const auto rhs_packed_ptr = rhs_packed_mtx_qs4cx +
-        kernel_packet.ukernel.get_rhs_packed_offset(
-            thread_id * vec_per_thread, k);
-    auto dst_ptr = dst_act_mtx_bf16 +
-        kernel_packet.ukernel.get_dst_offset(
-            0, thread_id * vec_per_thread, dst_stride);
-    const int64_t vec_num = (thread_id == num_threads - 1)
-        ? (n - vec_per_thread * thread_id)
-        : vec_per_thread;
-
-    kernel_packet.ukernel.run_matmul(
-        m,
-        vec_num,
-        k,
-        lhs_packed_base,
-        rhs_packed_ptr,
-        (uint16_t*)dst_ptr,
-        dst_stride,
-        element_size, // dst_stride_col
-        -FLT_MAX,
-        FLT_MAX);
-  };
-
-  at::parallel_for(
-      0, num_threads, /*grain_size=*/1, [&](int64_t begin, int64_t end) {
-        for (int64_t thread_id = begin; thread_id < end; ++thread_id) {
-          mm(thread_id);
-        }
-      });
-}
-void kai_quant_pack_lhs_int4_mm(
-    const at::Tensor& output,
-    const at::Tensor& input,
-    const at::Tensor& weight,
-    const int64_t m,
-    const int64_t n,
    const int64_t k,
    const int64_t bl) {
  // Prefer Channelwise kernel over Groupwise kernel for conflicting cases
  if (bl == k) {
-    const auto input_dtype = input.dtype();
-
-    if (input_dtype == at::kBFloat16) {
-      if (cpuinfo_has_arm_bf16()) {
-        kleidiai::kai_quant_pack_lhs_int4_mm_bf16_channelwise(
-            output, input, weight, m, n, k);
-      } else {
-        TORCH_CHECK(
-            false,
-            "BF16 Unsupported: CPU does not support BF16. Please use a CPU with BF16 support.");
-      }
-    } else if (input_dtype == at::kFloat) {
-      kleidiai::kai_quant_pack_lhs_int4_mm_channelwise(
-          output, input, weight, m, n, k);
-    } else {
-      TORCH_CHECK(
-          false,
-          "Unsupported input data type: Only Bfloat16 and Float inputs are supported.");
-    }
-  } else if ((bl % 32 == 0) && (k % bl == 0)) {
+    kleidiai::kai_quant_pack_lhs_int4_mm_channelwise(
+        output, input, weight, m, n, k);
+  } else if (!(bl % 32) && !(k % bl)) {
    kleidiai::kai_quant_pack_lhs_int4_mm_groupwise(
        output, input, weight, m, n, k, bl);
  }
--- a/aten/src/ATen/native/kleidiai/kai_kernels.h
+++ b/aten/src/ATen/native/kleidiai/kai_kernels.h
@ -25,8 +25,7 @@ void kai_pack_int4_rhs(
 size_t kai_pack_rhs_int4_size(
    const int64_t n,
    const int64_t k,
-    const int64_t bl,
-    at::ScalarType tensor_dtype = at::kFloat);
+    const int64_t bl);

 /**
 * @brief Run 2 operations ( Input quantize and pack -> 4 bit Matmul )
--- a/aten/src/ATen/native/kleidiai/kai_pack.h
+++ b/aten/src/ATen/native/kleidiai/kai_pack.h
@ -36,8 +36,7 @@ void kai_pack_rhs_groupwise_int4(
    AT_ERROR("kai_pack_rhs_channelwise_int4: Scales data pointer is null");
  }

-  float* bias_ptr =
-      bias.has_value() ? bias.value().to(kFloat).data_ptr<float>() : NULL;
+  float* bias_ptr = bias.has_value() ? bias.value().data_ptr<float>() : NULL;
  auto& params = kernel.rhs_pack_params;

  kernel.kai_run_rhs_pack(
@ -74,8 +73,7 @@ void kai_pack_rhs_channelwise_int4(
  auto weight_packed_data =
      reinterpret_cast<uint8_t*>(weight_packed.data_ptr());
  const auto weight_data = weight.data_ptr<uint8_t>();
-
-  const auto scales_data = scales.to(kFloat).data_ptr<float>();
+  const auto scales_data = scales.data_ptr<float>();

  if (weight_data == nullptr) {
    AT_ERROR("kai_pack_rhs_channelwise_int4: Weight data pointer is null");
@ -85,8 +83,7 @@ void kai_pack_rhs_channelwise_int4(
    AT_ERROR("kai_pack_rhs_channelwise_int4: Scales data pointer is null");
  }

-  float* bias_ptr =
-      bias.has_value() ? bias.value().to(kFloat).data_ptr<float>() : NULL;
+  float* bias_ptr = bias.has_value() ? bias.value().data_ptr<float>() : NULL;
  auto& params = kernel.rhs_pack_params;

  kernel.kai_run_rhs_pack(
--- a/aten/src/ATen/native/kleidiai/kai_ukernel_interface.cpp
+++ b/aten/src/ATen/native/kleidiai/kai_ukernel_interface.cpp
@ -68,39 +68,5 @@ kai_matmul_ukernel_f32_qa8dxp_qs4cxp kai_select_channelwise_matmul_ukernel(
    const kai_kernel_id id) {
  return channelwise_8bit_4bit_kernels.at(id);
 }
-
-// Kernel Mapping - BF16 Channelwise
-std::unordered_map<kai_kernel_id, kai_matmul_ukernel_bf16_qa8dxp_qs4cxp>
-    bf16_channelwise_8bit_4bit_kernels = {
-        {kai_kernel_id::
-             matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-         {{kai_get_m_step_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_n_step_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_mr_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_nr_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_kr_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_sr_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_lhs_packed_offset_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_rhs_packed_offset_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_dst_offset_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_get_dst_size_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod,
-           kai_run_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod}}},
-        {kai_kernel_id::matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-         {{kai_get_m_step_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_n_step_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_mr_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_nr_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_kr_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_sr_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_lhs_packed_offset_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_rhs_packed_offset_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_dst_offset_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_get_dst_size_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm,
-           kai_run_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm}}}};
-
-kai_matmul_ukernel_bf16_qa8dxp_qs4cxp kai_select_bf16_channelwise_matmul_ukernel(
-    const kai_kernel_id id) {
-  return bf16_channelwise_8bit_4bit_kernels.at(id);
-}
 } // namespace at::native::kleidiai
 #endif
--- a/aten/src/ATen/native/kleidiai/kai_ukernel_interface.h
+++ b/aten/src/ATen/native/kleidiai/kai_ukernel_interface.h
@ -10,32 +10,21 @@
 #include <kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4cxp/kai_matmul_clamp_f32_qai8dxp1x8_qsi4cxp8x8_1x8x32_neon_dotprod.h>
 #include <kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4cxp/kai_matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm.h>
 #include <kai/ukernels/matmul/matmul_clamp_f32_qai8dxp_qsi4cxp/kai_matmul_clamp_f32_qai8dxp_qsi4cxp_interface.h>
-#include <kai/ukernels/matmul/matmul_clamp_bf16_qai8dxp_qsi4cxp/kai_matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod.h>
-#include <kai/ukernels/matmul/matmul_clamp_bf16_qai8dxp_qsi4cxp/kai_matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm.h>
-#include <kai/ukernels/matmul/matmul_clamp_bf16_qai8dxp_qsi4cxp/kai_matmul_clamp_bf16_qai8dxp_qsi4cxp_interface.h>
 #include <kai/ukernels/matmul/pack/kai_lhs_quant_pack_qai8dxp_f32.h>
-#include <kai/ukernels/matmul/pack/kai_lhs_quant_pack_qai8dxp_bf16_neon.h>
 #include <kai/ukernels/matmul/pack/kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0.h>
 #include <kai/ukernels/matmul/pack/kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0.h>

 namespace at::native::kleidiai {

 enum class kai_kernel_id {
-  // FP32 inputs, 4-bit weights, FP32 output
  matmul_clamp_f32_qai8dxp1x8_qsi4c32p8x8_1x8x32_neon_dotprod =
-      0, // Groupwise 4-bit GEMV (per-group scales, NEON DOTPROD)
+      0, // Groupwise 4 bit GEMV
  matmul_clamp_f32_qai8dxp4x8_qsi4c32p4x8_4x8x32_neon_i8mm =
-      1, // Groupwise 4-bit GEMM (per-group scales, NEON I8MM)
+      1, // Groupwise 4 bit GEMM
  matmul_clamp_f32_qai8dxp1x8_qsi4cxp8x8_1x8x32_neon_dotprod =
-      2, // Channelwise 4-bit GEMV (per-channel scales, NEON DOTPROD)
+      2, // Channelwise 4 bit GEMV
  matmul_clamp_f32_qai8dxp4x8_qsi4cxp8x8_8x8x32_neon_i8mm =
-      3, // Channelwise 4-bit GEMM (per-channel scales, NEON I8MM)
-
-  // BF16 inputs, 4-bit weights, BF16 output
-  matmul_clamp_bf16_qai8dxp1x8_qsi4cxp8x8_1x8_neon_dotprod =
-      4, // Channelwise 4-bit GEMV with BF16 input/output
-  matmul_clamp_bf16_qai8dxp4x8_qsi4cxp8x8_8x8_neon_i8mm =
-      5  // Channelwise 4-bit GEMM with BF16 input/output
+      3 // Channelwise 4 bit GEMM
 };

 // Channelwise Kernel mapping
@ -77,9 +66,6 @@ struct kai_matmul_ukernel_f32_qa8dxp_qs4cxp {
      void* rhs_packed,
      size_t extra_bytes,
      const struct kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0_params* params);
-   size_t(*kai_get_lhs_quant_pack_offset)(
-        size_t m_idx, size_t k, size_t mr, size_t kr, size_t sr
-    );

  kai_matmul_ukernel_f32_qa8dxp_qs4cxp(
      const kai_matmul_clamp_f32_qai8dxp_qsi4cxp_ukernel& kernel)
@ -89,71 +75,12 @@ struct kai_matmul_ukernel_f32_qa8dxp_qs4cxp {
        kai_get_rhs_packed_size(
            &kai_get_rhs_packed_size_rhs_pack_nxk_qsi4cxp_qs4cxs1s0),
        kai_run_lhs_quant_pack(&kai_run_lhs_quant_pack_qai8dxp_f32),
-        kai_run_rhs_pack(&kai_run_rhs_pack_nxk_qsi4cxp_qs4cxs1s0),
-        kai_get_lhs_quant_pack_offset(&kai_get_lhs_packed_offset_lhs_quant_pack_qai8dxp_f32){}
+        kai_run_rhs_pack(&kai_run_rhs_pack_nxk_qsi4cxp_qs4cxs1s0) {}
 };

 struct kai_matmul_ukernel_f32_qa8dxp_qs4cxp
 kai_select_channelwise_matmul_ukernel(const kai_kernel_id id);

-// bf16 Channelwise Kernel mapping
-struct kai_matmul_ukernel_bf16_qa8dxp_qs4cxp {
-    struct kai_matmul_clamp_bf16_qai8dxp_qsi4cxp_ukernel ukernel;
-    struct kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0_params rhs_pack_params;
-    size_t (*kai_get_lhs_packed_size)(
-        size_t m,
-        size_t k,
-        size_t mr,
-        size_t kr,
-        size_t sr);
-    size_t (*kai_get_rhs_packed_size)(
-        size_t n,
-        size_t k,
-        size_t nr,
-        size_t kr,
-        size_t sr);
-    void (*kai_run_lhs_quant_pack)(
-        size_t m,
-        size_t k,
-        size_t mr,
-        size_t kr,
-        size_t sr,
-        size_t m_idx_start,
-        const void* lhs,
-        size_t lhs_stride,
-        void* lhs_packed);
-    void (*kai_run_rhs_pack)(
-        size_t num_groups,
-        size_t n,
-        size_t k,
-        size_t nr,
-        size_t kr,
-        size_t sr,
-        const uint8_t* rhs,
-        const float* bias,
-        const float* scale,
-        void* rhs_packed,
-        size_t extra_bytes,
-        const struct kai_rhs_pack_nxk_qsi4cxp_qs4cxs1s0_params* params);
-        size_t(*kai_get_lhs_quant_pack_offset)(
-            size_t m_idx, size_t k, size_t mr, size_t kr, size_t sr
-        );
-
-    kai_matmul_ukernel_bf16_qa8dxp_qs4cxp(
-        const kai_matmul_clamp_bf16_qai8dxp_qsi4cxp_ukernel& kernel)
-        : ukernel(kernel),
-          kai_get_lhs_packed_size(
-              &kai_get_lhs_packed_size_lhs_quant_pack_qai8dxp_bf16_neon),
-          kai_get_rhs_packed_size(
-              &kai_get_rhs_packed_size_rhs_pack_nxk_qsi4cxp_qs4cxs1s0),
-          kai_run_lhs_quant_pack(&kai_run_lhs_quant_pack_qai8dxp_bf16_neon),
-          kai_run_rhs_pack(&kai_run_rhs_pack_nxk_qsi4cxp_qs4cxs1s0),
-          kai_get_lhs_quant_pack_offset(&kai_get_lhs_packed_offset_lhs_quant_pack_qai8dxp_bf16_neon){}
-  };
-
-struct kai_matmul_ukernel_bf16_qa8dxp_qs4cxp
-kai_select_bf16_channelwise_matmul_ukernel(const kai_kernel_id id);
-
 // Groupwise Kernel mapping
 struct kai_matmul_ukernel_f32_qa8dxp_qs4c32p {
  struct kai_matmul_clamp_f32_qai8dxp_qsi4c32p_ukernel ukernel;
@ -198,9 +125,6 @@ struct kai_matmul_ukernel_f32_qa8dxp_qs4c32p {
      void* rhs_packed,
      size_t extra_bytes,
      const struct kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0_params* params);
-      size_t(*kai_get_lhs_quant_pack_offset)(
-        size_t m_idx, size_t k, size_t mr, size_t kr, size_t sr
-    );

  kai_matmul_ukernel_f32_qa8dxp_qs4c32p(
      const kai_matmul_clamp_f32_qai8dxp_qsi4c32p_ukernel& kernel)
@ -210,8 +134,7 @@ struct kai_matmul_ukernel_f32_qa8dxp_qs4c32p {
        kai_get_rhs_packed_size(
            &kai_get_rhs_packed_size_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0),
        kai_run_lhs_quant_pack(&kai_run_lhs_quant_pack_qai8dxp_f32),
-        kai_run_rhs_pack(&kai_run_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0),
-        kai_get_lhs_quant_pack_offset(&kai_get_lhs_packed_offset_lhs_quant_pack_qai8dxp_f32) {}
+        kai_run_rhs_pack(&kai_run_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0) {}
 };

 struct kai_matmul_ukernel_f32_qa8dxp_qs4c32p kai_select_groupwise_matmul_ukernel(
--- a/test/inductor/test_torchinductor.py
+++ b/test/inductor/test_torchinductor.py
@ -2476,11 +2476,12 @@ class CommonTemplate:
        b_int8pack, b_scales = convert_weight_to_int8pack(b)
        self.common(fn, (a, b_int8pack, b_scales, c))

+    @xfail_if_mps_unimplemented
    @xfail_if_triton_cpu
    @skipCUDAIf(True, "No _dyn_quant_pack_4bit_weight implementation on CUDA")
    @skipIfRocm
    @skipIfXpu(msg="No _dyn_quant_pack_4bit_weight implementation on XPU")
-    def test__dyn_quant_pack_4bit_weight_fp32(self):
+    def test__dyn_quant_pack_4bit_weight(self):
        q_group = 32
        k = 128
        n = 128
@ -2511,46 +2512,12 @@ class CommonTemplate:

        self.common(fn, (b, in_features, out_features))

-    @xfail_if_triton_cpu
-    @skipCUDAIf(True, "No _dyn_quant_pack_4bit_weight implementation on CUDA")
-    @skipIfRocm
-    @skipIfXpu(msg="No _dyn_quant_pack_4bit_weight implementation on XPU")
-    def test__dyn_quant_pack_4bit_weight_bf16(self):
-        q_group = 32
-        k = 128
-        n = 128
-
-        torch.manual_seed(1)
-        b = torch.rand((k, n), dtype=torch.bfloat16)
-        in_features = b.size(0)
-        out_features = b.size(1)
-
-        def dyn_quant_pack_4bit_weight(b, in_features, out_features):
-            b_uint8, b_scales_and_zeros = _group_quantize_tensor_symmetric(
-                b, n_bit=4, groupsize=q_group
-            )
-
-            if q_group == in_features:
-                b_scales_and_zeros = b_scales_and_zeros.to(torch.float)
-            else:
-                b_scales_and_zeros = b_scales_and_zeros.to(torch.bfloat16)
-            b_int4pack = torch._dyn_quant_pack_4bit_weight(
-                b_uint8, b_scales_and_zeros, None, q_group, in_features, out_features
-            )
-
-            return b_int4pack, b_scales_and_zeros
-
-        def fn(b, in_features, out_features):
-            b_int4pack, _ = dyn_quant_pack_4bit_weight(b, in_features, out_features)
-            return b_int4pack
-
-        self.common(fn, (b, in_features, out_features))
-
+    @xfail_if_mps_unimplemented
    @xfail_if_triton_cpu
    @skipCUDAIf(True, "No _dyn_quant_matmul_4bit implementation on CUDA")
    @skipIfRocm
    @skipIfXpu(msg="No _dyn_quant_matmul_4bit implementation on XPU")
-    def test__dyn_quant_matmul_4bit_fp32_input(self):
+    def test__dyn_quant_matmul_4bit(self):
        q_group = 32
        m = 32
        k = 128
@ -2590,60 +2557,6 @@ class CommonTemplate:

        self.common(fn, (a, q_group, in_features, out_features))

-    @xfail_if_triton_cpu
-    @skipCUDAIf(True, "No _dyn_quant_matmul_4bit implementation on CUDA")
-    @skipIfRocm
-    @skipIfXpu(msg="No _dyn_quant_matmul_4bit implementation on XPU")
-    def test__dyn_quant_matmul_4bit_bf16_input(self):
-        m = 32
-        k = 128
-        n = 128
-        q_group = k
-
-        torch.manual_seed(1)
-        a = torch.rand((m, k), dtype=torch.bfloat16)
-        b = torch.rand((k, n), dtype=torch.bfloat16)
-
-        # codegen_dynamic_shape test fails without explicitly marking these dynamic
-        torch._dynamo.mark_dynamic(a, 0)
-        torch._dynamo.mark_dynamic(b, 1)
-
-        in_features = b.size(0)
-        out_features = b.size(1)
-
-        if not self.is_dtype_supported(torch.bfloat16):
-            raise unittest.SkipTest(
-                f"torch.bfloat16 not supported for device {self.device}"
-            )
-
-        def dyn_quant_pack_4bit_weight(b, in_features, out_features):
-            b_uint8, b_scales_and_zeros = _group_quantize_tensor_symmetric(
-                b, n_bit=4, groupsize=q_group
-            )
-
-            if q_group == in_features:
-                b_scales_and_zeros = b_scales_and_zeros.to(torch.float)
-            else:
-                b_scales_and_zeros = b_scales_and_zeros.to(torch.bfloat16)
-            b_int4pack = torch._dyn_quant_pack_4bit_weight(
-                b_uint8, b_scales_and_zeros, None, q_group, in_features, out_features
-            )
-
-            return b_int4pack, b_scales_and_zeros
-
-        def fn(a, q_group, in_features, out_features):
-            b_int4pack, _ = dyn_quant_pack_4bit_weight(b, in_features, out_features)
-            res = torch.ops.aten._dyn_quant_matmul_4bit(
-                a,
-                b_int4pack,
-                q_group,
-                in_features,
-                out_features,
-            )
-            return res
-
-        self.common(fn, (a, q_group, in_features, out_features), atol=1, rtol=0.5)
-
    def test_expanded_reduction(self):
        def fn(x, y):
            z = x * y
--- a/test/test_linalg.py
+++ b/test/test_linalg.py
@ -7798,7 +7798,7 @@ scipy_lobpcg  | {eq_err_scipy:10.2e}  | {eq_err_general_scipy:10.2e}  | {iters2:
    @parametrize("m", [1, 32])
    @parametrize("k", [64, 128])
    @parametrize("n", [4096, 11008])
-    def test__dyn_quant_matmul_4bit_fp32(self, device, m, k, n):
+    def test__dyn_quant_matmul_4bit(self, device, m, k, n):
        if self.device_type == "cuda":
            self.skipTest("CUDA is unsupported")

@ -7870,86 +7870,7 @@ scipy_lobpcg  | {eq_err_scipy:10.2e}  | {eq_err_general_scipy:10.2e}  | {iters2:
    @parametrize("m", [1, 32])
    @parametrize("k", [64, 128])
    @parametrize("n", [4096, 11008])
-    def test__dyn_quant_matmul_4bit_bf16(self, device, m, k, n):
-        if self.device_type == "cuda":
-            self.skipTest("CUDA is unsupported")
-
-
-        torch.manual_seed(1)
-        a_bfloat16 = torch.rand((m, k), dtype=torch.bfloat16, device=device)
-        b_bfloat16 = torch.rand((k, n), dtype=torch.bfloat16, device=device)
-        in_features = b_bfloat16.size(0)
-        out_features = b_bfloat16.size(1)
-        q_group = in_features
-
-        def dyn_quant_pack_4bit_weight(b, in_features, out_features):
-            b_uint8, b_scales_and_zeros = _group_quantize_tensor_symmetric(
-                b, n_bit=4, groupsize=q_group
-            )
-
-            if q_group == in_features:
-                b_scales_and_zeros = b_scales_and_zeros.to(torch.float)
-            else:
-                b_scales_and_zeros = b_scales_and_zeros.to(torch.bfloat16)
-            b_int4pack = torch._dyn_quant_pack_4bit_weight(
-                b_uint8, b_scales_and_zeros, None, q_group, in_features, out_features
-            )
-
-            return b_int4pack, b_scales_and_zeros
-
-        def dyn_quant_matmul_4bit(
-            a, b_int4pack, q_group, in_features, out_features
-        ):
-            return torch.ops.aten._dyn_quant_matmul_4bit(
-                a,
-                b_int4pack,
-                q_group,
-                in_features,
-                out_features,
-            )
-
-        b_int4pack, b_scales_and_zeros = dyn_quant_pack_4bit_weight(
-            b_bfloat16, in_features, out_features
-        )
-
-        dtypes = [torch.bfloat16]
-
-        for dtype in dtypes:
-            a = a_bfloat16.to(dtype=dtype)
-            b = b_bfloat16.to(dtype=dtype)
-            ref = torch.mm(a, b)
-            res = dyn_quant_matmul_4bit(
-                a,
-                b_int4pack,
-                q_group,
-                in_features,
-                out_features,
-            )
-
-        # Mean relative error check
-        expected_mean_err = 0.00952
-        mean_err_tol = 0.005  # allow small deviation (±0.005)
-        mean_err = ((res - ref).abs() / ref.abs().clamp(min=1e-5)).mean()
-        self.assertTrue(
-            abs(mean_err - expected_mean_err) < mean_err_tol,
-            f"Mean relative error {mean_err:.6f} deviates from expected {expected_mean_err}"
-        )
-
-        # Elementwise relative error check
-        elementwise_diff = (res - ref).abs()
-        elementwise_relative_error = elementwise_diff / ref.abs().clamp(min=torch.finfo(ref.dtype).eps)
-        self.assertTrue(
-            torch.all(elementwise_relative_error < 0.070),
-            "Some elements have relative error >= 7%"
-        )
-
-    @unittest.skipIf(IS_FBCODE and IS_REMOTE_GPU, "cublas runtime error")
-    @unittest.skipIf(TEST_WITH_ROCM and IS_REMOTE_GPU, "ROCM is unsupported")
-    @onlyNativeDeviceTypes
-    @parametrize("m", [1, 32])
-    @parametrize("k", [64, 128])
-    @parametrize("n", [4096, 11008])
-    def test_compile_dyn_quant_matmul_4bit_fp32(self, device, m, k, n):
+    def test_compile_dyn_quant_matmul_4bit(self, device, m, k, n):
        if self.device_type == "cuda":
            self.skipTest("CUDA is unsupported")

@ -8007,83 +7928,6 @@ scipy_lobpcg  | {eq_err_scipy:10.2e}  | {eq_err_general_scipy:10.2e}  | {iters2:
        )

    @onlyNativeDeviceTypes
-    @unittest.skipIf(IS_FBCODE and IS_REMOTE_GPU, "cublas runtime error")
-    @unittest.skipIf(TEST_WITH_ROCM and IS_REMOTE_GPU, "ROCM is unsupported")
-    @onlyNativeDeviceTypes
-    @parametrize("m", [1, 32])
-    @parametrize("k", [64, 128])
-    @parametrize("n", [4096, 11008])
-    def test_compile_dyn_quant_matmul_4bit_bf16(self, device, m, k, n):
-        if self.device_type == "cuda":
-            self.skipTest("CUDA is unsupported")
-
-
-        torch.manual_seed(1)
-        a_bfloat16 = torch.rand((m, k), dtype=torch.bfloat16, device=device)
-        b_bfloat16 = torch.rand((k, n), dtype=torch.bfloat16, device=device)
-        in_features = b_bfloat16.size(0)
-        out_features = b_bfloat16.size(1)
-        q_group = in_features
-
-        b_uint8, b_scales_and_zeros = _group_quantize_tensor_symmetric(
-            b_bfloat16, n_bit=4, groupsize=q_group
-        )
-
-        if q_group == in_features:
-            b_scales_and_zeros = b_scales_and_zeros.to(dtype=torch.float)
-        else:
-            b_scales_and_zeros = b_scales_and_zeros.to(dtype=torch.bfloat16)
-
-        @torch.compile
-        def dyn_quant_matmul_4bit(
-            a, b_uint8, b_scales_and_zeros, q_group, in_features, out_features
-        ):
-            b_int4pack = torch._dyn_quant_pack_4bit_weight(
-                b_uint8, b_scales_and_zeros, None, q_group, in_features, out_features
-            )
-            return torch._dyn_quant_matmul_4bit(
-                a,
-                b_int4pack,
-                q_group,
-                in_features,
-                out_features,
-            )
-
-        res = dyn_quant_matmul_4bit(
-            a_bfloat16,
-            b_uint8,
-            b_scales_and_zeros,
-            q_group,
-            in_features,
-            out_features,
-        )
-        ref = torch.mm(a_bfloat16, b_bfloat16)
-
-        # === Accuracy checks ===
-
-        # Mean relative error check
-        expected_mean_err = 0.00952
-        mean_err_tol = 0.005  # allow small deviation (±0.005)
-        mean_err = ((res - ref).abs() / ref.abs().clamp(min=1e-5)).mean()
-        self.assertTrue(
-            abs(mean_err - expected_mean_err) < mean_err_tol,
-            f"Mean relative error {mean_err:.6f} deviates from expected {expected_mean_err}"
-        )
-
-        # Avoid divide-by-zero with clamp
-        denominator = ref.abs().clamp(min=torch.finfo(ref.dtype).eps)
-
-        # Compute elementwise relative error — always non-negative
-        elementwise_relative_error = (res - ref).abs() / denominator
-
-        # Check if all elements are within 6% error
-        assert torch.all(elementwise_relative_error >= 0), "Relative error should never be negative"
-        self.assertTrue(
-            torch.all(elementwise_relative_error < 0.070),
-            "Some elements have relative error >= 7%"
-        )
-
-    @onlyCPU
    @parametrize("m", [32, 64])
    @parametrize("k", [32, 64])
    @parametrize("n", [48, 64])
--- a/torch/_meta_registrations.py
+++ b/torch/_meta_registrations.py
@ -3722,7 +3722,6 @@ def kai_roundup(a: int, b: int) -> int:

 def get_kai_packed_weight_size(n_bits, N, K, groupsize):
    if n_bits == 4:
-        # Works for both fp32 and bf16 Kernels
        if groupsize == K:  # channelwise
            # dotprod params only [1x8x32_neon_dotprod]
            kai_nr = 8
@ -3852,8 +3851,6 @@ def meta__dyn_quant_pack_4bit_weight(
        )
        return weights.new_empty(int(packed_weight_size), dtype=torch.uint8)
    packed_weight_size = weights.numel() + scales_zeros.numel()
-    if bias is not None:
-        packed_weight_size += bias.numel()
    return weights.new_empty(packed_weight_size, dtype=torch.float)


@ -3867,12 +3864,8 @@ def meta__dyn_quant_matmul_4bit(
 ):
    torch._check(inp.dim() == 2, lambda: "input must be a 2D tensor")
    torch._check(
-        (inp.dtype == torch.float32)
-        or (inp.dtype == torch.bfloat16 and block_size == in_features),
-        lambda: (
-            f"expected input to be f32 or bf16 (bf16 requires block_size == in_features), "
-            f"got {inp.dtype} with block_size={block_size} and in_features={in_features}"
-        ),
+        inp.dtype == torch.float32,
+        lambda: f"expected input to be f32, got {inp.dtype}",
    )
    M = inp.size(0)
    return inp.new_empty(M, out_features, dtype=inp.dtype)
--- a/torch/_refs/init.py
+++ b/torch/_refs/init.py
@ -702,7 +702,7 @@ def exp2(a):
 # CompositeImplicitAutograd - don't register decomp
@out_wrapper()
@elementwise_type_promotion_wrapper(
-    type_promoting_args=("a,"),
+    type_promoting_args=("a",),
    type_promotion_kind=ELEMENTWISE_TYPE_PROMOTION_KIND.NO_OPMATH,
 )
 def fill(a: TensorLikeType, value: NumberType) -> TensorLikeType:
Author	SHA1	Message	Date
can-gaa-hou	ae71b0e163	Fix typo in torch._refs (#167310 ) Should be a typo here, but it doesn't raise an error because the inner function splits it into `a` and `,`, and the `,` case check is skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167310 Approved by: https://github.com/eellison	2025-11-11 22:31:09 +00:00
PyTorch MergeBot	5b6ff8148d	Revert "[ARM] Improve LLM performance & mem usage using int4-bf16 KleidiAI kernels (#158250 )" This reverts commit 402c46503002f98ccfc023a733081fb0719223a1. Reverted https://github.com/pytorch/pytorch/pull/158250 on behalf of https://github.com/izaitsevfb due to Broke some torch.compile jobs ([comment](https://github.com/pytorch/pytorch/pull/158250#issuecomment-3518944863))	2025-11-11 22:27:51 +00:00
Jithun Nair	1f7e4343e7	[ROCm][CI] Add docker-cache-rocm.yml to test MI3xx CI docker caching (#167554 ) * Trigger this workflow on every completed run of `docker-builds.yml` * Uses `ubuntu-latest` for downloading artifacts from `docker-build` workflow run * Uses `linux.rocm.gfx942.docker-cache` to cache docker images as tarballs for MI3xx CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/167554 Approved by: https://github.com/jeffdaily	2025-11-11 21:32:22 +00:00
PyTorch MergeBot	b21856f5fc	Revert "[DebugMode] record triton kernels, run-to-run determinism checks (#167028 )" This reverts commit 259ba0ecabd809edd35d12b4f992777cb5923b68. Reverted https://github.com/pytorch/pytorch/pull/167028 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167028#issuecomment-3518811298))	2025-11-11 21:31:12 +00:00
Pian Pawakapan	259ba0ecab	[DebugMode] record triton kernels, run-to-run determinism checks (#167028 ) Following up on https://github.com/pytorch/pytorch/pull/166348, extends DebugMode to capture inductor triton kernels at runtime, and adds an API for checking run-to-run determinism based on tensor hashes. The workflow looks something like... ```python # do 1st run with hashes, get logs with DebugMode() as debug_mode, DebugMode.log_tensor_hashes(): compiled_model(inputs) logs1 = debug_mode.logs # do 2nd run with DebugMode() as debug_mode, DebugMode.log_tensor_hashes(): compiled_model(inputs) logs2 = debug_mode.logs # returns list of calls w/ mismatched outputs mismatches = DebugMode.check_hash_mismatches(logs1, logs2) ``` Example dump off a smaller version of @drisspg's FlexAttention fwd+bwd determinism tests [script](https://gist.github.com/pianpwk/f65cc63811d12853709dcc77d7eb69f1) (without forced reduction order): ``` cfg: TestConfig(name='Standard', B=2, Hq=32, Hkv=32, Q=2048, KV=2048, Dqk=128, Dv=128) DETERMINISM: fwd: True, bwd_q: False, bwd_k: False, bwd_v: True $$$ DEBUG MODE DUMP $$$ (this is what the logs look like) [triton] triton_tem_fused_0(arg_Q=t: bf16[2, 32, 2048, 128], arg_K=t: bf16[2, 32, 2048, 128], arg_V=t: bf16[2, 32, 2048, 128], arg_LSE=t: f32[2, 32, 2048], arg_MAX=t: f32[2, 32, 2048], arg_KV_NUM_BLKS=t: i32[2, 32, 16], arg_KV_IDX=t: i32[2, 32, 16, 16], arg_FULL_KV_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_KV_IDX=t: i32[2, 32, 16, 16], out_ptr0=t: bf16[2, 32, 2048, 128]) # post-kernel hashes: {arg_Q: 13385916.068706088, arg_K: 13389356.409105342, arg_V: 13384993.48412523, arg_LSE: 1347168.9026973695, arg_MAX: 81775.3811062593, arg_KV_NUM_BLKS: 1024.0, arg_KV_IDX: 122880.0, arg_FULL_KV_NUM_BLKS: 7680.0, arg_FULL_KV_IDX: 122880.0, out_ptr0: 924917.7918248245} [triton] triton_per_fused_zeros_0(in_ptr0=t: bf16[2, 32, 2048, 128], in_ptr1=t: bf16[2, 32, 2048, 128], out_ptr1=t: f32[2, 32, 2048], xnumel=131072, r0_numel=128) # post-kernel hashes: {in_ptr0: 924917.7918248245, in_ptr1: 13389213.797377996, out_ptr1: 81775.38106592931} [triton] triton_tem_fused_zeros_1(arg_Q=t: bf16[2, 32, 2048, 128], arg_K=t: bf16[2, 32, 2048, 128], arg_V=t: bf16[2, 32, 2048, 128], arg_LSE=t: f32[2, 32, 2048], arg_DELTA=t: f32[2, 32, 2048], arg_DO=t: bf16[2, 32, 2048, 128], arg_DQ=t: bf16[2, 32, 2048, 128], arg_DV=t: bf16[2, 32, 2048, 128], arg_KV_NUM_BLKS=t: i32[2, 32, 16], arg_KV_IDX=t: i32[2, 32, 16, 16], arg_Q_NUM_BLKS=t: i32[2, 32, 16], arg_Q_IDX=t: i32[2, 32, 16, 16], arg_FULL_KV_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_KV_IDX=t: i32[2, 32, 16, 16], arg_FULL_Q_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_Q_IDX=t: i32[2, 32, 16, 16], out_ptr0=t: bf16[2, 32, 2048, 128]) # post-kernel hashes: {arg_Q: 13385916.068706088, arg_K: 13389356.409105342, arg_V: 13384993.48412523, arg_LSE: 1347168.9026973695, arg_DELTA: 81775.38106592931, arg_DO: 13389213.797377996, arg_DQ: 874474.8084187683, arg_DV: 727742.3138379117, arg_KV_NUM_BLKS: 1024.0, arg_KV_IDX: 122880.0, arg_Q_NUM_BLKS: 1024.0, arg_Q_IDX: 122880.0, arg_FULL_KV_NUM_BLKS: 7680.0, arg_FULL_KV_IDX: 122880.0, arg_FULL_Q_NUM_BLKS: 7680.0, arg_FULL_Q_IDX: 122880.0, out_ptr0: 700542.3431890717} $$$ MISMATCHES $$$ mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_0', 'arg_name': 'arg_MAX', 'pytree_path': None, 'hash1': 0.0, 'hash2': 81775.3811062593, 'rel_diff': 1.0, 'is_input_hash': False} # I guess this one is misleading? not sure if I'm doing something wrong with waiting for kernel results mismatch: {'call_type': 'triton kernel', 'call': 'triton_per_fused_zeros_0', 'arg_name': 'out_ptr1', 'pytree_path': None, 'hash1': 81775.3811062593, 'hash2': 81775.38106592931, 'rel_diff': 4.931801261646669e-10, 'is_input_hash': False} mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'arg_DELTA', 'pytree_path': None, 'hash1': 81775.3811062593, 'hash2': 81775.38106592931, 'rel_diff': 4.931801261646669e-10, 'is_input_hash': False} mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'arg_DQ', 'pytree_path': None, 'hash1': 874474.8097136207, 'hash2': 874474.8084187683, 'rel_diff': 1.480720012120795e-09, 'is_input_hash': False} mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'out_ptr0', 'pytree_path': None, 'hash1': 700542.3488049245, 'hash2': 700542.3431890717, 'rel_diff': 8.016435812581196e-09, 'is_input_hash': False} ``` note: current hash implementation is basically tensor norm, so tensor closeness -> hash closeness. This is likely to change soon, e.g. maybe to `torch.hash_tensor` (https://github.com/pytorch/pytorch/pull/154149) by default Sample paste diff between log dumps from 2 runs: <img width="1665" height="445" alt="Screenshot 2025-11-05 at 11 27 24 PM" src="https://github.com/user-attachments/assets/41402e37-f50b-4a9e-a17c-bb98b5917076" /> Another case where running this for FSDP2 on Llama3-8B, helped narrow down divergence b/w aot_eager <-> inductor, to inductor's FWD RMSNorm kernels: P2027003180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167028 Approved by: https://github.com/v0i0	2025-11-11 20:37:53 +00:00
PyTorch MergeBot	051f1fe8e3	Revert "[ROCm][CI] Update docker-cache-mi300.yml to test MI300 CI docker caching (#167554 )" This reverts commit ee387c43feada1cc2049b42a970ec4e2f12f210e. Reverted https://github.com/pytorch/pytorch/pull/167554 on behalf of https://github.com/jithunnair-amd due to workflow had failure 'Unexpected input(s) 'run_id'' ([comment](https://github.com/pytorch/pytorch/pull/167554#issuecomment-3518642191))	2025-11-11 20:34:44 +00:00
Jithun Nair	ee387c43fe	[ROCm][CI] Update docker-cache-mi300.yml to test MI300 CI docker caching (#167554 ) Trigger this workflow on every completed run of `docker-builds.yml` and run on `ubuntu-latest` so it doesn't queue infinitely for `rocm-docker` label Pull Request resolved: https://github.com/pytorch/pytorch/pull/167554 Approved by: https://github.com/jeffdaily	2025-11-11 19:49:00 +00:00