[Feat] Supports Aclgraph for bge-m3 (#3171 )

### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>
[BugFix] Fix multimodal model support fullgraph error (#3425 )
2025-10-20 13:43:53 +08:00 · 2025-10-14 23:07:45 +08:00 · 2025-10-14 21:51:09 +08:00 · 2025-10-14 21:28:41 +08:00 · 2025-10-14 21:11:05 +08:00 · 2025-10-14 20:35:58 +08:00
45 changed files with 944 additions and 320 deletions
--- a/.github/workflows/_e2e_test.yaml
+++ b/.github/workflows/_e2e_test.yaml
@ -90,9 +90,11 @@ jobs:

          pytest -sv tests/e2e/singlecard/test_aclgraph.py
          pytest -sv tests/e2e/singlecard/test_ascend_scheduler.py
+          pytest -sv tests/e2e/singlecard/test_bge_model.py
          pytest -sv tests/e2e/singlecard/test_camem.py
          pytest -sv tests/e2e/singlecard/test_chunked.py
          pytest -sv tests/e2e/singlecard/test_embedding.py
+          pytest -sv tests/e2e/singlecard/test_embedding_aclgraph.py
          pytest -sv tests/e2e/singlecard/test_guided_decoding.py
          pytest -sv tests/e2e/singlecard/test_ilama_lora.py
          pytest -sv tests/e2e/singlecard/test_profile_execute_duration.py
--- a/docs/source/developer_guide/feature_guide/ModelRunner_prepare_inputs.md
+++ b/docs/source/developer_guide/feature_guide/ModelRunner_prepare_inputs.md
@ -1,4 +1,6 @@
-# Purpose
+# Prepare inputs for model forwarding
+
+## Purpose
 What information should we have in order to perform model forward pass?
 - the inputs
 - the corresponding attention metadata of the inputs
@ -17,8 +19,8 @@ Therefore, as long as we have these two pieces of information mentioned above, w

 This article will explain **how we obtain the inputs and their corresponding attention metadata** which are on the left part of above diagram.

-# Overview
-## 1. Obtain inputs
+## Overview
+### 1. Obtain inputs
 The workflow of obtain inputs:
 1. Get `token positions`: The relative position of each token within its request sequence.

@ -29,7 +31,7 @@ The workflow of obtain inputs:
 At last, these `Token IDs` required to feed into the model, and also, `positions` should be send into model to create `Rope` (Rotary positional embedding). Both of them are the inputs of a model.

 **Note**: because the `Token IDs` is the inputs of the model, so we will call it `Inputs IDs`
-## 2. Build inputs attention metadata
+### 2. Build inputs attention metadata
 The model requires these attention metadata during the forward pass:
 - `query start location`: represents the start and end location of each request corresponding to the scheduled tokens.
 - `sequence length`: the length of each request including both computed tokens and newly scheduled tokens.
@ -41,7 +43,7 @@ The model requires these attention metadata during the forward pass:
 - `slot mapping`: the indices of each token that input token will be stored into.
 - `attention mask`: The mask matrix applied to attention scores before softmax to control which tokens can attend to each other. (usually a causal attention)

-# Before start
+## Before start
 There are mainly three types of variables.
 - token level: represents one attribute corresponding to each scheduled token, so the length of this variable is the number of scheduled tokens
 - request level: represents one attribute of each scheduled request, which length usually is the number of scheduled requests. (`query start location` is a special case, which has one more element)
@ -52,7 +54,7 @@ There are mainly three types of variables.
 **Note**: How were these two tables formed?
 - Both of them are come from the `_update_states` method before **prepare inputs**. You can take a look if you need more inspiration.

-## Tips
+### Tips
 What is `Token ID`?
 For simple, a `token ID` is an **integer** (usually `int32`), which represents a token.
 example of `Token ID`:
@ -74,7 +76,7 @@ example of `Token ID`:
 | vocab_size-1 | <|im_end|>    |
 ```

-# Go through details
+## Go through details
 Make a simple example, assumption:
 - max tokens can be scheduled at once: 10.
 - `block size`: 2
@ -82,19 +84,19 @@ Make a simple example, assumption:
 - `max model length`: 12 (the max token count can be handled at one request sequence in this model).

 These assumption are configured in the beginning when starting the vllm. They are not fixed, so you can manually set them.
-## Step 1: All requests in the prefill phase
+### Step 1: All requests in the prefill phase

-### Obtain inputs
+#### Obtain inputs
 Due to the max schedule token count limitation is 10, The scheduled token of each request: `{'0': 3, '1': 2, '2': 5}`. Note that the `request_2` is in chunked prefill, still has 3 prompt tokens not be scheduled.

-#### 1. Get token positions:
+##### 1. Get token positions:
 First, find out each token belong to which request: the 0~2 tokens belong to request_0, 3~4 tokens belong to request_1 and 5~9 tokens belong to request_2. So, we can use `request indices` to point out each token belongs to which request. `request indices`: `[0, 0, 0, 1, 1, 2, 2, 2, 2, 2]`

 For each request, use **the number of tokens already computed** + **the relative position in current scheduled tokens**: `request_0: [0 + 0, 0 + 1, 0 + 2]`, `request_1: [0 + 0, 0 + 1]`, `request_2: [0 + 0, 0 + 1,..., 0 + 4]` and then concat them together: `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. Note: there is more efficient way (using `request indices`) to create positions in actual code.

 Finally, `token opsitions` is `[0, 1, 2, 0, 1, 0, 1, 2, 3, 4]`. This variable is **token level**

-#### 2. Get token indices:
+##### 2. Get token indices:
 Current **Token IDs table**, which shape is `(max num request, max model len)`.

 Why these `T_3_5`, `T_3_6`, `T_3_7` are in this table even them are not scheduled this time?
@ -116,7 +118,7 @@ Let's say `M = max model len`, Then we can use `token positions` together with t

 So `token indices` = `[0 + 0 * M, 1 + 0 * M, 2 + 0 * M, 0 + 1 * M, 1 + 1 * M, 0 + 2 * M, 1 + 2 * M, 2 + 2 * M, 3 + 2 * M, 4 + 2 * M]` = `[0, 1, 2, 12, 13, 24, 25, 26, 27, 28]`

-#### 3. Retrieve the Token IDs
+##### 3. Retrieve the Token IDs
 As mentioned before, we will refer to these `Token IDs` as `Input IDs`.

 We use the `token indices` to select out the corresponding `Input IDs` from the token table, The Pseudocode like:
@ -128,7 +130,7 @@ input_ids = token_table[token_indices]
 As mentioned before, we will refer these Token IDs as Inputs IDs:
 - `Input IDs` = `[T_0_0, T_0_1, T_0_2, T_1_0, T_1_1, T_2_0, T_2_1, T_3_2, T_3_3, T_3_4]`

-### Build inputs attention metadata
+#### Build inputs attention metadata
 Current **Block Table**, we use the first block (i.e. block_0) to mark the unused block. The shape of the block is `(max num request, max model len / block size)`, the `max model len / block size = 12 / 2 = 6`

 ```
@ -172,10 +174,10 @@ First, we know the scheduled token count is `[3, 2, 5]` **request level**
 - `slot mapping`: `[2, 3, 4, 6, 7, 8, 9, 10, 11, 12]`. **token level**
 - `attention mask`: For all request do prefill, we simply create only one mask matrix for reuse across different requests. The shape of this mask matrix is `5 * 5`:

-## Step 2: Chunked prefill
+### Step 2: Chunked prefill
 In Step 2, we will no longer provide explanations or perform calculations; instead, we will directly present the final result.

-### Obtain inputs
+#### Obtain inputs
 The scheduled token of each request: `{'0': 1, '1': 1, '2': 3}`.

 1. `request indices`: `[0, 1, 2, 2, 2]`
@ -198,7 +200,7 @@ Current **Token IDs table**:
 3. `token indices`: `[3, 14, 29, 30, 31]`
 4. `Input IDs`: `[T_0_3, T_1_2, T_3_5, T_3_6, T_3_7]`

-### Build inputs attention metadata
+#### Build inputs attention metadata
 Current **Block Table**. **Note**: We allocate the `7` and `8` block to `request_1` and `request_2` respectively. Because they need more space in device to store kv cache after generate new tokens or chunked prefill new tokens.

 ```
@ -231,7 +233,7 @@ scheduled token count is `[1, 1, 3]`
 - `slot mapping`: `[5, 14, 13, 16, 17]`
 - `attention mask`: `5 * 8` Each token will have a `1 * 8` vector, and there are 5 scheduled tokens.

-# At last
+## At last
 If you under stand the step_1 and step_2, you will know the all following steps.

 Hope this article can help you get better understand to how vllm prepare inputs for model forwarding. If you have any good idea, welcome to contribute to us.
--- a/docs/source/user_guide/configuration/additional_config.md
+++ b/docs/source/user_guide/configuration/additional_config.md
@ -73,10 +73,10 @@ ascend_scheduler_config also support the options from [vllm scheduler config](ht

 **weight_prefetch_config**

-| Name             | Type | Default                            | Description                        |
-|------------------|------|------------------------------------|------------------------------------|
-| `enabled`        | bool | `False`                            | Whether to enable weight prefetch. |
-| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}}` | Prefetch ratio of each weights.    |
+| Name             | Type | Default                                                     | Description                        |
+|------------------|------|-------------------------------------------------------------|------------------------------------|
+| `enabled`        | bool | `False`                                                     | Whether to enable weight prefetch. |
+| `prefetch_ratio` | dict | `{"attn": {"qkv": 1.0, "o": 1.0}, "moe": {"gate_up": 0.8}}` | Prefetch ratio of each weights.    |

 ### Example

@ -104,6 +104,9 @@ An example of additional configuration is as follows:
                "qkv": 1.0,
                "o": 1.0,
            },
+            "moe": {
+                "gate_up": 0.8
+            }
        },
    },
    "multistream_overlap_shared_expert": True,
--- a/tests/e2e/singlecard/ops/test_fused_moe.py
+++ b/tests/e2e/singlecard/ops/test_fused_moe.py
@ -291,7 +291,9 @@ def test_select_experts(
        custom_routing_function.return_value = (mock_weights, mock_ids)

    with patch("vllm_ascend.ops.moe.experts_selector._native_grouped_topk"
-               ) as mock_native_grouped_topk:
+               ) as mock_native_grouped_topk, \
+            patch('vllm_ascend.ops.moe.experts_selector.get_forward_context',
+                  return_value=MagicMock(weight_prefetch_method=MagicMock())):
        mock_native_grouped_topk.side_effect = lambda x, num_groups, k: torch.randn_like(
            x)

@ -325,7 +327,9 @@ def test_select_experts(

@pytest.mark.parametrize("device", DEVICE)
 def test_select_experts_invalid_scoring_func(device: str):
-    with pytest.raises(ValueError,
+    with patch('vllm_ascend.ops.moe.experts_selector.get_forward_context',
+                  return_value=MagicMock(weight_prefetch_method=MagicMock())), \
+            pytest.raises(ValueError,
                       match="Unsupported scoring function: invalid"):
        select_experts(hidden_states=torch.randn(1, 128, device=device),
                       router_logits=torch.randn(1, 8, device=device),
--- a/tests/e2e/singlecard/test_bge_model.py
+++ b/tests/e2e/singlecard/test_bge_model.py
@ -0,0 +1,49 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# Copyright 2023 The vLLM team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+# Adapted from vllm/tests/basic_correctness/test_basic_correctness.py
+#
+from modelscope import snapshot_download  # type: ignore[import-untyped]
+
+from tests.e2e.conftest import HfRunner, VllmRunner
+from tests.e2e.utils import check_embeddings_close
+
+
+def test_bge_model_correctness():
+    queries = ['What is the capital of China?', 'Explain gravity']
+
+    model_name = snapshot_download("BAAI/bge-m3")
+    with VllmRunner(
+            model_name,
+            task="embed",
+            enforce_eager=True,
+    ) as vllm_runner:
+        vllm_outputs = vllm_runner.encode(queries)
+
+    with HfRunner(
+            model_name,
+            dtype="float32",
+            is_sentence_transformer=True,
+    ) as hf_runner:
+        hf_outputs = hf_runner.encode(queries)
+
+    check_embeddings_close(
+        embeddings_0_lst=hf_outputs,
+        embeddings_1_lst=vllm_outputs,
+        name_0="hf",
+        name_1="vllm",
+        tol=1e-2,
+    )
--- a/tests/e2e/singlecard/test_embedding_aclgraph.py
+++ b/tests/e2e/singlecard/test_embedding_aclgraph.py
@ -0,0 +1,55 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# Copyright 2023 The vLLM team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# This file is a part of the vllm-ascend project.
+# Adapted from vllm/tests/basic_correctness/test_basic_correctness.py
+#
+import os
+
+import pytest
+
+from tests.e2e.conftest import VllmRunner
+from tests.e2e.utils import check_embeddings_close
+
+os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
+
+MODELS = ["BAAI/bge-m3"]
+
+
+@pytest.mark.parametrize("model_name", MODELS)
+def test_aclgrpah_embed_models_correctness(model_name):
+    queries = ['What is the capital of China?', 'Explain gravity']
+
+    with VllmRunner(
+            model_name,
+            task="embed",
+            enforce_eager=False,
+    ) as vllm_aclgraph_runner:
+        vllm_aclgraph_outputs = vllm_aclgraph_runner.encode(queries)
+
+    with VllmRunner(
+            model_name,
+            task="embed",
+            enforce_eager=True,
+    ) as vllm_runner:
+        vllm_outputs = vllm_runner.encode(queries)
+
+    check_embeddings_close(
+        embeddings_0_lst=vllm_outputs,
+        embeddings_1_lst=vllm_aclgraph_outputs,
+        name_0="hf",
+        name_1="vllm",
+        tol=1e-2,
+    )
--- a/tests/ut/attention/test_mla_v1.py
+++ b/tests/ut/attention/test_mla_v1.py
@ -376,7 +376,8 @@ class TestAscendMLAImpl(TestBase):
        self.assertEqual(q_pe.shape[1], self.impl.num_heads)
        self.assertEqual(q_pe.shape[2], self.impl.qk_rope_head_dim)

-    def test_process_weights_after_loading(self):
+    @patch('torch_npu.npu_format_cast')
+    def test_process_weights_after_loading(self, mock_format_cast):
        layer = MagicMock(spec=LinearBase)
        layer.input_size_per_partition = 10
        quant_method = MagicMock()
@ -389,6 +390,7 @@ class TestAscendMLAImpl(TestBase):
        layer.weight = torch.randn(shape_0, shape_1)
        self.impl.kv_b_proj = layer
        apply.return_value = layer.weight.T
+        mock_format_cast.return_value = layer.weight
        self.impl.process_weights_after_loading(torch.bfloat16)

        self.assertEqual(self.impl.W_UK_T.shape[0], self.impl.num_heads)
--- a/tests/ut/models/test_deepseek_v2.py
+++ b/tests/ut/models/test_deepseek_v2.py
@ -12,7 +12,7 @@
 # limitations under the License.
 # This file is a part of the vllm-ascend project.
 #
-from unittest.mock import Mock, patch
+from unittest.mock import MagicMock, Mock, patch

 import pytest
 import torch
@ -20,6 +20,7 @@ from vllm.config import CacheConfig
 from vllm.model_executor.layers.logits_processor import LogitsProcessor
 from vllm.model_executor.layers.vocab_parallel_embedding import ParallelLMHead

+from vllm_ascend import ascend_config
 from vllm_ascend.models.deepseek_v2 import (CustomDeepseekV2MLAAttention,
                                            CustomDeepseekV2RowParallelLinear)

@ -46,6 +47,13 @@ def test_row_parallel_linear(cls, mock_distributed):
 def test_custom_deepseek_v2_mla_attention(mock_rms_norm, mock_mla_forward,
                                          mock_distributed, base_config):
    mock_rms_norm.return_value = (torch.randn(2, 128), torch.randn(2, 128))
+    # Make a fake ascend config because of the AscendLinearBase
+    vllm_config = MagicMock()
+    vllm_config.additional_config = None
+    vllm_config.parallel_config.enable_expert_parallel = False
+    vllm_config.parallel_config.tensor_parallel_size = 1
+    vllm_config.kv_transfer_config = None
+    ascend_config.init_ascend_config(vllm_config)

    attn = CustomDeepseekV2MLAAttention(config=base_config,
                                        hidden_size=128,
@ -78,6 +86,7 @@ def test_custom_deepseek_v2_mla_attention(mock_rms_norm, mock_mla_forward,
                                        kv_lora_rank=16,
                                        prefix="layers.1.self_attn")
    assert hasattr(attn, "q_proj")
+    ascend_config._ASCEND_CONFIG = None


 def test_deepseek_v2_lmhead(mock_distributed, vllm_config):
@ -90,6 +99,14 @@ def test_deepseek_v2_lmhead(mock_distributed, vllm_config):

    config = SimpleConfig()

+    # Make a fake ascend config because of the AscendLinearBase
+    vllm_config = MagicMock()
+    vllm_config.additional_config = None
+    vllm_config.parallel_config.enable_expert_parallel = False
+    vllm_config.parallel_config.tensor_parallel_size = 1
+    vllm_config.kv_transfer_config = None
+    ascend_config.init_ascend_config(vllm_config)
+
    # 直接创建lmhead和logits_processor
    lmhead = ParallelLMHead(config.vocab_size, config.hidden_size)
    logits_processor = LogitsProcessor(config.vocab_size)
@ -105,3 +122,4 @@ def test_deepseek_v2_lmhead(mock_distributed, vllm_config):
                          return_value=mock_logits):
            logits = logits_processor(lmhead, mock_output)
    assert logits.shape == (2, 4, config.vocab_size)
+    ascend_config._ASCEND_CONFIG = None
--- a/tests/ut/ops/test_fused_ops.py
+++ b/tests/ut/ops/test_fused_ops.py
@ -92,14 +92,16 @@ def mock_dist_env(mocker: MockerFixture):

    mock_moe_comm_method.finalize.side_effect = mock_finalize
    dp_metadata = MagicMock(num_tokens_across_dp_cpu=[5, 5])
-    mock_forward_context_obj = MagicMock(moe_comm_method=mock_moe_comm_method,
-                                         moe_comm_type=MoECommType.MC2,
-                                         max_tokens_across_dp=10,
-                                         dp_metadata=dp_metadata,
-                                         mc2_mask=torch.zeros(
-                                             16, dtype=torch.bool),
-                                         padded_num_tokens=16,
-                                         with_quant=False)
+    mock_weight_prefetch_method = MagicMock()
+    mock_forward_context_obj = MagicMock(
+        moe_comm_method=mock_moe_comm_method,
+        moe_comm_type=MoECommType.MC2,
+        max_tokens_across_dp=10,
+        dp_metadata=dp_metadata,
+        mc2_mask=torch.zeros(16, dtype=torch.bool),
+        padded_num_tokens=16,
+        with_quant=False,
+        weight_prefetch_method=mock_weight_prefetch_method)

    with patch('torch.distributed.get_rank', return_value=0), \
        patch('torch.distributed.get_world_size', return_value=4), \
@ -132,7 +134,9 @@ def mock_dist_env(mocker: MockerFixture):
        patch('vllm_ascend.ops.moe.moe_comm_method.AlltoAllCommImpl._get_token_dispatcher',
              return_value=None), \
        patch('vllm_ascend.ops.moe.moe_comm_method.AllGatherCommImpl._get_token_dispatcher',
-              return_value=None):
+              return_value=None), \
+        patch('vllm_ascend.ops.moe.experts_selector.get_forward_context',
+              return_value=mock_forward_context_obj):

        yield {
            'mock_forward_context_obj': mock_forward_context_obj,
--- a/tests/ut/ops/test_linear.py
+++ b/tests/ut/ops/test_linear.py
@ -5,10 +5,13 @@ from unittest.mock import MagicMock, patch

 import torch

+from tests.ut.base import TestBase
 from vllm_ascend import ascend_config
 from vllm_ascend.distributed import parallel_state
 from vllm_ascend.ops.linear import (AscendMergedColumnParallelLinear,
-                                    AscendRowParallelLinear)
+                                    AscendReplicatedLinear,
+                                    AscendRowParallelLinear,
+                                    AscendUnquantizedLinearMethod)


 class BaseLinearTest(unittest.TestCase):
@ -49,6 +52,47 @@ class BaseLinearTest(unittest.TestCase):
            p.stop()


+class TestAscendUnquantizedLinearMethod(TestBase):
+
+    def setUp(self):
+        self.method = AscendUnquantizedLinearMethod()
+
+    @mock.patch("vllm_ascend.ops.linear.is_enable_nz")
+    @mock.patch("torch_npu.npu_format_cast")
+    @mock.patch("torch.version")
+    def test_process_weights_after_loading_is_8_3_enable_nz(
+            self, mock_version, mock_format_cast, mock_is_nz):
+        layer = mock.MagicMock()
+
+        mock_version.cann = "8.3.RC1"
+        mock_is_nz.return_value = 1
+        self.method.process_weights_after_loading(layer)
+        mock_format_cast.assert_called_once()
+
+    @mock.patch("vllm_ascend.ops.linear.is_enable_nz")
+    @mock.patch("torch_npu.npu_format_cast")
+    @mock.patch("torch.version")
+    def test_process_weights_after_loading_is_8_3_disable_nz(
+            self, mock_version, mock_format_cast, mock_is_nz):
+        layer = mock.MagicMock()
+
+        mock_version.cann = "8.3.RC1"
+        mock_is_nz.return_value = 0
+        self.method.process_weights_after_loading(layer)
+        mock_format_cast.assert_not_called()
+
+    @mock.patch("vllm_ascend.ops.linear.is_enable_nz")
+    @mock.patch("torch.version")
+    def test_process_weights_after_loading_not_8_3(self, mock_version,
+                                                   mock_is_nz):
+        layer = mock.MagicMock()
+
+        mock_version.cann = "8.2.RC1"
+        mock_is_nz.return_value = 1
+        # Should not raise exception
+        self.method.process_weights_after_loading(layer)
+
+
 class TestAscendRowParallelLinear(BaseLinearTest):

    def test_mlp_optimize(self):
@ -92,5 +136,24 @@ class TestAscendMergedColumnParallelLinear(BaseLinearTest):
        self.assertEqual(linear.custom_op.comm_group, parallel_state._MLP_TP)


+class TestAscendReplicatedLinear(BaseLinearTest):
+
+    def test_init_disable_tp(self):
+        linear = AscendReplicatedLinear(
+            input_size=16,
+            output_size=8,
+        )
+        self.assertTrue(
+            isinstance(linear.quant_method, AscendUnquantizedLinearMethod))
+
+    def test_init_without_disable_tp(self):
+        linear = AscendReplicatedLinear(
+            input_size=16,
+            output_size=8,
+        )
+        self.assertTrue(
+            isinstance(linear.quant_method, AscendUnquantizedLinearMethod))
+
+
 if __name__ == '__main__':
    unittest.main()
--- a/tests/ut/quantization/test_quant_config.py
+++ b/tests/ut/quantization/test_quant_config.py
@ -4,10 +4,10 @@ import torch
 from vllm.attention.layer import Attention
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.fused_moe.config import FusedMoEConfig
-from vllm.model_executor.layers.linear import (LinearBase,
-                                               UnquantizedLinearMethod)
+from vllm.model_executor.layers.linear import LinearBase

 from tests.ut.base import TestBase
+from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod
 from vllm_ascend.quantization.quant_config import (AscendKVCacheMethod,
                                                   AscendQuantConfig)
 from vllm_ascend.utils import ASCEND_QUANTIZATION_METHOD
@ -82,7 +82,7 @@ class TestAscendQuantConfig(TestBase):
                          'is_layer_skipped_ascend',
                          return_value=True):
            method = self.ascend_config.get_quant_method(linear_layer, ".attn")
-            self.assertIsInstance(method, UnquantizedLinearMethod)
+            self.assertIsInstance(method, AscendUnquantizedLinearMethod)

        # Test quantized layer
        with patch.object(self.ascend_config, 'is_layer_skipped_ascend', return_value=False), \
--- a/tests/ut/quantization/test_w8a8.py
+++ b/tests/ut/quantization/test_w8a8.py
@ -137,8 +137,10 @@ class TestAscendW8A8LinearMethod(TestBase):
        expected_y_output += bias
        self.assertTrue(torch.equal(output, expected_y_output))

+    @patch("vllm_ascend.quantization.w8a8.is_enable_nz")
    @patch('torch_npu.npu_format_cast')
-    def test_process_weights_after_loading(self, mock_npu_format_cast):
+    def test_process_weights_after_loading_not_nz(self, mock_npu_format_cast,
+                                                  mock_is_nz):
        layer = MagicMock()

        layer.weight.data = torch.randn(128, 256)
@ -148,6 +150,7 @@ class TestAscendW8A8LinearMethod(TestBase):
        layer.weight_scale.data = torch.randn(128, 1)
        layer.weight_offset.data = torch.randn(128, 1)

+        mock_is_nz.return_value = 0
        mock_npu_format_cast.return_value = MagicMock
        self.method.process_weights_after_loading(layer)

@ -160,6 +163,35 @@ class TestAscendW8A8LinearMethod(TestBase):

        self.assertEqual(layer.weight_scale.data.shape, (128, ))
        self.assertEqual(layer.weight_offset.data.shape, (128, ))
+        mock_npu_format_cast.assert_not_called()
+
+    @patch("vllm_ascend.quantization.w8a8.is_enable_nz")
+    @patch('torch_npu.npu_format_cast')
+    def test_process_weights_after_loading_nz(self, mock_npu_format_cast,
+                                              mock_is_nz):
+        layer = MagicMock()
+
+        layer.weight.data = torch.randn(128, 256)
+        layer.input_scale.data = torch.tensor([0.1])
+        layer.input_offset.data = torch.tensor([0])
+        layer.deq_scale = torch.tensor([0.5])
+        layer.weight_scale.data = torch.randn(128, 1)
+        layer.weight_offset.data = torch.randn(128, 1)
+
+        mock_is_nz.return_value = 1
+        mock_npu_format_cast.return_value = MagicMock
+        self.method.process_weights_after_loading(layer)
+
+        expected_offset = torch.tensor([0]).repeat(256).to(torch.int8)
+        self.assertTrue(
+            torch.equal(layer.aclnn_input_offset.data, expected_offset))
+        self.assertFalse(layer.aclnn_input_offset.requires_grad)
+
+        self.assertFalse(layer.deq_scale.requires_grad)
+
+        self.assertEqual(layer.weight_scale.data.shape, (128, ))
+        self.assertEqual(layer.weight_offset.data.shape, (128, ))
+        mock_npu_format_cast.assert_called_once()


 class TestAscendW8A8FusedMoEMethod(TestBase):
@ -723,6 +755,14 @@ class TestSelectExperts(TestBase):
        self.hidden_states = torch.randn(self.num_tokens, self.hidden_size)
        self.router_logits = torch.randn(self.num_tokens, self.num_experts)

+        self.mock_ctx = MagicMock()
+        self.mock_ctx.weight_prefetch_method = MagicMock()
+        patcher = patch(
+            'vllm_ascend.ops.moe.experts_selector.get_forward_context',
+            return_value=self.mock_ctx)
+        self.addCleanup(patcher.stop)
+        patcher.start()
+
    @patch('torch_npu.npu_moe_gating_top_k_softmax')
    def test_softmax_scoring(self, mock_topk):
        """Test softmax scoring function"""
--- a/tests/ut/test_utils.py
+++ b/tests/ut/test_utils.py
@ -39,6 +39,14 @@ class TestUtils(TestBase):
                        "Ascend910P1"):
            self.assertFalse(utils.is_310p())

+    def test_is_enable_nz(self):
+        with mock.patch("vllm_ascend.utils.envs_ascend.VLLM_ASCEND_ENABLE_NZ",
+                        1):
+            self.assertTrue(utils.is_enable_nz())
+        with mock.patch("vllm_ascend.utils.envs_ascend.VLLM_ASCEND_ENABLE_NZ",
+                        0):
+            self.assertFalse(utils.is_enable_nz())
+
    def test_sleep_mode_enabled(self):
        utils._SLEEP_MODE_ENABLED = None
        with mock.patch("vllm_ascend._build_info.__sleep_mode_enabled__",
--- a/tests/ut/torchair/test_utils.py
+++ b/tests/ut/torchair/test_utils.py
@ -96,15 +96,17 @@ class TestTorchairUtils(TestBase):
            self.assertEqual(args[0], expected_name)
            self.assertEqual(args[1], expected_path)

+    @mock.patch('vllm_ascend.torchair.utils.is_enable_nz')
    @mock.patch('torch_npu.get_npu_format')
    @mock.patch('torch_npu.npu_format_cast')
    @mock.patch('vllm.model_executor.layers.fused_moe.layer.FusedMoE',
                new=mock.MagicMock)
-    def test_converting_weight_acl_format(self, mock_npu_cast,
-                                          mock_get_format):
+    def test_converting_weight_acl_format_to_nz(self, mock_npu_cast,
+                                                mock_get_format, mock_is_nz):
        ACL_FORMAT_FRACTAL_NZ = 29
        mock_get_format.return_value = 1
        mock_npu_cast.return_value = 1
+        mock_is_nz.return_value = 1

        fused_moe = mock.MagicMock()
        fused_moe.w13_weight = mock.MagicMock()
@ -137,3 +139,26 @@ class TestTorchairUtils(TestBase):

        utils.converting_weight_acl_format(model, ACL_FORMAT_FRACTAL_NZ)
        mock_npu_cast.assert_not_called()
+
+    @mock.patch('vllm_ascend.torchair.utils.is_enable_nz')
+    @mock.patch('torch_npu.get_npu_format')
+    @mock.patch('torch_npu.npu_format_cast')
+    @mock.patch('vllm.model_executor.layers.fused_moe.layer.FusedMoE',
+                new=mock.MagicMock)
+    def test_converting_weight_acl_format_no_nz(self, mock_npu_cast,
+                                                mock_get_format, mock_is_nz):
+        ACL_FORMAT_FRACTAL_NZ = 29
+        mock_get_format.return_value = 1
+        mock_npu_cast.return_value = 1
+        mock_is_nz.return_value = 0
+
+        fused_moe = mock.MagicMock()
+        fused_moe.w13_weight = mock.MagicMock()
+        fused_moe.w2_weight = mock.MagicMock()
+        fused_moe.w13_weight.data = torch.randn(128, 256)
+        fused_moe.w2_weight.data = torch.randn(256, 128)
+        model = mock.MagicMock()
+        model.modules.return_value = [fused_moe]
+
+        utils.converting_weight_acl_format(model, ACL_FORMAT_FRACTAL_NZ)
+        mock_npu_cast.assert_not_called()
--- a/vllm_ascend/ascend_config.py
+++ b/vllm_ascend/ascend_config.py
@ -216,6 +216,9 @@ class WeightPrefetchConfig:
            "qkv": 1.0,
            "o": 1.0,
        },
+        "moe": {
+            "gate_up": 0.8
+        }
    }

    def __init__(self, weight_prefetch_config: dict):
--- a/vllm_ascend/ascend_forward_context.py
+++ b/vllm_ascend/ascend_forward_context.py
@ -145,7 +145,7 @@ def set_ascend_forward_context(
            forward_context.prefetch_mlp_gate_up_proj = False
            forward_context.prefetch_mlp_down_proj = False
        forward_context.prefetch_mlp_enabled = prefetch_mlp_enabled
-        # TODO(yuzhup): integrate moe weight prefetch method
+        forward_context.model_instance = model_instance
        forward_context.weight_prefetch_method = weight_prefetch_method

        # TODO(rjg-lyh): The current implementation is somewhat brute force and not elegant.
--- a/vllm_ascend/attention/attention_mask.py
+++ b/vllm_ascend/attention/attention_mask.py
@ -50,6 +50,7 @@ class AttentionMaskBuilder:
        self._seq_len_cached = attn_mask.shape[0]
        self.attn_mask_cache = attn_mask
        self.device = device
+        self.pooling_mask = None
        if torch.version.cann.startswith("8.3"):
            assigned_mask_dim = 2048
            self.chunked_prefill_attn_mask = torch.triu(
@ -75,6 +76,14 @@ class AttentionMaskBuilder:
        return self.attn_mask_cache[:max_seq_len, :max_seq_len].contiguous(
        ).to(device, non_blocking=True)

+    def get_pooling_mask(self, device):
+        if self.pooling_mask is None:
+            # the compressed attention mask for npu_fusion_attention sparse mode 4
+            self.pooling_mask = torch.triu(torch.ones(
+                2048, 2048), diagonal=1).to(torch.bool).to(device,
+                                                           non_blocking=True)
+        return self.pooling_mask
+
    def get_splitfuse_attn_mask(
        self,
        seq_lens: torch.Tensor = None,
--- a/vllm_ascend/attention/attention_v1.py
+++ b/vllm_ascend/attention/attention_v1.py
@ -606,9 +606,8 @@ class AscendAttentionBackendImpl(AttentionImpl):
            num_actual_tokens = attn_metadata.num_actual_tokens
            assert layer._k_scale_float == 1.0 and layer._v_scale_float == 1.0
            attn_type = self.attn_type
-            if attn_type != AttentionType.DECODER:
-                raise NotImplementedError("Encoder self-attention and "
-                                          "encoder/decoder cross-attention "
+            if attn_type != AttentionType.DECODER and attn_type != AttentionType.ENCODER_ONLY:
+                raise NotImplementedError("Encoder/decoder cross-attention "
                                          "are not implemented for "
                                          "PallasAttentionBackendImpl")
            # View q k v to BSH.
@ -628,9 +627,25 @@ class AscendAttentionBackendImpl(AttentionImpl):
                    key_cache=self.key_cache,
                    value_cache=self.value_cache,
                    slot_indices=slots)
-
+            if attn_type == AttentionType.ENCODER_ONLY:
+                cum_seq_len = attn_metadata.query_start_loc[1:].tolist()
+                attn_out = torch_npu.npu_fusion_attention(
+                    query,
+                    key,
+                    value,
+                    head_num=self.num_heads,
+                    input_layout="TND",
+                    scale=self.scale,
+                    sparse_mode=4,
+                    atten_mask=attn_metadata.attn_mask,
+                    pre_tockens=attn_metadata.max_query_len,
+                    next_tockens=attn_metadata.max_query_len,
+                    actual_seq_qlen=cum_seq_len,
+                    actual_seq_kvlen=cum_seq_len,
+                )
+                output = attn_out[0]
            # V0-Style scheduler situation.
-            if attn_metadata.attn_state == AscendAttentionState.PrefillNoCache:
+            elif attn_metadata.attn_state == AscendAttentionState.PrefillNoCache:
                output = self._forward_prefill_no_cache(
                    query, key, value, attn_metadata, output, num_tokens)
            elif attn_metadata.attn_state == \
--- a/vllm_ascend/attention/mla_v1.py
+++ b/vllm_ascend/attention/mla_v1.py
@ -27,6 +27,8 @@ from vllm_ascend.multistream.base import MSAttentionMetadataSplitConfig
 from vllm_ascend.multistream.context import get_multistream_comm_context
 from vllm_ascend.multistream.ms_split import model_input_split_v1_mla_attn
 from vllm_ascend.ops.weight_prefetch import maybe_npu_prefetch
+from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, ACL_FORMAT_FRACTAL_NZ,
+                               is_enable_nz)
 from vllm_ascend.worker.npu_input_batch import InputBatch

 if TYPE_CHECKING:
@ -595,6 +597,10 @@ class AscendMLAImpl(MLAAttentionImpl):
                del eye
                # standardize to (output, input)
                return dequant_weights.T
+            # Weight will be reshaped next. To be on the safe side, the format
+            # of the weight should be reverted to FRACTAL_AND.
+            layer.weight.data = torch_npu.npu_format_cast(
+                layer.weight.data, ACL_FORMAT_FRACTAL_ND)
            return layer.weight

        # we currently do not have quantized bmm's which are needed for
@ -623,6 +629,12 @@ class AscendMLAImpl(MLAAttentionImpl):
        # Convert from (L, N, P) to (N, P, L)
        self.W_UK_T = W_UK.permute(1, 2, 0).contiguous()

+        # Function `get_and_maybe_dequant_weights` will cast the weights to
+        # FRACTAL_AND. So we need to cast to FRACTAL_NZ again.
+        if is_enable_nz():
+            self.kv_b_proj.weight.data = torch_npu.npu_format_cast(
+                self.kv_b_proj.weight.data, ACL_FORMAT_FRACTAL_NZ)
+
        # Waiting for BMM NZ support
        # self.W_UV.data = torch_npu.npu_format_cast(self.W_UV.data, 29)
        # self.W_UK_T.data = torch_npu.npu_format_cast(self.W_UK_T.data, 29)
--- a/vllm_ascend/distributed/cpu_offload_connector.py
+++ b/vllm_ascend/distributed/cpu_offload_connector.py
@ -18,8 +18,10 @@ from vllm.distributed.parallel_state import get_pp_group, get_tp_group
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.utils import logger
 from vllm.v1.core.sched.output import SchedulerOutput
-from vllm.v1.kv_cache_interface import FullAttentionSpec, KVCacheSpec
+from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheSpec,
+                                        MLAAttentionSpec)

+from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.distributed.cpu_offload_manager.metadata import (
    MetadataServer, MetadataServerProc, MLAConfig)

@ -434,18 +436,30 @@ def get_kv_cache_spec(vllm_config: VllmConfig) -> dict[str, KVCacheSpec]:
    forward_ctx = vllm_config.compilation_config.static_forward_context
    block_size = vllm_config.cache_config.block_size
    use_mla = vllm_config.model_config.use_mla
+    ascend_config = get_ascend_config()
+    use_sfa = ascend_config.use_sfa
    kv_cache_spec: dict[str, KVCacheSpec] = {}
    for layer_name, attn_module in forward_ctx.items():
        if isinstance(attn_module, FusedMoE):
            continue
        assert isinstance(attn_module, Attention)
        if attn_module.attn_type == AttentionType.DECODER:
-            kv_cache_spec[layer_name] = FullAttentionSpec(
-                block_size=block_size,
-                num_kv_heads=attn_module.num_kv_heads,
-                head_size=attn_module.head_size,
-                dtype=attn_module.dtype,
-                use_mla=use_mla)
+            if use_mla and not use_sfa:
+                kv_cache_spec[layer_name] = MLAAttentionSpec(
+                    block_size=block_size,
+                    num_kv_heads=attn_module.num_kv_heads,
+                    head_size=attn_module.head_size,
+                    dtype=attn_module.dtype,
+                    cache_dtype_str=vllm_config.cache_config.cache_dtype)
+            else:
+                # TODO(cmq): This is a hack way to fix deepseek kvcache when
+                # using DSA. Fix the spec in vLLM is a finnal way.
+                kv_cache_spec[layer_name] = FullAttentionSpec(
+                    block_size=block_size,
+                    num_kv_heads=attn_module.num_kv_heads,
+                    head_size=attn_module.head_size,
+                    dtype=attn_module.dtype)
+
        elif attn_module.attn_type in (AttentionType.ENCODER,
                                       AttentionType.ENCODER_ONLY):
            continue
--- a/vllm_ascend/envs.py
+++ b/vllm_ascend/envs.py
@ -169,6 +169,9 @@ env_variables: Dict[str, Callable[[], Any]] = {
    lambda: int(os.getenv("VLLM_ASCEND_KVCACHE_DELAY_FREE_TIMEOUT", 250)),
    "VLLM_ASCEND_ENABLE_MLAPO":
    lambda: bool(int(os.getenv("VLLM_ASCEND_ENABLE_MLAPO", '0'))),
+    # Whether to enable transpose weight and cast format to FRACTAL_NZ.
+    "VLLM_ASCEND_ENABLE_NZ":
+    lambda: int(os.getenv("VLLM_ASCEND_ENABLE_NZ", 1)),
 }

 # end-env-vars-definition
--- a/vllm_ascend/models/deepseek_v2.py
+++ b/vllm_ascend/models/deepseek_v2.py
@ -32,13 +32,15 @@ from torch import nn
 from transformers import PretrainedConfig
 from vllm.attention import AttentionMetadata
 from vllm.config import CacheConfig, VllmConfig
-from vllm.distributed import (get_pp_group, get_tensor_model_parallel_rank,
+from vllm.distributed import (divide, get_pp_group,
+                              get_tensor_model_parallel_rank,
                              get_tensor_model_parallel_world_size,
                              get_tp_group, split_tensor_along_last_dim,
                              tensor_model_parallel_all_reduce)
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.layernorm import RMSNorm
-from vllm.model_executor.layers.linear import (ColumnParallelLinear,
+from vllm.model_executor.layers.linear import (WEIGHT_LOADER_V2_SUPPORTED,
+                                               ColumnParallelLinear,
                                               ReplicatedLinear,
                                               RowParallelLinear)
 from vllm.model_executor.layers.logits_processor import LogitsProcessor
@ -57,16 +59,81 @@ from vllm.model_executor.models.deepseek_v2 import (
 from vllm.model_executor.models.utils import (PPMissingLayer,
                                              is_pp_missing_parameter,
                                              maybe_prefix)
+from vllm.model_executor.utils import set_weight_attrs

 from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.models.layers.mla import AscendMLAModules
 from vllm_ascend.models.layers.sfa import (AscendSFAModules,
                                           AscendSparseFlashAttention, Indexer)
 from vllm_ascend.ops.common_fused_moe import AscendFusedMoE
+from vllm_ascend.ops.linear import AscendLinearBase


 class CustomDeepseekV2RowParallelLinear(RowParallelLinear):

+    def __init__(
+        self,
+        input_size: int,
+        output_size: int,
+        bias: bool = True,
+        input_is_parallel: bool = True,
+        skip_bias_add: bool = False,
+        params_dtype: Optional[torch.dtype] = None,
+        reduce_results: bool = True,
+        quant_config: Optional[QuantizationConfig] = None,
+        prefix: str = "",
+        *,
+        return_bias: bool = True,
+        disable_tp: bool = False,
+    ):
+        # Divide the weight matrix along the first dimension.
+        self.tp_rank = (get_tensor_model_parallel_rank()
+                        if not disable_tp else 0)
+        self.tp_size = (get_tensor_model_parallel_world_size()
+                        if not disable_tp else 1)
+        self.input_size_per_partition = divide(input_size, self.tp_size)
+        self.output_size_per_partition = output_size
+        self.output_partition_sizes = [output_size]
+
+        AscendLinearBase.__init__(self,
+                                  input_size,
+                                  output_size,
+                                  skip_bias_add,
+                                  params_dtype,
+                                  quant_config,
+                                  prefix,
+                                  return_bias=return_bias,
+                                  disable_tp=disable_tp)
+
+        self.input_is_parallel = input_is_parallel
+        self.reduce_results = reduce_results
+
+        assert self.quant_method is not None
+        self.quant_method.create_weights(
+            layer=self,
+            input_size_per_partition=self.input_size_per_partition,
+            output_partition_sizes=self.output_partition_sizes,
+            input_size=self.input_size,
+            output_size=self.output_size,
+            params_dtype=self.params_dtype,
+            weight_loader=(
+                self.weight_loader_v2 if self.quant_method.__class__.__name__
+                in WEIGHT_LOADER_V2_SUPPORTED else self.weight_loader))
+        if not reduce_results and (bias and not skip_bias_add):
+            raise ValueError("When not reduce the results, adding bias to the "
+                             "results can lead to incorrect results")
+
+        if bias:
+            self.bias = nn.Parameter(
+                torch.empty(self.output_size, dtype=params_dtype))
+            set_weight_attrs(self.bias, {
+                "output_dim": 0,
+                "weight_loader": self.weight_loader,
+            })
+        else:
+            self.register_parameter("bias", None)
+        self.update_param_tp_status()
+
    def forward(
        self,
        input_,
--- a/vllm_ascend/ops/common_fused_moe.py
+++ b/vllm_ascend/ops/common_fused_moe.py
@ -37,7 +37,8 @@ from vllm_ascend.eplb.core.eplb_utils import (determine_default_expert_map,
 from vllm_ascend.ops.expert_load_balancer import ExpertLoadBalancer
 from vllm_ascend.ops.moe.experts_selector import select_experts
 from vllm_ascend.ops.moe.moe_comm_method import setup_moe_comm_method
-from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, is_310p, npu_stream_switch
+from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_NZ, is_310p, is_enable_nz,
+                               npu_stream_switch)


 class AscendUnquantizedFusedMoEMethod(UnquantizedFusedMoEMethod):
@ -83,7 +84,7 @@ class AscendUnquantizedFusedMoEMethod(UnquantizedFusedMoEMethod):
            w2_data = self._maybe_pad_weight(layer.w2_weight.data)
            layer.w2_weight = torch.nn.Parameter(w2_data, requires_grad=False)

-        if not is_310p():
+        if not is_310p() and is_enable_nz():
            layer.w13_weight.data = torch_npu.npu_format_cast(
                layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ)
            layer.w2_weight.data = torch_npu.npu_format_cast(
--- a/vllm_ascend/ops/linear.py
+++ b/vllm_ascend/ops/linear.py
@ -24,17 +24,29 @@ from typing import Optional, Union

 import torch
 import torch.nn as nn
+import torch_npu
 from torch.nn.parameter import Parameter
 from vllm.distributed import divide
 from vllm.model_executor.layers.linear import (  # noqa
    WEIGHT_LOADER_V2_SUPPORTED, ColumnParallelLinear, LinearBase,
    MergedColumnParallelLinear, QKVParallelLinear, QuantizeMethodBase,
-    RowParallelLinear, UnquantizedLinearMethod)
+    ReplicatedLinear, RowParallelLinear, UnquantizedLinearMethod)
 from vllm.model_executor.layers.quantization.base_config import \
    QuantizationConfig
 from vllm.model_executor.utils import set_weight_attrs

-from vllm_ascend.ops.linear_op import get_parallel_op
+from vllm_ascend.ops.linear_op import get_parallel_op, get_replicated_op
+from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, is_enable_nz
+
+
+class AscendUnquantizedLinearMethod(UnquantizedLinearMethod):
+    """Linear method without quantization"""
+
+    def process_weights_after_loading(self, layer: torch.nn.Module) -> None:
+        super().process_weights_after_loading(layer)
+        if is_enable_nz() and torch.version.cann.startswith("8.3"):
+            layer.weight.data = torch_npu.npu_format_cast(
+                layer.weight.data, ACL_FORMAT_FRACTAL_NZ)


 # TODO(realliujiaxu): Remove this class after linear of vllm supports custom comm group
@ -65,7 +77,7 @@ class AscendLinearBase(LinearBase):
        self.prefix = prefix
        if quant_config is None:
            self.quant_method: Optional[
-                QuantizeMethodBase] = UnquantizedLinearMethod()
+                QuantizeMethodBase] = AscendUnquantizedLinearMethod()
        else:
            self.quant_method = quant_config.get_quant_method(self,
                                                              prefix=prefix)
@ -364,3 +376,81 @@ class AscendColumnParallelLinear(ColumnParallelLinear):
            return self.custom_op.apply(input_)

        return super().forward(input_)
+
+
+class AscendReplicatedLinear(ReplicatedLinear):
+    """Ascend Replicated linear layer.
+
+    Args:
+        input_size: input dimension of the linear layer.
+        output_size: output dimension of the linear layer.
+        bias: If true, add bias.
+        skip_bias_add: If true, skip adding bias but instead return it.
+        params_dtype: Data type for the parameters.
+        quant_config: Quantization configure.
+        prefix: The name of the layer in the state dict, including all parents
+                        (e.g. model.layers.0.qkv_proj)
+        return_bias: If true, return bias together with outputs in forward pass.
+        disable_tp: Take no effect for replicated linear layers.
+    """
+
+    def __init__(
+        self,
+        input_size: int,
+        output_size: int,
+        bias: bool = True,
+        skip_bias_add: bool = False,
+        params_dtype: Optional[torch.dtype] = None,
+        quant_config: Optional[QuantizationConfig] = None,
+        prefix: str = "",
+        *,
+        return_bias: bool = True,
+        disable_tp: bool = False,
+    ):
+        self.custom_op = get_replicated_op(disable_tp, prefix, self)
+        # If MergedReplicatedLinear, use output size of each partition.
+        if hasattr(self, "output_sizes"):
+            self.output_partition_sizes = self.output_sizes
+        else:
+            self.output_partition_sizes = [output_size]
+
+        AscendLinearBase.__init__(self,
+                                  input_size,
+                                  output_size,
+                                  skip_bias_add,
+                                  params_dtype,
+                                  quant_config,
+                                  prefix=prefix,
+                                  return_bias=return_bias,
+                                  disable_tp=disable_tp)
+
+        # All the linear layer supports quant method.
+        assert self.quant_method is not None
+        self.quant_method.create_weights(self,
+                                         self.input_size, [self.output_size],
+                                         self.input_size,
+                                         self.output_size,
+                                         self.params_dtype,
+                                         weight_loader=self.weight_loader)
+
+        if bias:
+            self.bias = Parameter(
+                torch.empty(self.output_size, dtype=self.params_dtype))
+            set_weight_attrs(self.bias, {
+                "output_dim": 0,
+                "weight_loader": self.weight_loader,
+            })
+        else:
+            self.register_parameter("bias", None)
+
+        if self.custom_op is not None:
+            self.custom_op.update_attrs()
+
+    def forward(
+        self,
+        input_,
+    ) -> Union[torch.Tensor, tuple[torch.Tensor, Optional[Parameter]]]:
+        if self.custom_op is not None:
+            return self.custom_op.apply(input_)
+
+        return super().forward(input_)
--- a/vllm_ascend/ops/linear_op.py
+++ b/vllm_ascend/ops/linear_op.py
@ -17,16 +17,16 @@ This file extends the functionality of linear operations by encapsulating custom
 communication groups and forward functions into classes (linear ops).

 Current class inheritance structure:
-CustomTensorParallelOp
+CustomLinearOp
 ├── CustomColumnParallelOp
 │   ├── MLPColumnParallelOp
 │   ├── SequenceColumnParallelOp
 └── CustomRowParallelOp
-    ├── MLPRowParallelOp
-    ├── OProjRowParallelOp
-    ├── MatmulAllreduceRowParallelOp
-    └── SequenceRowParallelOp
-
+│   ├── MLPRowParallelOp
+│   ├── OProjRowParallelOp
+│   ├── MatmulAllreduceRowParallelOp
+│   └── SequenceRowParallelOp
+└── CustomReplicatedOp
 How to extend a new linear op? Taking column parallel op as an example:
 1. Inherit from CustomColumnParallelOp and create a new class MyColumnParallelOp
 2. [Optional] The default communication group is the TP group. If a custom communication group is needed, override the comm_group method
@ -52,7 +52,7 @@ from vllm_ascend.utils import (dense_optim_enable, enable_sp,
                               oproj_tp_enable)


-class CustomTensorParallelOp:
+class CustomLinearOp:

    def __init__(self, layer):
        self.layer = layer
@ -95,7 +95,7 @@ class CustomTensorParallelOp:
        return output, output_bias


-class CustomColumnParallelOp(CustomTensorParallelOp):
+class CustomColumnParallelOp(CustomLinearOp):

    def __init__(self, layer):
        super().__init__(layer)
@ -106,7 +106,7 @@ class CustomColumnParallelOp(CustomTensorParallelOp):
        self.gather_output = self.layer.gather_output


-class CustomRowParallelOp(CustomTensorParallelOp):
+class CustomRowParallelOp(CustomLinearOp):

    def __init__(self, layer):
        super().__init__(layer)
@ -129,6 +129,18 @@ class CustomRowParallelOp(CustomTensorParallelOp):
        return output, output_bias


+class CustomReplicatedOp(CustomLinearOp):
+
+    def apply_impl(self, input_):
+        bias = self.bias if not self.skip_bias_add else None
+        assert self.quant_method is not None
+
+        output = self.quant_method.apply(self.layer, input_, bias)
+        output_bias = self.bias if self.skip_bias_add else None
+
+        return output, output_bias
+
+
 class MLPColumnParallelOp(CustomColumnParallelOp):

    def __init__(self, layer):
@ -422,3 +434,11 @@ def get_parallel_op(disable_tp, prefix, layer, direct):
        return custom_op, custom_op.tp_rank, custom_op.tp_size

    return None, get_tp_group().rank_in_group, get_tp_group().world_size
+
+
+def get_replicated_op(disable_tp, prefix,
+                      layer) -> Optional[Union[CustomReplicatedOp]]:
+    if disable_tp:
+        return None
+
+    return CustomReplicatedOp(layer)
--- a/vllm_ascend/ops/moe/experts_selector.py
+++ b/vllm_ascend/ops/moe/experts_selector.py
@ -18,6 +18,7 @@ from typing import Callable, Optional

 import torch
 import torch_npu
+from vllm.forward_context import get_forward_context


 def return_row_idx(hidden_states, top_k):
@ -65,7 +66,11 @@ def select_experts(hidden_states: torch.Tensor,
        topk_weights: router weights of shape (num_tokens, top_k).
        topk_ids: selected expert IDs of shape (num_tokens, top_k).
    """
-
+    # prefetch w1_w3_proj.weight preprocess
+    weight_prefetch_method = get_forward_context().weight_prefetch_method
+    if weight_prefetch_method:
+        weight_prefetch_method.maybe_prefetch_moe_weight_preprocess(
+            hidden_states, "gate_up")
    topk_weights, topk_ids, row_idx = _select_experts_with_fusion_ops(
        hidden_states=hidden_states,
        router_logits=router_logits,
--- a/vllm_ascend/ops/moe/moe_mlp.py
+++ b/vllm_ascend/ops/moe/moe_mlp.py
@ -78,6 +78,10 @@ def quant_apply_mlp(hidden_states: torch.Tensor,
    bias1, bias2 = None, None
    _output_dtype = w2_scale.dtype

+    weight_prefetch_method = get_forward_context().weight_prefetch_method
+    if weight_prefetch_method:
+        weight_prefetch_method.maybe_prefetch_moe_weight_postprocess(
+            hidden_states)
    is_mc2 = get_forward_context().moe_comm_type == MoECommType.MC2
    if w1_scale_bias is None and is_mc2:
        if fusion and not dynamic_eplb:
--- a/vllm_ascend/ops/weight_prefetch.py
+++ b/vllm_ascend/ops/weight_prefetch.py
@ -1,83 +1,112 @@
-from dataclasses import dataclass, field
-
-import torch
-import torch_npu
-
-from vllm_ascend.ascend_config import WeightPrefetchConfig
-from vllm_ascend.ops.linear import (AscendQKVParallelLinear,
-                                    AscendRowParallelLinear)
-
-SUPPORTED_MODULES = ["attn", "mlp", "moe"]
-
-
-@dataclass
-class ModuleWeightPrefetchConfig:
-    module_name: str
-    enable: bool = False
-    prefetch_ratio: dict = field(default_factory=dict)
-    linear_prefix_map: dict = field(default_factory=dict)
-
-    def __post_init__(self) -> None:
-        self.prefetch_ratio = {
-            prefix: ratio
-            for prefix, ratio in self.prefetch_ratio.items() if 0 <= ratio <= 1
-        }
-
-        assert self.module_name in SUPPORTED_MODULES, (
-            f"Invalid module name {self.module_name}, should be one of {SUPPORTED_MODULES}"
-        )
-
-        if self.module_name in SUPPORTED_MODULES:
-            self.enable = self.enable and any(self.prefetch_ratio.values()) > 0
-
-
-class WeightPrefetchMethod:
-    """
-    Unified weight prefetch method.
-    """
-
-    def __init__(self, weight_prefetch_config: WeightPrefetchConfig) -> None:
-        self.attn = ModuleWeightPrefetchConfig(
-            module_name="attn",
-            enable=weight_prefetch_config.enabled,
-            prefetch_ratio=weight_prefetch_config.prefetch_ratio.get(
-                "attn", {}),
-            linear_prefix_map={
-                AscendQKVParallelLinear.__name__: "qkv",
-                AscendRowParallelLinear.__name__: "o",
-            })
-
-    def maybe_prefetch_attn_weight_preprocess(
-            self, layer_cls_name: str, weight: torch.Tensor,
-            start_flag: torch.Tensor) -> None:
-        if not self.attn.enable or layer_cls_name not in self.attn.linear_prefix_map:
-            return
-
-        prefix = self.attn.linear_prefix_map.get(layer_cls_name, "")
-        weight_size = weight.data.element_size() * weight.data.numel(
-        ) * self.attn.prefetch_ratio.get(prefix, 0)
-
-        torch.ops.vllm.prefetch_preprocess(weight=weight,
-                                           start_flag=start_flag,
-                                           max_weight_size=int(weight_size))
-
-    def maybe_prefetch_attn_weight_postprocess(
-            self, layer_cls_name: str, stop_flag: torch.Tensor) -> None:
-        if not self.attn.enable or layer_cls_name not in self.attn.linear_prefix_map:
-            return
-
-        torch.ops.vllm.prefetch_postprocess(stop_flag)
-
-
-def maybe_npu_prefetch(inputs: torch.Tensor,
-                       dependency: torch.Tensor,
-                       max_size: int = 0,
-                       offset: int = 0,
-                       *,
-                       enabled: bool = True) -> None:
-    if not enabled:
-        return
-    input_size = inputs.element_size() * inputs.numel()
-    if max_size <= 0 or max_size > input_size:
-        max_size = input_size
-    torch_npu.npu_prefetch(inputs, dependency, max_size, offset)
+from dataclasses import dataclass, field
+
+import torch
+import torch_npu
+from vllm.forward_context import get_forward_context
+
+from vllm_ascend.ascend_config import WeightPrefetchConfig
+from vllm_ascend.ops.linear import (AscendQKVParallelLinear,
+                                    AscendRowParallelLinear)
+
+SUPPORTED_MODULES = ["attn", "mlp", "moe"]
+MOE_PREFETCH_TOKEN_THRESHOLD = 96
+
+
+@dataclass
+class ModuleWeightPrefetchConfig:
+    module_name: str
+    enable: bool = False
+    is_active_this_forward: bool = False
+    prefetch_ratio: dict = field(default_factory=dict)
+    linear_prefix_map: dict = field(default_factory=dict)
+
+    def __post_init__(self) -> None:
+        self.prefetch_ratio = {
+            prefix: ratio
+            for prefix, ratio in self.prefetch_ratio.items() if 0 <= ratio <= 1
+        }
+
+        assert self.module_name in SUPPORTED_MODULES, (
+            f"Invalid module name {self.module_name}, should be one of {SUPPORTED_MODULES}"
+        )
+
+        if self.module_name in SUPPORTED_MODULES:
+            self.enable = self.enable and any(self.prefetch_ratio.values()) > 0
+
+
+class WeightPrefetchMethod:
+    """
+    Unified weight prefetch method.
+    """
+
+    def __init__(self, weight_prefetch_config: WeightPrefetchConfig) -> None:
+        self.attn = ModuleWeightPrefetchConfig(
+            module_name="attn",
+            enable=weight_prefetch_config.enabled,
+            prefetch_ratio=weight_prefetch_config.prefetch_ratio.get(
+                "attn", {}),
+            linear_prefix_map={
+                AscendQKVParallelLinear.__name__: "qkv",
+                AscendRowParallelLinear.__name__: "o",
+            })
+        self.moe = ModuleWeightPrefetchConfig(
+            module_name="moe",
+            enable=weight_prefetch_config.enabled,
+            prefetch_ratio=weight_prefetch_config.prefetch_ratio.get(
+                "moe", {}))
+
+    def maybe_prefetch_attn_weight_preprocess(
+            self, layer_cls_name: str, weight: torch.Tensor,
+            start_flag: torch.Tensor) -> None:
+        if not self.attn.enable or layer_cls_name not in self.attn.linear_prefix_map:
+            return
+
+        prefix = self.attn.linear_prefix_map.get(layer_cls_name, "")
+        weight_size = weight.data.element_size() * weight.data.numel(
+        ) * self.attn.prefetch_ratio.get(prefix, 0)
+
+        torch.ops.vllm.prefetch_preprocess(weight=weight,
+                                           start_flag=start_flag,
+                                           max_weight_size=int(weight_size))
+
+    def maybe_prefetch_attn_weight_postprocess(
+            self, layer_cls_name: str, stop_flag: torch.Tensor) -> None:
+        if not self.attn.enable or layer_cls_name not in self.attn.linear_prefix_map:
+            return
+
+        torch.ops.vllm.prefetch_postprocess(stop_flag)
+
+    def maybe_prefetch_moe_weight_preprocess(self, hidden_states, prefix):
+        self.moe.is_active_this_forward = hidden_states.shape[
+            0] >= MOE_PREFETCH_TOKEN_THRESHOLD if self.moe.enable else False
+        if not self.moe.is_active_this_forward:
+            return
+        forward_context = get_forward_context()
+        weight = forward_context.model_instance.model.layers[
+            forward_context.layer_idx].mlp.experts.w13_weight
+        weight_size = weight.data.element_size() * weight.data.numel(
+        ) * self.moe.prefetch_ratio.get(prefix, 0)
+        torch.ops.vllm.prefetch_preprocess(weight=weight,
+                                           start_flag=None,
+                                           max_weight_size=int(weight_size))
+        forward_context.layer_idx += 1
+
+    def maybe_prefetch_moe_weight_postprocess(self, stop_flag: torch.Tensor):
+        if not self.moe.is_active_this_forward:
+            return
+
+        torch.ops.vllm.prefetch_postprocess(stop_flag)
+
+
+def maybe_npu_prefetch(inputs: torch.Tensor,
+                       dependency: torch.Tensor,
+                       max_size: int = 0,
+                       offset: int = 0,
+                       *,
+                       enabled: bool = True) -> None:
+    if not enabled:
+        return
+    input_size = inputs.element_size() * inputs.numel()
+    if max_size <= 0 or max_size > input_size:
+        max_size = input_size
+    torch_npu.npu_prefetch(inputs, dependency, max_size, offset)
--- a/vllm_ascend/patch/init.py
+++ b/vllm_ascend/patch/init.py
@ -132,3 +132,22 @@
 #       - this is a bug by Ascend only. It can' be fixed in vLLM.
 #    Future Plan:
 #       Fix this bug in torch-npu, bump torch-npu version and remove this patch.
+# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+#   1. `vllm.model_executor.models.roberta.RobertaEmbedding.forward`
+#    Why:
+#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
+#    How：
+#       Replace shift operation with multiplication and division.
+#    Related PR (if no, explain why):
+#       No, this need CANN add an aclnn shift operation
+#    Future Plan:
+#       Revert this when CANN support shift aclnn operation
+#   2. `vllm.model_executor.models.roberta.RobertaForSequenceClassification.forward `
+#    Why:
+#       shift operation in `_encode_token_type_ids` and `_decode_token_type_ids` cannot run in ascend aclgraph mode
+#    How：
+#       Replace shift operation with multiplication and division.
+#    Related PR (if no, explain why):
+#       No, this need CANN add an aclnn shift operation
+#    Future Plan:
+#       Revert this when CANN support shift aclnn operation
--- a/vllm_ascend/patch/platform/patch_common/init.py
+++ b/vllm_ascend/patch/platform/patch_common/init.py
@ -19,4 +19,3 @@ import vllm_ascend.patch.platform.patch_common.patch_config  # noqa
 import vllm_ascend.patch.platform.patch_common.patch_distributed  # noqa
 import vllm_ascend.patch.platform.patch_common.patch_mamba_config  # noqa
 import vllm_ascend.patch.worker.patch_common.patch_attention_selector  # noqa
-import vllm_ascend.patch.worker.patch_common.patch_attentionspec  # noqa
--- a/vllm_ascend/patch/platform/patch_common/patch_mamba_config.py
+++ b/vllm_ascend/patch/platform/patch_common/patch_mamba_config.py
@ -6,8 +6,6 @@ from vllm.model_executor.models.config import MambaModelConfig
 from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE, cdiv
 from vllm.v1.kv_cache_interface import FullAttentionSpec, MambaSpec

-from vllm_ascend.ascend_config import get_ascend_config
-

@classmethod
 def verify_and_update_config(cls, vllm_config) -> None:
@ -24,7 +22,6 @@ def verify_and_update_config(cls, vllm_config) -> None:
    logger = init_logger(__name__)
    # Enable FULL_AND_PIECEWISE by default
    MambaModelConfig.verify_and_update_config(vllm_config)
-    ascend_config = get_ascend_config()

    cache_config = vllm_config.cache_config
    model_config = vllm_config.model_config
@ -40,8 +37,7 @@ def verify_and_update_config(cls, vllm_config) -> None:
        block_size=1,
        num_kv_heads=model_config.get_num_kv_heads(parallel_config),
        head_size=model_config.get_head_size(),
-        dtype=kv_cache_dtype,
-        use_mla=model_config.use_mla or ascend_config.use_sfa).page_size_bytes
+        dtype=kv_cache_dtype).page_size_bytes

    model_cls, _ = ModelRegistry.resolve_model_cls(
        model_config.architecture,
--- a/vllm_ascend/patch/worker/patch_common/init.py
+++ b/vllm_ascend/patch/worker/patch_common/init.py
@ -22,10 +22,10 @@ if HAS_TRITON:

 # isort: off
 import vllm_ascend.patch.worker.patch_common.patch_attention_selector  # noqa
-import vllm_ascend.patch.worker.patch_common.patch_attentionspec  # noqa
 import vllm_ascend.patch.worker.patch_common.patch_attention_layer  # noqa
 import vllm_ascend.patch.worker.patch_common.patch_distributed  # noqa
 import vllm_ascend.patch.worker.patch_common.patch_logits  # noqa
+import vllm_ascend.patch.worker.patch_common.patch_roberta  # noqa
 import vllm_ascend.patch.worker.patch_common.patch_weight_loader  # noqa
 import vllm_ascend.patch.worker.patch_common.patch_multimodal_merge  # noqa

--- a/vllm_ascend/patch/worker/patch_common/patch_attention_selector.py
+++ b/vllm_ascend/patch/worker/patch_common/patch_attention_selector.py
@ -64,6 +64,7 @@ def _cached_get_attn_backend(
    use_mla: bool = False,
    use_sfa: bool = False,
    has_sink: bool = False,
+    use_sparse: bool = False,
 ) -> type[AttentionBackend]:
    # Check whether a particular choice of backend was
    # previously forced.
--- a/vllm_ascend/patch/worker/patch_common/patch_attentionspec.py
+++ b/vllm_ascend/patch/worker/patch_common/patch_attentionspec.py
@ -1,110 +0,0 @@
-from dataclasses import dataclass, fields
-from typing import Optional
-
-import torch
-import vllm
-from typing_extensions import Self
-from vllm.config import VllmConfig
-from vllm.utils import cdiv, get_dtype_size
-from vllm.v1.core.single_type_kv_cache_manager import (FullAttentionManager,
-                                                       spec_manager_map)
-from vllm.v1.kv_cache_interface import FullAttentionSpec, KVCacheSpec
-
-
-@dataclass(frozen=True)
-class AttentionSpec(KVCacheSpec):
-    num_kv_heads: int
-    head_size: int
-    dtype: torch.dtype
-    use_mla: bool
-    use_sfa: bool
-
-    @property
-    def page_size_bytes(self) -> int:
-        # For MLA we only store a single latent vector
-        coef = 1 if self.use_mla else 2
-        sfa_bytes = 128 * self.block_size * get_dtype_size(
-            self.dtype) if self.use_sfa else 0
-
-        return coef * self.block_size * self.num_kv_heads * self.head_size \
-                * get_dtype_size(self.dtype) + sfa_bytes
-
-
-vllm.v1.kv_cache_interface.AttentionSpec = AttentionSpec
-
-
-@dataclass(frozen=True)
-class AscendFullAttentionSpec(FullAttentionSpec, AttentionSpec):
-    sliding_window: Optional[int] = None
-    attention_chunk_size: Optional[int] = None
-    """
-    When hybrid allocator is disabled and the model contains both full 
-    attention layers and sliding window attention layers, sliding 
-    window attention are regarded as full attention in KV cache manager 
-    (blocks are allocated for all tokens), while computed as sliding window 
-    attention in model runner.
-    In this case, we use FullAttentionSpec and record the sliding window size.
-    Default to None for not using sliding window attention.
-    """
-
-    def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int:
-        max_model_len = vllm_config.model_config.max_model_len
-        dcp_world_size = \
-            vllm_config.parallel_config.decode_context_parallel_size
-        # Note(hc): each dcp rank only need save
-        # (max_model_len//dcp_world_size) tokens locally.
-        if dcp_world_size > 1:
-            max_model_len = cdiv(max_model_len, dcp_world_size)
-        return cdiv(max_model_len, self.block_size) * self.page_size_bytes
-
-    @classmethod
-    def merge_window_sizes(cls, window_sizes: set[int]) -> Optional[int]:
-        if len(window_sizes) == 0:
-            return None
-        elif len(window_sizes) == 1:
-            return window_sizes.pop()
-        else:
-            raise ValueError(
-                "All attention layers in the same KV cache group must have the "
-                "same window size.")
-
-    @classmethod
-    def merge(cls, specs: list[Self]) -> Self:
-        """
-        Merge a list of FullAttentionSpec objects into a single 
-        FullAttentionSpec object.
-        """
-        assert all(isinstance(spec, FullAttentionSpec) for spec in specs), (
-            "All attention layers in the same KV cache group must be "
-            "FullAttentionSpec.")
-
-        sliding_window = set(spec.sliding_window for spec in specs
-                             if spec.sliding_window is not None)
-        attention_chunk_size = set(spec.attention_chunk_size for spec in specs
-                                   if spec.attention_chunk_size is not None)
-        merged_spec = cls(
-            block_size=specs[0].block_size,
-            num_kv_heads=specs[0].num_kv_heads,
-            head_size=specs[0].head_size,
-            dtype=specs[0].dtype,
-            use_mla=specs[0].use_mla,
-            use_sfa=specs[0].use_sfa,
-            sliding_window=cls.merge_window_sizes(sliding_window),
-            attention_chunk_size=cls.merge_window_sizes(attention_chunk_size),
-        )
-        for spec in specs:
-            for f in fields(AttentionSpec):
-                assert getattr(spec, f.name) == getattr(merged_spec, f.name), (
-                    "All attention layers in the same KV cache group must have "
-                    "the same attention spec.")
-        assert (
-            (merged_spec.sliding_window is not None) +
-            (merged_spec.attention_chunk_size is not None) <= 1
-        ), ("Model with both sliding window layers and chunked local attention "
-            "layers is not supported.")
-        return merged_spec
-
-
-spec_manager_map.update({AscendFullAttentionSpec: FullAttentionManager})
-
-vllm.v1.kv_cache_interface.FullAttentionSpec = AscendFullAttentionSpec
--- a/vllm_ascend/patch/worker/patch_common/patch_roberta.py
+++ b/vllm_ascend/patch/worker/patch_common/patch_roberta.py
@ -0,0 +1,88 @@
+#
+# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
+# This file is a part of the vllm-ascend project.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from typing import Optional
+
+import torch
+from vllm.model_executor.models.roberta import (
+    RobertaEmbedding, RobertaForSequenceClassification,
+    replace_roberta_positions)
+from vllm.sequence import IntermediateTensors
+
+# aclgraph does not support shift operator for now
+# TODO: revert me when aclgraph supports shift operator
+TOKEN_TYPE_SHIFT = 30
+TOKEN_TYPE_MULTIPLIER = 1 << 30
+TOKEN_MASK = TOKEN_TYPE_MULTIPLIER - 1
+
+
+def _encode_token_type_ids(input_ids: torch.Tensor,
+                           token_type_ids: torch.Tensor) -> None:
+    # input_ids can be padded to the right
+    input_ids[:token_type_ids.shape[0]].bitwise_or_(token_type_ids *
+                                                    TOKEN_TYPE_MULTIPLIER)
+
+
+def _decode_token_type_ids(input_ids: torch.Tensor) -> torch.Tensor:
+
+    token_type_ids = input_ids // TOKEN_TYPE_MULTIPLIER
+
+    input_ids.bitwise_and_(TOKEN_MASK)
+
+    return token_type_ids
+
+
+def roberta_for_sequence_classification_forward(
+    self,
+    input_ids: Optional[torch.Tensor],
+    positions: torch.Tensor,
+    intermediate_tensors: Optional[IntermediateTensors] = None,
+    inputs_embeds: Optional[torch.Tensor] = None,
+    token_type_ids: Optional[torch.Tensor] = None,
+) -> torch.Tensor:
+    replace_roberta_positions(input_ids=input_ids,
+                              position_ids=positions,
+                              padding_idx=self.padding_idx)
+    if token_type_ids is not None:
+        assert self.roberta.config.vocab_size < (1 << TOKEN_TYPE_SHIFT)
+        assert input_ids is not None
+        _encode_token_type_ids(input_ids, token_type_ids)
+    return self.roberta(input_ids=input_ids,
+                        positions=positions,
+                        inputs_embeds=inputs_embeds,
+                        intermediate_tensors=intermediate_tensors)
+
+
+def roberta_embedding_forward(
+    self,
+    input_ids: torch.Tensor,
+    position_ids: torch.Tensor,
+) -> torch.Tensor:
+
+    token_type_ids = _decode_token_type_ids(input_ids)
+
+    inputs_embeds = self.word_embeddings(input_ids)
+    position_embeddings = self.position_embeddings(position_ids)
+
+    token_type_embeddings = self.token_type_embeddings(token_type_ids)
+    embeddings = inputs_embeds + token_type_embeddings + position_embeddings
+    embeddings = self.LayerNorm(embeddings)
+    return embeddings
+
+
+RobertaEmbedding.forward = roberta_embedding_forward
+RobertaForSequenceClassification.forward = roberta_for_sequence_classification_forward
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@ -134,7 +134,8 @@ class NPUPlatform(Platform):
        structured_outputs_config = vllm_config.structured_outputs_config

        if (model_config is not None and not model_config.use_mla
-                and not scheduler_config.async_scheduling):
+                and not scheduler_config.async_scheduling
+                and model_config.runner_type != "pooling"):
            logger.info(
                "Non-MLA LLMs forcibly disable the chunked prefill feature,"
                "as the performance of operators supporting this feature "
--- a/vllm_ascend/quantization/quant_config.py
+++ b/vllm_ascend/quantization/quant_config.py
@ -24,8 +24,7 @@ from vllm.distributed import get_tensor_model_parallel_rank
 from vllm.model_executor.layers.fused_moe import (FusedMoE, FusedMoEMethodBase,
                                                  FusedMoeWeightScaleSupported)
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
-                                               RowParallelLinear,
-                                               UnquantizedLinearMethod)
+                                               RowParallelLinear)
 from vllm.model_executor.layers.quantization import \
    register_quantization_config
 from vllm.model_executor.layers.quantization.base_config import (
@ -33,11 +32,13 @@ from vllm.model_executor.layers.quantization.base_config import (
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.model_executor.layers.vocab_parallel_embedding import (
    UnquantizedEmbeddingMethod, VocabParallelEmbedding)
+from vllm.model_executor.parameter import PerTensorScaleParameter
 from vllm.model_executor.utils import set_weight_attrs

 from vllm_ascend.distributed.parallel_state import (get_mlp_tp_group,
                                                    get_otp_group)
 from vllm_ascend.ops.common_fused_moe import AscendUnquantizedFusedMoEMethod
+from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod
 from vllm_ascend.utils import (ASCEND_QUANTIZATION_METHOD, mlp_tp_enable,
                               oproj_tp_enable)

@ -100,7 +101,7 @@ class AscendQuantConfig(QuantizationConfig):
        if isinstance(layer, LinearBase):
            if self.is_layer_skipped_ascend(prefix,
                                            self.packed_modules_mapping):
-                return UnquantizedLinearMethod()
+                return AscendUnquantizedLinearMethod()
            return AscendLinearMethod(self, prefix,
                                      self.packed_modules_mapping)
        elif isinstance(layer, Attention) and \
@ -250,6 +251,7 @@ class AscendLinearMethod(LinearMethodBase):
        **extra_weight_attrs,
    ) -> None:
        output_size_per_partition = sum(output_partition_sizes)
+        weight_loader = extra_weight_attrs.get("weight_loader")

        weight_dict = self.quant_method.get_weight(input_size_per_partition,
                                                   output_size_per_partition,
@ -262,7 +264,8 @@ class AscendLinearMethod(LinearMethodBase):

        pertensor_dict = self.quant_method.get_pertensor_param(params_dtype)
        for pertensor_name, pertensor_param in pertensor_dict.items():
-            param = torch.nn.Parameter(pertensor_param, requires_grad=False)
+            param = PerTensorScaleParameter(data=pertensor_param,
+                                            weight_loader=weight_loader)
            # disable warning
            param.ignore_warning = True
            layer.register_parameter(pertensor_name, param)
--- a/vllm_ascend/quantization/w4a8_dynamic.py
+++ b/vllm_ascend/quantization/w4a8_dynamic.py
@ -27,7 +27,7 @@ from vllm.forward_context import get_forward_context
 from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.distributed.parallel_state import get_mc2_group
 from vllm_ascend.ops.moe.experts_selector import select_experts
-from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ
+from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, is_enable_nz


 class AscendW4A8DynamicLinearMethod:
@ -393,9 +393,10 @@ class AscendW4A8DynamicFusedMoEMethod:

        self.update_bias(layer, w13_bias, w2_bias)

-        layer.w13_weight.data = torch_npu.npu_format_cast(
-            layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ)
-        layer.w2_weight.data = torch_npu.npu_format_cast(
-            layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ)
+        if is_enable_nz():
+            layer.w13_weight.data = torch_npu.npu_format_cast(
+                layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ)
+            layer.w2_weight.data = torch_npu.npu_format_cast(
+                layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ)
        layer.w13_weight.data = self.pack_to_int32(layer.w13_weight.data)
        layer.w2_weight.data = self.pack_to_int32(layer.w2_weight.data)
--- a/vllm_ascend/quantization/w8a8.py
+++ b/vllm_ascend/quantization/w8a8.py
@ -25,7 +25,7 @@ from vllm.forward_context import get_forward_context

 from vllm_ascend.attention.attention_v1 import AscendAttentionState
 from vllm_ascend.ops.moe.experts_selector import select_experts
-from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, is_310p
+from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, is_310p, is_enable_nz


 def quant_per_tensor(in_tensor: torch.Tensor,
@ -156,8 +156,9 @@ class AscendW8A8LinearMethod:
            requires_grad=False).to(layer.aclnn_input_scale.dtype)
        if self.transpose_weight:
            layer.weight.data = layer.weight.data.transpose(0, 1).contiguous()
-        layer.weight.data = torch_npu.npu_format_cast(layer.weight.data,
-                                                      ACL_FORMAT_FRACTAL_NZ)
+        if is_enable_nz():
+            layer.weight.data = torch_npu.npu_format_cast(
+                layer.weight.data, ACL_FORMAT_FRACTAL_NZ)
        layer.weight_scale.data = torch.flatten(layer.weight_scale.data)
        layer.weight_offset.data = torch.flatten(layer.weight_offset.data)

@ -340,7 +341,7 @@ class AscendW8A8FusedMoEMethod:
        # converting ACL_FORMAT_FRACTAL_NZ.
        # npu_quant_grouped_matmul_dequant in eager mode does not accept
        # ACL_FORMAT_FRACTAL_NZ.
-        if not is_310p():
+        if not is_310p() and is_enable_nz():
            layer.w13_weight.data = torch_npu.npu_format_cast(
                layer.w13_weight.data, ACL_FORMAT_FRACTAL_NZ).contiguous()
            layer.w2_weight.data = torch_npu.npu_format_cast(
--- a/vllm_ascend/quantization/w8a8_dynamic.py
+++ b/vllm_ascend/quantization/w8a8_dynamic.py
@ -26,7 +26,7 @@ from vllm.forward_context import get_forward_context
 from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.distributed.parallel_state import get_mc2_group
 from vllm_ascend.ops.moe.experts_selector import select_experts
-from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ
+from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, is_enable_nz


 class AscendW8A8DynamicLinearMethod:
@ -101,8 +101,9 @@ class AscendW8A8DynamicLinearMethod:
        if self.transpose_weight:
            layer.weight.data = layer.weight.data.transpose(0, 1).contiguous()
        # cast quantized weight tensors in NZ format for higher inference speed
-        layer.weight.data = torch_npu.npu_format_cast(layer.weight.data,
-                                                      ACL_FORMAT_FRACTAL_NZ)
+        if is_enable_nz():
+            layer.weight.data = torch_npu.npu_format_cast(
+                layer.weight.data, ACL_FORMAT_FRACTAL_NZ)
        layer.weight_scale.data = layer.weight_scale.data.flatten()
        layer.weight_scale_fp32 = layer.weight_scale.data.to(torch.float32)
        layer.weight_offset.data = layer.weight_offset.data.flatten()
@ -267,8 +268,9 @@ class AscendW8A8DynamicFusedMoEMethod:
                1, 2).contiguous()
            layer.w2_weight.data = layer.w2_weight.data.transpose(
                1, 2).contiguous()
-        torch_npu.npu_format_cast_(layer.w13_weight, ACL_FORMAT_FRACTAL_NZ)
-        torch_npu.npu_format_cast_(layer.w2_weight, ACL_FORMAT_FRACTAL_NZ)
+        if is_enable_nz():
+            torch_npu.npu_format_cast_(layer.w13_weight, ACL_FORMAT_FRACTAL_NZ)
+            torch_npu.npu_format_cast_(layer.w2_weight, ACL_FORMAT_FRACTAL_NZ)
        layer.w13_weight_scale.data = layer.w13_weight_scale.data.view(
            layer.w13_weight_scale.data.shape[0], -1)
        layer.w13_weight_scale_fp32 = layer.w13_weight_scale.data.to(
--- a/vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py
+++ b/vllm_ascend/torchair/quantization/torchair_w8a8_dynamic.py
@ -29,6 +29,7 @@ from vllm_ascend.torchair.ops.torchair_fused_moe import torchair_select_experts
 from vllm_ascend.torchair.utils import npu_stream_switch, npu_wait_tensor
 from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_NZ, AscendSocVersion,
                               dispose_tensor, get_ascend_soc_version,
+                               is_enable_nz,
                               is_hierarchical_communication_enabled)


@ -829,7 +830,9 @@ class TorchairAscendW8A8DynamicLinearMethod:
        if self.transpose_weight:
            layer.weight.data = layer.weight.data.transpose(0, 1).contiguous()
        # cast quantized weight tensors in NZ format (29) for higher inference speed
-        layer.weight.data = torch_npu.npu_format_cast(layer.weight.data, 29)
+        if is_enable_nz():
+            layer.weight.data = torch_npu.npu_format_cast(
+                layer.weight.data, 29)
        layer.weight_scale.data = layer.weight_scale.data.flatten()
        layer.weight_scale_fp32 = layer.weight_scale.data.to(torch.float32)
        layer.weight_offset.data = layer.weight_offset.data.flatten()
@ -1048,7 +1051,9 @@ class TorchairAscendW8A8DynamicFusedMoEMethod:
                1, 2).contiguous()
            layer.w2_weight.data = layer.w2_weight.data.transpose(
                1, 2).contiguous()
-        torch_npu.npu_format_cast_(layer.w2_weight, ACL_FORMAT_FRACTAL_NZ)
+        if is_enable_nz():
+            torch_npu.npu_format_cast_(layer.w13_weight, ACL_FORMAT_FRACTAL_NZ)
+            torch_npu.npu_format_cast_(layer.w2_weight, ACL_FORMAT_FRACTAL_NZ)
        layer.w13_weight_scale.data = layer.w13_weight_scale.data.view(
            layer.w13_weight_scale.data.shape[0], -1)
        layer.w13_weight_scale_fp32 = layer.w13_weight_scale.data.to(
--- a/vllm_ascend/torchair/torchair_sfa.py
+++ b/vllm_ascend/torchair/torchair_sfa.py
@ -24,6 +24,7 @@ from vllm_ascend.attention.utils import (AscendCommonAttentionMetadata,
 from vllm_ascend.multistream.base import MSAttentionMetadataSplitConfig
 from vllm_ascend.multistream.ms_split import model_input_split_v1_mla_attn
 from vllm_ascend.torchair.utils import TorchairCommonAttentionMetadata
+from vllm_ascend.utils import is_enable_nz
 from vllm_ascend.worker.npu_input_batch import InputBatch

 if TYPE_CHECKING:
@ -841,7 +842,8 @@ class AscendSFATorchairImpl(MLAAttentionImpl):
        wd_qkv = wd_qkv.t().contiguous()
        wd_qkv = transdata(wd_qkv,
                           block_size=(16, 32)).unsqueeze(0).contiguous()
-        self.wd_qkv = torch_npu.npu_format_cast(wd_qkv, 29)
+        if is_enable_nz():
+            self.wd_qkv = torch_npu.npu_format_cast(wd_qkv, 29)

        kv_a_proj_deq_scl = self.kv_a_proj_with_mqa.deq_scale.clone()
        kv_a_proj_deq_scl = kv_a_proj_deq_scl.reshape(
@ -874,7 +876,8 @@ class AscendSFATorchairImpl(MLAAttentionImpl):
            self.num_heads * (self.qk_nope_head_dim + self.qk_rope_head_dim),
            -1)
        wu_q = transdata(wu_q, block_size=(16, 32)).unsqueeze(0).contiguous()
-        self.wu_q = torch_npu.npu_format_cast(wu_q, 29)
+        if is_enable_nz():
+            self.wu_q = torch_npu.npu_format_cast(wu_q, 29)

        qb_deq_scl = self.q_proj.deq_scale.data.clone()
        qb_deq_scl = qb_deq_scl.reshape(
--- a/vllm_ascend/torchair/utils.py
+++ b/vllm_ascend/torchair/utils.py
@ -14,6 +14,7 @@ try:
 except ImportError:
    from torchair.ops import NpuStreamSwitch as _npu_stream_switch
    from torchair.ops import npu_wait_tensor as _npu_wait_tensor
+from vllm_ascend.utils import ACL_FORMAT_FRACTAL_NZ, is_enable_nz

 KV_CACHE_BYTES_CACHE_PATH_NAME = ".kv_cache_bytes"
 KV_CACHE_BYTES_CACHE_FILE_NAME = "kv_cache_bytes"
@ -141,6 +142,9 @@ def converting_weight_acl_format(model, format):
        if isinstance(module, FusedMoE):
            if torch_npu.get_npu_format(module.w13_weight.data) == format:
                return
+            if format == ACL_FORMAT_FRACTAL_NZ \
+                    and not is_enable_nz():
+                return
            module.w13_weight.data = torch_npu.npu_format_cast(
                module.w13_weight.data, format)
            module.w2_weight.data = torch_npu.npu_format_cast(
--- a/vllm_ascend/utils.py
+++ b/vllm_ascend/utils.py
@ -65,6 +65,10 @@ def is_310p():
    return _IS_310P


+def is_enable_nz():
+    return envs_ascend.VLLM_ASCEND_ENABLE_NZ
+
+
 def sleep_mode_enabled():
    global _SLEEP_MODE_ENABLED
    if _SLEEP_MODE_ENABLED is None:
@ -508,6 +512,7 @@ def register_ascend_customop(vllm_config: Optional[VllmConfig] = None):
    from vllm_ascend.ops.linear import (AscendColumnParallelLinear,
                                        AscendMergedColumnParallelLinear,
                                        AscendQKVParallelLinear,
+                                        AscendReplicatedLinear,
                                        AscendRowParallelLinear)
    from vllm_ascend.ops.rotary_embedding import (
        AscendDeepseekScalingRotaryEmbedding, AscendRotaryEmbedding,
@ -526,6 +531,7 @@ def register_ascend_customop(vllm_config: Optional[VllmConfig] = None):
        "YaRNScalingRotaryEmbedding": AscendYaRNRotaryEmbedding,
        "MergedColumnParallelLinear": AscendMergedColumnParallelLinear,
        "QKVParallelLinear": AscendQKVParallelLinear,
+        "ReplicatedLinear": AscendReplicatedLinear,
        "DeepseekScalingRotaryEmbedding": AscendDeepseekScalingRotaryEmbedding,
        "VocabParallelEmbedding": AscendVocabParallelEmbedding,
        "ParallelLMHead": AscendParallelLMHead,
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@ -77,9 +77,11 @@ from vllm.v1.attention.backends.utils import (
 from vllm.v1.cudagraph_dispatcher import CudagraphDispatcher
 # yapf conflicts with isort for this block
 # yapf: disable
-from vllm.v1.kv_cache_interface import (AttentionSpec, FullAttentionSpec,
-                                        KVCacheConfig, KVCacheGroupSpec,
-                                        KVCacheSpec, MambaSpec,
+from vllm.v1.kv_cache_interface import (AttentionSpec,
+                                        EncoderOnlyAttentionSpec,
+                                        FullAttentionSpec, KVCacheConfig,
+                                        KVCacheGroupSpec, KVCacheSpec,
+                                        MambaSpec, MLAAttentionSpec,
                                        UniformTypeKVCacheSpecs)
 # yapf: enable
 from vllm.v1.outputs import (EMPTY_MODEL_RUNNER_OUTPUT, AsyncModelRunnerOutput,
@ -97,6 +99,7 @@ from vllm.v1.worker.utils import (AttentionGroup, bind_kv_cache,
                                  sanity_check_mm_encoder_outputs,
                                  scatter_mm_placeholders)

+import vllm_ascend.envs as envs_ascend
 from vllm_ascend.ascend_config import get_ascend_config
 from vllm_ascend.ascend_forward_context import (MoECommType,
                                                set_ascend_forward_context)
@ -125,7 +128,7 @@ from vllm_ascend.spec_decode.interface import SpecDcodeType
 from vllm_ascend.spec_decode.mtp_proposer import MtpProposer
 from vllm_ascend.utils import (ACL_FORMAT_FRACTAL_ND, ACL_FORMAT_FRACTAL_NZ,
                               AscendSocVersion, ProfileExecuteDuration,
-                               get_ascend_soc_version, is_310p,
+                               get_ascend_soc_version, is_310p, is_enable_nz,
                               lmhead_tp_enable)
 from vllm_ascend.worker.npu_input_batch import CachedRequestState, InputBatch

@ -137,8 +140,6 @@ else:

 import torch_npu

-import vllm_ascend.envs as envs_ascend
-
 # if true, allow tensor initialization and casting with internal format (e.g., NZ)
 torch.npu.config.allow_internal_format = True

@ -867,8 +868,11 @@ class NPUModelRunner(LoRAModelRunnerMixin):

    def _make_attention_mask(self, seq_lens, position,
                             attn_state) -> torch.Tensor:
+        # Pooling situation.
+        if self.model_config.runner_type == "pooling" and self.model_config.pooler_config.pooling_type == "CLS":
+            return self.attn_mask_builder.get_pooling_mask(self.device)
        # Chunk Prefill situation.
-        if attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla and not self.ascend_config.use_sfa:
+        elif attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla and not self.ascend_config.use_sfa:
            if torch.version.cann.startswith("8.3"):
                return self.attn_mask_builder.get_splitfuse_attn_mask()
            else:
@ -1426,14 +1430,29 @@ class NPUModelRunner(LoRAModelRunnerMixin):
        # in the same group share the same metadata.
        for kv_cache_group_id, kv_cache_group_spec in enumerate(
                self.kv_cache_config.kv_cache_groups):
-            blk_table = self.input_batch.block_table[kv_cache_group_id]
-            blk_table_tensor = blk_table.get_device_tensor()
-            slot_mapping = blk_table.slot_mapping_cpu[:
-                                                      total_num_scheduled_tokens]
-            self.slot_mapping[:total_num_scheduled_tokens].copy_(
-                slot_mapping[:total_num_scheduled_tokens],
-                non_blocking=True,
-            )
+            if isinstance(kv_cache_group_spec.kv_cache_spec,
+                          EncoderOnlyAttentionSpec):
+                # Encoder-only layers do not have KV cache, so we need to
+                # create a dummy block table and slot mapping for them.
+                blk_table_tensor = torch.zeros(
+                    (num_reqs, 1),
+                    dtype=torch.int32,
+                    device=self.device,
+                )
+                slot_mapping = torch.zeros(
+                    (total_num_scheduled_tokens, ),
+                    dtype=torch.int64,
+                    device=self.device,
+                )
+            else:
+                blk_table = self.input_batch.block_table[kv_cache_group_id]
+                blk_table_tensor = blk_table.get_device_tensor()
+                slot_mapping = blk_table.slot_mapping_cpu[:
+                                                          total_num_scheduled_tokens]
+                self.slot_mapping[:total_num_scheduled_tokens].copy_(
+                    slot_mapping[:total_num_scheduled_tokens],
+                    non_blocking=True,
+                )

            # Make AscendCommonAttentionMetadata
            common_attn_metadata = AscendCommonAttentionMetadata(
@ -1469,7 +1488,8 @@ class NPUModelRunner(LoRAModelRunnerMixin):
                common_prefix_len = 0
                extra_attn_metadata_args = {}
                builder = attn_group.get_metadata_builder()
-                if isinstance(builder, GDNAttentionMetadataBuilder):
+                if isinstance(builder, GDNAttentionMetadataBuilder
+                              ) or self.model_config.runner_type == "pooling":
                    if use_spec_decode:
                        extra_attn_metadata_args = dict(
                            num_accepted_tokens=self.num_accepted_tokens.
@ -1520,13 +1540,14 @@ class NPUModelRunner(LoRAModelRunnerMixin):

        forward_context = get_forward_context()
        if forward_context.cudagraph_runtime_mode == CUDAGraphMode.FULL:
+            # TODO: maybe_padded_num_tokens will be removed, use num_input_tokens instead
            if self.vllm_config.model_config.use_mla:
                # FIXME: Try using `auto_dispatch_capture=True`
                update_mla_attn_params(self.update_stream, forward_context,
-                                       positions.shape[0])
+                                       maybe_padded_num_tokens)
            else:
                update_attn_params(self.update_stream, forward_context,
-                                   positions.shape[0])
+                                   maybe_padded_num_tokens)

        if get_forward_context().sp_enabled:
            hidden_states = tensor_model_parallel_all_gather(hidden_states, 0)
@ -2609,6 +2630,9 @@ class NPUModelRunner(LoRAModelRunnerMixin):
                                         runtime_mode=CUDAGraphMode.FULL)

    def _convert_torch_format(self, tensor):
+        if ACL_FORMAT == ACL_FORMAT_FRACTAL_NZ \
+                and not is_enable_nz():
+            return tensor
        tensor = torch_npu.npu_format_cast(tensor, ACL_FORMAT)
        return tensor

@ -2621,7 +2645,6 @@ class NPUModelRunner(LoRAModelRunnerMixin):
        """
        kv_cache_config = deepcopy(kv_cache_config)
        self.kv_cache_config = kv_cache_config
-        self.initialize_attn_backend(kv_cache_config)
        self.use_hybrid_blocks = (len(self.attn_groups) > 1)
        # NOTE: Currently, we determine whether we need `num_accepted_tokens` through `MambaSpec`.
        self.need_accepted_tokens = any([
@ -2630,6 +2653,8 @@ class NPUModelRunner(LoRAModelRunnerMixin):
        ])

        self.may_reinitialize_input_batch(kv_cache_config)
+        self.may_add_encoder_only_layers_to_kv_cache_config()
+        self.initialize_attn_backend(kv_cache_config)

        if self.ascend_config.is_deepseek_sfa:
            kv_caches = self.initialize_kv_cache_tensors_deepseek_sfa(
@ -3085,6 +3110,31 @@ class NPUModelRunner(LoRAModelRunnerMixin):
                kernel_block_sizes=kernel_block_sizes,
            )

+    def may_add_encoder_only_layers_to_kv_cache_config(self) -> None:
+        """
+        Add encoder-only layers to the KV cache config.
+        """
+        block_size = self.vllm_config.cache_config.block_size
+        encoder_only_attn_specs: dict[AttentionSpec,
+                                      list[str]] = defaultdict(list)
+        attn_layers = get_layers_from_vllm_config(self.vllm_config, Attention)
+        for layer_name, attn_module in attn_layers.items():
+            if attn_module.attn_type == AttentionType.ENCODER_ONLY:
+                attn_spec: AttentionSpec = EncoderOnlyAttentionSpec(
+                    block_size=block_size,
+                    num_kv_heads=attn_module.num_kv_heads,
+                    head_size=attn_module.head_size,
+                    dtype=self.kv_cache_dtype)
+                encoder_only_attn_specs[attn_spec].append(layer_name)
+                self.runner_only_attn_layers.add(layer_name)
+        if len(encoder_only_attn_specs) > 0:
+            assert len(
+                encoder_only_attn_specs
+            ) == 1, "Only support one encoder-only attention spec now"
+            spec, layer_names = encoder_only_attn_specs.popitem()
+            self.kv_cache_config.kv_cache_groups.append(
+                KVCacheGroupSpec(layer_names=layer_names, kv_cache_spec=spec))
+
    def initialize_attn_backend(self, kv_cache_config: KVCacheConfig) -> None:
        """
        Initialize the attention backends and attention metadata builders.
@ -3218,13 +3268,21 @@ class NPUModelRunner(LoRAModelRunnerMixin):
            # TODO(lucas): move the attention specs into the model layers like
            # the attention backends
            if attn_module.attn_type == AttentionType.DECODER:
-                kv_cache_spec[layer_name] = FullAttentionSpec(
-                    block_size=block_size,
-                    num_kv_heads=attn_module.num_kv_heads,
-                    head_size=attn_module.head_size,
-                    dtype=self.kv_cache_dtype,
-                    use_mla=use_mla,
-                    use_sfa=use_sfa)
+                if use_mla and not use_sfa:
+                    kv_cache_spec[layer_name] = MLAAttentionSpec(
+                        block_size=block_size,
+                        num_kv_heads=attn_module.num_kv_heads,
+                        head_size=attn_module.head_size,
+                        dtype=self.kv_cache_dtype,
+                        cache_dtype_str=self.cache_config.cache_dtype)
+                else:
+                    # TODO(cmq): This is a hack way to fix deepseek kvcache when
+                    # using DSA. Fix the spec in vLLM is a finnal way.
+                    kv_cache_spec[layer_name] = FullAttentionSpec(
+                        block_size=block_size,
+                        num_kv_heads=attn_module.num_kv_heads,
+                        head_size=attn_module.head_size,
+                        dtype=self.kv_cache_dtype)
            elif attn_module.attn_type in (AttentionType.ENCODER,
                                           AttentionType.ENCODER_ONLY):
                # encoder-only attention does not need KV cache.
Author	SHA1	Message	Date
xuyexiong	02c26dcfc7	[Feat] Supports Aclgraph for bge-m3 (#3171 ) ### What this PR does / why we need it? [Feat] Supports Aclgraph for bge-m3 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? ``` pytest -s tests/e2e/singlecard/test_embedding.py pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py ``` to start an online server with bs 10, each batch's seq length=8192, we set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked: ``` vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}' ``` For bs10, each batch's seq length=8192, QPS is improved from 85 to 104, which is a 22% improvement, lots of host bound is reduced. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: xuyexiong <xuyexiong@huawei.com> Co-authored-by: wangyongjun <1104133197@qq.com>	2025-10-14 23:07:45 +08:00
fan2956	434059e417	[BugFix] Fix multimodal model support fullgraph error (#3425 ) ### What this PR does / why we need it? Because the update_attn_params function requires passing the num_tokens parameter, and num_tokens is obtained via postions.shape[0]. However, the multimodal model uses mrope (Multidimensional Rotary Position Embedding), which results in the postions having a shape of 2. Consequently, postions.shape[0] retrieves an incorrect value.We resolve this issue by replacing positions.shape[0] with maybe_padded_num_tokens. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: fan2956 <zhoufan53@huawei.com>	2025-10-14 21:51:09 +08:00
Mengqing Cao	223cc34085	[KVCache] Refactor KVCache as page_size_bytes is ineffective (#3438 ) ### What this PR does / why we need it? Refactor KVCache as page_size_bytes is ineffective. 1. Currently the `AttentionSpec` is patched, but the `page_size_bytes` is still using that in vLLM in runtime, thus the patch is not working actually. Thus this pr removes the patch on `AttentionSpec`, and will do the final fix in vLLM. 2. Use `MLAAttentionSpec` instead of `FullAttentionSpec` to reduce `page_size_bytes` of spec, so that num_blocks in spec could double ### How was this patch tested? Test pass with Qwen3-Next and DeepSeek-V3.2-Exp - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-10-14 21:28:41 +08:00
linfeng-yuan	c55d99d13e	[bugfix][torchair] fix missing weight nz cast for w13_weight in torchair_w8a8_dynamic.py (#3446 ) ### What this PR does / why we need it? Fix the issue of missing NZ conversion for quantized weights in GMM after moe_dispatch operator in torchair scenario, which does not involve aclgraph & single scenarios. ### How was this patch tested? vllm serving passed with lower latency (~5ms TPOT with bs_per_rank=28 & ep_size=32) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-14 21:11:05 +08:00
TaoYu Chen	5fe883fa43	fix the title of modelrunner's prepare inputs docs (#3457 ) ### What this PR does / why we need it? Fix the wrong title of the modelrunner_prepare_inputs docs ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com>	2025-10-14 20:35:58 +08:00
yuzhup	78777237a9	[2/N][Feat] Attention and MoE weight prefetch in Qwen3MoE models (#3203 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `gate_up_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": True, "prefetch_ratio": { "moe": { "gate_up": 0.8 }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com>	2025-10-14 20:16:33 +08:00
anon189Ty	07e39620ea	[Feat] Unquantized Linear to nz and control all nz-cast (#3356 ) ### What this PR does / why we need it? Currently, when executing to the Linear layer of models in vLLM-Ascend, the weights format is ND in unquantized case and skipped ascend case. This PR supplements the execution logic for Linear layer. We use a new global variable: VLLM_ASCEND_ENABLE_NZ. When VLLM_ASCEND_ENABLE_NZ=1 and CANN version is 8.3, the weights of the Linear layer will be converted to FRACTAL_NZ, in both unquantized case and skipped ascend case. We also use VLLM_ASCEND_ENABLE_NZ to control the existing NZ conversion, such as w8a8-quantized case. ### Does this PR introduce _any_ user-facing change? Add a new global variable VLLM_ASCEND_ENABLE_NZ. If you want to use NZ format, you should set VLLM_ASCEND_ENABLE_NZ=1. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>	2025-10-14 17:39:26 +08:00
elilzhu	5c45c227dc	[BugFix] fix qwen2.5vl quant bug (#3426 ) ### What this PR does / why we need it? This PR fixes issues: 1. Resolve the issue of qwen2.5-VL quantization service startup failure: AttributeError, 'Parameter' object has no attribute 'weight_loader'. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? - ci & e2e - vLLM version: v0.11.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: elilzhu <2435754260@qq.com>	2025-10-14 17:31:26 +08:00