[Fix] Clears unused slot mappings and fix accuracy issue with MLA models when enabling FULL_DECODE_ONLY (#3482)

### What this PR does / why we need it? MLA and GQA use different computation logic: MLA slice batches and only compute on the actually valid tokens. That means outer padding must be handled carefully — the accuracy issue this PR fixes was caused by stale data in `slot_mapping` being reused by subsequent inference steps. So we zeros out the portion of the slot mapping tensor that is not used by the currently scheduled tokens. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Working on it. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-20 21:53:54 +08:00 · 2025-10-16 19:43:09 +08:00
parent f9535cc9e2
commit ccb6fb9ec1
1 changed files with 1 additions and 0 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@ -1462,6 +1462,7 @@ class NPUModelRunner(LoRAModelRunnerMixin):
                    slot_mapping[:total_num_scheduled_tokens],
                    non_blocking=True,
                )
+                self.slot_mapping[total_num_scheduled_tokens:].fill_(0)

            # Make AscendCommonAttentionMetadata
            common_attn_metadata = AscendCommonAttentionMetadata(