[Fix] Clears unused slot mappings and fix accuracy issue with MLA models when enabling FULL_DECODE_ONLY (#3482)

### What this PR does / why we need it?
MLA and GQA use different computation logic: MLA slice batches and only
compute on the actually valid tokens. That means outer padding must be
handled carefully — the accuracy issue this PR fixes was caused by stale
data in `slot_mapping` being reused by subsequent inference steps.

So we zeros out the portion of the slot mapping tensor that is not used
by the currently scheduled tokens.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Working on it.
- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
This commit is contained in:
Yizhou
2025-10-16 19:43:09 +08:00
committed by GitHub
parent f9535cc9e2
commit ccb6fb9ec1

View File

@ -1462,6 +1462,7 @@ class NPUModelRunner(LoRAModelRunnerMixin):
slot_mapping[:total_num_scheduled_tokens],
non_blocking=True,
)
self.slot_mapping[total_num_scheduled_tokens:].fill_(0)
# Make AscendCommonAttentionMetadata
common_attn_metadata = AscendCommonAttentionMetadata(