mirror of
https://github.com/vllm-project/vllm-ascend.git
synced 2025-10-20 21:53:54 +08:00
[Fix] Clears unused slot mappings and fix accuracy issue with MLA models when enabling FULL_DECODE_ONLY
(#3482)
### What this PR does / why we need it? MLA and GQA use different computation logic: MLA slice batches and only compute on the actually valid tokens. That means outer padding must be handled carefully — the accuracy issue this PR fixes was caused by stale data in `slot_mapping` being reused by subsequent inference steps. So we zeros out the portion of the slot mapping tensor that is not used by the currently scheduled tokens. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Working on it. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
This commit is contained in:
@ -1462,6 +1462,7 @@ class NPUModelRunner(LoRAModelRunnerMixin):
|
||||
slot_mapping[:total_num_scheduled_tokens],
|
||||
non_blocking=True,
|
||||
)
|
||||
self.slot_mapping[total_num_scheduled_tokens:].fill_(0)
|
||||
|
||||
# Make AscendCommonAttentionMetadata
|
||||
common_attn_metadata = AscendCommonAttentionMetadata(
|
||||
|
Reference in New Issue
Block a user