[Bugfix] Fix async copy bug under single expert scenario (#3005)

Add missing barrier when no implicit synchonize by `repeat_interleave` is available. Otherwise, the `non_blocking=True` copy of `output_splits` and `input_splits` from NPU may failed to complete before later `async_all_to_all` uses them. ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: ef7eefe17a Signed-off-by: sdmyzlp <lrwei2@petalmail.com>
2025-10-20 13:43:53 +08:00 · 2025-09-19 14:05:36 +08:00
parent 2a87b4cecb
commit 05a700d370
1 changed files with 4 additions and 0 deletions
--- a/vllm_ascend/ops/moe/token_dispatcher.py
+++ b/vllm_ascend/ops/moe/token_dispatcher.py
@ -639,6 +639,10 @@ class TokenDispatcherWithAll2AllV(MoETokenDispatcher):
            self.global_input_tokens_local_experts_indices = torch.repeat_interleave(
                self.expert_ids_per_ep_rank,
                self.num_global_tokens_per_local_expert.ravel())
+        else:
+            # TODO: This full synchronization can be a performance bottleneck.
+            # A more granular sync (e.g., blocking D2H copies) should be investigated.
+            torch.npu.synchronize()

        return num_tokens_per_local_expert