vllm-ascend

mirror of https://github.com/vllm-project/vllm-ascend.git synced 2025-10-20 13:43:53 +08:00

Author	SHA1	Message	Date
Clorist33	302494c1fe	[EPLB] ut for EPLB (#3035 ) ## UT for EPLB Co-authored-by Skywalker-EP 173723846@qq.com Co-authored-by offline 0806@qq.com Co-authored-by dsxsteven@sina.com ## UT Description ### 1. Module Description - Module: EPLB ### 2. Covered Source Files - vllm_ascend/eplb/adaptor/abstract_adaptor.py - vllm_ascend/eplb/core/eplb_device_transfer_loader.py - vllm_ascend/eplb/core/eplb_utils.py - vllm_ascend/eplb/core/policy/policy_abstract.py - vllm_ascend/eplb/core/policy/policy_dynamic_ep.py - vllm_ascend/eplb/core/policy/policy_dynamic_ep_v2.py - vllm_ascend/eplb/core/policy/policy_factory.py ### 3. Testing Method - Framework: pytest - Test Data: mock data - Test Type: unit test ### 4. Coverage - Statement Coverage: 90% - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: tanqingshan (A) <50050625@china.huawei.com> Signed-off-by: tanqingshan <50050625@china.huawei.com> Signed-off-by: daishixun <dsxsteven@sina.com> Co-authored-by: tanqingshan (A) <t50050625@china.huawei.com> Co-authored-by: tanqingshan <50050625@china.huawei.com> Co-authored-by: daishixun <dsxsteven@sina.com> Co-authored-by: dsxsteven <36877507+dsxsteven@users.noreply.github.com>	2025-09-24 17:14:38 +08:00
Csrayz	80524f5711	[CORE] concurrent partial prefills (#2372 ) # What this PR does / why we need it? When processing a mix of large and small requests, the TTFT of responses is significantly reduc\ed. Please refer to https://github.com/vllm-project/vllm/pull/10235, which achieves the same effect by simply limiting the number of prompt fills for long requests. This solution can be applied to both AscendScheduler (V0) and vLLM Scheduler (V1). Tests show that TTFT can be significantly improved when handling such mixed requests. However, This capability is currently missing when Ascend Scheduler is enabled. This benchmark used the Qwen3-8B model, with a context length of 128K, running on a single card. Regarding dataset selection, the sharegpt_clean dataset is used, with its content concatenated and cropped. Small requests with token=50 and medium requests with token=10240 were constructed (there were also large requests with token=102400, but these were ignored because when using the Prefill First scheduling strategy, max_num_batched_tokens will not be set to such a large value). When loading vLLM, set max_num_batched_tokens=22000. This length can accommodate two medium-sized requests and some short requests, reflecting an extreme scenario where the budget is almost entirely occupied by longer requests. Next, we mix 990 small requests and 100 medium requests into one type of load scenario (hereinafter referred to as 10%), and similarly generate load scenarios with 5% medium requests and 1% load scenarios. Performance tests were conducted separately for enabling vLLMScheduler, AscendScheduler, and AscendScheduler (long prompt concurrency set to 1). - vLLM version: v0.10.2 - vLLM main: `1dfea5f4a9` --------- Signed-off-by: Csrayz <jover@cmbchina.com>	2025-09-24 17:12:55 +08:00
Mengqing Cao	2d885869c5	[KVCache][Bugfix] Fix kv cache initialization error of attention layer (#3113 ) ### What this PR does / why we need it? Fixes #3096 1. Fix kv cache initialization error of attention layer. There are some models with layer name like `attn.attn`, instead of `self_attn`, but the initialization of kv cache tensors only check for `self_attn` and `attn.attn`, which leding to the error `AssertionError: Some layers are not correctly initialized` 2. Set the default value of input arg `sampling_metadata` in `compute_logits` for the modeling files in vllm-ascend. Thus fixing the error `Qwen3NextForCausalLM.compute_logits() missing 1 required positional argument: 'sampling_metadata'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? test locally with internlm - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-24 11:32:34 +08:00
weijinqian0	6aa4253798	[Refactor] [SP]The sequence parallelism characteristics in the MoE and Dense models are integrated into a single solution. (#3085 ) What this PR does / why we need it? there are two sets of sp implementations for moe and dense models. One is called sequence_parallelism, and the other is flashcomm_v1. We did the following things： Merge two sets of code with the same implementation into one. Remove the implementation of sequence_parallelism, as this solution cannot support aclgraph. Does this PR introduce any user-facing change? No How was this patch tested? e2e&ut - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-24 11:29:59 +08:00
Icey	e7618d9414	[2/N][Refactor][Qwen3-Next] remove redundant methods and patch methods in Qwen3NextGatedDeltaNet (#3082 ) ### What this PR does / why we need it? remove redundant methods and patch methods in Qwen3NextGatedDeltaNet involved causal_conv1d_fn, causal_conv1d_update_npu, fused_gdn_gating, fused_reccrrent_gated_delta_rule, torch_chunk_gated_delta_rule, RMSNormGated ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` def main(): prompts = [ "The future of AI is", ] # Create a sampling params object. sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-24 11:25:42 +08:00
baxingpiaochong	eb205d9f35	[P/D][BugFix]Mooncake timeout release bug fix (#2899 ) ### What this PR does / why we need it? In the P node timeout release mechanism during PD separation, the req_id that requires timeout release is transmitted from the scheduler to the worker. If the KV cache between PDs is transferred too quickly, the P node's req_id may be released twice. The first release is when the D node notifies the P node that the KV cache has been pulled, and the second release is when the scheduler transmits the timeout release to the worker. To address this bug, an intermediate component is introduced to manage the release of req_ids. Pull kv and forward2 may occur one after the other in timing. The previous timeout defaulted to forward2 being before pull_kv. ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: baxingpiaochong <771405853@qq.com>	2025-09-24 11:22:46 +08:00
Song Zhixin	6995a7bc5b	[Disagg][Perf] Use NPU event sync instead of blocking tolist to avoid unintentional copy ops blocking across different NPU streams, improving disagg TTIT/TTFT (#2788 ) ### What this PR does / why we need it? When we copy the sampled valid token ids from device to host, avoid using tolist which would trigger a CUDA wise stream sync if the source is on device. We change it to use non-blocking copy followed by an explicit CUDA event sync. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Bring up vLLM server ```bash VLLM_USE_V1=1 vllm serve Qwen/Qwen2.5-14B-Instruct --disable-l og-requests -tp 8 --max-num-seqs 64 --no-enable-prefix-caching --max_num_batched_tokens=8000 ``` ## Before： ![76218085a0cde9b2a73214e35fb7fc08](https://github.com/user-attachments/assets/38cbd02d-d380-47f8-a111-4bd859102eb1) ## After ![6c2111136673332244d3ce11060f4048](https://github.com/user-attachments/assets/957f9bf1-ec50-4f49-9318-f4876b3e3691) As shown in the figure, the TTFT decreased - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-09-24 11:21:58 +08:00
Peipei	c4b976af1a	[Model][VLM][Patch]Modify ascend affinity _merge_multimodal_embeddings (#3071 ) ### What this PR does / why we need it? This PR aims to address the incompatibility of the `.masked_scatter_` operation in the current `_merge_multimodal_embeddings` function on Ascend. For now, it reverts to the previous version of the CPU operation, which can be executed asynchronously on the device side to enhance performance. - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` --------- Signed-off-by: booker123456 <945658361@qq.com>	2025-09-24 10:25:28 +08:00
weiguihua2	b1380f3b87	[Doc] modify the version compatibility between vllm and vllm-ascend (#3130 ) ### What this PR does / why we need it? modify the version compatibility between vllm and vllm-ascend, the main branch of vllm-ascend corresponds to the v0.10.2 tag of vllm. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `f225ea7dd9` Signed-off-by: weiguihua2 <weiguihua2@huawei.com>	2025-09-23 20:31:49 +08:00
linfeng-yuan	d01fd1d1c3	[misc][torchair] fix bugs around `deepseek mtp`, `enable_shared_expert_dp` and `use_cached_kv_cache_bytes` (#3074 ) ### What this PR does / why we need it? This miscellaneous contains several small fixes: 1) fix initialization and forward bugs of DeepseekMTPLayer with `shared_expert_dp` enabled. 2) fix a tensor shape mismatches after o_proj caused by a work-aroud change in NPUModelRunner. 3) avoid unnecessary decline of kv_cache memory (default: 64MB) with `use_cached_kv_cache_bytes` disabled. 4) fall back `fused_moe_state` from `MC2` to `All2All` since the padding logic of `mc2_mask` is incompatible with input hidden_states when `shared_expert_dp` enabled. Once this PR is merged, users can launch disaggregated_prefill deployments (large_ep) with `deepseek_mtp` and `shared_expert_dp` as `v0.9.1-dev` branch. The remaining problem of kv_cache tokens decline compared to `v0.9.1-dev` will be resolved by https://github.com/vllm-project/vllm-ascend/pull/3073. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? E2E vllm serving about deepseek_mtp with torchair graph mode and `enable_shared_expert_dp` with eager mode. Large ep deployments are also tested with this PR. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-23 14:52:42 +08:00
lidenghui1110	0f3939e5a9	[Feature]cpu offload connector (#1659 ) This PR implements cpu offload connector to enable NPU kv cache offload to host DRAM. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: lidenghui <lidenghui1110@gmail.com> Signed-off-by: AlvisGong <gwly0401@163.com> Signed-off-by: CalvinXKY <kyxiezju@163.com> Co-authored-by: AlvisGong <gwly0401@163.com>	2025-09-23 14:25:05 +08:00
Li Wang	96eb1ed408	[CI] Bump vLLM commit hash to 0923(f225ea7) (#3110 ) ### What this PR does / why we need it? Bump vLLM commit hash to `f225ea7dd9` ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5aeb925452` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-23 14:13:25 +08:00
Jianwei Mao	d586255678	fix wrong --num-gpus parameter requirements, and avoid ambiguity (#3116 ) fix the problem of https://github.com/vllm-project/vllm-ascend/issues/3114 - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Jianwei Mao <maojianwei2012@126.com>	2025-09-23 11:58:44 +08:00
Yizhou	39a85c49fa	[Refactor] Rename cudagraph_support to aclgraph_support (#3104 ) ### What this PR does / why we need it? Updates the `cudagraph_support` attribute to `aclgraph_support` to use terminology appropriate for the Ascend platform (ACL graphs instead of CUDA graphs). This change also explicitly disables graph support for the MLA attention backend. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. - vLLM version: v0.10.2 - vLLM main: `5aeb925452` Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-23 11:30:31 +08:00
wyu0-0	d2399ab97b	Fix VLLM_ASCEND_LLMDD_RPC_PORT renaming (#3108 ) ### What this PR does / why we need it? This PR implements the renaming of the environment variable VLLM_LLMDD_RPC_PORT to VLLM_ASCEND_LLMDD_RPC_PORT, as proposed and tracked in [#2450](https://github.com/vllm-project/vllm-ascend/pull/2450). The renaming is intended to align the variable naming convention with other Ascend-specific environment variables in the vllm-ascend codebase, enhancing consistency and clarity for developers and users working with Ascend-based deployments. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? CI passed with existing test. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: wyu0-0 <woshilynn@163.com>	2025-09-23 10:33:04 +08:00
Mercykid-bash	29c173ab48	FlashLB algorithm (#3042 ) ## Purpose This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel balancing algorithm: FlashLB. ## Motivation 1. The default algorithm adopts a two-stage greedy strategy: a. Replica allotment: Determine the number of expert replicas by minimizing the maximum load per replica (Min Max Replica, MMR). b. Replica placement: Distribute replicas across devices by repeatedly assigning the heaviest replica to the least loaded device (Longest Processing Time First, LPT). However, this sequential process lacks inter-stage collaborative optimization, often leading to suboptimal load balancing. For example, in the simple case shown in the figure below: given 8 logical experts with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2 replicas allocated per device across 8 devices, the EPLB algorithm yields a maximum per-device hotness of 232, while our proposed FlashLB algorithm can reduce this value to 205. 2. The default algorithm relies on the averaged expert hotness over a fixed time window for optimization. While this provides a coarse approximation of the hotness distribution, it fails to capture oscillatory deviations and temporal correlations of expert hotness observed across iterations in real-world scenarios, limiting optimization quality. 3. The default algorithm periodically regenerates the expert placement table. However, it generates the table for each individual layer, and the new table does not account for correlations with the previous one; these two factors collectively lead to nearly full-scale expert reassignment. ## FlashLB Algorithm Principle 1. Joint Optimization FlashLB achieves joint optimization of replica allotment and placement through group-based decision-making. Each group gradually determines the replica count and placement for a subset of experts, ensuring that the expected inter-device load balance (considering both deployed and pending expert replicas) is holistically optimized. To attain superior load balancing, FlashLB employs tree search to expand the solution space while integrating pruning and precompilation techniques for acceleration, thereby delivering load balancing that is both high-quality and practically efficient. 2. Multi-Shot Enhancement FlashLB partitions each profiling interval (e.g., 1024 iterations) into consecutive smaller sub-intervals (e.g., 16 iterations), each capturing independent hotness measurements. It then performs multi-shot optimization to co-optimize these sub-intervals simultaneously—enabling adaptation to time-variant expert hotness while enhancing robustness. 3. Incremental Adjustment To reduce the overhead of frequent expert re-deployment, FlashLB introduces an incremental adjustment scheme operating at both inter-layer and intra-layer levels: a. Inter-Layer: Hotness variations are tracked at the layer level. Only layers with fluctuations exceeding a predefined threshold trigger re-computation of expert placement, avoiding unnecessary redeployment for stable layers； b. Intra-Layer (Optional): A lightweight incremental LPT algorithm (LPT-Incremental) is applied. Instead of recomputing full placement for all experts in a layer, it selectively adjusts only the hottest experts or those with replica count changes, further reducing migration overhead. This incremental strategy significantly reduces adjustment costs while maintaining balanced performance across layers and devices. ## Co-author: Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: tangtianyi <tangtianyi4@huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: fems14 <1804143737@qq.com> Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: zhanghw0354 <zhanghaiwencmss@139.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com> Co-authored-by: Lucas Kabela <lucasakabela@gmail.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: Angazenn <supperccell@163.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: rjg-lyh <83491835+rjg-lyh@users.noreply.github.com> Co-authored-by: weichen <132029610+Pr0Wh1teGivee@users.noreply.github.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com> Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>	2025-09-23 10:27:14 +08:00
hucong	8dd53c8860	[Bugfix][PD] Auto-clear producer KV cache if no pull notification (#2174 ) ### What this PR does / why we need it? This PR addresses a critical issue where Node D (Device) failures cause Node P (Processor) to hang due to inability to release KV cache. Trigger Scenarios: 1. Node D fails mid-inference (e.g., network disconnection) 2. Node D rejects requests at a certain stage (e.g., via API server) 3. Load-test script termination causes Node P or D to abort queued requests Root Cause Analysis: 1. Currently, Node D sends a "KV cache pull complete, release approved" message to Node P 2. This message is transmitted via the worker connector. If PD connection breaks or requests are rejected upstream, Node D cannot send the message 3. Node P will never release KV cache without receiving this message Solution: Following VLLM community's approach (NIXL connector timeout mechanism), we're implementing: - A timeout mechanism with comprehensive warnings - Updated README documentation - Reference: VLLM's optimization PR [#20139](https://github.com/vllm-project/vllm/pull/20139) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? None - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: underfituu <hzhucong@163.com>	2025-09-23 09:53:34 +08:00
yupeng	704467cd9a	[Bugfix][LoRA] Fix bug introduced by upstream vllm#25249 (#3095 ) ### What this PR does / why we need it? Fix the impact to LoRA that https://github.com/vllm-project/vllm/pull/25249 brought. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? pytest -sv tests/e2e/singlecard/test_ilama_lora.py pytest -sv tests/e2e/multicard/test_ilama_lora_tp2.py - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: paulyu12 <507435917@qq.com>	2025-09-22 22:26:01 +08:00
Yizhou	3fa7cf6345	[Refactor][Graph] Move graph parameter logic to acl_graph module (#3101 ) ### What this PR does / why we need it? This is the follow-up PR of #2128 . Moves graph parameter management components, including `GraphParams`, `get_graph_params`, and `set_graph_params`, from the generic `utils.py` to the more specific `compilation/acl_graph.py`. Additionally, extracts the `update_attn_params` logic from the `NPUModelRunner` class into a standalone function within the `acl_graph` module. This refactoring improves code organization by centralizing ACL graph-related logic into its own dedicated module, enhancing modularity and clarity. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None needed. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-22 22:23:14 +08:00
Li Wang	02f89d166f	[CI] Update vllm version to 20250922(5aeb925) (#3091 ) ### What this PR does / why we need it? This pr bump vllm commit hash to `5aeb925452` fix issues: 1. https://github.com/vllm-project/vllm/pull/25345 has remove v0 metadata 2. https://github.com/vllm-project/vllm/pull/25332 3. https://github.com/vllm-project/vllm/pull/25334 4. https://github.com/vllm-project/vllm/pull/23558, note that this vllm commit update the model register logic, which will check all the model registered have the `vllm.model_executor.models` path , which breaks our custom registration of the deepseek_v3 model (it doesn't exist in the vllm model path). so I move deepseek_v3 model registy to deepseek_v2 to solve temporary ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-22 22:18:13 +08:00
fems14	1c9f0fe26f	Fix of DeepSeek Error in KV Pool Mixed Deployment Scenario (#3087 ) ### What this PR does / why we need it? A new kv_role "kv_both" is added to run mixed deployment scenarios. The mixed deployment will involve a decode phase, where with_prefill should be false. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` Signed-off-by: fems14 <1804143737@qq.com>	2025-09-22 20:36:41 +08:00
weichen	37a0715eda	[Refactor] Adjustments to moe_comm_method selection process (#3001 ) ### What this PR does / why we need it? Fix issues mentioned in https://github.com/vllm-project/vllm-ascend/pull/2791 and some minor refactoring. 1. Use Enum instead of string. 2. Avoid setting a new property to forward_context in AscendFusedMoE.forward(). 3. Enabling TokenDispatcherWithMoge. 4. Remove redundant code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing: 1. Enable/Disable EP 2. Aclgraph & eager - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-09-22 19:12:58 +08:00
rjg-lyh	bb1f0d5a62	[main] remove the redundant log prints in register_custom_ops.py (#3094 ) ### What this PR does / why we need it? This PR removed the redundant log prints in register_custom_ops.py, in order to make output clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: rjg-lyh <1318825571@qq.com>	2025-09-22 17:17:31 +08:00
Yizhou	338231acaf	[Feat][Graph] Support `FULL_DECODE_ONLY` mode for GQA/MHA models (#2128 ) Note: This depends on [vLLM #25161](https://github.com/vllm-project/vllm/pull/25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * Reduced dispatch latency: By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * Stabilized multi-device performance: Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * Stream/resource savings: Consolidating graph captures frees up streams, allowing more graphs to be captured. Known issues: 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-09-22 17:14:28 +08:00
Mengqing Cao	f39bd309b6	[Hybrid KV] Follow up UniformTypeKVCacheSpecs (#3070 ) ### What this PR does / why we need it? Follow up `UniformTypeKVCacheSpecs` changes introduced by https://github.com/vllm-project/vllm/pull/25101, which support different hidden size in uniform type kvcache specs This also fix the CI issue about `TypeError: AttentionGroup.__init__() missing 1 required positional argument: 'kv_cache_spec'` ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tests passed with exsiting e2e tests. - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-22 15:02:41 +08:00
tianyitang	f1f2c8f5e5	[Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems (#2962 ) ### What this PR does / why we need it? Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems . 1. The original op's performance is suboptimal in certain scenarios, necessitating optimization through the _new op_ (npu_fused_infer_attention_score)。 2. For ultra-long sequences (128k), the original operator will allocate a large attn_mask, which consumes excessive CPU memory. In contrast, the _new op_ supports a fixed-size compressed mask, effectively resolving this issue. NOTE1: The current PR retains the original logic and uses a version check of the CANN package to determine whether the _new op_ can be enabled. This ensures no impact on existing users. In future versions, this version check and the original logic will be deprecated, and the _new op_ scheduling will be uniformly adopted. NOTE2: This pr relies on future CANN version, which is not available now. NOTE3: To enable the new op in chunked prefill, the parameter additional_config should be set like `--additional-config '{"ascend_scheduler_config": {"enabled":true,"enable_chunked_prefill":true}}' \` at least. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: `6c5f82e5aa` --------- Signed-off-by: tangtianyi <tangtianyi4@huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Co-authored-by: Angazenn <supperccell@163.com>	2025-09-22 14:56:14 +08:00
zhangxinyuehfad	c90a6d3658	[Test] Update the format of the accuracy report (#3081 ) ### What this PR does / why we need it? Update the format of the accuracy report ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-09-22 14:10:03 +08:00
dependabot[bot]	37a0b3f25e	Bump actions/labeler from 5 to 6 (#3086 ) Bumps [actions/labeler](https://github.com/actions/labeler) from 5 to 6. - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-09-22 14:07:37 +08:00
linfeng-yuan	ffdd1a36e2	[bugfix][torchair] fix wasted NPU memory buffer allocation for quantized deepseek with unquantized MTP layer (#3068 ) ### What this PR does / why we need it? While running quantized deepseek models with unquantized MTP layer, free NPU memory abnormally decreases for `2*HCCL_BUFFSIZE` bytes. This results from the wasted VRAM buffer allocation casued by calling `dist.all_to_all_single` without correct device process group argument. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We run vllm online serving with quantized deepseek-r1 and unquantized MTP layer, and observed that free_memory increased without redundat VRAM buffer for HCCL communication op (all_to_all_single). - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-09-22 14:06:43 +08:00
Icey	14b39d3c70	[1/N][Refactor][Qwen3-Next] remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention (#3019 ) ### What this PR does / why we need it? remove redundant Qwen3NextSparseMoeBlock and Qwen3NextAttention ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? ``` def main(): prompts = [ "The future of AI is", ] sampling_params = SamplingParams(max_tokens=100, temperature=0.6, top_k=40, top_p=0.95) # Create an LLM. llm = LLM( # model="/root/.cache/modelscope/hub/models/Qwen/Qwen3-30B-A3B", model="Qwen/Qwen3-Next-80B-A3B-Instruct", tensor_parallel_size=4, enforce_eager=True, trust_remote_code=True, max_model_len=256, gpu_memory_utilization=0.7, block_size=64, ) # Generate texts from the prompts. outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` - vLLM version: v0.10.2 - vLLM main: `9d1c50a5ac` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-22 11:24:08 +08:00
wangxiyuan	88d24cce8b	[CI] Enable main based lint check and light ci matrix (#3079 ) ### What this PR does / why we need it? Followup on https://github.com/vllm-project/vllm-ascend/pull/3064 1. should limit vllm version to the same hash with mypy 2. fix the vllm version bug for e2e light test. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-22 10:37:53 +08:00
Yikun Jiang	693f547ccf	Refactor ci to reuse base workflow and re-enable ut coverage (#3064 ) ### What this PR does / why we need it? 1. Refactor ci to reuse base workflow and enable main 2 hours trigger job: - Extract e2e test in to _e2e_test.yaml - Reuse _e2e_test in light / full job - Enable main 2 hours trigger job 2. Rename e2e test to ascend test to make sure action display label 3. Re-enable ut coverage which was failed since `5bcb4c1528` and disable on `6d8bc38c7b` ### Does this PR introduce _any_ user-facing change? Only developer behavior changes: - Every job trigger full test with vllm release and hash - Run full job per 2 hours with vllm main - e2e light test (30 mins): `lint` (6mins) ---> ut (10mins) ---> `v0.10.2 + main / 4 jobs` (15mins) - e2e full test (1.5h): `ready label` ---> `v0.10.2 + main / 4 jobs`, about 1.5h - schedule test: 2hours ---> `v0.10.2 + main / 4 jobs`, about 1.5h ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-21 13:27:08 +08:00
Yikun Jiang	b8b68b3dfe	[CI] Upgrade vLLM to 20250920 (c60e613) and address config break (#3067 ) ### What this PR does / why we need it? Bump main to `c60e6137f0` - Updated imports in `vllm.config` to `vllm.config.model`(`aed16879a9`) https://github.com/vllm-project/vllm/pull/25252 - Refactored `vllm_ascend/sample/sampler.py` to use string values for `logprobs_mode` instead of the `LogprobsMode` enum, simplifying logprobs mode handling and improving compatibility with recent vLLM changes (`aed16879a9`) https://github.com/vllm-project/vllm/pull/25252 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` --------- Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-21 09:49:17 +08:00
Li Wang	12bcbd02bb	[CI] Upgrade vLLM to 20250919 (6d8246aa) and fix some broken issue (#2907 ) ### What this PR does / why we need it? 1. This pr bump vllm commit to `6d8246aaff` 2. fix upstream changes https://github.com/vllm-project/vllm/pull/24548 abort multi-modal kwargs, make vllm main and `v0.10.2` both adaptable 3. fix metadata_builder changes introduced by https://github.com/vllm-project/vllm/pull/23693 4. fix `structured_outputs_config` changes introduced by https://github.com/vllm-project/vllm/pull/22772 5. fix `moe_config` changes introduced by https://github.com/vllm-project/vllm/pull/22537 Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> - vLLM version: v0.10.2 - vLLM main: `c60e6137f0` --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Co-authored-by: MengqingCao <cmq0113@163.com>	2025-09-20 17:37:57 +08:00
Lucas Kabela	53ecd89e8f	[Bugfix] Remove `VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE` (#2969 ) ### What this PR does / why we need it? This PR prepares for deleting this enviroment variable, `VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE`, as vllm requires `fullgraph=True` to run - Fixes https://github.com/vllm-project/vllm/issues/21834 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See CI - vLLM version: v0.10.2 - vLLM main: `99cc41ad50` --------- Signed-off-by: Lucas Kabela <lucaskabela@meta.com>	2025-09-20 08:22:30 +08:00
zhangxinyuehfad	e26fe1caf1	[TEST] Speed up DS V2 accuracy test and turn up accuracy baseline (#3047 ) ### What this PR does / why we need it? 1. update expected accuracy for DeepSeek-V2-Lite 2. add batch size ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Accuracy CI passed - vLLM version: v0.10.2 - vLLM main: `838d7116ba` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-09-20 00:40:33 +08:00
zhangxinyuehfad	a22b532d38	[Fixbug] Fix shape not match when sliding_window and dynamic batch_size (#2830 ) ### What this PR does / why we need it? Fix shape not match when test LLM-Research/Phi-4-mini-instruct accuarcy ### Does this PR introduce _any_ user-facing change? Users can't set dynamic batch_size or use lm_eval test accuracy when using models(sliding_window) ### How was this patch tested? accuarcy of LLM-Research/Phi-4-mini-instruct is ok : ``` vllm (pretrained=LLM-Research/Phi-4-mini-instruct,max_model_len=4096,dtype=auto,tensor_parallel_size=1), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto \|Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\| \|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\| \|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.8105\|± \|0.0108\| \| \| \|strict-match \| 5\|exact_match\|↑ \|0.8097\|± \|0.0108\| ``` - vLLM version: v0.10.2 - vLLM main: `3c96e7b8a1` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-09-19 22:35:14 +08:00
zhanghw0354	cf549b976d	[Test]Add unit test for compilation/acl_graph.py (#3039 ) ### What this PR does / why we need it? According to issue [#1298 ](https://github.com/vllm-project/vllm-ascend/issues/1298) ,this pull request adds unit test code for compilation/acl_graph.py. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.2 - vLLM main: `f2718d2948` --------- Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com>	2025-09-19 21:31:17 +08:00
22dimensions	0942d9aaab	[3/N][Refactor][Quantization]remove packed_modules_mapping from models (#3021 ) ### What this PR does / why we need it? Some custom models in vllm-ascend define packed_modules_mapping, which prevent keeping same model class with vllm community. So move these custom packed_modules_mapping to quant utils.py. After this pr, some custom models can be removed. ### Does this PR introduce _any_ user-facing change? tested by CI ### How was this patch tested? tested by CI - vLLM version: v0.10.2 - vLLM main: `5089fd749c` Signed-off-by: 22dimensions <waitingwind@foxmail.com>	2025-09-19 20:50:14 +08:00
Yikun Jiang	4ba56716f9	Increase doctest timeout to 300s and time print (#3041 ) ### What this PR does / why we need it? Increase doctest timeout to 300s and time print, according to time print in https://github.com/vllm-project/vllm-ascend/pull/3045 , most of time consumed in `Graph capturing`, so I think it's fine to increase doctest timeout This PR also add time log for each task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Run `/vllm-workspace/vllm-ascend/tests/e2e/run_doctests.sh` - CI passed - vLLM version: v0.10.2 - vLLM main: `a684c0124c` Closes: https://github.com/vllm-project/vllm-ascend/issues/3045 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-19 20:26:00 +08:00
Shanshan Shen	8326f15ecf	[CustomOp] Register AscendSharedFusedMoE custom op (#2980 ) ### What this PR does / why we need it? Register `AscendSharedFusedMoE` custom op. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `DeepSeek-V2-Lite` is a MoE model with shared experts. Test: ```bash vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite \ --trust-remote-code \ --enforce-eager \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.95 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V2-Lite", "messages": [ {"role": "user", "content": "介绍一下联通公司？"} ], "stream": false, "max_tokens": 100 }' ``` Output: ```bash 中国联合网络通信集团有限公司（简称“中国联通”）于2009年1月6日在原中国网通和原中国联通的基础上合并组建而成，在国内31个省（自治区、直辖市）和境外多个国家和地区设有分支机构，是中国唯一一家在纽约、香港、上海三地同时上市的电信运营企业，连续多年入选“世界500强企业”。\n\n中国联通主要经营固定通信业务，移动通信业务，国内 ``` - vLLM version: v0.10.2 - vLLM main: `486c5599e3` --------- Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com>	2025-09-19 19:05:01 +08:00
sdmyzlp	05a700d370	[Bugfix] Fix async copy bug under single expert scenario (#3005 ) Add missing barrier when no implicit synchonize by `repeat_interleave` is available. Otherwise, the `non_blocking=True` copy of `output_splits` and `input_splits` from NPU may failed to complete before later `async_all_to_all` uses them. ### What this PR does / why we need it? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` Signed-off-by: sdmyzlp <lrwei2@petalmail.com>	2025-09-19 14:05:36 +08:00
xuyexiong	2a87b4cecb	[Bugfix] Fix specdecoding in chunkedprefill scenario (#3025 ) ### What this PR does / why we need it? The speculative decode phase of chunkedprefill has taken an incorrect path, should always use TND layout for speculative decoding. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `6d8246aaff` Signed-off-by: xuyexiong <xuyexiong@huawei.com>	2025-09-19 14:05:08 +08:00
Song Zhixin	833cd1b698	[BugFix] Async scheduling and PP compatibility with DP (#2796 ) ### What this PR does / why we need it? based on the https://github.com/vllm-project/vllm/pull/23770, fix Async scheduling and PP compatibility with DP, also fixes issue with finished requests not being processed in async scheduling and PP cases, and possible worker race conditions. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `544fe76b95` --------- Signed-off-by: jesse <szxfml@gmail.com>	2025-09-19 11:29:50 +08:00
whx	0a526768f5	[Feature] Support moe multi-stream for aclgraph. (#2946 ) This PR puts the calculation of shared experts into a separate stream, overlaping with routing experts. - vLLM version: v0.10.2 - vLLM main: `fbd6523ac0` --------- Signed-off-by: whx-sjtu <2952154980@qq.com>	2025-09-19 11:06:45 +08:00
zhangxinyuehfad	0c04bf1e36	[Fixbug] Fix accuracy for DeepSeek-V2-Lite (#3016 ) ### What this PR does / why we need it? Fix accuracy for DeepSeek-V2-Lite ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: `66072b36db` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-09-18 23:58:23 +08:00
Mengqing Cao	367edff5af	[HybridKV] Fix prefill disaggregation kvcache addr alignment & use hybrid kv cache only when running qwen3_next (#3007 ) ### What this PR does / why we need it? This pr fixes a few issues on prefill disaggregation: 1. Fix prefill disaggregation kvcache addr alignment issue, llmdatadist needs the addr of tensors to be aligned with 2M 2. Fix prefill disaggregation kvcache shape error, llmdatadist requires k/v tensors with shape [num_blocks, ...], however the implentment before this pr is [2, num_blocks, ...], which will break prefill disaggregation 3. Use hybrid kv cache only when running qwen3_next to fix accuracy issue on prefill disaggregation. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Tested locally by @liziyu179 - vLLM version: v0.10.2 - vLLM main: `4f02b77de4` --------- Signed-off-by: MengqingCao <cmq0113@163.com>	2025-09-18 21:43:22 +08:00
Icey	acb46f303f	Fix VocabParallelEmbedding UT (#2722 ) ### What this PR does / why we need it? Fix VocabParallelEmbedding UT ### How was this patch tested? CI passed with new added/existing test. - vLLM version: main - vLLM main: `f592b3174b` --------- Signed-off-by: Icey <1790571317@qq.com>	2025-09-18 19:54:01 +08:00
Li Wang	01592515b8	[Bugfix] Fix sleep mode level 2 (#1376 ) ### What this PR does / why we need it? For sleep mode level 2, we discarded model both weights and kv_cache, but the problems is: When we discard weights, we also discard some tensors representing the model state which we called `model.named_buffers()`, such as: `running_mean / running_var` in BatchNorm、rope cos-sin cache ... when we update weights, but forgot to update buffers as well, this will lead to some unknown issue ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `5963b98b46` --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-09-18 19:51:52 +08:00
LeeWenquan	f4e3d22432	Remove chunked_prefill_for_mla and fix ring_mla bug (#2781 ) ### What this PR does / why we need it? Remove chunked prefill for mla branch in mla , and change dtype of prefill_mask to avoid accuracy problem ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - vLLM version: v0.10.2 - vLLM main: `ef7eefe17a` --------- Signed-off-by: SunnyLee219 <3294305115@qq.com>	2025-09-18 19:43:26 +08:00

... 2 3 4 5 6 ...

1108 Commits