vllm-ascend

mirror of https://github.com/vllm-project/vllm-ascend.git synced 2025-10-20 13:43:53 +08:00

Author	SHA1	Message	Date
lilinsiman	22a1d91cf5	[CI] Add single request test case for aclgraph (#3392 ) ### What this PR does / why we need it? This pr adds online single request DP2 test case for aclgraph ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ut - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-14 11:13:44 +08:00
Yizhou	4536123341	[Fix] Fix mc2_tokens_capacity-related issues (#3411 ) ### What this PR does / why we need it? Replaces the hardcoded `mc2_tokens_capacity` with the max graph capture size for a more accurate allocation. This change ensures the capacity is correctly sized relative to the graph capture configuration, removing a magic number and making the setup more robust. This PR fixes two issues: 1. <del>MC2 op restrictions differ between SoCs.</del> @Angazenn This requires an overhaul, hence removed from this PR, please commit another PR. 2. The hardcoded value `512` allocates too much buffer for large models. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Tested in daily checks. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-14 10:56:12 +08:00
wangxiaoteng888	19b85ef1bc	[Bugfix] multi_node_pd_disaggregation_mooncake.md update (#3400 ) ### What this PR does / why we need it? multi_node_pd_disaggregation_mooncake.md update. Fix issues encountered during service startup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiaoteng@huawei.com <wangxiaoteng@huawei.com>	2025-10-14 09:29:35 +08:00
wangxiyuan	49b850270f	[Community] Nominate new maintainers: @yiz-liu @paulyu12 @weijinqian0 @nalinaly (#3406 ) I'd like to nominate 4 new maintainers for vllm-ascend: ---- Yizhou Liu [@yiz-liu](https://github.com/yiz-liu) ---- Review Quality‌: He has completed [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Ayiz-liu) and provided solutions or guides for [10+ issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3Ayiz-liu), which includes many quality review like [#issue-3428408401](https://github.com/vllm-project/vllm-ascend/issues/3002#issue-3428408401), [#discussion_r2224572309](https://github.com/vllm-project/vllm-ascend/pull/1803#discussion_r2224572309), [#issuecomment-2982470226](https://github.com/vllm-project/vllm-ascend/pull/1261#issuecomment-2982470226), [#issuecomment-2903621197](https://github.com/vllm-project/vllm-ascend/pull/836#issuecomment-2903621197), [#issuecomment-2857678691](https://github.com/vllm-project/vllm-ascend/issues/778#issuecomment-2857678691). Sustained and High-Quality Contributions: He has contributed more than [30+ commits](https://github.com/vllm-project/vllm-ascend/commits?author=yiz-liu) since Mar.2025, especially, aclgraph, DP, and EP related contributions are the main reason why I nominated him. As the owner of aclgraph support, he continuously improves aclgraph stability and performance as well as fixes key bugs. he laid the groundwork for EP-related functionality and delivered multiple foundational improvements Community involvement: He has a very good habit of logging issues：https://github.com/vllm-project/vllm-ascend/issues/1649 and is also very active and involved in [many issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20state%3Aopen%20commenter%3Ayiz-liu%20-author%3Ayiz-liu) to help users resolve issues. ---- Peng Yu [@paulyu12](https://github.com/paulyu12) --- The main reasons for his nomination are his expertise and key contributions to the LORA and sustained and major contributions (initial support/doc/bugfix) around Lora. Sustained and Major Contributions: @paulyu12 starts his contribution with [Lora and Mulit-Lora support](`697908f5cd`) since Apr 2025, he contributed about [10+ commits and bugfixes](`697908f5cd`) on vllm-ascend. Review Quality‌ and Community Involvement‌: He also helped more than 10+ users address [Lora related issues](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Apaulyu12+-author%3Apaulyu12+is%3Aclosed). I believe his addition will further improve vLLM Ascend Lora support. ---- Jinqian Wei [@weijinqian0](https://github.com/weijinqian0) --- The main reasons for his nomination are his key contributions to the RL scene and the high quality of his code reviews. Review Quality‌: He has completed [60+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Aweijinqian0+is%3Aopen+-author%3Aweijinqian0) since June. 2025, include [#comment-3284055430](https://github.com/vllm-project/vllm-ascend/pull/2791#issuecomment-3284055430), [discussion_r2332166704](https://github.com/vllm-project/vllm-ascend/pull/2817#discussion_r2332166704), [discussion_r2343289692](https://github.com/vllm-project/vllm-ascend/pull/2846#discussion_r2343289692) high quality review. Sustained and Quality Contributions: He has Deep understanding of ‌vLLM‌ and ‌vLLM Ascend‌ codebases and solid contributions in RL scene (about [10+ PR merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Aweijinqian0+is%3Amerged+) and 10+ PRs merged as co-author. - Code Refactor: As a co-author, he participated in the refactoring of the MOE module https://github.com/vllm-project/vllm-ascend/pull/2150 https://github.com/vllm-project/vllm-ascend/pull/2706 https://github.com/vllm-project/vllm-ascend/pull/2867 - Performance Enhancement for RL: Participated as a co-author in the design and development of the solution, contributing to the planning of core capabilities. https://github.com/vllm-project/vllm-ascend/pull/1547 https://github.com/vllm-project/vllm-ascend/pull/2120 and so on. So I think he's a great addition to the vLLM Ascend Maintainer team. ---- Chuanyu Qin [@nalinaly](https://github.com/nalinaly) --- The main reason I nominated Qinchuanyu is because he is the initial designer of aclgraph and torch-npu, two key components of vllm-ascend. Considering aclgraph will eventually become the main path for vllm-ascend's graph model, I propose to nominate him. Sustained and Major Contributions: In fact, chuanyu actively helped the users/developers of vllm-ascend since Mar 2025 ([vllm-discuss#162](https://discuss.vllm.ai/t/can-ascend-officially-draft-a-documentation-on-the-vllm-ascend-adaptation-for-graph-mode/162/5)), and also helped early users of vllm-ascend understand aclgraph. He provided lots of help in the process of integrating aclgraph with vllm-ascend. Community Involvement‌: As speaker, he also presents help users understand aclgraph and torch_npu [《The design philosophy of torch_npu and the high performance principle of aclGraph》](https://github.com/PyTorch-China/pytorch-meetup/blob/main/beijing-2025/%E3%80%905%E3%80%91torch_npu%20%E7%9A%84%E8%AE%BE%E8%AE%A1%E5%93%B2%E5%AD%A6%E4%B8%8E%20aclGraph%20%E9%AB%98%E6%80%A7%E8%83%BD%E5%8E%9F%E7%90%86-%E7%A7%A6%E4%BC%A0%E7%91%9C-0920.pdf) ---- They have activate contribution to vllm-ascend or have rich experience for ascend AI. Welcome! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-14 08:51:58 +08:00
menogrey	657c08cfb2	[UT] fix skipped test_utils ut test. (#3422 ) ### What this PR does / why we need it? Fixes: fix the test in `tests/ut/torchair/test_utils.py` and enable the UT test in CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: menogrey <1299267905@qq.com>	2025-10-14 08:31:13 +08:00
Slightwind	4f6d60eb06	[Feature] Add W4A4 Flat Quantization support (#3427 ) Introduce W4A4 Flat Quantization for better model compression and inference efficiency on Ascend devices. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: SlightwindSec <slightwindsec@gmail.com>	2025-10-13 23:20:16 +08:00
weijinqian0	6972df5951	[Feature] optimize sp & qwen3 next support sp. (#3225 ) This PR will accomplish the following tasks: optimize SP In the old version implementation, the first layer was all_reduce, which used rms to split chunks. We changed it to perform reduce_scatter on the embedding side, replace one all_reduce operation and one chunk with one reduce_scatter operation. Support qwen3 next Since Qwen3 Next includes a linear attention module, the prefix name of this module cannot take effect directly. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-10-13 23:02:12 +08:00
realliujiaxu	31682961af	[Feat] enable hierarchical communication for mc2 ops on A2 (#3015 ) Currently, when in A2, setting the environment variables `HCCL_INTRA_PCIE_ENABLE=1` and `HCCL_INTRA_ROCE_ENABLE=0` can reduce cross-machine communication traffic and significantly improve communication performance. For more details, please refer to [document](https://www.hiascend.com/document/detail/zh/Pytorch/710/apiref/torchnpuCustomsapi/context/torch_npu-npu_moe_distribute_dispatch_v2.md) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>	2025-10-13 16:13:17 +08:00
lidenghui1110	0563106477	[Feature] mooncake connector support GQA transport (#2947 ) ### What this PR does / why we need it? The previous implementation of the Mooncake connector only supported scenarios where the Tensor Parallel sizes for the Prefill and Decode phases were the same for MLA and GQA/MHA. For heterogeneous TP scenarios, a single rank on a decode node needs to pull the KV cache from multiple ranks on the prefill nodes and then merge them (only support prefill TP >= decode TP now). During this merge, a transpose operation is required because the layouts of the KV caches are different. To minimize transpose overhead, we use the npu_paged_cache_load operation to extract the blocks corresponding to the request from the KV cache. After performing the transpose, we use _npu_reshape_and_cache to write the blocks back to their original positions. This process is illustrated in the diagram below. b means block_size, this diagram illustrates transpose kv cache layout for one block. In the implementation, we transpose kv cache by layer for one request. <img width="1464" height="916" alt="image" src="https://github.com/user-attachments/assets/09d96a98-e41c-4733-9535-05544163081a" /> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested - vLLM version: v0.11.0 --------- Signed-off-by: chenxiao <Jaychou1620@Gmail.com> Signed-off-by: zzy-ContiLearn <1831242919@qq.com> Signed-off-by: zzhx1 <zzh_201018@outlook.com> Signed-off-by: Kurumi5210 <jaychou1620@gmail.com> Co-authored-by: zzy-ContiLearn <1831242919@qq.com> Co-authored-by: chenxiao <cx02308786@antgroup.com> Co-authored-by: chenxiao <Jaychou1620@Gmail.com> Co-authored-by: zzhx1 <zzh_201018@outlook.com>	2025-10-13 15:48:37 +08:00
dsxsteven	847d12a389	[BugFix]Fix moe load problems in torchair when using dynamic eplb (#3381 ) ### What this PR does / why we need it? When using dynamic eplb, moe load is not imported. We fix this problem by modifying the return value of hidden states in torchair. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepseekV3 in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: daishixun <dsxsteven@sina.com>	2025-10-13 11:38:57 +08:00
Yikun Jiang	cd69385dab	Add models test and add serval new models yaml (#3394 ) ### What this PR does / why we need it? This PR added Add accuracy CI for servals new models - `ascend test / accuracy` is for PR triggered check popluar models accuracy - `ascedn test / models` is for accuracy report, full models test, nightly model test - Add Qwen2-Audio-7B-Instruct, Qwen2-VL-7B-Instruct, Qwen3-8B, Qwen3-VL-30B-A3B-Instruct ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Closes: https://github.com/vllm-project/vllm-ascend/pull/2330 Closes: https://github.com/vllm-project/vllm-ascend/pull/3362 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>	2025-10-12 17:27:50 +08:00
jiangyunfan1	d05d29ff0e	Enable nightly test and add qwen3 32b test case (#3370 ) ### What this PR does / why we need it? This PR adds a nightly test case for qwen3_32b bf16 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? by running the case - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: jiangyunfan1 <jiangyunfan1@h-partners.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-10-12 15:46:28 +08:00
leo-pony	0d59a3c317	[CI] Make the test_pipeline_parallel run normally in full test (#3391 ) ### What this PR does / why we need it? Make the test_pipeline_parallel take effect in full test of CI. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-10-12 15:43:13 +08:00
Chen Chen	bcc313e8f2	add mla_preprocess kernel (#3226 ) ### What this PR does / why we need it? - Adds the `mla_preprocess` custom kernel to provide an optimized pre-processing operator for Multi-head Latent Attention (MLA) on Ascend NPUs. - Wires the new kernel into the C++ extension pipeline so vLLM can invoke it directly, cutting Python-side tensor shuffling and memory copies that previously bottlenecked MLA compilation paths. ### Does this PR introduce any user-facing change? - No. The change only introduces a low-level kernel; public APIs and inference behavior remain unchanged. ### How was this patch tested? - Dedicated Ascend kernels are not covered by our CI yet, so no extra automated tests were added. Future MLA-focused regression runs will cover this path. - vLLM version: v0.11.0 Signed-off-by: Chen Chen <0109chenchen@gmail.com>	2025-10-12 07:39:45 +08:00
Li Wang	1b1207e3c3	[Bugfix] Add quantization param for multi-node CI (#3383 ) ### What this PR does / why we need it? Add quantization param for `deepseek-w8a8` multi-node test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-11 19:25:16 +08:00
huangxialu	e8c871ed0a	[Test] enable external launcher and add e2e test for sleep mode in level2 (#3344 ) ### What this PR does / why we need it? 1. Enable tests/e2e/multicard/test_external_launcher.py 2. Add e2e test for sleep mode in level2 ### Does this PR introduce _any_ user-facing change? not involved ### How was this patch tested? CI passed with existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: huangxialu <huangxialu1@huawei.com> Co-authored-by: Shangwei-Li <lishangwei2@huawei.com>	2025-10-11 17:29:38 +08:00
Mercykid-bash	ecb1713dfc	Bugfix: Expose the user policy type interface (#3336 ) This PR primarily focuses on two key changes: 1. Adjusts internal interface calls to optimize the interaction logic between related modules. 2. Exposes an interface that allows users to select the EPLB algorithm, enabling more flexible configuration based on specific usage scenarios. These changes aim to enhance the usability of the system while ensuring the stability of internal operations. Relevant unit tests have been updated to cover the modified logic. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: Che Ruan <cr623@ic.ac.uk>	2025-10-11 16:28:57 +08:00
linfeng-yuan	e4acb2dfc7	[feat] support customized and separated hccl_buffer_size for process group initialization (#3073 ) ### What this PR does / why we need it? Currently, users have to set `HCCL_BUFFSIZE` to 512~1024 to perform mc2 operators (dispatch and combine) while running moe models with large `ep_size` and `batch_size`. This environmental variable not only affects allocated VRAM for mc2 group, but also increases VRAM allocation for dp, tp & ep groups, leading to significant kvcache and free_memory drops. This PR supports to automatically calculate and set `hccl_buffer_size` for each process group (except mc2 group) separately when users set `HCCL_BUFFSIZE` for mc2 group. This can significantly reduce wasted buffer_size set for dp, tp & ep groups. Note that current mc2 operators can only perform communication space partitioning based on `HCCL_BUFFSIZE` configuration. Once they support `hccl_buffer_size` configuration with `pg_options` while initializing process group, we'll caculate the required buffer size and users would avoid set `HCCL_BUFFSIZE` themselves. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? We performed E2E serving with deepseek_r1 initializing DP/TP/EP/MC2 process group and observed significant kv_cache and free_memory increase! - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>	2025-10-11 15:55:22 +08:00
Li Wang	9eb103607f	[1/N][CI] Add multi node test (#3359 ) ### What this PR does / why we need it? This pr purpose to add multi-node test, on the first step, add `deepseek-v3` dp+tp+ep test ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-11 14:50:46 +08:00
offline893	82b6c846ca	[BugFix]Fix eplb problems when using dynamic eplb. (#3364 ) ### What this PR does / why we need it? When using dynamic eplb,it will be blocking by nz tensor.We fix these prolems by clone src tensor and recv tensor. ### Does this PR introduce any user-facing change? ### How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-11 14:04:02 +08:00
wangxiaoteng888	ca05f7d632	[Bugfix] TP size larger than KV cache head causes accuracy issues (#3366 ) ### What this PR does / why we need it? Resolve the issue where, in the case of unequal TP (Tensor Parallelism), the TP size is larger than the number of model attention kvcache heads, causing the KV cache to generate duplicates, which leads to transmission errors in the original code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By ci - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>	2025-10-11 11:22:23 +08:00
无脸男	ace300a549	[Bugfix] Fix the abnormal NPU memory usage in full graph mode. (#3331 ) ### What this PR does / why we need it? In the full graph mode, since paged attention operators updates are required, the parameters of this operators needs to be retained. However, the tensor such as query、key cache、value cache, does not need to be persistently saved, and we can manually release this space by `weak_ref_tensor` to save the memory. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: WithHades <244036962@qq.com>	2025-10-11 10:20:10 +08:00
Ruri	866f5e7283	[Bugfix] Fix weight prefetching `AssertionError` in W8A8 MTP scene (#3361 ) ### What this PR does / why we need it? - Fix `AssertionError` of `weight_prefetch_method` in W8A8 MTP scene - Remove hard-code key (https://github.com/vllm-project/vllm-ascend/pull/3146#discussion_r2416644010) ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? `weight_prefetch_method is None` (tested on DeepSeek-R1-w8a8mix_MTP) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>	2025-10-11 09:24:02 +08:00
Peipei	8c1a4dedf3	[Bugfix]modify the enable range of _merge_multimodal_embeddings patch (#3360 ) ### What this PR does / why we need it? Modify the enable range of _merge_multimodal_embeddings patch. The current patch is only enabled for offline inference on the platform. For online serviceization, due to the addition of the worker sub-process, it is not enabled within the sub-process. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: booker123456 <945658361@qq.com>	2025-10-11 08:37:07 +08:00
Angazenn	27e0f2c035	[Perf]Add YaRN custom op (#3355 ) ### What this PR does / why we need it? YaRN scaling is used to improve long seq accuracy for models like Qwen3. In vLLM, YaRN scaling refers to `YaRNScalingRotaryEmbedding` class which inherits from original `RotaryEmbedding`. Although `YaRNScalingRotaryEmbedding` does not rewrite the `forward` function of `RotaryEmbedding` , using YaRN on npu still run into the native implementation of foward in `RotaryEmbedding`, rather than forward_oot in vLLM-Ascend. Thus I register another custom op here to enable the oot implementation for YaRN in vLLM-Ascend, similar to #3151 . ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Angazenn <supperccell@163.com>	2025-10-11 08:36:20 +08:00
zouyida2052	ee0a95e47f	bugfix for mtp when running torchair in a2 (#3354 ) ### What this PR does / why we need it? when ops torchair_fused_experts_with_mc2 is called, we need pass a tp group, but now it only pass when quantized scenario, we need also pass it when unquantized. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-10 23:07:24 +08:00
lilinsiman	90e00deaa9	[Bugfix] Optimized exception throwing when stream captures exception (#3322 ) ### What this PR does / why we need it? Optimized exception throwing when stream captures exception, resolved possible misleading. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: lilinsiman <lilinsiman@gmail.com>	2025-10-10 17:09:28 +08:00
panchao-hub	1756efa5fd	[Feat][Graph]Support FULL_DECEDE_ONLY mode for MLA models (#3125 ) ### What this PR does / why we need it? Adds support for capturing the Multi-Layer Attention (MLA) decode operation into an ACL graph. This improves performance by compiling the attention kernel for single-token decoding. Key changes include: - Implementing the graph capture logic for the MLA kernel, including workspace management and parameter updates. - Modifying the rotary embedding (RoPE) handling to use pre-allocated tensors, which is a requirement for graph capture. - Adding a `build_for_graph_capture` method to the MLA metadata builder to create dummy metadata during the graph compilation phase. Known issues: - Currently, MTP is not supported in FULL_DECEDE_ONLY mode -- we're working on a fix - We are preparing to remove update_mla_attn_params with auto_dispatch_capture ### Does this PR introduce _any_ user-facing change? compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: panchao-hub <315134829@qq.com> Signed-off-by: p00465316 <panchao13@huawei.com> Co-authored-by: p00465316 <panchao13@huawei.com> Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>	2025-10-10 16:31:20 +08:00
wangxiyuan	ba19dd3183	Revert PTA upgrade PR (#3352 ) we notice that torch npu 0919 doesn't work. This PR revert related change which rely on 0919 version. Revert PR: #3295 #3205 #3102 Related: #3353 - vLLM version: v0.11.0	2025-10-10 14:09:53 +08:00
zhangxinyuehfad	601a37aeff	[Fixbug] Fix accuarcy template (#3088 ) ### What this PR does / why we need it? Fix empty lines between lm_eval command lines for accuarcy template - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` Signed-off-by: hfadzxy <starmoon_zhang@163.com>	2025-10-10 09:03:21 +08:00
MengLong Chen	6ae75933da	[Feat] Load balance of tokens across experts in dummy_run (#3184 ) ### What this PR does / why we need it? Due to the special input data during the dummy run, the majority of tokens are distributed on DP0TP0, which results in insufficient available KV cache on DP0TP0. This PR changes the `topk_ids` of the dummy_run input from all zeros to random values. This is a naive implementation for experts load balance so as to avoid accumulating too much tokens on a single rank. ### How was this patch tested? model: DeepSeek-v3-w8a8 ```bash vllm serve DeepSeek-v3-w8a8 \ --host 0.0.0.0 \ --port 8004 \ --data-parallel-size 2 \ --tensor-parallel-size 8 \ --quantization ascend \ --seed 1024 \ --enforce-eager \ --served-model-name deepseek_v3 \ --enable-expert-parallel \ --disable-log-stats \ --max-num-seqs 18 \ --max-model-len 8192 \ --max-num-batched-tokens 8192 \ --trust-remote-code \ --no-enable-prefix-caching \ --gpu-memory-utilization 0.9 \ --speculative-config '{"num_speculative_tokens": 1, "method":"deepseek_mtp"}' \ --additional-config \ '{"ascend_scheduler_config":{"enabled":false},"torchair_graph_config":{"enabled":false}}' ``` The Available memory: 2728672256 -> 6771544064 KV Cache size: 38144 -> 95232 tokens After enabling load balance - vLLM version: v0.11.0 --------- Signed-off-by: chenmenglong <chenmenglong1@huawei.com>	2025-10-10 09:00:07 +08:00
Li Wang	60b7c936c5	[Doc] Update deepseek-v3.2 doc (#3319 ) ### What this PR does / why we need it? Upgrade deepseek-v3.2 doc for A2 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-10 08:55:39 +08:00
XiaoxinWang	579b7e5f21	add pagedattention to support FULL_DECODE_ONLY. (#3102 ) ### What this PR does / why we need it? Calculate in advance the workspace memory size needed for the PagedAttention operator to avoid deadlocks during resource cleanup. This PR requires torch_npu version 0920 or newer. ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com> Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>	2025-10-10 08:50:33 +08:00
offline893	1c2c72af8d	[bugfix]change log2phy map to npu (#3339 ) ### What this PR does / why we need it? Resolved the issue of EPLB failure caused by changes in the log2phy map due to device type modifications when using MTP rotation position encoding. ### Does this PR introduce any user-facing change? ### How was this patch tested? https://github.com/vllm-project/vllm/commit/releases/v0.11.0 - vLLM version: v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-10 08:47:55 +08:00
fems14	55e23fabec	【bugfix】fix connector register failed (#3335 ) ### What this PR does / why we need it? Register the connector in the plugin ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: fems14 <1804143737@qq.com>	2025-10-09 21:09:54 +08:00
Ruri	ff37575936	[1/N][Feat] Add weight prefetch feature for Attention layers (#3146 ) ### What this PR does / why we need it? - Refacotr and integrate a unified `WeightPrefetchMethod` - Integrate `qkv_proj.weight` and `o_proj.weight` in quantized Attention modules - Prefetching these weights ahead of matmul-like operators imporves performance by reducing L2 cache transfer latency ### Does this PR introduce _any_ user-facing change? Add a new config in `--additional-config` for configuration: ```json { "weight_prefetch_config": { "enabled": false, "prefetch_ratio": { "attn": { "qkv": 1.0, "o": 1.0, }, }, }, } ``` This feature is enabled by default, and can be disabled through this configuration ### How was this patch tested? - vLLM version: v0.11.0 --------- Signed-off-by: yuzhup <15705211260@163.com> Signed-off-by: zhoux77899 <zhouxiang100@huawei.com> Co-authored-by: yuzhup <15705211260@163.com>	2025-10-09 20:38:39 +08:00
huangdong2022	23db56a340	[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with norm bias (#3205 ) ### What this PR does / why we need it? 1. qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and quant op' during quantization scene. 2. torch_npu.add_rms_norm_quant op fixed accuracy while model weights is quantized by anti_method m4, m4 quantization is asymmetric outlier suppression method, it will generate none-zero norm bias, add_rms_norm_quant op updated to add this parameter to calculate. ### Does this PR introduce _any_ user-facing change? please use a torch_npu version >= torch_npu-2.7.1.dev20250919 ### How was this patch tested? 1. no special parameters to set, no new envs to set. 2. use qwen3 moe quantization model to test ,such as Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8, Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: huangdong2022 <huangdong51@huawei.com> Signed-off-by: h30027576 <huangdong51@huawei.com>	2025-10-09 20:18:10 +08:00
zouyida2052	81aff9c555	bugfix for mtp (#3300 ) ### What this PR does / why we need it? when mtp>1, we need refresh cos ans sin in each step. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.11.0 Signed-off-by: zouyida2052 <zouyida2002@gmail.com>	2025-10-09 19:22:46 +08:00
Wang Yixuan	30c5d947c3	[bugfix]fix multistream moe in torchair (#3164 ) ### What this PR does / why we need it? the multistream moe in tochari only validate in decode, but can't be applied to chunked prefill, So add some judgments to isolate the scenario ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: hust17yixuan <303660421@qq.com>	2025-10-09 19:00:32 +08:00
weichen	94dd832815	[MoE] [Refactor] Combine common_fused_moe and fused_moe (#3176 ) ### What this PR does / why we need it? 1. Move additional functionalities from fused_moe.py to common_fused_moe.py and remove fused_moe.py 2. Remove unnecessary custom classes from qwen3_moe.py, and it will be completely removed after we release vllm-ascend v0.11.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Qwen3-30B-A3B/Qwen3-30B-A3B-W8A8/DeepSeek-V3-W4A8-Pruing/deepseek-mtp/pangu-pro-moe-pruing: 1. Enable/Disable EP 3. Aclgraph & eager 4. SP - vLLM version: v0.11.0 --------- Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com>	2025-10-09 14:12:46 +08:00
Li Wang	a36e3da78e	[Misc] Drop 0102 related lines (#3323 ) ### What this PR does / why we need it? Since https://github.com/vllm-project/vllm-ascend/pull/3284 merged, should discard some extra code that was previously done for version compatibility ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 Signed-off-by: wangli <wangli858794774@gmail.com>	2025-10-09 14:10:57 +08:00
wangxiyuan	1c5b302f0d	[Misc] Clean up useless patch (#3320 ) ### What this PR does / why we need it? 1. clean up v0.10.2 support in ut and e2e test 2. remove v0.11.0 period job, we're at v0.11.0 now. 3. remove uesless patch for deepseek v3.2. They have been done in vLLM already. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 14:07:26 +08:00
wangxiyuan	a43e2f61e1	[CI] Update vLLM to v0.11.0 (#3315 ) ### What this PR does / why we need it? There are 3 step to upgrade vllm-ascend to newest vllm. We'll create 3 PR - [x] Upgrade vllm to v0.11.0 to make CI happy first . - [ ] Move deepseek v3.2 to vllm way - [ ] Then we'll add a new PR to add vllm main support. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 10:41:19 +08:00
wangxiyuan	f12f76d7ba	Drop 0.10.2 (#3284 ) Drop v0.10.2 support, we support vLLM 0.11.0rc3 now. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-10-09 10:28:38 +08:00
Yikun Jiang	2dde1268c7	Fix doc for A2 series and cleanup note (#3307 ) ### What this PR does / why we need it? Fix doc for A2 series and cleanup note ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>	2025-10-01 14:39:48 +08:00
weijinqian0	474fa737c8	[bugfix] Fix moe bug: allgather error. (#3279 ) It will crash when deepseek model executed in A2. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: weijinqian_v1 <weijinqian@huawei.com> Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>	2025-09-30 18:45:09 +08:00
wangxiyuan	b8c58d68e1	[Doc] Add deepseek v3.2 tutorial (#3275 ) Add deepseek v3.2 tutorial - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: wangli <wangli858794774@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>	2025-09-30 17:54:31 +08:00
wangxiyuan	4abdcdba4e	upgrade pta to 0919 (#3295 ) ### What this PR does / why we need it? Upgrade torch-npu to the newest POC version ### Does this PR introduce _any_ user-facing change? yes, user need upgrade the pta version as well. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-09-30 17:14:23 +08:00
leo-pony	3a27b15ddc	[bugfix] Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp parallel example (#3287 ) ### What this PR does / why we need it? Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp parallel example. For large-parameter models of Qwen3-30B and above, weight loading alone takes 4 to 5 minutes. Therefore, the 5-minute timeout in the current example code implementation is too short, causing some DP instances to be killed prematurely and eventually stuck in the DP synchronization all-reduce operation. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>	2025-09-30 15:30:01 +08:00
Chao Lei	a486ff8c11	KVCache Transfer via Layer-wise Strategy in Disaggregation (#2602 ) ### What this PR does / why we need it? See RFC: https://github.com/vllm-project/vllm-ascend/issues/2470 This PR add a new kv connector for layer-wised kv transfer ### Does this PR introduce _any_ user-facing change? yes, a new kv connector is added. User can use layer wised feature now. ### How was this patch tested? - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: leichao.lc <leichao139636@163.com> Signed-off-by: CaveNightingale <2859066733@qq.com> Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com> Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com> Signed-off-by: hanxinlong <50882499@qq.com> Signed-off-by: liziyu <liziyu16@huawei.com> Co-authored-by: CaveNightingale <2859066733@qq.com> Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com> Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com> Co-authored-by: hanxinlong <50882499@qq.com>	2025-09-30 15:10:29 +08:00

1 2 3 4 5 ...

1057 Commits