vllm-ascend

mirror of https://github.com/vllm-project/vllm-ascend.git synced 2025-10-20 13:43:53 +08:00

Files

xuyexiong 02c26dcfc7 [Feat] Supports Aclgraph for bge-m3 (#3171 )

### What this PR does / why we need it?
[Feat] Supports Aclgraph for bge-m3

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
```
pytest -s tests/e2e/singlecard/test_embedding.py
pytest -s tests/e2e/singlecard/test_embedding_aclgraph.py
```
to start an online server with bs 10, each batch's seq length=8192, we
set --max-num-batched-tokens=8192*10 to ensure encoder is not chunked:
```
vllm serve /home/data/bge-m3 --max_model_len 1024 --served-model-name "bge-m3" --task embed --host 0.0.0.0 --port 9095 --max-num-batched-tokens 81920 --compilation-config '{"cudagraph_capture_sizes":[8192, 10240, 20480, 40960, 81920]}'
```
For bs10, each batch's seq length=8192, QPS is improved from 85 to 104,
which is a 22% improvement, lots of host bound is reduced.


- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: xuyexiong <xuyexiong@huawei.com>
Co-authored-by: wangyongjun <1104133197@qq.com>

2025-10-14 23:07:45 +08:00

ISSUE_TEMPLATE

[Doc] Release note for v0.11.0rc0 (#3224 )

2025-09-30 03:26:18 +08:00

workflows

[Feat] Supports Aclgraph for bge-m3 (#3171 )

2025-10-14 23:07:45 +08:00

actionlint.yaml

[1/N][CI] Add multi node test (#3359 )

2025-10-11 14:50:46 +08:00

dependabot.yml

[CI] Add dependabot support and labeler workflow (#162 )