mirror of
https://github.com/vllm-project/vllm-ascend.git
synced 2025-10-20 13:43:53 +08:00
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
1.8 KiB
1.8 KiB
Online serving tests
- Input length: randomly sample 200 prompts from ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput, TTFT (median time to the first token ), ITL (median inter-token latency) TPOT(median time per output token).
{serving_tests_markdown_table}
Offline tests
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: end-to-end latency.
{latency_tests_markdown_table}
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT and lmarena-ai/vision-arena-bench-v0.1(multi-modal) dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput.
{throughput_tests_markdown_table}