### What this PR does / why we need it? Since, `vllm bench` cli has optimized enough for production use(support more datasets), we are now do not need to copy vllm codes, now , with vllm installed, we can easily use the benchmark cli ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com>
Introduction
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.
Overview
Benchmarking Coverage: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see quick_start to learn more supported devices list), with different models(coming soon).
-
Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
-
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput.
-
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
Benchmarking Duration: about 800 senond for single model.
Quick Use
Prerequisites
Before running the benchmarks, ensure the following:
-
vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
-
Install necessary dependencies for benchmarks:
pip install -r benchmarks/requirements-bench.txt
-
For performance benchmark, it is recommended to set the load-format as
dummy
, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. -
If you want to run benchmark customized, feel free to add your own models and parameters in the JSON, let's take
Qwen2.5-VL-7B-Instruct
as an example:[ { "test_name": "serving_qwen2_5vl_7B_tp1", "qps_list": [ 1, 4, 16, "inf" ], "server_parameters": { "model": "Qwen/Qwen2.5-VL-7B-Instruct", "tensor_parallel_size": 1, "swap_space": 16, "disable_log_stats": "", "disable_log_requests": "", "trust_remote_code": "", "max_model_len": 16384 }, "client_parameters": { "model": "Qwen/Qwen2.5-VL-7B-Instruct", "backend": "openai-chat", "dataset_name": "hf", "hf_split": "train", "endpoint": "/v1/chat/completions", "dataset_path": "lmarena-ai/vision-arena-bench-v0.1", "num_prompts": 200 } } ]
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named
serving_qwen2_5vl_7B_tp1
, designed to evaluate the performance of theQwen/Qwen2.5-VL-7B-Instruct
model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark cli.-
Test Overview
-
Test Name: serving_qwen2_5vl_7B_tp1
-
Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
-
-
Server Parameters
-
Model: Qwen/Qwen2.5-VL-7B-Instruct
-
Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
-
Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
-
disable_log_stats: disables logging of performance statistics.
-
disable_log_requests: disables logging of individual requests.
-
Trust Remote Code: enabled (allows execution of model-specific custom code)
-
Max Model Length: 16,384 tokens (maximum context length supported by the model)
-
-
Client Parameters
-
Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
-
Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
-
Dataset Source: Hugging Face (hf)
-
Dataset Split: train
-
Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
-
Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
-
Number of Prompts: 200 (the total number of prompts used during the test)
-
-
Run benchmarks
Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
bash benchmarks/scripts/run-performance-benchmarks.sh
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
.
|-- serving_qwen2_5_7B_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json
|-- serving_qwen2_5_7B_tp1_qps_4.json
|-- serving_qwen2_5_7B_tp1_qps_inf.json
|-- latency_qwen2_5_7B_tp1.json
|-- throughput_qwen2_5_7B_tp1.json
These files contain detailed benchmarking results for further analysis.
Use benchmark cli
For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks
Similarly, let’s take Qwen2.5-VL-7B-Instruct
benchmark as an example:
Online serving
- Launch the server:
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
- Running performance tests using cli
vllm bench serve --model Qwen2.5-VL-7B-Instruct\ --endpoint-type "openai-chat" --dataset-name hf \ --hf-split train --endpoint "/v1/chat/completions" \ --dataset-path "lmarena-ai/vision-arena-bench-v0.1" \ --num-prompts 200 \ --request-rate 16
Offline
- Throughput
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \ --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \ --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \ --num-prompts 200 --backend vllm
- Latency
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \ --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \ --load-format dummy --num-iters-warmup 5 --num-iters 15