mirror of https://github.com/vllm-project/vllm-ascend.git synced 2025-10-20 13:43:53 +08:00

Files

Jiawei Li e57cca971c Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

**Background:**

There are two principles about operator registration in PyTorch
- The same namespace can be only registered once by `TORCH_LIBRARY`
- The operator signatures can be only registered once by `def`

Considering that all custom operators defined in the current repo are
only used by Ascend, instead of defining a common operator schema by
vLLM, all accelerators then follow this operator schema and complete the
implementation based on their respective hardware, which is conducive to
functional abstraction.

Therefore, we can rename the operator registration namespace to an
Ascend-specific namespace(**_C_ascend**).

Related ISSUE: https://github.com/vllm-project/vllm-ascend/issues/2742


- vLLM version: main
- vLLM main:
f592b3174b

Signed-off-by: FFFrog <ljw1101.vip@gmail.com>

2025-09-13 11:58:52 +08:00

ops

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

scripts

[Benchmark] Correctly kill vllm process in performance benchamrk (#2782 )

2025-09-07 10:36:34 +08:00

tests

[Benchmark] Correctly kill vllm process in performance benchamrk (#2782 )

2025-09-07 10:36:34 +08:00

README.md

[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 )

2025-07-25 22:16:10 +08:00

requirements-bench.txt

[CI] Remove benchmark patch and increase the scheduler frequency (#1762 )

2025-07-13 20:00:35 +08:00

README.md

Introduction

This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.

Overview

Benchmarking Coverage: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see quick_start to learn more supported devices list), with different models(coming soon).

Latency tests
- Input length: 32 tokens.
- Output length: 128 tokens.
- Batch size: fixed (8).
- Models: Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm to achieve maximum throughput.
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput.
Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 200 prompts.
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
- Average QPS (query per second): 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Models: Qwen2.5-VL-7B-Instruct, Qwen2.5-7B-Instruct, Qwen3-8B.
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).

Benchmarking Duration: about 800 senond for single model.

Quick Use

Prerequisites

Before running the benchmarks, ensure the following:

vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.

Install necessary dependencies for benchmarks:

pip install -r benchmarks/requirements-bench.txt

For performance benchmark, it is recommended to set the load-format as dummy, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.

If you want to run benchmark customized, feel free to add your own models and parameters in the JSON, let's take Qwen2.5-VL-7B-Instructas an example:

[
{
  "test_name": "serving_qwen2_5vl_7B_tp1",
  "qps_list": [
    1,
    4,
    16,
    "inf"
  ],
  "server_parameters": {
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "tensor_parallel_size": 1,
    "swap_space": 16,
    "disable_log_stats": "",
    "disable_log_requests": "",
    "trust_remote_code": "",
    "max_model_len": 16384
  },
  "client_parameters": {
    "model": "Qwen/Qwen2.5-VL-7B-Instruct",
    "backend": "openai-chat",
    "dataset_name": "hf",
    "hf_split": "train",
    "endpoint": "/v1/chat/completions",
    "dataset_path": "lmarena-ai/vision-arena-bench-v0.1",
    "num_prompts": 200
  }
}
]

this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named serving_qwen2_5vl_7B_tp1, designed to evaluate the performance of the Qwen/Qwen2.5-VL-7B-Instruct model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark cli.

Test Overview
- Test Name: serving_qwen2_5vl_7B_tp1
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- disable_log_stats: disables logging of performance statistics.
- disable_log_requests: disables logging of individual requests.
- Trust Remote Code: enabled (allows execution of model-specific custom code)
- Max Model Length: 16,384 tokens (maximum context length supported by the model)
Client Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct (same as the server)
- Backend: openai-chat (suggests the client uses the OpenAI-compatible chat API format)
- Dataset Source: Hugging Face (hf)
- Dataset Split: train
- Endpoint: /v1/chat/completions (the REST API endpoint to which chat requests are sent)
- Dataset Path: lmarena-ai/vision-arena-bench-v0.1 (the benchmark dataset used for evaluation, hosted on Hugging Face)
- Number of Prompts: 200 (the total number of prompts used during the test)

Run benchmarks

Use benchmark script

The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:

bash benchmarks/scripts/run-performance-benchmarks.sh

Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:

.
|-- serving_qwen2_5_7B_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json
|-- serving_qwen2_5_7B_tp1_qps_4.json
|-- serving_qwen2_5_7B_tp1_qps_inf.json
|-- latency_qwen2_5_7B_tp1.json
|-- throughput_qwen2_5_7B_tp1.json

These files contain detailed benchmarking results for further analysis.

Use benchmark cli

For more flexible and customized use, benchmark cli is also provided to run online/offline benchmarks Similarly, let’s take Qwen2.5-VL-7B-Instruct benchmark as an example:

Online serving

Launch the server:

vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789

Running performance tests using cli

vllm bench serve --model Qwen2.5-VL-7B-Instruct\
--endpoint-type "openai-chat" --dataset-name hf \
--hf-split train --endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 200 \
--request-rate 16

Offline

Throughput

vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 --backend vllm

Latency

vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
--load-format dummy --num-iters-warmup 5 --num-iters 15

README.md Unescape Escape

Introduction

Overview

Quick Use

Prerequisites

Run benchmarks

Use benchmark script

Use benchmark cli

Online serving

Offline

README.md