[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)

### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code

### How was this patch tested?


- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
Li Wang
2025-07-25 22:16:10 +08:00
committed by GitHub
parent d629f0b2b5
commit bdfb065b5d
31 changed files with 215 additions and 64 deletions

View File

@ -25,4 +25,3 @@ CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add. If tests were not added, please describe why they were not added and/or why it was difficult to add.
--> -->

View File

@ -46,11 +46,11 @@ repos:
# files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$ # files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$
# types_or: [c++] # types_or: [c++]
# args: [--style=google, --verbose] # args: [--style=google, --verbose]
# - repo: https://github.com/jackdewinter/pymarkdown - repo: https://github.com/jackdewinter/pymarkdown
# rev: v0.9.29 rev: v0.9.29
# hooks: hooks:
# - id: pymarkdown - id: pymarkdown
# args: [fix] args: [fix]
- repo: https://github.com/rhysd/actionlint - repo: https://github.com/rhysd/actionlint
rev: v1.7.7 rev: v1.7.7
hooks: hooks:
@ -131,6 +131,12 @@ repos:
types: [python] types: [python]
pass_filenames: false pass_filenames: false
additional_dependencies: [regex] additional_dependencies: [regex]
- id: python-init
name: Enforce __init__.py in Python packages
entry: python tools/check_python_src_init.py
language: python
types: [python]
pass_filenames: false
# Keep `suggestion` last # Keep `suggestion` last
- id: suggestion - id: suggestion
name: Suggestion name: Suggestion

View File

@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
For answers to common questions about this code of conduct, see the For answers to common questions about this code of conduct, see the
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at [Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
[Contributor Covenant translations](https://www.contributor-covenant.org/translations). [Contributor Covenant translations](https://www.contributor-covenant.org/translations).

View File

@ -26,7 +26,6 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
**Benchmarking Duration**: about 800 senond for single model. **Benchmarking Duration**: about 800 senond for single model.
# Quick Use # Quick Use
## Prerequisites ## Prerequisites
Before running the benchmarks, ensure the following: Before running the benchmarks, ensure the following:
@ -34,11 +33,12 @@ Before running the benchmarks, ensure the following:
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices. - vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
- Install necessary dependencies for benchmarks: - Install necessary dependencies for benchmarks:
```
pip install -r benchmarks/requirements-bench.txt ```shell
``` pip install -r benchmarks/requirements-bench.txt
```
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time.
- If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example: - If you want to run benchmark customized, feel free to add your own models and parameters in the [JSON](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests), let's take `Qwen2.5-VL-7B-Instruct`as an example:
```shell ```shell
@ -72,27 +72,28 @@ Before running the benchmarks, ensure the following:
} }
] ]
``` ```
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
- **Test Overview** - **Test Overview**
- Test Name: serving_qwen2_5vl_7B_tp1 - Test Name: serving_qwen2_5vl_7B_tp1
- Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing). - Queries Per Second (QPS): The test is run at four different QPS levels: 1, 4, 16, and inf (infinite load, typically used for stress testing).
- Server Parameters - Server Parameters
- Model: Qwen/Qwen2.5-VL-7B-Instruct - Model: Qwen/Qwen2.5-VL-7B-Instruct
- Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node) - Tensor Parallelism: 1 (no model parallelism is used; the model runs on a single device or node)
- Swap Space: 16 GB (used to handle memory overflow by swapping to disk) - Swap Space: 16 GB (used to handle memory overflow by swapping to disk)
- disable_log_stats: disables logging of performance statistics. - disable_log_stats: disables logging of performance statistics.
- disable_log_requests: disables logging of individual requests. - disable_log_requests: disables logging of individual requests.
- Trust Remote Code: enabled (allows execution of model-specific custom code) - Trust Remote Code: enabled (allows execution of model-specific custom code)
- Max Model Length: 16,384 tokens (maximum context length supported by the model) - Max Model Length: 16,384 tokens (maximum context length supported by the model)
- Client Parameters - Client Parameters
@ -110,17 +111,18 @@ Before running the benchmarks, ensure the following:
- Number of Prompts: 200 (the total number of prompts used during the test) - Number of Prompts: 200 (the total number of prompts used during the test)
## Run benchmarks ## Run benchmarks
### Use benchmark script ### Use benchmark script
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory: The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
```
```shell
bash benchmarks/scripts/run-performance-benchmarks.sh bash benchmarks/scripts/run-performance-benchmarks.sh
``` ```
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following: Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
```
```shell
. .
|-- serving_qwen2_5_7B_tp1_qps_1.json |-- serving_qwen2_5_7B_tp1_qps_1.json
|-- serving_qwen2_5_7B_tp1_qps_16.json |-- serving_qwen2_5_7B_tp1_qps_16.json
@ -129,6 +131,7 @@ Once the script completes, you can find the results in the benchmarks/results fo
|-- latency_qwen2_5_7B_tp1.json |-- latency_qwen2_5_7B_tp1.json
|-- throughput_qwen2_5_7B_tp1.json |-- throughput_qwen2_5_7B_tp1.json
``` ```
These files contain detailed benchmarking results for further analysis. These files contain detailed benchmarking results for further analysis.
### Use benchmark cli ### Use benchmark cli
@ -137,30 +140,36 @@ For more flexible and customized use, benchmark cli is also provided to run onli
Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example: Similarly, lets take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
#### Online serving #### Online serving
1. Launch the server: 1. Launch the server:
```shell
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789 ```shell
``` vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
```
2. Running performance tests using cli 2. Running performance tests using cli
```shell
```shell
vllm bench serve --model Qwen2.5-VL-7B-Instruct\ vllm bench serve --model Qwen2.5-VL-7B-Instruct\
--endpoint-type "openai-chat" --dataset-name hf \ --endpoint-type "openai-chat" --dataset-name hf \
--hf-split train --endpoint "/v1/chat/completions" \ --hf-split train --endpoint "/v1/chat/completions" \
--dataset-path "lmarena-ai/vision-arena-bench-v0.1" \ --dataset-path "lmarena-ai/vision-arena-bench-v0.1" \
--num-prompts 200 \ --num-prompts 200 \
--request-rate 16 --request-rate 16
``` ```
#### Offline #### Offline
- **Throughput** - **Throughput**
```shell
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \ ```shell
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \ vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \ --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
--num-prompts 200 --backend vllm --dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
``` --num-prompts 200 --backend vllm
```
- **Latency** - **Latency**
```shell
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \ ```shell
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \ vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
--load-format dummy --num-iters-warmup 5 --num-iters 15 --model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
``` --load-format dummy --num-iters-warmup 5 --num-iters 15
```

View File

@ -28,4 +28,4 @@
- Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct - Models: Qwen/Qwen3-8B, Qwen/Qwen2.5-VL-7B-Instruct
- Evaluation metrics: throughput. - Evaluation metrics: throughput.
{throughput_tests_markdown_table} {throughput_tests_markdown_table}

View File

@ -1,7 +1,7 @@
# Governance # Governance
## Mission ## Mission
As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM. As a vital component of vLLM, the vLLM Ascend project is dedicated to providing an easy, fast, and cheap LLM Serving for Everyone on Ascend NPU, and to actively contribute to the enrichment of vLLM.
## Principles ## Principles
vLLM Ascend follows the vLLM community's code of conduct[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md) vLLM Ascend follows the vLLM community's code of conduct[vLLM - CODE OF CONDUCT](https://github.com/vllm-project/vllm/blob/main/CODE_OF_CONDUCT.md)
@ -13,7 +13,7 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
**Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code **Responsibility:** Help new contributors on boarding, handle and respond to community questions, review RFCs, code
**Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement. **Requirements:** Complete at least 1 contribution. Contributor is someone who consistently and actively participates in a project, included but not limited to issue/review/commits/community involvement.
Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently. Contributors will be empowered [vllm-project/vllm-ascend](https://github.com/vllm-project/vllm-ascend) Github repo `Triage` permissions (`Can read and clone this repository. Can also manage issues and pull requests`) to help community developers collaborate more efficiently.

View File

@ -4,7 +4,7 @@
[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code. [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) is an easy-to-use and efficient platform for training and fine-tuning large language models. With LLaMA-Factory, you can fine-tune hundreds of pre-trained models locally without writing any code.
LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model. LLaMA-Facotory users need to evaluate and inference the model after fine-tuning the model.
**The Business Challenge** **The Business Challenge**

View File

@ -13,6 +13,7 @@ But you can still set up dev env on Linux/Windows/macOS for linting and basic
test as following commands: test as following commands:
#### Run lint locally #### Run lint locally
```bash ```bash
# Choose a base dir (~/vllm-project/) and set up venv # Choose a base dir (~/vllm-project/) and set up venv
cd ~/vllm-project/ cd ~/vllm-project/
@ -103,7 +104,6 @@ If the PR spans more than one category, please include all relevant prefixes.
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html). You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers. If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
:::{toctree} :::{toctree}
:caption: Index :caption: Index
:maxdepth: 1 :maxdepth: 1

View File

@ -172,6 +172,7 @@ pytest -sv tests/ut
# Run single test # Run single test
pytest -sv tests/ut/test_ascend_config.py pytest -sv tests/ut/test_ascend_config.py
``` ```
:::: ::::
::::{tab-item} Multi cards test ::::{tab-item} Multi cards test
@ -185,6 +186,7 @@ pytest -sv tests/ut
# Run single test # Run single test
pytest -sv tests/ut/test_ascend_config.py pytest -sv tests/ut/test_ascend_config.py
``` ```
:::: ::::
::::: :::::
@ -218,10 +220,12 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.
# Run a certain case in test script # Run a certain case in test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
``` ```
:::: ::::
::::{tab-item} Multi cards test ::::{tab-item} Multi cards test
:sync: multi :sync: multi
```bash ```bash
cd /vllm-workspace/vllm-ascend/ cd /vllm-workspace/vllm-ascend/
# Run all single card the tests # Run all single card the tests
@ -233,6 +237,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_ba
# Run a certain case in test script # Run a certain case in test script
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
``` ```
:::: ::::
::::: :::::

View File

@ -3,4 +3,4 @@
:::{toctree} :::{toctree}
:caption: Accuracy Report :caption: Accuracy Report
:maxdepth: 1 :maxdepth: 1
::: :::

View File

@ -65,6 +65,7 @@ pip install gradio plotly evalscope
## 3. Run gsm8k accuracy test using EvalScope ## 3. Run gsm8k accuracy test using EvalScope
You can `evalscope eval` run gsm8k accuracy test: You can `evalscope eval` run gsm8k accuracy test:
``` ```
evalscope eval \ evalscope eval \
--model Qwen/Qwen2.5-7B-Instruct \ --model Qwen/Qwen2.5-7B-Instruct \
@ -98,6 +99,7 @@ pip install evalscope[perf] -U
### Basic usage ### Basic usage
You can use `evalscope perf` run perf test: You can use `evalscope perf` run perf test:
``` ```
evalscope perf \ evalscope perf \
--url "http://localhost:8000/v1/chat/completions" \ --url "http://localhost:8000/v1/chat/completions" \
@ -111,7 +113,7 @@ evalscope perf \
### Output results ### Output results
After 1-2 mins, the output is as shown below: After 1-2 mins, the output is as shown below:
```shell ```shell
Benchmarking summary: Benchmarking summary:

View File

@ -1,7 +1,7 @@
# Using lm-eval # Using lm-eval
This document will guide you have a accuracy testing using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness). This document will guide you have a accuracy testing using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness).
## 1. Run docker container ## 1. Run docker container
You can run docker container on a single NPU: You can run docker container on a single NPU:
@ -36,6 +36,7 @@ Install lm-eval in the container.
```bash ```bash
pip install lm-eval pip install lm-eval
``` ```
Run the following command: Run the following command:
``` ```

View File

@ -1,4 +1,4 @@
# Using OpenCompass # Using OpenCompass
This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass). This document will guide you have a accuracy testing using [OpenCompass](https://github.com/open-compass/opencompass).
## 1. Online Serving ## 1. Online Serving
@ -29,7 +29,9 @@ docker run --rm \
-it $IMAGE \ -it $IMAGE \
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240 vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
``` ```
If your service start successfully, you can see the info shown below: If your service start successfully, you can see the info shown below:
``` ```
INFO: Started server process [6873] INFO: Started server process [6873]
INFO: Waiting for application startup. INFO: Waiting for application startup.
@ -37,6 +39,7 @@ INFO: Application startup complete.
``` ```
Once your server is started, you can query the model with input prompts in new terminal: Once your server is started, you can query the model with input prompts in new terminal:
``` ```
curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \

View File

@ -50,6 +50,7 @@ Before writing a patch, following the principle above, we should patch the least
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`. 2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`. 3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
4. Write your patch code in the new file. Here is an example: 4. Write your patch code in the new file. Here is an example:
```python ```python
import vllm import vllm
@ -59,8 +60,10 @@ Before writing a patch, following the principle above, we should patch the least
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
``` ```
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`. 5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows: 6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
``` ```
# ** File: <The patch file name> ** # ** File: <The patch file name> **
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -74,8 +77,8 @@ Before writing a patch, following the principle above, we should patch the least
# Future Plan: # Future Plan:
# <Describe the future plan to remove the patch> # <Describe the future plan to remove the patch>
``` ```
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
## Limitation ## Limitation
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely. 1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.

View File

@ -216,6 +216,7 @@ The first argument of `vllm.ModelRegistry.register_model()` indicates the unique
], ],
} }
``` ```
::: :::
## Step 3: Verification ## Step 3: Verification

View File

@ -4,6 +4,7 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks). **Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
## 1. Run docker container ## 1. Run docker container
```{code-block} bash ```{code-block} bash
:substitutions: :substitutions:
# Update DEVICE according to your device (/dev/davinci[0-7]) # Update DEVICE according to your device (/dev/davinci[0-7])
@ -29,6 +30,7 @@ docker run --rm \
``` ```
## 2. Install dependencies ## 2. Install dependencies
```bash ```bash
cd /workspace/vllm-ascend cd /workspace/vllm-ascend
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
@ -37,11 +39,13 @@ pip install -r benchmarks/requirements-bench.txt
## 3. (Optional)Prepare model weights ## 3. (Optional)Prepare model weights
For faster running speed, we recommend downloading the model in advance For faster running speed, we recommend downloading the model in advance
```bash ```bash
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
``` ```
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths: You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
```bash ```bash
[ [
{ {
@ -59,11 +63,13 @@ You can also replace all model paths in the [json](https://github.com/vllm-proje
## 4. Run benchmark script ## 4. Run benchmark script
Run benchmark script: Run benchmark script:
```bash ```bash
bash benchmarks/scripts/run-performance-benchmarks.sh bash benchmarks/scripts/run-performance-benchmarks.sh
``` ```
After about 10 mins, the output is as shown below: After about 10 mins, the output is as shown below:
```bash ```bash
online serving: online serving:
qps 1: qps 1:
@ -173,6 +179,7 @@ Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
Total num prompt tokens: 42659 Total num prompt tokens: 42659
Total num output tokens: 43545 Total num output tokens: 43545
``` ```
The result json files are generated into the path `benchmark/results` The result json files are generated into the path `benchmark/results`
These files contain detailed benchmarking results for further analysis. These files contain detailed benchmarking results for further analysis.

View File

@ -10,6 +10,7 @@ The execution duration of each stage (including pre/post-processing, model forwa
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages. * Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:** **We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
``` ```
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
``` ```
@ -36,4 +37,4 @@ VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_in
5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms 5747:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:14.21ms [prepare input and forward]:10.10ms [forward]:4.52ms
5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms 5751:(IntegratedWorker pid=1502524) Profile execute duration [Decode]: [post process]:15.03ms [prepare input and forward]:10.00ms [forward]:4.42ms
``` ```

View File

@ -190,6 +190,7 @@ git clone https://github.com/vllm-project/vllm-ascend.git
cd vllm-ascend cd vllm-ascend
docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile . docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
``` ```
::: :::
```{code-block} bash ```{code-block} bash

View File

@ -35,6 +35,7 @@ docker run --rm \
# Install curl # Install curl
apt-get update -y && apt-get install -y curl apt-get update -y && apt-get install -y curl
``` ```
:::: ::::
::::{tab-item} openEuler ::::{tab-item} openEuler
@ -63,6 +64,7 @@ docker run --rm \
# Install curl # Install curl
yum update -y && yum install -y curl yum update -y && yum install -y curl
``` ```
:::: ::::
::::: :::::
@ -73,6 +75,7 @@ The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/v
You can use Modelscope mirror to speed up download: You can use Modelscope mirror to speed up download:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well --> <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash ```bash
export VLLM_USE_MODELSCOPE=true export VLLM_USE_MODELSCOPE=true
``` ```
@ -87,6 +90,7 @@ With vLLM installed, you can start generating texts for list of input prompts (i
Try to run below Python script directly or use `python3` shell to generate texts: Try to run below Python script directly or use `python3` shell to generate texts:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well --> <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```python ```python
from vllm import LLM, SamplingParams from vllm import LLM, SamplingParams
@ -115,6 +119,7 @@ the following command to start the vLLM server with the
[Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well --> <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash ```bash
# Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models) # Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
vllm serve Qwen/Qwen2.5-0.5B-Instruct & vllm serve Qwen/Qwen2.5-0.5B-Instruct &
@ -128,11 +133,13 @@ INFO: Waiting for application startup.
INFO: Application startup complete. INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
``` ```
Congratulations, you have successfully started the vLLM server! Congratulations, you have successfully started the vLLM server!
You can query the list the models: You can query the list the models:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well --> <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash ```bash
curl http://localhost:8000/v1/models | python3 -m json.tool curl http://localhost:8000/v1/models | python3 -m json.tool
``` ```
@ -140,6 +147,7 @@ curl http://localhost:8000/v1/models | python3 -m json.tool
You can also query the model with input prompts: You can also query the model with input prompts:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well --> <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash ```bash
curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
@ -155,12 +163,14 @@ vLLM is serving as background process, you can use `kill -2 $VLLM_PID` to stop t
it's equal to `Ctrl-C` to stop foreground vLLM process: it's equal to `Ctrl-C` to stop foreground vLLM process:
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well --> <!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
```bash ```bash
VLLM_PID=$(pgrep -f "vllm serve") VLLM_PID=$(pgrep -f "vllm serve")
kill -2 "$VLLM_PID" kill -2 "$VLLM_PID"
``` ```
You will see output as below: You will see output as below:
``` ```
INFO: Shutting down FastAPI HTTP server. INFO: Shutting down FastAPI HTTP server.
INFO: Shutting down INFO: Shutting down
@ -170,4 +180,4 @@ INFO: Application shutdown complete.
Finally, you can exit container by using `ctrl-D`. Finally, you can exit container by using `ctrl-D`.
:::: ::::
::::: :::::

View File

@ -43,11 +43,13 @@ Execute the following commands on each node in sequence. The results must all be
### NPU Interconnect Verification: ### NPU Interconnect Verification:
#### 1. Get NPU IP Addresses #### 1. Get NPU IP Addresses
```bash ```bash
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
``` ```
#### 2. Cross-Node PING Test #### 2. Cross-Node PING Test
```bash ```bash
# Execute on the target node (replace with actual IP) # Execute on the target node (replace with actual IP)
hccn_tool -i 0 -ping -g address 10.20.0.20 hccn_tool -i 0 -ping -g address 10.20.0.20
@ -95,6 +97,7 @@ Before launch the inference server, ensure some environment variables are set fo
Run the following scripts on two nodes respectively Run the following scripts on two nodes respectively
**node0** **node0**
```shell ```shell
#!/bin/sh #!/bin/sh
@ -135,6 +138,7 @@ vllm serve /root/.cache/ds_v3 \
``` ```
**node1** **node1**
```shell ```shell
#!/bin/sh #!/bin/sh
@ -173,7 +177,7 @@ vllm serve /root/.cache/ds_v3 \
--additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}' --additional-config '{"ascend_scheduler_config":{"enabled":true},"torchair_graph_config":{"enabled":true}}'
``` ```
The Deployment view looks like: The Deployment view looks like:
![alt text](../assets/multi_node_dp.png) ![alt text](../assets/multi_node_dp.png)
Once your server is started, you can query the model with input prompts: Once your server is started, you can query the model with input prompts:
@ -191,6 +195,7 @@ curl http://{ node0 ip:8004 }/v1/completions \
## Run benchmarks ## Run benchmarks
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks) For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
```shell ```shell
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \ vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
--dataset-name random --random-input-len 128 --random-output-len 128 \ --dataset-name random --random-input-len 128 --random-output-len 128 \

View File

@ -71,6 +71,7 @@ curl http://localhost:8000/v1/completions \
"temperature": 0.6 "temperature": 0.6
}' }'
``` ```
:::: ::::
::::{tab-item} v1/chat/completions ::::{tab-item} v1/chat/completions
@ -91,6 +92,7 @@ curl http://localhost:8000/v1/chat/completions \
"add_special_tokens" : true "add_special_tokens" : true
}' }'
``` ```
:::: ::::
::::: :::::
@ -170,9 +172,11 @@ if __name__ == "__main__":
del llm del llm
clean_up() clean_up()
``` ```
:::: ::::
::::{tab-item} Eager Mode ::::{tab-item} Eager Mode
```{code-block} python ```{code-block} python
:substitutions: :substitutions:
import gc import gc
@ -226,6 +230,7 @@ if __name__ == "__main__":
del llm del llm
clean_up() clean_up()
``` ```
:::: ::::
::::: :::::

View File

@ -30,7 +30,7 @@ docker run --rm \
## Install modelslim and convert model ## Install modelslim and convert model
:::{note} :::{note}
You can choose to convert the model yourself or use the quantized model we uploaded, You can choose to convert the model yourself or use the quantized model we uploaded,
see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8 see https://www.modelscope.cn/models/vllm-ascend/QwQ-32B-W8A8
::: :::
@ -55,6 +55,7 @@ python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --cal
## Verify the quantized model ## Verify the quantized model
The converted model files looks like: The converted model files looks like:
```bash ```bash
. .
|-- config.json |-- config.json
@ -72,11 +73,13 @@ Run the following script to start the vLLM server with quantized model:
:::{note} :::{note}
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now. The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
::: :::
```bash ```bash
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
``` ```
Once your server is started, you can query the model with input prompts Once your server is started, you can query the model with input prompts
```bash ```bash
curl http://localhost:8000/v1/completions \ curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
@ -93,7 +96,7 @@ curl http://localhost:8000/v1/completions \
Run the following script to execute offline inference on multi-NPU with quantized model: Run the following script to execute offline inference on multi-NPU with quantized model:
:::{note} :::{note}
To enable quantization for ascend, quantization method must be "ascend" To enable quantization for ascend, quantization method must be "ascend"
::: :::
```python ```python
@ -131,4 +134,4 @@ for output in outputs:
del llm del llm
clean_up() clean_up()
``` ```

View File

@ -80,6 +80,7 @@ curl http://localhost:8000/v1/completions \
"temperature": 0.6 "temperature": 0.6
}' }'
``` ```
:::: ::::
::::{tab-item} Qwen/Qwen2.5-7B-Instruct ::::{tab-item} Qwen/Qwen2.5-7B-Instruct
@ -318,6 +319,7 @@ if __name__ == "__main__":
::::: :::::
Run script: Run script:
```bash ```bash
python example.py python example.py
``` ```

View File

@ -66,6 +66,7 @@ for output in outputs:
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
:::: ::::
::::{tab-item} Eager Mode ::::{tab-item} Eager Mode
@ -92,6 +93,7 @@ for output in outputs:
generated_text = output.outputs[0].text generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
``` ```
:::: ::::
::::: :::::
@ -131,6 +133,7 @@ docker run --rm \
-it $IMAGE \ -it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240 vllm serve Qwen/Qwen3-8B --max_model_len 26240
``` ```
:::: ::::
::::{tab-item} Eager Mode ::::{tab-item} Eager Mode
@ -156,6 +159,7 @@ docker run --rm \
-it $IMAGE \ -it $IMAGE \
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
``` ```
:::: ::::
::::: :::::

View File

@ -191,4 +191,4 @@ Logs of the vllm server:
INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nWhat is the text in the illustrate?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None. INFO 03-12 11:16:50 logger.py:39] Received request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb: prompt: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>\nWhat is the text in the illustrate?<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16353, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb. INFO 03-12 11:16:50 engine.py:280] Added request chatcmpl-92148a41eca64b6d82d3d7cfa5723aeb.
INFO: 127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO: 127.0.0.1:54004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
``` ```

View File

@ -11,6 +11,7 @@ To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/m
Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is available for vLLM Ascend in the future. Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is available for vLLM Ascend in the future.
Install modelslim: Install modelslim:
```bash ```bash
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001 git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
cd msit/msmodelslim cd msit/msmodelslim
@ -22,7 +23,6 @@ pip install accelerate
Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96). Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).
```bash ```bash
cd example/DeepSeek cd example/DeepSeek
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True
@ -39,6 +39,7 @@ Once convert action is done, there are two important files generated.
2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file. 2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file.
Here is the full converted model files: Here is the full converted model files:
```bash ```bash
. .
├── config.json ├── config.json
@ -103,4 +104,4 @@ submit a issue, maybe some new models need to be adapted.
### 2. How to solve the error "Could not locate the configuration_deepseek.py"? ### 2. How to solve the error "Could not locate the configuration_deepseek.py"?
Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error. Please convert DeepSeek series models using `modelslim-VLLM-8.1.RC1.b020_001` modelslim, this version has fixed the missing configuration_deepseek.py error.

View File

@ -6,7 +6,6 @@ Sleep Mode is an API designed to offload model weights and discard KV cache from
Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU. Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
## Getting started ## Getting started
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`. With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.

View File

@ -205,7 +205,7 @@ This is the first release candidate of v0.8.4 for vllm-ascend. Please follow the
### Highlights ### Highlights
- vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcely. - vLLM V1 engine experimental support is included in this version. You can visit [official guide](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html) to get more detail. By default, vLLM will fallback to V0 if V1 doesn't work, please set `VLLM_USE_V1=1` environment if you want to use V1 forcely.
- LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521). - LoRAMulti-LoRA And Dynamic Serving is supported now. The performance will be improved in the next release. Please follow the [official doc](https://docs.vllm.ai/en/latest/features/lora.html) for more usage information. Thanks for the contribution from China Merchants Bank. [#521](https://github.com/vllm-project/vllm-ascend/pull/521).
- Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513) - Sleep Mode feature is supported. Currently it's only work on V0 engine. V1 engine support will come soon. [#513](https://github.com/vllm-project/vllm-ascend/pull/513)

View File

@ -34,7 +34,6 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
| XLM-RoBERTa-based | ✅ | | | XLM-RoBERTa-based | ✅ | |
| Molmo | ✅ | | | Molmo | ✅ | |
## Multimodal Language Models ## Multimodal Language Models
### Generative Models ### Generative Models

View File

@ -21,3 +21,14 @@ requires = [
"numba", "numba",
] ]
build-backend = "setuptools.build_meta" build-backend = "setuptools.build_meta"
[tool.pymarkdown]
plugins.md004.style = "sublist" # ul-style
plugins.md007.indent = 4 # ul-indent
plugins.md007.start_indented = true # ul-indent
plugins.md013.enabled = false # line-length
plugins.md041.enabled = false # first-line-h1
plugins.md033.enabled = false # inline-html
plugins.md046.enabled = false # code-block-style
plugins.md024.allow_different_nesting = true # no-duplicate-headers
plugins.md029.enabled = false # ol-prefix

View File

@ -0,0 +1,75 @@
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
# Copyright 2023 The vLLM team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
# Adapted from https://github.com/vllm-project/vllm/tree/main/tools
#
import os
import sys
VLLM_ASCEND_SRC = "vllm_ascend"
VLLM_SRC = "vllm-empty/vllm"
def check_init_file_in_package(directory):
"""
Check if a Python package directory contains __init__.py file.
A directory is considered a Python package if it contains `.py` files and an `__init__.py` file.
"""
try:
files = os.listdir(directory)
except FileNotFoundError:
print(f"Warning: Directory does not exist: {directory}")
return False
# If any .py file exists, we expect an __init__.py
if any(f.endswith('.py') for f in files):
init_file = os.path.join(directory, '__init__.py')
if not os.path.isfile(init_file):
return False
return True
def find_missing_init_dirs(src_dir):
"""
Walk through the src_dir and return subdirectories missing __init__.py.
"""
missing_init = set()
for dirpath, _, _ in os.walk(src_dir):
if not check_init_file_in_package(dirpath):
missing_init.add(dirpath)
return missing_init
def main():
all_missing = set()
for src in [VLLM_ASCEND_SRC, VLLM_SRC]:
missing = find_missing_init_dirs(src)
all_missing.update(missing)
if all_missing:
print(
"❌ Missing '__init__.py' files in the following Python package directories:"
)
for pkg in sorted(all_missing):
print(f" - {pkg}")
sys.exit(1)
else:
print("✅ All Python packages have __init__.py files.")
if __name__ == "__main__":
main()