mirror of
https://github.com/vllm-project/vllm-ascend.git
synced 2025-10-20 21:53:54 +08:00
[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011)
### What this PR does / why we need it?
1. Enable pymarkdown check
2. Enable python `__init__.py` check for vllm and vllm-ascend
3. Make clean code
### How was this patch tested?
- vLLM version: v0.9.2
- vLLM main:
29c6fbe58c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
1
.github/PULL_REQUEST_TEMPLATE.md
vendored
1
.github/PULL_REQUEST_TEMPLATE.md
vendored
@ -25,4 +25,3 @@ CI passed with new added/existing test.
|
||||
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
|
||||
If tests were not added, please describe why they were not added and/or why it was difficult to add.
|
||||
-->
|
||||
|
||||
|
@ -46,11 +46,11 @@ repos:
|
||||
# files: ^csrc/.*\.(cpp|hpp|cc|hh|cxx|hxx)$
|
||||
# types_or: [c++]
|
||||
# args: [--style=google, --verbose]
|
||||
# - repo: https://github.com/jackdewinter/pymarkdown
|
||||
# rev: v0.9.29
|
||||
# hooks:
|
||||
# - id: pymarkdown
|
||||
# args: [fix]
|
||||
- repo: https://github.com/jackdewinter/pymarkdown
|
||||
rev: v0.9.29
|
||||
hooks:
|
||||
- id: pymarkdown
|
||||
args: [fix]
|
||||
- repo: https://github.com/rhysd/actionlint
|
||||
rev: v1.7.7
|
||||
hooks:
|
||||
@ -131,6 +131,12 @@ repos:
|
||||
types: [python]
|
||||
pass_filenames: false
|
||||
additional_dependencies: [regex]
|
||||
- id: python-init
|
||||
name: Enforce __init__.py in Python packages
|
||||
entry: python tools/check_python_src_init.py
|
||||
language: python
|
||||
types: [python]
|
||||
pass_filenames: false
|
||||
# Keep `suggestion` last
|
||||
- id: suggestion
|
||||
name: Suggestion
|
||||
|
@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
|
||||
For answers to common questions about this code of conduct, see the
|
||||
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
|
||||
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).
|
||||
|
||||
|
@ -26,7 +26,6 @@ This document outlines the benchmarking methodology for vllm-ascend, aimed at ev
|
||||
|
||||
**Benchmarking Duration**: about 800 senond for single model.
|
||||
|
||||
|
||||
# Quick Use
|
||||
## Prerequisites
|
||||
Before running the benchmarks, ensure the following:
|
||||
@ -34,7 +33,8 @@ Before running the benchmarks, ensure the following:
|
||||
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
|
||||
|
||||
- Install necessary dependencies for benchmarks:
|
||||
```
|
||||
|
||||
```shell
|
||||
pip install -r benchmarks/requirements-bench.txt
|
||||
```
|
||||
|
||||
@ -72,6 +72,7 @@ Before running the benchmarks, ensure the following:
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
this Json will be structured and parsed into server parameters and client parameters by the benchmark script. This configuration defines a test case named `serving_qwen2_5vl_7B_tp1`, designed to evaluate the performance of the `Qwen/Qwen2.5-VL-7B-Instruct` model under different request rates. The test includes both server and client parameters, for more parameters details, see vllm benchmark [cli](https://github.com/vllm-project/vllm/tree/main/vllm/benchmarks).
|
||||
|
||||
- **Test Overview**
|
||||
@ -110,17 +111,18 @@ Before running the benchmarks, ensure the following:
|
||||
|
||||
- Number of Prompts: 200 (the total number of prompts used during the test)
|
||||
|
||||
|
||||
|
||||
## Run benchmarks
|
||||
|
||||
### Use benchmark script
|
||||
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:
|
||||
```
|
||||
|
||||
```shell
|
||||
bash benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
Once the script completes, you can find the results in the benchmarks/results folder. The output files may resemble the following:
|
||||
```
|
||||
|
||||
```shell
|
||||
.
|
||||
|-- serving_qwen2_5_7B_tp1_qps_1.json
|
||||
|-- serving_qwen2_5_7B_tp1_qps_16.json
|
||||
@ -129,6 +131,7 @@ Once the script completes, you can find the results in the benchmarks/results fo
|
||||
|-- latency_qwen2_5_7B_tp1.json
|
||||
|-- throughput_qwen2_5_7B_tp1.json
|
||||
```
|
||||
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
|
||||
### Use benchmark cli
|
||||
@ -137,10 +140,13 @@ For more flexible and customized use, benchmark cli is also provided to run onli
|
||||
Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
|
||||
#### Online serving
|
||||
1. Launch the server:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen2.5-VL-7B-Instruct --max-model-len 16789
|
||||
```
|
||||
|
||||
2. Running performance tests using cli
|
||||
|
||||
```shell
|
||||
vllm bench serve --model Qwen2.5-VL-7B-Instruct\
|
||||
--endpoint-type "openai-chat" --dataset-name hf \
|
||||
@ -152,13 +158,16 @@ Similarly, let’s take `Qwen2.5-VL-7B-Instruct` benchmark as an example:
|
||||
|
||||
#### Offline
|
||||
- **Throughput**
|
||||
|
||||
```shell
|
||||
vllm bench throughput --output-json results/throughput_qwen2_5_7B_tp1.json \
|
||||
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 --load-format dummy \
|
||||
--dataset-path /github/home/.cache/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
|
||||
--num-prompts 200 --backend vllm
|
||||
```
|
||||
|
||||
- **Latency**
|
||||
|
||||
```shell
|
||||
vllm bench latency --output-json results/latency_qwen2_5_7B_tp1.json \
|
||||
--model Qwen/Qwen2.5-7B-Instruct --tensor-parallel-size 1 \
|
||||
|
@ -13,6 +13,7 @@ But you can still set up dev env on Linux/Windows/macOS for linting and basic
|
||||
test as following commands:
|
||||
|
||||
#### Run lint locally
|
||||
|
||||
```bash
|
||||
# Choose a base dir (~/vllm-project/) and set up venv
|
||||
cd ~/vllm-project/
|
||||
@ -103,7 +104,6 @@ If the PR spans more than one category, please include all relevant prefixes.
|
||||
You may find more information about contributing to vLLM Ascend backend plugin on [<u>docs.vllm.ai</u>](https://docs.vllm.ai/en/latest/contributing/overview.html).
|
||||
If you find any problem when contributing, you can feel free to submit a PR to improve the doc to help other developers.
|
||||
|
||||
|
||||
:::{toctree}
|
||||
:caption: Index
|
||||
:maxdepth: 1
|
||||
|
@ -172,6 +172,7 @@ pytest -sv tests/ut
|
||||
# Run single test
|
||||
pytest -sv tests/ut/test_ascend_config.py
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Multi cards test
|
||||
@ -185,6 +186,7 @@ pytest -sv tests/ut
|
||||
# Run single test
|
||||
pytest -sv tests/ut/test_ascend_config.py
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
@ -218,10 +220,12 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.
|
||||
# Run a certain case in test script
|
||||
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/singlecard/test_offline_inference.py::test_models
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Multi cards test
|
||||
:sync: multi
|
||||
|
||||
```bash
|
||||
cd /vllm-workspace/vllm-ascend/
|
||||
# Run all single card the tests
|
||||
@ -233,6 +237,7 @@ VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_dynamic_npugraph_ba
|
||||
# Run a certain case in test script
|
||||
VLLM_USE_MODELSCOPE=true pytest -sv tests/e2e/multicard/test_offline_inference.py::test_models
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
:::::
|
||||
|
@ -65,6 +65,7 @@ pip install gradio plotly evalscope
|
||||
## 3. Run gsm8k accuracy test using EvalScope
|
||||
|
||||
You can `evalscope eval` run gsm8k accuracy test:
|
||||
|
||||
```
|
||||
evalscope eval \
|
||||
--model Qwen/Qwen2.5-7B-Instruct \
|
||||
@ -98,6 +99,7 @@ pip install evalscope[perf] -U
|
||||
### Basic usage
|
||||
|
||||
You can use `evalscope perf` run perf test:
|
||||
|
||||
```
|
||||
evalscope perf \
|
||||
--url "http://localhost:8000/v1/chat/completions" \
|
||||
|
@ -36,6 +36,7 @@ Install lm-eval in the container.
|
||||
```bash
|
||||
pip install lm-eval
|
||||
```
|
||||
|
||||
Run the following command:
|
||||
|
||||
```
|
||||
|
@ -29,7 +29,9 @@ docker run --rm \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen2.5-7B-Instruct --max_model_len 26240
|
||||
```
|
||||
|
||||
If your service start successfully, you can see the info shown below:
|
||||
|
||||
```
|
||||
INFO: Started server process [6873]
|
||||
INFO: Waiting for application startup.
|
||||
@ -37,6 +39,7 @@ INFO: Application startup complete.
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts in new terminal:
|
||||
|
||||
```
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
|
@ -50,6 +50,7 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
2. Decide which process we should patch. For example, here `distributed` belongs to the vLLM main process, so we should patch `platform`.
|
||||
3. Create the patch file in the right folder. The file should be named as `patch_{module_name}.py`. The example here is `vllm_ascend/patch/platform/patch_common/patch_distributed.py`.
|
||||
4. Write your patch code in the new file. Here is an example:
|
||||
|
||||
```python
|
||||
import vllm
|
||||
|
||||
@ -59,8 +60,10 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
|
||||
vllm.distributed.parallel_state.destroy_model_parallel = patch_destroy_model_parallel
|
||||
```
|
||||
|
||||
5. Import the patch file in `__init__.py`. In this example, add `import vllm_ascend.patch.platform.patch_common.patch_distributed` into `vllm_ascend/patch/platform/patch_common/__init__.py`.
|
||||
6. Add the description of the patch in `vllm_ascend/patch/__init__.py`. The description format is as follows:
|
||||
|
||||
```
|
||||
# ** File: <The patch file name> **
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
@ -74,8 +77,8 @@ Before writing a patch, following the principle above, we should patch the least
|
||||
# Future Plan:
|
||||
# <Describe the future plan to remove the patch>
|
||||
```
|
||||
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
|
||||
|
||||
7. Add the Unit Test and E2E Test. Any newly added code in vLLM Ascend should contain the Unit Test and E2E Test as well. You can find more details in [test guide](../contribution/testing.md)
|
||||
|
||||
## Limitation
|
||||
1. In V1 Engine, vLLM starts three kinds of process: Main process, EngineCore process and Worker process. Now vLLM Ascend only support patch the code in Main process and Worker process by default. If you want to patch the code runs in EngineCore process, you should patch EngineCore process entirely during setup, the entry code is here `vllm.v1.engine.core`. Please override `EngineCoreProc` and `DPEngineCoreProc` entirely.
|
||||
|
@ -216,6 +216,7 @@ The first argument of `vllm.ModelRegistry.register_model()` indicates the unique
|
||||
],
|
||||
}
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
## Step 3: Verification
|
||||
|
@ -4,6 +4,7 @@ This document details the benchmark methodology for vllm-ascend, aimed at evalua
|
||||
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks).
|
||||
|
||||
## 1. Run docker container
|
||||
|
||||
```{code-block} bash
|
||||
:substitutions:
|
||||
# Update DEVICE according to your device (/dev/davinci[0-7])
|
||||
@ -29,6 +30,7 @@ docker run --rm \
|
||||
```
|
||||
|
||||
## 2. Install dependencies
|
||||
|
||||
```bash
|
||||
cd /workspace/vllm-ascend
|
||||
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
|
||||
@ -37,11 +39,13 @@ pip install -r benchmarks/requirements-bench.txt
|
||||
|
||||
## 3. (Optional)Prepare model weights
|
||||
For faster running speed, we recommend downloading the model in advance:
|
||||
|
||||
```bash
|
||||
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
|
||||
```
|
||||
|
||||
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks/tests) files with your local paths:
|
||||
|
||||
```bash
|
||||
[
|
||||
{
|
||||
@ -59,11 +63,13 @@ You can also replace all model paths in the [json](https://github.com/vllm-proje
|
||||
|
||||
## 4. Run benchmark script
|
||||
Run benchmark script:
|
||||
|
||||
```bash
|
||||
bash benchmarks/scripts/run-performance-benchmarks.sh
|
||||
```
|
||||
|
||||
After about 10 mins, the output is as shown below:
|
||||
|
||||
```bash
|
||||
online serving:
|
||||
qps 1:
|
||||
@ -173,6 +179,7 @@ Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
|
||||
Total num prompt tokens: 42659
|
||||
Total num output tokens: 43545
|
||||
```
|
||||
|
||||
The result json files are generated into the path `benchmark/results`
|
||||
These files contain detailed benchmarking results for further analysis.
|
||||
|
||||
|
@ -10,6 +10,7 @@ The execution duration of each stage (including pre/post-processing, model forwa
|
||||
* Use the blocking API `ProfileExecuteDuration().pop_captured_sync` at an appropriate time to get and print the execution durations of all observed stages.
|
||||
|
||||
**We have instrumented the key inference stages (including pre-processing, model forward pass, etc.) for execute duration profiling. Execute the script as follows:**
|
||||
|
||||
```
|
||||
VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE=1 python3 vllm-ascend/examples/offline_inference_npu.py
|
||||
```
|
||||
|
@ -190,6 +190,7 @@ git clone https://github.com/vllm-project/vllm-ascend.git
|
||||
cd vllm-ascend
|
||||
docker build -t vllm-ascend-dev-image:latest -f ./Dockerfile .
|
||||
```
|
||||
|
||||
:::
|
||||
|
||||
```{code-block} bash
|
||||
|
@ -35,6 +35,7 @@ docker run --rm \
|
||||
# Install curl
|
||||
apt-get update -y && apt-get install -y curl
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} openEuler
|
||||
@ -63,6 +64,7 @@ docker run --rm \
|
||||
# Install curl
|
||||
yum update -y && yum install -y curl
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
@ -73,6 +75,7 @@ The default workdir is `/workspace`, vLLM and vLLM Ascend code are placed in `/v
|
||||
You can use Modelscope mirror to speed up download:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```bash
|
||||
export VLLM_USE_MODELSCOPE=true
|
||||
```
|
||||
@ -87,6 +90,7 @@ With vLLM installed, you can start generating texts for list of input prompts (i
|
||||
Try to run below Python script directly or use `python3` shell to generate texts:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```python
|
||||
from vllm import LLM, SamplingParams
|
||||
|
||||
@ -115,6 +119,7 @@ the following command to start the vLLM server with the
|
||||
[Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) model:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```bash
|
||||
# Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
|
||||
vllm serve Qwen/Qwen2.5-0.5B-Instruct &
|
||||
@ -128,11 +133,13 @@ INFO: Waiting for application startup.
|
||||
INFO: Application startup complete.
|
||||
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
||||
```
|
||||
|
||||
Congratulations, you have successfully started the vLLM server!
|
||||
|
||||
You can query the list the models:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/models | python3 -m json.tool
|
||||
```
|
||||
@ -140,6 +147,7 @@ curl http://localhost:8000/v1/models | python3 -m json.tool
|
||||
You can also query the model with input prompts:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
@ -155,12 +163,14 @@ vLLM is serving as background process, you can use `kill -2 $VLLM_PID` to stop t
|
||||
it's equal to `Ctrl-C` to stop foreground vLLM process:
|
||||
|
||||
<!-- tests/e2e/doctest/001-quickstart-test.sh should be considered updating as well -->
|
||||
|
||||
```bash
|
||||
VLLM_PID=$(pgrep -f "vllm serve")
|
||||
kill -2 "$VLLM_PID"
|
||||
```
|
||||
|
||||
You will see output as below:
|
||||
|
||||
```
|
||||
INFO: Shutting down FastAPI HTTP server.
|
||||
INFO: Shutting down
|
||||
|
@ -43,11 +43,13 @@ Execute the following commands on each node in sequence. The results must all be
|
||||
|
||||
### NPU Interconnect Verification:
|
||||
#### 1. Get NPU IP Addresses
|
||||
|
||||
```bash
|
||||
for i in {0..7}; do hccn_tool -i $i -ip -g | grep ipaddr; done
|
||||
```
|
||||
|
||||
#### 2. Cross-Node PING Test
|
||||
|
||||
```bash
|
||||
# Execute on the target node (replace with actual IP)
|
||||
hccn_tool -i 0 -ping -g address 10.20.0.20
|
||||
@ -95,6 +97,7 @@ Before launch the inference server, ensure some environment variables are set fo
|
||||
Run the following scripts on two nodes respectively
|
||||
|
||||
**node0**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
@ -135,6 +138,7 @@ vllm serve /root/.cache/ds_v3 \
|
||||
```
|
||||
|
||||
**node1**
|
||||
|
||||
```shell
|
||||
#!/bin/sh
|
||||
|
||||
@ -191,6 +195,7 @@ curl http://{ node0 ip:8004 }/v1/completions \
|
||||
|
||||
## Run benchmarks
|
||||
For details please refer to [benchmark](https://github.com/vllm-project/vllm-ascend/tree/main/benchmarks)
|
||||
|
||||
```shell
|
||||
vllm bench serve --model /root/.cache/ds_v3 --served-model-name deepseek_v3 \
|
||||
--dataset-name random --random-input-len 128 --random-output-len 128 \
|
||||
|
@ -71,6 +71,7 @@ curl http://localhost:8000/v1/completions \
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} v1/chat/completions
|
||||
@ -91,6 +92,7 @@ curl http://localhost:8000/v1/chat/completions \
|
||||
"add_special_tokens" : true
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
@ -170,9 +172,11 @@ if __name__ == "__main__":
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
|
||||
```{code-block} python
|
||||
:substitutions:
|
||||
import gc
|
||||
@ -226,6 +230,7 @@ if __name__ == "__main__":
|
||||
del llm
|
||||
clean_up()
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
|
@ -55,6 +55,7 @@ python3 quant_qwen.py --model_path $MODEL_PATH --save_directory $SAVE_PATH --cal
|
||||
|
||||
## Verify the quantized model
|
||||
The converted model files looks like:
|
||||
|
||||
```bash
|
||||
.
|
||||
|-- config.json
|
||||
@ -72,11 +73,13 @@ Run the following script to start the vLLM server with quantized model:
|
||||
:::{note}
|
||||
The value "ascend" for "--quantization" argument will be supported after [a specific PR](https://github.com/vllm-project/vllm-ascend/pull/877) is merged and released, you can cherry-pick this commit for now.
|
||||
:::
|
||||
|
||||
```bash
|
||||
vllm serve /home/models/QwQ-32B-w8a8 --tensor-parallel-size 4 --served-model-name "qwq-32b-w8a8" --max-model-len 4096 --quantization ascend
|
||||
```
|
||||
|
||||
Once your server is started, you can query the model with input prompts
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
|
@ -80,6 +80,7 @@ curl http://localhost:8000/v1/completions \
|
||||
"temperature": 0.6
|
||||
}'
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Qwen/Qwen2.5-7B-Instruct
|
||||
@ -318,6 +319,7 @@ if __name__ == "__main__":
|
||||
:::::
|
||||
|
||||
Run script:
|
||||
|
||||
```bash
|
||||
python example.py
|
||||
```
|
||||
|
@ -66,6 +66,7 @@ for output in outputs:
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
@ -92,6 +93,7 @@ for output in outputs:
|
||||
generated_text = output.outputs[0].text
|
||||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
@ -131,6 +133,7 @@ docker run --rm \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen3-8B --max_model_len 26240
|
||||
```
|
||||
|
||||
::::
|
||||
|
||||
::::{tab-item} Eager Mode
|
||||
@ -156,6 +159,7 @@ docker run --rm \
|
||||
-it $IMAGE \
|
||||
vllm serve Qwen/Qwen3-8B --max_model_len 26240 --enforce-eager
|
||||
```
|
||||
|
||||
::::
|
||||
:::::
|
||||
|
||||
|
@ -11,6 +11,7 @@ To quantize a model, users should install [ModelSlim](https://gitee.com/ascend/m
|
||||
Currently, only the specific tag [modelslim-VLLM-8.1.RC1.b020_001](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/README.md) of modelslim works with vLLM Ascend. Please do not install other version until modelslim master version is available for vLLM Ascend in the future.
|
||||
|
||||
Install modelslim:
|
||||
|
||||
```bash
|
||||
git clone https://gitee.com/ascend/msit -b modelslim-VLLM-8.1.RC1.b020_001
|
||||
cd msit/msmodelslim
|
||||
@ -22,7 +23,6 @@ pip install accelerate
|
||||
|
||||
Take [DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) as an example, you just need to download the model, and then execute the convert command. The command is shown below. More info can be found in modelslim doc [deepseek w8a8 dynamic quantization docs](https://gitee.com/ascend/msit/blob/modelslim-VLLM-8.1.RC1.b020_001/msmodelslim/example/DeepSeek/README.md#deepseek-v2-w8a8-dynamic%E9%87%8F%E5%8C%96).
|
||||
|
||||
|
||||
```bash
|
||||
cd example/DeepSeek
|
||||
python3 quant_deepseek.py --model_path {original_model_path} --save_directory {quantized_model_save_path} --device_type cpu --act_method 2 --w_bit 8 --a_bit 8 --is_dynamic True
|
||||
@ -39,6 +39,7 @@ Once convert action is done, there are two important files generated.
|
||||
2. [quant_model_description.json](https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-W8A8/file/view/master/quant_model_description.json?status=1). All the converted weights info are recorded in this file.
|
||||
|
||||
Here is the full converted model files:
|
||||
|
||||
```bash
|
||||
.
|
||||
├── config.json
|
||||
|
@ -6,7 +6,6 @@ Sleep Mode is an API designed to offload model weights and discard KV cache from
|
||||
|
||||
Since the generation and training phases may employ different model parallelism strategies, it becomes crucial to free KV cache and even offload model parameters stored within vLLM during training. This ensures efficient memory utilization and avoids resource contention on the NPU.
|
||||
|
||||
|
||||
## Getting started
|
||||
|
||||
With `enable_sleep_mode=True`, the way we manage memory(malloc, free) in vllm will under a specific memory pool, during loading model and initialize kv_caches, we tag the memory as a map: `{"weight": data, "kv_cache": data}`.
|
||||
|
@ -34,7 +34,6 @@ Get the newest info here: https://github.com/vllm-project/vllm-ascend/issues/160
|
||||
| XLM-RoBERTa-based | ✅ | |
|
||||
| Molmo | ✅ | |
|
||||
|
||||
|
||||
## Multimodal Language Models
|
||||
|
||||
### Generative Models
|
||||
|
@ -21,3 +21,14 @@ requires = [
|
||||
"numba",
|
||||
]
|
||||
build-backend = "setuptools.build_meta"
|
||||
|
||||
[tool.pymarkdown]
|
||||
plugins.md004.style = "sublist" # ul-style
|
||||
plugins.md007.indent = 4 # ul-indent
|
||||
plugins.md007.start_indented = true # ul-indent
|
||||
plugins.md013.enabled = false # line-length
|
||||
plugins.md041.enabled = false # first-line-h1
|
||||
plugins.md033.enabled = false # inline-html
|
||||
plugins.md046.enabled = false # code-block-style
|
||||
plugins.md024.allow_different_nesting = true # no-duplicate-headers
|
||||
plugins.md029.enabled = false # ol-prefix
|
||||
|
75
tools/check_python_src_init.py
Normal file
75
tools/check_python_src_init.py
Normal file
@ -0,0 +1,75 @@
|
||||
#
|
||||
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
|
||||
# Copyright 2023 The vLLM team.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# This file is a part of the vllm-ascend project.
|
||||
# Adapted from https://github.com/vllm-project/vllm/tree/main/tools
|
||||
#
|
||||
import os
|
||||
import sys
|
||||
|
||||
VLLM_ASCEND_SRC = "vllm_ascend"
|
||||
VLLM_SRC = "vllm-empty/vllm"
|
||||
|
||||
|
||||
def check_init_file_in_package(directory):
|
||||
"""
|
||||
Check if a Python package directory contains __init__.py file.
|
||||
A directory is considered a Python package if it contains `.py` files and an `__init__.py` file.
|
||||
"""
|
||||
try:
|
||||
files = os.listdir(directory)
|
||||
except FileNotFoundError:
|
||||
print(f"Warning: Directory does not exist: {directory}")
|
||||
return False
|
||||
|
||||
# If any .py file exists, we expect an __init__.py
|
||||
if any(f.endswith('.py') for f in files):
|
||||
init_file = os.path.join(directory, '__init__.py')
|
||||
if not os.path.isfile(init_file):
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def find_missing_init_dirs(src_dir):
|
||||
"""
|
||||
Walk through the src_dir and return subdirectories missing __init__.py.
|
||||
"""
|
||||
missing_init = set()
|
||||
for dirpath, _, _ in os.walk(src_dir):
|
||||
if not check_init_file_in_package(dirpath):
|
||||
missing_init.add(dirpath)
|
||||
return missing_init
|
||||
|
||||
|
||||
def main():
|
||||
all_missing = set()
|
||||
|
||||
for src in [VLLM_ASCEND_SRC, VLLM_SRC]:
|
||||
missing = find_missing_init_dirs(src)
|
||||
all_missing.update(missing)
|
||||
|
||||
if all_missing:
|
||||
print(
|
||||
"❌ Missing '__init__.py' files in the following Python package directories:"
|
||||
)
|
||||
for pkg in sorted(all_missing):
|
||||
print(f" - {pkg}")
|
||||
sys.exit(1)
|
||||
else:
|
||||
print("✅ All Python packages have __init__.py files.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
Reference in New Issue
Block a user