diff --git a/.buildkite/nightly-benchmarks/README.md b/.buildkite/nightly-benchmarks/README.md index ae42f70077..fcde284efe 100644 --- a/.buildkite/nightly-benchmarks/README.md +++ b/.buildkite/nightly-benchmarks/README.md @@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc ## Trigger the benchmark Performance benchmark will be triggered when: + - A PR being merged into vllm. - Every commit for those PRs with `perf-benchmarks` label AND `ready` label. @@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh ``` Runtime environment variables: + - `ON_CPU`: set the value to '1' on Intelยฎ Xeonยฎ Processors. Default value is 0. - `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file). - `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file). @@ -46,12 +48,14 @@ Runtime environment variables: - `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string. Nightly benchmark will be triggered when: + - Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label. ## Performance benchmark details See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases. > NOTE: For Intelยฎ Xeonยฎ Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead. +> ### Latency test Here is an example of one test inside `latency-tests.json`: @@ -149,6 +153,7 @@ Here is an example using the script to compare result_a and result_b without det Here is an example using the script to compare result_a and result_b with detail test name. `python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json` + | | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio | |---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------| | 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 | diff --git a/.buildkite/nightly-benchmarks/nightly-annotation.md b/.buildkite/nightly-benchmarks/nightly-annotation.md index ef11c04005..466def07b6 100644 --- a/.buildkite/nightly-benchmarks/nightly-annotation.md +++ b/.buildkite/nightly-benchmarks/nightly-annotation.md @@ -1,3 +1,4 @@ +# Nightly benchmark annotation ## Description @@ -13,15 +14,15 @@ Please download the visualization scripts in the post - Find the docker we use in `benchmarking pipeline` - Deploy the docker, and inside the docker: - - Download `nightly-benchmarks.zip`. - - In the same folder, run the following code: + - Download `nightly-benchmarks.zip`. + - In the same folder, run the following code: - ```bash - export HF_TOKEN= - apt update - apt install -y git - unzip nightly-benchmarks.zip - VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh - ``` + ```bash + export HF_TOKEN= + apt update + apt install -y git + unzip nightly-benchmarks.zip + VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh + ``` And the results will be inside `./benchmarks/results`. diff --git a/.buildkite/nightly-benchmarks/nightly-descriptions.md b/.buildkite/nightly-benchmarks/nightly-descriptions.md index 5f003f42f0..8afde017d3 100644 --- a/.buildkite/nightly-benchmarks/nightly-descriptions.md +++ b/.buildkite/nightly-benchmarks/nightly-descriptions.md @@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/ ## Setup - Docker images: - - vLLM: `vllm/vllm-openai:v0.6.2` - - SGLang: `lmsysorg/sglang:v0.3.2-cu121` - - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` - - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` - - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* - - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. + - vLLM: `vllm/vllm-openai:v0.6.2` + - SGLang: `lmsysorg/sglang:v0.3.2-cu121` + - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12` + - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3` + - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.* + - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark. - Hardware - - 8x Nvidia A100 GPUs + - 8x Nvidia A100 GPUs - Workload: - - Dataset - - ShareGPT dataset - - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) - - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) - - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. - - Models: llama-3 8B, llama-3 70B. - - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). - - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. - - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. - - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). + - Dataset + - ShareGPT dataset + - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output) + - Decode-heavy dataset (in average 462 input tokens, 256 output tokens) + - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use. + - Models: llama-3 8B, llama-3 70B. + - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)). + - Average QPS (query per second): 2, 4, 8, 16, 32 and inf. + - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed. + - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better). ## Known issues diff --git a/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md b/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md index a1f8441ccd..8bb16bd3cf 100644 --- a/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md +++ b/.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md @@ -1,3 +1,4 @@ +# Performance benchmarks descriptions ## Latency tests diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index 017ec7ca82..d4aceab447 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,4 +1,5 @@ -## Essential Elements of an Effective PR Description Checklist +# Essential Elements of an Effective PR Description Checklist + - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [ ] The test plan, such as providing test command. - [ ] The test results, such as pasting the results comparison before and after, or e2e results @@ -14,5 +15,4 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE B ## (Optional) Documentation Update - **BEFORE SUBMITTING, PLEASE READ ** (anything written below this line will be removed by GitHub Actions) diff --git a/.markdownlint.yaml b/.markdownlint.yaml new file mode 100644 index 0000000000..c86fed9555 --- /dev/null +++ b/.markdownlint.yaml @@ -0,0 +1,13 @@ +MD007: + indent: 4 +MD013: false +MD024: + siblings_only: true +MD033: false +MD042: false +MD045: false +MD046: false +MD051: false +MD052: false +MD053: false +MD059: false diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 5197820fb4..045096cb86 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -35,12 +35,11 @@ repos: exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*' types_or: [c++, cuda] args: [--style=file, --verbose] -- repo: https://github.com/jackdewinter/pymarkdown - rev: v0.9.29 +- repo: https://github.com/igorshubovych/markdownlint-cli + rev: v0.45.0 hooks: - - id: pymarkdown + - id: markdownlint-fix exclude: '.*\.inc\.md' - args: [fix] - repo: https://github.com/rhysd/actionlint rev: v1.7.7 hooks: diff --git a/README.md b/README.md index dc2f0afbe3..5348405b72 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,4 @@ +

@@ -16,6 +17,7 @@ Easy, fast, and cheap LLM serving for everyone --- *Latest News* ๐Ÿ”ฅ + - [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing). - [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/). - [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing). @@ -46,6 +48,7 @@ Easy, fast, and cheap LLM serving for everyone --- + ## About vLLM is a fast and easy-to-use library for LLM inference and serving. @@ -75,6 +78,7 @@ vLLM is flexible and easy to use with: - Multi-LoRA support vLLM seamlessly supports most popular open-source models on HuggingFace, including: + - Transformer-like LLMs (e.g., Llama) - Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3) - Embedding Models (e.g., E5-Mistral) @@ -91,6 +95,7 @@ pip install vllm ``` Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more. + - [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html) - [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html) - [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html) @@ -107,6 +112,7 @@ vLLM is a community project. Our compute resources for development and testing a Cash Donations: + - a16z - Dropbox - Sequoia Capital @@ -114,6 +120,7 @@ Cash Donations: - ZhenFund Compute Resources: + - AMD - Anyscale - AWS diff --git a/RELEASE.md b/RELEASE.md index 9352e7ef70..db0d51afc7 100644 --- a/RELEASE.md +++ b/RELEASE.md @@ -60,9 +60,10 @@ Please note: **No feature work allowed for cherry picks**. All PRs that are cons Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI. **Current Coverage:** + * Models: Llama3, Llama4, and Mixtral * Hardware: NVIDIA H100 and AMD MI300x -* *Note: Coverage may change based on new model releases and hardware availability* +* _Note: Coverage may change based on new model releases and hardware availability_ **Performance Validation Process:** @@ -71,11 +72,13 @@ Request write access to the [pytorch/pytorch-integration-testing](https://github **Step 2: Review Benchmark Setup** Familiarize yourself with the benchmark configurations: + * [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda) * [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm) **Step 3: Run the Benchmark** Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure: + * **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`) * **vLLM commit**: Set to the RC commit hash diff --git a/benchmarks/README.md b/benchmarks/README.md index 3b10963c3e..644517235b 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -4,7 +4,7 @@ This README guides you through running benchmark tests with the extensive datasets supported on vLLM. Itโ€™s a living document, updated as new features and datasets become available. -**Dataset Overview** +## Dataset Overview @@ -81,9 +81,10 @@ become available. **Note**: HuggingFace dataset's `dataset-name` should be set to `hf` ---- +## ๐Ÿš€ Example - Online Benchmark +
-๐Ÿš€ Example - Online Benchmark +Show more
@@ -109,7 +110,7 @@ vllm bench serve \ If successful, you will see the following output -``` +```text ============ Serving Benchmark Result ============ Successful requests: 10 Benchmark duration (s): 5.78 @@ -133,11 +134,11 @@ P99 ITL (ms): 8.39 ================================================== ``` -**Custom Dataset** +### Custom Dataset If the dataset you want to benchmark is not supported yet in vLLM, even then you can benchmark on it using `CustomDataset`. Your data needs to be in `.jsonl` format and needs to have "prompt" field per entry, e.g., data.jsonl -``` +```json {"prompt": "What is the capital of India?"} {"prompt": "What is the capital of Iran?"} {"prompt": "What is the capital of China?"} @@ -166,7 +167,7 @@ vllm bench serve --port 9001 --save-result --save-detailed \ You can skip applying chat template if your data already has it by using `--custom-skip-chat-template`. -**VisionArena Benchmark for Vision Language Models** +### VisionArena Benchmark for Vision Language Models ```bash # need a model with vision capability here @@ -184,7 +185,7 @@ vllm bench serve \ --num-prompts 1000 ``` -**InstructCoder Benchmark with Speculative Decoding** +### InstructCoder Benchmark with Speculative Decoding ``` bash VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \ @@ -201,13 +202,13 @@ vllm bench serve \ --num-prompts 2048 ``` -**Other HuggingFaceDataset Examples** +### Other HuggingFaceDataset Examples ```bash vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests ``` -**`lmms-lab/LLaVA-OneVision-Data`** +`lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench serve \ @@ -221,7 +222,7 @@ vllm bench serve \ --num-prompts 10 ``` -**`Aeala/ShareGPT_Vicuna_unfiltered`** +`Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench serve \ @@ -234,7 +235,7 @@ vllm bench serve \ --num-prompts 10 ``` -**`AI-MO/aimo-validation-aime`** +`AI-MO/aimo-validation-aime`: ``` bash vllm bench serve \ @@ -245,7 +246,7 @@ vllm bench serve \ --seed 42 ``` -**`philschmid/mt-bench`** +`philschmid/mt-bench`: ``` bash vllm bench serve \ @@ -255,7 +256,7 @@ vllm bench serve \ --num-prompts 80 ``` -**Running With Sampling Parameters** +### Running With Sampling Parameters When using OpenAI-compatible backends such as `vllm`, optional sampling parameters can be specified. Example client command: @@ -273,25 +274,29 @@ vllm bench serve \ --num-prompts 10 ``` -**Running With Ramp-Up Request Rate** +### Running With Ramp-Up Request Rate The benchmark tool also supports ramping up the request rate over the duration of the benchmark run. This can be useful for stress testing the server or finding the maximum throughput that it can handle, given some latency budget. Two ramp-up strategies are supported: + - `linear`: Increases the request rate linearly from a start value to an end value. - `exponential`: Increases the request rate exponentially. The following arguments can be used to control the ramp-up: + - `--ramp-up-strategy`: The ramp-up strategy to use (`linear` or `exponential`). - `--ramp-up-start-rps`: The request rate at the beginning of the benchmark. - `--ramp-up-end-rps`: The request rate at the end of the benchmark.
+## ๐Ÿ“ˆ Example - Offline Throughput Benchmark +
-๐Ÿ“ˆ Example - Offline Throughput Benchmark +Show more
@@ -305,15 +310,15 @@ vllm bench throughput \ If successful, you will see the following output -``` +```text Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s Total num prompt tokens: 5014 Total num output tokens: 1500 ``` -**VisionArena Benchmark for Vision Language Models** +### VisionArena Benchmark for Vision Language Models -``` bash +```bash vllm bench throughput \ --model Qwen/Qwen2-VL-7B-Instruct \ --backend vllm-chat \ @@ -325,13 +330,13 @@ vllm bench throughput \ The `num prompt tokens` now includes image token counts -``` +```text Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s Total num prompt tokens: 14527 Total num output tokens: 1280 ``` -**InstructCoder Benchmark with Speculative Decoding** +### InstructCoder Benchmark with Speculative Decoding ``` bash VLLM_WORKER_MULTIPROC_METHOD=spawn \ @@ -349,15 +354,15 @@ vllm bench throughput \ "prompt_lookup_min": 2}' ``` -``` +```text Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s Total num prompt tokens: 261136 Total num output tokens: 204800 ``` -**Other HuggingFaceDataset Examples** +### Other HuggingFaceDataset Examples -**`lmms-lab/LLaVA-OneVision-Data`** +`lmms-lab/LLaVA-OneVision-Data`: ```bash vllm bench throughput \ @@ -370,7 +375,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**`Aeala/ShareGPT_Vicuna_unfiltered`** +`Aeala/ShareGPT_Vicuna_unfiltered`: ```bash vllm bench throughput \ @@ -382,7 +387,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**`AI-MO/aimo-validation-aime`** +`AI-MO/aimo-validation-aime`: ```bash vllm bench throughput \ @@ -394,7 +399,7 @@ vllm bench throughput \ --num-prompts 10 ``` -**Benchmark with LoRA Adapters** +Benchmark with LoRA adapters: ``` bash # download dataset @@ -413,20 +418,22 @@ vllm bench throughput \
+## ๐Ÿ› ๏ธ Example - Structured Output Benchmark +
-๐Ÿ› ๏ธ Example - Structured Output Benchmark +Show more
Benchmark the performance of structured output generation (JSON, grammar, regex). -**Server Setup** +### Server Setup ```bash vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests ``` -**JSON Schema Benchmark** +### JSON Schema Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -438,7 +445,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Grammar-based Generation Benchmark** +### Grammar-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -450,7 +457,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Regex-based Generation Benchmark** +### Regex-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -461,7 +468,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**Choice-based Generation Benchmark** +### Choice-based Generation Benchmark ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -472,7 +479,7 @@ python3 benchmarks/benchmark_serving_structured_output.py \ --num-prompts 1000 ``` -**XGrammar Benchmark Dataset** +### XGrammar Benchmark Dataset ```bash python3 benchmarks/benchmark_serving_structured_output.py \ @@ -485,14 +492,16 @@ python3 benchmarks/benchmark_serving_structured_output.py \
+## ๐Ÿ“š Example - Long Document QA Benchmark +
-๐Ÿ“š Example - Long Document QA Benchmark +Show more
Benchmark the performance of long document question-answering with prefix caching. -**Basic Long Document QA Test** +### Basic Long Document QA Test ```bash python3 benchmarks/benchmark_long_document_qa_throughput.py \ @@ -504,7 +513,7 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \ --repeat-count 5 ``` -**Different Repeat Modes** +### Different Repeat Modes ```bash # Random mode (default) - shuffle prompts randomly @@ -537,14 +546,16 @@ python3 benchmarks/benchmark_long_document_qa_throughput.py \
+## ๐Ÿ—‚๏ธ Example - Prefix Caching Benchmark +
-๐Ÿ—‚๏ธ Example - Prefix Caching Benchmark +Show more
Benchmark the efficiency of automatic prefix caching. -**Fixed Prompt with Prefix Caching** +### Fixed Prompt with Prefix Caching ```bash python3 benchmarks/benchmark_prefix_caching.py \ @@ -555,7 +566,7 @@ python3 benchmarks/benchmark_prefix_caching.py \ --input-length-range 128:256 ``` -**ShareGPT Dataset with Prefix Caching** +### ShareGPT Dataset with Prefix Caching ```bash # download dataset @@ -572,14 +583,16 @@ python3 benchmarks/benchmark_prefix_caching.py \
+## โšก Example - Request Prioritization Benchmark +
-โšก Example - Request Prioritization Benchmark +Show more
Benchmark the performance of request prioritization in vLLM. -**Basic Prioritization Test** +### Basic Prioritization Test ```bash python3 benchmarks/benchmark_prioritization.py \ @@ -590,7 +603,7 @@ python3 benchmarks/benchmark_prioritization.py \ --scheduling-policy priority ``` -**Multiple Sequences per Prompt** +### Multiple Sequences per Prompt ```bash python3 benchmarks/benchmark_prioritization.py \ diff --git a/benchmarks/auto_tune/README.md b/benchmarks/auto_tune/README.md index c479ff1aa2..9aad51df6e 100644 --- a/benchmarks/auto_tune/README.md +++ b/benchmarks/auto_tune/README.md @@ -3,6 +3,7 @@ This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate. ## Table of Contents + - [Prerequisites](#prerequisites) - [Configuration](#configuration) - [How to Run](#how-to-run) @@ -52,7 +53,7 @@ You must set the following variables at the top of the script before execution. 1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section. 2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost. -``` +```bash cd bash auto_tune.sh ``` @@ -64,6 +65,7 @@ bash auto_tune.sh Here are a few examples of how to configure the script for different goals: ### 1. Maximize Throughput (No Latency Constraint) + - **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens. - **Configuration**: @@ -76,6 +78,7 @@ MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number ``` #### 2. Maximize Throughput with a Latency Requirement + - **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms. - **Configuration**: @@ -88,6 +91,7 @@ MAX_LATENCY_ALLOWED_MS=500 ``` #### 3. Maximize Throughput with Prefix Caching and Latency Requirements + - **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms. - **Configuration**: @@ -109,7 +113,7 @@ After the script finishes, you will find the results in a new, timestamped direc - **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found. -``` +```text # Example result.txt content hash:a1b2c3d4... max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8 diff --git a/benchmarks/kernels/deepgemm/README.md b/benchmarks/kernels/deepgemm/README.md index 917e814010..41e68e047b 100644 --- a/benchmarks/kernels/deepgemm/README.md +++ b/benchmarks/kernels/deepgemm/README.md @@ -8,7 +8,7 @@ Currently this just includes dense GEMMs and only works on Hopper GPUs. You need to install vLLM in your usual fashion, then install DeepGEMM from source in its own directory: -``` +```bash git clone --recursive https://github.com/deepseek-ai/DeepGEMM cd DeepGEMM python setup.py install @@ -17,7 +17,7 @@ uv pip install -e . ## Usage -``` +```console python benchmark_fp8_block_dense_gemm.py INFO 02-26 21:55:13 [__init__.py:207] Automatically detected platform cuda. ===== STARTING FP8 GEMM BENCHMARK ===== diff --git a/csrc/quantization/cutlass_w8a8/Epilogues.md b/csrc/quantization/cutlass_w8a8/Epilogues.md index a30e1fdf3a..15a66913e9 100644 --- a/csrc/quantization/cutlass_w8a8/Epilogues.md +++ b/csrc/quantization/cutlass_w8a8/Epilogues.md @@ -86,6 +86,7 @@ D = s_a s_b \widehat A \widehat B ``` Epilogue parameters: + - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). @@ -135,7 +136,7 @@ That is precomputed and stored in `azp_with_adj` as a row-vector. Epilogue parameters: - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - - Generally this will be per-tensor as the zero-points are per-tensor. + - Generally this will be per-tensor as the zero-points are per-tensor. - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `azp_with_adj` is the precomputed zero-point term ($` z_a J_a \widehat B `$), is per-channel (row-vector). - `bias` is the bias, is always per-channel (row-vector). @@ -152,7 +153,7 @@ That means the zero-point term $` z_a J_a \widehat B `$ becomes an outer product Epilogue parameters: - `scale_a` is the scale for activations, can be per-tensor (scalar) or per-token (column-vector). - - Generally this will be per-token as the zero-points are per-token. + - Generally this will be per-token as the zero-points are per-token. - `scale_b` is the scale for weights, can be per-tensor (scalar) or per-channel (row-vector). - `azp_adj` is the precomputed zero-point adjustment term ($` \mathbf 1 \widehat B `$), is per-channel (row-vector). - `azp` is the zero-point (`z_a`), is per-token (column-vector). diff --git a/docs/cli/README.md b/docs/cli/README.md index dfb6051a8c..b1371c82a4 100644 --- a/docs/cli/README.md +++ b/docs/cli/README.md @@ -6,13 +6,13 @@ toc_depth: 4 The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with: -``` +```bash vllm --help ``` Available Commands: -``` +```bash vllm {chat,complete,serve,bench,collect-env,run-batch} ``` diff --git a/docs/configuration/tpu.md b/docs/configuration/tpu.md index 005b7f78f4..0ff0cdda38 100644 --- a/docs/configuration/tpu.md +++ b/docs/configuration/tpu.md @@ -40,6 +40,7 @@ Although the first compilation can take some time, for all subsequent server lau Use `VLLM_XLA_CACHE_PATH` environment variable to write to shareable storage for future deployed nodes (like when using autoscaling). #### Reducing compilation time + This initial compilation time ranges significantly and is impacted by many of the arguments discussed in this optimization doc. Factors that influence the length of time to compile are things like model size and `--max-num-batch-tokens`. Other arguments you can tune are things like `VLLM_TPU_MOST_MODEL_LEN`. ### Optimize based on your data @@ -71,12 +72,15 @@ The fewer tokens we pad, the less unnecessary computation TPU does, the better p However, you need to be careful to choose the padding gap. If the gap is too small, it means the number of buckets is large, leading to increased warmup (precompile) time and higher memory to store the compiled graph. Too many compilaed graphs may lead to HBM OOM. Conversely, an overly large gap yields no performance improvement compared to the default exponential padding. -**If possible, use the precision that matches the chipโ€™s hardware acceleration** +#### Quantization + +If possible, use the precision that matches the chipโ€™s hardware acceleration: - v5e has int4/int8 hardware acceleration in the MXU - v6e has int4/int8 hardware acceleration in the MXU -Supported quantized formats and features in vLLM on TPU [Jul '25] +Supported quantized formats and features in vLLM on TPU [Jul '25]: + - INT8 W8A8 - INT8 W8A16 - FP8 KV cache @@ -84,11 +88,13 @@ Supported quantized formats and features in vLLM on TPU [Jul '25] - [WIP] AWQ - [WIP] FP4 W4A8 -**Don't set TP to be less than the number of chips on a single-host deployment** +#### Parallelization + +Don't set TP to be less than the number of chips on a single-host deployment. Although itโ€™s common to do this with GPUs, don't try to fragment 2 or 8 different workloads across 8 chips on a single host. If you need 1 or 4 chips, just create an instance with 1 or 4 chips (these are partial-host machine types). -### Tune your workloads! +### Tune your workloads Although we try to have great default configs, we strongly recommend you check out the [vLLM auto-tuner](../../benchmarks/auto_tune/README.md) to optimize your workloads for your use case. @@ -99,6 +105,7 @@ Although we try to have great default configs, we strongly recommend you check o The auto-tuner provides a profile of optimized configurations as its final step. However, interpreting this profile can be challenging for new users. We plan to expand this section in the future with more detailed guidance. In the meantime, you can learn how to collect a TPU profile using vLLM's native profiling tools [here](../examples/offline_inference/profiling_tpu.md). This profile can provide valuable insights into your workload's performance. #### SPMD + More details to come. **Want us to cover something that isn't listed here? Open up an issue please and cite this doc. We'd love to hear your questions or tips.** diff --git a/docs/contributing/ci/failures.md b/docs/contributing/ci/failures.md index 573efb3b05..d7e2dfbca8 100644 --- a/docs/contributing/ci/failures.md +++ b/docs/contributing/ci/failures.md @@ -20,19 +20,19 @@ the failure? - **Use this title format:** - ``` + ```text [CI Failure]: failing-test-job - regex/matching/failing:test ``` - **For the environment field:** - ``` - Still failing on main as of commit abcdef123 + ```text + Still failing on main as of commit abcdef123 ``` - **In the description, include failing tests:** - ``` + ```text FAILED failing/test.py:failing_test1 - Failure description FAILED failing/test.py:failing_test2 - Failure description https://github.com/orgs/vllm-project/projects/20 diff --git a/docs/contributing/ci/update_pytorch_version.md b/docs/contributing/ci/update_pytorch_version.md index 699d0531ac..3a6026d450 100644 --- a/docs/contributing/ci/update_pytorch_version.md +++ b/docs/contributing/ci/update_pytorch_version.md @@ -106,6 +106,7 @@ releases (which would take too much time), they can be built from source to unblock the update process. ### FlashInfer + Here is how to build and install it from source with `torch2.7.0+cu128` in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271): ```bash @@ -121,6 +122,7 @@ public location for immediate installation, such as [this FlashInfer wheel link] team if you want to get the package published there. ### xFormers + Similar to FlashInfer, here is how to build and install xFormers from source: ```bash @@ -138,7 +140,7 @@ uv pip install --system \ ### causal-conv1d -``` +```bash uv pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8' ``` diff --git a/docs/contributing/deprecation_policy.md b/docs/contributing/deprecation_policy.md index ff69cbae08..904ef4ca05 100644 --- a/docs/contributing/deprecation_policy.md +++ b/docs/contributing/deprecation_policy.md @@ -31,7 +31,7 @@ Features that fall under this policy include (at a minimum) the following: The deprecation process consists of several clearly defined stages that span multiple Y releases: -**1. Deprecated (Still On By Default)** +### 1. Deprecated (Still On By Default) - **Action**: Feature is marked as deprecated. - **Timeline**: A removal version is explicitly stated in the deprecation @@ -46,7 +46,7 @@ warning (e.g., "This will be removed in v0.10.0"). - GitHub Issue (RFC) for feedback - Documentation and use of the `@typing_extensions.deprecated` decorator for Python APIs -**2.Deprecated (Off By Default)** +### 2.Deprecated (Off By Default) - **Action**: Feature is disabled by default, but can still be re-enabled via a CLI flag or environment variable. Feature throws an error when used without @@ -55,7 +55,7 @@ re-enabling. while signaling imminent removal. Ensures any remaining usage is clearly surfaced and blocks silent breakage before full removal. -**3. Removed** +### 3. Removed - **Action**: Feature is completely removed from the codebase. - **Note**: Only features that have passed through the previous deprecation diff --git a/docs/contributing/profiling.md b/docs/contributing/profiling.md index 13c3bc2c7e..7c18b464b5 100644 --- a/docs/contributing/profiling.md +++ b/docs/contributing/profiling.md @@ -112,13 +112,13 @@ vllm bench serve \ In practice, you should set the `--duration` argument to a large value. Whenever you want the server to stop profiling, run: -``` +```bash nsys sessions list ``` to get the session id in the form of `profile-XXXXX`, then run: -``` +```bash nsys stop --session=profile-XXXXX ``` diff --git a/docs/contributing/vulnerability_management.md b/docs/contributing/vulnerability_management.md index e20b10f8f7..847883f742 100644 --- a/docs/contributing/vulnerability_management.md +++ b/docs/contributing/vulnerability_management.md @@ -32,9 +32,9 @@ We prefer to keep all vulnerability-related communication on the security report on GitHub. However, if you need to contact the VMT directly for an urgent issue, you may contact the following individuals: -- Simon Mo - simon.mo@hey.com -- Russell Bryant - rbryant@redhat.com -- Huzaifa Sidhpurwala - huzaifas@redhat.com +- Simon Mo - +- Russell Bryant - +- Huzaifa Sidhpurwala - ## Slack Discussion diff --git a/docs/deployment/frameworks/anything-llm.md b/docs/deployment/frameworks/anything-llm.md index d6b28a358c..e62a33b208 100644 --- a/docs/deployment/frameworks/anything-llm.md +++ b/docs/deployment/frameworks/anything-llm.md @@ -19,9 +19,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 - Download and install [Anything LLM desktop](https://anythingllm.com/desktop). - On the bottom left of open settings, AI Prooviders --> LLM: - - LLM Provider: Generic OpenAI - - Base URL: http://{vllm server host}:{vllm server port}/v1 - - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` + - LLM Provider: Generic OpenAI + - Base URL: http://{vllm server host}:{vllm server port}/v1 + - Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ` ![](../../assets/deployment/anything-llm-provider.png) @@ -30,9 +30,9 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096 ![](../../assets/deployment/anything-llm-chat-without-doc.png) - Click the upload button: - - upload the doc - - select the doc and move to the workspace - - save and embed + - upload the doc + - select the doc and move to the workspace + - save and embed ![](../../assets/deployment/anything-llm-upload-doc.png) diff --git a/docs/deployment/frameworks/chatbox.md b/docs/deployment/frameworks/chatbox.md index 15f92ed1e3..cbca6e6282 100644 --- a/docs/deployment/frameworks/chatbox.md +++ b/docs/deployment/frameworks/chatbox.md @@ -19,11 +19,11 @@ vllm serve qwen/Qwen1.5-0.5B-Chat - Download and install [Chatbox desktop](https://chatboxai.app/en#download). - On the bottom left of settings, Add Custom Provider - - API Mode: `OpenAI API Compatible` - - Name: vllm - - API Host: `http://{vllm server host}:{vllm server port}/v1` - - API Path: `/chat/completions` - - Model: `qwen/Qwen1.5-0.5B-Chat` + - API Mode: `OpenAI API Compatible` + - Name: vllm + - API Host: `http://{vllm server host}:{vllm server port}/v1` + - API Path: `/chat/completions` + - Model: `qwen/Qwen1.5-0.5B-Chat` ![](../../assets/deployment/chatbox-settings.png) diff --git a/docs/deployment/frameworks/dify.md b/docs/deployment/frameworks/dify.md index a3063194fb..35f02c33cb 100644 --- a/docs/deployment/frameworks/dify.md +++ b/docs/deployment/frameworks/dify.md @@ -34,11 +34,11 @@ docker compose up -d - In the top-right user menu (under the profile icon), go to Settings, then click `Model Provider`, and locate the `vLLM` provider to install it. - Fill in the model provider details as follows: - - **Model Type**: `LLM` - - **Model Name**: `Qwen/Qwen1.5-7B-Chat` - - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` - - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` - - **Completion Mode**: `Completion` + - **Model Type**: `LLM` + - **Model Name**: `Qwen/Qwen1.5-7B-Chat` + - **API Endpoint URL**: `http://{vllm_server_host}:{vllm_server_port}/v1` + - **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat` + - **Completion Mode**: `Completion` ![](../../assets/deployment/dify-settings.png) diff --git a/docs/deployment/frameworks/haystack.md b/docs/deployment/frameworks/haystack.md index a18d68142c..70b4b48d45 100644 --- a/docs/deployment/frameworks/haystack.md +++ b/docs/deployment/frameworks/haystack.md @@ -1,7 +1,5 @@ # Haystack -# Haystack - [Haystack](https://github.com/deepset-ai/haystack) is an end-to-end LLM framework that allows you to build applications powered by LLMs, Transformer models, vector search and more. Whether you want to perform retrieval-augmented generation (RAG), document search, question answering or answer generation, Haystack can orchestrate state-of-the-art embedding models and LLMs into pipelines to build end-to-end NLP applications and solve your use case. It allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. diff --git a/docs/deployment/frameworks/retrieval_augmented_generation.md b/docs/deployment/frameworks/retrieval_augmented_generation.md index 96dd99e711..d5f2ec302b 100644 --- a/docs/deployment/frameworks/retrieval_augmented_generation.md +++ b/docs/deployment/frameworks/retrieval_augmented_generation.md @@ -3,6 +3,7 @@ [Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources. Here are the integrations: + - vLLM + [langchain](https://github.com/langchain-ai/langchain) + [milvus](https://github.com/milvus-io/milvus) - vLLM + [llamaindex](https://github.com/run-llama/llama_index) + [milvus](https://github.com/milvus-io/milvus) diff --git a/docs/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md index 497f9f1a92..fae392589c 100644 --- a/docs/deployment/integrations/production-stack.md +++ b/docs/deployment/integrations/production-stack.md @@ -140,11 +140,12 @@ The core vLLM production stack configuration is managed with YAML. Here is the e ``` In this YAML configuration: + * **`modelSpec`** includes: - * `name`: A nickname that you prefer to call the model. - * `repository`: Docker repository of vLLM. - * `tag`: Docker image tag. - * `modelURL`: The LLM model that you want to use. + * `name`: A nickname that you prefer to call the model. + * `repository`: Docker repository of vLLM. + * `tag`: Docker image tag. + * `modelURL`: The LLM model that you want to use. * **`replicaCount`**: Number of replicas. * **`requestCPU` and `requestMemory`**: Specifies the CPU and memory resource requests for the pod. * **`requestGPU`**: Specifies the number of GPUs required. diff --git a/docs/deployment/k8s.md b/docs/deployment/k8s.md index f244b0858e..cad801a431 100644 --- a/docs/deployment/k8s.md +++ b/docs/deployment/k8s.md @@ -5,7 +5,7 @@ Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine le - [Deployment with CPUs](#deployment-with-cpus) - [Deployment with GPUs](#deployment-with-gpus) - [Troubleshooting](#troubleshooting) - - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) + - [Startup Probe or Readiness Probe Failure, container log contains "KeyboardInterrupt: terminated"](#startup-probe-or-readiness-probe-failure-container-log-contains-keyboardinterrupt-terminated) - [Conclusion](#conclusion) Alternatively, you can deploy vLLM to Kubernetes using any of the following: diff --git a/docs/design/metrics.md b/docs/design/metrics.md index 52cd320dd4..ba34c7dca0 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -361,7 +361,7 @@ instances in Prometheus. We use this concept for the `vllm:cache_config_info` metric: -``` +```text # HELP vllm:cache_config_info Information of the LLMEngine CacheConfig # TYPE vllm:cache_config_info gauge vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0 @@ -686,7 +686,7 @@ documentation for this option states: The metrics were added by and who up in an OpenTelemetry trace as: -``` +```text -> gen_ai.latency.time_in_scheduler: Double(0.017550230026245117) -> gen_ai.latency.time_in_model_forward: Double(3.151565277099609) -> gen_ai.latency.time_in_model_execute: Double(3.6468167304992676) diff --git a/docs/design/p2p_nccl_connector.md b/docs/design/p2p_nccl_connector.md index 082dff15ef..94af8bedd2 100644 --- a/docs/design/p2p_nccl_connector.md +++ b/docs/design/p2p_nccl_connector.md @@ -5,6 +5,7 @@ An implementation of xPyD with dynamic scaling based on point-to-point communica ## Detailed Design ### Overall Process + As shown in Figure 1, the overall process of this **PD disaggregation** solution is described through a request flow: 1. The client sends an HTTP request to the Proxy/Router's `/v1/completions` interface. @@ -23,7 +24,7 @@ A simple HTTP service acts as the entry point for client requests and starts a b The Proxy/Router is responsible for selecting 1P1D based on the characteristics of the client request, such as the prompt, and generating a corresponding `request_id`, for example: -``` +```text cmpl-___prefill_addr_10.0.1.2:21001___decode_addr_10.0.1.3:22001_93923d63113b4b338973f24d19d4bf11-0 ``` @@ -70,6 +71,7 @@ pip install "vllm>=0.9.2" ## Run xPyD ### Instructions + - The following examples are run on an A800 (80GB) device, using the Meta-Llama-3.1-8B-Instruct model. - Pay attention to the setting of the `kv_buffer_size` (in bytes). The empirical value is 10% of the GPU memory size. This is related to the kvcache size. If it is too small, the GPU memory buffer for temporarily storing the received kvcache will overflow, causing the kvcache to be stored in the tensor memory pool, which increases latency. If it is too large, the kvcache available for inference will be reduced, leading to a smaller batch size and decreased throughput. - For Prefill instances, when using non-GET mode, the `kv_buffer_size` can be set to 1, as Prefill currently does not need to receive kvcache. However, when using GET mode, a larger `kv_buffer_size` is required because it needs to store the kvcache sent to the D instance. diff --git a/docs/design/prefix_caching.md b/docs/design/prefix_caching.md index 2d3c841289..fcc014cf85 100644 --- a/docs/design/prefix_caching.md +++ b/docs/design/prefix_caching.md @@ -18,10 +18,12 @@ In the example above, the KV cache in the first block can be uniquely identified * Block tokens: A tuple of tokens in this block. The reason to include the exact tokens is to reduce potential hash value collision. * Extra hashes: Other values required to make this block unique, such as LoRA IDs, multi-modality input hashes (see the example below), and cache salts to isolate caches in multi-tenant environments. -> **Note 1:** We only cache full blocks. +!!! note "Note 1" + We only cache full blocks. -> **Note 2:** The above hash key structure is not 100% collision free. Theoretically itโ€™s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash. -SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context). +!!! note "Note 2" + The above hash key structure is not 100% collision free. Theoretically itโ€™s still possible for the different prefix tokens to have the same hash value. To avoid any hash collisions **in a multi-tenant setup, we advise to use SHA256** as hash function instead of the default builtin hash. + SHA256 is supported since vLLM v0.8.3 and must be enabled with a command line argument. It comes with a performance impact of about 100-200ns per token (~6ms for 50k tokens of context). **A hashing example with multi-modality inputs** In this example, we illustrate how prefix caching works with multi-modality inputs (e.g., images). Assuming we have a request with the following messages: @@ -92,7 +94,8 @@ To improve privacy in shared environments, vLLM supports isolating prefix cache With this setup, cache sharing is limited to users or requests that explicitly agree on a common salt, enabling cache reuse within a trust group while isolating others. -> **Note:** Cache isolation is not supported in engine V0. +!!! note + Cache isolation is not supported in engine V0. ## Data Structure diff --git a/docs/design/torch_compile.md b/docs/design/torch_compile.md index ea5d8ac212..2d76e7f3ad 100644 --- a/docs/design/torch_compile.md +++ b/docs/design/torch_compile.md @@ -8,7 +8,7 @@ Throughout the example, we will run a common Llama model using v1, and turn on d In the very verbose logs, we can see: -``` +```console INFO 03-07 03:06:55 [backends.py:409] Using cache directory: ~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0 for vLLM's torch.compile ``` @@ -75,7 +75,7 @@ Every submodule can be identified by its index, and will be processed individual In the very verbose logs, we can also see: -``` +```console DEBUG 03-07 03:52:37 [backends.py:134] store the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') DEBUG 03-07 03:52:39 [backends.py:134] store the 1-th graph for shape None from inductor via handle ('f7fmlodmf3h3by5iiu2c4zarwoxbg4eytwr3ujdd2jphl4pospfd', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/ly/clyfzxldfsj7ehaluis2mca2omqka4r7mgcedlf6xfjh645nw6k2.py') ... @@ -93,7 +93,7 @@ One more detail: you can see that the 1-th graph and the 15-th graph have the sa If we already have the cache directory (e.g. run the same code for the second time), we will see the following logs: -``` +```console DEBUG 03-07 04:00:45 [backends.py:86] Directly load the 0-th graph for shape None from inductor via handle ('fpegyiq3v3wzjzphd45wkflpabggdbjpylgr7tta4hj6uplstsiw', '~/.cache/vllm/torch_compile_cache/1517964802/rank_0_0/inductor_cache/iw/ciwzrk3ittdqatuzwonnajywvno3llvjcs2vfdldzwzozn3zi3iy.py') ``` diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md index 259a447984..930265b8f9 100644 --- a/docs/features/compatibility_matrix.md +++ b/docs/features/compatibility_matrix.md @@ -36,9 +36,9 @@ th:not(:first-child) { | Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| -| [CP][chunked-prefill] | โœ… | | | | | | | | | | | | | | | -| [APC](automatic_prefix_caching.md) | โœ… | โœ… | | | | | | | | | | | | | | -| [LoRA](lora.md) | โœ… | โœ… | โœ… | | | | | | | | | | | | | +| [CP][chunked-prefill] | โœ… | | | | | | | | | | | | | | +| [APC](automatic_prefix_caching.md) | โœ… | โœ… | | | | | | | | | | | | | +| [LoRA](lora.md) | โœ… | โœ… | โœ… | | | | | | | | | | | | | [SD](spec_decode.md) | โœ… | โœ… | โŒ | โœ… | | | | | | | | | | | | CUDA graph | โœ… | โœ… | โœ… | โœ… | โœ… | | | | | | | | | | | [pooling](../models/pooling_models.md) | โœ…\* | โœ…\* | โœ… | โŒ | โœ… | โœ… | | | | | | | | | diff --git a/docs/features/lora.md b/docs/features/lora.md index ea1b495138..a4e05dae11 100644 --- a/docs/features/lora.md +++ b/docs/features/lora.md @@ -119,6 +119,7 @@ export VLLM_ALLOW_RUNTIME_LORA_UPDATING=True ``` ### Using API Endpoints + Loading a LoRA Adapter: To dynamically load a LoRA adapter, send a POST request to the `/v1/load_lora_adapter` endpoint with the necessary @@ -156,6 +157,7 @@ curl -X POST http://localhost:8000/v1/unload_lora_adapter \ ``` ### Using Plugins + Alternatively, you can use the LoRAResolver plugin to dynamically load LoRA adapters. LoRAResolver plugins enable you to load LoRA adapters from both local and remote sources such as local file system and S3. On every request, when there's a new model name that hasn't been loaded yet, the LoRAResolver will try to resolve and load the corresponding LoRA adapter. You can set up multiple LoRAResolver plugins if you want to load LoRA adapters from different sources. For example, you might have one resolver for local files and another for S3 storage. vLLM will load the first LoRA adapter that it finds. diff --git a/docs/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md index d4c8852206..b8677f11a1 100644 --- a/docs/features/multimodal_inputs.md +++ b/docs/features/multimodal_inputs.md @@ -588,7 +588,9 @@ Full example: /bin/bash`. + If Ray is running inside containers, run the commands in the remainder of this guide *inside the containers*, not on the host. To open a shell inside a container, connect to a node and use `docker exec -it /bin/bash`. Once a Ray cluster is running, use vLLM as you would in a single-node setting. All resources across the Ray cluster are visible to vLLM, so a single `vllm` command on a single node is sufficient. diff --git a/docs/serving/expert_parallel_deployment.md b/docs/serving/expert_parallel_deployment.md index d79b6fc590..280b3322b1 100644 --- a/docs/serving/expert_parallel_deployment.md +++ b/docs/serving/expert_parallel_deployment.md @@ -31,11 +31,12 @@ vLLM provides three communication backends for EP: Enable EP by setting the `--enable-expert-parallel` flag. The EP size is automatically calculated as: -``` +```text EP_SIZE = TP_SIZE ร— DP_SIZE ``` Where: + - `TP_SIZE`: Tensor parallel size (always 1 for now) - `DP_SIZE`: Data parallel size - `EP_SIZE`: Expert parallel size (computed automatically) diff --git a/docs/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md index 4eb2ea2731..dfed15d4ac 100644 --- a/docs/serving/openai_compatible_server.md +++ b/docs/serving/openai_compatible_server.md @@ -206,6 +206,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai We support both [Vision](https://platform.openai.com/docs/guides/vision)- and [Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters; see our [Multimodal Inputs](../features/multimodal_inputs.md) guide for more information. + - *Note: `image_url.detail` parameter is not supported.* Code example: diff --git a/docs/usage/security.md b/docs/usage/security.md index 76140434dc..d54e2bb37e 100644 --- a/docs/usage/security.md +++ b/docs/usage/security.md @@ -13,15 +13,18 @@ All communications between nodes in a multi-node vLLM deployment are **insecure The following options control inter-node communications in vLLM: #### 1. **Environment Variables:** - - `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on + +- `VLLM_HOST_IP`: Sets the IP address for vLLM processes to communicate on #### 2. **KV Cache Transfer Configuration:** - - `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1) - - `--kv-port`: The port for KV cache transfer communications (default: 14579) + +- `--kv-ip`: The IP address for KV cache transfer communications (default: 127.0.0.1) +- `--kv-port`: The port for KV cache transfer communications (default: 14579) #### 3. **Data Parallel Configuration:** - - `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1) - - `data_parallel_master_port`: Port of the data parallel master (default: 29500) + +- `data_parallel_master_ip`: IP of the data parallel master (default: 127.0.0.1) +- `data_parallel_master_port`: Port of the data parallel master (default: 29500) ### Notes on PyTorch Distributed @@ -41,18 +44,21 @@ Key points from the PyTorch security guide: ### Security Recommendations #### 1. **Network Isolation:** - - Deploy vLLM nodes on a dedicated, isolated network - - Use network segmentation to prevent unauthorized access - - Implement appropriate firewall rules + +- Deploy vLLM nodes on a dedicated, isolated network +- Use network segmentation to prevent unauthorized access +- Implement appropriate firewall rules #### 2. **Configuration Best Practices:** - - Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults - - Configure firewalls to only allow necessary ports between nodes + +- Always set `VLLM_HOST_IP` to a specific IP address rather than using defaults +- Configure firewalls to only allow necessary ports between nodes #### 3. **Access Control:** - - Restrict physical and network access to the deployment environment - - Implement proper authentication and authorization for management interfaces - - Follow the principle of least privilege for all system components + +- Restrict physical and network access to the deployment environment +- Implement proper authentication and authorization for management interfaces +- Follow the principle of least privilege for all system components ## Security and Firewalls: Protecting Exposed vLLM Systems diff --git a/docs/usage/v1_guide.md b/docs/usage/v1_guide.md index 498ff3da0c..38399c6633 100644 --- a/docs/usage/v1_guide.md +++ b/docs/usage/v1_guide.md @@ -148,7 +148,7 @@ are not yet supported. vLLM V1 supports logprobs and prompt logprobs. However, there are some important semantic differences compared to V0: -**Logprobs Calculation** +##### Logprobs Calculation Logprobs in V1 are now returned immediately once computed from the modelโ€™s raw output (i.e. before applying any logits post-processing such as temperature scaling or penalty @@ -157,7 +157,7 @@ probabilities used during sampling. Support for logprobs with post-sampling adjustments is in progress and will be added in future updates. -**Prompt Logprobs with Prefix Caching** +##### Prompt Logprobs with Prefix Caching Currently prompt logprobs are only supported when prefix caching is turned off via `--no-enable-prefix-caching`. In a future release, prompt logprobs will be compatible with prefix caching, but a recomputation will be triggered to recover the full prompt logprobs even upon a prefix cache hit. See details in [RFC #13414](gh-issue:13414). @@ -165,7 +165,7 @@ Currently prompt logprobs are only supported when prefix caching is turned off v As part of the major architectural rework in vLLM V1, several legacy features have been deprecated. -**Sampling features** +##### Sampling features - **best_of**: This feature has been deprecated due to limited usage. See details at [RFC #13361](gh-issue:13361). - **Per-Request Logits Processors**: In V0, users could pass custom @@ -173,11 +173,11 @@ As part of the major architectural rework in vLLM V1, several legacy features ha feature has been deprecated. Instead, the design is moving toward supporting **global logits processors**, a feature the team is actively working on for future releases. See details at [RFC #13360](gh-pr:13360). -**KV Cache features** +##### KV Cache features - **GPU <> CPU KV Cache Swapping**: with the new simplified core architecture, vLLM V1 no longer requires KV cache swapping to handle request preemptions. -**Structured Output features** +##### Structured Output features - **Request-level Structured Output Backend**: Deprecated, alternative backends (outlines, guidance) with fallbacks is supported now. diff --git a/examples/offline_inference/disaggregated-prefill-v1/README.md b/examples/offline_inference/disaggregated-prefill-v1/README.md index 9cbdb19820..abf6883f8d 100644 --- a/examples/offline_inference/disaggregated-prefill-v1/README.md +++ b/examples/offline_inference/disaggregated-prefill-v1/README.md @@ -5,6 +5,6 @@ This example contains scripts that demonstrate disaggregated prefill in the offl ## Files - `run.sh` - A helper script that will run `prefill_example.py` and `decode_example.py` sequentially. - - Make sure you are in the `examples/offline_inference/disaggregated-prefill-v1` directory before running `run.sh`. + - Make sure you are in the `examples/offline_inference/disaggregated-prefill-v1` directory before running `run.sh`. - `prefill_example.py` - A script which performs prefill only, saving the KV state to the `local_storage` directory and the prompts to `output.txt`. - `decode_example.py` - A script which performs decode only, loading the KV state from the `local_storage` directory and the prompts from `output.txt`. diff --git a/examples/offline_inference/openai_batch/README.md b/examples/offline_inference/openai_batch/README.md index 631fde91fc..3c6f6c7a6c 100644 --- a/examples/offline_inference/openai_batch/README.md +++ b/examples/offline_inference/openai_batch/README.md @@ -19,9 +19,9 @@ We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` e ## Pre-requisites * The examples in this document use `meta-llama/Meta-Llama-3-8B-Instruct`. - - Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens) - - Install the token on your machine (Run `huggingface-cli login`). - - Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions. + * Create a [user access token](https://huggingface.co/docs/hub/en/security-tokens) + * Install the token on your machine (Run `huggingface-cli login`). + * Get access to the gated model by [visiting the model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and agreeing to the terms and conditions. ## Example 1: Running with a local file @@ -105,7 +105,7 @@ To integrate with cloud blob storage, we recommend using presigned urls. * [Create an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html). * The `awscli` package (Run `pip install awscli`) to configure your credentials and interactively use s3. - - [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). + * [Configure your credentials](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html). * The `boto3` python package (Run `pip install boto3`) to generate presigned urls. ### Step 1: Upload your input script diff --git a/examples/others/lmcache/README.md b/examples/others/lmcache/README.md index 95a6bf995b..759be55d6f 100644 --- a/examples/others/lmcache/README.md +++ b/examples/others/lmcache/README.md @@ -28,16 +28,20 @@ to run disaggregated prefill and benchmark the performance. ### Components #### Server Scripts + - `disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh` - Launches individual vLLM servers for prefill/decode, and also launches the proxy server. - `disagg_prefill_lmcache_v1/disagg_proxy_server.py` - FastAPI proxy server that coordinates between prefiller and decoder - `disagg_prefill_lmcache_v1/disagg_example_nixl.sh` - Main script to run the example #### Configuration + - `disagg_prefill_lmcache_v1/configs/lmcache-prefiller-config.yaml` - Configuration for prefiller server - `disagg_prefill_lmcache_v1/configs/lmcache-decoder-config.yaml` - Configuration for decoder server #### Log Files + The main script generates several log files: + - `prefiller.log` - Logs from the prefill server - `decoder.log` - Logs from the decode server - `proxy.log` - Logs from the proxy server diff --git a/examples/others/logging_configuration.md b/examples/others/logging_configuration.md index 916ab5fd1c..7c8bdd199a 100644 --- a/examples/others/logging_configuration.md +++ b/examples/others/logging_configuration.md @@ -8,11 +8,11 @@ of logging configurations that range from simple-and-inflexible to more-complex-and-more-flexible. - No vLLM logging (simple and inflexible) - - Set `VLLM_CONFIGURE_LOGGING=0` (leaving `VLLM_LOGGING_CONFIG_PATH` unset) + - Set `VLLM_CONFIGURE_LOGGING=0` (leaving `VLLM_LOGGING_CONFIG_PATH` unset) - vLLM's default logging configuration (simple and inflexible) - - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` + - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` - Fine-grained custom logging configuration (more complex, more flexible) - - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and + - Leave `VLLM_CONFIGURE_LOGGING` unset or set `VLLM_CONFIGURE_LOGGING=1` and set `VLLM_LOGGING_CONFIG_PATH=` ## Logging Configuration Environment Variables diff --git a/pyproject.toml b/pyproject.toml index a65267942d..dfad5d2cdf 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -156,16 +156,6 @@ markers = [ "optional: optional tests that are automatically skipped, include --optional to run them", ] -[tool.pymarkdown] -plugins.md004.style = "sublist" # ul-style -plugins.md007.indent = 4 # ul-indent -plugins.md007.start_indented = true # ul-indent -plugins.md013.enabled = false # line-length -plugins.md041.enabled = false # first-line-h1 -plugins.md033.enabled = false # inline-html -plugins.md046.enabled = false # code-block-style -plugins.md024.allow_different_nesting = true # no-duplicate-headers - [tool.ty.src] root = "./vllm" respect-ignore-files = true diff --git a/tools/ep_kernels/README.md b/tools/ep_kernels/README.md index f1479146f0..273e0f378e 100644 --- a/tools/ep_kernels/README.md +++ b/tools/ep_kernels/README.md @@ -1,6 +1,9 @@ +# Expert parallel kernels + Large-scale cluster-level expert parallel, as described in the [DeepSeek-V3 Technical Report](http://arxiv.org/abs/2412.19437), is an efficient way to deploy sparse MoE models with many experts. However, such deployment requires many components beyond a normal Python package, including system package support and system driver support. It is impossible to bundle all these components into a Python package. Here we break down the requirements in 2 steps: + 1. Build and install the Python libraries (both [pplx-kernels](https://github.com/ppl-ai/pplx-kernels) and [DeepEP](https://github.com/deepseek-ai/DeepEP)), including necessary dependencies like NVSHMEM. This step does not require any privileged access. Any user can do this. 2. Configure NVIDIA driver to enable IBGDA. This step requires root access, and must be done on the host machine. @@ -8,15 +11,15 @@ Here we break down the requirements in 2 steps: All scripts accept a positional argument as workspace path for staging the build, defaulting to `$(pwd)/ep_kernels_workspace`. -# Usage +## Usage -## Single-node +### Single-node ```bash bash install_python_libraries.sh ``` -## Multi-node +### Multi-node ```bash bash install_python_libraries.sh diff --git a/vllm/plugins/lora_resolvers/README.md b/vllm/plugins/lora_resolvers/README.md index 7e7c55f5c6..48f27dddea 100644 --- a/vllm/plugins/lora_resolvers/README.md +++ b/vllm/plugins/lora_resolvers/README.md @@ -6,7 +6,8 @@ via the LoRAResolver plugin framework. Note that `VLLM_ALLOW_RUNTIME_LORA_UPDATING` must be set to true to allow LoRA resolver plugins to work, and `VLLM_PLUGINS` must be set to include the desired resolver plugins. -# lora_filesystem_resolver +## lora_filesystem_resolver + This LoRA Resolver is installed with vLLM by default. To use, set `VLLM_PLUGIN_LORA_CACHE_DIR` to a local directory. When vLLM receives a request for a LoRA adapter `foobar` it doesn't currently recognize, it will look in that local directory