-- Install the dependencies:
+vLLM is a fast and easy-to-use library for LLM inference and serving.
-```bash
-pip install -r ../requirements/docs.txt
-```
+Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
-- Clean the previous build (optional but recommended):
+vLLM is fast with:
-```bash
-make clean
-```
+- State-of-the-art serving throughput
+- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
+- Continuous batching of incoming requests
+- Fast model execution with CUDA/HIP graph
+- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
+- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
+- Speculative decoding
+- Chunked prefill
-- Generate the HTML documentation:
+vLLM is flexible and easy to use with:
-```bash
-make html
-```
+- Seamless integration with popular HuggingFace models
+- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
+- Tensor parallelism and pipeline parallelism support for distributed inference
+- Streaming outputs
+- OpenAI-compatible API server
+- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
+- Prefix caching support
+- Multi-lora support
-## Open the docs with your browser
+For more information, check out the following:
-- Serve the documentation locally:
-
-```bash
-python -m http.server -d build/html/
-```
-
-This will start a local server at http://localhost:8000. You can now open your browser and view the documentation.
-
-If port 8000 is already in use, you can specify a different port, for example:
-
-```bash
-python -m http.server 3000 -d build/html/
-```
+- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
+- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
+- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
+- [vLLM Meetups][meetups]
diff --git a/docs/api/README.md b/docs/api/README.md
new file mode 100644
index 0000000000..5c7b2ca79e
--- /dev/null
+++ b/docs/api/README.md
@@ -0,0 +1,107 @@
+# Summary
+
+[](){ #configuration }
+
+## Configuration
+
+API documentation for vLLM's configuration classes.
+
+- [vllm.config.ModelConfig][]
+- [vllm.config.CacheConfig][]
+- [vllm.config.TokenizerPoolConfig][]
+- [vllm.config.LoadConfig][]
+- [vllm.config.ParallelConfig][]
+- [vllm.config.SchedulerConfig][]
+- [vllm.config.DeviceConfig][]
+- [vllm.config.SpeculativeConfig][]
+- [vllm.config.LoRAConfig][]
+- [vllm.config.PromptAdapterConfig][]
+- [vllm.config.MultiModalConfig][]
+- [vllm.config.PoolerConfig][]
+- [vllm.config.DecodingConfig][]
+- [vllm.config.ObservabilityConfig][]
+- [vllm.config.KVTransferConfig][]
+- [vllm.config.CompilationConfig][]
+- [vllm.config.VllmConfig][]
+
+[](){ #offline-inference-api }
+
+## Offline Inference
+
+LLM Class.
+
+- [vllm.LLM][]
+
+LLM Inputs.
+
+- [vllm.inputs.PromptType][]
+- [vllm.inputs.TextPrompt][]
+- [vllm.inputs.TokensPrompt][]
+
+## vLLM Engines
+
+Engine classes for offline and online inference.
+
+- [vllm.LLMEngine][]
+- [vllm.AsyncLLMEngine][]
+
+## Inference Parameters
+
+Inference parameters for vLLM APIs.
+
+[](){ #sampling-params }
+[](){ #pooling-params }
+
+- [vllm.SamplingParams][]
+- [vllm.PoolingParams][]
+
+[](){ #multi-modality }
+
+## Multi-Modality
+
+vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
+
+Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
+via the `multi_modal_data` field in [vllm.inputs.PromptType][].
+
+Looking to add your own multi-modal model? Please follow the instructions listed [here][supports-multimodal].
+
+- [vllm.multimodal.MULTIMODAL_REGISTRY][]
+
+### Inputs
+
+User-facing inputs.
+
+- [vllm.multimodal.inputs.MultiModalDataDict][]
+
+Internal data structures.
+
+- [vllm.multimodal.inputs.PlaceholderRange][]
+- [vllm.multimodal.inputs.NestedTensors][]
+- [vllm.multimodal.inputs.MultiModalFieldElem][]
+- [vllm.multimodal.inputs.MultiModalFieldConfig][]
+- [vllm.multimodal.inputs.MultiModalKwargsItem][]
+- [vllm.multimodal.inputs.MultiModalKwargs][]
+- [vllm.multimodal.inputs.MultiModalInputs][]
+
+### Data Parsing
+
+- [vllm.multimodal.parse][]
+
+### Data Processing
+
+- [vllm.multimodal.processing][]
+
+### Memory Profiling
+
+- [vllm.multimodal.profiling][]
+
+### Registry
+
+- [vllm.multimodal.registry][]
+
+## Model Development
+
+- [vllm.model_executor.models.interfaces_base][]
+- [vllm.model_executor.models.interfaces][]
+- [vllm.model_executor.models.adapters][]
diff --git a/docs/api/vllm/.meta.yml b/docs/api/vllm/.meta.yml
new file mode 100644
index 0000000000..c15adfec64
--- /dev/null
+++ b/docs/api/vllm/.meta.yml
@@ -0,0 +1,2 @@
+search:
+ boost: 0.5
diff --git a/docs/assets/contributing/dockerfile-stages-dependency.png b/docs/assets/contributing/dockerfile-stages-dependency.png
new file mode 100644
index 0000000000..0838bfa37f
Binary files /dev/null and b/docs/assets/contributing/dockerfile-stages-dependency.png differ
diff --git a/docs/source/assets/deployment/anything-llm-chat-with-doc.png b/docs/assets/deployment/anything-llm-chat-with-doc.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-chat-with-doc.png
rename to docs/assets/deployment/anything-llm-chat-with-doc.png
diff --git a/docs/source/assets/deployment/anything-llm-chat-without-doc.png b/docs/assets/deployment/anything-llm-chat-without-doc.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-chat-without-doc.png
rename to docs/assets/deployment/anything-llm-chat-without-doc.png
diff --git a/docs/source/assets/deployment/anything-llm-provider.png b/docs/assets/deployment/anything-llm-provider.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-provider.png
rename to docs/assets/deployment/anything-llm-provider.png
diff --git a/docs/source/assets/deployment/anything-llm-upload-doc.png b/docs/assets/deployment/anything-llm-upload-doc.png
similarity index 100%
rename from docs/source/assets/deployment/anything-llm-upload-doc.png
rename to docs/assets/deployment/anything-llm-upload-doc.png
diff --git a/docs/source/assets/deployment/architecture_helm_deployment.png b/docs/assets/deployment/architecture_helm_deployment.png
similarity index 100%
rename from docs/source/assets/deployment/architecture_helm_deployment.png
rename to docs/assets/deployment/architecture_helm_deployment.png
diff --git a/docs/source/assets/deployment/chatbox-chat.png b/docs/assets/deployment/chatbox-chat.png
similarity index 100%
rename from docs/source/assets/deployment/chatbox-chat.png
rename to docs/assets/deployment/chatbox-chat.png
diff --git a/docs/source/assets/deployment/chatbox-settings.png b/docs/assets/deployment/chatbox-settings.png
similarity index 100%
rename from docs/source/assets/deployment/chatbox-settings.png
rename to docs/assets/deployment/chatbox-settings.png
diff --git a/docs/source/assets/deployment/dify-chat.png b/docs/assets/deployment/dify-chat.png
similarity index 100%
rename from docs/source/assets/deployment/dify-chat.png
rename to docs/assets/deployment/dify-chat.png
diff --git a/docs/source/assets/deployment/dify-create-chatbot.png b/docs/assets/deployment/dify-create-chatbot.png
similarity index 100%
rename from docs/source/assets/deployment/dify-create-chatbot.png
rename to docs/assets/deployment/dify-create-chatbot.png
diff --git a/docs/source/assets/deployment/dify-settings.png b/docs/assets/deployment/dify-settings.png
similarity index 100%
rename from docs/source/assets/deployment/dify-settings.png
rename to docs/assets/deployment/dify-settings.png
diff --git a/docs/source/assets/deployment/open_webui.png b/docs/assets/deployment/open_webui.png
similarity index 100%
rename from docs/source/assets/deployment/open_webui.png
rename to docs/assets/deployment/open_webui.png
diff --git a/docs/source/assets/deployment/streamlit-chat.png b/docs/assets/deployment/streamlit-chat.png
similarity index 100%
rename from docs/source/assets/deployment/streamlit-chat.png
rename to docs/assets/deployment/streamlit-chat.png
diff --git a/docs/source/assets/design/arch_overview/entrypoints.excalidraw.png b/docs/assets/design/arch_overview/entrypoints.excalidraw.png
similarity index 100%
rename from docs/source/assets/design/arch_overview/entrypoints.excalidraw.png
rename to docs/assets/design/arch_overview/entrypoints.excalidraw.png
diff --git a/docs/source/assets/design/arch_overview/llm_engine.excalidraw.png b/docs/assets/design/arch_overview/llm_engine.excalidraw.png
similarity index 100%
rename from docs/source/assets/design/arch_overview/llm_engine.excalidraw.png
rename to docs/assets/design/arch_overview/llm_engine.excalidraw.png
diff --git a/docs/source/assets/design/hierarchy.png b/docs/assets/design/hierarchy.png
similarity index 100%
rename from docs/source/assets/design/hierarchy.png
rename to docs/assets/design/hierarchy.png
diff --git a/docs/source/assets/design/v1/metrics/intervals-1.png b/docs/assets/design/v1/metrics/intervals-1.png
similarity index 100%
rename from docs/source/assets/design/v1/metrics/intervals-1.png
rename to docs/assets/design/v1/metrics/intervals-1.png
diff --git a/docs/source/assets/design/v1/metrics/intervals-2.png b/docs/assets/design/v1/metrics/intervals-2.png
similarity index 100%
rename from docs/source/assets/design/v1/metrics/intervals-2.png
rename to docs/assets/design/v1/metrics/intervals-2.png
diff --git a/docs/source/assets/design/v1/metrics/intervals-3.png b/docs/assets/design/v1/metrics/intervals-3.png
similarity index 100%
rename from docs/source/assets/design/v1/metrics/intervals-3.png
rename to docs/assets/design/v1/metrics/intervals-3.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-1.png b/docs/assets/design/v1/prefix_caching/example-time-1.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-1.png
rename to docs/assets/design/v1/prefix_caching/example-time-1.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-3.png b/docs/assets/design/v1/prefix_caching/example-time-3.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-3.png
rename to docs/assets/design/v1/prefix_caching/example-time-3.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-4.png b/docs/assets/design/v1/prefix_caching/example-time-4.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-4.png
rename to docs/assets/design/v1/prefix_caching/example-time-4.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-5.png b/docs/assets/design/v1/prefix_caching/example-time-5.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-5.png
rename to docs/assets/design/v1/prefix_caching/example-time-5.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-6.png b/docs/assets/design/v1/prefix_caching/example-time-6.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-6.png
rename to docs/assets/design/v1/prefix_caching/example-time-6.png
diff --git a/docs/source/assets/design/v1/prefix_caching/example-time-7.png b/docs/assets/design/v1/prefix_caching/example-time-7.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/example-time-7.png
rename to docs/assets/design/v1/prefix_caching/example-time-7.png
diff --git a/docs/source/assets/design/v1/prefix_caching/free.png b/docs/assets/design/v1/prefix_caching/free.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/free.png
rename to docs/assets/design/v1/prefix_caching/free.png
diff --git a/docs/source/assets/design/v1/prefix_caching/overview.png b/docs/assets/design/v1/prefix_caching/overview.png
similarity index 100%
rename from docs/source/assets/design/v1/prefix_caching/overview.png
rename to docs/assets/design/v1/prefix_caching/overview.png
diff --git a/docs/source/assets/features/disagg_prefill/abstraction.jpg b/docs/assets/features/disagg_prefill/abstraction.jpg
similarity index 100%
rename from docs/source/assets/features/disagg_prefill/abstraction.jpg
rename to docs/assets/features/disagg_prefill/abstraction.jpg
diff --git a/docs/source/assets/features/disagg_prefill/overview.jpg b/docs/assets/features/disagg_prefill/overview.jpg
similarity index 100%
rename from docs/source/assets/features/disagg_prefill/overview.jpg
rename to docs/assets/features/disagg_prefill/overview.jpg
diff --git a/docs/source/assets/kernel/k_vecs.png b/docs/assets/kernel/k_vecs.png
similarity index 100%
rename from docs/source/assets/kernel/k_vecs.png
rename to docs/assets/kernel/k_vecs.png
diff --git a/docs/source/assets/kernel/key.png b/docs/assets/kernel/key.png
similarity index 100%
rename from docs/source/assets/kernel/key.png
rename to docs/assets/kernel/key.png
diff --git a/docs/source/assets/kernel/logits_vec.png b/docs/assets/kernel/logits_vec.png
similarity index 100%
rename from docs/source/assets/kernel/logits_vec.png
rename to docs/assets/kernel/logits_vec.png
diff --git a/docs/source/assets/kernel/q_vecs.png b/docs/assets/kernel/q_vecs.png
similarity index 100%
rename from docs/source/assets/kernel/q_vecs.png
rename to docs/assets/kernel/q_vecs.png
diff --git a/docs/source/assets/kernel/query.png b/docs/assets/kernel/query.png
similarity index 100%
rename from docs/source/assets/kernel/query.png
rename to docs/assets/kernel/query.png
diff --git a/docs/source/assets/kernel/v_vec.png b/docs/assets/kernel/v_vec.png
similarity index 100%
rename from docs/source/assets/kernel/v_vec.png
rename to docs/assets/kernel/v_vec.png
diff --git a/docs/source/assets/kernel/value.png b/docs/assets/kernel/value.png
similarity index 100%
rename from docs/source/assets/kernel/value.png
rename to docs/assets/kernel/value.png
diff --git a/docs/source/assets/logos/vllm-logo-only-light.ico b/docs/assets/logos/vllm-logo-only-light.ico
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-only-light.ico
rename to docs/assets/logos/vllm-logo-only-light.ico
diff --git a/docs/source/assets/logos/vllm-logo-only-light.png b/docs/assets/logos/vllm-logo-only-light.png
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-only-light.png
rename to docs/assets/logos/vllm-logo-only-light.png
diff --git a/docs/source/assets/logos/vllm-logo-text-dark.png b/docs/assets/logos/vllm-logo-text-dark.png
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-text-dark.png
rename to docs/assets/logos/vllm-logo-text-dark.png
diff --git a/docs/source/assets/logos/vllm-logo-text-light.png b/docs/assets/logos/vllm-logo-text-light.png
similarity index 100%
rename from docs/source/assets/logos/vllm-logo-text-light.png
rename to docs/assets/logos/vllm-logo-text-light.png
diff --git a/docs/source/community/meetups.md b/docs/community/meetups.md
similarity index 98%
rename from docs/source/community/meetups.md
rename to docs/community/meetups.md
index aa1a71c86c..2c47be443a 100644
--- a/docs/source/community/meetups.md
+++ b/docs/community/meetups.md
@@ -1,6 +1,7 @@
-(meetups)=
-
-# vLLM Meetups
+---
+title: vLLM Meetups
+---
+[](){ #meetups }
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:
diff --git a/docs/source/community/sponsors.md b/docs/community/sponsors.md
similarity index 100%
rename from docs/source/community/sponsors.md
rename to docs/community/sponsors.md
diff --git a/docs/source/contributing/deprecation_policy.md b/docs/contributing/deprecation_policy.md
similarity index 100%
rename from docs/source/contributing/deprecation_policy.md
rename to docs/contributing/deprecation_policy.md
diff --git a/docs/source/contributing/dockerfile/dockerfile.md b/docs/contributing/dockerfile/dockerfile.md
similarity index 89%
rename from docs/source/contributing/dockerfile/dockerfile.md
rename to docs/contributing/dockerfile/dockerfile.md
index 90b9a33cfb..3765996cb0 100644
--- a/docs/source/contributing/dockerfile/dockerfile.md
+++ b/docs/contributing/dockerfile/dockerfile.md
@@ -1,7 +1,7 @@
# Dockerfile
We provide a to construct the image for running an OpenAI compatible server with vLLM.
-More information about deploying with Docker can be found [here](#deployment-docker).
+More information about deploying with Docker can be found [here][deployment-docker].
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
@@ -17,11 +17,9 @@ The edges of the build graph represent:
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
- > :::{figure} /assets/contributing/dockerfile-stages-dependency.png
- > :align: center
- > :alt: query
- > :width: 100%
- > :::
+ >
+ > { align="center" alt="query" width="100%" }
+ >
>
> Made using:
>
diff --git a/docs/contributing/model/README.md b/docs/contributing/model/README.md
new file mode 100644
index 0000000000..b7727f02c1
--- /dev/null
+++ b/docs/contributing/model/README.md
@@ -0,0 +1,23 @@
+---
+title: Adding a New Model
+---
+[](){ #new-model }
+
+This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
+
+Contents:
+
+- [Basic](basic.md)
+- [Registration](registration.md)
+- [Tests](tests.md)
+- [Multimodal](multimodal.md)
+
+!!! note
+ The complexity of adding a new model depends heavily on the model's architecture.
+ The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
+ However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
+
+!!! tip
+ If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
+ or ask on our [developer slack](https://slack.vllm.ai).
+ We will be happy to help you out!
diff --git a/docs/source/contributing/model/basic.md b/docs/contributing/model/basic.md
similarity index 87%
rename from docs/source/contributing/model/basic.md
rename to docs/contributing/model/basic.md
index 1fa56dc472..0c0ba33792 100644
--- a/docs/source/contributing/model/basic.md
+++ b/docs/contributing/model/basic.md
@@ -1,6 +1,7 @@
-(new-model-basic)=
-
-# Implementing a Basic Model
+---
+title: Implementing a Basic Model
+---
+[](){ #new-model-basic }
This guide walks you through the steps to implement a basic vLLM model.
@@ -10,9 +11,8 @@ First, clone the PyTorch model code from the source repository.
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from
HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
-:::{warning}
-Make sure to review and adhere to the original code's copyright and licensing terms!
-:::
+!!! warning
+ Make sure to review and adhere to the original code's copyright and licensing terms!
## 2. Make your code compatible with vLLM
@@ -67,7 +67,7 @@ class MyModel(nn.Module):
...
```
-- Rewrite the {meth}`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
+- Rewrite the [forward][torch.nn.Module.forward] method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
```python
def forward(
@@ -78,10 +78,9 @@ def forward(
...
```
-:::{note}
-Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
-If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
-:::
+!!! note
+ Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
+ If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out for more examples.
@@ -89,7 +88,7 @@ For reference, check out our [Llama implementation](gh-file:vllm/model_executor/
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
-For the embedding layer, you can simply replace {class}`torch.nn.Embedding` with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
+For the embedding layer, you can simply replace [torch.nn.Embedding][] with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
When it comes to the linear layers, we provide the following options to parallelize them:
- `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
@@ -107,7 +106,7 @@ This method should load the weights from the HuggingFace's checkpoint file and a
## 5. Register your model
-See [this page](#new-model-registration) for instructions on how to register your new model to be used by vLLM.
+See [this page][new-model-registration] for instructions on how to register your new model to be used by vLLM.
## Frequently Asked Questions
diff --git a/docs/contributing/model/multimodal.md b/docs/contributing/model/multimodal.md
new file mode 100644
index 0000000000..892ab90984
--- /dev/null
+++ b/docs/contributing/model/multimodal.md
@@ -0,0 +1,803 @@
+---
+title: Multi-Modal Support
+---
+[](){ #supports-multimodal }
+
+This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs][multimodal-inputs].
+
+## 1. Update the base vLLM model
+
+It is assumed that you have already implemented the model in vLLM according to [these steps][new-model-basic].
+Further update the model as follows:
+
+- Reserve a keyword parameter in [forward][torch.nn.Module.forward] for each input tensor that corresponds to a multi-modal input, as shown in the following example:
+
+ ```diff
+ def forward(
+ self,
+ input_ids: torch.Tensor,
+ positions: torch.Tensor,
+ + pixel_values: torch.Tensor,
+ ) -> SamplerOutput:
+ ```
+
+ More conveniently, you can simply pass `**kwargs` to the [forward][torch.nn.Module.forward] method and retrieve the keyword parameters for multimodal inputs from it.
+
+- Implement [get_multimodal_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings] that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
+
+ ```python
+ class YourModelForImage2Seq(nn.Module):
+ ...
+
+ def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
+
+ assert self.vision_encoder is not None
+ image_features = self.vision_encoder(image_input)
+ return self.multi_modal_projector(image_features)
+
+ def get_multimodal_embeddings(
+ self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
+
+ # Validate the multimodal input keyword arguments
+ image_input = self._parse_and_validate_image_input(**kwargs)
+ if image_input is None:
+ return None
+
+ # Run multimodal inputs through encoder and projector
+ vision_embeddings = self._process_image_input(image_input)
+ return vision_embeddings
+ ```
+
+!!! warning
+ The returned `multimodal_embeddings` must be either a **3D [torch.Tensor][]** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D [torch.Tensor][]'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
+
+- Implement [get_input_embeddings][vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings] to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
+
+ ```python
+ from .utils import merge_multimodal_embeddings
+
+ class YourModelForImage2Seq(nn.Module):
+ ...
+
+ def get_input_embeddings(
+ self,
+ input_ids: torch.Tensor,
+ multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
+ ) -> torch.Tensor:
+
+ # `get_input_embeddings` should already be implemented for the language
+ # model as one of the requirements of basic vLLM model implementation.
+ inputs_embeds = self.language_model.get_input_embeddings(input_ids)
+
+ if multimodal_embeddings is not None:
+ inputs_embeds = merge_multimodal_embeddings(
+ input_ids=input_ids,
+ inputs_embeds=inputs_embeds,
+ multimodal_embeddings=multimodal_embeddings,
+ placeholder_token_id=self.config.image_token_index)
+
+ return inputs_embeds
+ ```
+
+- Implement [get_language_model][vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model] getter to provide stable access to the underlying language model.
+
+ ```python
+ class YourModelForImage2Seq(nn.Module):
+ ...
+
+ def get_language_model(self) -> torch.nn.Module:
+ # Change `language_model` according to your implementation.
+ return self.language_model
+ ```
+
+- Once the above steps are done, update the model class with the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
+
+ ```diff
+ + from vllm.model_executor.models.interfaces import SupportsMultiModal
+
+ - class YourModelForImage2Seq(nn.Module):
+ + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
+ ```
+
+!!! note
+ The model class does not have to be named `*ForCausalLM`.
+ Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
+
+## 2. Specify processing information
+
+Next, create a subclass of [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo]
+to provide basic information related to HF processing.
+
+### Maximum number of input items
+
+You need to override the abstract method [get_supported_mm_limits][vllm.multimodal.processing.BaseProcessingInfo.get_supported_mm_limits]
+to return the maximum number of input items for each modality supported by the model.
+
+For example, if the model supports any number of images but only one video per prompt:
+
+```python
+def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
+ return {"image": None, "video": 1}
+```
+
+## 3. Specify dummy inputs
+
+Then, inherit [BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] to construct dummy inputs for
+HF processing as well as memory profiling.
+
+### For memory profiling
+
+Override the abstract methods [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text] and [get_dummy_mm_data][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data] to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it.
+
+Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
+
+=== "Basic example: LLaVA"
+
+ Looking at the code of HF's `LlavaForConditionalGeneration`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
+ n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
+ n_image_features = image_features.shape[0] * image_features.shape[1]
+
+ if n_image_tokens != n_image_features:
+ raise ValueError(
+ f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
+ )
+ special_image_mask = (
+ (input_ids == self.config.image_token_index)
+ .unsqueeze(-1)
+ .expand_as(inputs_embeds)
+ .to(inputs_embeds.device)
+ )
+ image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
+ inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
+ ```
+
+ The number of placeholder feature tokens per image is `image_features.shape[1]`.
+ `image_features` is calculated inside the `get_image_features` method:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
+ image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+
+ selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
+ if vision_feature_select_strategy == "default":
+ selected_image_feature = selected_image_feature[:, 1:]
+ elif vision_feature_select_strategy == "full":
+ selected_image_feature = selected_image_feature
+ else:
+ raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
+ image_features = self.multi_modal_projector(selected_image_feature)
+ return image_features
+ ```
+
+ We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
+ (`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
+ Moreover, we only need the sequence length (the second dimension of the tensor) to get `image_features.shape[1]`.
+ The sequence length is determined by the initial hidden states in `CLIPVisionTransformer` since the attention
+ mechanism doesn't change the sequence length of the output hidden states.
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
+ hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
+ hidden_states = self.pre_layrnorm(hidden_states)
+
+ encoder_outputs = self.encoder(
+ inputs_embeds=hidden_states,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+ ```
+
+ To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
+ target_dtype = self.patch_embedding.weight.dtype
+ patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
+ patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
+
+ class_embeds = self.class_embedding.expand(batch_size, 1, -1)
+ embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
+ if interpolate_pos_encoding:
+ embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
+ else:
+ embeddings = embeddings + self.position_embedding(self.position_ids)
+ return embeddings
+ ```
+
+ We can infer that `embeddings.shape[1] == self.num_positions`, where
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
+ self.num_patches = (self.image_size // self.patch_size) ** 2
+ self.num_positions = self.num_patches + 1
+ ```
+
+ Overall, the number of placeholder feature tokens for an image can be calculated as:
+
+ ```python
+ def get_num_image_tokens(
+ self,
+ *,
+ image_width: int,
+ image_height: int,
+ ) -> int:
+ hf_config = self.get_hf_config()
+ hf_processor = self.get_hf_processor()
+
+ image_size = hf_config.vision_config.image_size
+ patch_size = hf_config.vision_config.patch_size
+
+ num_image_tokens = (image_size // patch_size) ** 2 + 1
+ if hf_processor.vision_feature_select_strategy == "default":
+ num_image_tokens -= 1
+
+ return num_image_tokens
+ ```
+
+ Notice that the number of image tokens doesn't depend on the image width and height.
+ We can simply use a dummy `image_size` to calculate the multimodal profiling data:
+
+ ```python
+ # NOTE: In actuality, this is usually implemented as part of the
+ # model's subclass of `BaseProcessingInfo`, but we show it as is
+ # here for simplicity.
+ def get_image_size_with_most_features(self) -> ImageSize:
+ hf_config = self.get_hf_config()
+ width = height = hf_config.image_size
+ return ImageSize(width=width, height=height)
+
+ def get_dummy_mm_data(
+ self,
+ seq_len: int,
+ mm_counts: Mapping[str, int],
+ ) -> MultiModalDataDict:
+ num_images = mm_counts.get("image", 0)
+
+ target_width, target_height = \
+ self.info.get_image_size_with_most_features()
+
+ return {
+ "image":
+ self._get_dummy_images(width=target_width,
+ height=target_height,
+ num_images=num_images)
+ }
+ ```
+
+ For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
+
+ ```python
+ def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
+ num_images = mm_counts.get("image", 0)
+
+ processor = self.info.get_hf_processor()
+ image_token = processor.image_token
+
+ return image_token * num_images
+ ```
+
+=== "No input placeholders: Fuyu"
+
+ Looking at the code of HF's `FuyuForCausalLM`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
+ if image_patches is not None and past_key_values is None:
+ patch_embeddings = [
+ self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
+ .squeeze(0)
+ .to(inputs_embeds.device)
+ for patch in image_patches
+ ]
+ inputs_embeds = self.gather_continuous_embeddings(
+ word_embeddings=inputs_embeds,
+ continuous_embeddings=patch_embeddings,
+ image_patch_input_indices=image_patches_indices,
+ )
+ ```
+
+ The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
+ which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
+
+ Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information?
+ Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**.
+
+ The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then
+ `FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`.
+
+ In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
+ returning the dimensions after resizing (but before padding) as metadata.
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
+ image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
+ batch_images = image_encoding["images"]
+ image_unpadded_heights = image_encoding["image_unpadded_heights"]
+ image_unpadded_widths = image_encoding["image_unpadded_widths"]
+
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
+ if do_resize:
+ batch_images = [
+ [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
+ for images in batch_images
+ ]
+
+ image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
+ image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
+ image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
+
+ if do_pad:
+ batch_images = [
+ [
+ self.pad_image(
+ image,
+ size=size,
+ mode=padding_mode,
+ constant_values=padding_value,
+ input_data_format=input_data_format,
+ )
+ for image in images
+ ]
+ for images in batch_images
+ ]
+ ```
+
+ In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
+ model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+ image_input=tensor_batch_images,
+ image_present=image_present,
+ image_unpadded_h=image_unpadded_heights,
+ image_unpadded_w=image_unpadded_widths,
+ image_placeholder_id=image_placeholder_id,
+ image_newline_id=image_newline_id,
+ variable_sized=True,
+ )
+
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
+ image_height, image_width = image.shape[1], image.shape[2]
+ if variable_sized: # variable_sized=True
+ new_h = min(
+ image_height,
+ math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
+ )
+ new_w = min(
+ image_width,
+ math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
+ )
+ image = image[:, :new_h, :new_w]
+ image_height, image_width = new_h, new_w
+
+ num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
+ tensor_of_image_ids = torch.full(
+ [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
+ )
+ patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
+ assert num_patches == patches.shape[0]
+ ```
+
+ The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
+ patch_size = patch_size if patch_size is not None else self.patch_size
+ patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
+
+ if image_height % patch_height != 0:
+ raise ValueError(f"{image_height=} must be divisible by {patch_height}")
+ if image_width % patch_width != 0:
+ raise ValueError(f"{image_width=} must be divisible by {patch_width}")
+
+ num_patches_per_dim_h = image_height // patch_height
+ num_patches_per_dim_w = image_width // patch_width
+ num_patches = num_patches_per_dim_h * num_patches_per_dim_w
+ ```
+
+ These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
+ to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
+
+ ```python
+ def get_image_size_with_most_features(self) -> ImageSize:
+ image_processor = self.get_image_processor()
+ return ImageSize(width=image_processor.size["width"],
+ height=image_processor.size["height"])
+ ```
+
+ Fuyu does not expect image placeholders in the inputs to HF processor, so
+ the dummy prompt text is empty regardless of the number of images.
+
+ ```python
+ def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
+ return ""
+ ```
+
+ For the multimodal image profiling data, the logic is very similar to LLaVA:
+
+ ```python
+ def get_dummy_mm_data(
+ self,
+ seq_len: int,
+ mm_counts: Mapping[str, int],
+ ) -> MultiModalDataDict:
+ target_width, target_height = \
+ self.info.get_image_size_with_most_features()
+ num_images = mm_counts.get("image", 0)
+
+ return {
+ "image":
+ self._get_dummy_images(width=target_width,
+ height=target_height,
+ num_images=num_images)
+ }
+ ```
+
+## 4. Specify processing details
+
+Afterwards, create a subclass of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]
+to fill in the missing details about HF processing.
+
+!!! info
+ [Multi-Modal Data Processing][mm-processing]
+
+### Multi-modal fields
+
+Override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] to
+return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
+
+=== "Basic example: LLaVA"
+
+ The output of `CLIPImageProcessor` is a simple tensor with shape
+ `(num_images, num_channels, image_height, image_width)`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
+ images = [
+ to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
+ for image in all_images
+ ]
+
+ data = {"pixel_values": images}
+ return BatchFeature(data=data, tensor_type=return_tensors)
+ ```
+
+ So, we override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
+
+ ```python
+ def _get_mm_fields_config(
+ self,
+ hf_inputs: BatchFeature,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ ) -> Mapping[str, MultiModalFieldConfig]:
+ return dict(
+ pixel_values=MultiModalFieldConfig.batched("image"),
+ )
+ ```
+
+ !!! note
+ Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
+ pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
+
+=== "With postprocessing: Fuyu"
+
+ The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates
+ the patches from each image belonging to an item in the batch:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
+ image_input_ids.append(tensor_of_image_ids)
+ image_patches.append(patches)
+ else:
+ image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
+
+ batch_image_input_ids.append(image_input_ids)
+ batch_image_patches.append(image_patches)
+ ```
+
+ The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
+ `(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
+
+ In order to support the use of [MultiModalFieldConfig.batched][] like in LLaVA,
+ we remove the extra batch dimension by overriding [BaseMultiModalProcessor._call_hf_processor][]:
+
+ ```python
+ def _call_hf_processor(
+ self,
+ prompt: str,
+ mm_data: Mapping[str, object],
+ mm_kwargs: Mapping[str, object],
+ ) -> BatchFeature:
+ processed_outputs = super()._call_hf_processor(
+ prompt=prompt,
+ mm_data=mm_data,
+ mm_kwargs=mm_kwargs,
+ )
+
+ image_patches = processed_outputs.get("image_patches")
+ if image_patches is not None:
+ images = mm_data["images"]
+ assert isinstance(images, list)
+
+ # Original output: (1, num_images, Pn, Px * Py * C)
+ # New output: (num_images, Pn, Px * Py * C)
+ assert (isinstance(image_patches, list)
+ and len(image_patches) == 1)
+ assert (isinstance(image_patches[0], torch.Tensor)
+ and len(image_patches[0]) == len(images))
+
+ processed_outputs["image_patches"] = image_patches[0]
+
+ return processed_outputs
+ ```
+
+ !!! note
+ Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
+ for text-only inputs to prevent unnecessary warnings from HF processor.
+
+ This lets us override [_get_mm_fields_config][vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config] as follows:
+
+ ```python
+ def _get_mm_fields_config(
+ self,
+ hf_inputs: BatchFeature,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ ) -> Mapping[str, MultiModalFieldConfig]:
+ return dict(image_patches=MultiModalFieldConfig.batched("image"))
+ ```
+
+### Prompt updates
+
+Override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] to
+return a list of [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instances.
+
+Each [PromptUpdate][vllm.multimodal.processing.PromptUpdate] instance specifies an update operation
+(e.g.: insertion, replacement) performed by the HF processor.
+
+=== "Basic example: LLaVA"
+
+ Looking at HF's `LlavaProcessor`:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
+ prompt_strings = []
+ for sample in text:
+ sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
+ prompt_strings.append(sample)
+ ```
+
+ It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
+ Based on this, we override [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] as follows:
+
+ ```python
+ def _get_prompt_updates(
+ self,
+ mm_items: MultiModalDataItems,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ out_mm_kwargs: MultiModalKwargs,
+ ) -> Sequence[PromptUpdate]:
+ hf_config = self.info.get_hf_config()
+ image_token_id = hf_config.image_token_index
+
+ def get_replacement(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+
+ image_size = images.get_image_size(item_idx)
+ num_image_tokens = self.info.get_num_image_tokens(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+
+ return [image_token_id] * num_image_tokens
+
+ return [
+ PromptReplacement(
+ modality="image",
+ target=[image_token_id],
+ replacement=get_replacement,
+ ),
+ ]
+ ```
+
+=== "Handling additional tokens: Fuyu"
+
+ Recall the layout of feature tokens from Step 2:
+
+ ```
+ |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+ |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+ ...
+ |SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
+ ```
+
+ We define a helper function to return `ncols` and `nrows` directly:
+
+ ```python
+ def get_image_feature_grid_size(
+ self,
+ *,
+ image_width: int,
+ image_height: int,
+ ) -> tuple[int, int]:
+ image_processor = self.get_image_processor()
+ target_width = image_processor.size["width"]
+ target_height = image_processor.size["height"]
+ patch_width = image_processor.patch_size["width"]
+ patch_height = image_processor.patch_size["height"]
+
+ if not (image_width <= target_width and image_height <= target_height):
+ height_scale_factor = target_height / image_height
+ width_scale_factor = target_width / image_width
+ optimal_scale_factor = min(height_scale_factor, width_scale_factor)
+
+ image_height = int(image_height * optimal_scale_factor)
+ image_width = int(image_width * optimal_scale_factor)
+
+ ncols = math.ceil(image_width / patch_width)
+ nrows = math.ceil(image_height / patch_height)
+ return ncols, nrows
+ ```
+
+ Based on this, we can initially define our replacement tokens as:
+
+ ```python
+ def get_replacement(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+ image_size = images.get_image_size(item_idx)
+
+ ncols, nrows = self.info.get_image_feature_grid_size(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+
+ # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
+ # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
+ return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
+ ```
+
+ However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
+ a BOS token (``) is also added to the promopt:
+
+ ```python
+ # https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
+ model_image_input = self.image_processor.preprocess_with_tokenizer_info(
+ image_input=tensor_batch_images,
+ image_present=image_present,
+ image_unpadded_h=image_unpadded_heights,
+ image_unpadded_w=image_unpadded_widths,
+ image_placeholder_id=image_placeholder_id,
+ image_newline_id=image_newline_id,
+ variable_sized=True,
+ )
+ prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
+ tokenizer=self.tokenizer,
+ prompts=prompts,
+ scale_factors=scale_factors,
+ max_tokens_to_generate=self.max_tokens_to_generate,
+ max_position_embeddings=self.max_position_embeddings,
+ add_BOS=True,
+ add_beginning_of_answer_token=True,
+ )
+ ```
+
+ To assign the vision embeddings to only the image tokens, instead of a string
+ you can return an instance of [PromptUpdateDetails][vllm.multimodal.processing.PromptUpdateDetails]:
+
+ ```python
+ hf_config = self.info.get_hf_config()
+ bos_token_id = hf_config.bos_token_id # ``
+ assert isinstance(bos_token_id, int)
+
+ def get_replacement_fuyu(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+ image_size = images.get_image_size(item_idx)
+
+ ncols, nrows = self.info.get_image_feature_grid_size(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+ image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+ [_NEWLINE_TOKEN_ID]) * nrows
+
+ return PromptUpdateDetails.select_token_id(
+ image_tokens + [bos_token_id],
+ embed_token_id=_IMAGE_TOKEN_ID,
+ )
+ ```
+
+ Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
+ we can search for it to conduct the replacement at the start of the string:
+
+ ```python
+ def _get_prompt_updates(
+ self,
+ mm_items: MultiModalDataItems,
+ hf_processor_mm_kwargs: Mapping[str, object],
+ out_mm_kwargs: MultiModalKwargs,
+ ) -> Sequence[PromptUpdate]:
+ hf_config = self.info.get_hf_config()
+ bos_token_id = hf_config.bos_token_id
+ assert isinstance(bos_token_id, int)
+
+ tokenizer = self.info.get_tokenizer()
+ eot_token_id = tokenizer.bos_token_id
+ assert isinstance(eot_token_id, int)
+
+ def get_replacement_fuyu(item_idx: int):
+ images = mm_items.get_items("image", ImageProcessorItems)
+ image_size = images.get_image_size(item_idx)
+
+ ncols, nrows = self.info.get_image_feature_grid_size(
+ image_width=image_size.width,
+ image_height=image_size.height,
+ )
+ image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
+ [_NEWLINE_TOKEN_ID]) * nrows
+
+ return PromptUpdateDetails.select_token_id(
+ image_tokens + [bos_token_id],
+ embed_token_id=_IMAGE_TOKEN_ID,
+ )
+
+ return [
+ PromptReplacement(
+ modality="image",
+ target=[eot_token_id],
+ replacement=get_replacement_fuyu,
+ )
+ ]
+ ```
+
+## 5. Register processor-related classes
+
+After you have defined [BaseProcessingInfo][vllm.multimodal.processing.BaseProcessingInfo] (Step 2),
+[BaseDummyInputsBuilder][vllm.multimodal.profiling.BaseDummyInputsBuilder] (Step 3),
+and [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] (Step 4),
+decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor `
+to register them to the multi-modal registry:
+
+```diff
+ from vllm.model_executor.models.interfaces import SupportsMultiModal
++ from vllm.multimodal import MULTIMODAL_REGISTRY
+
++ @MULTIMODAL_REGISTRY.register_processor(YourMultiModalProcessor,
++ info=YourProcessingInfo,
++ dummy_inputs=YourDummyInputsBuilder)
+ class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
+```
+
+## Notes
+
+### Inserting feature tokens without replacement
+
+Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use [PromptInsertion][vllm.multimodal.processing.PromptInsertion] instead of [PromptReplacement][vllm.multimodal.processing.PromptReplacement] inside [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
+
+Examples:
+
+- BLIP-2 (insert at start of prompt):
+- Florence2 (insert at start of prompt):
+- Molmo (insert after `<|endoftext|>` token):
+
+### Handling prompt updates unrelated to multi-modal data
+
+[_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates] assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override [_apply_hf_processor_tokens_only][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only] so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design][mm-processing].
+
+Examples:
+
+- Chameleon (appends `sep_token`):
+- Fuyu (appends `boa_token`):
+- Molmo (applies chat template which is not defined elsewhere):
+
+### Custom HF processor
+
+Some models don't define a HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to [_call_hf_processor][vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor].
+
+Examples:
+
+- DeepSeek-VL2:
+- InternVL:
+- Qwen-VL:
diff --git a/docs/source/contributing/model/registration.md b/docs/contributing/model/registration.md
similarity index 52%
rename from docs/source/contributing/model/registration.md
rename to docs/contributing/model/registration.md
index 64cd25b538..e796e49a75 100644
--- a/docs/source/contributing/model/registration.md
+++ b/docs/contributing/model/registration.md
@@ -1,33 +1,32 @@
-(new-model-registration)=
-
-# Registering a Model to vLLM
+---
+title: Registering a Model to vLLM
+---
+[](){ #new-model-registration }
vLLM relies on a model registry to determine how to run each model.
-A list of pre-registered architectures can be found [here](#supported-models).
+A list of pre-registered architectures can be found [here][supported-models].
If your model is not on this list, you must register it to vLLM.
This page provides detailed instructions on how to do so.
## Built-in models
-To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source](#build-from-source).
+To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
This gives you the ability to modify the codebase and test your model.
-After you have implemented your model (see [tutorial](#new-model-basic)), put it into the directory.
+After you have implemented your model (see [tutorial][new-model-basic]), put it into the directory.
Then, add your model class to `_VLLM_MODELS` in so that it is automatically registered upon importing vLLM.
-Finally, update our [list of supported models](#supported-models) to promote your model!
+Finally, update our [list of supported models][supported-models] to promote your model!
-:::{important}
-The list of models in each section should be maintained in alphabetical order.
-:::
+!!! warning
+ The list of models in each section should be maintained in alphabetical order.
## Out-of-tree models
You can load an external model using a plugin without modifying the vLLM codebase.
-:::{seealso}
-[vLLM's Plugin System](#plugin-system)
-:::
+!!! info
+ [vLLM's Plugin System][plugin-system]
To register the model, use the following code:
@@ -45,11 +44,9 @@ from vllm import ModelRegistry
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
```
-:::{important}
-If your model is a multimodal model, ensure the model class implements the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
-Read more about that [here](#supports-multimodal).
-:::
+!!! warning
+ If your model is a multimodal model, ensure the model class implements the [SupportsMultiModal][vllm.model_executor.models.interfaces.SupportsMultiModal] interface.
+ Read more about that [here][supports-multimodal].
-:::{note}
-Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
-:::
+!!! note
+ Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
diff --git a/docs/source/contributing/model/tests.md b/docs/contributing/model/tests.md
similarity index 75%
rename from docs/source/contributing/model/tests.md
rename to docs/contributing/model/tests.md
index 68d51d89f7..2688098618 100644
--- a/docs/source/contributing/model/tests.md
+++ b/docs/contributing/model/tests.md
@@ -1,6 +1,7 @@
-(new-model-tests)=
-
-# Writing Unit Tests
+---
+title: Writing Unit Tests
+---
+[](){ #new-model-tests }
This page explains how to write unit tests to verify the implementation of your model.
@@ -14,14 +15,12 @@ Without them, the CI for your PR will fail.
Include an example HuggingFace repository for your model in .
This enables a unit test that loads dummy weights to ensure that the model can be initialized in vLLM.
-:::{important}
-The list of models in each section should be maintained in alphabetical order.
-:::
+!!! warning
+ The list of models in each section should be maintained in alphabetical order.
-:::{tip}
-If your model requires a development version of HF Transformers, you can set
-`min_transformers_version` to skip the test in CI until the model is released.
-:::
+!!! tip
+ If your model requires a development version of HF Transformers, you can set
+ `min_transformers_version` to skip the test in CI until the model is released.
## Optional Tests
@@ -34,16 +33,16 @@ These tests compare the model outputs of vLLM against [HF Transformers](https://
#### Generative models
-For [generative models](#generative-models), there are two levels of correctness tests, as defined in :
+For [generative models][generative-models], there are two levels of correctness tests, as defined in :
- Exact correctness (`check_outputs_equal`): The text outputted by vLLM should exactly match the text outputted by HF.
- Logprobs similarity (`check_logprobs_close`): The logprobs outputted by vLLM should be in the top-k logprobs outputted by HF, and vice versa.
#### Pooling models
-For [pooling models](#pooling-models), we simply check the cosine similarity, as defined in .
+For [pooling models][pooling-models], we simply check the cosine similarity, as defined in .
-(mm-processing-tests)=
+[](){ #mm-processing-tests }
### Multi-modal processing
diff --git a/docs/source/contributing/overview.md b/docs/contributing/overview.md
similarity index 87%
rename from docs/source/contributing/overview.md
rename to docs/contributing/overview.md
index 89b31f0311..7dbf8bfdcf 100644
--- a/docs/source/contributing/overview.md
+++ b/docs/contributing/overview.md
@@ -27,7 +27,21 @@ See .
## Developing
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
-Check out the [building from source](#build-from-source) documentation for details.
+Check out the [building from source][build-from-source] documentation for details.
+
+### Building the docs
+
+Install the dependencies:
+
+```bash
+pip install -r requirements/docs.txt
+```
+
+Start the autoreloading MkDocs server:
+
+```bash
+mkdocs serve
+```
## Testing
@@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files
pytest tests/
```
-:::{tip}
-Since the ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
+!!! tip
+ Since the ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
-Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
-:::
+ Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
-:::{note}
-Currently, the repository is not fully checked by `mypy`.
-:::
+!!! note
+ Currently, the repository is not fully checked by `mypy`.
-:::{note}
-Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
-platform to run unit tests locally, rely on the continuous integration system to run the tests for
-now.
-:::
+!!! note
+ Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
+ platform to run unit tests locally, rely on the continuous integration system to run the tests for
+ now.
## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
-:::{important}
-If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
-:::
+!!! warning
+ If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
## Pull Requests & Code Reviews
@@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following:
- `[Misc]` for PRs that do not fit the above categories. Please use this
sparingly.
-:::{note}
-If the PR spans more than one category, please include all relevant prefixes.
-:::
+!!! note
+ If the PR spans more than one category, please include all relevant prefixes.
### Code Quality
diff --git a/docs/source/contributing/profiling/profiling_index.md b/docs/contributing/profiling.md
similarity index 90%
rename from docs/source/contributing/profiling/profiling_index.md
rename to docs/contributing/profiling.md
index ce25daa39c..be01b9b65f 100644
--- a/docs/source/contributing/profiling/profiling_index.md
+++ b/docs/contributing/profiling.md
@@ -1,8 +1,7 @@
# Profiling vLLM
-:::{warning}
-Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
-:::
+!!! warning
+ Profiling is only intended for vLLM developers and maintainers to understand the proportion of time spent in different parts of the codebase. **vLLM end-users should never turn on profiling** as it will significantly slow down the inference.
## Profile with PyTorch Profiler
@@ -14,15 +13,13 @@ When using `benchmarks/benchmark_serving.py`, you can enable profiling by passin
Traces can be visualized using .
-:::{tip}
-Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
-:::
+!!! tip
+ Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.
-:::{tip}
-To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
-Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
-`export VLLM_RPC_TIMEOUT=1800000`
-:::
+!!! tip
+ To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
+ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the server. Say something like 30 minutes.
+ `export VLLM_RPC_TIMEOUT=1800000`
### Example commands and usage
diff --git a/docs/source/contributing/vulnerability_management.md b/docs/contributing/vulnerability_management.md
similarity index 100%
rename from docs/source/contributing/vulnerability_management.md
rename to docs/contributing/vulnerability_management.md
diff --git a/docs/deployment/docker.md b/docs/deployment/docker.md
new file mode 100644
index 0000000000..293536e52c
--- /dev/null
+++ b/docs/deployment/docker.md
@@ -0,0 +1,126 @@
+---
+title: Using Docker
+---
+[](){ #deployment-docker }
+
+[](){ #deployment-docker-pre-built-image }
+
+## Use vLLM's Official Docker Image
+
+vLLM offers an official Docker image for deployment.
+The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
+
+```console
+$ docker run --runtime nvidia --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ --env "HUGGING_FACE_HUB_TOKEN=" \
+ -p 8000:8000 \
+ --ipc=host \
+ vllm/vllm-openai:latest \
+ --model mistralai/Mistral-7B-v0.1
+```
+
+This image can also be used with other container engines such as [Podman](https://podman.io/).
+
+```console
+$ podman run --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+ -p 8000:8000 \
+ --ipc=host \
+ vllm/vllm-openai:latest \
+ --model mistralai/Mistral-7B-v0.1
+```
+
+You can add any other [engine-args][engine-args] you need after the image tag (`vllm/vllm-openai:latest`).
+
+!!! note
+ You can either use the `ipc=host` flag or `--shm-size` flag to allow the
+ container to access the host's shared memory. vLLM uses PyTorch, which uses shared
+ memory to share data between processes under the hood, particularly for tensor parallel inference.
+
+!!! note
+ Optional dependencies are not included in order to avoid licensing issues (e.g. ).
+
+ If you need to use those dependencies (having accepted the license terms),
+ create a custom Dockerfile on top of the base image with an extra layer that installs them:
+
+ ```Dockerfile
+ FROM vllm/vllm-openai:v0.8.3
+
+ # e.g. install the `audio` optional dependencies
+ # NOTE: Make sure the version of vLLM matches the base image!
+ RUN uv pip install --system vllm[audio]==0.8.3
+ ```
+
+!!! tip
+ Some new models may only be available on the main branch of [HF Transformers](https://github.com/huggingface/transformers).
+
+ To use the development version of `transformers`, create a custom Dockerfile on top of the base image
+ with an extra layer that installs their code from source:
+
+ ```Dockerfile
+ FROM vllm/vllm-openai:latest
+
+ RUN uv pip install --system git+https://github.com/huggingface/transformers.git
+ ```
+
+[](){ #deployment-docker-build-image-from-source }
+
+## Building vLLM's Docker Image from Source
+
+You can build and run vLLM from source via the provided . To build vLLM:
+
+```console
+# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
+DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai --file docker/Dockerfile
+```
+
+!!! note
+ By default vLLM will build for all GPU types for widest distribution. If you are just building for the
+ current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
+ for vLLM to find the current GPU type and build for that.
+
+ If you are using Podman instead of Docker, you might need to disable SELinux labeling by
+ adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).
+
+## Building for Arm64/aarch64
+
+A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
+of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
+
+!!! note
+ Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
+ flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
+ Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
+
+```console
+# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
+$ python3 use_existing_torch.py
+$ DOCKER_BUILDKIT=1 docker build . \
+ --file docker/Dockerfile \
+ --target vllm-openai \
+ --platform "linux/arm64" \
+ -t vllm/vllm-gh200-openai:latest \
+ --build-arg max_jobs=66 \
+ --build-arg nvcc_threads=2 \
+ --build-arg torch_cuda_arch_list="9.0+PTX" \
+ --build-arg vllm_fa_cmake_gpu_arches="90-real"
+```
+
+## Use the custom-built vLLM Docker image
+
+To run vLLM with the custom-built Docker image:
+
+```console
+$ docker run --runtime nvidia --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ -p 8000:8000 \
+ --env "HUGGING_FACE_HUB_TOKEN=" \
+ vllm/vllm-openai
+```
+
+The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
+
+!!! note
+ **For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
diff --git a/docs/source/deployment/frameworks/anything-llm.md b/docs/deployment/frameworks/anything-llm.md
similarity index 78%
rename from docs/source/deployment/frameworks/anything-llm.md
rename to docs/deployment/frameworks/anything-llm.md
index d430c170ef..a89e633c08 100644
--- a/docs/source/deployment/frameworks/anything-llm.md
+++ b/docs/deployment/frameworks/anything-llm.md
@@ -1,6 +1,7 @@
-(deployment-anything-llm)=
-
-# Anything LLM
+---
+title: Anything LLM
+---
+[](){ #deployment-anything-llm }
[Anything LLM](https://github.com/Mintplex-Labs/anything-llm) is a full-stack application that enables you to turn any document, resource, or piece of content into context that any LLM can use as references during chatting.
@@ -25,23 +26,19 @@ vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
- Base URL: http://{vllm server host}:{vllm server port}/v1
- Chat Model Name: `Qwen/Qwen1.5-32B-Chat-AWQ`
-:::{image} /assets/deployment/anything-llm-provider.png
-:::
+
- Back to home page, New Workspace --> create `vllm` workspace, and start to chat:
-:::{image} /assets/deployment/anything-llm-chat-without-doc.png
-:::
+
- Click the upload button:
- upload the doc
- select the doc and move to the workspace
- save and embed
-:::{image} /assets/deployment/anything-llm-upload-doc.png
-:::
+
- Chat again:
-:::{image} /assets/deployment/anything-llm-chat-with-doc.png
-:::
+
diff --git a/docs/source/deployment/frameworks/bentoml.md b/docs/deployment/frameworks/bentoml.md
similarity index 89%
rename from docs/source/deployment/frameworks/bentoml.md
rename to docs/deployment/frameworks/bentoml.md
index 2bf435bda8..7e64b6eb6f 100644
--- a/docs/source/deployment/frameworks/bentoml.md
+++ b/docs/deployment/frameworks/bentoml.md
@@ -1,6 +1,7 @@
-(deployment-bentoml)=
-
-# BentoML
+---
+title: BentoML
+---
+[](){ #deployment-bentoml }
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-compliant image and deploy it on Kubernetes.
diff --git a/docs/source/deployment/frameworks/cerebrium.md b/docs/deployment/frameworks/cerebrium.md
similarity index 98%
rename from docs/source/deployment/frameworks/cerebrium.md
rename to docs/deployment/frameworks/cerebrium.md
index b20c95137b..84cb2304fa 100644
--- a/docs/source/deployment/frameworks/cerebrium.md
+++ b/docs/deployment/frameworks/cerebrium.md
@@ -1,12 +1,11 @@
-(deployment-cerebrium)=
+---
+title: Cerebrium
+---
+[](){ #deployment-cerebrium }
-# Cerebrium
-
-:::{raw} html
-:::
vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebrium.ai/), a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
diff --git a/docs/source/deployment/frameworks/chatbox.md b/docs/deployment/frameworks/chatbox.md
similarity index 84%
rename from docs/source/deployment/frameworks/chatbox.md
rename to docs/deployment/frameworks/chatbox.md
index e62f464715..10da2fc710 100644
--- a/docs/source/deployment/frameworks/chatbox.md
+++ b/docs/deployment/frameworks/chatbox.md
@@ -1,6 +1,7 @@
-(deployment-chatbox)=
-
-# Chatbox
+---
+title: Chatbox
+---
+[](){ #deployment-chatbox }
[Chatbox](https://github.com/chatboxai/chatbox) is a desktop client for LLMs, available on Windows, Mac, Linux.
@@ -27,10 +28,8 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
- API Path: `/chat/completions`
- Model: `qwen/Qwen1.5-0.5B-Chat`
-:::{image} /assets/deployment/chatbox-settings.png
-:::
+
- Go to `Just chat`, and start to chat:
-:::{image} /assets/deployment/chatbox-chat.png
-:::
+
diff --git a/docs/source/deployment/frameworks/dify.md b/docs/deployment/frameworks/dify.md
similarity index 90%
rename from docs/source/deployment/frameworks/dify.md
rename to docs/deployment/frameworks/dify.md
index 5cdf6a3876..886484b543 100644
--- a/docs/source/deployment/frameworks/dify.md
+++ b/docs/deployment/frameworks/dify.md
@@ -1,6 +1,7 @@
-(deployment-dify)=
-
-# Dify
+---
+title: Dify
+---
+[](){ #deployment-dify }
[Dify](https://github.com/langgenius/dify) is an open-source LLM app development platform. Its intuitive interface combines agentic AI workflow, RAG pipeline, agent capabilities, model management, observability features, and more, allowing you to quickly move from prototype to production.
@@ -42,15 +43,12 @@ docker compose up -d
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
- **Completion Mode**: `Completion`
-:::{image} /assets/deployment/dify-settings.png
-:::
+
- To create a test chatbot, go to `Studio → Chatbot → Create from Blank`, then select Chatbot as the type:
-:::{image} /assets/deployment/dify-create-chatbot.png
-:::
+
- Click the chatbot you just created to open the chat interface and start interacting with the model:
-:::{image} /assets/deployment/dify-chat.png
-:::
+
diff --git a/docs/source/deployment/frameworks/dstack.md b/docs/deployment/frameworks/dstack.md
similarity index 83%
rename from docs/source/deployment/frameworks/dstack.md
rename to docs/deployment/frameworks/dstack.md
index a16e28f2d8..7de9285574 100644
--- a/docs/source/deployment/frameworks/dstack.md
+++ b/docs/deployment/frameworks/dstack.md
@@ -1,12 +1,11 @@
-(deployment-dstack)=
+---
+title: dstack
+---
+[](){ #deployment-dstack }
-# dstack
-
-:::{raw} html
-:::
vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/), an open-source framework for running LLMs on any cloud. This tutorial assumes that you have already configured credentials, gateway, and GPU quotas on your cloud environment.
@@ -97,6 +96,5 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
-:::{note}
-dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
-:::
+!!! note
+ dstack automatically handles authentication on the gateway using dstack's tokens. Meanwhile, if you don't want to configure a gateway, you can provision dstack `Task` instead of `Service`. The `Task` is for development purpose only. If you want to know more about hands-on materials how to serve vLLM using dstack, check out [this repository](https://github.com/dstackai/dstack-examples/tree/main/deployment/vllm)
diff --git a/docs/deployment/frameworks/helm.md b/docs/deployment/frameworks/helm.md
new file mode 100644
index 0000000000..192b90438a
--- /dev/null
+++ b/docs/deployment/frameworks/helm.md
@@ -0,0 +1,95 @@
+---
+title: Helm
+---
+[](){ #deployment-helm }
+
+A Helm chart to deploy vLLM for Kubernetes
+
+Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.
+
+This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file.
+
+## Prerequisites
+
+Before you begin, ensure that you have the following:
+
+- A running Kubernetes cluster
+- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
+- Available GPU resources in your cluster
+- S3 with the model which will be deployed
+
+## Installing the chart
+
+To install the chart with the release name `test-vllm`:
+
+```console
+helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
+```
+
+## Uninstalling the Chart
+
+To uninstall the `test-vllm` deployment:
+
+```console
+helm uninstall test-vllm --namespace=ns-vllm
+```
+
+The command removes all the Kubernetes components associated with the
+chart **including persistent volumes** and deletes the release.
+
+## Architecture
+
+
+
+## Values
+
+| Key | Type | Default | Description |
+|--------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|
+| autoscaling | object | {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80} | Autoscaling configuration |
+| autoscaling.enabled | bool | false | Enable autoscaling |
+| autoscaling.maxReplicas | int | 100 | Maximum replicas |
+| autoscaling.minReplicas | int | 1 | Minimum replicas |
+| autoscaling.targetCPUUtilizationPercentage | int | 80 | Target CPU utilization for autoscaling |
+| configs | object | {} | Configmap |
+| containerPort | int | 8000 | Container port |
+| customObjects | list | [] | Custom Objects configuration |
+| deploymentStrategy | object | {} | Deployment strategy configuration |
+| externalConfigs | list | [] | External configuration |
+| extraContainers | list | [] | Additional containers configuration |
+| extraInit | object | {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true} | Additional configuration for the init container |
+| extraInit.pvcStorage | string | "50Gi" | Storage size of the s3 |
+| extraInit.s3modelpath | string | "relative_s3_model_path/opt-125m" | Path of the model on the s3 which hosts model weights and config files |
+| extraInit.awsEc2MetadataDisabled | boolean | true | Disables the use of the Amazon EC2 instance metadata service |
+| extraPorts | list | [] | Additional ports configuration |
+| gpuModels | list | ["TYPE_GPU_USED"] | Type of gpu used |
+| image | object | {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"} | Image configuration |
+| image.command | list | ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"] | Container launch command |
+| image.repository | string | "vllm/vllm-openai" | Image repository |
+| image.tag | string | "latest" | Image tag |
+| livenessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10} | Liveness probe configuration |
+| livenessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive |
+| livenessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the Kubelet http request on the server |
+| livenessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
+| livenessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
+| livenessProbe.initialDelaySeconds | int | 15 | Number of seconds after the container has started before liveness probe is initiated |
+| livenessProbe.periodSeconds | int | 10 | How often (in seconds) to perform the liveness probe |
+| maxUnavailablePodDisruptionBudget | string | "" | Disruption Budget Configuration |
+| readinessProbe | object | {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5} | Readiness probe configuration |
+| readinessProbe.failureThreshold | int | 3 | Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready |
+| readinessProbe.httpGet | object | {"path":"/health","port":8000} | Configuration of the Kubelet http request on the server |
+| readinessProbe.httpGet.path | string | "/health" | Path to access on the HTTP server |
+| readinessProbe.httpGet.port | int | 8000 | Name or number of the port to access on the container, on which the server is listening |
+| readinessProbe.initialDelaySeconds | int | 5 | Number of seconds after the container has started before readiness probe is initiated |
+| readinessProbe.periodSeconds | int | 5 | How often (in seconds) to perform the readiness probe |
+| replicaCount | int | 1 | Number of replicas |
+| resources | object | {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}} | Resource configuration |
+| resources.limits."nvidia.com/gpu" | int | 1 | Number of gpus used |
+| resources.limits.cpu | int | 4 | Number of CPUs |
+| resources.limits.memory | string | "16Gi" | CPU memory configuration |
+| resources.requests."nvidia.com/gpu" | int | 1 | Number of gpus used |
+| resources.requests.cpu | int | 4 | Number of CPUs |
+| resources.requests.memory | string | "16Gi" | CPU memory configuration |
+| secrets | object | {} | Secrets configuration |
+| serviceName | string | Service name | |
+| servicePort | int | 80 | Service port |
+| labels.environment | string | test | Environment name |
diff --git a/docs/source/deployment/frameworks/litellm.md b/docs/deployment/frameworks/litellm.md
similarity index 97%
rename from docs/source/deployment/frameworks/litellm.md
rename to docs/deployment/frameworks/litellm.md
index 6dd3607ca5..3011cde830 100644
--- a/docs/source/deployment/frameworks/litellm.md
+++ b/docs/deployment/frameworks/litellm.md
@@ -1,6 +1,7 @@
-(deployment-litellm)=
-
-# LiteLLM
+---
+title: LiteLLM
+---
+[](){ #deployment-litellm }
[LiteLLM](https://github.com/BerriAI/litellm) call all LLM APIs using the OpenAI format [Bedrock, Huggingface, VertexAI, TogetherAI, Azure, OpenAI, Groq etc.]
diff --git a/docs/source/deployment/frameworks/lobe-chat.md b/docs/deployment/frameworks/lobe-chat.md
similarity index 89%
rename from docs/source/deployment/frameworks/lobe-chat.md
rename to docs/deployment/frameworks/lobe-chat.md
index 6d86b7fa9c..cd95c02815 100644
--- a/docs/source/deployment/frameworks/lobe-chat.md
+++ b/docs/deployment/frameworks/lobe-chat.md
@@ -1,6 +1,7 @@
-(deployment-lobe-chat)=
-
-# Lobe Chat
+---
+title: Lobe Chat
+---
+[](){ #deployment-lobe-chat }
[Lobe Chat](https://github.com/lobehub/lobe-chat) is an open-source, modern-design ChatGPT/LLMs UI/Framework.
diff --git a/docs/source/deployment/frameworks/lws.md b/docs/deployment/frameworks/lws.md
similarity index 99%
rename from docs/source/deployment/frameworks/lws.md
rename to docs/deployment/frameworks/lws.md
index 4e9a03b5c4..18282a89dd 100644
--- a/docs/source/deployment/frameworks/lws.md
+++ b/docs/deployment/frameworks/lws.md
@@ -1,6 +1,7 @@
-(deployment-lws)=
-
-# LWS
+---
+title: LWS
+---
+[](){ #deployment-lws }
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
A major use case is for multi-host/multi-node distributed inference.
diff --git a/docs/source/deployment/frameworks/modal.md b/docs/deployment/frameworks/modal.md
similarity index 85%
rename from docs/source/deployment/frameworks/modal.md
rename to docs/deployment/frameworks/modal.md
index e7c42088e3..dbdb739a10 100644
--- a/docs/source/deployment/frameworks/modal.md
+++ b/docs/deployment/frameworks/modal.md
@@ -1,6 +1,7 @@
-(deployment-modal)=
-
-# Modal
+---
+title: Modal
+---
+[](){ #deployment-modal }
vLLM can be run on cloud GPUs with [Modal](https://modal.com), a serverless computing platform designed for fast auto-scaling.
diff --git a/docs/source/deployment/frameworks/open-webui.md b/docs/deployment/frameworks/open-webui.md
similarity index 87%
rename from docs/source/deployment/frameworks/open-webui.md
rename to docs/deployment/frameworks/open-webui.md
index 83e5303a00..1ab1931068 100644
--- a/docs/source/deployment/frameworks/open-webui.md
+++ b/docs/deployment/frameworks/open-webui.md
@@ -1,6 +1,7 @@
-(deployment-open-webui)=
-
-# Open WebUI
+---
+title: Open WebUI
+---
+[](){ #deployment-open-webui }
1. Install the [Docker](https://docs.docker.com/engine/install/)
@@ -25,5 +26,4 @@ ghcr.io/open-webui/open-webui:main
On the top of the web page, you can see the model `qwen/Qwen1.5-0.5B-Chat`.
-:::{image} /assets/deployment/open_webui.png
-:::
+
diff --git a/docs/source/deployment/frameworks/retrieval_augmented_generation.md b/docs/deployment/frameworks/retrieval_augmented_generation.md
similarity index 96%
rename from docs/source/deployment/frameworks/retrieval_augmented_generation.md
rename to docs/deployment/frameworks/retrieval_augmented_generation.md
index f84451fafe..cb26c8378d 100644
--- a/docs/source/deployment/frameworks/retrieval_augmented_generation.md
+++ b/docs/deployment/frameworks/retrieval_augmented_generation.md
@@ -1,6 +1,7 @@
-(deployment-retrieval-augmented-generation)=
-
-# Retrieval-Augmented Generation
+---
+title: Retrieval-Augmented Generation
+---
+[](){ #deployment-retrieval-augmented-generation }
[Retrieval-augmented generation (RAG)](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is a technique that enables generative artificial intelligence (Gen AI) models to retrieve and incorporate new information. It modifies interactions with a large language model (LLM) so that the model responds to user queries with reference to a specified set of documents, using this information to supplement information from its pre-existing training data. This allows LLMs to use domain-specific and/or updated information. Use cases include providing chatbot access to internal company data or generating responses based on authoritative sources.
diff --git a/docs/source/deployment/frameworks/skypilot.md b/docs/deployment/frameworks/skypilot.md
similarity index 97%
rename from docs/source/deployment/frameworks/skypilot.md
rename to docs/deployment/frameworks/skypilot.md
index 5e101b9001..1844a50c56 100644
--- a/docs/source/deployment/frameworks/skypilot.md
+++ b/docs/deployment/frameworks/skypilot.md
@@ -1,12 +1,11 @@
-(deployment-skypilot)=
+---
+title: SkyPilot
+---
+[](){ #deployment-skypilot }
-# SkyPilot
-
-:::{raw} html
-:::
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with [SkyPilot](https://github.com/skypilot-org/skypilot), an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in [SkyPilot AI gallery](https://skypilot.readthedocs.io/en/latest/gallery/index.html).
@@ -104,10 +103,8 @@ service:
max_completion_tokens: 1
```
-:::{raw} html
Click to see the full recipe YAML
-:::
```yaml
service:
@@ -153,9 +150,7 @@ run: |
2>&1 | tee api_server.log
```
-:::{raw} html
-:::
Start the serving the Llama-3 8B model on multiple replicas:
@@ -169,10 +164,8 @@ Wait until the service is ready:
watch -n10 sky serve status vllm
```
-:::{raw} html
Example outputs:
-:::
```console
Services
@@ -185,9 +178,7 @@ vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) R
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
```
-:::{raw} html
-:::
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
@@ -223,10 +214,8 @@ service:
This will scale the service up to when the QPS exceeds 2 for each replica.
-:::{raw} html
Click to see the full recipe YAML
-:::
```yaml
service:
@@ -275,9 +264,7 @@ run: |
2>&1 | tee api_server.log
```
-:::{raw} html
-:::
To update the service with the new config:
@@ -295,10 +282,8 @@ sky serve down vllm
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
-:::{raw} html
Click to see the full GUI YAML
-:::
```yaml
envs:
@@ -328,9 +313,7 @@ run: |
--stop-token-ids 128009,128001 | tee ~/gradio.log
```
-:::{raw} html
-:::
1. Start the chat web UI:
diff --git a/docs/source/deployment/frameworks/streamlit.md b/docs/deployment/frameworks/streamlit.md
similarity index 91%
rename from docs/source/deployment/frameworks/streamlit.md
rename to docs/deployment/frameworks/streamlit.md
index 084550ec99..8956d1ddc7 100644
--- a/docs/source/deployment/frameworks/streamlit.md
+++ b/docs/deployment/frameworks/streamlit.md
@@ -1,6 +1,7 @@
-(deployment-streamlit)=
-
-# Streamlit
+---
+title: Streamlit
+---
+[](){ #deployment-streamlit }
[Streamlit](https://github.com/streamlit/streamlit) lets you transform Python scripts into interactive web apps in minutes, instead of weeks. Build dashboards, generate reports, or create chat apps.
@@ -38,5 +39,4 @@ VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" streamlit run stream
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug
```
-:::{image} /assets/deployment/streamlit-chat.png
-:::
+
diff --git a/docs/source/deployment/frameworks/triton.md b/docs/deployment/frameworks/triton.md
similarity index 87%
rename from docs/source/deployment/frameworks/triton.md
rename to docs/deployment/frameworks/triton.md
index 94d8712015..082bc24d85 100644
--- a/docs/source/deployment/frameworks/triton.md
+++ b/docs/deployment/frameworks/triton.md
@@ -1,5 +1,6 @@
-(deployment-triton)=
-
-# NVIDIA Triton
+---
+title: NVIDIA Triton
+---
+[](){ #deployment-triton }
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
diff --git a/docs/source/deployment/integrations/kserve.md b/docs/deployment/integrations/kserve.md
similarity index 85%
rename from docs/source/deployment/integrations/kserve.md
rename to docs/deployment/integrations/kserve.md
index c780fd74e8..754b983dee 100644
--- a/docs/source/deployment/integrations/kserve.md
+++ b/docs/deployment/integrations/kserve.md
@@ -1,6 +1,7 @@
-(deployment-kserve)=
-
-# KServe
+---
+title: KServe
+---
+[](){ #deployment-kserve }
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
diff --git a/docs/source/deployment/integrations/kubeai.md b/docs/deployment/integrations/kubeai.md
similarity index 93%
rename from docs/source/deployment/integrations/kubeai.md
rename to docs/deployment/integrations/kubeai.md
index 2f5772e075..ba0a3c52cc 100644
--- a/docs/source/deployment/integrations/kubeai.md
+++ b/docs/deployment/integrations/kubeai.md
@@ -1,6 +1,7 @@
-(deployment-kubeai)=
-
-# KubeAI
+---
+title: KubeAI
+---
+[](){ #deployment-kubeai }
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
diff --git a/docs/source/deployment/integrations/llamastack.md b/docs/deployment/integrations/llamastack.md
similarity index 94%
rename from docs/source/deployment/integrations/llamastack.md
rename to docs/deployment/integrations/llamastack.md
index a6c3569637..2ae600a423 100644
--- a/docs/source/deployment/integrations/llamastack.md
+++ b/docs/deployment/integrations/llamastack.md
@@ -1,6 +1,7 @@
-(deployment-llamastack)=
-
-# Llama Stack
+---
+title: Llama Stack
+---
+[](){ #deployment-llamastack }
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
diff --git a/docs/source/deployment/integrations/llmaz.md b/docs/deployment/integrations/llmaz.md
similarity index 87%
rename from docs/source/deployment/integrations/llmaz.md
rename to docs/deployment/integrations/llmaz.md
index cd4a76353d..03d284c347 100644
--- a/docs/source/deployment/integrations/llmaz.md
+++ b/docs/deployment/integrations/llmaz.md
@@ -1,6 +1,7 @@
-(deployment-llmaz)=
-
-# llmaz
+---
+title: llmaz
+---
+[](){ #deployment-llmaz }
[llmaz](https://github.com/InftyAI/llmaz) is an easy-to-use and advanced inference platform for large language models on Kubernetes, aimed for production use. It uses vLLM as the default model serving backend.
diff --git a/docs/source/deployment/integrations/production-stack.md b/docs/deployment/integrations/production-stack.md
similarity index 98%
rename from docs/source/deployment/integrations/production-stack.md
rename to docs/deployment/integrations/production-stack.md
index 05f1568306..8288a4b6e6 100644
--- a/docs/source/deployment/integrations/production-stack.md
+++ b/docs/deployment/integrations/production-stack.md
@@ -1,6 +1,7 @@
-(deployment-production-stack)=
-
-# Production stack
+---
+title: Production stack
+---
+[](){ #deployment-production-stack }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using the [vLLM production stack](https://github.com/vllm-project/production-stack). Born out of a Berkeley-UChicago collaboration, [vLLM production stack](https://github.com/vllm-project/production-stack) is an officially released, production-optimized codebase under the [vLLM project](https://github.com/vllm-project), designed for LLM deployment with:
@@ -114,7 +115,7 @@ To remove the deployment, run:
sudo helm uninstall vllm
```
-------
+---
### (Advanced) Configuring vLLM production stack
diff --git a/docs/source/deployment/k8s.md b/docs/deployment/k8s.md
similarity index 98%
rename from docs/source/deployment/k8s.md
rename to docs/deployment/k8s.md
index 9079cfa8e1..bd2bd44cd5 100644
--- a/docs/source/deployment/k8s.md
+++ b/docs/deployment/k8s.md
@@ -1,6 +1,7 @@
-(deployment-k8s)=
-
-# Using Kubernetes
+---
+title: Using Kubernetes
+---
+[](){ #deployment-k8s }
Deploying vLLM on Kubernetes is a scalable and efficient way to serve machine learning models. This guide walks you through deploying vLLM using native Kubernetes.
@@ -19,9 +20,8 @@ Alternatively, you can deploy vLLM to Kubernetes using any of the following:
## Deployment with CPUs
-:::{note}
-The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
-:::
+!!! note
+ The use of CPUs here is for demonstration and testing purposes only and its performance will not be on par with GPUs.
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
diff --git a/docs/source/deployment/nginx.md b/docs/deployment/nginx.md
similarity index 77%
rename from docs/source/deployment/nginx.md
rename to docs/deployment/nginx.md
index bf404f1098..9d1f744757 100644
--- a/docs/source/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@@ -1,20 +1,21 @@
-(nginxloadbalancer)=
-
-# Using Nginx
+---
+title: Using Nginx
+---
+[](){ #nginxloadbalancer }
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
Table of contents:
-1. [Build Nginx Container](#nginxloadbalancer-nginx-build)
-2. [Create Simple Nginx Config file](#nginxloadbalancer-nginx-conf)
-3. [Build vLLM Container](#nginxloadbalancer-nginx-vllm-container)
-4. [Create Docker Network](#nginxloadbalancer-nginx-docker-network)
-5. [Launch vLLM Containers](#nginxloadbalancer-nginx-launch-container)
-6. [Launch Nginx](#nginxloadbalancer-nginx-launch-nginx)
-7. [Verify That vLLM Servers Are Ready](#nginxloadbalancer-nginx-verify-nginx)
+1. [Build Nginx Container][nginxloadbalancer-nginx-build]
+2. [Create Simple Nginx Config file][nginxloadbalancer-nginx-conf]
+3. [Build vLLM Container][nginxloadbalancer-nginx-vllm-container]
+4. [Create Docker Network][nginxloadbalancer-nginx-docker-network]
+5. [Launch vLLM Containers][nginxloadbalancer-nginx-launch-container]
+6. [Launch Nginx][nginxloadbalancer-nginx-launch-nginx]
+7. [Verify That vLLM Servers Are Ready][nginxloadbalancer-nginx-verify-nginx]
-(nginxloadbalancer-nginx-build)=
+[](){ #nginxloadbalancer-nginx-build }
## Build Nginx Container
@@ -39,7 +40,7 @@ Build the container:
docker build . -f Dockerfile.nginx --tag nginx-lb
```
-(nginxloadbalancer-nginx-conf)=
+[](){ #nginxloadbalancer-nginx-conf }
## Create Simple Nginx Config file
@@ -63,7 +64,7 @@ server {
}
```
-(nginxloadbalancer-nginx-vllm-container)=
+[](){ #nginxloadbalancer-nginx-vllm-container }
## Build vLLM Container
@@ -79,7 +80,7 @@ cd $vllm_root
docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
```
-(nginxloadbalancer-nginx-docker-network)=
+[](){ #nginxloadbalancer-nginx-docker-network }
## Create Docker Network
@@ -87,7 +88,7 @@ docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_prox
docker network create vllm_nginx
```
-(nginxloadbalancer-nginx-launch-container)=
+[](){ #nginxloadbalancer-nginx-launch-container }
## Launch vLLM Containers
@@ -105,11 +106,10 @@ docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24
docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
```
-:::{note}
-If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
-:::
+!!! note
+ If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.
-(nginxloadbalancer-nginx-launch-nginx)=
+[](){ #nginxloadbalancer-nginx-launch-nginx }
## Launch Nginx
@@ -117,7 +117,7 @@ If you are behind proxy, you can pass the proxy settings to the docker run comma
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
```
-(nginxloadbalancer-nginx-verify-nginx)=
+[](){ #nginxloadbalancer-nginx-verify-nginx }
## Verify That vLLM Servers Are Ready
diff --git a/docs/source/deployment/security.md b/docs/deployment/security.md
similarity index 100%
rename from docs/source/deployment/security.md
rename to docs/deployment/security.md
diff --git a/docs/source/design/arch_overview.md b/docs/design/arch_overview.md
similarity index 81%
rename from docs/source/design/arch_overview.md
rename to docs/design/arch_overview.md
index 94bda8b5c5..75d3e1b7cc 100644
--- a/docs/source/design/arch_overview.md
+++ b/docs/design/arch_overview.md
@@ -1,22 +1,18 @@
-(arch-overview)=
-
-# Architecture Overview
+---
+title: Architecture Overview
+---
+[](){ #arch-overview }
This document provides an overview of the vLLM architecture.
-:::{contents} Table of Contents
-:depth: 2
-:local: true
-:::
+[TOC]
## Entrypoints
vLLM provides a number of entrypoints for interacting with the system. The
following diagram shows the relationship between them.
-:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
-:alt: Entrypoints Diagram
-:::
+
### LLM Class
@@ -77,16 +73,14 @@ python -m vllm.entrypoints.openai.api_server --model
That code can be found in .
-More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document.
+More details on the API server can be found in the [OpenAI-Compatible Server][openai-compatible-server] document.
## LLM Engine
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
the vLLM system, handling model inference and asynchronous request processing.
-:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
-:alt: LLMEngine Diagram
-:::
+
### LLMEngine
@@ -137,18 +131,16 @@ input tensors and capturing cudagraphs.
## Model
Every model runner object has one model object, which is the actual
-`torch.nn.Module` instance. See [huggingface_integration](#huggingface-integration) for how various
+`torch.nn.Module` instance. See [huggingface_integration][huggingface-integration] for how various
configurations affect the class we ultimately get.
## Class Hierarchy
The following figure shows the class hierarchy of vLLM:
-> :::{figure} /assets/design/hierarchy.png
-> :align: center
-> :alt: query
-> :width: 100%
-> :::
+>
+> { align="center" alt="query" width="100%" }
+>
There are several important design choices behind this class hierarchy:
@@ -178,44 +170,43 @@ of a vision model and a language model. By making the constructor uniform, we
can easily create a vision model and a language model and compose them into a
vision-language model.
-:::{note}
-To support this change, all vLLM models' signatures have been updated to:
+!!! note
+ To support this change, all vLLM models' signatures have been updated to:
-```python
-def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
-```
-
-To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
-
-```python
-class MyOldModel(nn.Module):
- def __init__(
- self,
- config,
- cache_config: Optional[CacheConfig] = None,
- quant_config: Optional[QuantizationConfig] = None,
- lora_config: Optional[LoRAConfig] = None,
- prefix: str = "",
- ) -> None:
- ...
-
-from vllm.config import VllmConfig
-class MyNewModel(MyOldModel):
+ ```python
def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
- config = vllm_config.model_config.hf_config
- cache_config = vllm_config.cache_config
- quant_config = vllm_config.quant_config
- lora_config = vllm_config.lora_config
- super().__init__(config, cache_config, quant_config, lora_config, prefix)
+ ```
-if __version__ >= "0.6.4":
- MyModel = MyNewModel
-else:
- MyModel = MyOldModel
-```
+ To avoid accidentally passing incorrect arguments, the constructor is now keyword-only. This ensures that the constructor will raise an error if old configurations are passed. vLLM developers have already made this change for all models within vLLM. For out-of-tree registered models, developers need to update their models, for example by adding shim code to adapt the old constructor signature to the new one:
-This way, the model can work with both old and new versions of vLLM.
-:::
+ ```python
+ class MyOldModel(nn.Module):
+ def __init__(
+ self,
+ config,
+ cache_config: Optional[CacheConfig] = None,
+ quant_config: Optional[QuantizationConfig] = None,
+ lora_config: Optional[LoRAConfig] = None,
+ prefix: str = "",
+ ) -> None:
+ ...
+
+ from vllm.config import VllmConfig
+ class MyNewModel(MyOldModel):
+ def __init__(self, *, vllm_config: VllmConfig, prefix: str = ""):
+ config = vllm_config.model_config.hf_config
+ cache_config = vllm_config.cache_config
+ quant_config = vllm_config.quant_config
+ lora_config = vllm_config.lora_config
+ super().__init__(config, cache_config, quant_config, lora_config, prefix)
+
+ if __version__ >= "0.6.4":
+ MyModel = MyNewModel
+ else:
+ MyModel = MyOldModel
+ ```
+
+ This way, the model can work with both old and new versions of vLLM.
3\. **Sharding and Quantization at Initialization**: Certain features require
changing the model weights. For example, tensor parallelism needs to shard the
diff --git a/docs/source/design/automatic_prefix_caching.md b/docs/design/automatic_prefix_caching.md
similarity index 98%
rename from docs/source/design/automatic_prefix_caching.md
rename to docs/design/automatic_prefix_caching.md
index 3928e0c165..80883bb1d9 100644
--- a/docs/source/design/automatic_prefix_caching.md
+++ b/docs/design/automatic_prefix_caching.md
@@ -1,6 +1,7 @@
-(design-automatic-prefix-caching)=
-
-# Automatic Prefix Caching
+---
+title: Automatic Prefix Caching
+---
+[](){ #design-automatic-prefix-caching }
The core idea of [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
diff --git a/docs/source/design/huggingface_integration.md b/docs/design/huggingface_integration.md
similarity index 98%
rename from docs/source/design/huggingface_integration.md
rename to docs/design/huggingface_integration.md
index 7d271b1cfb..68cc27ea76 100644
--- a/docs/source/design/huggingface_integration.md
+++ b/docs/design/huggingface_integration.md
@@ -1,6 +1,7 @@
-(huggingface-integration)=
-
-# Integration with HuggingFace
+---
+title: Integration with HuggingFace
+---
+[](){ #huggingface-integration }
This document describes how vLLM integrates with HuggingFace libraries. We will explain step by step what happens under the hood when we run `vllm serve`.
diff --git a/docs/source/design/kernel/paged_attention.md b/docs/design/kernel/paged_attention.md
similarity index 94%
rename from docs/source/design/kernel/paged_attention.md
rename to docs/design/kernel/paged_attention.md
index e1770c8226..ad8b5c9264 100644
--- a/docs/source/design/kernel/paged_attention.md
+++ b/docs/design/kernel/paged_attention.md
@@ -1,6 +1,7 @@
-(design-paged-attention)=
-
-# vLLM Paged Attention
+---
+title: vLLM Paged Attention
+---
+[](){ #design-paged-attention }
- Currently, vLLM utilizes its own implementation of a multi-head query
attention kernel (`csrc/attention/attention_kernels.cu`).
@@ -139,26 +140,22 @@
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
```
- :::{figure} ../../assets/kernel/query.png
- :align: center
- :alt: query
- :width: 70%
-
- Query data of one token at one head
- :::
+
+ { align="center" alt="query" width="70%" }
+
+
+
- Each thread defines its own `q_ptr` which points to the assigned
query token data on global memory. For example, if `VEC_SIZE` is 4
and `HEAD_SIZE` is 128, the `q_ptr` points to data that contains
total of 128 elements divided into 128 / 4 = 32 vecs.
- :::{figure} ../../assets/kernel/q_vecs.png
- :align: center
- :alt: q_vecs
- :width: 70%
-
- `q_vecs` for one thread group
- :::
+
+ { align="center" alt="q_vecs" width="70%" }
+
+
+
```cpp
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
@@ -195,13 +192,11 @@
points to key token data based on `k_cache` at assigned block,
assigned head and assigned token.
- :::{figure} ../../assets/kernel/key.png
- :align: center
- :alt: key
- :width: 70%
-
- Key data of all context tokens at one head
- :::
+
+ { align="center" alt="key" width="70%" }
+
+
+
- The diagram above illustrates the memory layout for key data. It
assumes that the `BLOCK_SIZE` is 16, `HEAD_SIZE` is 128, `x` is
@@ -214,13 +209,11 @@
elements for one token) that will be processed by 2 threads (one
thread group) separately.
- :::{figure} ../../assets/kernel/k_vecs.png
- :align: center
- :alt: k_vecs
- :width: 70%
-
- `k_vecs` for one thread
- :::
+
+ { align="center" alt="k_vecs" width="70%" }
+
+
+
```cpp
K_vec k_vecs[NUM_VECS_PER_THREAD]
@@ -289,14 +282,12 @@
should be performed across the entire thread block, encompassing
results between the query token and all context key tokens.
- :::{math}
- :nowrap: true
-
+ $$
\begin{gather*}
m(x):=\max _i \quad x_i \\ \quad f(x):=\left[\begin{array}{lll}e^{x_1-m(x)} & \ldots & e^{x_B-m(x)}\end{array}\right]\\ \quad \ell(x):=\sum_i f(x)_i \\
\quad \operatorname{softmax}(x):=\frac{f(x)}{\ell(x)}
\end{gather*}
- :::
+ $$
### `qk_max` and `logits`
@@ -379,29 +370,23 @@
## Value
-:::{figure} ../../assets/kernel/value.png
-:align: center
-:alt: value
-:width: 70%
+
+ { align="center" alt="value" width="70%" }
+
+
+
-Value data of all context tokens at one head
-:::
+
+ { align="center" alt="logits_vec" width="50%" }
+
+
+
-:::{figure} ../../assets/kernel/logits_vec.png
-:align: center
-:alt: logits_vec
-:width: 50%
-
-`logits_vec` for one thread
-:::
-
-:::{figure} ../../assets/kernel/v_vec.png
-:align: center
-:alt: v_vec
-:width: 70%
-
-List of `v_vec` for one thread
-:::
+
+ { align="center" alt="v_vec" width="70%" }
+
+
+
- Now we need to retrieve the value data and perform dot multiplication
with `logits`. Unlike query and key, there is no thread group
diff --git a/docs/source/design/mm_processing.md b/docs/design/mm_processing.md
similarity index 61%
rename from docs/source/design/mm_processing.md
rename to docs/design/mm_processing.md
index dc92a3c2c5..f3685ce76a 100644
--- a/docs/source/design/mm_processing.md
+++ b/docs/design/mm_processing.md
@@ -1,10 +1,11 @@
-(mm-processing)=
+---
+title: Multi-Modal Data Processing
+---
+[](){ #mm-processing }
-# Multi-Modal Data Processing
+To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching][automatic-prefix-caching], we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. ``) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
-To enable various optimizations in vLLM such as [chunked prefill](#chunked-prefill) and [prefix caching](#automatic-prefix-caching), we use {class}`~vllm.multimodal.processing.BaseMultiModalProcessor` to provide the correspondence between placeholder feature tokens (e.g. ``) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
-
-Here are the main features of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`:
+Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:
## Prompt Update Detection
@@ -15,7 +16,7 @@ One of the main responsibilities of HF processor is to update the prompt with pl
The information about which tokens have been updated is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.
-In vLLM, this information is specified using {class}`~vllm.multimodal.processing.PromptUpdate` in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
+In vLLM, this information is specified using [PromptUpdate][vllm.multimodal.processing.PromptUpdate] in [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates]. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
## Tokenized Prompt Inputs
@@ -43,22 +44,22 @@ While HF processors support text + multi-modal inputs natively, this is not so f
Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other.
-(mm-dummy-text)=
+[](){ #mm-dummy-text }
### Dummy text
-We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
+We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text]. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
-(mm-automatic-prompt-updating)=
+[](){ #mm-automatic-prompt-updating }
### Automatic prompt updating
We address the second issue by implementing model-agnostic code in
-{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates` to automatically update the prompt with feature placeholder tokens based on the specification outputted by {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`.
+[_apply_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates] to automatically update the prompt with feature placeholder tokens based on the specification outputted by [_get_prompt_updates][vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates].
### Summary
-With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main`.
+With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in [_apply_hf_processor_main][vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main].
## Processor Output Caching
@@ -66,4 +67,4 @@ Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238)
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.
-Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#mm-dummy-text) to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating](#mm-automatic-prompt-updating) afterwards to keep the output tokens and multi-modal data consistent with each other.
+Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text][mm-dummy-text] to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating][mm-automatic-prompt-updating] afterwards to keep the output tokens and multi-modal data consistent with each other.
diff --git a/docs/source/design/multiprocessing.md b/docs/design/multiprocessing.md
similarity index 97%
rename from docs/source/design/multiprocessing.md
rename to docs/design/multiprocessing.md
index 43fe5fe2e5..649edfcce6 100644
--- a/docs/source/design/multiprocessing.md
+++ b/docs/design/multiprocessing.md
@@ -2,14 +2,13 @@
## Debugging
-Please see the [Troubleshooting](#troubleshooting-python-multiprocessing)
+Please see the [Troubleshooting][troubleshooting-python-multiprocessing]
page for information on known issues and how to solve them.
## Introduction
-:::{important}
-The source code references are to the state of the code at the time of writing in December, 2024.
-:::
+!!! warning
+ The source code references are to the state of the code at the time of writing in December, 2024.
The use of Python multiprocessing in vLLM is complicated by:
diff --git a/docs/source/design/plugin_system.md b/docs/design/plugin_system.md
similarity index 86%
rename from docs/source/design/plugin_system.md
rename to docs/design/plugin_system.md
index 225030885f..5027a35c23 100644
--- a/docs/source/design/plugin_system.md
+++ b/docs/design/plugin_system.md
@@ -1,12 +1,13 @@
-(plugin-system)=
-
-# vLLM's Plugin System
+---
+title: vLLM's Plugin System
+---
+[](){ #plugin-system }
The community frequently requests the ability to extend vLLM with custom features. To facilitate this, vLLM includes a plugin system that allows users to add custom features without modifying the vLLM codebase. This document explains how plugins work in vLLM and how to create a plugin for vLLM.
## How Plugins Work in vLLM
-Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [](#arch-overview)), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
+Plugins are user-registered code that vLLM executes. Given vLLM's architecture (see [Arch Overview][arch-overview]), multiple processes may be involved, especially when using distributed inference with various parallelism techniques. To enable plugins successfully, every process created by vLLM needs to load the plugin. This is done by the [load_general_plugins](https://github.com/vllm-project/vllm/blob/c76ac49d266e27aa3fea84ef2df1f813d24c91c7/vllm/plugins/__init__.py#L16) function in the `vllm.plugins` module. This function is called for every process created by vLLM before it starts any work.
## How vLLM Discovers Plugins
diff --git a/docs/source/design/v1/metrics.md b/docs/design/v1/metrics.md
similarity index 98%
rename from docs/source/design/v1/metrics.md
rename to docs/design/v1/metrics.md
index de80226553..2631f28e46 100644
--- a/docs/source/design/v1/metrics.md
+++ b/docs/design/v1/metrics.md
@@ -57,7 +57,7 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
- `vllm:spec_decode_num_draft_tokens_total` (Counter)
- `vllm:spec_decode_num_emitted_tokens_total` (Counter)
-These are documented under [Inferencing and Serving -> Production Metrics](project:../../serving/metrics.md).
+These are documented under [Inferencing and Serving -> Production Metrics](../../serving/metrics.md).
### Grafana Dashboard
@@ -222,9 +222,7 @@ And the calculated intervals are:
Put another way:
-:::{image} /assets/design/v1/metrics/intervals-1.png
-:alt: Interval calculations - common case
-:::
+
We explored the possibility of having the frontend calculate these
intervals using the timing of events visible by the frontend. However,
@@ -239,17 +237,13 @@ When a preemption occurs during decode, since any already generated
tokens are reused, we consider the preemption as affecting the
inter-token, decode, and inference intervals.
-:::{image} /assets/design/v1/metrics/intervals-2.png
-:alt: Interval calculations - preempted decode
-:::
+
When a preemption occurs during prefill (assuming such an event
is possible), we consider the preemption as affecting the
time-to-first-token and prefill intervals.
-:::{image} /assets/design/v1/metrics/intervals-3.png
-:alt: Interval calculations - preempted prefill
-:::
+
### Frontend Stats Collection
@@ -467,7 +461,7 @@ In general:
hatch](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics)
for some time before deleting them.
-See the [deprecation policy](project:../../contributing/deprecation_policy.md) for
+See the [deprecation policy](../../contributing/deprecation_policy.md) for
the project-wide deprecation policy.
### Unimplemented - `vllm:tokens_total`
diff --git a/docs/source/design/v1/prefix_caching.md b/docs/design/v1/prefix_caching.md
similarity index 94%
rename from docs/source/design/v1/prefix_caching.md
rename to docs/design/v1/prefix_caching.md
index 0f74757777..ad041b0059 100644
--- a/docs/source/design/v1/prefix_caching.md
+++ b/docs/design/v1/prefix_caching.md
@@ -122,9 +122,7 @@ There are two design points to highlight:
As a result, we will have the following components when the KV cache manager is initialized:
-:::{image} /assets/design/v1/prefix_caching/overview.png
-:alt: Component Overview
-:::
+
* Block Pool: A list of KVCacheBlock.
* Free Block Queue: Only store the pointers of head and tail blocks for manipulations.
@@ -194,9 +192,7 @@ As can be seen, block 3 is a new full block and is cached. However, it is redund
When a request is finished, we free all its blocks if no other requests are using them (reference count = 0). In this example, we free request 1 and block 2, 3, 4, 8 associated with it. We can see that the freed blocks are added to the tail of the free queue in the *reverse* order. This is because the last block of a request must hash more tokens and is less likely to be reused by other requests. As a result, it should be evicted first.
-:::{image} /assets/design/v1/prefix_caching/free.png
-:alt: Free Queue after Free a Request
-:::
+
### Eviction (LRU)
@@ -212,36 +208,24 @@ In this example, we assume the block size is 4 (each block can cache 4 tokens),
**Time 1: The cache is empty and a new request comes in.** We allocate 4 blocks. 3 of them are already full and cached. The fourth block is partially full with 3 of 4 tokens.
-:::{image} /assets/design/v1/prefix_caching/example-time-1.png
-:alt: Example Time 1
-:::
+
**Time 3: Request 0 makes the block 3 full and asks for a new block to keep decoding.** We cache block 3 and allocate block 4.
-:::{image} /assets/design/v1/prefix_caching/example-time-3.png
-:alt: Example Time 3
-:::
+
**Time 4: Request 1 comes in with the 14 prompt tokens, where the first 10 tokens are the same as request 0.** We can see that only the first 2 blocks (8 tokens) hit the cache, because the 3rd block only matches 2 of 4 tokens.
-:::{image} /assets/design/v1/prefix_caching/example-time-4.png
-:alt: Example Time 4
-:::
+
**Time 5: Request 0 is finished and free.** Blocks 2, 3 and 4 are added to the free queue in the reverse order (but block 2 and 3 are still cached). Block 0 and 1 are not added to the free queue because they are being used by Request 1.
-:::{image} /assets/design/v1/prefix_caching/example-time-5.png
-:alt: Example Time 5
-:::
+
**Time 6: Request 1 is finished and free.**
-:::{image} /assets/design/v1/prefix_caching/example-time-6.png
-:alt: Example Time 6
-:::
+
**Time 7: Request 2 comes in with the 29 prompt tokens, where the first 12 tokens are the same as request 0\.** Note that even the block order in the free queue was `7 - 8 - 9 - 4 - 3 - 2 - 6 - 5 - 1 - 0`, the cache hit blocks (i.e., 0, 1, 2) are touched and removed from the queue before allocation, so the free queue becomes `7 - 8 - 9 - 4 - 3 - 6 - 5`. As a result, the allocated blocks are 0 (cached), 1 (cached), 2 (cached), 7, 8, 9, 4, 3 (evicted).
-:::{image} /assets/design/v1/prefix_caching/example-time-7.png
-:alt: Example Time 7
-:::
+
diff --git a/docs/source/design/v1/torch_compile.md b/docs/design/v1/torch_compile.md
similarity index 100%
rename from docs/source/design/v1/torch_compile.md
rename to docs/design/v1/torch_compile.md
diff --git a/docs/source/features/automatic_prefix_caching.md b/docs/features/automatic_prefix_caching.md
similarity index 91%
rename from docs/source/features/automatic_prefix_caching.md
rename to docs/features/automatic_prefix_caching.md
index 5c5b37c2a0..5e92796ddd 100644
--- a/docs/source/features/automatic_prefix_caching.md
+++ b/docs/features/automatic_prefix_caching.md
@@ -1,14 +1,14 @@
-(automatic-prefix-caching)=
-
-# Automatic Prefix Caching
+---
+title: Automatic Prefix Caching
+---
+[](){ #automatic-prefix-caching }
## Introduction
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part.
-:::{note}
-Technical details on how vLLM implements APC can be found [here](#design-automatic-prefix-caching).
-:::
+!!! note
+ Technical details on how vLLM implements APC can be found [here][design-automatic-prefix-caching].
## Enabling APC in vLLM
diff --git a/docs/features/compatibility_matrix.md b/docs/features/compatibility_matrix.md
new file mode 100644
index 0000000000..77ceea49f1
--- /dev/null
+++ b/docs/features/compatibility_matrix.md
@@ -0,0 +1,77 @@
+---
+title: Compatibility Matrix
+---
+[](){ #compatibility-matrix }
+
+The tables below show mutually exclusive features and the support on some hardware.
+
+The symbols used have the following meanings:
+
+- ✅ = Full compatibility
+- 🟠 = Partial compatibility
+- ❌ = No compatibility
+
+!!! note
+ Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
+
+## Feature x Feature
+
+
+
+| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | prmpt adptr | [SD][spec-decode] | CUDA graph | pooling | enc-dec | logP | prmpt logP | async output | multi-step | mm | best-of | beam-search |
+|-----------------------------------------------------------|-------------------------|-----------------------------------|------------------------|---------------------------------------------------|---------------------|--------------|-----------------------------------------------|-------------------------------------------------------|--------------------------------------|---------------------------------------------------|-------------------------------------------------------------|--------------------|---------------------------------------------|-----------|---------------|
+| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
+| [APC][automatic-prefix-caching] | ✅ | ✅ | | | | | | | | | | | | | |
+| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | | | | | | | | | | | | |
+| prmpt adptr | ✅ | ✅ | ✅ | ✅ | | | | | | | | | | | |
+| [SD][spec-decode] | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | | | | | |
+| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | | | | | | | | |
+| pooling | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | | | | | | | | |
+| enc-dec | ❌ | [❌](gh-issue:7366) | ❌ | ❌ | [❌](gh-issue:7366) | ✅ | ✅ | ✅ | | | | | | | |
+| logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | | | | | | |
+| prmpt logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | | | | | |
+| async output | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | | | | |
+| multi-step | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | | | |
+| mm | ✅ | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | ❔ | ❔ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ | ✅ | | |
+| best-of | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ✅ | ✅ | |
+| beam-search | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:6137) | ✅ | ❌ | ✅ | ✅ | ✅ | ❔ | [❌](gh-issue:7968) | ❔ | ✅ | ✅ |
+
+[](){ #feature-x-hardware }
+
+## Feature x Hardware
+
+| Feature | Volta | Turing | Ampere | Ada | Hopper | CPU | AMD |
+|-----------------------------------------------------------|--------------------|----------|----------|-------|----------|--------------------|-------|
+| [CP][chunked-prefill] | [❌](gh-issue:2729) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [APC][automatic-prefix-caching] | [❌](gh-issue:3687) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| [LoRA][lora-adapter] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| prmpt adptr | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:8475) | ✅ |
+| [SD][spec-decode] | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| CUDA graph | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
+| pooling | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❔ |
+| enc-dec | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
+| mm | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| prmpt logP | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| async output | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ |
+| multi-step | ✅ | ✅ | ✅ | ✅ | ✅ | [❌](gh-issue:8477) | ✅ |
+| best-of | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| beam-search | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
diff --git a/docs/source/features/disagg_prefill.md b/docs/features/disagg_prefill.md
similarity index 87%
rename from docs/source/features/disagg_prefill.md
rename to docs/features/disagg_prefill.md
index 2fa20140c0..54be05647d 100644
--- a/docs/source/features/disagg_prefill.md
+++ b/docs/features/disagg_prefill.md
@@ -1,12 +1,12 @@
-(disagg-prefill)=
-
-# Disaggregated Prefilling (experimental)
+---
+title: Disaggregated Prefilling (experimental)
+---
+[](){ #disagg-prefill }
This page introduces you the disaggregated prefilling feature in vLLM.
-:::{note}
-This feature is experimental and subject to change.
-:::
+!!! note
+ This feature is experimental and subject to change.
## Why disaggregated prefilling?
@@ -15,9 +15,8 @@ Two main reasons:
- **Tuning time-to-first-token (TTFT) and inter-token-latency (ITL) separately**. Disaggregated prefilling put prefill and decode phase of LLM inference inside different vLLM instances. This gives you the flexibility to assign different parallel strategies (e.g. `tp` and `pp`) to tune TTFT without affecting ITL, or to tune ITL without affecting TTFT.
- **Controlling tail ITL**. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL. Chunked prefill with a proper chunk size also can achieve the same goal, but in practice it's hard to figure out the correct chunk size value. So disaggregated prefilling is a much more reliable way to control tail ITL.
-:::{note}
-Disaggregated prefill DOES NOT improve throughput.
-:::
+!!! note
+ Disaggregated prefill DOES NOT improve throughput.
## Usage example
@@ -39,21 +38,16 @@ Key abstractions for disaggregated prefilling:
- **LookupBuffer**: LookupBuffer provides two API: `insert` KV cache and `drop_select` KV cache. The semantics of `insert` and `drop_select` are similar to SQL, where `insert` inserts a KV cache into the buffer, and `drop_select` returns the KV cache that matches the given condition and drop it from the buffer.
- **Pipe**: A single-direction FIFO pipe for tensor transmission. It supports `send_tensor` and `recv_tensor`.
-:::{note}
-`insert` is non-blocking operation but `drop_select` is blocking operation.
-:::
+!!! note
+ `insert` is non-blocking operation but `drop_select` is blocking operation.
Here is a figure illustrating how the above 3 abstractions are organized:
-:::{image} /assets/features/disagg_prefill/abstraction.jpg
-:alt: Disaggregated prefilling abstractions
-:::
+
The workflow of disaggregated prefilling is as follows:
-:::{image} /assets/features/disagg_prefill/overview.jpg
-:alt: Disaggregated prefilling workflow
-:::
+
The `buffer` corresponds to `insert` API in LookupBuffer, and the `drop_select` corresponds to `drop_select` API in LookupBuffer.
diff --git a/docs/source/features/lora.md b/docs/features/lora.md
similarity index 96%
rename from docs/source/features/lora.md
rename to docs/features/lora.md
index 5a3ce0c01f..642462f7c4 100644
--- a/docs/source/features/lora.md
+++ b/docs/features/lora.md
@@ -1,10 +1,11 @@
-(lora-adapter)=
-
-# LoRA Adapters
+---
+title: LoRA Adapters
+---
+[](){ #lora-adapter }
This document shows you how to use [LoRA adapters](https://arxiv.org/abs/2106.09685) with vLLM on top of a base model.
-LoRA adapters can be used with any vLLM model that implements {class}`~vllm.model_executor.models.interfaces.SupportsLoRA`.
+LoRA adapters can be used with any vLLM model that implements [SupportsLoRA][vllm.model_executor.models.interfaces.SupportsLoRA].
Adapters can be efficiently served on a per request basis with minimal overhead. First we download the adapter(s) and save
them locally with
@@ -60,9 +61,8 @@ vllm serve meta-llama/Llama-2-7b-hf \
--lora-modules sql-lora=$HOME/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/snapshots/0dfa347e8877a4d4ed19ee56c140fa518470028c/
```
-:::{note}
-The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
-:::
+!!! note
+ The commit ID `0dfa347e8877a4d4ed19ee56c140fa518470028c` may change over time. Please check the latest commit ID in your environment to ensure you are using the correct one.
The server entrypoint accepts all other LoRA configuration parameters (`max_loras`, `max_lora_rank`, `max_cpu_loras`,
etc.), which will apply to all forthcoming requests. Upon querying the `/models` endpoint, we should see our LoRA along
diff --git a/docs/source/features/multimodal_inputs.md b/docs/features/multimodal_inputs.md
similarity index 84%
rename from docs/source/features/multimodal_inputs.md
rename to docs/features/multimodal_inputs.md
index bb2997f008..19b6681729 100644
--- a/docs/source/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@@ -1,20 +1,20 @@
-(multimodal-inputs)=
+---
+title: Multimodal Inputs
+---
+[](){ #multimodal-inputs }
-# Multimodal Inputs
+This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
-This page teaches you how to pass multi-modal inputs to [multi-modal models](#supported-mm-models) in vLLM.
-
-:::{note}
-We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
-and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
-:::
+!!! note
+ We are actively iterating on multi-modal support. See [this RFC](gh-issue:4194) for upcoming changes,
+ and [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) if you have any feedback or feature requests.
## Offline Inference
-To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`:
+To input multi-modal data, follow this schema in [vllm.inputs.PromptType][]:
- `prompt`: The prompt should follow the format that is documented on HuggingFace.
-- `multi_modal_data`: This is a dictionary that follows the schema defined in {class}`vllm.multimodal.inputs.MultiModalDataDict`.
+- `multi_modal_data`: This is a dictionary that follows the schema defined in [vllm.multimodal.inputs.MultiModalDataDict][].
### Image Inputs
@@ -211,16 +211,15 @@ for o in outputs:
Our OpenAI-compatible server accepts multi-modal data via the [Chat Completions API](https://platform.openai.com/docs/api-reference/chat).
-:::{important}
-A chat template is **required** to use Chat Completions API.
-For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
+!!! warning
+ A chat template is **required** to use Chat Completions API.
+ For HF format models, the default chat template is defined inside `chat_template.json` or `tokenizer_config.json`.
-If no default chat template is available, we will first look for a built-in fallback in .
-If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
+ If no default chat template is available, we will first look for a built-in fallback in .
+ If no fallback is available, an error is raised and you have to provide the chat template manually via the `--chat-template` argument.
-For certain models, we provide alternative chat templates inside .
-For example, VLM2Vec uses which is different from the default one for Phi-3-Vision.
-:::
+ For certain models, we provide alternative chat templates inside .
+ For example, VLM2Vec uses which is different from the default one for Phi-3-Vision.
### Image Inputs
@@ -284,25 +283,21 @@ print("Chat completion output:", chat_response.choices[0].message.content)
Full example:
-:::{tip}
-Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
-and pass the file path as `url` in the API request.
-:::
+!!! tip
+ Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via `--allowed-local-media-path` when launching the API server/engine,
+ and pass the file path as `url` in the API request.
-:::{tip}
-There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
-In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
-:::
+!!! tip
+ There is no need to place image placeholders in the text content of the API request - they are already represented by the image content.
+ In fact, you can place image placeholders in the middle of the text by interleaving text and image content.
-:::{note}
-By default, the timeout for fetching images through HTTP URL is `5` seconds.
-You can override this by setting the environment variable:
+!!! note
+ By default, the timeout for fetching images through HTTP URL is `5` seconds.
+ You can override this by setting the environment variable:
-```console
-export VLLM_IMAGE_FETCH_TIMEOUT=
-```
-
-:::
+ ```console
+ export VLLM_IMAGE_FETCH_TIMEOUT=
+ ```
### Video Inputs
@@ -357,15 +352,13 @@ print("Chat completion output from image url:", result)
Full example:
-:::{note}
-By default, the timeout for fetching videos through HTTP URL is `30` seconds.
-You can override this by setting the environment variable:
+!!! note
+ By default, the timeout for fetching videos through HTTP URL is `30` seconds.
+ You can override this by setting the environment variable:
-```console
-export VLLM_VIDEO_FETCH_TIMEOUT=
-```
-
-:::
+ ```console
+ export VLLM_VIDEO_FETCH_TIMEOUT=
+ ```
### Audio Inputs
@@ -461,15 +454,13 @@ print("Chat completion output from audio url:", result)
Full example:
-:::{note}
-By default, the timeout for fetching audios through HTTP URL is `10` seconds.
-You can override this by setting the environment variable:
+!!! note
+ By default, the timeout for fetching audios through HTTP URL is `10` seconds.
+ You can override this by setting the environment variable:
-```console
-export VLLM_AUDIO_FETCH_TIMEOUT=
-```
-
-:::
+ ```console
+ export VLLM_AUDIO_FETCH_TIMEOUT=
+ ```
### Embedding Inputs
@@ -535,7 +526,6 @@ chat_completion = client.chat.completions.create(
)
```
-:::{note}
-Only one message can contain `{"type": "image_embeds"}`.
-If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
-:::
+!!! note
+ Only one message can contain `{"type": "image_embeds"}`.
+ If used with a model that requires additional parameters, you must also provide a tensor for each of them, e.g. `image_grid_thw`, `image_sizes`, etc.
diff --git a/docs/source/features/prompt_embeds.md b/docs/features/prompt_embeds.md
similarity index 92%
rename from docs/source/features/prompt_embeds.md
rename to docs/features/prompt_embeds.md
index 9d7b242bbe..6f5616e05d 100644
--- a/docs/source/features/prompt_embeds.md
+++ b/docs/features/prompt_embeds.md
@@ -6,13 +6,12 @@ This page teaches you how to pass prompt embedding inputs to vLLM.
The traditional flow of text data for a Large Language Model goes from text to token ids (via a tokenizer) then from token ids to prompt embeddings. For a traditional decoder-only model (such as meta-llama/Llama-3.1-8B-Instruct), this step of converting token ids to prompt embeddings happens via a look-up from a learned embedding matrix, but the model is not limited to processing only the embeddings corresponding to its token vocabulary.
-:::{note}
-Prompt embeddings are currently only supported in the v0 engine.
-:::
+!!! note
+ Prompt embeddings are currently only supported in the v0 engine.
## Offline Inference
-To input multi-modal data, follow this schema in {class}`vllm.inputs.EmbedsPrompt`:
+To input multi-modal data, follow this schema in [vllm.inputs.EmbedsPrompt][]:
- `prompt_embeds`: A torch tensor representing a sequence of prompt/token embeddings. This has the shape (sequence_length, hidden_size), where sequence length is the number of tokens embeddings and hidden_size is the hidden size (embedding size) of the model.
diff --git a/docs/features/quantization/README.md b/docs/features/quantization/README.md
new file mode 100644
index 0000000000..71f62065f6
--- /dev/null
+++ b/docs/features/quantization/README.md
@@ -0,0 +1,22 @@
+---
+title: Quantization
+---
+[](){ #quantization-index }
+
+Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
+
+Contents:
+
+- [Supported_Hardware](supported_hardware.md)
+- [Auto_Awq](auto_awq.md)
+- [Bnb](bnb.md)
+- [Bitblas](bitblas.md)
+- [Gguf](gguf.md)
+- [Gptqmodel](gptqmodel.md)
+- [Int4](int4.md)
+- [Int8](int8.md)
+- [Fp8](fp8.md)
+- [Modelopt](modelopt.md)
+- [Quark](quark.md)
+- [Quantized_Kvcache](quantized_kvcache.md)
+- [Torchao](torchao.md)
diff --git a/docs/source/features/quantization/auto_awq.md b/docs/features/quantization/auto_awq.md
similarity index 98%
rename from docs/source/features/quantization/auto_awq.md
rename to docs/features/quantization/auto_awq.md
index b4ac597f5a..5879b3126f 100644
--- a/docs/source/features/quantization/auto_awq.md
+++ b/docs/features/quantization/auto_awq.md
@@ -1,6 +1,7 @@
-(auto-awq)=
-
-# AutoAWQ
+---
+title: AutoAWQ
+---
+[](){ #auto-awq }
To create a new 4-bit quantized model, you can leverage [AutoAWQ](https://github.com/casper-hansen/AutoAWQ).
Quantization reduces the model's precision from BF16/FP16 to INT4 which effectively reduces the total model memory footprint.
diff --git a/docs/source/features/quantization/bitblas.md b/docs/features/quantization/bitblas.md
similarity index 76%
rename from docs/source/features/quantization/bitblas.md
rename to docs/features/quantization/bitblas.md
index d0b2bf858c..8e9cf67a7a 100644
--- a/docs/source/features/quantization/bitblas.md
+++ b/docs/features/quantization/bitblas.md
@@ -1,14 +1,14 @@
-(bitblas)=
-
-# BitBLAS
+---
+title: BitBLAS
+---
+[](){ #bitblas }
vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more efficient and flexible model inference. Compared to other quantization frameworks, BitBLAS provides more precision combinations.
-:::{note}
-Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
-Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
-For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
-:::
+!!! note
+ Ensure your hardware supports the selected `dtype` (`torch.bfloat16` or `torch.float16`).
+ Most recent NVIDIA GPUs support `float16`, while `bfloat16` is more common on newer architectures like Ampere or Hopper.
+ For details see [supported hardware](https://docs.vllm.ai/en/latest/features/quantization/supported_hardware.html).
Below are the steps to utilize BitBLAS with vLLM.
diff --git a/docs/source/features/quantization/bnb.md b/docs/features/quantization/bnb.md
similarity index 97%
rename from docs/source/features/quantization/bnb.md
rename to docs/features/quantization/bnb.md
index 1843a33a3d..990ac34eb2 100644
--- a/docs/source/features/quantization/bnb.md
+++ b/docs/features/quantization/bnb.md
@@ -1,6 +1,7 @@
-(bits-and-bytes)=
-
-# BitsAndBytes
+---
+title: BitsAndBytes
+---
+[](){ #bits-and-bytes }
vLLM now supports [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) for more efficient model inference.
BitsAndBytes quantizes models to reduce memory usage and enhance performance without significantly sacrificing accuracy.
diff --git a/docs/source/features/quantization/fp8.md b/docs/features/quantization/fp8.md
similarity index 88%
rename from docs/source/features/quantization/fp8.md
rename to docs/features/quantization/fp8.md
index cb304d5472..01d5d9da04 100644
--- a/docs/source/features/quantization/fp8.md
+++ b/docs/features/quantization/fp8.md
@@ -1,6 +1,7 @@
-(fp8)=
-
-# FP8 W8A8
+---
+title: FP8 W8A8
+---
+[](){ #fp8 }
vLLM supports FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs such as Nvidia H100 and AMD MI300x.
Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8.
@@ -14,10 +15,9 @@ The FP8 types typically supported in hardware have two distinct representations,
- **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
- **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
-:::{note}
-FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
-FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
-:::
+!!! note
+ FP8 computation is supported on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper).
+ FP8 models will run on compute capability > 8.0 (Ampere) as weight-only W8A16, utilizing FP8 Marlin.
## Installation
@@ -94,9 +94,8 @@ print(result[0].outputs[0].text)
Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
-:::{note}
-Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
-:::
+!!! note
+ Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
```console
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
@@ -133,6 +132,5 @@ result = model.generate("Hello, my name is")
print(result[0].outputs[0].text)
```
-:::{warning}
-Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
-:::
+!!! warning
+ Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.
diff --git a/docs/source/features/quantization/gguf.md b/docs/features/quantization/gguf.md
similarity index 76%
rename from docs/source/features/quantization/gguf.md
rename to docs/features/quantization/gguf.md
index e93e4dcd3b..04ab5945e8 100644
--- a/docs/source/features/quantization/gguf.md
+++ b/docs/features/quantization/gguf.md
@@ -1,14 +1,13 @@
-(gguf)=
+---
+title: GGUF
+---
+[](){ #gguf }
-# GGUF
+!!! warning
+ Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-:::{warning}
-Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
-:::
-
-:::{warning}
-Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
-:::
+!!! warning
+ Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use [gguf-split](https://github.com/ggerganov/llama.cpp/pull/6135) tool to merge them to a single-file model.
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
@@ -25,9 +24,8 @@ You can also add `--tensor-parallel-size 2` to enable tensor parallelism inferen
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
```
-:::{warning}
-We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
-:::
+!!! warning
+ We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
diff --git a/docs/source/features/quantization/gptqmodel.md b/docs/features/quantization/gptqmodel.md
similarity index 98%
rename from docs/source/features/quantization/gptqmodel.md
rename to docs/features/quantization/gptqmodel.md
index 9771d5a4fe..10660a408f 100644
--- a/docs/source/features/quantization/gptqmodel.md
+++ b/docs/features/quantization/gptqmodel.md
@@ -1,6 +1,7 @@
-(gptqmodel)=
-
-# GPTQModel
+---
+title: GPTQModel
+---
+[](){ #gptqmodel }
To create a new 4-bit or 8-bit GPTQ quantized model, you can leverage [GPTQModel](https://github.com/ModelCloud/GPTQModel) from ModelCloud.AI.
diff --git a/docs/source/features/quantization/int4.md b/docs/features/quantization/int4.md
similarity index 94%
rename from docs/source/features/quantization/int4.md
rename to docs/features/quantization/int4.md
index 7a0ab4ad22..b7d0920636 100644
--- a/docs/source/features/quantization/int4.md
+++ b/docs/features/quantization/int4.md
@@ -1,14 +1,14 @@
-(int4)=
-
-# INT4 W4A16
+---
+title: INT4 W4A16
+---
+[](){ #int4 }
vLLM supports quantizing weights to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for reducing model size and maintaining low latency in workloads with low queries per second (QPS).
Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int4-llms-for-vllm-668ec34bf3c9fa45f857df2c).
-:::{note}
-INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
-:::
+!!! note
+ INT4 computation is supported on NVIDIA GPUs with compute capability > 8.0 (Ampere, Ada Lovelace, Hopper, Blackwell).
## Prerequisites
@@ -121,9 +121,8 @@ $ lm_eval --model vllm \
--batch_size 'auto'
```
-:::{note}
-Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-:::
+!!! note
+ Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
## Best Practices
diff --git a/docs/source/features/quantization/int8.md b/docs/features/quantization/int8.md
similarity index 92%
rename from docs/source/features/quantization/int8.md
rename to docs/features/quantization/int8.md
index 1e4b01d355..1d9fba9dc8 100644
--- a/docs/source/features/quantization/int8.md
+++ b/docs/features/quantization/int8.md
@@ -1,15 +1,15 @@
-(int8)=
-
-# INT8 W8A8
+---
+title: INT8 W8A8
+---
+[](){ #int8 }
vLLM supports quantizing weights and activations to INT8 for memory savings and inference acceleration.
This quantization method is particularly useful for reducing model size while maintaining good performance.
Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs ready to use with vLLM](https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415).
-:::{note}
-INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
-:::
+!!! note
+ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turing, Ampere, Ada Lovelace, Hopper, Blackwell).
## Prerequisites
@@ -125,9 +125,8 @@ $ lm_eval --model vllm \
--batch_size 'auto'
```
-:::{note}
-Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
-:::
+!!! note
+ Quantized models can be sensitive to the presence of the `bos` token. Make sure to include the `add_bos_token=True` argument when running evaluations.
## Best Practices
diff --git a/docs/source/features/quantization/modelopt.md b/docs/features/quantization/modelopt.md
similarity index 100%
rename from docs/source/features/quantization/modelopt.md
rename to docs/features/quantization/modelopt.md
diff --git a/docs/source/features/quantization/quantized_kvcache.md b/docs/features/quantization/quantized_kvcache.md
similarity index 98%
rename from docs/source/features/quantization/quantized_kvcache.md
rename to docs/features/quantization/quantized_kvcache.md
index 86e6354ec8..e3ebd024ba 100644
--- a/docs/source/features/quantization/quantized_kvcache.md
+++ b/docs/features/quantization/quantized_kvcache.md
@@ -1,6 +1,7 @@
-(quantized-kvcache)=
-
-# Quantized KV Cache
+---
+title: Quantized KV Cache
+---
+[](){ #quantized-kvcache }
## FP8 KV Cache
diff --git a/docs/source/features/quantization/quark.md b/docs/features/quantization/quark.md
similarity index 94%
rename from docs/source/features/quantization/quark.md
rename to docs/features/quantization/quark.md
index 955890dbc7..51da98cc09 100644
--- a/docs/source/features/quantization/quark.md
+++ b/docs/features/quantization/quark.md
@@ -1,6 +1,7 @@
-(quark)=
-
-# AMD QUARK
+---
+title: AMD QUARK
+---
+[](){ #quark }
Quantization can effectively reduce memory and bandwidth usage, accelerate computation and improve
throughput while with minimal accuracy loss. vLLM can leverage [Quark](https://quark.docs.amd.com/latest/),
@@ -86,13 +87,12 @@ We need to set the quantization configuration, you can check
for further details. Here we use FP8 per-tensor quantization on weight, activation,
kv-cache and the quantization algorithm is AutoSmoothQuant.
-:::{note}
-Note the quantization algorithm needs a JSON config file and the config file is located in
-[Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
-under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
-AutoSmoothQuant config file for Llama is
-`examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
-:::
+!!! note
+ Note the quantization algorithm needs a JSON config file and the config file is located in
+ [Quark Pytorch examples](https://quark.docs.amd.com/latest/pytorch/pytorch_examples.html),
+ under the directory `examples/torch/language_modeling/llm_ptq/models`. For example,
+ AutoSmoothQuant config file for Llama is
+ `examples/torch/language_modeling/llm_ptq/models/llama/autosmoothquant_config.json`.
```python
from quark.torch.quantization import (Config, QuantizationConfig,
diff --git a/docs/features/quantization/supported_hardware.md b/docs/features/quantization/supported_hardware.md
new file mode 100644
index 0000000000..2967bf9c75
--- /dev/null
+++ b/docs/features/quantization/supported_hardware.md
@@ -0,0 +1,28 @@
+---
+title: Supported Hardware
+---
+[](){ #quantization-supported-hardware }
+
+The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
+
+| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Inferentia | Google TPU |
+|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
+| AWQ | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
+| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ✅︎ | ✅︎ | ❌ | ❌ |
+| Marlin (GPTQ/AWQ/FP8) | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| INT8 (W8A8) | ❌ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ✅︎ | ❌ | ✅︎ |
+| FP8 (W8A8) | ❌ | ❌ | ❌ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
+| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| DeepSpeedFP | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| GGUF | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ❌ | ❌ | ❌ | ❌ |
+
+- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
+- ✅︎ indicates that the quantization method is supported on the specified hardware.
+- ❌ indicates that the quantization method is not supported on the specified hardware.
+
+!!! note
+ This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
+
+ For the most up-to-date information on hardware support and quantization methods, please refer to or consult with the vLLM development team.
diff --git a/docs/source/features/quantization/torchao.md b/docs/features/quantization/torchao.md
similarity index 100%
rename from docs/source/features/quantization/torchao.md
rename to docs/features/quantization/torchao.md
diff --git a/docs/source/features/reasoning_outputs.md b/docs/features/reasoning_outputs.md
similarity index 97%
rename from docs/source/features/reasoning_outputs.md
rename to docs/features/reasoning_outputs.md
index bf4f8901a1..85464269ef 100644
--- a/docs/source/features/reasoning_outputs.md
+++ b/docs/features/reasoning_outputs.md
@@ -1,6 +1,7 @@
-(reasoning-outputs)=
-
-# Reasoning Outputs
+---
+title: Reasoning Outputs
+---
+[](){ #reasoning-outputs }
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
@@ -17,10 +18,9 @@ vLLM currently supports the following reasoning models:
| [IBM Granite 3.2 language models](https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a) | `granite` | ❌ | ❌ |
| [Qwen3 series](https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f) | `qwen3` | `guided_json`, `guided_regex` | ✅ |
-:::{note}
-IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
-The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
-:::
+!!! note
+ IBM Granite 3.2 reasoning is disabled by default; to enable it, you must also pass `thinking=True` in your `chat_template_kwargs`.
+ The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass `enable_thinking=False` in your `chat_template_kwargs`.
## Quickstart
@@ -167,12 +167,10 @@ client = OpenAI(
models = client.models.list()
model = models.data[0].id
-
class People(BaseModel):
name: str
age: int
-
json_schema = People.model_json_schema()
prompt = ("Generate a JSON with the name and age of one random person.")
diff --git a/docs/source/features/spec_decode.md b/docs/features/spec_decode.md
similarity index 93%
rename from docs/source/features/spec_decode.md
rename to docs/features/spec_decode.md
index f16e0d9652..dce87c2789 100644
--- a/docs/source/features/spec_decode.md
+++ b/docs/features/spec_decode.md
@@ -1,16 +1,15 @@
-(spec-decode)=
+---
+title: Speculative Decoding
+---
+[](){ #spec-decode }
-# Speculative Decoding
+!!! warning
+ Please note that speculative decoding in vLLM is not yet optimized and does
+ not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
+ The work to optimize it is ongoing and can be followed here:
-:::{warning}
-Please note that speculative decoding in vLLM is not yet optimized and does
-not usually yield inter-token latency reductions for all prompt datasets or sampling parameters.
-The work to optimize it is ongoing and can be followed here:
-:::
-
-:::{warning}
-Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
-:::
+!!! warning
+ Currently, speculative decoding in vLLM is not compatible with pipeline parallelism.
This document shows how to use [Speculative Decoding](https://x.com/karpathy/status/1697318534555336961) with vLLM.
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
@@ -51,9 +50,8 @@ python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
```
-:::{warning}
-Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
-:::
+!!! warning
+ Note: Please use `--speculative_config` to set all configurations related to speculative decoding. The previous method of specifying the model through `--speculative_model` and adding related parameters (e.g., `--num_speculative_tokens`) separately has been deprecated now.
Then use a client:
@@ -255,7 +253,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
- titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
+ titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
can occur due to following factors:
@@ -264,7 +262,7 @@ can occur due to following factors:
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
due to non-deterministic behavior in batched operations or numerical instability.
-For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
+For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs][faq].
## Resources for vLLM contributors
diff --git a/docs/source/features/structured_outputs.md b/docs/features/structured_outputs.md
similarity index 96%
rename from docs/source/features/structured_outputs.md
rename to docs/features/structured_outputs.md
index 03119ec744..f96b598cff 100644
--- a/docs/source/features/structured_outputs.md
+++ b/docs/features/structured_outputs.md
@@ -1,6 +1,7 @@
-(structured-outputs)=
-
-# Structured Outputs
+---
+title: Structured Outputs
+---
+[](){ #structured-outputs }
vLLM supports the generation of structured outputs using
[xgrammar](https://github.com/mlc-ai/xgrammar) or
@@ -20,7 +21,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_grammar`: the output will follow the context free grammar.
- `structural_tag`: Follow a JSON schema within a set of specified tags within the generated text.
-You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server) page.
+You can see the complete list of supported parameters on the [OpenAI-Compatible Server][openai-compatible-server] page.
Structured outputs are supported by default in the OpenAI-Compatible Server. You
may choose to specify the backend to use by setting the
@@ -83,13 +84,11 @@ class CarType(str, Enum):
truck = "Truck"
coupe = "Coupe"
-
class CarDescription(BaseModel):
brand: str
model: str
car_type: CarType
-
json_schema = CarDescription.model_json_schema()
completion = client.chat.completions.create(
@@ -105,11 +104,10 @@ completion = client.chat.completions.create(
print(completion.choices[0].message.content)
```
-:::{tip}
-While not strictly necessary, normally it´s better to indicate in the prompt the
-JSON schema and how the fields should be populated. This can improve the
-results notably in most cases.
-:::
+!!! tip
+ While not strictly necessary, normally it´s better to indicate in the prompt the
+ JSON schema and how the fields should be populated. This can improve the
+ results notably in most cases.
Finally we have the `guided_grammar` option, which is probably the most
difficult to use, but it´s really powerful. It allows us to define complete
@@ -160,12 +158,10 @@ Here is a simple example demonstrating how to get structured output using Pydant
from pydantic import BaseModel
from openai import OpenAI
-
class Info(BaseModel):
name: str
age: int
-
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
@@ -199,17 +195,14 @@ from typing import List
from pydantic import BaseModel
from openai import OpenAI
-
class Step(BaseModel):
explanation: str
output: str
-
class MathResponse(BaseModel):
steps: list[Step]
final_answer: str
-
client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="dummy")
completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.1-8B-Instruct",
diff --git a/docs/source/features/tool_calling.md b/docs/features/tool_calling.md
similarity index 99%
rename from docs/source/features/tool_calling.md
rename to docs/features/tool_calling.md
index f76128406b..75cd00e24d 100644
--- a/docs/source/features/tool_calling.md
+++ b/docs/features/tool_calling.md
@@ -322,7 +322,6 @@ class ExampleToolParser(ToolParser):
tool_calls=[],
content=text)
-
```
Then you can use this plugin in the command line like this.
diff --git a/docs/source/getting_started/faq.md b/docs/getting_started/faq.md
similarity index 91%
rename from docs/source/getting_started/faq.md
rename to docs/getting_started/faq.md
index c1bb28937c..51977d4434 100644
--- a/docs/source/getting_started/faq.md
+++ b/docs/getting_started/faq.md
@@ -1,23 +1,24 @@
-(faq)=
-
-# Frequently Asked Questions
+---
+title: Frequently Asked Questions
+---
+[](){ #faq }
> Q: How can I serve multiple models on a single port using the OpenAI API?
A: Assuming that you're referring to using OpenAI compatible server to serve multiple models at once, that is not currently supported, you can run multiple instances of the server (each serving a different model) at the same time, and have another layer to route the incoming request to the correct server accordingly.
-______________________________________________________________________
+---
> Q: Which model to use for offline inference embedding?
A: You can try [e5-mistral-7b-instruct](https://huggingface.co/intfloat/e5-mistral-7b-instruct) and [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5);
-more are listed [here](#supported-models).
+more are listed [here][supported-models].
By extracting hidden states, vLLM can automatically convert text generation models like [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B),
[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) into embedding models,
but they are expected to be inferior to models that are specifically trained on embedding tasks.
-______________________________________________________________________
+---
> Q: Can the output of a prompt vary across runs in vLLM?
diff --git a/docs/getting_started/installation/.nav.yml b/docs/getting_started/installation/.nav.yml
new file mode 100644
index 0000000000..7acfc015ff
--- /dev/null
+++ b/docs/getting_started/installation/.nav.yml
@@ -0,0 +1,5 @@
+nav:
+ - README.md
+ - gpu.md
+ - cpu.md
+ - ai_accelerator.md
\ No newline at end of file
diff --git a/docs/getting_started/installation/README.md b/docs/getting_started/installation/README.md
new file mode 100644
index 0000000000..36bb16cc02
--- /dev/null
+++ b/docs/getting_started/installation/README.md
@@ -0,0 +1,20 @@
+---
+title: Installation
+---
+[](){ #installation-index }
+
+vLLM supports the following hardware platforms:
+
+- [GPU](gpu.md)
+ - [NVIDIA CUDA](gpu.md#nvidia-cuda)
+ - [AMD ROCm](gpu.md#amd-rocm)
+ - [Intel XPU](gpu.md#intel-xpu)
+- [CPU](cpu.md)
+ - [Intel/AMD x86](cpu.md#intelamd-x86)
+ - [ARM AArch64](cpu.md#arm-aarch64)
+ - [Apple silicon](cpu.md#apple-silicon)
+ - [IBM Z (S390X)](cpu.md#ibm-z-s390x)
+- [Other AI accelerators](ai_accelerator.md)
+ - [Google TPU](ai_accelerator.md#google-tpu)
+ - [Intel Gaudi](ai_accelerator.md#intel-gaudi)
+ - [AWS Neuron](ai_accelerator.md#aws-neuron)
diff --git a/docs/getting_started/installation/ai_accelerator.md b/docs/getting_started/installation/ai_accelerator.md
new file mode 100644
index 0000000000..a4f136a172
--- /dev/null
+++ b/docs/getting_started/installation/ai_accelerator.md
@@ -0,0 +1,117 @@
+# Other AI accelerators
+
+vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:installation"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:installation"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:installation"
+
+## Requirements
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:requirements"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:requirements"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:requirements"
+
+## Configure a new environment
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:configure-a-new-environment"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:configure-a-new-environment"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:configure-a-new-environment"
+
+## Set up using Python
+
+### Pre-built wheels
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-wheels"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-wheels"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-wheels"
+
+### Build wheel from source
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-wheel-from-source"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-wheel-from-source"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-wheel-from-source"
+
+## Set up using Docker
+
+### Pre-built images
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:pre-built-images"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:pre-built-images"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:pre-built-images"
+
+### Build image from source
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:build-image-from-source"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:build-image-from-source"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:build-image-from-source"
+
+## Extra information
+
+=== "Google TPU"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/tpu.inc.md:extra-information"
+
+=== "Intel Gaudi"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md:extra-information"
+
+=== "AWS Neuron"
+
+ --8<-- "docs/getting_started/installation/ai_accelerator/neuron.inc.md:extra-information"
diff --git a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md b/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
similarity index 84%
rename from docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
rename to docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
index 78938de317..1ca8a9216a 100644
--- a/docs/source/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/hpu-gaudi.inc.md
@@ -1,12 +1,12 @@
-# Installation
+# --8<-- [start:installation]
This tab provides instructions on running vLLM with Intel Gaudi devices.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: Ubuntu 22.04 LTS
- Python: 3.10
@@ -48,13 +48,16 @@ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-i
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
```
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Intel Gaudi wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
To build and install vLLM from source, run:
@@ -75,29 +78,32 @@ pip install -r requirements/hpu.txt
python setup.py develop
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
Currently, there are no pre-built Intel Gaudi images.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
```console
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
```
-:::{tip}
-If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
-:::
+!!! tip
+ If you're observing the following error: `docker: Error response from daemon: Unknown runtime specified habana.`, please refer to "Install Using Containers" section of [Intel Gaudi Software Stack and Driver Installation](https://docs.habana.ai/en/v1.18.0/Installation_Guide/Bare_Metal_Fresh_OS.html). Make sure you have `habana-container-runtime` package installed and that `habana` container runtime is registered.
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
## Supported features
-- [Offline inference](#offline-inference)
-- Online serving via [OpenAI-Compatible Server](#openai-compatible-server)
+- [Offline inference][offline-inference]
+- Online serving via [OpenAI-Compatible Server][openai-compatible-server]
- HPU autodetection - no need to manually select device within vLLM
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,
@@ -157,41 +163,25 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via `PT_HPU_LAZY_MODE` environment variable), and `--enforce-eager` flag.
-:::{list-table} vLLM execution modes
-:widths: 25 25 50
-:header-rows: 1
+| `PT_HPU_LAZY_MODE` | `enforce_eager` | execution mode |
+|----------------------|-------------------|--------------------|
+| 0 | 0 | torch.compile |
+| 0 | 1 | PyTorch eager mode |
+| 1 | 0 | HPU Graphs |
+ vLLM execution modes
-- * `PT_HPU_LAZY_MODE`
- * `enforce_eager`
- * execution mode
-- * 0
- * 0
- * torch.compile
-- * 0
- * 1
- * PyTorch eager mode
-- * 1
- * 0
- * HPU Graphs
-- * 1
- * 1
- * PyTorch lazy mode
-:::
+!!! warning
+ In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
-:::{warning}
-In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
-:::
-
-(gaudi-bucketing-mechanism)=
+[](){ #gaudi-bucketing-mechanism }
### Bucketing mechanism
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - `batch_size` and `sequence_length`.
-:::{note}
-Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
-:::
+!!! note
+ Bucketing allows us to reduce the number of required graphs significantly, but it does not handle any graph compilation and device code generation - this is done in warmup and HPUGraph capture phase.
Bucketing ranges are determined with 3 parameters - `min`, `step` and `max`. They can be set separately for prompt and decode phase, and for batch size and sequence length dimension. These parameters can be observed in logs during vLLM startup:
@@ -224,15 +214,13 @@ min = 128, step = 128, max = 512
In the logged scenario, 24 buckets were generated for prompt (prefill) runs, and 48 buckets for decode runs. Each bucket corresponds to a separate optimized device binary for a given model with specified tensor shapes. Whenever a batch of requests is processed, it is padded across batch and sequence length dimension to the smallest possible bucket.
-:::{warning}
-If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
-:::
+!!! warning
+ If a request exceeds maximum bucket size in any dimension, it will be processed without padding, and its processing may require a graph compilation, potentially significantly increasing end-to-end latency. The boundaries of the buckets are user-configurable via environment variables, and upper bucket boundaries can be increased to avoid such scenario.
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
-:::{note}
-Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
-:::
+!!! note
+ Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
### Warmup
@@ -252,11 +240,10 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
```
-This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
+This example uses the same buckets as in the [Bucketing Mechanism][gaudi-bucketing-mechanism] section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
-:::{tip}
-Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
-:::
+!!! tip
+ Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
### HPU Graph capture
@@ -271,9 +258,8 @@ With its default value (`VLLM_GRAPH_RESERVED_MEM=0.1`), 10% of usable memory wil
Environment variable `VLLM_GRAPH_PROMPT_RATIO` determines the ratio of usable graph memory reserved for prefill and decode graphs. By default (`VLLM_GRAPH_PROMPT_RATIO=0.3`), both stages have equal memory constraints.
Lower value corresponds to less usable graph memory reserved for prefill stage, e.g. `VLLM_GRAPH_PROMPT_RATIO=0.2` will reserve 20% of usable graph memory for prefill graphs, and 80% of usable graph memory for decode graphs.
-:::{note}
-`gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
-:::
+!!! note
+ `gpu_memory_utilization` does not correspond to the absolute memory usage across HPU. It specifies the memory margin after loading the model and performing a profile run. If device has 100 GiB of total memory, and 50 GiB of free memory after loading model weights and executing profiling run, `gpu_memory_utilization` at its default value will mark 90% of 50 GiB as usable, leaving 5 GiB of margin, regardless of total device memory.
User can also configure the strategy for capturing HPU Graphs for prompt and decode stages separately. Strategy affects the order of capturing graphs. There are two strategies implemented:
@@ -282,9 +268,8 @@ User can also configure the strategy for capturing HPU Graphs for prompt and dec
When there's large amount of requests pending, vLLM scheduler will attempt to fill the maximum batch size for decode as soon as possible. When a request is finished, decode batch size decreases. When that happens, vLLM will attempt to schedule a prefill iteration for requests in the waiting queue, to fill the decode batch size to its previous state. This means that in a full load scenario, decode batch size is often at its maximum, which makes large batch size HPU Graphs crucial to capture, as reflected by `max_bs` strategy. On the other hand, prefills will be executed most frequently with very low batch sizes (1-4), which is reflected in `min_tokens` strategy.
-:::{note}
-`VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
-:::
+!!! note
+ `VLLM_GRAPH_PROMPT_RATIO` does not set a hard limit on memory taken by graphs for each stage (prefill and decode). vLLM will first attempt to use up entirety of usable prefill graph memory (usable graph memory * `VLLM_GRAPH_PROMPT_RATIO`) for capturing prefill HPU Graphs, next it will attempt do the same for decode graphs and usable decode graph memory pool. If one stage is fully captured, and there is unused memory left within usable graph memory pool, vLLM will attempt further graph capture for the other stage, until no more HPU Graphs can be captured without exceeding reserved memory pool. The behavior on that mechanism can be observed in the example below.
Each described step is logged by vLLM server, as follows (negative values correspond to memory being released):
@@ -401,3 +386,4 @@ the below:
higher batches. You can do that by adding `--enforce-eager` flag to
server (for online serving), or by passing `enforce_eager=True`
argument to LLM constructor (for offline inference).
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md b/docs/getting_started/installation/ai_accelerator/neuron.inc.md
similarity index 79%
rename from docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
rename to docs/getting_started/installation/ai_accelerator/neuron.inc.md
index b4bfb696fa..671afa8d89 100644
--- a/docs/source/getting_started/installation/ai_accelerator/neuron.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/neuron.inc.md
@@ -1,14 +1,14 @@
-# Installation
+# --8<-- [start:installation]
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
Paged Attention and Chunked Prefill are currently in development and will be available soon.
Data types currently supported in Neuron SDK are FP16 and BF16.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: Linux
- Python: 3.9 -- 3.11
@@ -63,17 +63,19 @@ sudo apt-get install aws-neuronx-tools=2.* -y
export PATH=/opt/aws/neuron/bin:$PATH
```
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built Neuron wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
-:::{note}
-The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
-:::
+!!! note
+ The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
Following instructions are applicable to Neuron SDK 2.16 and beyond.
@@ -122,18 +124,23 @@ VLLM_TARGET_DEVICE="neuron" pip install .
If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed.
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
Currently, there are no pre-built Neuron images.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-See for instructions on building the Docker image.
+See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
Make sure to use in place of the default Dockerfile.
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
There is no extra information for this device.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md b/docs/getting_started/installation/ai_accelerator/tpu.inc.md
similarity index 55%
rename from docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
rename to docs/getting_started/installation/ai_accelerator/tpu.inc.md
index 4459cc61e1..d0b1681201 100644
--- a/docs/source/getting_started/installation/ai_accelerator/tpu.inc.md
+++ b/docs/getting_started/installation/ai_accelerator/tpu.inc.md
@@ -1,4 +1,4 @@
-# Installation
+# --8<-- [start:installation]
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
@@ -30,11 +30,11 @@ For TPU pricing information, see [Cloud TPU pricing](https://cloud.google.com/tp
You may need additional persistent storage for your TPU VMs. For more
information, see [Storage options for Cloud TPU data](https://cloud.devsite.corp.google.com/tpu/docs/storage-options).
-:::{attention}
-There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- Google Cloud TPU VM
- TPU versions: v6e, v5e, v5p, v4
@@ -51,10 +51,9 @@ When you request queued resources, the request is added to a queue maintained by
the Cloud TPU service. When the requested resource becomes available, it's
assigned to your Google Cloud project for your immediate exclusive use.
-:::{note}
-In all of the following commands, replace the ALL CAPS parameter names with
-appropriate values. See the parameter descriptions table for more information.
-:::
+!!! note
+ In all of the following commands, replace the ALL CAPS parameter names with
+ appropriate values. See the parameter descriptions table for more information.
### Provision Cloud TPUs with GKE
@@ -79,33 +78,15 @@ gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--service-account SERVICE_ACCOUNT
```
-:::{list-table} Parameter descriptions
-:header-rows: 1
-
-- * Parameter name
- * Description
-- * QUEUED_RESOURCE_ID
- * The user-assigned ID of the queued resource request.
-- * TPU_NAME
- * The user-assigned name of the TPU which is created when the queued
- resource request is allocated.
-- * PROJECT_ID
- * Your Google Cloud project
-- * ZONE
- * The GCP zone where you want to create your Cloud TPU. The value you use
- depends on the version of TPUs you are using. For more information, see
- `TPU regions and zones `_
-- * ACCELERATOR_TYPE
- * The TPU version you want to use. Specify the TPU version, for example
- `v5litepod-4` specifies a v5e TPU with 4 cores, `v6e-1` specifies a v6e TPU with 1 core. For more information,
- see [TPU versions](https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions).
-- * RUNTIME_VERSION
- * The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes).
-- * SERVICE_ACCOUNT
- * The email address for your service account. You can find it in the IAM
- Cloud Console under *Service Accounts*. For example:
- `tpu-service-account@.iam.gserviceaccount.com`
-:::
+| Parameter name | Description |
+|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| QUEUED_RESOURCE_ID | The user-assigned ID of the queued resource request. |
+| TPU_NAME | The user-assigned name of the TPU which is created when the queued |
+| PROJECT_ID | Your Google Cloud project |
+| ZONE | The GCP zone where you want to create your Cloud TPU. The value you use |
+| ACCELERATOR_TYPE | The TPU version you want to use. Specify the TPU version, for example |
+| RUNTIME_VERSION | The TPU VM runtime version to use. For example, use `v2-alpha-tpuv6e` for a VM loaded with one or more v6e TPU(s). For more information see [TPU VM images](https://cloud.google.com/tpu/docs/runtimes). |
+ Parameter descriptions
Connect to your TPU using SSH:
@@ -113,13 +94,16 @@ Connect to your TPU using SSH:
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
```
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built TPU wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
Install Miniconda:
@@ -161,13 +145,16 @@ Run the setup script:
VLLM_TARGET_DEVICE="tpu" python -m pip install -e .
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-See for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
+See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
You can use to build a Docker image with TPU support.
@@ -182,31 +169,30 @@ Run the Docker image with the following command:
docker run --privileged --net host --shm-size=16G -it vllm-tpu
```
-:::{note}
-Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
-possible input shapes and compiles an XLA graph for each shape. The
-compilation time may take 20~30 minutes in the first run. However, the
-compilation time reduces to ~5 minutes afterwards because the XLA graphs are
-cached in the disk (in {code}`VLLM_XLA_CACHE_PATH` or {code}`~/.cache/vllm/xla_cache` by default).
-:::
+!!! note
+ Since TPU relies on XLA which requires static shapes, vLLM bucketizes the
+ possible input shapes and compiles an XLA graph for each shape. The
+ compilation time may take 20~30 minutes in the first run. However, the
+ compilation time reduces to ~5 minutes afterwards because the XLA graphs are
+ cached in the disk (in `VLLM_XLA_CACHE_PATH` or `~/.cache/vllm/xla_cache` by default).
-:::{tip}
-If you encounter the following error:
+!!! tip
+ If you encounter the following error:
-```console
-from torch._C import * # noqa: F403
-ImportError: libopenblas.so.0: cannot open shared object file: No such
-file or directory
-```
+ ```console
+ from torch._C import * # noqa: F403
+ ImportError: libopenblas.so.0: cannot open shared object file: No such
+ file or directory
+ ```
-Install OpenBLAS with the following command:
+ Install OpenBLAS with the following command:
-```console
-sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
-```
+ ```console
+ sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
+ ```
-:::
-
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
There is no extra information for this device.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/cpu.md b/docs/getting_started/installation/cpu.md
similarity index 74%
rename from docs/source/getting_started/installation/cpu.md
rename to docs/getting_started/installation/cpu.md
index 2c0ec60d71..18c96b264a 100644
--- a/docs/source/getting_started/installation/cpu.md
+++ b/docs/getting_started/installation/cpu.md
@@ -2,107 +2,47 @@
vLLM is a Python library that supports the following CPU variants. Select your CPU type to see vendor specific instructions:
-:::::{tab-set}
-:sync-group: device
+=== "Intel/AMD x86"
-::::{tab-item} Intel/AMD x86
-:selected:
-:sync: x86
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:installation"
-:::{include} cpu/x86.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
+=== "ARM AArch64"
-::::
+ --8<-- "docs/getting_started/installation/cpu/arm.inc.md:installation"
-::::{tab-item} ARM AArch64
-:sync: arm
+=== "Apple silicon"
-:::{include} cpu/arm.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
+ --8<-- "docs/getting_started/installation/cpu/apple.inc.md:installation"
-::::
+=== "IBM Z (S390X)"
-::::{tab-item} Apple silicon
-:sync: apple
-
-:::{include} cpu/apple.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} IBM Z (S390X)
-:sync: s390x
-
-:::{include} cpu/s390x.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-:::::
+ --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:installation"
## Requirements
- Python: 3.9 -- 3.12
-:::::{tab-set}
-:sync-group: device
+=== "Intel/AMD x86"
-::::{tab-item} Intel/AMD x86
-:sync: x86
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:requirements"
-:::{include} cpu/x86.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
+=== "ARM AArch64"
-::::
+ --8<-- "docs/getting_started/installation/cpu/arm.inc.md:requirements"
-::::{tab-item} ARM AArch64
-:sync: arm
+=== "Apple silicon"
-:::{include} cpu/arm.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
+ --8<-- "docs/getting_started/installation/cpu/apple.inc.md:requirements"
-::::
+=== "IBM Z (S390X)"
-::::{tab-item} Apple silicon
-:sync: apple
-
-:::{include} cpu/apple.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} IBM Z (S390X)
-:sync: s390x
-
-:::{include} cpu/s390x.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-:::::
+ --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:requirements"
## Set up using Python
### Create a new Python environment
-:::{include} python_env_setup.inc.md
-:::
+--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
### Pre-built wheels
@@ -110,69 +50,29 @@ Currently, there are no pre-built CPU wheels.
### Build wheel from source
-:::::{tab-set}
-:sync-group: device
+=== "Intel/AMD x86"
-::::{tab-item} Intel/AMD x86
-:sync: x86
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:build-wheel-from-source"
-:::{include} cpu/x86.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
+=== "ARM AArch64"
-::::
+ --8<-- "docs/getting_started/installation/cpu/arm.inc.md:build-wheel-from-source"
-::::{tab-item} ARM AArch64
-:sync: arm
+=== "Apple silicon"
-:::{include} cpu/arm.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
+ --8<-- "docs/getting_started/installation/cpu/apple.inc.md:build-wheel-from-source"
-::::
+=== "IBM Z (s390x)"
-::::{tab-item} Apple silicon
-:sync: apple
-
-:::{include} cpu/apple.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} IBM Z (s390x)
-:sync: s390x
-
-:::{include} cpu/s390x.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-:::::
+ --8<-- "docs/getting_started/installation/cpu/s390x.inc.md:build-wheel-from-source"
## Set up using Docker
### Pre-built images
-:::::{tab-set}
-:sync-group: device
+=== "Intel/AMD x86"
-::::{tab-item} Intel/AMD x86
-:sync: x86
-
-:::{include} cpu/x86.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-:::::
+ --8<-- "docs/getting_started/installation/cpu/x86.inc.md:pre-built-images"
### Build image from source
@@ -192,13 +92,11 @@ $ docker run --rm \
other vLLM OpenAI server arguments
```
-::::{tip}
-For ARM or Apple silicon, use `docker/Dockerfile.arm`
-::::
+!!! tip
+ For ARM or Apple silicon, use `docker/Dockerfile.arm`
-::::{tip}
-For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
-::::
+!!! tip
+ For IBM Z (s390x), use `docker/Dockerfile.s390x` and in `docker run` use flag `--dtype float`
## Supported features
diff --git a/docs/source/getting_started/installation/cpu/apple.inc.md b/docs/getting_started/installation/cpu/apple.inc.md
similarity index 58%
rename from docs/source/getting_started/installation/cpu/apple.inc.md
rename to docs/getting_started/installation/cpu/apple.inc.md
index 7bc9e85ecd..7a91e3ce5e 100644
--- a/docs/source/getting_started/installation/cpu/apple.inc.md
+++ b/docs/getting_started/installation/cpu/apple.inc.md
@@ -1,24 +1,27 @@
-# Installation
+# --8<-- [start:installation]
vLLM has experimental support for macOS with Apple silicon. For now, users shall build from the source vLLM to natively run on macOS.
Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: `macOS Sonoma` or later
- SDK: `XCode 15.4` or later with Command Line Tools
- Compiler: `Apple Clang >= 15.0.0`
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
@@ -29,9 +32,8 @@ pip install -r requirements/cpu.txt
pip install -e .
```
-:::{note}
-On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
-:::
+!!! note
+ On macOS the `VLLM_TARGET_DEVICE` is automatically set to `cpu`, which currently is the only supported device.
#### Troubleshooting
@@ -51,10 +53,15 @@ If the build has error like the following snippet where standard C++ headers can
1 error generated.
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/getting_started/installation/cpu/arm.inc.md b/docs/getting_started/installation/cpu/arm.inc.md
new file mode 100644
index 0000000000..59b71dcaf9
--- /dev/null
+++ b/docs/getting_started/installation/cpu/arm.inc.md
@@ -0,0 +1,41 @@
+# --8<-- [start:installation]
+
+vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.
+
+ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
+
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
+
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
+
+- OS: Linux
+- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
+- Instruction Set Architecture (ISA): NEON support is required
+
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
+
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
+
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
+
+--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
+
+Testing has been conducted on AWS Graviton3 instances for compatibility.
+
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
+
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
+
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
+
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/cpu/build.inc.md b/docs/getting_started/installation/cpu/build.inc.md
similarity index 96%
rename from docs/source/getting_started/installation/cpu/build.inc.md
rename to docs/getting_started/installation/cpu/build.inc.md
index f385f3d5b1..7d6472afa7 100644
--- a/docs/source/getting_started/installation/cpu/build.inc.md
+++ b/docs/getting_started/installation/cpu/build.inc.md
@@ -32,3 +32,5 @@ If you want to develop vllm, install it in editable mode instead.
```console
VLLM_TARGET_DEVICE=cpu python setup.py develop
```
+
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/cpu/s390x.inc.md b/docs/getting_started/installation/cpu/s390x.inc.md
similarity index 64%
rename from docs/source/getting_started/installation/cpu/s390x.inc.md
rename to docs/getting_started/installation/cpu/s390x.inc.md
index 9b41173b44..670485feef 100644
--- a/docs/source/getting_started/installation/cpu/s390x.inc.md
+++ b/docs/getting_started/installation/cpu/s390x.inc.md
@@ -1,25 +1,28 @@
-# Installation
+# --8<-- [start:installation]
vLLM has experimental support for s390x architecture on IBM Z platform. For now, users shall build from the vLLM source to natively run on IBM Z platform.
Currently the CPU implementation for s390x architecture supports FP32 datatype only.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- OS: `Linux`
- SDK: `gcc/g++ >= 12.3.0` or later with Command Line Tools
- Instruction Set Architecture (ISA): VXE support is required. Works with Z14 and above.
- Build install python packages: `pyarrow`, `torch` and `torchvision`
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
@@ -39,9 +42,8 @@ curl https://sh.rustup.rs -sSf | sh -s -- -y && \
Execute the following commands to build and install vLLM from the source.
-::::{tip}
-Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
-::::
+!!! tip
+ Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
```console
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
@@ -53,10 +55,15 @@ Please build the following dependencies, `torchvision`, `pyarrow` from the sourc
pip install dist/*.whl
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-## Extra information
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/getting_started/installation/cpu/x86.inc.md b/docs/getting_started/installation/cpu/x86.inc.md
new file mode 100644
index 0000000000..9434eeea8b
--- /dev/null
+++ b/docs/getting_started/installation/cpu/x86.inc.md
@@ -0,0 +1,46 @@
+# --8<-- [start:installation]
+
+vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
+
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
+
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
+
+- OS: Linux
+- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
+- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
+
+!!! tip
+ [Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
+
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
+
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
+
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
+
+--8<-- "docs/getting_started/installation/cpu/cpu/build.inc.md"
+
+!!! note
+ - AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
+ - If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
+
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
+
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
+
+See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
+
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
+
+# --8<-- [end:build-image-from-source]
+# --8<-- [start:extra-information]
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/device.template.md b/docs/getting_started/installation/device.template.md
similarity index 100%
rename from docs/source/getting_started/installation/device.template.md
rename to docs/getting_started/installation/device.template.md
diff --git a/docs/getting_started/installation/gpu.md b/docs/getting_started/installation/gpu.md
new file mode 100644
index 0000000000..3c983f6006
--- /dev/null
+++ b/docs/getting_started/installation/gpu.md
@@ -0,0 +1,124 @@
+# GPU
+
+vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:installation"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:installation"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:installation"
+
+## Requirements
+
+- OS: Linux
+- Python: 3.9 -- 3.12
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:requirements"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:requirements"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:requirements"
+
+## Set up using Python
+
+### Create a new Python environment
+
+--8<-- "docs/getting_started/installation/python_env_setup.inc.md"
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:create-a-new-python-environment"
+
+=== "AMD ROCm"
+
+ There is no extra information on creating a new Python environment for this device.
+
+=== "Intel XPU"
+
+ There is no extra information on creating a new Python environment for this device.
+
+### Pre-built wheels
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-wheels"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-wheels"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-wheels"
+
+[](){ #build-from-source }
+
+### Build wheel from source
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-wheel-from-source"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-wheel-from-source"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-wheel-from-source"
+
+## Set up using Docker
+
+### Pre-built images
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:pre-built-images"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:pre-built-images"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:pre-built-images"
+
+### Build image from source
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:build-image-from-source"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:build-image-from-source"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:build-image-from-source"
+
+## Supported features
+
+=== "NVIDIA CUDA"
+
+ --8<-- "docs/getting_started/installation/gpu/cuda.inc.md:supported-features"
+
+=== "AMD ROCm"
+
+ --8<-- "docs/getting_started/installation/gpu/rocm.inc.md:supported-features"
+
+=== "Intel XPU"
+
+ --8<-- "docs/getting_started/installation/gpu/xpu.inc.md:supported-features"
diff --git a/docs/source/getting_started/installation/gpu/cuda.inc.md b/docs/getting_started/installation/gpu/cuda.inc.md
similarity index 74%
rename from docs/source/getting_started/installation/gpu/cuda.inc.md
rename to docs/getting_started/installation/gpu/cuda.inc.md
index d3d4b4ef6c..8653f98050 100644
--- a/docs/source/getting_started/installation/gpu/cuda.inc.md
+++ b/docs/getting_started/installation/gpu/cuda.inc.md
@@ -1,24 +1,26 @@
-# Installation
+# --8<-- [start:installation]
vLLM contains pre-compiled C++ and CUDA (12.8) binaries.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
### Create a new Python environment
-:::{note}
-PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See for more details.
-:::
+!!! note
+ PyTorch installed via `conda` will statically link `NCCL` library, which can cause issues when vLLM tries to use `NCCL`. See for more details.
In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.
-Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-from-source) for more details.
+Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below][build-from-source] for more details.
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
You can install vLLM using either `pip` or `uv pip`:
@@ -32,9 +34,8 @@ uv pip install vllm --torch-backend=auto
We recommend leveraging `uv` to [automatically select the appropriate PyTorch index at runtime](https://docs.astral.sh/uv/guides/integration/pytorch/#automatic-backend-selection) by inspecting the installed CUDA driver version via `--torch-backend=auto` (or `UV_TORCH_BACKEND=auto`). To select a specific backend (e.g., `cu126`), set `--torch-backend=cu126` (or `UV_TORCH_BACKEND=cu126`). If this doesn't work, try running `uv self update` to update `uv` first.
-:::{note}
-NVIDIA Blackwell GPUs (B200, GB200) require a minimum of CUDA 12.8, so make sure you are installing PyTorch wheels with at least that version. PyTorch itself offers a [dedicated interface](https://pytorch.org/get-started/locally/) to determine the appropriate pip command to run for a given target configuration.
-:::
+!!! note
+ NVIDIA Blackwell GPUs (B200, GB200) require a minimum of CUDA 12.8, so make sure you are installing PyTorch wheels with at least that version. PyTorch itself offers a [dedicated interface](https://pytorch.org/get-started/locally/) to determine the appropriate pip command to run for a given target configuration.
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
@@ -45,7 +46,7 @@ export PYTHON_VERSION=312
uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
-(install-the-latest-code)=
+[](){ #install-the-latest-code }
#### Install the latest code
@@ -87,7 +88,8 @@ uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.a
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
#### Set up using Python-only build (without compilation)
@@ -105,10 +107,9 @@ This command will do the following:
3. Download the pre-built wheel of the base commit.
4. Use its compiled libraries in the installation.
-:::{note}
-1. If you change C++ or kernel code, you cannot use Python-only build; otherwise you will see an import error about library not found or undefined symbol.
-2. If you rebase your dev branch, it is recommended to uninstall vllm and re-run the above command to make sure your libraries are up to date.
-:::
+!!! note
+ 1. If you change C++ or kernel code, you cannot use Python-only build; otherwise you will see an import error about library not found or undefined symbol.
+ 2. If you rebase your dev branch, it is recommended to uninstall vllm and re-run the above command to make sure your libraries are up to date.
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
@@ -118,12 +119,11 @@ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vll
pip install --editable .
```
-You can find more information about vLLM's wheels in .
+You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code].
-:::{note}
-There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
-It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to for instructions on how to install a specified wheel.
-:::
+!!! note
+ There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
+ It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [install-the-latest-code][install-the-latest-code] for instructions on how to install a specified wheel.
#### Full build (with compilation)
@@ -135,17 +135,16 @@ cd vllm
pip install -e .
```
-:::{tip}
-Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
+!!! tip
+ Building from source requires a lot of compilation. If you are building from source repeatedly, it's more efficient to cache the compilation results.
-For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
-As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
+ For example, you can install [ccache](https://github.com/ccache/ccache) using `conda install ccache` or `apt install ccache` .
+ As long as `which ccache` command can find the `ccache` binary, it will be used automatically by the build system. After the first build, subsequent builds will be much faster.
-When using `ccache` with `pip install -e .`, you should run `CCACHE_NOHASHDIR="true" pip install --no-build-isolation -e .`. This is because `pip` creates a new folder with a random name for each build, preventing `ccache` from recognizing that the same files are being built.
+ When using `ccache` with `pip install -e .`, you should run `CCACHE_NOHASHDIR="true" pip install --no-build-isolation -e .`. This is because `pip` creates a new folder with a random name for each build, preventing `ccache` from recognizing that the same files are being built.
-[sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
-The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
-:::
+ [sccache](https://github.com/mozilla/sccache) works similarly to `ccache`, but has the capability to utilize caching in remote storage environments.
+ The following environment variables can be set to configure the vLLM `sccache` remote: `SCCACHE_BUCKET=vllm-build-sccache SCCACHE_REGION=us-west-2 SCCACHE_S3_NO_CREDENTIALS=1`. We also recommend setting `SCCACHE_IDLE_TIMEOUT=0`.
##### Use an existing PyTorch installation
@@ -220,11 +219,13 @@ export VLLM_TARGET_DEVICE=empty
pip install -e .
```
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
-See for instructions on using the official Docker image.
+See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image.
Another way to access the latest code is to use the docker images:
@@ -237,10 +238,12 @@ These docker images are used for CI and testing only, and they are not intended
The latest code can contain bugs and may not be stable. Please use it with caution.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
-See for instructions on building the Docker image.
+See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
## Supported features
-See compatibility matrix for feature support information.
+See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/gpu/rocm.inc.md b/docs/getting_started/installation/gpu/rocm.inc.md
similarity index 72%
rename from docs/source/getting_started/installation/gpu/rocm.inc.md
rename to docs/getting_started/installation/gpu/rocm.inc.md
index dc74368fe2..85d539b756 100644
--- a/docs/source/getting_started/installation/gpu/rocm.inc.md
+++ b/docs/getting_started/installation/gpu/rocm.inc.md
@@ -1,28 +1,31 @@
-# Installation
+# --8<-- [start:installation]
vLLM supports AMD GPUs with ROCm 6.3.
-:::{attention}
-There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels for this device, so you must either use the pre-built Docker image or build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- GPU: MI200s (gfx90a), MI300 (gfx942), Radeon RX 7900 series (gfx1100/1101), Radeon RX 9000 series (gfx1200/1201)
- ROCm 6.3
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built ROCm wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
0. Install prerequisites (skip if you are already in an environment/docker with the following installed):
-- [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
-- [PyTorch](https://pytorch.org/)
+ - [ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/index.html)
+ - [PyTorch](https://pytorch.org/)
For installing PyTorch, you can start from a fresh docker image, e.g, `rocm/pytorch:rocm6.3_ubuntu24.04_py3.12_pytorch_release_2.4.0`, `rocm/pytorch-nightly`. If you are using docker image, you can skip to Step 3.
@@ -49,9 +52,8 @@ Currently, there are no pre-built ROCm wheels.
cd ../..
```
- :::{note}
- If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
- :::
+ !!! note
+ If you see HTTP issue related to downloading packages during building triton, please try again as the HTTP error is intermittent.
2. Optionally, if you choose to use CK flash attention, you can install [flash attention for ROCm](https://github.com/ROCm/flash-attention)
@@ -69,9 +71,8 @@ Currently, there are no pre-built ROCm wheels.
cd ..
```
- :::{note}
- You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
- :::
+ !!! note
+ You might need to downgrade the "ninja" version to 1.10 as it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`)
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
@@ -84,9 +85,8 @@ Currently, there are no pre-built ROCm wheels.
python3 setup.py develop
```
- :::{note}
- You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
- :::
+ !!! note
+ You will need to config the `$AITER_BRANCH_OR_COMMIT` for your purpose.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
@@ -108,31 +108,30 @@ Currently, there are no pre-built ROCm wheels.
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
- :::{tip}
- - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
- - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
- - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
- - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
- :::
+ !!! tip
+ - Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.
+ - Triton flash attention does not currently support sliding window attention. If using half precision, please use CK flash-attention for sliding window support.
+ - To use CK flash-attention or PyTorch naive attention, please use this flag `export VLLM_USE_TRITON_FLASH_ATTN=0` to turn off triton flash attention.
+ - The ROCm version of PyTorch, ideally, should match the ROCm driver version.
-:::{tip}
-- For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
- For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
-:::
+!!! tip
+ - For MI300x (gfx942) users, to achieve optimal performance, please refer to [MI300x tuning guide](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html) for performance optimization and tuning tips on system and workflow level.
+ For vLLM, please refer to [vLLM performance optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html#vllm-performance-optimization).
## Set up using Docker (Recommended)
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
The [AMD Infinity hub for vLLM](https://hub.docker.com/r/rocm/vllm/tags) offers a prebuilt, optimized
docker image designed for validating inference performance on the AMD Instinct™ MI300X accelerator.
-:::{tip}
-Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
-for instructions on how to use this prebuilt docker image.
-:::
+!!! tip
+ Please check [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/performance-validation/mi300x/vllm-benchmark.html)
+ for instructions on how to use this prebuilt docker image.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
Building the Docker image from source is the recommended way to use vLLM with ROCm.
@@ -213,4 +212,5 @@ Where the `` is the location where the model is stored, for examp
## Supported features
-See compatibility matrix for feature support information.
+See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/gpu/xpu.inc.md b/docs/getting_started/installation/gpu/xpu.inc.md
similarity index 67%
rename from docs/source/getting_started/installation/gpu/xpu.inc.md
rename to docs/getting_started/installation/gpu/xpu.inc.md
index 74937a1842..bee9a7ebb7 100644
--- a/docs/source/getting_started/installation/gpu/xpu.inc.md
+++ b/docs/getting_started/installation/gpu/xpu.inc.md
@@ -1,23 +1,26 @@
-# Installation
+# --8<-- [start:installation]
vLLM initially supports basic model inference and serving on Intel GPU platform.
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
+!!! warning
+ There are no pre-built wheels or images for this device, so you must build vLLM from source.
-## Requirements
+# --8<-- [end:installation]
+# --8<-- [start:requirements]
- Supported Hardware: Intel Data Center GPU, Intel ARC GPU
- OneAPI requirements: oneAPI 2025.0
-## Set up using Python
+# --8<-- [end:requirements]
+# --8<-- [start:set-up-using-python]
-### Pre-built wheels
+# --8<-- [end:set-up-using-python]
+# --8<-- [start:pre-built-wheels]
Currently, there are no pre-built XPU wheels.
-### Build wheel from source
+# --8<-- [end:pre-built-wheels]
+# --8<-- [start:build-wheel-from-source]
- First, install required driver and Intel OneAPI 2025.0 or later.
- Second, install Python packages for vLLM XPU backend building:
@@ -35,18 +38,20 @@ pip install -v -r requirements/xpu.txt
VLLM_TARGET_DEVICE=xpu python setup.py install
```
-:::{note}
-- FP16 is the default data type in the current XPU backend. The BF16 data
- type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
-:::
+!!! note
+ - FP16 is the default data type in the current XPU backend. The BF16 data
+ type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
-## Set up using Docker
+# --8<-- [end:build-wheel-from-source]
+# --8<-- [start:set-up-using-docker]
-### Pre-built images
+# --8<-- [end:set-up-using-docker]
+# --8<-- [start:pre-built-images]
Currently, there are no pre-built XPU images.
-### Build image from source
+# --8<-- [end:pre-built-images]
+# --8<-- [start:build-image-from-source]
```console
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
@@ -73,3 +78,4 @@ python -m vllm.entrypoints.openai.api_server \
```
By default, a ray instance will be launched automatically if no existing one is detected in the system, with `num-gpus` equals to `parallel_config.world_size`. We recommend properly starting a ray cluster before execution, referring to the helper script.
+# --8<-- [end:extra-information]
diff --git a/docs/source/getting_started/installation/python_env_setup.inc.md b/docs/getting_started/installation/python_env_setup.inc.md
similarity index 100%
rename from docs/source/getting_started/installation/python_env_setup.inc.md
rename to docs/getting_started/installation/python_env_setup.inc.md
diff --git a/docs/source/getting_started/quickstart.md b/docs/getting_started/quickstart.md
similarity index 75%
rename from docs/source/getting_started/quickstart.md
rename to docs/getting_started/quickstart.md
index ecca296b0b..d24e75e814 100644
--- a/docs/source/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@@ -1,11 +1,12 @@
-(quickstart)=
-
-# Quickstart
+---
+title: Quickstart
+---
+[](){ #quickstart }
This guide will help you quickly get started with vLLM to perform:
-- [Offline batched inference](#quickstart-offline)
-- [Online serving using OpenAI-compatible server](#quickstart-online)
+- [Offline batched inference][quickstart-offline]
+- [Online serving using OpenAI-compatible server][quickstart-online]
## Prerequisites
@@ -41,31 +42,29 @@ pip install --upgrade uv
uv pip install vllm --torch-backend=auto
```
-:::{note}
-For more detail and non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
-:::
+!!! note
+ For more detail and non-CUDA platforms, please refer [here][installation-index] for specific instructions on how to install vLLM.
-(quickstart-offline)=
+[](){ #quickstart-offline }
## Offline Batched Inference
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script:
-The first line of this example imports the classes {class}`~vllm.LLM` and {class}`~vllm.SamplingParams`:
+The first line of this example imports the classes [LLM][vllm.LLM] and [SamplingParams][vllm.SamplingParams]:
-- {class}`~vllm.LLM` is the main class for running offline inference with vLLM engine.
-- {class}`~vllm.SamplingParams` specifies the parameters for the sampling process.
+- [LLM][vllm.LLM] is the main class for running offline inference with vLLM engine.
+- [SamplingParams][vllm.SamplingParams] specifies the parameters for the sampling process.
```python
from vllm import LLM, SamplingParams
```
-The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](#sampling-params).
-:::{important}
-By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if {class}`~vllm.SamplingParams` is not specified.
+The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
+!!! warning
+ By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
-However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the {class}`~vllm.LLM` instance.
-:::
+ However, if vLLM's default sampling parameters are preferred, please set `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
```python
prompts = [
@@ -77,20 +76,18 @@ prompts = [
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
```
-The {class}`~vllm.LLM` class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here](#supported-models).
+The [LLM][vllm.LLM] class initializes vLLM's engine and the [OPT-125M model](https://arxiv.org/abs/2205.01068) for offline inference. The list of supported models can be found [here][supported-models].
```python
llm = LLM(model="facebook/opt-125m")
```
-:::{note}
-By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
+!!! note
+ By default, vLLM downloads models from [Hugging Face](https://huggingface.co/). If you would like to use models from [ModelScope](https://www.modelscope.cn), set the environment variable `VLLM_USE_MODELSCOPE` before initializing the engine.
-```shell
-export VLLM_USE_MODELSCOPE=True
-```
-
-:::
+ ```shell
+ export VLLM_USE_MODELSCOPE=True
+ ```
Now, the fun part! The outputs are generated using `llm.generate`. It adds the input prompts to the vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of `RequestOutput` objects, which include all of the output tokens.
@@ -103,7 +100,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
-(quickstart-online)=
+[](){ #quickstart-online }
## OpenAI-Compatible Server
@@ -116,15 +113,13 @@ Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instru
vllm serve Qwen/Qwen2.5-1.5B-Instruct
```
-:::{note}
-By default, the server uses a predefined chat template stored in the tokenizer.
-You can learn about overriding it [here](#chat-template).
-:::
-:::{important}
-By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
+!!! note
+ By default, the server uses a predefined chat template stored in the tokenizer.
+ You can learn about overriding it [here][chat-template].
+!!! warning
+ By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
-To disable this behavior, please pass `--generation-config vllm` when launching the server.
-:::
+ To disable this behavior, please pass `--generation-config vllm` when launching the server.
This server can be queried in the same format as OpenAI API. For example, to list the models:
@@ -215,6 +210,5 @@ Currently, vLLM supports multiple backends for efficient Attention computation a
If desired, you can also manually set the backend of your choice by configuring the environment variable `VLLM_ATTENTION_BACKEND` to one of the following options: `FLASH_ATTN`, `FLASHINFER` or `XFORMERS`.
-```{attention}
-There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see for instructions on how to install it.
-```
+!!! warning
+ There are no pre-built vllm wheels containing Flash Infer, so you must install it in your environment first. Refer to the [Flash Infer official docs](https://docs.flashinfer.ai/) or see for instructions on how to install it.
diff --git a/docs/source/getting_started/troubleshooting.md b/docs/getting_started/troubleshooting.md
similarity index 86%
rename from docs/source/getting_started/troubleshooting.md
rename to docs/getting_started/troubleshooting.md
index a4744827f2..07e30f9684 100644
--- a/docs/source/getting_started/troubleshooting.md
+++ b/docs/getting_started/troubleshooting.md
@@ -1,12 +1,12 @@
-(troubleshooting)=
-
-# Troubleshooting
+---
+title: Troubleshooting
+---
+[](){ #troubleshooting }
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
-:::{note}
-Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
-:::
+!!! note
+ Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
## Hangs downloading a model
@@ -18,13 +18,12 @@ It's recommended to download the model first using the [huggingface-cli](https:/
If the model is large, it can take a long time to load it from disk. Pay attention to where you store the model. Some clusters have shared filesystems across nodes, e.g. a distributed filesystem or a network filesystem, which can be slow.
It'd be better to store the model in a local disk. Additionally, have a look at the CPU memory usage, when the model is too large it might take a lot of CPU memory, slowing down the operating system because it needs to frequently swap between disk and memory.
-:::{note}
-To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
-:::
+!!! note
+ To isolate the model downloading and loading issue, you can use the `--load-format dummy` argument to skip loading the model weights. This way, you can check if the model downloading and loading is the bottleneck.
## Out of memory
-If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](#reducing-memory-usage) to reduce the memory consumption.
+If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options][reducing-memory-usage] to reduce the memory consumption.
## Generation quality changed
@@ -53,9 +52,9 @@ You might also need to set `export NCCL_SOCKET_IFNAME=`
## Error near `self.graph.replay()`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
-To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
+To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.
-(troubleshooting-incorrect-hardware-driver)=
+[](){ #troubleshooting-incorrect-hardware-driver }
## Incorrect hardware/driver
@@ -140,16 +139,15 @@ If the script runs successfully, you should see the message `sanity check is suc
If the test script hangs or crashes, usually it means the hardware/drivers are broken in some sense. You should try to contact your system administrator or hardware vendor for further assistance. As a common workaround, you can try to tune some NCCL environment variables, such as `export NCCL_P2P_DISABLE=1` to see if it helps. Please check [their documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html) for more information. Please only use these environment variables as a temporary workaround, as they might affect the performance of the system. The best solution is still to fix the hardware/drivers so that the test script can run successfully.
-:::{note}
-A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
+!!! note
+ A multi-node environment is more complicated than a single-node one. If you see errors such as `torch.distributed.DistNetworkError`, it is likely that the network/DNS setup is incorrect. In that case, you can manually assign node rank and specify the IP via command line arguments:
-- In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
-- In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
+ - In the first node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 0 --master_addr $MASTER_ADDR test.py`.
+ - In the second node, run `NCCL_DEBUG=TRACE torchrun --nnodes 2 --nproc-per-node=2 --node-rank 1 --master_addr $MASTER_ADDR test.py`.
-Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
-:::
+ Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
-(troubleshooting-python-multiprocessing)=
+[](){ #troubleshooting-python-multiprocessing }
## Python multiprocessing
@@ -260,7 +258,7 @@ or:
ValueError: Model architectures [''] are not supported for now. Supported architectures: [...]
```
-But you are sure that the model is in the [list of supported models](#supported-models), there may be some issue with vLLM's model resolution. In that case, please follow [these steps](#model-resolution) to explicitly specify the vLLM implementation for the model.
+But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps][model-resolution] to explicitly specify the vLLM implementation for the model.
## Failed to infer device type
diff --git a/docs/source/getting_started/v1_user_guide.md b/docs/getting_started/v1_user_guide.md
similarity index 100%
rename from docs/source/getting_started/v1_user_guide.md
rename to docs/getting_started/v1_user_guide.md
diff --git a/docs/make.bat b/docs/make.bat
deleted file mode 100644
index 747ffb7b30..0000000000
--- a/docs/make.bat
+++ /dev/null
@@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
- set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=source
-set BUILDDIR=build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
- echo.
- echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
- echo.installed, then set the SPHINXBUILD environment variable to point
- echo.to the full path of the 'sphinx-build' executable. Alternatively you
- echo.may add the Sphinx directory to PATH.
- echo.
- echo.If you don't have Sphinx installed, grab it from
- echo.https://www.sphinx-doc.org/
- exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd
diff --git a/docs/mkdocs/hooks/generate_examples.py b/docs/mkdocs/hooks/generate_examples.py
new file mode 100644
index 0000000000..9144f6824b
--- /dev/null
+++ b/docs/mkdocs/hooks/generate_examples.py
@@ -0,0 +1,159 @@
+# SPDX-License-Identifier: Apache-2.0
+
+import itertools
+import re
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Literal
+
+ROOT_DIR = Path(__file__).parent.parent.parent.parent
+ROOT_DIR_RELATIVE = '../../../../..'
+EXAMPLE_DIR = ROOT_DIR / "examples"
+EXAMPLE_DOC_DIR = ROOT_DIR / "docs/getting_started/examples"
+print(ROOT_DIR.resolve())
+print(EXAMPLE_DIR.resolve())
+print(EXAMPLE_DOC_DIR.resolve())
+
+
+def fix_case(text: str) -> str:
+ subs = {
+ "api": "API",
+ "cli": "CLI",
+ "cpu": "CPU",
+ "llm": "LLM",
+ "mae": "MAE",
+ "tpu": "TPU",
+ "aqlm": "AQLM",
+ "gguf": "GGUF",
+ "lora": "LoRA",
+ "rlhf": "RLHF",
+ "vllm": "vLLM",
+ "openai": "OpenAI",
+ "lmcache": "LMCache",
+ "multilora": "MultiLoRA",
+ "mlpspeculator": "MLPSpeculator",
+ r"fp\d+": lambda x: x.group(0).upper(), # e.g. fp16, fp32
+ r"int\d+": lambda x: x.group(0).upper(), # e.g. int8, int16
+ }
+ for pattern, repl in subs.items():
+ text = re.sub(rf'\b{pattern}\b', repl, text, flags=re.IGNORECASE)
+ return text
+
+
+@dataclass
+class Example:
+ """
+ Example class for generating documentation content from a given path.
+
+ Attributes:
+ path (Path): The path to the main directory or file.
+ category (str): The category of the document.
+ main_file (Path): The main file in the directory.
+ other_files (list[Path]): list of other files in the directory.
+ title (str): The title of the document.
+
+ Methods:
+ __post_init__(): Initializes the main_file, other_files, and title attributes.
+ determine_main_file() -> Path: Determines the main file in the given path.
+ determine_other_files() -> list[Path]: Determines other files in the directory excluding the main file.
+ determine_title() -> str: Determines the title of the document.
+ generate() -> str: Generates the documentation content.
+ """ # noqa: E501
+ path: Path
+ category: str = None
+ main_file: Path = field(init=False)
+ other_files: list[Path] = field(init=False)
+ title: str = field(init=False)
+
+ def __post_init__(self):
+ self.main_file = self.determine_main_file()
+ self.other_files = self.determine_other_files()
+ self.title = self.determine_title()
+
+ def determine_main_file(self) -> Path:
+ """
+ Determines the main file in the given path.
+ If the path is a file, it returns the path itself. Otherwise, it searches
+ for Markdown files (*.md) in the directory and returns the first one found.
+ Returns:
+ Path: The main file path, either the original path if it's a file or the first
+ Markdown file found in the directory.
+ Raises:
+ IndexError: If no Markdown files are found in the directory.
+ """ # noqa: E501
+ return self.path if self.path.is_file() else list(
+ self.path.glob("*.md")).pop()
+
+ def determine_other_files(self) -> list[Path]:
+ """
+ Determine other files in the directory excluding the main file.
+
+ This method checks if the given path is a file. If it is, it returns an empty list.
+ Otherwise, it recursively searches through the directory and returns a list of all
+ files that are not the main file.
+
+ Returns:
+ list[Path]: A list of Path objects representing the other files in the directory.
+ """ # noqa: E501
+ if self.path.is_file():
+ return []
+ is_other_file = lambda file: file.is_file() and file != self.main_file
+ return [file for file in self.path.rglob("*") if is_other_file(file)]
+
+ def determine_title(self) -> str:
+ return fix_case(self.path.stem.replace("_", " ").title())
+
+ def generate(self) -> str:
+ content = f"---\ntitle: {self.title}\n---\n\n"
+ content += f"Source .\n\n"
+
+ is_code = self.main_file.suffix != ".md"
+ if is_code:
+ content += f"```{self.main_file.suffix[1:]}\n"
+ content += f'--8<-- "{self.main_file}"\n'
+ if is_code:
+ content += "```\n"
+ content += "\n"
+
+ if not self.other_files:
+ return content
+
+ content += "## Example materials\n\n"
+ for file in sorted(self.other_files):
+ content += f'??? abstract "{file.relative_to(self.path)}"\n'
+ if file.suffix != ".md":
+ content += f" ```{file.suffix[1:]}\n"
+ content += f' --8<-- "{file}"\n'
+ if file.suffix != ".md":
+ content += " ```\n"
+
+ return content
+
+
+def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+ # Create the EXAMPLE_DOC_DIR if it doesn't exist
+ if not EXAMPLE_DOC_DIR.exists():
+ EXAMPLE_DOC_DIR.mkdir(parents=True)
+
+ categories = sorted(p for p in EXAMPLE_DIR.iterdir() if p.is_dir())
+
+ examples = []
+ glob_patterns = ["*.py", "*.md", "*.sh"]
+ # Find categorised examples
+ for category in categories:
+ globs = [category.glob(pattern) for pattern in glob_patterns]
+ for path in itertools.chain(*globs):
+ examples.append(Example(path, category.stem))
+ # Find examples in subdirectories
+ for path in category.glob("*/*.md"):
+ examples.append(Example(path.parent, category.stem))
+
+ # Generate the example documentation
+ for example in sorted(examples, key=lambda e: e.path.stem):
+ example_name = f"{example.path.stem}.md"
+ doc_path = EXAMPLE_DOC_DIR / example.category / example_name
+ print(doc_path)
+ if not doc_path.parent.exists():
+ doc_path.parent.mkdir(parents=True)
+ with open(doc_path, "w+") as f:
+ f.write(example.generate())
diff --git a/docs/mkdocs/hooks/remove_announcement.py b/docs/mkdocs/hooks/remove_announcement.py
new file mode 100644
index 0000000000..e5f8549d83
--- /dev/null
+++ b/docs/mkdocs/hooks/remove_announcement.py
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: Apache-2.0
+import os
+from typing import Literal
+
+
+def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
+ # see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa
+ if os.getenv('READTHEDOCS_VERSION_TYPE') == "tag":
+ # remove the warning banner if the version is a tagged release
+ docs_dir = os.path.dirname(__file__)
+ announcement_path = os.path.join(docs_dir,
+ "mkdocs/overrides/main.html")
+ # The file might be removed already if the build is triggered multiple
+ # times (readthedocs build both HTML and PDF versions separately)
+ if os.path.exists(announcement_path):
+ os.remove(announcement_path)
diff --git a/docs/mkdocs/hooks/url_schemes.py b/docs/mkdocs/hooks/url_schemes.py
new file mode 100644
index 0000000000..03e7ffbb27
--- /dev/null
+++ b/docs/mkdocs/hooks/url_schemes.py
@@ -0,0 +1,54 @@
+# SPDX-License-Identifier: Apache-2.0
+import re
+
+from mkdocs.config.defaults import MkDocsConfig
+from mkdocs.structure.files import Files
+from mkdocs.structure.pages import Page
+
+
+def on_page_markdown(markdown: str, *, page: Page, config: MkDocsConfig,
+ files: Files):
+ gh_icon = ":octicons-mark-github-16:"
+ gh_url = "https://github.com"
+ repo_url = f"{gh_url}/vllm-project/vllm"
+ org_url = f"{gh_url}/orgs/vllm-project"
+ urls = {
+ "issue": f"{repo_url}/issues",
+ "pr": f"{repo_url}/pull",
+ "project": f"{org_url}/projects",
+ "dir": f"{repo_url}/tree/main",
+ "file": f"{repo_url}/blob/main",
+ }
+ titles = {
+ "issue": "Issue #",
+ "pr": "Pull Request #",
+ "project": "Project #",
+ "dir": "",
+ "file": "",
+ }
+
+ scheme = r"gh-(?P.+?):(?P.+?)(#(?P.+?))?"
+ inline_link = re.compile(r"\[(?P[^\[]+?)\]\(" + scheme + r"\)")
+ auto_link = re.compile(f"<{scheme}>")
+
+ def replace_inline_link(match: re.Match) -> str:
+ url = f'{urls[match.group("type")]}/{match.group("path")}'
+ if fragment := match.group("fragment"):
+ url += f"#{fragment}"
+
+ return f'[{gh_icon} {match.group("title")}]({url})'
+
+ def replace_auto_link(match: re.Match) -> str:
+ type = match.group("type")
+ path = match.group("path")
+ title = f"{titles[type]}{path}"
+ url = f"{urls[type]}/{path}"
+ if fragment := match.group("fragment"):
+ url += f"#{fragment}"
+
+ return f"[{gh_icon} {title}]({url})"
+
+ markdown = inline_link.sub(replace_inline_link, markdown)
+ markdown = auto_link.sub(replace_auto_link, markdown)
+
+ return markdown
diff --git a/docs/source/_static/custom.js b/docs/mkdocs/javascript/run_llm_widget.js
similarity index 54%
rename from docs/source/_static/custom.js
rename to docs/mkdocs/javascript/run_llm_widget.js
index 58bc2ebb96..d0e5560e92 100644
--- a/docs/source/_static/custom.js
+++ b/docs/mkdocs/javascript/run_llm_widget.js
@@ -17,22 +17,3 @@ document.addEventListener("DOMContentLoaded", function () {
script.async = true;
document.head.appendChild(script);
});
-
-// Update URL search params when tab is clicked
- document.addEventListener("DOMContentLoaded", function () {
- const tabs = document.querySelectorAll(".sd-tab-label");
-
- function updateURL(tab) {
- const syncGroup = tab.getAttribute("data-sync-group");
- const syncId = tab.getAttribute("data-sync-id");
- if (syncGroup && syncId) {
- const url = new URL(window.location);
- url.searchParams.set(syncGroup, syncId);
- window.history.replaceState(null, "", url);
- }
- }
-
- tabs.forEach(tab => {
- tab.addEventListener("click", () => updateURL(tab));
- });
-});
diff --git a/docs/mkdocs/overrides/main.html b/docs/mkdocs/overrides/main.html
new file mode 100644
index 0000000000..bdd62ebc15
--- /dev/null
+++ b/docs/mkdocs/overrides/main.html
@@ -0,0 +1,5 @@
+{% extends "base.html" %}
+
+{% block announce %}
+
You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
+{% endblock %}
diff --git a/docs/source/models/extensions/fastsafetensor.md b/docs/models/extensions/fastsafetensor.md
similarity index 100%
rename from docs/source/models/extensions/fastsafetensor.md
rename to docs/models/extensions/fastsafetensor.md
diff --git a/docs/source/models/extensions/runai_model_streamer.md b/docs/models/extensions/runai_model_streamer.md
similarity index 86%
rename from docs/source/models/extensions/runai_model_streamer.md
rename to docs/models/extensions/runai_model_streamer.md
index e0daa6f86d..c80120fa98 100644
--- a/docs/source/models/extensions/runai_model_streamer.md
+++ b/docs/models/extensions/runai_model_streamer.md
@@ -1,6 +1,7 @@
-(runai-model-streamer)=
-
-# Loading models with Run:ai Model Streamer
+---
+title: Loading models with Run:ai Model Streamer
+---
+[](){ #runai-model-streamer }
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).
@@ -48,9 +49,8 @@ You can read further about CPU buffer memory limiting [here](https://github.com/
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
```
-:::{note}
-For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
-:::
+!!! note
+ For further instructions about tunable parameters and additional parameters configurable through environment variables, read the [Environment Variables Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md).
## Sharded Model Loading
@@ -74,6 +74,5 @@ The sharded loader supports all the same tunable parameters as the regular Run:a
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
```
-:::{note}
-The sharded loader is particularly efficient for tensor or pipeline parallel models where each worker only needs to read its own shard rather than the entire checkpoint.
-:::
+!!! note
+ The sharded loader is particularly efficient for tensor or pipeline parallel models where each worker only needs to read its own shard rather than the entire checkpoint.
diff --git a/docs/source/models/extensions/tensorizer.md b/docs/models/extensions/tensorizer.md
similarity index 79%
rename from docs/source/models/extensions/tensorizer.md
rename to docs/models/extensions/tensorizer.md
index cd94c81e62..36b49626d4 100644
--- a/docs/source/models/extensions/tensorizer.md
+++ b/docs/models/extensions/tensorizer.md
@@ -1,6 +1,7 @@
-(tensorizer)=
-
-# Loading models with CoreWeave's Tensorizer
+---
+title: Loading models with CoreWeave's Tensorizer
+---
+[](){ #tensorizer }
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized
@@ -11,6 +12,5 @@ For more information on CoreWeave's Tensorizer, please refer to
[CoreWeave's Tensorizer documentation](https://github.com/coreweave/tensorizer). For more information on serializing a vLLM model, as well a general usage guide to using Tensorizer with vLLM, see
the [vLLM example script](https://docs.vllm.ai/en/latest/getting_started/examples/tensorize_vllm_model.html).
-:::{note}
-Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
-:::
+!!! note
+ Note that to use this feature you will need to install `tensorizer` by running `pip install vllm[tensorizer]`.
diff --git a/docs/source/models/generative_models.md b/docs/models/generative_models.md
similarity index 63%
rename from docs/source/models/generative_models.md
rename to docs/models/generative_models.md
index dd765e4a97..566b1c29fc 100644
--- a/docs/source/models/generative_models.md
+++ b/docs/models/generative_models.md
@@ -1,24 +1,25 @@
-(generative-models)=
-
-# Generative Models
+---
+title: Generative Models
+---
+[](){ #generative-models }
vLLM provides first-class support for generative models, which covers most of LLMs.
-In vLLM, generative models implement the {class}`~vllm.model_executor.models.VllmModelForTextGeneration` interface.
+In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
-which are then passed through {class}`~vllm.model_executor.layers.Sampler` to obtain the final text.
+which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
## Offline Inference
-The {class}`~vllm.LLM` class provides various methods for offline inference.
-See for a list of options when initializing the model.
+The [LLM][vllm.LLM] class provides various methods for offline inference.
+See [configuration][configuration] for a list of options when initializing the model.
### `LLM.generate`
-The {class}`~vllm.LLM.generate` method is available to all generative models in vLLM.
+The [generate][vllm.LLM.generate] method is available to all generative models in vLLM.
It is similar to [its counterpart in HF Transformers](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate),
except that tokenization and detokenization are also performed automatically.
@@ -34,7 +35,7 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
-You can optionally control the language generation by passing {class}`~vllm.SamplingParams`.
+You can optionally control the language generation by passing [SamplingParams][vllm.SamplingParams].
For example, you can use greedy sampling by setting `temperature=0`:
```python
@@ -50,16 +51,15 @@ for output in outputs:
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
-:::{important}
-By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if {class}`~vllm.SamplingParams` is not specified.
+!!! warning
+ By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the huggingface model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
-However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the {class}`~vllm.LLM` instance.
-:::
+ However, if vLLM's default sampling parameters are preferred, please pass `generation_config="vllm"` when creating the [LLM][vllm.LLM] instance.
A code example can be found here:
### `LLM.beam_search`
-The {class}`~vllm.LLM.beam_search` method implements [beam search](https://huggingface.co/docs/transformers/en/generation_strategies#beam-search) on top of {class}`~vllm.LLM.generate`.
+The [beam_search][vllm.LLM.beam_search] method implements [beam search](https://huggingface.co/docs/transformers/en/generation_strategies#beam-search) on top of [generate][vllm.LLM.generate].
For example, to search using 5 beams and output at most 50 tokens:
```python
@@ -77,14 +77,13 @@ for output in outputs:
### `LLM.chat`
-The {class}`~vllm.LLM.chat` method implements chat functionality on top of {class}`~vllm.LLM.generate`.
+The [chat][vllm.LLM.chat] method implements chat functionality on top of [generate][vllm.LLM.generate].
In particular, it accepts input similar to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat)
and automatically applies the model's [chat template](https://huggingface.co/docs/transformers/en/chat_templating) to format the prompt.
-:::{important}
-In general, only instruction-tuned models have a chat template.
-Base models may perform poorly as they are not trained to respond to the chat conversation.
-:::
+!!! warning
+ In general, only instruction-tuned models have a chat template.
+ Base models may perform poorly as they are not trained to respond to the chat conversation.
```python
from vllm import LLM
@@ -133,7 +132,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)
## Online Serving
-Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:
+Our [OpenAI-Compatible Server][openai-compatible-server] provides endpoints that correspond to the offline APIs:
-- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
-- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
+- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
+- [Chat API][chat-api] is similar to `LLM.chat`, accepting both text and [multi-modal inputs][multimodal-inputs] for models with a chat template.
diff --git a/docs/source/models/pooling_models.md b/docs/models/pooling_models.md
similarity index 62%
rename from docs/source/models/pooling_models.md
rename to docs/models/pooling_models.md
index 3fd35e2e8b..89a128915a 100644
--- a/docs/source/models/pooling_models.md
+++ b/docs/models/pooling_models.md
@@ -1,70 +1,48 @@
-(pooling-models)=
-
-# Pooling Models
+---
+title: Pooling Models
+---
+[](){ #pooling-models }
vLLM also supports pooling models, including embedding, reranking and reward models.
-In vLLM, pooling models implement the {class}`~vllm.model_executor.models.VllmModelForPooling` interface.
-These models use a {class}`~vllm.model_executor.layers.Pooler` to extract the final hidden states of the input
+In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
+These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
before returning them.
-:::{note}
-We currently support pooling models primarily as a matter of convenience.
-As shown in the [Compatibility Matrix](#compatibility-matrix), most vLLM features are not applicable to
-pooling models as they only work on the generation or decode stage, so performance may not improve as much.
-:::
+!!! note
+ We currently support pooling models primarily as a matter of convenience.
+ As shown in the [Compatibility Matrix][compatibility-matrix], most vLLM features are not applicable to
+ pooling models as they only work on the generation or decode stage, so performance may not improve as much.
For pooling models, we support the following `--task` options.
The selected option sets the default pooler used to extract the final hidden states:
-:::{list-table}
-:widths: 50 25 25 25
-:header-rows: 1
-
-- * Task
- * Pooling Type
- * Normalization
- * Softmax
-- * Embedding (`embed`)
- * `LAST`
- * ✅︎
- * ❌
-- * Classification (`classify`)
- * `LAST`
- * ❌
- * ✅︎
-- * Sentence Pair Scoring (`score`)
- * \*
- * \*
- * \*
-- * Reward Modeling (`reward`)
- * `ALL`
- * ❌
- * ❌
-:::
+| Task | Pooling Type | Normalization | Softmax |
+|---------------------------------|----------------|-----------------|-----------|
+| Embedding (`embed`) | `LAST` | ✅︎ | ❌ |
+| Classification (`classify`) | `LAST` | ❌ | ✅︎ |
+| Sentence Pair Scoring (`score`) | \* | \* | \* |
\*The default pooler is always defined by the model.
-:::{note}
-If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
-:::
+!!! note
+ If the model's implementation in vLLM defines its own pooler, the default pooler is set to that instead of the one specified in this table.
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the default pooler based on its Sentence Transformers configuration file (`modules.json`).
-:::{tip}
-You can customize the model's pooling method via the `--override-pooler-config` option,
-which takes priority over both the model's and Sentence Transformers's defaults.
-:::
+!!! tip
+ You can customize the model's pooling method via the `--override-pooler-config` option,
+ which takes priority over both the model's and Sentence Transformers's defaults.
## Offline Inference
-The {class}`~vllm.LLM` class provides various methods for offline inference.
-See for a list of options when initializing the model.
+The [LLM][vllm.LLM] class provides various methods for offline inference.
+See [configuration][configuration] for a list of options when initializing the model.
### `LLM.encode`
-The {class}`~vllm.LLM.encode` method is available to all pooling models in vLLM.
+The [encode][vllm.LLM.encode] method is available to all pooling models in vLLM.
It returns the extracted hidden states directly, which is useful for reward models.
```python
@@ -79,7 +57,7 @@ print(f"Data: {data!r}")
### `LLM.embed`
-The {class}`~vllm.LLM.embed` method outputs an embedding vector for each prompt.
+The [embed][vllm.LLM.embed] method outputs an embedding vector for each prompt.
It is primarily designed for embedding models.
```python
@@ -96,7 +74,7 @@ A code example can be found here: .
+
+These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
+
+[](){ #transformers-backend }
+
+### Transformers
+
+vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models are supported, and vision language model support is planned!
+
+To check if the modeling backend is Transformers, you can simply do this:
+
+```python
+from vllm import LLM
+llm = LLM(model=..., task="generate") # Name or path of your model
+llm.apply_model(lambda model: print(type(model)))
+```
+
+If it is `TransformersForCausalLM` then it means it's based on Transformers!
+
+!!! tip
+ You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for [offline-inference][offline-inference] or `--model-impl transformers` for the [openai-compatible-server][openai-compatible-server].
+
+!!! note
+ vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
+
+#### Custom models
+
+If a model is neither supported natively by vLLM or Transformers, it can still be used in vLLM!
+
+For a model to be compatible with the Transformers backend for vLLM it must:
+
+- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
+ * The model directory must have the correct structure (e.g. `config.json` is present).
+ * `config.json` must contain `auto_map.AutoModel`.
+- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
+ * Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
+
+If the compatible model is:
+
+- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for [offline-inference][offline-inference] or `--trust-remote-code` for the [openai-compatible-server][openai-compatible-server].
+- in a local directory, simply pass directory path to `model=` for [offline-inference][offline-inference] or `vllm serve ` for the [openai-compatible-server][openai-compatible-server].
+
+This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
+
+[](){ #writing-custom-models }
+
+#### Writing custom models
+
+This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
+
+To make your model compatible with the Transformers backend, it needs:
+
+1. `kwargs` passed down through all modules from `MyModel` to `MyAttention`.
+2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
+3. `MyModel` must contain `_supports_attention_backend = True`.
+
+```python title="modeling_my_model.py"
+
+from transformers import PreTrainedModel
+from torch import nn
+
+class MyAttention(nn.Module):
+
+ def forward(self, hidden_states, **kwargs):
+ ...
+ attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
+ attn_output, attn_weights = attention_interface(
+ self,
+ query_states,
+ key_states,
+ value_states,
+ **kwargs,
+ )
+ ...
+
+class MyModel(PreTrainedModel):
+ _supports_attention_backend = True
+```
+
+Here is what happens in the background when this model is loaded:
+
+1. The config is loaded.
+2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
+3. `MyModel` is loaded into `TransformersForCausalLM` (see ) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
+
+That's it!
+
+For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:
+
+```python title="configuration_my_model.py"
+
+from transformers import PretrainedConfig
+
+class MyConfig(PretrainedConfig):
+ base_model_tp_plan = {
+ "layers.*.self_attn.k_proj": "colwise",
+ "layers.*.self_attn.v_proj": "colwise",
+ "layers.*.self_attn.o_proj": "rowwise",
+ "layers.*.mlp.gate_proj": "colwise",
+ "layers.*.mlp.up_proj": "colwise",
+ "layers.*.mlp.down_proj": "rowwise",
+ }
+ base_model_pp_plan = {
+ "embed_tokens": (["input_ids"], ["inputs_embeds"]),
+ "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
+ "norm": (["hidden_states"], ["hidden_states"]),
+ }
+```
+
+- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
+- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
+ * You only need to do this for layers which are not present on all pipeline stages
+ * vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
+ * The `list` in the first element of the `tuple` contains the names of the input arguments
+ * The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
+
+## Loading a Model
+
+### Hugging Face Hub
+
+By default, vLLM loads models from [Hugging Face (HF) Hub](https://huggingface.co/models). To change the download path for models, you can set the `HF_HOME` environment variable; for more details, refer to [their official documentation](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome).
+
+To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
+If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
+
+Models do not _need_ to be natively supported to be used in vLLM.
+The [Transformers backend][transformers-backend] enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
+
+!!! tip
+ The easiest way to check if your model is really supported at runtime is to run the program below:
+
+ ```python
+ from vllm import LLM
+
+ # For generative models (task=generate) only
+ llm = LLM(model=..., task="generate") # Name or path of your model
+ output = llm.generate("Hello, my name is")
+ print(output)
+
+ # For pooling models (task={embed,classify,reward,score}) only
+ llm = LLM(model=..., task="embed") # Name or path of your model
+ output = llm.encode("Hello, my name is")
+ print(output)
+ ```
+
+ If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
+
+Otherwise, please refer to [Adding a New Model][new-model] for instructions on how to implement your model in vLLM.
+Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
+
+#### Download a model
+
+If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository:
+
+```console
+# Download a model
+huggingface-cli download HuggingFaceH4/zephyr-7b-beta
+
+# Specify a custom cache directory
+huggingface-cli download HuggingFaceH4/zephyr-7b-beta --cache-dir ./path/to/cache
+
+# Download a specific file from a model repo
+huggingface-cli download HuggingFaceH4/zephyr-7b-beta eval_results.json
+```
+
+#### List the downloaded models
+
+Use the Hugging Face CLI to [manage models](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) stored in local cache:
+
+```console
+# List cached models
+huggingface-cli scan-cache
+
+# Show detailed (verbose) output
+huggingface-cli scan-cache -v
+
+# Specify a custom cache directory
+huggingface-cli scan-cache --dir ~/.cache/huggingface/hub
+```
+
+#### Delete a cached model
+
+Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:
+
+```console
+# The `delete-cache` command requires extra dependencies to work with the TUI.
+# Please run `pip install huggingface_hub[cli]` to install them.
+
+# Launch the interactive TUI to select models to delete
+$ huggingface-cli delete-cache
+? Select revisions to delete: 1 revisions selected counting for 438.9M.
+ ○ None of the following (if selected, nothing will be deleted).
+Model BAAI/bge-base-en-v1.5 (438.9M, used 1 week ago)
+❯ ◉ a5beb1e3: main # modified 1 week ago
+
+Model BAAI/bge-large-en-v1.5 (1.3G, used 1 week ago)
+ ○ d4aa6901: main # modified 1 week ago
+
+Model BAAI/bge-reranker-base (1.1G, used 4 weeks ago)
+ ○ 2cfc18c9: main # modified 4 weeks ago
+
+Press to select, to validate and to quit without modification.
+
+# Need to confirm after selected
+? Select revisions to delete: 1 revision(s) selected.
+? 1 revisions selected counting for 438.9M. Confirm deletion ? Yes
+Start deletion.
+Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.
+```
+
+#### Using a proxy
+
+Here are some tips for loading/downloading models from Hugging Face using a proxy:
+
+- Set the proxy globally for your session (or set it in the profile file):
+
+```shell
+export http_proxy=http://your.proxy.server:port
+export https_proxy=http://your.proxy.server:port
+```
+
+- Set the proxy for just the current command:
+
+```shell
+https_proxy=http://your.proxy.server:port huggingface-cli download
+
+# or use vllm cmd directly
+https_proxy=http://your.proxy.server:port vllm serve --disable-log-requests
+```
+
+- Set the proxy in Python interpreter:
+
+```python
+import os
+
+os.environ['http_proxy'] = 'http://your.proxy.server:port'
+os.environ['https_proxy'] = 'http://your.proxy.server:port'
+```
+
+### ModelScope
+
+To use models from [ModelScope](https://www.modelscope.cn) instead of Hugging Face Hub, set an environment variable:
+
+```shell
+export VLLM_USE_MODELSCOPE=True
+```
+
+And use with `trust_remote_code=True`.
+
+```python
+from vllm import LLM
+
+llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
+
+# For generative models (task=generate) only
+output = llm.generate("Hello, my name is")
+print(output)
+
+# For pooling models (task={embed,classify,reward,score}) only
+output = llm.encode("Hello, my name is")
+print(output)
+```
+
+[](){ #feature-status-legend }
+
+## Feature Status Legend
+
+- ✅︎ indicates that the feature is supported for the model.
+
+- 🚧 indicates that the feature is planned but not yet supported for the model.
+
+- ⚠️ indicates that the feature is available but may have known issues or limitations.
+
+[](){ #supported-text-models }
+
+## List of Text-only Language Models
+
+### Generative Models
+
+See [this page][generative-models] for more information on how to use generative models.
+
+#### Text Generation
+
+Specified using `--task generate`.
+
+| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] |
+|---------------------------------------------------|-----------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-----------------------------|
+| `AquilaForCausalLM` | Aquila, Aquila2 | `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc. | ✅︎ | ✅︎ |
+| `ArcticForCausalLM` | Arctic | `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc. | ✅︎ | |
+| `BaiChuanForCausalLM` | Baichuan2, Baichuan | `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc. | ✅︎ | ✅︎ |
+| `BambaForCausalLM` | Bamba | `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B` | | |
+| `BloomForCausalLM` | BLOOM, BLOOMZ, BLOOMChat | `bigscience/bloom`, `bigscience/bloomz`, etc. | ✅︎ | |
+| `BartForConditionalGeneration` | BART | `facebook/bart-base`, `facebook/bart-large-cnn`, etc. | | |
+| `ChatGLMModel`, `ChatGLMForConditionalGeneration` | ChatGLM | `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc. | ✅︎ | ✅︎ |
+| `CohereForCausalLM`, `Cohere2ForCausalLM` | Command-R | `CohereForAI/c4ai-command-r-v01`, `CohereForAI/c4ai-command-r7b-12-2024`, etc. | ✅︎ | ✅︎ |
+| `DbrxForCausalLM` | DBRX | `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc. | ✅︎ | |
+| `DeciLMForCausalLM` | DeciLM | `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, etc. | ✅︎ | |
+| `DeepseekForCausalLM` | DeepSeek | `deepseek-ai/deepseek-llm-67b-base`, `deepseek-ai/deepseek-llm-7b-chat` etc. | ✅︎ | |
+| `DeepseekV2ForCausalLM` | DeepSeek-V2 | `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat` etc. | ✅︎ | |
+| `DeepseekV3ForCausalLM` | DeepSeek-V3 | `deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3` etc. | ✅︎ | |
+| `ExaoneForCausalLM` | EXAONE-3 | `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc. | ✅︎ | ✅︎ |
+| `FalconForCausalLM` | Falcon | `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. | ✅︎ | |
+| `FalconMambaForCausalLM` | FalconMamba | `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc. | ✅︎ | ✅︎ |
+| `FalconH1ForCausalLM` | Falcon-H1 | `tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, etc. | ✅︎ | ✅︎ |
+| `GemmaForCausalLM` | Gemma | `google/gemma-2b`, `google/gemma-1.1-2b-it`, etc. | ✅︎ | ✅︎ |
+| `Gemma2ForCausalLM` | Gemma 2 | `google/gemma-2-9b`, `google/gemma-2-27b`, etc. | ✅︎ | ✅︎ |
+| `Gemma3ForCausalLM` | Gemma 3 | `google/gemma-3-1b-it`, etc. | ✅︎ | ✅︎ |
+| `GlmForCausalLM` | GLM-4 | `THUDM/glm-4-9b-chat-hf`, etc. | ✅︎ | ✅︎ |
+| `Glm4ForCausalLM` | GLM-4-0414 | `THUDM/GLM-4-32B-0414`, etc. | ✅︎ | ✅︎ |
+| `GPT2LMHeadModel` | GPT-2 | `gpt2`, `gpt2-xl`, etc. | ✅︎ | |
+| `GPTBigCodeForCausalLM` | StarCoder, SantaCoder, WizardCoder | `bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, etc. | ✅︎ | ✅︎ |
+| `GPTJForCausalLM` | GPT-J | `EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc. | ✅︎ | |
+| `GPTNeoXForCausalLM` | GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM | `EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc. | ✅︎ | |
+| `GraniteForCausalLM` | Granite 3.0, Granite 3.1, PowerLM | `ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, etc. | ✅︎ | ✅︎ |
+| `GraniteMoeForCausalLM` | Granite 3.0 MoE, PowerMoE | `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc. | ✅︎ | ✅︎ |
+| `GraniteMoeHybridForCausalLM` | Granite 4.0 MoE Hybrid | `ibm-granite/granite-4.0-tiny-preview`, etc. | ✅︎ | ✅︎ |
+| `GraniteMoeSharedForCausalLM` | Granite MoE Shared | `ibm-research/moe-7b-1b-active-shared-experts` (test model) | ✅︎ | ✅︎ |
+| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ |
+| `Grok1ModelForCausalLM` | Grok1 | `hpcai-tech/grok-1`. | ✅︎ | ✅︎ |
+| `InternLMForCausalLM` | InternLM | `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc. | ✅︎ | ✅︎ |
+| `InternLM2ForCausalLM` | InternLM2 | `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc. | ✅︎ | ✅︎ |
+| `InternLM3ForCausalLM` | InternLM3 | `internlm/internlm3-8b-instruct`, etc. | ✅︎ | ✅︎ |
+| `JAISLMHeadModel` | Jais | `inceptionai/jais-13b`, `inceptionai/jais-13b-chat`, `inceptionai/jais-30b-v3`, `inceptionai/jais-30b-chat-v3`, etc. | ✅︎ | |
+| `JambaForCausalLM` | Jamba | `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc. | ✅︎ | ✅︎ |
+| `LlamaForCausalLM` | Llama 3.1, Llama 3, Llama 2, LLaMA, Yi | `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc. | ✅︎ | ✅︎ |
+| `MambaForCausalLM` | Mamba | `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc. | ✅︎ | |
+| `MiniCPMForCausalLM` | MiniCPM | `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc. | ✅︎ | ✅︎ |
+| `MiniCPM3ForCausalLM` | MiniCPM3 | `openbmb/MiniCPM3-4B`, etc. | ✅︎ | ✅︎ |
+| `MistralForCausalLM` | Mistral, Mistral-Instruct | `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc. | ✅︎ | ✅︎ |
+| `MixtralForCausalLM` | Mixtral-8x7B, Mixtral-8x7B-Instruct | `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc. | ✅︎ | ✅︎ |
+| `MPTForCausalLM` | MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter | `mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, etc. | ✅︎ | |
+| `NemotronForCausalLM` | Nemotron-3, Nemotron-4, Minitron | `nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, etc. | ✅︎ | ✅︎ |
+| `OLMoForCausalLM` | OLMo | `allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, etc. | ✅︎ | |
+| `OLMo2ForCausalLM` | OLMo2 | `allenai/OLMo-2-0425-1B`, etc. | ✅︎ | |
+| `OLMoEForCausalLM` | OLMoE | `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc. | ✅︎ | ✅︎ |
+| `OPTForCausalLM` | OPT, OPT-IML | `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc. | ✅︎ | |
+| `OrionForCausalLM` | Orion | `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc. | ✅︎ | |
+| `PhiForCausalLM` | Phi | `microsoft/phi-1_5`, `microsoft/phi-2`, etc. | ✅︎ | ✅︎ |
+| `Phi3ForCausalLM` | Phi-4, Phi-3 | `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc. | ✅︎ | ✅︎ |
+| `Phi3SmallForCausalLM` | Phi-3-Small | `microsoft/Phi-3-small-8k-instruct`, `microsoft/Phi-3-small-128k-instruct`, etc. | ✅︎ | |
+| `PhiMoEForCausalLM` | Phi-3.5-MoE | `microsoft/Phi-3.5-MoE-instruct`, etc. | ✅︎ | ✅︎ |
+| `PersimmonForCausalLM` | Persimmon | `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc. | ✅︎ | |
+| `Plamo2ForCausalLM` | PLaMo2 | `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc. | | |
+| `QWenLMHeadModel` | Qwen | `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc. | ✅︎ | ✅︎ |
+| `Qwen2ForCausalLM` | QwQ, Qwen2 | `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc. | ✅︎ | ✅︎ |
+| `Qwen2MoeForCausalLM` | Qwen2MoE | `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc. | ✅︎ | |
+| `Qwen3ForCausalLM` | Qwen3 | `Qwen/Qwen3-8B`, etc. | ✅︎ | ✅︎ |
+| `Qwen3MoeForCausalLM` | Qwen3MoE | `Qwen/Qwen3-30B-A3B`, etc. | ✅︎ | |
+| `StableLmForCausalLM` | StableLM | `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc. | ✅︎ | |
+| `Starcoder2ForCausalLM` | Starcoder2 | `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc. | ✅︎ | |
+| `SolarForCausalLM` | Solar Pro | `upstage/solar-pro-preview-instruct`, etc. | ✅︎ | ✅︎ |
+| `TeleChat2ForCausalLM` | TeleChat2 | `Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc. | ✅︎ | ✅︎ |
+| `TeleFLMForCausalLM` | TeleFLM | `CofeAI/FLM-2-52B-Instruct-2407`, `CofeAI/Tele-FLM`, etc. | ✅︎ | ✅︎ |
+| `XverseForCausalLM` | XVERSE | `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc. | ✅︎ | ✅︎ |
+| `MiniMaxText01ForCausalLM` | MiniMax-Text | `MiniMaxAI/MiniMax-Text-01`, etc. | ✅︎ | |
+| `Zamba2ForCausalLM` | Zamba2 | `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc. | | |
+
+!!! note
+ Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
+
+### Pooling Models
+
+See [this page](pooling-models) for more information on how to use pooling models.
+
+!!! warning
+ Since some model architectures support both generative and pooling tasks,
+ you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+
+#### Text Embedding
+
+Specified using `--task embed`.
+
+| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] |
+|--------------------------------------------------------|---------------------|---------------------------------------------------------------------------------------------------------------------|------------------------|-----------------------------|
+| `BertModel` | BERT-based | `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc. | | |
+| `Gemma2Model` | Gemma 2-based | `BAAI/bge-multilingual-gemma2`, etc. | ✅︎ | |
+| `GritLM` | GritLM | `parasail-ai/GritLM-7B-vllm`. | ✅︎ | ✅︎ |
+| `GteModel` | Arctic-Embed-2.0-M | `Snowflake/snowflake-arctic-embed-m-v2.0`. | ︎ | |
+| `GteNewModel` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | ︎ | ︎ |
+| `ModernBertModel` | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | ︎ | ︎ |
+| `NomicBertModel` | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | ︎ | ︎ |
+| `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ |
+| `Qwen2Model`, `Qwen2ForCausalLM` | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ |
+| `RobertaModel`, `RobertaForMaskedLM` | RoBERTa-based | `sentence-transformers/all-roberta-large-v1`, etc. | | |
+
+!!! note
+ `ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
+ You should manually set mean pooling by passing `--override-pooler-config '{"pooling_type": "MEAN"}'`.
+
+!!! note
+ The HF implementation of `Alibaba-NLP/gte-Qwen2-1.5B-instruct` is hardcoded to use causal attention despite what is shown in `config.json`. To compare vLLM vs HF results,
+ you should set `--hf-overrides '{"is_causal": true}'` in vLLM so that the two implementations are consistent with each other.
+
+ For both the 1.5B and 7B variants, you also need to enable `--trust-remote-code` for the correct tokenizer to be loaded.
+ See [relevant issue on HF Transformers](https://github.com/huggingface/transformers/issues/34882).
+
+!!! note
+ `jinaai/jina-embeddings-v3` supports multiple tasks through lora, while vllm temporarily only supports text-matching tasks by merging lora weights.
+
+!!! note
+ The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.
+
+If your model is not in the above list, we will try to automatically convert the model using
+[as_embedding_model][vllm.model_executor.models.adapters.as_embedding_model]. By default, the embeddings
+of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
+
+#### Reward Modeling
+
+Specified using `--task reward`.
+
+| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] |
+|---------------------------|-----------------|------------------------------------------------------------------------|------------------------|-----------------------------|
+| `InternLM2ForRewardModel` | InternLM2-based | `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc. | ✅︎ | ✅︎ |
+| `LlamaForCausalLM` | Llama-based | `peiyi9979/math-shepherd-mistral-7b-prm`, etc. | ✅︎ | ✅︎ |
+| `Qwen2ForRewardModel` | Qwen2-based | `Qwen/Qwen2.5-Math-RM-72B`, etc. | ✅︎ | ✅︎ |
+
+If your model is not in the above list, we will try to automatically convert the model using
+[as_reward_model][vllm.model_executor.models.adapters.as_reward_model]. By default, we return the hidden states of each token directly.
+
+!!! warning
+ For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
+ e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
+
+#### Classification
+
+Specified using `--task classify`.
+
+| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] |
+|----------------------------------|----------|----------------------------------------|------------------------|-----------------------------|
+| `JambaForSequenceClassification` | Jamba | `ai21labs/Jamba-tiny-reward-dev`, etc. | ✅︎ | ✅︎ |
+
+If your model is not in the above list, we will try to automatically convert the model using
+[as_classification_model][vllm.model_executor.models.adapters.as_classification_model]. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
+
+#### Sentence Pair Scoring
+
+Specified using `--task score`.
+
+| Architecture | Models | Example HF Models |
+|---------------------------------------|-------------------|----------------------------------------------|
+| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. |
+| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. |
+| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. |
+
+[](){ #supported-mm-models }
+
+## List of Multimodal Language Models
+
+The following modalities are supported depending on the model:
+
+- **T**ext
+- **I**mage
+- **V**ideo
+- **A**udio
+
+Any combination of modalities joined by `+` are supported.
+
+- e.g.: `T + I` means that the model supports text-only, image-only, and text-with-image inputs.
+
+On the other hand, modalities separated by `/` are mutually exclusive.
+
+- e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
+
+See [this page][multimodal-inputs] on how to pass multi-modal inputs to the model.
+
+!!! warning
+ **To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
+ or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
+
+ Offline inference:
+
+ ```python
+ from vllm import LLM
+
+ llm = LLM(
+ model="Qwen/Qwen2-VL-7B-Instruct",
+ limit_mm_per_prompt={"image": 4},
+ )
+ ```
+
+ Online serving:
+
+ ```bash
+ vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt '{"image":4}'
+ ```
+
+ **This is no longer required if you are using vLLM V1.**
+
+!!! note
+ vLLM currently only supports adding LoRA to the language backbone of multimodal models.
+
+### Generative Models
+
+See [this page][generative-models] for more information on how to use generative models.
+
+#### Text Generation
+
+Specified using `--task generate`.
+
+| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] | [V1](gh-issue:8779) |
+|----------------------------------------------|--------------------------------------------------------------------------|-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-----------------------------|-----------------------|
+| `AriaForConditionalGeneration` | Aria | T + I+ | `rhymes-ai/Aria` | ✅︎ | ✅︎ | |
+| `AyaVisionForConditionalGeneration` | Aya Vision | T + I+ | `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc. | ✅︎ | ✅︎ | |
+| `Blip2ForConditionalGeneration` | BLIP-2 | T + IE | `Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc. | ✅︎ | ✅︎ | |
+| `ChameleonForConditionalGeneration` | Chameleon | T + I | `facebook/chameleon-7b` etc. | ✅︎ | ✅︎ | |
+| `DeepseekVLV2ForCausalLM`^ | DeepSeek-VL2 | T + I+ | `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc. | ✅︎ | ✅︎ | |
+| `Florence2ForConditionalGeneration` | Florence-2 | T + I | `microsoft/Florence-2-base`, `microsoft/Florence-2-large` etc. | | | |
+| `FuyuForCausalLM` | Fuyu | T + I | `adept/fuyu-8b` etc. | ✅︎ | ✅︎ | |
+| `Gemma3ForConditionalGeneration` | Gemma 3 | T + I+ | `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc. | ✅︎ | ✅︎ | ⚠️ |
+| `GLM4VForCausalLM`^ | GLM-4V | T + I | `THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220` etc. | ✅︎ | ✅︎ | ✅︎ |
+| `GraniteSpeechForConditionalGeneration` | Granite Speech | T + A | `ibm-granite/granite-speech-3.3-8b` | ✅︎ | ✅︎ | ✅︎ |
+| `H2OVLChatModel` | H2OVL | T + IE+ | `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc. | ✅︎ | ✅︎\* | |
+| `Idefics3ForConditionalGeneration` | Idefics3 | T + I | `HuggingFaceM4/Idefics3-8B-Llama3` etc. | ✅︎ | ✅︎ | |
+| `InternVLChatModel` | InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0 | T + IE+ | `OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc. | ✅︎ | ✅︎ | |
+| `KimiVLForConditionalGeneration` | Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking | T + I+ | `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking` | ✅︎ | | |
+| `Llama4ForConditionalGeneration` | Llama 4 | T + I+ | `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc. | ✅︎ | ✅︎ | |
+| `LlavaForConditionalGeneration` | LLaVA-1.5 | T + IE+ | `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc. | ✅︎ | ✅︎ | |
+| `LlavaNextForConditionalGeneration` | LLaVA-NeXT | T + IE+ | `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc. | ✅︎ | ✅︎ | |
+| `LlavaNextVideoForConditionalGeneration` | LLaVA-NeXT-Video | T + V | `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc. | ✅︎ | ✅︎ | |
+| `LlavaOnevisionForConditionalGeneration` | LLaVA-Onevision | T + I+ + V+ | `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc. | ✅︎ | ✅︎ | |
+| `MiniCPMO` | MiniCPM-O | T + IE+ + VE+ + AE+ | `openbmb/MiniCPM-o-2_6`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `MiniCPMV` | MiniCPM-V | T + IE+ + VE+ | `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `MiniMaxVL01ForConditionalGeneration` | MiniMax-VL | T + IE+ | `MiniMaxAI/MiniMax-VL-01`, etc. | ✅︎ | ✅︎ | |
+| `Mistral3ForConditionalGeneration` | Mistral3 | T + I+ | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `MllamaForConditionalGeneration` | Llama 3.2 | T + I+ | `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc. | | | |
+| `MolmoForCausalLM` | Molmo | T + I+ | `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `NVLM_D_Model` | NVLM-D 1.0 | T + I+ | `nvidia/NVLM-D-72B`, etc. | ✅︎ | ✅︎ | |
+| `Ovis` | Ovis2, Ovis1.6 | T + I+ | `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc. | ✅︎ | | |
+| `PaliGemmaForConditionalGeneration` | PaliGemma, PaliGemma 2 | T + IE | `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc. | ✅︎ | ⚠️ | |
+| `Phi3VForCausalLM` | Phi-3-Vision, Phi-3.5-Vision | T + IE+ | `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc. | ✅︎ | ✅︎ | |
+| `Phi4MMForCausalLM` | Phi-4-multimodal | T + I+ / T + A+ / I+ + A+ | `microsoft/Phi-4-multimodal-instruct`, etc. | ✅︎ | ✅︎ | |
+| `PixtralForConditionalGeneration` | Pixtral | T + I+ | `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, `mistral-community/pixtral-12b`, etc. | ✅︎ | ✅︎ | |
+| `QwenVLForConditionalGeneration`^ | Qwen-VL | T + IE+ | `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `Qwen2AudioForConditionalGeneration` | Qwen2-Audio | T + A+ | `Qwen/Qwen2-Audio-7B-Instruct` | ✅︎ | ✅︎ | |
+| `Qwen2VLForConditionalGeneration` | QVQ, Qwen2-VL | T + IE+ + VE+ | `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `Qwen2_5_VLForConditionalGeneration` | Qwen2.5-VL | T + IE+ + VE+ | `Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ |
+| `Qwen2_5OmniThinkerForConditionalGeneration` | Qwen2.5-Omni | T + IE+ + VE+ + A+ | `Qwen/Qwen2.5-Omni-7B` | ✅︎ | ✅︎\* | |
+| `SkyworkR1VChatModel` | Skywork-R1V-38B | T + I | `Skywork/Skywork-R1V-38B` | ✅︎ | ✅︎ | |
+| `SmolVLMForConditionalGeneration` | SmolVLM2 | T + I | `SmolVLM2-2.2B-Instruct` | ✅︎ | ✅︎ | |
+
+^ You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
+ • For example, to use DeepSeek-VL2 series models:
+ `--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`
+E Pre-computed embeddings can be inputted for this modality.
++ Multiple items can be inputted per text prompt for this modality.
+
+!!! warning
+ Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
+ However, there are differences in how they handle text + image inputs:
+
+ V0 correctly implements the model's attention pattern:
+ - Uses bidirectional attention between the image tokens corresponding to the same image
+ - Uses causal attention for other tokens
+ - Implemented via (naive) PyTorch SDPA with masking tensors
+ - Note: May use significant memory for long prompts with image
+
+ V1 currently uses a simplified attention pattern:
+ - Uses causal attention for all tokens, including image tokens
+ - Generates reasonable outputs but does not match the original model's attention for text + image inputs, especially when `{"do_pan_and_scan": true}`
+ - Will be updated in the future to support the correct behavior
+
+ This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
+
+!!! note
+ `h2oai/h2ovl-mississippi-2b` will be available in V1 once we support head size 80.
+
+!!! note
+ To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
+
+!!! warning
+ The output quality of `AllenAI/Molmo-7B-D-0924` (especially in object localization tasks) has deteriorated in recent updates.
+
+ For the best results, we recommend using the following dependency versions (tested on A10 and L40):
+
+ ```text
+ # Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
+ torch==2.5.1
+ torchvision==0.20.1
+ transformers==4.48.1
+ tokenizers==0.21.0
+ tiktoken==0.7.0
+ vllm==0.7.0
+
+ # Optional but recommended for improved performance and stability
+ triton==3.1.0
+ xformers==0.0.28.post3
+ uvloop==0.21.0
+ protobuf==5.29.3
+ openai==1.60.2
+ opencv-python-headless==4.11.0.86
+ pillow==10.4.0
+
+ # Installed FlashAttention (for float16 only)
+ flash-attn>=2.5.6 # Not used in float32, but should be documented
+ ```
+
+ **Note:** Make sure you understand the security implications of using outdated packages.
+
+!!! note
+ The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
+ For more details, please see:
+
+!!! warning
+ Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
+
+!!! note
+ To use Qwen2.5-Omni, you have to install Hugging Face Transformers library from source via
+ `pip install git+https://github.com/huggingface/transformers.git`.
+
+ Read audio from video pre-processing is currently supported on V0 (but not V1), because overlapping modalities is not yet supported in V1.
+ `--mm-processor-kwargs '{"use_audio_in_video": true}'`.
+
+### Pooling Models
+
+See [this page](pooling-models) for more information on how to use pooling models.
+
+!!! warning
+ Since some model architectures support both generative and pooling tasks,
+ you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
+
+#### Text Embedding
+
+Specified using `--task embed`.
+
+Any text generation model can be converted into an embedding model by passing `--task embed`.
+
+!!! note
+ To get the best results, you should use pooling models that are specifically trained as such.
+
+The following table lists those that are tested in vLLM.
+
+| Architecture | Models | Inputs | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] |
+|-------------------------------------|--------------------|----------|--------------------------|------------------------|-----------------------------|
+| `LlavaNextForConditionalGeneration` | LLaVA-NeXT-based | T / I | `royokong/e5-v` | ✅︎ | |
+| `Phi3VForCausalLM` | Phi-3-Vision-based | T + I | `TIGER-Lab/VLM2Vec-Full` | 🚧 | ✅︎ |
+
+#### Transcription
+
+Specified using `--task transcription`.
+
+Speech2Text models trained specifically for Automatic Speech Recognition.
+
+| Architecture | Models | Example HF Models | [LoRA][lora-adapter] | [PP][distributed-serving] |
+|----------------|----------|---------------------|------------------------|-----------------------------|
+
+---
+
+## Model Support Policy
+
+At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
+
+1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
+
+2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
+
+ !!! tip
+ When comparing the output of `model.generate` from Hugging Face Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
+
+3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
+
+4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
+
+5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
+
+Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
+
+Note that, as an inference engine, vLLM does not introduce new models. Therefore, all models supported by vLLM are third-party models in this regard.
+
+We have the following levels of testing for models:
+
+1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
+2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
+3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test.
+4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
diff --git a/docs/source/performance/benchmarks.md b/docs/performance/benchmarks.md
similarity index 86%
rename from docs/source/performance/benchmarks.md
rename to docs/performance/benchmarks.md
index 39dc470a1c..00505fc6f2 100644
--- a/docs/source/performance/benchmarks.md
+++ b/docs/performance/benchmarks.md
@@ -1,13 +1,14 @@
-(benchmarks)=
-
-# Benchmark Suites
+---
+title: Benchmark Suites
+---
+[](){ #benchmarks }
vLLM contains two sets of benchmarks:
-- [Performance benchmarks](#performance-benchmarks)
-- [Nightly benchmarks](#nightly-benchmarks)
+- [Performance benchmarks][performance-benchmarks]
+- [Nightly benchmarks][nightly-benchmarks]
-(performance-benchmarks)=
+[](){ #performance-benchmarks }
## Performance Benchmarks
@@ -17,7 +18,7 @@ The latest performance results are hosted on the public [vLLM Performance Dashbo
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
-(nightly-benchmarks)=
+[](){ #nightly-benchmarks }
## Nightly Benchmarks
diff --git a/docs/source/performance/optimization.md b/docs/performance/optimization.md
similarity index 98%
rename from docs/source/performance/optimization.md
rename to docs/performance/optimization.md
index 4160f07849..57e01a384b 100644
--- a/docs/source/performance/optimization.md
+++ b/docs/performance/optimization.md
@@ -1,6 +1,7 @@
-(optimization-and-tuning)=
-
-# Optimization and Tuning
+---
+title: Optimization and Tuning
+---
+[](){ #optimization-and-tuning }
This guide covers optimization strategies and performance tuning for vLLM V1.
@@ -26,7 +27,7 @@ You can monitor the number of preemption requests through Prometheus metrics exp
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
-(chunked-prefill)=
+[](){ #chunked-prefill }
## Chunked Prefill
diff --git a/docs/source/serving/distributed_serving.md b/docs/serving/distributed_serving.md
similarity index 73%
rename from docs/source/serving/distributed_serving.md
rename to docs/serving/distributed_serving.md
index c285ef3e8e..259af5cabc 100644
--- a/docs/source/serving/distributed_serving.md
+++ b/docs/serving/distributed_serving.md
@@ -1,6 +1,7 @@
-(distributed-serving)=
-
-# Distributed Inference and Serving
+---
+title: Distributed Inference and Serving
+---
+[](){ #distributed-serving }
## How to decide the distributed inference strategy?
@@ -14,9 +15,8 @@ In short, you should increase the number of GPUs and the number of nodes until y
After adding enough GPUs and nodes to hold the model, you can run vLLM first, which will print some logs like `# GPU blocks: 790`. Multiply the number by `16` (the block size), and you can get roughly the maximum number of tokens that can be served on the current configuration. If this number is not satisfying, e.g. you want higher throughput, you can further increase the number of GPUs or nodes, until the number of blocks is enough.
-:::{note}
-There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
-:::
+!!! note
+ There is one edge case: if the model fits in a single node with multiple GPUs, but the number of GPUs cannot divide the model size evenly, you can use pipeline parallelism, which splits the model along layers and supports uneven splits. In this case, the tensor parallel size should be 1 and the pipeline parallel size should be the number of GPUs.
## Running vLLM on a single node
@@ -77,13 +77,11 @@ bash run_cluster.sh \
Then you get a ray cluster of **containers**. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses.
-:::{warning}
-It is considered best practice to set `VLLM_HOST_IP` to an address on a private network segment for the vLLM cluster. The traffic sent here is not encrypted. The endpoints are also exchanging data in a format that could be exploited to execute arbitrary code should a malicious party gain access to the network. Please ensure that this network is not reachable by any untrusted parties.
-:::
+!!! warning
+ It is considered best practice to set `VLLM_HOST_IP` to an address on a private network segment for the vLLM cluster. The traffic sent here is not encrypted. The endpoints are also exchanging data in a format that could be exploited to execute arbitrary code should a malicious party gain access to the network. Please ensure that this network is not reachable by any untrusted parties.
-:::{warning}
-Since this is a ray cluster of **containers**, all the following commands should be executed in the **containers**, otherwise you are executing the commands on the host machine, which is not connected to the ray cluster. To enter the container, you can use `docker exec -it node /bin/bash`.
-:::
+!!! warning
+ Since this is a ray cluster of **containers**, all the following commands should be executed in the **containers**, otherwise you are executing the commands on the host machine, which is not connected to the ray cluster. To enter the container, you can use `docker exec -it node /bin/bash`.
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` and `ray list nodes` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
@@ -104,16 +102,13 @@ vllm serve /path/to/the/model/in/the/container \
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.
-:::{warning}
-After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See for more information.
-:::
+!!! warning
+ After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script][troubleshooting-incorrect-hardware-driver] for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See for more information.
-:::{warning}
-Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
+!!! warning
+ Please make sure you downloaded the model to all the nodes (with the same path), or the model is downloaded to some distributed file system that is accessible by all nodes.
-When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
-:::
+ When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
-:::{warning}
-If you keep receiving the error message `Error: No available node types can fulfill resource request` but you have enough GPUs in the cluster, chances are your nodes have multiple IP addresses and vLLM cannot find the right one, especially when you are using multi-node inference. Please make sure vLLM and ray use the same IP address. You can set the `VLLM_HOST_IP` environment variable to the right IP address in the `run_cluster.sh` script (different for each node!), and check `ray status` and `ray list nodes` to see the IP address used by Ray. See for more information.
-:::
+!!! warning
+ If you keep receiving the error message `Error: No available node types can fulfill resource request` but you have enough GPUs in the cluster, chances are your nodes have multiple IP addresses and vLLM cannot find the right one, especially when you are using multi-node inference. Please make sure vLLM and ray use the same IP address. You can set the `VLLM_HOST_IP` environment variable to the right IP address in the `run_cluster.sh` script (different for each node!), and check `ray status` and `ray list nodes` to see the IP address used by Ray. See for more information.
diff --git a/docs/serving/engine_args.md b/docs/serving/engine_args.md
new file mode 100644
index 0000000000..fb2689a563
--- /dev/null
+++ b/docs/serving/engine_args.md
@@ -0,0 +1,18 @@
+---
+title: Engine Arguments
+---
+[](){ #engine-args }
+
+Engine arguments control the behavior of the vLLM engine.
+
+- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
+- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`.
+
+You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
+
+However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
+
+For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
+
+!!! note
+ Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`
diff --git a/docs/serving/env_vars.md b/docs/serving/env_vars.md
new file mode 100644
index 0000000000..f6d548a19d
--- /dev/null
+++ b/docs/serving/env_vars.md
@@ -0,0 +1,12 @@
+# Environment Variables
+
+vLLM uses the following environment variables to configure the system:
+
+!!! warning
+ Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
+
+ All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
+
+```python
+--8<-- "vllm/envs.py:env-vars-definition"
+```
diff --git a/docs/source/serving/integrations/langchain.md b/docs/serving/integrations/langchain.md
similarity index 93%
rename from docs/source/serving/integrations/langchain.md
rename to docs/serving/integrations/langchain.md
index 03142d23b1..14ea6a0443 100644
--- a/docs/source/serving/integrations/langchain.md
+++ b/docs/serving/integrations/langchain.md
@@ -1,6 +1,7 @@
-(serving-langchain)=
-
-# LangChain
+---
+title: LangChain
+---
+[](){ #serving-langchain }
vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain) .
diff --git a/docs/source/serving/integrations/llamaindex.md b/docs/serving/integrations/llamaindex.md
similarity index 91%
rename from docs/source/serving/integrations/llamaindex.md
rename to docs/serving/integrations/llamaindex.md
index 8c72605202..251b7155c5 100644
--- a/docs/source/serving/integrations/llamaindex.md
+++ b/docs/serving/integrations/llamaindex.md
@@ -1,6 +1,7 @@
-(serving-llamaindex)=
-
-# LlamaIndex
+---
+title: LlamaIndex
+---
+[](){ #serving-llamaindex }
vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index) .
diff --git a/docs/source/serving/metrics.md b/docs/serving/metrics.md
similarity index 90%
rename from docs/source/serving/metrics.md
rename to docs/serving/metrics.md
index 647ece3f85..9ad7253184 100644
--- a/docs/source/serving/metrics.md
+++ b/docs/serving/metrics.md
@@ -4,7 +4,7 @@ vLLM exposes a number of metrics that can be used to monitor the health of the
system. These metrics are exposed via the `/metrics` endpoint on the vLLM
OpenAI compatible API server.
-You can start the server using Python, or using [Docker](#deployment-docker):
+You can start the server using Python, or using [Docker][deployment-docker]:
```console
vllm serve unsloth/Llama-3.2-1B-Instruct
@@ -31,11 +31,9 @@ vllm:iteration_tokens_total_bucket{le="512.0",model_name="unsloth/Llama-3.2-1B-I
The following metrics are exposed:
-:::{literalinclude} ../../../vllm/engine/metrics.py
-:end-before: end-metrics-definitions
-:language: python
-:start-after: begin-metrics-definitions
-:::
+```python
+--8<-- "vllm/engine/metrics.py:metrics-definitions"
+```
The following metrics are deprecated and due to be removed in a future version:
diff --git a/docs/source/serving/offline_inference.md b/docs/serving/offline_inference.md
similarity index 76%
rename from docs/source/serving/offline_inference.md
rename to docs/serving/offline_inference.md
index 433d2e894d..584d7cd143 100644
--- a/docs/source/serving/offline_inference.md
+++ b/docs/serving/offline_inference.md
@@ -1,10 +1,11 @@
-(offline-inference)=
-
-# Offline Inference
+---
+title: Offline Inference
+---
+[](){ #offline-inference }
You can run vLLM in your own code on a list of prompts.
-The offline API is based on the {class}`~vllm.LLM` class.
+The offline API is based on the [LLM][vllm.LLM] class.
To initialize the vLLM engine, create a new instance of `LLM` and specify the model to run.
For example, the following code downloads the [`facebook/opt-125m`](https://huggingface.co/facebook/opt-125m) model from HuggingFace
@@ -19,23 +20,22 @@ llm = LLM(model="facebook/opt-125m")
After initializing the `LLM` instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
-- [Generative models](#generative-models) output logprobs which are sampled from to obtain the final output text.
-- [Pooling models](#pooling-models) output their hidden states directly.
+- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
+- [Pooling models][pooling-models] output their hidden states directly.
Please refer to the above pages for more details about each API.
-:::{seealso}
-[API Reference](#offline-inference-api)
-:::
+!!! info
+ [API Reference][offline-inference-api]
-(configuration-options)=
+[](){ #configuration-options }
## Configuration Options
This section lists the most common options for running the vLLM engine.
-For a full list, refer to the page.
+For a full list, refer to the [configuration][configuration] page.
-(model-resolution)=
+[](){ #model-resolution }
### Model resolution
@@ -59,9 +59,9 @@ model = LLM(
)
```
-Our [list of supported models](#supported-models) shows the model architectures that are recognized by vLLM.
+Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
-(reducing-memory-usage)=
+[](){ #reducing-memory-usage }
### Reducing memory usage
@@ -80,18 +80,16 @@ llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
tensor_parallel_size=2)
```
-:::{important}
-To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. {func}`torch.cuda.set_device`)
-before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
+!!! warning
+ To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
+ before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
-To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
-:::
+ To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
-:::{note}
-With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
+!!! note
+ With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
-You can convert the model checkpoint to a sharded checkpoint using . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
-:::
+ You can convert the model checkpoint to a sharded checkpoint using . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
#### Quantization
@@ -100,7 +98,7 @@ Quantized models take less memory at the cost of lower precision.
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
and used directly without extra configuration.
-Dynamic quantization is also supported via the `quantization` option -- see [here](#quantization-index) for more details.
+Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
#### Context length and batch size
@@ -119,9 +117,8 @@ llm = LLM(model="adept/fuyu-8b",
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
-:::{important}
-CUDA graph capture takes up more memory in V1 than in V0.
-:::
+!!! warning
+ CUDA graph capture takes up more memory in V1 than in V0.
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
@@ -214,4 +211,4 @@ llm = LLM(model="OpenGVLab/InternVL2-2B",
### Performance optimization and tuning
You can potentially improve the performance of vLLM by finetuning various options.
-Please refer to [this guide](#optimization-and-tuning) for more details.
+Please refer to [this guide][optimization-and-tuning] for more details.
diff --git a/docs/source/serving/openai_compatible_server.md b/docs/serving/openai_compatible_server.md
similarity index 61%
rename from docs/source/serving/openai_compatible_server.md
rename to docs/serving/openai_compatible_server.md
index 61f7e98bf1..27cb9310c5 100644
--- a/docs/source/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
@@ -1,10 +1,11 @@
-(openai-compatible-server)=
-
-# OpenAI-Compatible Server
+---
+title: OpenAI-Compatible Server
+---
+[](){ #openai-compatible-server }
vLLM provides an HTTP server that implements OpenAI's [Completions API](https://platform.openai.com/docs/api-reference/completions), [Chat API](https://platform.openai.com/docs/api-reference/chat), and more! This functionality lets you serve models and interact with them using an HTTP client.
-In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`](#serve-args) command. (You can also use our [Docker](#deployment-docker) image.)
+In your terminal, you can [install](../getting_started/installation.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
@@ -20,58 +21,56 @@ client = OpenAI(
)
completion = client.chat.completions.create(
- model="NousResearch/Meta-Llama-3-8B-Instruct",
- messages=[
- {"role": "user", "content": "Hello!"}
- ]
+ model="NousResearch/Meta-Llama-3-8B-Instruct",
+ messages=[
+ {"role": "user", "content": "Hello!"}
+ ]
)
print(completion.choices[0].message)
```
-:::{tip}
-vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
-You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
-:::
+!!! tip
+ vLLM supports some parameters that are not supported by OpenAI, `top_k` for example.
+ You can pass these parameters to vLLM using the OpenAI client in the `extra_body` parameter of your requests, i.e. `extra_body={"top_k": 50}` for `top_k`.
-:::{important}
-By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
+!!! warning
+ By default, the server applies `generation_config.json` from the Hugging Face model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.
-To disable this behavior, please pass `--generation-config vllm` when launching the server.
-:::
+ To disable this behavior, please pass `--generation-config vllm` when launching the server.
## Supported APIs
We currently support the following OpenAI APIs:
-- [Completions API](#completions-api) (`/v1/completions`)
- - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
- - *Note: `suffix` parameter is not supported.*
-- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
- - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template](#chat-template).
- - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
-- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
- - Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
-- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
- - Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
+- [Completions API][completions-api] (`/v1/completions`)
+ - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`).
+ - *Note: `suffix` parameter is not supported.*
+- [Chat Completions API][chat-api] (`/v1/chat/completions`)
+ - Only applicable to [text generation models](../models/generative_models.md) (`--task generate`) with a [chat template][chat-template].
+ - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
+- [Embeddings API][embeddings-api] (`/v1/embeddings`)
+ - Only applicable to [embedding models](../models/pooling_models.md) (`--task embed`).
+- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
+ - Only applicable to Automatic Speech Recognition (ASR) models (OpenAI Whisper) (`--task generate`).
In addition, we have the following custom APIs:
-- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
- - Applicable to any model with a tokenizer.
-- [Pooling API](#pooling-api) (`/pooling`)
- - Applicable to all [pooling models](../models/pooling_models.md).
-- [Classification API](#classification-api) (`/classify`)
- - Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
-- [Score API](#score-api) (`/score`)
- - Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
-- [Re-rank API](#rerank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
- - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
- - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
- - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
- - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
+- [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
+ - Applicable to any model with a tokenizer.
+- [Pooling API][pooling-api] (`/pooling`)
+ - Applicable to all [pooling models](../models/pooling_models.md).
+- [Classification API][classification-api] (`/classify`)
+ - Only applicable to [classification models](../models/pooling_models.md) (`--task classify`).
+- [Score API][score-api] (`/score`)
+ - Applicable to embedding models and [cross-encoder models](../models/pooling_models.md) (`--task score`).
+- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
+ - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
+ - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
+ - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
+ - Only applicable to [cross-encoder models](../models/pooling_models.md) (`--task score`).
-(chat-template)=
+[](){ #chat-template }
## Chat Template
@@ -97,10 +96,10 @@ both a `type` and a `text` field. An example is provided below:
```python
completion = client.chat.completions.create(
- model="NousResearch/Meta-Llama-3-8B-Instruct",
- messages=[
- {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
- ]
+ model="NousResearch/Meta-Llama-3-8B-Instruct",
+ messages=[
+ {"role": "user", "content": [{"type": "text", "text": "Classify this sentiment: vLLM is wonderful!"}]}
+ ]
)
```
@@ -111,9 +110,9 @@ request. vLLM provides best-effort support to detect this automatically, which i
the detected format, which can be one of:
- `"string"`: A string.
- - Example: `"Hello world"`
+ - Example: `"Hello world"`
- `"openai"`: A list of dictionaries, similar to OpenAI schema.
- - Example: `[{"type": "text", "text": "Hello world!"}]`
+ - Example: `[{"type": "text", "text": "Hello world!"}]`
If the result is not what you expect, you can set the `--chat-template-content-format` CLI argument
to override which format to use.
@@ -126,13 +125,13 @@ Or directly merge them into the JSON payload if you are using HTTP call directly
```python
completion = client.chat.completions.create(
- model="NousResearch/Meta-Llama-3-8B-Instruct",
- messages=[
- {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
- ],
- extra_body={
- "guided_choice": ["positive", "negative"]
- }
+ model="NousResearch/Meta-Llama-3-8B-Instruct",
+ messages=[
+ {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+ ],
+ extra_body={
+ "guided_choice": ["positive", "negative"]
+ }
)
```
@@ -148,29 +147,29 @@ with `--enable-request-id-headers`.
```python
completion = client.chat.completions.create(
- model="NousResearch/Meta-Llama-3-8B-Instruct",
- messages=[
- {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
- ],
- extra_headers={
- "x-request-id": "sentiment-classification-00001",
- }
+ model="NousResearch/Meta-Llama-3-8B-Instruct",
+ messages=[
+ {"role": "user", "content": "Classify this sentiment: vLLM is wonderful!"}
+ ],
+ extra_headers={
+ "x-request-id": "sentiment-classification-00001",
+ }
)
print(completion._request_id)
completion = client.completions.create(
- model="NousResearch/Meta-Llama-3-8B-Instruct",
- prompt="A robot may not injure a human being",
- extra_headers={
- "x-request-id": "completion-test",
- }
+ model="NousResearch/Meta-Llama-3-8B-Instruct",
+ prompt="A robot may not injure a human being",
+ extra_headers={
+ "x-request-id": "completion-test",
+ }
)
print(completion._request_id)
```
## API Reference
-(completions-api)=
+[](){ #completions-api }
### Completions API
@@ -181,23 +180,19 @@ Code example:
#### Extra parameters
-The following [sampling parameters](#sampling-params) are supported.
+The following [sampling parameters][sampling-params] are supported.
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-completion-sampling-params
-:end-before: end-completion-sampling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:completion-sampling-params"
+```
The following extra parameters are supported:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-completion-extra-params
-:end-before: end-completion-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
+```
-(chat-api)=
+[](){ #chat-api }
### Chat API
@@ -206,37 +201,33 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai
We support both [Vision](https://platform.openai.com/docs/guides/vision)- and
[Audio](https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in)-related parameters;
-see our [Multimodal Inputs](#multimodal-inputs) guide for more information.
+see our [Multimodal Inputs][multimodal-inputs] guide for more information.
- *Note: `image_url.detail` parameter is not supported.*
Code example:
#### Extra parameters
-The following [sampling parameters](#sampling-params) are supported.
+The following [sampling parameters][sampling-params] are supported.
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-chat-completion-sampling-params
-:end-before: end-chat-completion-sampling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-sampling-params"
+```
The following extra parameters are supported:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-chat-completion-extra-params
-:end-before: end-chat-completion-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
+```
-(embeddings-api)=
+[](){ #embeddings-api }
### Embeddings API
Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
-If the model has a [chat template](#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
+If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
which will be treated as a single prompt to the model.
Code example:
@@ -246,138 +237,117 @@ Code example:
You can pass multi-modal inputs to embedding models by defining a custom chat template for the server
and passing a list of `messages` in the request. Refer to the examples below for illustration.
-:::::{tab-set}
-::::{tab-item} VLM2Vec
+=== "VLM2Vec"
-To serve the model:
+ To serve the model:
-```bash
-vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
- --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
-```
+ ```bash
+ vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
+ --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
+ ```
-:::{important}
-Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
-to run this model in embedding mode instead of text generation mode.
+ !!! warning
+ Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass `--task embed`
+ to run this model in embedding mode instead of text generation mode.
-The custom chat template is completely different from the original one for this model,
-and can be found here:
-:::
+ The custom chat template is completely different from the original one for this model,
+ and can be found here:
-Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
+ Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level `requests` library:
-```python
-import requests
+ ```python
+ import requests
-image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
+ image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
-response = requests.post(
- "http://localhost:8000/v1/embeddings",
- json={
- "model": "TIGER-Lab/VLM2Vec-Full",
- "messages": [{
- "role": "user",
- "content": [
- {"type": "image_url", "image_url": {"url": image_url}},
- {"type": "text", "text": "Represent the given image."},
- ],
- }],
- "encoding_format": "float",
- },
-)
-response.raise_for_status()
-response_json = response.json()
-print("Embedding output:", response_json["data"][0]["embedding"])
-```
+ response = requests.post(
+ "http://localhost:8000/v1/embeddings",
+ json={
+ "model": "TIGER-Lab/VLM2Vec-Full",
+ "messages": [{
+ "role": "user",
+ "content": [
+ {"type": "image_url", "image_url": {"url": image_url}},
+ {"type": "text", "text": "Represent the given image."},
+ ],
+ }],
+ "encoding_format": "float",
+ },
+ )
+ response.raise_for_status()
+ response_json = response.json()
+ print("Embedding output:", response_json["data"][0]["embedding"])
+ ```
-::::
+=== "DSE-Qwen2-MRL"
-::::{tab-item} DSE-Qwen2-MRL
+ To serve the model:
-To serve the model:
+ ```bash
+ vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
+ --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
+ ```
-```bash
-vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
- --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
-```
+ !!! warning
+ Like with VLM2Vec, we have to explicitly pass `--task embed`.
-:::{important}
-Like with VLM2Vec, we have to explicitly pass `--task embed`.
+ Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
+ by a custom chat template:
-Additionally, `MrLight/dse-qwen2-2b-mrl-v1` requires an EOS token for embeddings, which is handled
-by a custom chat template:
-:::
-
-:::{important}
-`MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
-example below for details.
-:::
-
-::::
-
-:::::
+ !!! warning
+ `MrLight/dse-qwen2-2b-mrl-v1` requires a placeholder image of the minimum image size for text query embeddings. See the full code
+ example below for details.
Full example:
#### Extra parameters
-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-embedding-pooling-params
-:end-before: end-embedding-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:embedding-pooling-params"
+```
The following extra parameters are supported by default:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-embedding-extra-params
-:end-before: end-embedding-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:embedding-extra-params"
+```
For chat-like input (i.e. if `messages` is passed), these extra parameters are supported instead:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-chat-embedding-extra-params
-:end-before: end-chat-embedding-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
+```
-(transcriptions-api)=
+[](){ #transcriptions-api }
### Transcriptions API
Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
you can use the [official OpenAI Python client](https://github.com/openai/openai-python) to interact with it.
-:::{note}
-To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
-:::
+!!! note
+ To use the Transcriptions API, please install with extra audio dependencies using `pip install vllm[audio]`.
Code example:
#### Extra Parameters
-The following [sampling parameters](#sampling-params) are supported.
+The following [sampling parameters][sampling-params] are supported.
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-transcription-sampling-params
-:end-before: end-transcription-sampling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:transcription-sampling-params"
+```
The following extra parameters are supported:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-transcription-extra-params
-:end-before: end-transcription-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
+```
-(tokenizer-api)=
+[](){ #tokenizer-api }
### Tokenizer API
@@ -387,17 +357,17 @@ It consists of two endpoints:
- `/tokenize` corresponds to calling `tokenizer.encode()`.
- `/detokenize` corresponds to calling `tokenizer.decode()`.
-(pooling-api)=
+[](){ #pooling-api }
### Pooling API
Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.
-The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
+The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
Code example:
-(classification-api)=
+[](){ #classification-api }
### Classification API
@@ -505,23 +475,19 @@ Response:
#### Extra parameters
-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-classification-pooling-params
-:end-before: end-classification-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:classification-pooling-params"
+```
The following extra parameters are supported:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-classification-extra-params
-:end-before: end-classification-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
+```
-(score-api)=
+[](){ #score-api }
### Score API
@@ -668,23 +634,19 @@ Response:
#### Extra parameters
-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-score-pooling-params
-:end-before: end-score-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:score-pooling-params"
+```
The following extra parameters are supported:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-score-extra-params
-:end-before: end-score-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
+```
-(rerank-api)=
+[](){ #rerank-api }
### Re-rank API
@@ -755,18 +717,14 @@ Response:
#### Extra parameters
-The following [pooling parameters](#pooling-params) are supported.
+The following [pooling parameters][pooling-params] are supported.
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-rerank-pooling-params
-:end-before: end-rerank-pooling-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:rerank-pooling-params"
+```
The following extra parameters are supported:
-:::{literalinclude} ../../../vllm/entrypoints/openai/protocol.py
-:language: python
-:start-after: begin-rerank-extra-params
-:end-before: end-rerank-extra-params
-:::
+```python
+--8<-- "vllm/entrypoints/openai/protocol.py:rerank-extra-params"
+```
diff --git a/docs/serving/serve_args.md b/docs/serving/serve_args.md
new file mode 100644
index 0000000000..16b4b29f45
--- /dev/null
+++ b/docs/serving/serve_args.md
@@ -0,0 +1,38 @@
+---
+title: Server Arguments
+---
+[](){ #serve-args }
+
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+
+## CLI Arguments
+
+The `vllm serve` command is used to launch the OpenAI-compatible server.
+To see the available CLI arguments, run `vllm serve --help`!
+
+## Configuration file
+
+You can load CLI arguments via a [YAML](https://yaml.org/) config file.
+The argument names must be the long form of those outlined [above][serve-args].
+
+For example:
+
+```yaml
+# config.yaml
+
+model: meta-llama/Llama-3.1-8B-Instruct
+host: "127.0.0.1"
+port: 6379
+uvicorn-log-level: "info"
+```
+
+To use the above config file:
+
+```bash
+vllm serve --config config.yaml
+```
+
+!!! note
+ In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
+ The order of priorities is `command line > config file values > defaults`.
+ e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
diff --git a/docs/source/serving/usage_stats.md b/docs/serving/usage_stats.md
similarity index 100%
rename from docs/source/serving/usage_stats.md
rename to docs/serving/usage_stats.md
diff --git a/docs/source/_static/custom.css b/docs/source/_static/custom.css
deleted file mode 100644
index 79bd2082b4..0000000000
--- a/docs/source/_static/custom.css
+++ /dev/null
@@ -1,8 +0,0 @@
-.vertical-table-header th.head:not(.stub) {
- writing-mode: sideways-lr;
- white-space: nowrap;
- max-width: 0;
- p {
- margin: 0;
- }
-}
diff --git a/docs/source/_templates/sections/header.html b/docs/source/_templates/sections/header.html
deleted file mode 100644
index 7174431b10..0000000000
--- a/docs/source/_templates/sections/header.html
+++ /dev/null
@@ -1,39 +0,0 @@
-
-
-
-
You are viewing the latest developer preview docs. Click here to view docs for the latest stable release.
-
diff --git a/docs/source/api/summary.md b/docs/source/api/summary.md
deleted file mode 100644
index 46de545f9d..0000000000
--- a/docs/source/api/summary.md
+++ /dev/null
@@ -1,133 +0,0 @@
-# Summary
-
-(configuration)=
-
-## Configuration
-
-API documentation for vLLM's configuration classes.
-
-```{autodoc2-summary}
- vllm.config.ModelConfig
- vllm.config.CacheConfig
- vllm.config.TokenizerPoolConfig
- vllm.config.LoadConfig
- vllm.config.ParallelConfig
- vllm.config.SchedulerConfig
- vllm.config.DeviceConfig
- vllm.config.SpeculativeConfig
- vllm.config.LoRAConfig
- vllm.config.PromptAdapterConfig
- vllm.config.MultiModalConfig
- vllm.config.PoolerConfig
- vllm.config.DecodingConfig
- vllm.config.ObservabilityConfig
- vllm.config.KVTransferConfig
- vllm.config.CompilationConfig
- vllm.config.VllmConfig
-```
-
-(offline-inference-api)=
-
-## Offline Inference
-
-LLM Class.
-
-```{autodoc2-summary}
- vllm.LLM
-```
-
-LLM Inputs.
-
-```{autodoc2-summary}
- vllm.inputs.PromptType
- vllm.inputs.TextPrompt
- vllm.inputs.TokensPrompt
-```
-
-## vLLM Engines
-
-Engine classes for offline and online inference.
-
-```{autodoc2-summary}
- vllm.LLMEngine
- vllm.AsyncLLMEngine
-```
-
-## Inference Parameters
-
-Inference parameters for vLLM APIs.
-
-(sampling-params)=
-(pooling-params)=
-
-```{autodoc2-summary}
- vllm.SamplingParams
- vllm.PoolingParams
-```
-
-(multi-modality)=
-
-## Multi-Modality
-
-vLLM provides experimental support for multi-modal models through the {mod}`vllm.multimodal` package.
-
-Multi-modal inputs can be passed alongside text and token prompts to [supported models](#supported-mm-models)
-via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.
-
-Looking to add your own multi-modal model? Please follow the instructions listed [here](#supports-multimodal).
-
-```{autodoc2-summary}
- vllm.multimodal.MULTIMODAL_REGISTRY
-```
-
-### Inputs
-
-User-facing inputs.
-
-```{autodoc2-summary}
- vllm.multimodal.inputs.MultiModalDataDict
-```
-
-Internal data structures.
-
-```{autodoc2-summary}
- vllm.multimodal.inputs.PlaceholderRange
- vllm.multimodal.inputs.NestedTensors
- vllm.multimodal.inputs.MultiModalFieldElem
- vllm.multimodal.inputs.MultiModalFieldConfig
- vllm.multimodal.inputs.MultiModalKwargsItem
- vllm.multimodal.inputs.MultiModalKwargs
- vllm.multimodal.inputs.MultiModalInputs
-```
-
-### Data Parsing
-
-```{autodoc2-summary}
- vllm.multimodal.parse
-```
-
-### Data Processing
-
-```{autodoc2-summary}
- vllm.multimodal.processing
-```
-
-### Memory Profiling
-
-```{autodoc2-summary}
- vllm.multimodal.profiling
-```
-
-### Registry
-
-```{autodoc2-summary}
- vllm.multimodal.registry
-```
-
-## Model Development
-
-```{autodoc2-summary}
- vllm.model_executor.models.interfaces_base
- vllm.model_executor.models.interfaces
- vllm.model_executor.models.adapters
-```
diff --git a/docs/source/autodoc2_docstring_parser.py b/docs/source/autodoc2_docstring_parser.py
deleted file mode 100644
index 41c49ed1c5..0000000000
--- a/docs/source/autodoc2_docstring_parser.py
+++ /dev/null
@@ -1,21 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-from docutils import nodes
-from myst_parser.parsers.sphinx_ import MystParser
-from sphinx.ext.napoleon import docstring
-
-
-class NapoleonParser(MystParser):
-
- def parse(self, input_string: str, document: nodes.document) -> None:
- # Get the Sphinx configuration
- config = document.settings.env.config
-
- parsed_content = str(
- docstring.GoogleDocstring(
- str(docstring.NumpyDocstring(input_string, config)),
- config,
- ))
- return super().parse(parsed_content, document)
-
-
-Parser = NapoleonParser
diff --git a/docs/source/community/blog.md b/docs/source/community/blog.md
deleted file mode 100644
index e8030edfa0..0000000000
--- a/docs/source/community/blog.md
+++ /dev/null
@@ -1,3 +0,0 @@
-# vLLM Blog
-
-vLLM blog posts are published [here](https://blog.vllm.ai/).
diff --git a/docs/source/conf.py b/docs/source/conf.py
deleted file mode 100644
index 5620d6de2c..0000000000
--- a/docs/source/conf.py
+++ /dev/null
@@ -1,263 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-
-# Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-
-import datetime
-import logging
-import os
-import re
-import sys
-from pathlib import Path
-
-import requests
-
-logger = logging.getLogger(__name__)
-REPO_ROOT = Path(__file__).resolve().parent.parent.parent
-sys.path.append(os.path.abspath(REPO_ROOT))
-
-# -- Project information -----------------------------------------------------
-
-project = 'vLLM'
-copyright = f'{datetime.datetime.now().year}, vLLM Team'
-author = 'the vLLM Team'
-
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
- "sphinx.ext.napoleon",
- "sphinx.ext.linkcode",
- "sphinx.ext.intersphinx",
- "sphinx_copybutton",
- "autodoc2",
- "myst_parser",
- "sphinxarg.ext",
- "sphinx_design",
- "sphinx_togglebutton",
-]
-myst_enable_extensions = [
- "colon_fence",
- "fieldlist",
-]
-autodoc2_packages = [
- {
- "path": "../../vllm",
- "exclude_dirs": ["__pycache__", "third_party"],
- },
-]
-autodoc2_output_dir = "api"
-autodoc2_render_plugin = "myst"
-autodoc2_hidden_objects = ["dunder", "private", "inherited"]
-autodoc2_sort_names = True
-autodoc2_index_template = None
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns: list[str] = ["**/*.template.md", "**/*.inc.md"]
-
-# Exclude the prompt "$" when copying code
-copybutton_prompt_text = r"\$ "
-copybutton_prompt_is_regexp = True
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages. See the documentation for
-# a list of builtin themes.
-#
-html_title = project
-html_theme = 'sphinx_book_theme'
-html_logo = 'assets/logos/vllm-logo-text-light.png'
-html_favicon = 'assets/logos/vllm-logo-only-light.ico'
-html_theme_options = {
- 'path_to_docs': 'docs/source',
- 'repository_url': 'https://github.com/vllm-project/vllm',
- 'use_repository_button': True,
- 'use_edit_page_button': True,
- # Prevents the full API being added to the left sidebar of every page.
- # Reduces build time by 2.5x and reduces build size from ~225MB to ~95MB.
- 'collapse_navbar': True,
- # Makes API visible in the right sidebar on API reference pages.
- 'show_toc_level': 3,
-}
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-html_static_path = ["_static"]
-html_js_files = ["custom.js"]
-html_css_files = ["custom.css"]
-
-myst_heading_anchors = 2
-myst_url_schemes = {
- 'http': None,
- 'https': None,
- 'mailto': None,
- 'ftp': None,
- "gh-issue": {
- "url":
- "https://github.com/vllm-project/vllm/issues/{{path}}#{{fragment}}",
- "title": "Issue #{{path}}",
- "classes": ["github"],
- },
- "gh-pr": {
- "url":
- "https://github.com/vllm-project/vllm/pull/{{path}}#{{fragment}}",
- "title": "Pull Request #{{path}}",
- "classes": ["github"],
- },
- "gh-project": {
- "url": "https://github.com/orgs/vllm-project/projects/{{path}}",
- "title": "Project #{{path}}",
- "classes": ["github"],
- },
- "gh-dir": {
- "url": "https://github.com/vllm-project/vllm/tree/main/{{path}}",
- "title": "{{path}}",
- "classes": ["github"],
- },
- "gh-file": {
- "url": "https://github.com/vllm-project/vllm/blob/main/{{path}}",
- "title": "{{path}}",
- "classes": ["github"],
- },
-}
-
-# see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa
-READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE')
-if READTHEDOCS_VERSION_TYPE == "tag":
- # remove the warning banner if the version is a tagged release
- header_file = os.path.join(os.path.dirname(__file__),
- "_templates/sections/header.html")
- # The file might be removed already if the build is triggered multiple times
- # (readthedocs build both HTML and PDF versions separately)
- if os.path.exists(header_file):
- os.remove(header_file)
-
-
-# Generate additional rst documentation here.
-def setup(app):
- from docs.source.generate_examples import generate_examples
- generate_examples()
-
-
-_cached_base: str = ""
-_cached_branch: str = ""
-
-
-def get_repo_base_and_branch(pr_number):
- global _cached_base, _cached_branch
- if _cached_base and _cached_branch:
- return _cached_base, _cached_branch
-
- url = f"https://api.github.com/repos/vllm-project/vllm/pulls/{pr_number}"
- response = requests.get(url)
- if response.status_code == 200:
- data = response.json()
- _cached_base = data['head']['repo']['full_name']
- _cached_branch = data['head']['ref']
- return _cached_base, _cached_branch
- else:
- logger.error("Failed to fetch PR details: %s", response)
- return None, None
-
-
-def linkcode_resolve(domain, info):
- if domain != 'py':
- return None
- if not info['module']:
- return None
-
- # Get path from module name
- file = Path(f"{info['module'].replace('.', '/')}.py")
- path = REPO_ROOT / file
- if not path.exists():
- path = REPO_ROOT / file.with_suffix("") / "__init__.py"
- if not path.exists():
- return None
-
- # Get the line number of the object
- with open(path) as f:
- lines = f.readlines()
- name = info['fullname'].split(".")[-1]
- pattern = fr"^( {{4}})*((def|class) )?{name}\b.*"
- for lineno, line in enumerate(lines, 1):
- if not line or line.startswith("#"):
- continue
- if re.match(pattern, line):
- break
-
- # If the line number is not found, return None
- if lineno == len(lines):
- return None
-
- # If the line number is found, create the URL
- filename = path.relative_to(REPO_ROOT)
- if "checkouts" in path.parts:
- # a PR build on readthedocs
- pr_number = REPO_ROOT.name
- base, branch = get_repo_base_and_branch(pr_number)
- if base and branch:
- return f"https://github.com/{base}/blob/{branch}/{filename}#L{lineno}"
- # Otherwise, link to the source file on the main branch
- return f"https://github.com/vllm-project/vllm/blob/main/{filename}#L{lineno}"
-
-
-# Mock out external dependencies here, otherwise sphinx-argparse won't work.
-autodoc_mock_imports = [
- "huggingface_hub",
- "pydantic",
- "zmq",
- "cloudpickle",
- "aiohttp",
- "starlette",
- "blake3",
- "cpuinfo",
- "transformers",
- "psutil",
- "vllm._C",
- "PIL",
- "numpy",
- "tqdm",
- # The mocks below are required by
- # docs/source/serving/openai_compatible_server.md's
- # vllm.entrypoints.openai.cli_args
- "openai",
- "fastapi",
- "partial_json_parser",
-]
-
-for mock_target in autodoc_mock_imports:
- if mock_target in sys.modules:
- logger.info(
- "Potentially problematic mock target (%s) found; "
- "autodoc_mock_imports cannot mock modules that have already "
- "been loaded into sys.modules when the sphinx build starts.",
- mock_target)
-
-intersphinx_mapping = {
- "python": ("https://docs.python.org/3", None),
- "typing_extensions":
- ("https://typing-extensions.readthedocs.io/en/latest", None),
- "aiohttp": ("https://docs.aiohttp.org/en/stable", None),
- "pillow": ("https://pillow.readthedocs.io/en/stable", None),
- "numpy": ("https://numpy.org/doc/stable", None),
- "torch": ("https://pytorch.org/docs/stable", None),
- "psutil": ("https://psutil.readthedocs.io/en/stable", None),
-}
-
-navigation_with_keys = False
diff --git a/docs/source/contributing/model/index.md b/docs/source/contributing/model/index.md
deleted file mode 100644
index 721ee3cd20..0000000000
--- a/docs/source/contributing/model/index.md
+++ /dev/null
@@ -1,27 +0,0 @@
-(new-model)=
-
-# Adding a New Model
-
-This section provides more information on how to integrate a [PyTorch](https://pytorch.org/) model into vLLM.
-
-:::{toctree}
-:caption: Contents
-:maxdepth: 1
-
-basic
-registration
-tests
-multimodal
-:::
-
-:::{note}
-The complexity of adding a new model depends heavily on the model's architecture.
-The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
-However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
-:::
-
-:::{tip}
-If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
-or ask on our [developer slack](https://slack.vllm.ai).
-We will be happy to help you out!
-:::
diff --git a/docs/source/contributing/model/multimodal.md b/docs/source/contributing/model/multimodal.md
deleted file mode 100644
index b42536f054..0000000000
--- a/docs/source/contributing/model/multimodal.md
+++ /dev/null
@@ -1,834 +0,0 @@
-(supports-multimodal)=
-
-# Multi-Modal Support
-
-This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](#multimodal-inputs).
-
-## 1. Update the base vLLM model
-
-It is assumed that you have already implemented the model in vLLM according to [these steps](#new-model-basic).
-Further update the model as follows:
-
-- Reserve a keyword parameter in {meth}`~torch.nn.Module.forward` for each input tensor that corresponds to a multi-modal input, as shown in the following example:
-
- ```diff
- def forward(
- self,
- input_ids: torch.Tensor,
- positions: torch.Tensor,
- + pixel_values: torch.Tensor,
- ) -> SamplerOutput:
- ```
-
- More conveniently, you can simply pass `**kwargs` to the {meth}`~torch.nn.Module.forward` method and retrieve the keyword parameters for multimodal inputs from it.
-
-- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings` that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
-
- ```python
- class YourModelForImage2Seq(nn.Module):
- ...
-
- def _process_image_input(self, image_input: YourModelImageInputs) -> torch.Tensor:
-
- assert self.vision_encoder is not None
- image_features = self.vision_encoder(image_input)
- return self.multi_modal_projector(image_features)
-
- def get_multimodal_embeddings(
- self, **kwargs: object) -> Optional[MultiModalEmbeddings]:
-
- # Validate the multimodal input keyword arguments
- image_input = self._parse_and_validate_image_input(**kwargs)
- if image_input is None:
- return None
-
- # Run multimodal inputs through encoder and projector
- vision_embeddings = self._process_image_input(image_input)
- return vision_embeddings
- ```
-
- :::{important}
- The returned `multimodal_embeddings` must be either a **3D {class}`torch.Tensor`** of shape `(num_items, feature_size, hidden_size)`, or a **list / tuple of 2D {class}`torch.Tensor`'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
- :::
-
-- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings` to merge `multimodal_embeddings` with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
-
- ```python
- from .utils import merge_multimodal_embeddings
-
- class YourModelForImage2Seq(nn.Module):
- ...
-
- def get_input_embeddings(
- self,
- input_ids: torch.Tensor,
- multimodal_embeddings: Optional[MultiModalEmbeddings] = None,
- ) -> torch.Tensor:
-
- # `get_input_embeddings` should already be implemented for the language
- # model as one of the requirements of basic vLLM model implementation.
- inputs_embeds = self.language_model.get_input_embeddings(input_ids)
-
- if multimodal_embeddings is not None:
- inputs_embeds = merge_multimodal_embeddings(
- input_ids=input_ids,
- inputs_embeds=inputs_embeds,
- multimodal_embeddings=multimodal_embeddings,
- placeholder_token_id=self.config.image_token_index)
-
- return inputs_embeds
- ```
-
-- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_language_model` getter to provide stable access to the underlying language model.
-
- ```python
- class YourModelForImage2Seq(nn.Module):
- ...
-
- def get_language_model(self) -> torch.nn.Module:
- # Change `language_model` according to your implementation.
- return self.language_model
- ```
-
-- Once the above steps are done, update the model class with the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
-
- ```diff
- + from vllm.model_executor.models.interfaces import SupportsMultiModal
-
- - class YourModelForImage2Seq(nn.Module):
- + class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
- ```
-
- :::{note}
- The model class does not have to be named {code}`*ForCausalLM`.
- Check out [the HuggingFace Transformers documentation](https://huggingface.co/docs/transformers/model_doc/auto#multimodal) for some examples.
- :::
-
-## 2. Specify processing information
-
-Next, create a subclass of {class}`~vllm.multimodal.processing.BaseProcessingInfo`
-to provide basic information related to HF processing.
-
-### Maximum number of input items
-
-You need to override the abstract method {meth}`~vllm.multimodal.processing.BaseProcessingInfo.get_supported_mm_limits`
-to return the maximum number of input items for each modality supported by the model.
-
-For example, if the model supports any number of images but only one video per prompt:
-
-```python
-def get_supported_mm_limits(self) -> Mapping[str, Optional[int]]:
- return {"image": None, "video": 1}
-```
-
-## 3. Specify dummy inputs
-
-Then, inherit {class}`~vllm.multimodal.profiling.BaseDummyInputsBuilder` to construct dummy inputs for
-HF processing as well as memory profiling.
-
-### For memory profiling
-
-Override the abstract methods {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text` and {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data` to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it.
-
-Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
-
-::::{tab-set}
-:::{tab-item} Basic example: LLaVA
-:sync: llava
-
-Looking at the code of HF's `LlavaForConditionalGeneration`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L530-L544
-n_image_tokens = (input_ids == self.config.image_token_index).sum().item()
-n_image_features = image_features.shape[0] * image_features.shape[1]
-
-if n_image_tokens != n_image_features:
- raise ValueError(
- f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
- )
-special_image_mask = (
- (input_ids == self.config.image_token_index)
- .unsqueeze(-1)
- .expand_as(inputs_embeds)
- .to(inputs_embeds.device)
-)
-image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
-inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
-```
-
-The number of placeholder feature tokens per image is `image_features.shape[1]`.
-`image_features` is calculated inside the `get_image_features` method:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/modeling_llava.py#L290-L300
-image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
-
-selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
-if vision_feature_select_strategy == "default":
- selected_image_feature = selected_image_feature[:, 1:]
-elif vision_feature_select_strategy == "full":
- selected_image_feature = selected_image_feature
-else:
- raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
-image_features = self.multi_modal_projector(selected_image_feature)
-return image_features
-```
-
-We can infer that `image_features.shape[1]` is based on `image_outputs.hidden_states.shape[1]` from the vision tower
-(`CLIPVisionModel` for the [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model).
-Moreover, we only need the sequence length (the second dimension of the tensor) to get `image_features.shape[1]`.
-The sequence length is determined by the initial hidden states in `CLIPVisionTransformer` since the attention
-mechanism doesn't change the sequence length of the output hidden states.
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L1094-L1102
-hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
-hidden_states = self.pre_layrnorm(hidden_states)
-
-encoder_outputs = self.encoder(
- inputs_embeds=hidden_states,
- output_attentions=output_attentions,
- output_hidden_states=output_hidden_states,
- return_dict=return_dict,
-)
-```
-
-To find the sequence length, we turn to the code of `CLIPVisionEmbeddings`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L247-L257
-target_dtype = self.patch_embedding.weight.dtype
-patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
-patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
-
-class_embeds = self.class_embedding.expand(batch_size, 1, -1)
-embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
-if interpolate_pos_encoding:
- embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
-else:
- embeddings = embeddings + self.position_embedding(self.position_ids)
-return embeddings
-```
-
-We can infer that `embeddings.shape[1] == self.num_positions`, where
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/modeling_clip.py#L195-L196
-self.num_patches = (self.image_size // self.patch_size) ** 2
-self.num_positions = self.num_patches + 1
-```
-
-Overall, the number of placeholder feature tokens for an image can be calculated as:
-
-```python
-def get_num_image_tokens(
- self,
- *,
- image_width: int,
- image_height: int,
-) -> int:
- hf_config = self.get_hf_config()
- hf_processor = self.get_hf_processor()
-
- image_size = hf_config.vision_config.image_size
- patch_size = hf_config.vision_config.patch_size
-
- num_image_tokens = (image_size // patch_size) ** 2 + 1
- if hf_processor.vision_feature_select_strategy == "default":
- num_image_tokens -= 1
-
- return num_image_tokens
-```
-
-Notice that the number of image tokens doesn't depend on the image width and height.
-We can simply use a dummy `image_size` to calculate the multimodal profiling data:
-
-```python
-# NOTE: In actuality, this is usually implemented as part of the
-# model's subclass of `BaseProcessingInfo`, but we show it as is
-# here for simplicity.
-def get_image_size_with_most_features(self) -> ImageSize:
- hf_config = self.get_hf_config()
- width = height = hf_config.image_size
- return ImageSize(width=width, height=height)
-
-def get_dummy_mm_data(
- self,
- seq_len: int,
- mm_counts: Mapping[str, int],
-) -> MultiModalDataDict:
- num_images = mm_counts.get("image", 0)
-
- target_width, target_height = \
- self.info.get_image_size_with_most_features()
-
- return {
- "image":
- self._get_dummy_images(width=target_width,
- height=target_height,
- num_images=num_images)
- }
-```
-
-For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
-
-```python
-def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
- num_images = mm_counts.get("image", 0)
-
- processor = self.info.get_hf_processor()
- image_token = processor.image_token
-
- return image_token * num_images
-```
-
-:::
-
-:::{tab-item} No input placeholders: Fuyu
-:sync: fuyu
-
-Looking at the code of HF's `FuyuForCausalLM`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/modeling_fuyu.py#L311-L322
-if image_patches is not None and past_key_values is None:
- patch_embeddings = [
- self.vision_embed_tokens(patch.to(self.vision_embed_tokens.weight.dtype))
- .squeeze(0)
- .to(inputs_embeds.device)
- for patch in image_patches
- ]
- inputs_embeds = self.gather_continuous_embeddings(
- word_embeddings=inputs_embeds,
- continuous_embeddings=patch_embeddings,
- image_patch_input_indices=image_patches_indices,
- )
-```
-
-The number of placeholder feature tokens for the `i`th item in the batch is `patch_embeddings[i].shape[0]`,
-which is the same as `image_patches[i].shape[0]`, i.e. `num_total_patches`.
-
-Unlike LLaVA, Fuyu does not define the number of patches inside the modeling file. Where can we get more information?
-Considering that the model input comes from the output of `FuyuProcessor`, let's **look at the preprocessing files**.
-
-The image outputs are obtained by calling `FuyuImageProcessor.preprocess` and then
-`FuyuImageProcessor.preprocess_with_tokenizer_info` inside `FuyuProcessor`.
-
-In `FuyuImageProcessor.preprocess`, the images are resized and padded to the target `FuyuImageProcessor.size`,
-returning the dimensions after resizing (but before padding) as metadata.
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L541-L544
-image_encoding = self.image_processor.preprocess(images, **output_kwargs["images_kwargs"])
-batch_images = image_encoding["images"]
-image_unpadded_heights = image_encoding["image_unpadded_heights"]
-image_unpadded_widths = image_encoding["image_unpadded_widths"]
-
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L480-L
-if do_resize:
- batch_images = [
- [self.resize(image, size=size, input_data_format=input_data_format) for image in images]
- for images in batch_images
- ]
-
-image_sizes = [get_image_size(images[0], channel_dim=input_data_format) for images in batch_images]
-image_unpadded_heights = [[image_size[0]] for image_size in image_sizes]
-image_unpadded_widths = [[image_size[1]] for image_size in image_sizes]
-
-if do_pad:
- batch_images = [
- [
- self.pad_image(
- image,
- size=size,
- mode=padding_mode,
- constant_values=padding_value,
- input_data_format=input_data_format,
- )
- for image in images
- ]
- for images in batch_images
- ]
-```
-
-In `FuyuImageProcessor.preprocess_with_tokenizer_info`, the images are split into patches based on this metadata:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L425
-model_image_input = self.image_processor.preprocess_with_tokenizer_info(
- image_input=tensor_batch_images,
- image_present=image_present,
- image_unpadded_h=image_unpadded_heights,
- image_unpadded_w=image_unpadded_widths,
- image_placeholder_id=image_placeholder_id,
- image_newline_id=image_newline_id,
- variable_sized=True,
-)
-
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L638-L658
-image_height, image_width = image.shape[1], image.shape[2]
-if variable_sized: # variable_sized=True
- new_h = min(
- image_height,
- math.ceil(image_unpadded_h[batch_index, subseq_index] / patch_height) * patch_height,
- )
- new_w = min(
- image_width,
- math.ceil(image_unpadded_w[batch_index, subseq_index] / patch_width) * patch_width,
- )
- image = image[:, :new_h, :new_w]
- image_height, image_width = new_h, new_w
-
-num_patches = self.get_num_patches(image_height=image_height, image_width=image_width)
-tensor_of_image_ids = torch.full(
- [num_patches], image_placeholder_id, dtype=torch.int32, device=image_input.device
-)
-patches = self.patchify_image(image=image.unsqueeze(0)).squeeze(0)
-assert num_patches == patches.shape[0]
-```
-
-The number of patches is in turn defined by `FuyuImageProcessor.get_num_patches`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L552-L562
-patch_size = patch_size if patch_size is not None else self.patch_size
-patch_height, patch_width = self.patch_size["height"], self.patch_size["width"]
-
-if image_height % patch_height != 0:
- raise ValueError(f"{image_height=} must be divisible by {patch_height}")
-if image_width % patch_width != 0:
- raise ValueError(f"{image_width=} must be divisible by {patch_width}")
-
-num_patches_per_dim_h = image_height // patch_height
-num_patches_per_dim_w = image_width // patch_width
-num_patches = num_patches_per_dim_h * num_patches_per_dim_w
-```
-
-These image patches correspond to placeholder tokens (`|SPEAKER|`). So, we just need to maximize the number of image patches. Since input images are first resized
-to fit within `image_processor.size`, we can maximize the number of image patches by inputting an image with size equal to `image_processor.size`.
-
-```python
-def get_image_size_with_most_features(self) -> ImageSize:
- image_processor = self.get_image_processor()
- return ImageSize(width=image_processor.size["width"],
- height=image_processor.size["height"])
-```
-
-Fuyu does not expect image placeholders in the inputs to HF processor, so
-the dummy prompt text is empty regardless of the number of images.
-
-```python
-def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
- return ""
-```
-
-For the multimodal image profiling data, the logic is very similar to LLaVA:
-
-```python
-def get_dummy_mm_data(
- self,
- seq_len: int,
- mm_counts: Mapping[str, int],
-) -> MultiModalDataDict:
- target_width, target_height = \
- self.info.get_image_size_with_most_features()
- num_images = mm_counts.get("image", 0)
-
- return {
- "image":
- self._get_dummy_images(width=target_width,
- height=target_height,
- num_images=num_images)
- }
-```
-
-:::
-
-::::
-
-## 4. Specify processing details
-
-Afterwards, create a subclass of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`
-to fill in the missing details about HF processing.
-
-:::{seealso}
-[Multi-Modal Data Processing](#mm-processing)
-:::
-
-### Multi-modal fields
-
-Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` to
-return a schema of the tensors outputted by the HF processor that are related to the input multi-modal items.
-
-:::::{tab-set}
-::::{tab-item} Basic example: LLaVA
-:sync: llava
-
-The output of `CLIPImageProcessor` is a simple tensor with shape
-`(num_images, num_channels, image_height, image_width)`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/clip/image_processing_clip.py#L339-L345
-images = [
- to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
- for image in all_images
-]
-
-data = {"pixel_values": images}
-return BatchFeature(data=data, tensor_type=return_tensors)
-```
-
-So, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` as follows:
-
-```python
-def _get_mm_fields_config(
- self,
- hf_inputs: BatchFeature,
- hf_processor_mm_kwargs: Mapping[str, object],
-) -> Mapping[str, MultiModalFieldConfig]:
- return dict(
- pixel_values=MultiModalFieldConfig.batched("image"),
- )
-```
-
-:::{note}
-Our [actual code](gh-file:vllm/model_executor/models/llava.py) additionally supports
-pre-computed image embeddings, which can be passed to be model via the `image_embeds` argument.
-:::
-
-::::
-
-::::{tab-item} With postprocessing: Fuyu
-:sync: fuyu
-
-The `image_patches` output of `FuyuImageProcessor.preprocess_with_tokenizer_info` concatenates
-the patches from each image belonging to an item in the batch:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/image_processing_fuyu.py#L673-L679
- image_input_ids.append(tensor_of_image_ids)
- image_patches.append(patches)
- else:
- image_input_ids.append(torch.tensor([], dtype=torch.int32, device=image_input.device))
-
-batch_image_input_ids.append(image_input_ids)
-batch_image_patches.append(image_patches)
-```
-
-The shape of `image_patches` outputted by `FuyuImageProcessor` is therefore
-`(1, num_images, num_patches, patch_width * patch_height * num_channels)`.
-
-In order to support the use of {func}`MultiModalFieldConfig.batched` like in LLaVA,
-we remove the extra batch dimension by overriding {meth}`BaseMultiModalProcessor._call_hf_processor`:
-
-```python
-def _call_hf_processor(
- self,
- prompt: str,
- mm_data: Mapping[str, object],
- mm_kwargs: Mapping[str, object],
-) -> BatchFeature:
- processed_outputs = super()._call_hf_processor(
- prompt=prompt,
- mm_data=mm_data,
- mm_kwargs=mm_kwargs,
- )
-
- image_patches = processed_outputs.get("image_patches")
- if image_patches is not None:
- images = mm_data["images"]
- assert isinstance(images, list)
-
- # Original output: (1, num_images, Pn, Px * Py * C)
- # New output: (num_images, Pn, Px * Py * C)
- assert (isinstance(image_patches, list)
- and len(image_patches) == 1)
- assert (isinstance(image_patches[0], torch.Tensor)
- and len(image_patches[0]) == len(images))
-
- processed_outputs["image_patches"] = image_patches[0]
-
- return processed_outputs
-```
-
-:::{note}
-Our [actual code](gh-file:vllm/model_executor/models/fuyu.py) has special handling
-for text-only inputs to prevent unnecessary warnings from HF processor.
-:::
-
-This lets us override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_mm_fields_config` as follows:
-
-```python
-def _get_mm_fields_config(
- self,
- hf_inputs: BatchFeature,
- hf_processor_mm_kwargs: Mapping[str, object],
-) -> Mapping[str, MultiModalFieldConfig]:
- return dict(image_patches=MultiModalFieldConfig.batched("image"))
-```
-
-::::
-
-:::::
-
-### Prompt updates
-
-Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` to
-return a list of {class}`~vllm.multimodal.processing.PromptUpdate` instances.
-
-Each {class}`~vllm.multimodal.processing.PromptUpdate` instance specifies an update operation
-(e.g.: insertion, replacement) performed by the HF processor.
-
-::::{tab-set}
-:::{tab-item} Basic example: LLaVA
-:sync: llava
-
-Looking at HF's `LlavaProcessor`:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.47.1/src/transformers/models/llava/processing_llava.py#L167-L170
-prompt_strings = []
-for sample in text:
- sample = sample.replace(self.image_token, self.image_token * num_image_tokens)
- prompt_strings.append(sample)
-```
-
-It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
-Based on this, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` as follows:
-
-```python
-def _get_prompt_updates(
- self,
- mm_items: MultiModalDataItems,
- hf_processor_mm_kwargs: Mapping[str, object],
- out_mm_kwargs: MultiModalKwargs,
-) -> Sequence[PromptUpdate]:
- hf_config = self.info.get_hf_config()
- image_token_id = hf_config.image_token_index
-
- def get_replacement(item_idx: int):
- images = mm_items.get_items("image", ImageProcessorItems)
-
- image_size = images.get_image_size(item_idx)
- num_image_tokens = self.info.get_num_image_tokens(
- image_width=image_size.width,
- image_height=image_size.height,
- )
-
- return [image_token_id] * num_image_tokens
-
- return [
- PromptReplacement(
- modality="image",
- target=[image_token_id],
- replacement=get_replacement,
- ),
- ]
-```
-
-:::
-
-:::{tab-item} Handling additional tokens: Fuyu
-:sync: fuyu
-
-Recall the layout of feature tokens from Step 2:
-
-```
-|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
-|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
-...
-|SPEAKER||SPEAKER|...|SPEAKER||NEWLINE|
-```
-
-We define a helper function to return `ncols` and `nrows` directly:
-
-```python
-def get_image_feature_grid_size(
- self,
- *,
- image_width: int,
- image_height: int,
-) -> tuple[int, int]:
- image_processor = self.get_image_processor()
- target_width = image_processor.size["width"]
- target_height = image_processor.size["height"]
- patch_width = image_processor.patch_size["width"]
- patch_height = image_processor.patch_size["height"]
-
- if not (image_width <= target_width and image_height <= target_height):
- height_scale_factor = target_height / image_height
- width_scale_factor = target_width / image_width
- optimal_scale_factor = min(height_scale_factor, width_scale_factor)
-
- image_height = int(image_height * optimal_scale_factor)
- image_width = int(image_width * optimal_scale_factor)
-
- ncols = math.ceil(image_width / patch_width)
- nrows = math.ceil(image_height / patch_height)
- return ncols, nrows
-```
-
-Based on this, we can initially define our replacement tokens as:
-
-```python
-def get_replacement(item_idx: int):
- images = mm_items.get_items("image", ImageProcessorItems)
- image_size = images.get_image_size(item_idx)
-
- ncols, nrows = self.info.get_image_feature_grid_size(
- image_width=image_size.width,
- image_height=image_size.height,
- )
-
- # `_IMAGE_TOKEN_ID` corresponds to `|SPEAKER|`
- # `_NEWLINE_TOKEN_ID` corresponds to `|NEWLINE|`
- return ([_IMAGE_TOKEN_ID] * ncols + [_NEWLINE_TOKEN_ID]) * nrows
-```
-
-However, this is not entirely correct. After `FuyuImageProcessor.preprocess_with_tokenizer_info` is called,
-a BOS token (``) is also added to the promopt:
-
-```python
-# https://github.com/huggingface/transformers/blob/v4.48.3/src/transformers/models/fuyu/processing_fuyu.py#L417-L435
-model_image_input = self.image_processor.preprocess_with_tokenizer_info(
- image_input=tensor_batch_images,
- image_present=image_present,
- image_unpadded_h=image_unpadded_heights,
- image_unpadded_w=image_unpadded_widths,
- image_placeholder_id=image_placeholder_id,
- image_newline_id=image_newline_id,
- variable_sized=True,
-)
-prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
- tokenizer=self.tokenizer,
- prompts=prompts,
- scale_factors=scale_factors,
- max_tokens_to_generate=self.max_tokens_to_generate,
- max_position_embeddings=self.max_position_embeddings,
- add_BOS=True,
- add_beginning_of_answer_token=True,
-)
-```
-
-To assign the vision embeddings to only the image tokens, instead of a string
-you can return an instance of {class}`~vllm.multimodal.processing.PromptUpdateDetails`:
-
-```python
-hf_config = self.info.get_hf_config()
-bos_token_id = hf_config.bos_token_id # ``
-assert isinstance(bos_token_id, int)
-
-def get_replacement_fuyu(item_idx: int):
- images = mm_items.get_items("image", ImageProcessorItems)
- image_size = images.get_image_size(item_idx)
-
- ncols, nrows = self.info.get_image_feature_grid_size(
- image_width=image_size.width,
- image_height=image_size.height,
- )
- image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
- [_NEWLINE_TOKEN_ID]) * nrows
-
- return PromptUpdateDetails.select_token_id(
- image_tokens + [bos_token_id],
- embed_token_id=_IMAGE_TOKEN_ID,
- )
-```
-
-Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the tokenized prompt,
-we can search for it to conduct the replacement at the start of the string:
-
-```python
-def _get_prompt_updates(
- self,
- mm_items: MultiModalDataItems,
- hf_processor_mm_kwargs: Mapping[str, object],
- out_mm_kwargs: MultiModalKwargs,
-) -> Sequence[PromptUpdate]:
- hf_config = self.info.get_hf_config()
- bos_token_id = hf_config.bos_token_id
- assert isinstance(bos_token_id, int)
-
- tokenizer = self.info.get_tokenizer()
- eot_token_id = tokenizer.bos_token_id
- assert isinstance(eot_token_id, int)
-
- def get_replacement_fuyu(item_idx: int):
- images = mm_items.get_items("image", ImageProcessorItems)
- image_size = images.get_image_size(item_idx)
-
- ncols, nrows = self.info.get_image_feature_grid_size(
- image_width=image_size.width,
- image_height=image_size.height,
- )
- image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
- [_NEWLINE_TOKEN_ID]) * nrows
-
- return PromptUpdateDetails.select_token_id(
- image_tokens + [bos_token_id],
- embed_token_id=_IMAGE_TOKEN_ID,
- )
-
- return [
- PromptReplacement(
- modality="image",
- target=[eot_token_id],
- replacement=get_replacement_fuyu,
- )
- ]
-```
-
-:::
-
-::::
-
-## 5. Register processor-related classes
-
-After you have defined {class}`~vllm.multimodal.processing.BaseProcessingInfo` (Step 2),
-{class}`~vllm.multimodal.profiling.BaseDummyInputsBuilder` (Step 3),
-and {class}`~vllm.multimodal.processing.BaseMultiModalProcessor` (Step 4),
-decorate the model class with {meth}`MULTIMODAL_REGISTRY.register_processor `
-to register them to the multi-modal registry:
-
-```diff
- from vllm.model_executor.models.interfaces import SupportsMultiModal
-+ from vllm.multimodal import MULTIMODAL_REGISTRY
-
-+ @MULTIMODAL_REGISTRY.register_processor(YourMultiModalProcessor,
-+ info=YourProcessingInfo,
-+ dummy_inputs=YourDummyInputsBuilder)
- class YourModelForImage2Seq(nn.Module, SupportsMultiModal):
-```
-
-## Notes
-
-### Inserting feature tokens without replacement
-
-Some HF processors directly insert feature tokens without replacing anything in the original prompt. In that case, you can use {class}`~vllm.multimodal.processing.PromptInsertion` instead of {class}`~vllm.multimodal.processing.PromptReplacement` inside {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`.
-
-Examples:
-
-- BLIP-2 (insert at start of prompt):
-- Florence2 (insert at start of prompt):
-- Molmo (insert after `<|endoftext|>` token):
-
-### Handling prompt updates unrelated to multi-modal data
-
-{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` assumes that each application of prompt update corresponds to one multi-modal item. If the HF processor performs additional processing regardless of how many multi-modal items there are, you should override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_tokens_only` so that the processed token inputs are consistent with the result of applying the HF processor on text inputs. This is because token inputs bypass the HF processor according to [our design](#mm-processing).
-
-Examples:
-
-- Chameleon (appends `sep_token`):
-- Fuyu (appends `boa_token`):
-- Molmo (applies chat template which is not defined elsewhere):
-
-### Custom HF processor
-
-Some models don't define a HF processor class on HF Hub. In that case, you can define a custom HF processor that has the same call signature as HF processors and pass it to {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._call_hf_processor`.
-
-Examples:
-
-- DeepSeek-VL2:
-- InternVL:
-- Qwen-VL:
diff --git a/docs/source/deployment/docker.md b/docs/source/deployment/docker.md
deleted file mode 100644
index ca56710bc2..0000000000
--- a/docs/source/deployment/docker.md
+++ /dev/null
@@ -1,133 +0,0 @@
-(deployment-docker)=
-
-# Using Docker
-
-(deployment-docker-pre-built-image)=
-
-## Use vLLM's Official Docker Image
-
-vLLM offers an official Docker image for deployment.
-The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
-
-```console
-$ docker run --runtime nvidia --gpus all \
- -v ~/.cache/huggingface:/root/.cache/huggingface \
- --env "HUGGING_FACE_HUB_TOKEN=" \
- -p 8000:8000 \
- --ipc=host \
- vllm/vllm-openai:latest \
- --model mistralai/Mistral-7B-v0.1
-```
-
-This image can also be used with other container engines such as [Podman](https://podman.io/).
-
-```console
-$ podman run --gpus all \
- -v ~/.cache/huggingface:/root/.cache/huggingface \
- --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
- -p 8000:8000 \
- --ipc=host \
- vllm/vllm-openai:latest \
- --model mistralai/Mistral-7B-v0.1
-```
-
-You can add any other you need after the image tag (`vllm/vllm-openai:latest`).
-
-:::{note}
-You can either use the `ipc=host` flag or `--shm-size` flag to allow the
-container to access the host's shared memory. vLLM uses PyTorch, which uses shared
-memory to share data between processes under the hood, particularly for tensor parallel inference.
-:::
-
-:::{note}
-Optional dependencies are not included in order to avoid licensing issues (e.g. ).
-
-If you need to use those dependencies (having accepted the license terms),
-create a custom Dockerfile on top of the base image with an extra layer that installs them:
-
-```Dockerfile
-FROM vllm/vllm-openai:v0.8.3
-
-# e.g. install the `audio` optional dependencies
-# NOTE: Make sure the version of vLLM matches the base image!
-RUN uv pip install --system vllm[audio]==0.8.3
-```
-
-:::
-
-:::{tip}
-Some new models may only be available on the main branch of [HF Transformers](https://github.com/huggingface/transformers).
-
-To use the development version of `transformers`, create a custom Dockerfile on top of the base image
-with an extra layer that installs their code from source:
-
-```Dockerfile
-FROM vllm/vllm-openai:latest
-
-RUN uv pip install --system git+https://github.com/huggingface/transformers.git
-```
-
-:::
-
-(deployment-docker-build-image-from-source)=
-
-## Building vLLM's Docker Image from Source
-
-You can build and run vLLM from source via the provided . To build vLLM:
-
-```console
-# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
-DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai --file docker/Dockerfile
-```
-
-:::{note}
-By default vLLM will build for all GPU types for widest distribution. If you are just building for the
-current GPU type the machine is running on, you can add the argument `--build-arg torch_cuda_arch_list=""`
-for vLLM to find the current GPU type and build for that.
-
-If you are using Podman instead of Docker, you might need to disable SELinux labeling by
-adding `--security-opt label=disable` when running `podman build` command to avoid certain [existing issues](https://github.com/containers/buildah/discussions/4184).
-:::
-
-## Building for Arm64/aarch64
-
-A docker container can be built for aarch64 systems such as the Nvidia Grace-Hopper. At time of this writing, this requires the use
-of PyTorch Nightly and should be considered **experimental**. Using the flag `--platform "linux/arm64"` will attempt to build for arm64.
-
-:::{note}
-Multiple modules must be compiled, so this process can take a while. Recommend using `--build-arg max_jobs=` & `--build-arg nvcc_threads=`
-flags to speed up build process. However, ensure your `max_jobs` is substantially larger than `nvcc_threads` to get the most benefits.
-Keep an eye on memory usage with parallel jobs as it can be substantial (see example below).
-:::
-
-```console
-# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
-$ python3 use_existing_torch.py
-$ DOCKER_BUILDKIT=1 docker build . \
- --file docker/Dockerfile \
- --target vllm-openai \
- --platform "linux/arm64" \
- -t vllm/vllm-gh200-openai:latest \
- --build-arg max_jobs=66 \
- --build-arg nvcc_threads=2 \
- --build-arg torch_cuda_arch_list="9.0+PTX" \
- --build-arg vllm_fa_cmake_gpu_arches="90-real"
-```
-
-## Use the custom-built vLLM Docker image
-
-To run vLLM with the custom-built Docker image:
-
-```console
-$ docker run --runtime nvidia --gpus all \
- -v ~/.cache/huggingface:/root/.cache/huggingface \
- -p 8000:8000 \
- --env "HUGGING_FACE_HUB_TOKEN=" \
- vllm/vllm-openai
-```
-
-The argument `vllm/vllm-openai` specifies the image to run, and should be replaced with the name of the custom-built image (the `-t` tag from the build command).
-
-:::{note}
-**For version 0.4.1 and 0.4.2 only** - the vLLM docker images under these versions are supposed to be run under the root user since a library under the root user's home directory, i.e. `/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` is required to be loaded during runtime. If you are running the container under a different user, you may need to first change the permissions of the library (and all the parent directories) to allow the user to access it, then run vLLM with environment variable `VLLM_NCCL_SO_PATH=/root/.config/vllm/nccl/cu12/libnccl.so.2.18.1` .
-:::
diff --git a/docs/source/deployment/frameworks/helm.md b/docs/source/deployment/frameworks/helm.md
deleted file mode 100644
index 7320d727fb..0000000000
--- a/docs/source/deployment/frameworks/helm.md
+++ /dev/null
@@ -1,250 +0,0 @@
-(deployment-helm)=
-
-# Helm
-
-A Helm chart to deploy vLLM for Kubernetes
-
-Helm is a package manager for Kubernetes. It will help you to deploy vLLM on k8s and automate the deployment of vLLM Kubernetes applications. With Helm, you can deploy the same framework architecture with different configurations to multiple namespaces by overriding variable values.
-
-This guide will walk you through the process of deploying vLLM with Helm, including the necessary prerequisites, steps for helm installation and documentation on architecture and values file.
-
-## Prerequisites
-
-Before you begin, ensure that you have the following:
-
-- A running Kubernetes cluster
-- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at [https://github.com/NVIDIA/k8s-device-plugin](https://github.com/NVIDIA/k8s-device-plugin)
-- Available GPU resources in your cluster
-- S3 with the model which will be deployed
-
-## Installing the chart
-
-To install the chart with the release name `test-vllm`:
-
-```console
-helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
-```
-
-## Uninstalling the Chart
-
-To uninstall the `test-vllm` deployment:
-
-```console
-helm uninstall test-vllm --namespace=ns-vllm
-```
-
-The command removes all the Kubernetes components associated with the
-chart **including persistent volumes** and deletes the release.
-
-## Architecture
-
-:::{image} /assets/deployment/architecture_helm_deployment.png
-:::
-
-## Values
-
-:::{list-table}
-:widths: 25 25 25 25
-:header-rows: 1
-
-- * Key
- * Type
- * Default
- * Description
-- * autoscaling
- * object
- * {"enabled":false,"maxReplicas":100,"minReplicas":1,"targetCPUUtilizationPercentage":80}
- * Autoscaling configuration
-- * autoscaling.enabled
- * bool
- * false
- * Enable autoscaling
-- * autoscaling.maxReplicas
- * int
- * 100
- * Maximum replicas
-- * autoscaling.minReplicas
- * int
- * 1
- * Minimum replicas
-- * autoscaling.targetCPUUtilizationPercentage
- * int
- * 80
- * Target CPU utilization for autoscaling
-- * configs
- * object
- * {}
- * Configmap
-- * containerPort
- * int
- * 8000
- * Container port
-- * customObjects
- * list
- * []
- * Custom Objects configuration
-- * deploymentStrategy
- * object
- * {}
- * Deployment strategy configuration
-- * externalConfigs
- * list
- * []
- * External configuration
-- * extraContainers
- * list
- * []
- * Additional containers configuration
-- * extraInit
- * object
- * {"pvcStorage":"1Gi","s3modelpath":"relative_s3_model_path/opt-125m", "awsEc2MetadataDisabled": true}
- * Additional configuration for the init container
-- * extraInit.pvcStorage
- * string
- * "50Gi"
- * Storage size of the s3
-- * extraInit.s3modelpath
- * string
- * "relative_s3_model_path/opt-125m"
- * Path of the model on the s3 which hosts model weights and config files
-- * extraInit.awsEc2MetadataDisabled
- * boolean
- * true
- * Disables the use of the Amazon EC2 instance metadata service
-- * extraPorts
- * list
- * []
- * Additional ports configuration
-- * gpuModels
- * list
- * ["TYPE_GPU_USED"]
- * Type of gpu used
-- * image
- * object
- * {"command":["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"],"repository":"vllm/vllm-openai","tag":"latest"}
- * Image configuration
-- * image.command
- * list
- * ["vllm","serve","/data/","--served-model-name","opt-125m","--host","0.0.0.0","--port","8000"]
- * Container launch command
-- * image.repository
- * string
- * "vllm/vllm-openai"
- * Image repository
-- * image.tag
- * string
- * "latest"
- * Image tag
-- * livenessProbe
- * object
- * {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":15,"periodSeconds":10}
- * Liveness probe configuration
-- * livenessProbe.failureThreshold
- * int
- * 3
- * Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not alive
-- * livenessProbe.httpGet
- * object
- * {"path":"/health","port":8000}
- * Configuration of the Kubelet http request on the server
-- * livenessProbe.httpGet.path
- * string
- * "/health"
- * Path to access on the HTTP server
-- * livenessProbe.httpGet.port
- * int
- * 8000
- * Name or number of the port to access on the container, on which the server is listening
-- * livenessProbe.initialDelaySeconds
- * int
- * 15
- * Number of seconds after the container has started before liveness probe is initiated
-- * livenessProbe.periodSeconds
- * int
- * 10
- * How often (in seconds) to perform the liveness probe
-- * maxUnavailablePodDisruptionBudget
- * string
- * ""
- * Disruption Budget Configuration
-- * readinessProbe
- * object
- * {"failureThreshold":3,"httpGet":{"path":"/health","port":8000},"initialDelaySeconds":5,"periodSeconds":5}
- * Readiness probe configuration
-- * readinessProbe.failureThreshold
- * int
- * 3
- * Number of times after which if a probe fails in a row, Kubernetes considers that the overall check has failed: the container is not ready
-- * readinessProbe.httpGet
- * object
- * {"path":"/health","port":8000}
- * Configuration of the Kubelet http request on the server
-- * readinessProbe.httpGet.path
- * string
- * "/health"
- * Path to access on the HTTP server
-- * readinessProbe.httpGet.port
- * int
- * 8000
- * Name or number of the port to access on the container, on which the server is listening
-- * readinessProbe.initialDelaySeconds
- * int
- * 5
- * Number of seconds after the container has started before readiness probe is initiated
-- * readinessProbe.periodSeconds
- * int
- * 5
- * How often (in seconds) to perform the readiness probe
-- * replicaCount
- * int
- * 1
- * Number of replicas
-- * resources
- * object
- * {"limits":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1},"requests":{"cpu":4,"memory":"16Gi","nvidia.com/gpu":1}}
- * Resource configuration
-- * resources.limits."nvidia.com/gpu"
- * int
- * 1
- * Number of gpus used
-- * resources.limits.cpu
- * int
- * 4
- * Number of CPUs
-- * resources.limits.memory
- * string
- * "16Gi"
- * CPU memory configuration
-- * resources.requests."nvidia.com/gpu"
- * int
- * 1
- * Number of gpus used
-- * resources.requests.cpu
- * int
- * 4
- * Number of CPUs
-- * resources.requests.memory
- * string
- * "16Gi"
- * CPU memory configuration
-- * secrets
- * object
- * {}
- * Secrets configuration
-- * serviceName
- * string
- *
- * Service name
-- * servicePort
- * int
- * 80
- * Service port
-- * labels.environment
- * string
- * test
- * Environment name
-- * labels.release
- * string
- * test
- * Release name
-:::
diff --git a/docs/source/deployment/frameworks/index.md b/docs/source/deployment/frameworks/index.md
deleted file mode 100644
index 3408c6c10e..0000000000
--- a/docs/source/deployment/frameworks/index.md
+++ /dev/null
@@ -1,22 +0,0 @@
-# Using other frameworks
-
-:::{toctree}
-:maxdepth: 1
-
-anything-llm
-bentoml
-cerebrium
-chatbox
-dify
-dstack
-helm
-litellm
-lobe-chat
-lws
-modal
-open-webui
-retrieval_augmented_generation
-skypilot
-streamlit
-triton
-:::
diff --git a/docs/source/deployment/integrations/index.md b/docs/source/deployment/integrations/index.md
deleted file mode 100644
index 410742b88c..0000000000
--- a/docs/source/deployment/integrations/index.md
+++ /dev/null
@@ -1,11 +0,0 @@
-# External Integrations
-
-:::{toctree}
-:maxdepth: 1
-
-kserve
-kubeai
-llamastack
-llmaz
-production-stack
-:::
diff --git a/docs/source/features/compatibility_matrix.md b/docs/source/features/compatibility_matrix.md
deleted file mode 100644
index 8865d26dea..0000000000
--- a/docs/source/features/compatibility_matrix.md
+++ /dev/null
@@ -1,476 +0,0 @@
-(compatibility-matrix)=
-
-# Compatibility Matrix
-
-The tables below show mutually exclusive features and the support on some hardware.
-
-The symbols used have the following meanings:
-
-- ✅ = Full compatibility
-- 🟠 = Partial compatibility
-- ❌ = No compatibility
-
-:::{note}
-Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
-:::
-
-## Feature x Feature
-
-:::{raw} html
-
-:::
-
-:::{list-table}
-:header-rows: 1
-:stub-columns: 1
-:widths: auto
-:class: vertical-table-header
-
-- * Feature
- * [CP](#chunked-prefill)
- * [APC](#automatic-prefix-caching)
- * [LoRA](#lora-adapter)
- * prmpt adptr
- * [SD](#spec-decode)
- * CUDA graph
- * pooling
- * enc-dec
- * logP
- * prmpt logP
- * async output
- * multi-step
- * mm
- * best-of
- * beam-search
- * guided dec
-- * [CP](#chunked-prefill)
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
-- * [APC](#automatic-prefix-caching)
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
-- * [LoRA](#lora-adapter)
- * ✅
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
-- * prmpt adptr
- * ✅
- * ✅
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
-- * [SD](#spec-decode)
- * ✅
- * ✅
- * ❌
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
-- * CUDA graph
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
- *
- *
-- * pooling
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
- *
-- * enc-dec
- * ❌
- * [❌](gh-issue:7366)
- * ❌
- * ❌
- * [❌](gh-issue:7366)
- * ✅
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
- *
- *
-- * logP
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ❌
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
- *
-- * prmpt logP
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ❌
- * ✅
- * ✅
- * ✅
- *
- *
- *
- *
- *
- *
-- * async output
- * ✅
- * ✅
- * ✅
- * ✅
- * ❌
- * ✅
- * ❌
- * ❌
- * ✅
- * ✅
- * ✅
- *
- *
- *
- *
- *
-- * multi-step
- * ❌
- * ✅
- * ❌
- * ✅
- * ❌
- * ✅
- * ❌
- * ❌
- * ✅
- * ✅
- * ✅
- * ✅
- *
- *
- *
- *
-- * mm
- * ✅
- * [🟠](gh-pr:8348)
- * [🟠](gh-pr:4194)
- * ❔
- * ❔
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ❔
- * ✅
- *
- *
- *
-- * best-of
- * ✅
- * ✅
- * ✅
- * ✅
- * [❌](gh-issue:6137)
- * ✅
- * ❌
- * ✅
- * ✅
- * ✅
- * ❔
- * [❌](gh-issue:7968)
- * ✅
- * ✅
- *
- *
-- * beam-search
- * ✅
- * ✅
- * ✅
- * ✅
- * [❌](gh-issue:6137)
- * ✅
- * ❌
- * ✅
- * ✅
- * ✅
- * ❔
- * [❌](gh-issue:7968)
- * ❔
- * ✅
- * ✅
- *
-- * guided dec
- * ✅
- * ✅
- * ❔
- * ❔
- * [❌](gh-issue:11484)
- * ✅
- * ❌
- * ❔
- * ✅
- * ✅
- * ✅
- * [❌](gh-issue:9893)
- * ❔
- * ✅
- * ✅
- * ✅
-:::
-
-(feature-x-hardware)=
-
-## Feature x Hardware
-
-:::{list-table}
-:header-rows: 1
-:stub-columns: 1
-:widths: auto
-
-- * Feature
- * Volta
- * Turing
- * Ampere
- * Ada
- * Hopper
- * CPU
- * AMD
-- * [CP](#chunked-prefill)
- * [❌](gh-issue:2729)
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * [APC](#automatic-prefix-caching)
- * [❌](gh-issue:3687)
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * [LoRA](#lora-adapter)
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * prmpt adptr
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * [❌](gh-issue:8475)
- * ✅
-- * [SD](#spec-decode)
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * CUDA graph
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ❌
- * ✅
-- * pooling
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ❔
-- * enc-dec
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ❌
-- * mm
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * logP
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * prmpt logP
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * async output
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ❌
- * ❌
-- * multi-step
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * [❌](gh-issue:8477)
- * ✅
-- * best-of
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * beam-search
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-- * guided dec
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
- * ✅
-:::
diff --git a/docs/source/features/quantization/index.md b/docs/source/features/quantization/index.md
deleted file mode 100644
index 7ad46b7094..0000000000
--- a/docs/source/features/quantization/index.md
+++ /dev/null
@@ -1,24 +0,0 @@
-(quantization-index)=
-
-# Quantization
-
-Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
-
-:::{toctree}
-:caption: Contents
-:maxdepth: 1
-
-supported_hardware
-auto_awq
-bnb
-bitblas
-gguf
-gptqmodel
-int4
-int8
-fp8
-modelopt
-quark
-quantized_kvcache
-torchao
-:::
diff --git a/docs/source/features/quantization/supported_hardware.md b/docs/source/features/quantization/supported_hardware.md
deleted file mode 100644
index f8af1ba60b..0000000000
--- a/docs/source/features/quantization/supported_hardware.md
+++ /dev/null
@@ -1,153 +0,0 @@
-(quantization-supported-hardware)=
-
-# Supported Hardware
-
-The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
-
-:::{list-table}
-:header-rows: 1
-:widths: 20 8 8 8 8 8 8 8 8 8 8
-
-- * Implementation
- * Volta
- * Turing
- * Ampere
- * Ada
- * Hopper
- * AMD GPU
- * Intel GPU
- * x86 CPU
- * AWS Inferentia
- * Google TPU
-- * AWQ
- * ❌
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ✅︎
- * ✅︎
- * ❌
- * ❌
-- * GPTQ
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ✅︎
- * ✅︎
- * ❌
- * ❌
-- * Marlin (GPTQ/AWQ/FP8)
- * ❌
- * ❌
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
-- * INT8 (W8A8)
- * ❌
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ✅︎
- * ❌
- * ✅︎
-- * FP8 (W8A8)
- * ❌
- * ❌
- * ❌
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ❌
- * ❌
-- * BitBLAS (GPTQ)
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
-- * AQLM
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
-- * bitsandbytes
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
-- * DeepSpeedFP
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
-- * GGUF
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ❌
- * ❌
- * ❌
- * ❌
-- * modelopt
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎
- * ✅︎︎
- * ❌
- * ❌
- * ❌
- * ❌
- * ❌
-:::
-
-- Volta refers to SM 7.0, Turing to SM 7.5, Ampere to SM 8.0/8.6, Ada to SM 8.9, and Hopper to SM 9.0.
-- ✅︎ indicates that the quantization method is supported on the specified hardware.
-- ❌ indicates that the quantization method is not supported on the specified hardware.
-
-:::{note}
-This compatibility chart is subject to change as vLLM continues to evolve and expand its support for different hardware platforms and quantization methods.
-
-For the most up-to-date information on hardware support and quantization methods, please refer to or consult with the vLLM development team.
-:::
diff --git a/docs/source/generate_examples.py b/docs/source/generate_examples.py
deleted file mode 100644
index f77dbefb0a..0000000000
--- a/docs/source/generate_examples.py
+++ /dev/null
@@ -1,244 +0,0 @@
-# SPDX-License-Identifier: Apache-2.0
-
-import itertools
-import re
-from dataclasses import dataclass, field
-from pathlib import Path
-
-ROOT_DIR = Path(__file__).parent.parent.parent.resolve()
-ROOT_DIR_RELATIVE = '../../../..'
-EXAMPLE_DIR = ROOT_DIR / "examples"
-EXAMPLE_DOC_DIR = ROOT_DIR / "docs/source/getting_started/examples"
-
-
-def fix_case(text: str) -> str:
- subs = {
- "api": "API",
- "cli": "CLI",
- "cpu": "CPU",
- "llm": "LLM",
- "mae": "MAE",
- "tpu": "TPU",
- "aqlm": "AQLM",
- "gguf": "GGUF",
- "lora": "LoRA",
- "rlhf": "RLHF",
- "vllm": "vLLM",
- "openai": "OpenAI",
- "lmcache": "LMCache",
- "multilora": "MultiLoRA",
- "mlpspeculator": "MLPSpeculator",
- r"fp\d+": lambda x: x.group(0).upper(), # e.g. fp16, fp32
- r"int\d+": lambda x: x.group(0).upper(), # e.g. int8, int16
- }
- for pattern, repl in subs.items():
- text = re.sub(rf'\b{pattern}\b', repl, text, flags=re.IGNORECASE)
- return text
-
-
-@dataclass
-class Index:
- """
- Index class to generate a structured document index.
-
- Attributes:
- path (Path): The path save the index file to.
- title (str): The title of the index.
- description (str): A brief description of the index.
- caption (str): An optional caption for the table of contents.
- maxdepth (int): The maximum depth of the table of contents. Defaults to 1.
- documents (list[str]): A list of document paths to include in the index. Defaults to an empty list.
-
- Methods:
- generate() -> str:
- Generates the index content as a string in the specified format.
- """ # noqa: E501
- path: Path
- title: str
- description: str
- caption: str
- maxdepth: int = 1
- documents: list[str] = field(default_factory=list)
-
- def generate(self) -> str:
- content = f"# {self.title}\n\n{self.description}\n\n"
- content += ":::{toctree}\n"
- content += f":caption: {self.caption}\n:maxdepth: {self.maxdepth}\n"
- content += "\n".join(self.documents) + "\n:::\n"
- return content
-
-
-@dataclass
-class Example:
- """
- Example class for generating documentation content from a given path.
-
- Attributes:
- path (Path): The path to the main directory or file.
- category (str): The category of the document.
- main_file (Path): The main file in the directory.
- other_files (list[Path]): list of other files in the directory.
- title (str): The title of the document.
-
- Methods:
- __post_init__(): Initializes the main_file, other_files, and title attributes.
- determine_main_file() -> Path: Determines the main file in the given path.
- determine_other_files() -> list[Path]: Determines other files in the directory excluding the main file.
- determine_title() -> str: Determines the title of the document.
- generate() -> str: Generates the documentation content.
- """ # noqa: E501
- path: Path
- category: str = None
- main_file: Path = field(init=False)
- other_files: list[Path] = field(init=False)
- title: str = field(init=False)
-
- def __post_init__(self):
- self.main_file = self.determine_main_file()
- self.other_files = self.determine_other_files()
- self.title = self.determine_title()
-
- def determine_main_file(self) -> Path:
- """
- Determines the main file in the given path.
- If the path is a file, it returns the path itself. Otherwise, it searches
- for Markdown files (*.md) in the directory and returns the first one found.
- Returns:
- Path: The main file path, either the original path if it's a file or the first
- Markdown file found in the directory.
- Raises:
- IndexError: If no Markdown files are found in the directory.
- """ # noqa: E501
- return self.path if self.path.is_file() else list(
- self.path.glob("*.md")).pop()
-
- def determine_other_files(self) -> list[Path]:
- """
- Determine other files in the directory excluding the main file.
-
- This method checks if the given path is a file. If it is, it returns an empty list.
- Otherwise, it recursively searches through the directory and returns a list of all
- files that are not the main file.
-
- Returns:
- list[Path]: A list of Path objects representing the other files in the directory.
- """ # noqa: E501
- if self.path.is_file():
- return []
- is_other_file = lambda file: file.is_file() and file != self.main_file
- return [file for file in self.path.rglob("*") if is_other_file(file)]
-
- def determine_title(self) -> str:
- return fix_case(self.path.stem.replace("_", " ").title())
-
- def generate(self) -> str:
- # Convert the path to a relative path from __file__
- make_relative = lambda path: ROOT_DIR_RELATIVE / path.relative_to(
- ROOT_DIR)
-
- content = f"Source .\n\n"
- include = "include" if self.main_file.suffix == ".md" else \
- "literalinclude"
- if include == "literalinclude":
- content += f"# {self.title}\n\n"
- content += f":::{{{include}}} {make_relative(self.main_file)}\n"
- if include == "literalinclude":
- content += f":language: {self.main_file.suffix[1:]}\n"
- content += ":::\n\n"
-
- if not self.other_files:
- return content
-
- content += "## Example materials\n\n"
- for file in sorted(self.other_files):
- include = "include" if file.suffix == ".md" else "literalinclude"
- content += f":::{{admonition}} {file.relative_to(self.path)}\n"
- content += ":class: dropdown\n\n"
- content += f":::{{{include}}} {make_relative(file)}\n:::\n"
- content += ":::\n\n"
-
- return content
-
-
-def generate_examples():
- # Create the EXAMPLE_DOC_DIR if it doesn't exist
- if not EXAMPLE_DOC_DIR.exists():
- EXAMPLE_DOC_DIR.mkdir(parents=True)
-
- # Create empty indices
- examples_index = Index(
- path=EXAMPLE_DOC_DIR / "examples_index.md",
- title="Examples",
- description=
- "A collection of examples demonstrating usage of vLLM.\nAll documented examples are autogenerated using from examples found in .", # noqa: E501
- caption="Examples",
- maxdepth=2)
- # Category indices stored in reverse order because they are inserted into
- # examples_index.documents at index 0 in order
- category_indices = {
- "other":
- Index(
- path=EXAMPLE_DOC_DIR / "examples_other_index.md",
- title="Other",
- description=
- "Other examples that don't strongly fit into the online or offline serving categories.", # noqa: E501
- caption="Examples",
- ),
- "online_serving":
- Index(
- path=EXAMPLE_DOC_DIR / "examples_online_serving_index.md",
- title="Online Serving",
- description=
- "Online serving examples demonstrate how to use vLLM in an online setting, where the model is queried for predictions in real-time.", # noqa: E501
- caption="Examples",
- ),
- "offline_inference":
- Index(
- path=EXAMPLE_DOC_DIR / "examples_offline_inference_index.md",
- title="Offline Inference",
- description=
- "Offline inference examples demonstrate how to use vLLM in an offline setting, where the model is queried for predictions in batches. We recommend starting with .", # noqa: E501
- caption="Examples",
- ),
- }
-
- examples = []
- glob_patterns = ["*.py", "*.md", "*.sh"]
- # Find categorised examples
- for category in category_indices:
- category_dir = EXAMPLE_DIR / category
- globs = [category_dir.glob(pattern) for pattern in glob_patterns]
- for path in itertools.chain(*globs):
- examples.append(Example(path, category))
- # Find examples in subdirectories
- for path in category_dir.glob("*/*.md"):
- examples.append(Example(path.parent, category))
- # Find uncategorised examples
- globs = [EXAMPLE_DIR.glob(pattern) for pattern in glob_patterns]
- for path in itertools.chain(*globs):
- examples.append(Example(path))
- # Find examples in subdirectories
- for path in EXAMPLE_DIR.glob("*/*.md"):
- # Skip categorised examples
- if path.parent.name in category_indices:
- continue
- examples.append(Example(path.parent))
-
- # Generate the example documentation
- for example in sorted(examples, key=lambda e: e.path.stem):
- doc_path = EXAMPLE_DOC_DIR / f"{example.path.stem}.md"
- with open(doc_path, "w+") as f:
- f.write(example.generate())
- # Add the example to the appropriate index
- index = category_indices.get(example.category, examples_index)
- index.documents.append(example.path.stem)
-
- # Generate the index files
- for category_index in category_indices.values():
- if category_index.documents:
- examples_index.documents.insert(0, category_index.path.name)
- with open(category_index.path, "w+") as f:
- f.write(category_index.generate())
-
- with open(examples_index.path, "w+") as f:
- f.write(examples_index.generate())
diff --git a/docs/source/getting_started/installation.md b/docs/source/getting_started/installation.md
deleted file mode 100644
index 44134bf01b..0000000000
--- a/docs/source/getting_started/installation.md
+++ /dev/null
@@ -1,28 +0,0 @@
-(installation-index)=
-
-# Installation
-
-vLLM supports the following hardware platforms:
-
-:::{toctree}
-:maxdepth: 1
-:hidden:
-
-installation/gpu
-installation/cpu
-installation/ai_accelerator
-:::
-
--
- - NVIDIA CUDA
- - AMD ROCm
- - Intel XPU
--
- - Intel/AMD x86
- - ARM AArch64
- - Apple silicon
- - IBM Z (S390X)
--
- - Google TPU
- - Intel Gaudi
- - AWS Neuron
diff --git a/docs/source/getting_started/installation/ai_accelerator.md b/docs/source/getting_started/installation/ai_accelerator.md
deleted file mode 100644
index 0a207af1a4..0000000000
--- a/docs/source/getting_started/installation/ai_accelerator.md
+++ /dev/null
@@ -1,299 +0,0 @@
-# Other AI accelerators
-
-vLLM is a Python library that supports the following AI accelerators. Select your AI accelerator type to see vendor specific instructions:
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:selected:
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-:::::
-
-## Requirements
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "## Requirements"
-:end-before: "## Configure a new environment"
-:::
-
-::::
-
-:::::
-
-## Configure a new environment
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "## Configure a new environment"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-:::::
-
-## Set up using Python
-
-### Pre-built wheels
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-:::::
-
-### Build wheel from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-:::::
-
-## Set up using Docker
-
-### Pre-built images
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-:::::
-
-### Build image from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Extra information"
-:::
-
-::::
-
-:::::
-
-## Extra information
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} Google TPU
-:sync: tpu
-
-:::{include} ai_accelerator/tpu.inc.md
-:start-after: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} Intel Gaudi
-:sync: hpu-gaudi
-
-:::{include} ai_accelerator/hpu-gaudi.inc.md
-:start-after: "## Extra information"
-:::
-
-::::
-
-::::{tab-item} AWS Neuron
-:sync: neuron
-
-:::{include} ai_accelerator/neuron.inc.md
-:start-after: "## Extra information"
-:::
-
-::::
-
-:::::
diff --git a/docs/source/getting_started/installation/cpu/arm.inc.md b/docs/source/getting_started/installation/cpu/arm.inc.md
deleted file mode 100644
index e7d8d60630..0000000000
--- a/docs/source/getting_started/installation/cpu/arm.inc.md
+++ /dev/null
@@ -1,34 +0,0 @@
-# Installation
-
-vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform.
-
-ARM CPU backend currently supports Float32, FP16 and BFloat16 datatypes.
-
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
-
-## Requirements
-
-- OS: Linux
-- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
-- Instruction Set Architecture (ISA): NEON support is required
-
-## Set up using Python
-
-### Pre-built wheels
-
-### Build wheel from source
-
-:::{include} cpu/build.inc.md
-:::
-
-Testing has been conducted on AWS Graviton3 instances for compatibility.
-
-## Set up using Docker
-
-### Pre-built images
-
-### Build image from source
-
-## Extra information
diff --git a/docs/source/getting_started/installation/cpu/x86.inc.md b/docs/source/getting_started/installation/cpu/x86.inc.md
deleted file mode 100644
index 9ae2035db5..0000000000
--- a/docs/source/getting_started/installation/cpu/x86.inc.md
+++ /dev/null
@@ -1,41 +0,0 @@
-# Installation
-
-vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16.
-
-:::{attention}
-There are no pre-built wheels or images for this device, so you must build vLLM from source.
-:::
-
-## Requirements
-
-- OS: Linux
-- Compiler: `gcc/g++ >= 12.3.0` (optional, recommended)
-- Instruction Set Architecture (ISA): AVX512 (optional, recommended)
-
-:::{tip}
-[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch) extends PyTorch with up-to-date features optimizations for an extra performance boost on Intel hardware.
-:::
-
-## Set up using Python
-
-### Pre-built wheels
-
-### Build wheel from source
-
-:::{include} cpu/build.inc.md
-:::
-
-:::{note}
-- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, which brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
-- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
-:::
-
-## Set up using Docker
-
-### Pre-built images
-
-See [https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo](https://gallery.ecr.aws/q9t5s3a7/vllm-cpu-release-repo)
-
-### Build image from source
-
-## Extra information
diff --git a/docs/source/getting_started/installation/gpu.md b/docs/source/getting_started/installation/gpu.md
deleted file mode 100644
index 22db992354..0000000000
--- a/docs/source/getting_started/installation/gpu.md
+++ /dev/null
@@ -1,301 +0,0 @@
-# GPU
-
-vLLM is a Python library that supports the following GPU variants. Select your GPU type to see vendor specific instructions:
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:selected:
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "# Installation"
-:end-before: "## Requirements"
-:::
-
-::::
-
-:::::
-
-## Requirements
-
-- OS: Linux
-- Python: 3.9 -- 3.12
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "## Requirements"
-:end-before: "## Set up using Python"
-:::
-
-::::
-
-:::::
-
-## Set up using Python
-
-### Create a new Python environment
-
-:::{include} python_env_setup.inc.md
-:::
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "## Create a new Python environment"
-:end-before: "### Pre-built wheels"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-There is no extra information on creating a new Python environment for this device.
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-There is no extra information on creating a new Python environment for this device.
-
-::::
-
-:::::
-
-### Pre-built wheels
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Pre-built wheels"
-:end-before: "### Build wheel from source"
-:::
-
-::::
-
-:::::
-
-(build-from-source)=
-
-### Build wheel from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Build wheel from source"
-:end-before: "## Set up using Docker"
-:::
-
-::::
-
-:::::
-
-## Set up using Docker
-
-### Pre-built images
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Pre-built images"
-:end-before: "### Build image from source"
-:::
-
-::::
-
-:::::
-
-### Build image from source
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "### Build image from source"
-:end-before: "## Supported features"
-:::
-
-::::
-
-:::::
-
-## Supported features
-
-:::::{tab-set}
-:sync-group: device
-
-::::{tab-item} NVIDIA CUDA
-:sync: cuda
-
-:::{include} gpu/cuda.inc.md
-:start-after: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} AMD ROCm
-:sync: rocm
-
-:::{include} gpu/rocm.inc.md
-:start-after: "## Supported features"
-:::
-
-::::
-
-::::{tab-item} Intel XPU
-:sync: xpu
-
-:::{include} gpu/xpu.inc.md
-:start-after: "## Supported features"
-:::
-
-::::
-
-:::::
diff --git a/docs/source/index.md b/docs/source/index.md
deleted file mode 100644
index db2192e87d..0000000000
--- a/docs/source/index.md
+++ /dev/null
@@ -1,217 +0,0 @@
-# Welcome to vLLM
-
-:::{figure} ./assets/logos/vllm-logo-text-light.png
-:align: center
-:alt: vLLM
-:class: no-scaled-link
-:width: 60%
-:::
-
-:::{raw} html
-
-Easy, fast, and cheap LLM serving for everyone
-
-
-:::
-
-vLLM is a fast and easy-to-use library for LLM inference and serving.
-
-Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
-
-vLLM is fast with:
-
-- State-of-the-art serving throughput
-- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
-- Continuous batching of incoming requests
-- Fast model execution with CUDA/HIP graph
-- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
-- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
-- Speculative decoding
-- Chunked prefill
-
-vLLM is flexible and easy to use with:
-
-- Seamless integration with popular HuggingFace models
-- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
-- Tensor parallelism and pipeline parallelism support for distributed inference
-- Streaming outputs
-- OpenAI-compatible API server
-- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
-- Prefix caching support
-- Multi-lora support
-
-For more information, check out the following:
-
-- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
-- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
-- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
-- [vLLM Meetups](#meetups)
-
-## Documentation
-
-% How to start using vLLM?
-
-:::{toctree}
-:caption: Getting Started
-:maxdepth: 1
-
-getting_started/installation
-getting_started/quickstart
-getting_started/examples/examples_index
-getting_started/troubleshooting
-getting_started/faq
-getting_started/v1_user_guide
-
-:::
-
-% What does vLLM support?
-
-:::{toctree}
-:caption: Models
-:maxdepth: 1
-
-models/supported_models
-models/generative_models
-models/pooling_models
-models/extensions/index
-:::
-
-% Additional capabilities
-
-:::{toctree}
-:caption: Features
-:maxdepth: 1
-
-features/quantization/index
-features/multimodal_inputs
-features/prompt_embeds
-features/lora
-features/tool_calling
-features/reasoning_outputs
-features/structured_outputs
-features/automatic_prefix_caching
-features/disagg_prefill
-features/spec_decode
-features/compatibility_matrix
-:::
-
-% Details about running vLLM
-
-:::{toctree}
-:caption: Training
-:maxdepth: 1
-
-training/trl.md
-training/rlhf.md
-
-:::
-
-:::{toctree}
-:caption: Inference and Serving
-:maxdepth: 1
-
-serving/offline_inference
-serving/openai_compatible_server
-serving/serve_args
-serving/distributed_serving
-serving/metrics
-serving/engine_args
-serving/env_vars
-serving/usage_stats
-serving/integrations/index
-:::
-
-% Scaling up vLLM for production
-
-:::{toctree}
-:caption: Deployment
-:maxdepth: 1
-
-deployment/security
-deployment/docker
-deployment/k8s
-deployment/nginx
-deployment/frameworks/index
-deployment/integrations/index
-:::
-
-% Making the most out of vLLM
-
-:::{toctree}
-:caption: Performance
-:maxdepth: 1
-
-performance/optimization
-performance/benchmarks
-:::
-
-% Explanation of vLLM internals
-
-:::{toctree}
-:caption: Design Documents
-:maxdepth: 2
-
-design/arch_overview
-design/huggingface_integration
-design/plugin_system
-design/kernel/paged_attention
-design/mm_processing
-design/automatic_prefix_caching
-design/multiprocessing
-:::
-
-:::{toctree}
-:caption: V1 Design Documents
-:maxdepth: 2
-
-design/v1/torch_compile
-design/v1/prefix_caching
-design/v1/metrics
-:::
-
-% How to contribute to the vLLM project
-
-:::{toctree}
-:caption: Developer Guide
-:maxdepth: 2
-
-contributing/overview
-contributing/deprecation_policy
-contributing/profiling/profiling_index
-contributing/dockerfile/dockerfile
-contributing/model/index
-contributing/vulnerability_management
-:::
-
-% Technical API specifications
-
-:::{toctree}
-:caption: API Reference
-:maxdepth: 2
-
-api/summary
-api/vllm/vllm
-:::
-
-% Latest news and acknowledgements
-
-:::{toctree}
-:caption: Community
-:maxdepth: 1
-
-community/blog
-community/meetups
-community/sponsors
-:::
-
-## Indices and tables
-
-- {ref}`genindex`
-- {ref}`modindex`
diff --git a/docs/source/models/extensions/index.md b/docs/source/models/extensions/index.md
deleted file mode 100644
index cdcdaa5b35..0000000000
--- a/docs/source/models/extensions/index.md
+++ /dev/null
@@ -1,9 +0,0 @@
-# Built-in Extensions
-
-:::{toctree}
-:maxdepth: 1
-
-runai_model_streamer
-tensorizer
-fastsafetensor
-:::
diff --git a/docs/source/models/supported_models.md b/docs/source/models/supported_models.md
deleted file mode 100644
index 6022dfb9c2..0000000000
--- a/docs/source/models/supported_models.md
+++ /dev/null
@@ -1,1406 +0,0 @@
-(supported-models)=
-
-# Supported Models
-
-vLLM supports [generative](generative-models) and [pooling](pooling-models) models across various tasks.
-If a model supports more than one task, you can set the task via the `--task` argument.
-
-For each task, we list the model architectures that have been implemented in vLLM.
-Alongside each architecture, we include some popular models that use it.
-
-## Model Implementation
-
-### vLLM
-
-If vLLM natively supports a model, its implementation can be found in .
-
-These models are what we list in and .
-
-(transformers-backend)=
-
-### Transformers
-
-vLLM also supports model implementations that are available in Transformers. This does not currently work for all models, but most decoder language models are supported, and vision language model support is planned!
-
-To check if the modeling backend is Transformers, you can simply do this:
-
-```python
-from vllm import LLM
-llm = LLM(model=..., task="generate") # Name or path of your model
-llm.apply_model(lambda model: print(type(model)))
-```
-
-If it is `TransformersForCausalLM` then it means it's based on Transformers!
-
-:::{tip}
-You can force the use of `TransformersForCausalLM` by setting `model_impl="transformers"` for or `--model-impl transformers` for the .
-:::
-
-:::{note}
-vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
-:::
-
-#### Custom models
-
-If a model is neither supported natively by vLLM or Transformers, it can still be used in vLLM!
-
-For a model to be compatible with the Transformers backend for vLLM it must:
-
-- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
- * The model directory must have the correct structure (e.g. `config.json` is present).
- * `config.json` must contain `auto_map.AutoModel`.
-- be a Transformers backend for vLLM compatible model (see ):
- * Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
-
-If the compatible model is:
-
-- on the Hugging Face Model Hub, simply set `trust_remote_code=True` for or `--trust-remote-code` for the .
-- in a local directory, simply pass directory path to `model=` for or `vllm serve ` for the .
-
-This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
-
-(writing-custom-models)=
-
-#### Writing custom models
-
-This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
-
-To make your model compatible with the Transformers backend, it needs:
-
-1. `kwargs` passed down through all modules from `MyModel` to `MyAttention`.
-2. `MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
-3. `MyModel` must contain `_supports_attention_backend = True`.
-
-```{code-block} python
-:caption: modeling_my_model.py
-
-from transformers import PreTrainedModel
-from torch import nn
-
-class MyAttention(nn.Module):
-
- def forward(self, hidden_states, **kwargs):
- ...
- attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
- attn_output, attn_weights = attention_interface(
- self,
- query_states,
- key_states,
- value_states,
- **kwargs,
- )
- ...
-
-class MyModel(PreTrainedModel):
- _supports_attention_backend = True
-```
-
-Here is what happens in the background when this model is loaded:
-
-1. The config is loaded.
-2. `MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
-3. `MyModel` is loaded into `TransformersForCausalLM` (see ) which sets `self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
-
-That's it!
-
-For your model to be compatible with vLLM's tensor parallel and/or pipeline parallel features, you must add `base_model_tp_plan` and/or `base_model_pp_plan` to your model's config class:
-
-```{code-block} python
-:caption: configuration_my_model.py
-
-from transformers import PretrainedConfig
-
-class MyConfig(PretrainedConfig):
- base_model_tp_plan = {
- "layers.*.self_attn.k_proj": "colwise",
- "layers.*.self_attn.v_proj": "colwise",
- "layers.*.self_attn.o_proj": "rowwise",
- "layers.*.mlp.gate_proj": "colwise",
- "layers.*.mlp.up_proj": "colwise",
- "layers.*.mlp.down_proj": "rowwise",
- }
- base_model_pp_plan = {
- "embed_tokens": (["input_ids"], ["inputs_embeds"]),
- "layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
- "norm": (["hidden_states"], ["hidden_states"]),
- }
-```
-
-- `base_model_tp_plan` is a `dict` that maps fully qualified layer name patterns to tensor parallel styles (currently only `"colwise"` and `"rowwise"` are supported).
-- `base_model_pp_plan` is a `dict` that maps direct child layer names to `tuple`s of `list`s of `str`s:
- * You only need to do this for layers which are not present on all pipeline stages
- * vLLM assumes that there will be only one `nn.ModuleList`, which is distributed across the pipeline stages
- * The `list` in the first element of the `tuple` contains the names of the input arguments
- * The `list` in the last element of the `tuple` contains the names of the variables the layer outputs to in your modeling code
-
-## Loading a Model
-
-### Hugging Face Hub
-
-By default, vLLM loads models from [Hugging Face (HF) Hub](https://huggingface.co/models). To change the download path for models, you can set the `HF_HOME` environment variable; for more details, refer to [their official documentation](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hfhome).
-
-To determine whether a given model is natively supported, you can check the `config.json` file inside the HF repository.
-If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.
-
-Models do not _need_ to be natively supported to be used in vLLM.
-The [Transformers backend](#transformers-backend) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
-
-:::{tip}
-The easiest way to check if your model is really supported at runtime is to run the program below:
-
-```python
-from vllm import LLM
-
-# For generative models (task=generate) only
-llm = LLM(model=..., task="generate") # Name or path of your model
-output = llm.generate("Hello, my name is")
-print(output)
-
-# For pooling models (task={embed,classify,reward,score}) only
-llm = LLM(model=..., task="embed") # Name or path of your model
-output = llm.encode("Hello, my name is")
-print(output)
-```
-
-If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
-:::
-
-Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
-Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
-
-#### Download a model
-
-If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository:
-
-```console
-# Download a model
-huggingface-cli download HuggingFaceH4/zephyr-7b-beta
-
-# Specify a custom cache directory
-huggingface-cli download HuggingFaceH4/zephyr-7b-beta --cache-dir ./path/to/cache
-
-# Download a specific file from a model repo
-huggingface-cli download HuggingFaceH4/zephyr-7b-beta eval_results.json
-```
-
-#### List the downloaded models
-
-Use the Hugging Face CLI to [manage models](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) stored in local cache:
-
-```console
-# List cached models
-huggingface-cli scan-cache
-
-# Show detailed (verbose) output
-huggingface-cli scan-cache -v
-
-# Specify a custom cache directory
-huggingface-cli scan-cache --dir ~/.cache/huggingface/hub
-```
-
-#### Delete a cached model
-
-Use the Hugging Face CLI to interactively [delete downloaded model](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#clean-your-cache) from the cache:
-
-```console
-# The `delete-cache` command requires extra dependencies to work with the TUI.
-# Please run `pip install huggingface_hub[cli]` to install them.
-
-# Launch the interactive TUI to select models to delete
-$ huggingface-cli delete-cache
-? Select revisions to delete: 1 revisions selected counting for 438.9M.
- ○ None of the following (if selected, nothing will be deleted).
-Model BAAI/bge-base-en-v1.5 (438.9M, used 1 week ago)
-❯ ◉ a5beb1e3: main # modified 1 week ago
-
-Model BAAI/bge-large-en-v1.5 (1.3G, used 1 week ago)
- ○ d4aa6901: main # modified 1 week ago
-
-Model BAAI/bge-reranker-base (1.1G, used 4 weeks ago)
- ○ 2cfc18c9: main # modified 4 weeks ago
-
-Press to select, to validate and to quit without modification.
-
-# Need to confirm after selected
-? Select revisions to delete: 1 revision(s) selected.
-? 1 revisions selected counting for 438.9M. Confirm deletion ? Yes
-Start deletion.
-Done. Deleted 1 repo(s) and 0 revision(s) for a total of 438.9M.
-```
-
-#### Using a proxy
-
-Here are some tips for loading/downloading models from Hugging Face using a proxy:
-
-- Set the proxy globally for your session (or set it in the profile file):
-
-```shell
-export http_proxy=http://your.proxy.server:port
-export https_proxy=http://your.proxy.server:port
-```
-
-- Set the proxy for just the current command:
-
-```shell
-https_proxy=http://your.proxy.server:port huggingface-cli download
-
-# or use vllm cmd directly
-https_proxy=http://your.proxy.server:port vllm serve --disable-log-requests
-```
-
-- Set the proxy in Python interpreter:
-
-```python
-import os
-
-os.environ['http_proxy'] = 'http://your.proxy.server:port'
-os.environ['https_proxy'] = 'http://your.proxy.server:port'
-```
-
-### ModelScope
-
-To use models from [ModelScope](https://www.modelscope.cn) instead of Hugging Face Hub, set an environment variable:
-
-```shell
-export VLLM_USE_MODELSCOPE=True
-```
-
-And use with `trust_remote_code=True`.
-
-```python
-from vllm import LLM
-
-llm = LLM(model=..., revision=..., task=..., trust_remote_code=True)
-
-# For generative models (task=generate) only
-output = llm.generate("Hello, my name is")
-print(output)
-
-# For pooling models (task={embed,classify,reward,score}) only
-output = llm.encode("Hello, my name is")
-print(output)
-```
-
-(feature-status-legend)=
-
-## Feature Status Legend
-
-- ✅︎ indicates that the feature is supported for the model.
-
-- 🚧 indicates that the feature is planned but not yet supported for the model.
-
-- ⚠️ indicates that the feature is available but may have known issues or limitations.
-
-(supported-text-models)=
-
-## List of Text-only Language Models
-
-### Generative Models
-
-See [this page](#generative-models) for more information on how to use generative models.
-
-#### Text Generation
-
-Specified using `--task generate`.
-
-:::{list-table}
-:widths: 25 25 50 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
-- * `AquilaForCausalLM`
- * Aquila, Aquila2
- * `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.
- * ✅︎
- * ✅︎
-- * `ArcticForCausalLM`
- * Arctic
- * `Snowflake/snowflake-arctic-base`, `Snowflake/snowflake-arctic-instruct`, etc.
- *
- * ✅︎
-- * `BaiChuanForCausalLM`
- * Baichuan2, Baichuan
- * `baichuan-inc/Baichuan2-13B-Chat`, `baichuan-inc/Baichuan-7B`, etc.
- * ✅︎
- * ✅︎
-- * `BambaForCausalLM`
- * Bamba
- * `ibm-ai-platform/Bamba-9B-fp8`, `ibm-ai-platform/Bamba-9B`
- *
- *
-- * `BloomForCausalLM`
- * BLOOM, BLOOMZ, BLOOMChat
- * `bigscience/bloom`, `bigscience/bloomz`, etc.
- *
- * ✅︎
-- * `BartForConditionalGeneration`
- * BART
- * `facebook/bart-base`, `facebook/bart-large-cnn`, etc.
- *
- *
-- * `ChatGLMModel`, `ChatGLMForConditionalGeneration`
- * ChatGLM
- * `THUDM/chatglm2-6b`, `THUDM/chatglm3-6b`, `ShieldLM-6B-chatglm3`, etc.
- * ✅︎
- * ✅︎
-- * `CohereForCausalLM`, `Cohere2ForCausalLM`
- * Command-R
- * `CohereForAI/c4ai-command-r-v01`, `CohereForAI/c4ai-command-r7b-12-2024`, etc.
- * ✅︎
- * ✅︎
-- * `DbrxForCausalLM`
- * DBRX
- * `databricks/dbrx-base`, `databricks/dbrx-instruct`, etc.
- *
- * ✅︎
-- * `DeciLMForCausalLM`
- * DeciLM
- * `nvidia/Llama-3_3-Nemotron-Super-49B-v1`, etc.
- *
- * ✅︎
-- * `DeepseekForCausalLM`
- * DeepSeek
- * `deepseek-ai/deepseek-llm-67b-base`, `deepseek-ai/deepseek-llm-7b-chat` etc.
- *
- * ✅︎
-- * `DeepseekV2ForCausalLM`
- * DeepSeek-V2
- * `deepseek-ai/DeepSeek-V2`, `deepseek-ai/DeepSeek-V2-Chat` etc.
- *
- * ✅︎
-- * `DeepseekV3ForCausalLM`
- * DeepSeek-V3
- * `deepseek-ai/DeepSeek-V3-Base`, `deepseek-ai/DeepSeek-V3` etc.
- *
- * ✅︎
-- * `ExaoneForCausalLM`
- * EXAONE-3
- * `LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`, etc.
- * ✅︎
- * ✅︎
-- * `FalconForCausalLM`
- * Falcon
- * `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.
- *
- * ✅︎
-- * `FalconMambaForCausalLM`
- * FalconMamba
- * `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc.
- * ✅︎
- * ✅︎
-- * `FalconH1ForCausalLM`
- * Falcon-H1
- * `tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, etc.
- * ✅︎
- * ✅︎
-- * `GemmaForCausalLM`
- * Gemma
- * `google/gemma-2b`, `google/gemma-1.1-2b-it`, etc.
- * ✅︎
- * ✅︎
-- * `Gemma2ForCausalLM`
- * Gemma 2
- * `google/gemma-2-9b`, `google/gemma-2-27b`, etc.
- * ✅︎
- * ✅︎
-- * `Gemma3ForCausalLM`
- * Gemma 3
- * `google/gemma-3-1b-it`, etc.
- * ✅︎
- * ✅︎
-- * `GlmForCausalLM`
- * GLM-4
- * `THUDM/glm-4-9b-chat-hf`, etc.
- * ✅︎
- * ✅︎
-- * `Glm4ForCausalLM`
- * GLM-4-0414
- * `THUDM/GLM-4-32B-0414`, etc.
- * ✅︎
- * ✅︎
-- * `GPT2LMHeadModel`
- * GPT-2
- * `gpt2`, `gpt2-xl`, etc.
- *
- * ✅︎
-- * `GPTBigCodeForCausalLM`
- * StarCoder, SantaCoder, WizardCoder
- * `bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, `WizardLM/WizardCoder-15B-V1.0`, etc.
- * ✅︎
- * ✅︎
-- * `GPTJForCausalLM`
- * GPT-J
- * `EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.
- *
- * ✅︎
-- * `GPTNeoXForCausalLM`
- * GPT-NeoX, Pythia, OpenAssistant, Dolly V2, StableLM
- * `EleutherAI/gpt-neox-20b`, `EleutherAI/pythia-12b`, `OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.
- *
- * ✅︎
-- * `GraniteForCausalLM`
- * Granite 3.0, Granite 3.1, PowerLM
- * `ibm-granite/granite-3.0-2b-base`, `ibm-granite/granite-3.1-8b-instruct`, `ibm/PowerLM-3b`, etc.
- * ✅︎
- * ✅︎
-- * `GraniteMoeForCausalLM`
- * Granite 3.0 MoE, PowerMoE
- * `ibm-granite/granite-3.0-1b-a400m-base`, `ibm-granite/granite-3.0-3b-a800m-instruct`, `ibm/PowerMoE-3b`, etc.
- * ✅︎
- * ✅︎
-- * `GraniteMoeHybridForCausalLM`
- * Granite 4.0 MoE Hybrid
- * `ibm-granite/granite-4.0-tiny-preview`, etc.
- * ✅︎
- * ✅︎
-- * `GraniteMoeSharedForCausalLM`
- * Granite MoE Shared
- * `ibm-research/moe-7b-1b-active-shared-experts` (test model)
- * ✅︎
- * ✅︎
-- * `GritLM`
- * GritLM
- * `parasail-ai/GritLM-7B-vllm`.
- * ✅︎
- * ✅︎
-- * `Grok1ModelForCausalLM`
- * Grok1
- * `hpcai-tech/grok-1`.
- * ✅︎
- * ✅︎
-- * `InternLMForCausalLM`
- * InternLM
- * `internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.
- * ✅︎
- * ✅︎
-- * `InternLM2ForCausalLM`
- * InternLM2
- * `internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc.
- * ✅︎
- * ✅︎
-- * `InternLM3ForCausalLM`
- * InternLM3
- * `internlm/internlm3-8b-instruct`, etc.
- * ✅︎
- * ✅︎
-- * `JAISLMHeadModel`
- * Jais
- * `inceptionai/jais-13b`, `inceptionai/jais-13b-chat`, `inceptionai/jais-30b-v3`, `inceptionai/jais-30b-chat-v3`, etc.
- *
- * ✅︎
-- * `JambaForCausalLM`
- * Jamba
- * `ai21labs/AI21-Jamba-1.5-Large`, `ai21labs/AI21-Jamba-1.5-Mini`, `ai21labs/Jamba-v0.1`, etc.
- * ✅︎
- * ✅︎
-- * `LlamaForCausalLM`
- * Llama 3.1, Llama 3, Llama 2, LLaMA, Yi
- * `meta-llama/Meta-Llama-3.1-405B-Instruct`, `meta-llama/Meta-Llama-3.1-70B`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `01-ai/Yi-34B`, etc.
- * ✅︎
- * ✅︎
-- * `MambaForCausalLM`
- * Mamba
- * `state-spaces/mamba-130m-hf`, `state-spaces/mamba-790m-hf`, `state-spaces/mamba-2.8b-hf`, etc.
- *
- * ✅︎
-- * `MiniCPMForCausalLM`
- * MiniCPM
- * `openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, `openbmb/MiniCPM-S-1B-sft`, etc.
- * ✅︎
- * ✅︎
-- * `MiniCPM3ForCausalLM`
- * MiniCPM3
- * `openbmb/MiniCPM3-4B`, etc.
- * ✅︎
- * ✅︎
-- * `MistralForCausalLM`
- * Mistral, Mistral-Instruct
- * `mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.
- * ✅︎
- * ✅︎
-- * `MixtralForCausalLM`
- * Mixtral-8x7B, Mixtral-8x7B-Instruct
- * `mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc.
- * ✅︎
- * ✅︎
-- * `MPTForCausalLM`
- * MPT, MPT-Instruct, MPT-Chat, MPT-StoryWriter
- * `mosaicml/mpt-7b`, `mosaicml/mpt-7b-storywriter`, `mosaicml/mpt-30b`, etc.
- *
- * ✅︎
-- * `NemotronForCausalLM`
- * Nemotron-3, Nemotron-4, Minitron
- * `nvidia/Minitron-8B-Base`, `mgoin/Nemotron-4-340B-Base-hf-FP8`, etc.
- * ✅︎
- * ✅︎
-- * `OLMoForCausalLM`
- * OLMo
- * `allenai/OLMo-1B-hf`, `allenai/OLMo-7B-hf`, etc.
- *
- * ✅︎
-- * `OLMo2ForCausalLM`
- * OLMo2
- * `allenai/OLMo-2-0425-1B`, etc.
- *
- * ✅︎
-- * `OLMoEForCausalLM`
- * OLMoE
- * `allenai/OLMoE-1B-7B-0924`, `allenai/OLMoE-1B-7B-0924-Instruct`, etc.
- * ✅︎
- * ✅︎
-- * `OPTForCausalLM`
- * OPT, OPT-IML
- * `facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.
- *
- * ✅︎
-- * `OrionForCausalLM`
- * Orion
- * `OrionStarAI/Orion-14B-Base`, `OrionStarAI/Orion-14B-Chat`, etc.
- *
- * ✅︎
-- * `PhiForCausalLM`
- * Phi
- * `microsoft/phi-1_5`, `microsoft/phi-2`, etc.
- * ✅︎
- * ✅︎
-- * `Phi3ForCausalLM`
- * Phi-4, Phi-3
- * `microsoft/Phi-4-mini-instruct`, `microsoft/Phi-4`, `microsoft/Phi-3-mini-4k-instruct`, `microsoft/Phi-3-mini-128k-instruct`, `microsoft/Phi-3-medium-128k-instruct`, etc.
- * ✅︎
- * ✅︎
-- * `Phi3SmallForCausalLM`
- * Phi-3-Small
- * `microsoft/Phi-3-small-8k-instruct`, `microsoft/Phi-3-small-128k-instruct`, etc.
- *
- * ✅︎
-- * `PhiMoEForCausalLM`
- * Phi-3.5-MoE
- * `microsoft/Phi-3.5-MoE-instruct`, etc.
- * ✅︎
- * ✅︎
-- * `PersimmonForCausalLM`
- * Persimmon
- * `adept/persimmon-8b-base`, `adept/persimmon-8b-chat`, etc.
- *
- * ✅︎
-- * `Plamo2ForCausalLM`
- * PLaMo2
- * `pfnet/plamo-2-1b`, `pfnet/plamo-2-8b`, etc.
- *
- *
-- * `QWenLMHeadModel`
- * Qwen
- * `Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.
- * ✅︎
- * ✅︎
-- * `Qwen2ForCausalLM`
- * QwQ, Qwen2
- * `Qwen/QwQ-32B-Preview`, `Qwen/Qwen2-7B-Instruct`, `Qwen/Qwen2-7B`, etc.
- * ✅︎
- * ✅︎
-- * `Qwen2MoeForCausalLM`
- * Qwen2MoE
- * `Qwen/Qwen1.5-MoE-A2.7B`, `Qwen/Qwen1.5-MoE-A2.7B-Chat`, etc.
- *
- * ✅︎
-- * `Qwen3ForCausalLM`
- * Qwen3
- * `Qwen/Qwen3-8B`, etc.
- * ✅︎
- * ✅︎
-- * `Qwen3MoeForCausalLM`
- * Qwen3MoE
- * `Qwen/Qwen3-30B-A3B`, etc.
- *
- * ✅︎
-- * `StableLmForCausalLM`
- * StableLM
- * `stabilityai/stablelm-3b-4e1t`, `stabilityai/stablelm-base-alpha-7b-v2`, etc.
- *
- * ✅︎
-- * `Starcoder2ForCausalLM`
- * Starcoder2
- * `bigcode/starcoder2-3b`, `bigcode/starcoder2-7b`, `bigcode/starcoder2-15b`, etc.
- *
- * ✅︎
-- * `SolarForCausalLM`
- * Solar Pro
- * `upstage/solar-pro-preview-instruct`, etc.
- * ✅︎
- * ✅︎
-- * `TeleChat2ForCausalLM`
- * TeleChat2
- * `Tele-AI/TeleChat2-3B`, `Tele-AI/TeleChat2-7B`, `Tele-AI/TeleChat2-35B`, etc.
- * ✅︎
- * ✅︎
-- * `TeleFLMForCausalLM`
- * TeleFLM
- * `CofeAI/FLM-2-52B-Instruct-2407`, `CofeAI/Tele-FLM`, etc.
- * ✅︎
- * ✅︎
-- * `XverseForCausalLM`
- * XVERSE
- * `xverse/XVERSE-7B-Chat`, `xverse/XVERSE-13B-Chat`, `xverse/XVERSE-65B-Chat`, etc.
- * ✅︎
- * ✅︎
-- * `MiniMaxText01ForCausalLM`
- * MiniMax-Text
- * `MiniMaxAI/MiniMax-Text-01`, etc.
- *
- * ✅︎
-- * `Zamba2ForCausalLM`
- * Zamba2
- * `Zyphra/Zamba2-7B-instruct`, `Zyphra/Zamba2-2.7B-instruct`, `Zyphra/Zamba2-1.2B-instruct`, etc.
- *
- *
-- * `MiMoForCausalLM`
- * MiMo
- * `XiaomiMiMo/MiMo-7B-RL`, etc.
- *
- *
-:::
-
-:::{note}
-Currently, the ROCm version of vLLM supports Mistral and Mixtral only for context lengths up to 4096.
-:::
-
-### Pooling Models
-
-See [this page](pooling-models) for more information on how to use pooling models.
-
-:::{important}
-Since some model architectures support both generative and pooling tasks,
-you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
-:::
-
-#### Text Embedding
-
-Specified using `--task embed`.
-
-:::{list-table}
-:widths: 25 25 50 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
-- * `BertModel`
- * BERT-based
- * `BAAI/bge-base-en-v1.5`, `Snowflake/snowflake-arctic-embed-xs`, etc.
- *
- *
-- * `Gemma2Model`
- * Gemma 2-based
- * `BAAI/bge-multilingual-gemma2`, etc.
- *
- * ✅︎
-- * `GritLM`
- * GritLM
- * `parasail-ai/GritLM-7B-vllm`.
- * ✅︎
- * ✅︎
-- * `GteModel`
- * Arctic-Embed-2.0-M
- * `Snowflake/snowflake-arctic-embed-m-v2.0`.
- *
- * ︎
-- * `GteNewModel`
- * mGTE-TRM (see note)
- * `Alibaba-NLP/gte-multilingual-base`, etc.
- * ︎
- * ︎
-- * `ModernBertModel`
- * ModernBERT-based
- * `Alibaba-NLP/gte-modernbert-base`, etc.
- * ︎
- * ︎
-- * `NomicBertModel`
- * Nomic BERT
- * `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc.
- * ︎
- * ︎
-- * `LlamaModel`, `LlamaForCausalLM`, `MistralModel`, etc.
- * Llama-based
- * `intfloat/e5-mistral-7b-instruct`, etc.
- * ✅︎
- * ✅︎
-- * `Qwen2Model`, `Qwen2ForCausalLM`
- * Qwen2-based
- * `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc.
- * ✅︎
- * ✅︎
-- * `RobertaModel`, `RobertaForMaskedLM`
- * RoBERTa-based
- * `sentence-transformers/all-roberta-large-v1`, etc.
- *
- *
-- * `XLMRobertaModel`
- * XLM-RoBERTa-based
- * `intfloat/multilingual-e5-large`, `jinaai/jina-reranker-v2-base-multilingual`, `Snowflake/snowflake-arctic-embed-l-v2.0`, `jinaai/jina-embeddings-v3`(see note), etc.
- *
- *
-:::
-
-:::{note}
-`ssmits/Qwen2-7B-Instruct-embed-base` has an improperly defined Sentence Transformers config.
-You should manually set mean pooling by passing `--override-pooler-config '{"pooling_type": "MEAN"}'`.
-:::
-
-:::{note}
-The HF implementation of `Alibaba-NLP/gte-Qwen2-1.5B-instruct` is hardcoded to use causal attention despite what is shown in `config.json`. To compare vLLM vs HF results,
-you should set `--hf-overrides '{"is_causal": true}'` in vLLM so that the two implementations are consistent with each other.
-
-For both the 1.5B and 7B variants, you also need to enable `--trust-remote-code` for the correct tokenizer to be loaded.
-See [relevant issue on HF Transformers](https://github.com/huggingface/transformers/issues/34882).
-:::
-
-:::{note}
-`jinaai/jina-embeddings-v3` supports multiple tasks through lora, while vllm temporarily only supports text-matching tasks by merging lora weights.
-:::
-
-:::{note}
-The second-generation GTE model (mGTE-TRM) is named `NewModel`. The name `NewModel` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewModel"]}'` to specify the use of the `GteNewModel` architecture.
-:::
-
-If your model is not in the above list, we will try to automatically convert the model using
-{func}`~vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
-of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
-
-#### Reward Modeling
-
-Specified using `--task reward`.
-
-:::{list-table}
-:widths: 25 25 50 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
-- * `InternLM2ForRewardModel`
- * InternLM2-based
- * `internlm/internlm2-1_8b-reward`, `internlm/internlm2-7b-reward`, etc.
- * ✅︎
- * ✅︎
-- * `LlamaForCausalLM`
- * Llama-based
- * `peiyi9979/math-shepherd-mistral-7b-prm`, etc.
- * ✅︎
- * ✅︎
-- * `Qwen2ForRewardModel`
- * Qwen2-based
- * `Qwen/Qwen2.5-Math-RM-72B`, etc.
- * ✅︎
- * ✅︎
-- * `Qwen2ForProcessRewardModel`
- * Qwen2-based
- * `Qwen/Qwen2.5-Math-PRM-7B`, `Qwen/Qwen2.5-Math-PRM-72B`, etc.
- * ✅︎
- * ✅︎
-:::
-
-If your model is not in the above list, we will try to automatically convert the model using
-{func}`~vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.
-
-:::{important}
-For process-supervised reward models such as `peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
-e.g.: `--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 123, "returned_token_ids": [456, 789]}'`.
-:::
-
-#### Classification
-
-Specified using `--task classify`.
-
-:::{list-table}
-:widths: 25 25 50 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
-- * `JambaForSequenceClassification`
- * Jamba
- * `ai21labs/Jamba-tiny-reward-dev`, etc.
- * ✅︎
- * ✅︎
-- * `Qwen2ForSequenceClassification`
- * Qwen2-based
- * `jason9693/Qwen2.5-1.5B-apeach`, etc.
- * ✅︎
- * ✅︎
-:::
-
-If your model is not in the above list, we will try to automatically convert the model using
-{func}`~vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
-
-#### Sentence Pair Scoring
-
-Specified using `--task score`.
-
-:::{list-table}
-:widths: 25 25 50 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
-- * `BertForSequenceClassification`
- * BERT-based
- * `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc.
- *
- *
-- * `RobertaForSequenceClassification`
- * RoBERTa-based
- * `cross-encoder/quora-roberta-base`, etc.
- *
- *
-- * `XLMRobertaForSequenceClassification`
- * XLM-RoBERTa-based
- * `BAAI/bge-reranker-v2-m3`, etc.
- *
- *
-- * `ModernBertForSequenceClassification`
- * ModernBert-based
- * `Alibaba-NLP/gte-reranker-modernbert-base`, etc.
- *
- *
-:::
-
-(supported-mm-models)=
-
-## List of Multimodal Language Models
-
-The following modalities are supported depending on the model:
-
-- **T**ext
-- **I**mage
-- **V**ideo
-- **A**udio
-
-Any combination of modalities joined by `+` are supported.
-
-- e.g.: `T + I` means that the model supports text-only, image-only, and text-with-image inputs.
-
-On the other hand, modalities separated by `/` are mutually exclusive.
-
-- e.g.: `T / I` means that the model supports text-only and image-only inputs, but not text-with-image inputs.
-
-See [this page](#multimodal-inputs) on how to pass multi-modal inputs to the model.
-
-:::{important}
-**To enable multiple multi-modal items per text prompt in vLLM V0**, you have to set `limit_mm_per_prompt` (offline inference)
-or `--limit-mm-per-prompt` (online serving). For example, to enable passing up to 4 images per text prompt:
-
-Offline inference:
-
-```python
-from vllm import LLM
-
-llm = LLM(
- model="Qwen/Qwen2-VL-7B-Instruct",
- limit_mm_per_prompt={"image": 4},
-)
-```
-
-Online serving:
-
-```bash
-vllm serve Qwen/Qwen2-VL-7B-Instruct --limit-mm-per-prompt '{"image":4}'
-```
-
-**This is no longer required if you are using vLLM V1.**
-
-:::
-
-:::{note}
-vLLM currently only supports adding LoRA to the language backbone of multimodal models.
-:::
-
-### Generative Models
-
-See [this page](#generative-models) for more information on how to use generative models.
-
-#### Text Generation
-
-Specified using `--task generate`.
-
-:::{list-table}
-:widths: 25 25 15 20 5 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Inputs
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
- * [V1](gh-issue:8779)
-- * `AriaForConditionalGeneration`
- * Aria
- * T + I+
- * `rhymes-ai/Aria`
- *
- * ✅︎
- * ✅︎
-- * `AyaVisionForConditionalGeneration`
- * Aya Vision
- * T + I+
- * `CohereForAI/aya-vision-8b`, `CohereForAI/aya-vision-32b`, etc.
- *
- * ✅︎
- * ✅︎
-- * `Blip2ForConditionalGeneration`
- * BLIP-2
- * T + IE
- * `Salesforce/blip2-opt-2.7b`, `Salesforce/blip2-opt-6.7b`, etc.
- *
- * ✅︎
- * ✅︎
-- * `ChameleonForConditionalGeneration`
- * Chameleon
- * T + I
- * `facebook/chameleon-7b` etc.
- *
- * ✅︎
- * ✅︎
-- * `DeepseekVLV2ForCausalLM`^
- * DeepSeek-VL2
- * T + I+
- * `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc.
- *
- * ✅︎
- * ✅︎
-- * `Florence2ForConditionalGeneration`
- * Florence-2
- * T + I
- * `microsoft/Florence-2-base`, `microsoft/Florence-2-large` etc.
- *
- *
- *
-- * `FuyuForCausalLM`
- * Fuyu
- * T + I
- * `adept/fuyu-8b` etc.
- *
- * ✅︎
- * ✅︎
-- * `Gemma3ForConditionalGeneration`
- * Gemma 3
- * T + I+
- * `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc.
- * ✅︎
- * ✅︎
- * ⚠️
-- * `GLM4VForCausalLM`^
- * GLM-4V
- * T + I
- * `THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220` etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `GraniteSpeechForConditionalGeneration`
- * Granite Speech
- * T + A
- * `ibm-granite/granite-speech-3.3-8b`
- * ✅︎
- * ✅︎
- * ✅︎
-- * `H2OVLChatModel`
- * H2OVL
- * T + IE+
- * `h2oai/h2ovl-mississippi-800m`, `h2oai/h2ovl-mississippi-2b`, etc.
- *
- * ✅︎
- * ✅︎\*
-- * `Idefics3ForConditionalGeneration`
- * Idefics3
- * T + I
- * `HuggingFaceM4/Idefics3-8B-Llama3` etc.
- * ✅︎
- *
- * ✅︎
-- * `InternVLChatModel`
- * InternVL 3.0, InternVideo 2.5, InternVL 2.5, Mono-InternVL, InternVL 2.0
- * T + IE+
- * `OpenGVLab/InternVL3-9B`, `OpenGVLab/InternVideo2_5_Chat_8B`, `OpenGVLab/InternVL2_5-4B`, `OpenGVLab/Mono-InternVL-2B`, `OpenGVLab/InternVL2-4B`, etc.
- *
- * ✅︎
- * ✅︎
-- * `KimiVLForConditionalGeneration`
- * Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking
- * T + I+
- * `moonshotai/Kimi-VL-A3B-Instruct`, `moonshotai/Kimi-VL-A3B-Thinking`
- *
- *
- * ✅︎
-- * `Llama4ForConditionalGeneration`
- * Llama 4
- * T + I+
- * `meta-llama/Llama-4-Scout-17B-16E-Instruct`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8`, `meta-llama/Llama-4-Maverick-17B-128E-Instruct`, etc.
- *
- * ✅︎
- * ✅︎
-- * `LlavaForConditionalGeneration`
- * LLaVA-1.5
- * T + IE+
- * `llava-hf/llava-1.5-7b-hf`, `TIGER-Lab/Mantis-8B-siglip-llama3` (see note), etc.
- *
- * ✅︎
- * ✅︎
-- * `LlavaNextForConditionalGeneration`
- * LLaVA-NeXT
- * T + IE+
- * `llava-hf/llava-v1.6-mistral-7b-hf`, `llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
- *
- * ✅︎
- * ✅︎
-- * `LlavaNextVideoForConditionalGeneration`
- * LLaVA-NeXT-Video
- * T + V
- * `llava-hf/LLaVA-NeXT-Video-7B-hf`, etc.
- *
- * ✅︎
- * ✅︎
-- * `LlavaOnevisionForConditionalGeneration`
- * LLaVA-Onevision
- * T + I+ + V+
- * `llava-hf/llava-onevision-qwen2-7b-ov-hf`, `llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
- *
- * ✅︎
- * ✅︎
-- * `MiniCPMO`
- * MiniCPM-O
- * T + IE+ + VE+ + AE+
- * `openbmb/MiniCPM-o-2_6`, etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `MiniCPMV`
- * MiniCPM-V
- * T + IE+ + VE+
- * `openbmb/MiniCPM-V-2` (see note), `openbmb/MiniCPM-Llama3-V-2_5`, `openbmb/MiniCPM-V-2_6`, etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `MiniMaxVL01ForConditionalGeneration`
- * MiniMax-VL
- * T + IE+
- * `MiniMaxAI/MiniMax-VL-01`, etc.
- *
- * ✅︎
- * ✅︎
-- * `Mistral3ForConditionalGeneration`
- * Mistral3
- * T + I+
- * `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `MllamaForConditionalGeneration`
- * Llama 3.2
- * T + I+
- * `meta-llama/Llama-3.2-90B-Vision-Instruct`, `meta-llama/Llama-3.2-11B-Vision`, etc.
- *
- *
- *
-- * `MolmoForCausalLM`
- * Molmo
- * T + I+
- * `allenai/Molmo-7B-D-0924`, `allenai/Molmo-7B-O-0924`, etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `NVLM_D_Model`
- * NVLM-D 1.0
- * T + I+
- * `nvidia/NVLM-D-72B`, etc.
- *
- * ✅︎
- * ✅︎
-- * `Ovis`
- * Ovis2, Ovis1.6
- * T + I+
- * `AIDC-AI/Ovis2-1B`, `AIDC-AI/Ovis1.6-Llama3.2-3B`, etc.
- *
- *
- * ✅︎
-- * `PaliGemmaForConditionalGeneration`
- * PaliGemma, PaliGemma 2
- * T + IE
- * `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
- *
- * ✅︎
- * ⚠️
-- * `Phi3VForCausalLM`
- * Phi-3-Vision, Phi-3.5-Vision
- * T + IE+
- * `microsoft/Phi-3-vision-128k-instruct`, `microsoft/Phi-3.5-vision-instruct`, etc.
- *
- * ✅︎
- * ✅︎
-- * `Phi4MMForCausalLM`
- * Phi-4-multimodal
- * T + I+ / T + A+ / I+ + A+
- * `microsoft/Phi-4-multimodal-instruct`, etc.
- * ✅︎
- *
- * ✅︎
-- * `PixtralForConditionalGeneration`
- * Pixtral
- * T + I+
- * `mistralai/Mistral-Small-3.1-24B-Instruct-2503`, `mistral-community/pixtral-12b`, etc.
- *
- * ✅︎
- * ✅︎
-- * `QwenVLForConditionalGeneration`^
- * Qwen-VL
- * T + IE+
- * `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `Qwen2AudioForConditionalGeneration`
- * Qwen2-Audio
- * T + A+
- * `Qwen/Qwen2-Audio-7B-Instruct`
- *
- * ✅︎
- * ✅︎
-- * `Qwen2VLForConditionalGeneration`
- * QVQ, Qwen2-VL
- * T + IE+ + VE+
- * `Qwen/QVQ-72B-Preview`, `Qwen/Qwen2-VL-7B-Instruct`, `Qwen/Qwen2-VL-72B-Instruct`, etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `Qwen2_5_VLForConditionalGeneration`
- * Qwen2.5-VL
- * T + IE+ + VE+
- * `Qwen/Qwen2.5-VL-3B-Instruct`, `Qwen/Qwen2.5-VL-72B-Instruct`, etc.
- * ✅︎
- * ✅︎
- * ✅︎
-- * `Qwen2_5OmniThinkerForConditionalGeneration`
- * Qwen2.5-Omni
- * T + IE+ + VE+ + A+
- * `Qwen/Qwen2.5-Omni-7B`
- *
- * ✅︎
- * ✅︎\*
-- * `SkyworkR1VChatModel`
- * Skywork-R1V-38B
- * T + I
- * `Skywork/Skywork-R1V-38B`
- *
- * ✅︎
- * ✅︎
-- * `SmolVLMForConditionalGeneration`
- * SmolVLM2
- * T + I
- * `SmolVLM2-2.2B-Instruct`
- *
- * ✅︎
- * ✅︎
-- * `UltravoxModel`
- * Ultravox
- * T + AE+
- * `fixie-ai/ultravox-v0_5-llama-3_2-1b`
- * ✅︎
- * ✅︎
- * ✅︎
-:::
-
-^ You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
- • For example, to use DeepSeek-VL2 series models:
- `--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`
-E Pre-computed embeddings can be inputted for this modality.
-+ Multiple items can be inputted per text prompt for this modality.
-
-:::{warning}
-Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
-However, there are differences in how they handle text + image inputs:
-
-V0 correctly implements the model's attention pattern:
-- Uses bidirectional attention between the image tokens corresponding to the same image
-- Uses causal attention for other tokens
-- Implemented via (naive) PyTorch SDPA with masking tensors
-- Note: May use significant memory for long prompts with image
-
-V1 currently uses a simplified attention pattern:
-- Uses causal attention for all tokens, including image tokens
-- Generates reasonable outputs but does not match the original model's attention for text + image inputs, especially when `{"do_pan_and_scan": true}`
-- Will be updated in the future to support the correct behavior
-
-This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
-:::
-
-:::{note}
-`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support head size 80.
-:::
-
-:::{note}
-To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
-:::
-
-:::{warning}
-The output quality of `AllenAI/Molmo-7B-D-0924` (especially in object localization tasks) has deteriorated in recent updates.
-
-For the best results, we recommend using the following dependency versions (tested on A10 and L40):
-
-```text
-# Core vLLM-compatible dependencies with Molmo accuracy setup (tested on L40)
-torch==2.5.1
-torchvision==0.20.1
-transformers==4.48.1
-tokenizers==0.21.0
-tiktoken==0.7.0
-vllm==0.7.0
-
-# Optional but recommended for improved performance and stability
-triton==3.1.0
-xformers==0.0.28.post3
-uvloop==0.21.0
-protobuf==5.29.3
-openai==1.60.2
-opencv-python-headless==4.11.0.86
-pillow==10.4.0
-
-# Installed FlashAttention (for float16 only)
-flash-attn>=2.5.6 # Not used in float32, but should be documented
-```
-
-**Note:** Make sure you understand the security implications of using outdated packages.
-:::
-
-:::{note}
-The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
-For more details, please see:
-:::
-
-:::{warning}
-Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
-:::
-
-:::{note}
-To use Qwen2.5-Omni, you have to install Hugging Face Transformers library from source via
-`pip install git+https://github.com/huggingface/transformers.git`.
-
-Read audio from video pre-processing is currently supported on V0 (but not V1), because overlapping modalities is not yet supported in V1.
-`--mm-processor-kwargs '{"use_audio_in_video": true}'`.
-:::
-
-### Pooling Models
-
-See [this page](pooling-models) for more information on how to use pooling models.
-
-:::{important}
-Since some model architectures support both generative and pooling tasks,
-you should explicitly specify the task type to ensure that the model is used in pooling mode instead of generative mode.
-:::
-
-#### Text Embedding
-
-Specified using `--task embed`.
-
-Any text generation model can be converted into an embedding model by passing `--task embed`.
-
-:::{note}
-To get the best results, you should use pooling models that are specifically trained as such.
-:::
-
-The following table lists those that are tested in vLLM.
-
-:::{list-table}
-:widths: 25 25 15 25 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Inputs
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
-- * `LlavaNextForConditionalGeneration`
- * LLaVA-NeXT-based
- * T / I
- * `royokong/e5-v`
- *
- * ✅︎
-- * `Phi3VForCausalLM`
- * Phi-3-Vision-based
- * T + I
- * `TIGER-Lab/VLM2Vec-Full`
- * 🚧
- * ✅︎
-- * `Qwen2VLForConditionalGeneration`
- * Qwen2-VL-based
- * T + I
- * `MrLight/dse-qwen2-2b-mrl-v1`
- *
- * ✅︎
-:::
-
-#### Transcription
-
-Specified using `--task transcription`.
-
-Speech2Text models trained specifically for Automatic Speech Recognition.
-
-:::{list-table}
-:widths: 25 25 25 5 5
-:header-rows: 1
-
-- * Architecture
- * Models
- * Example HF Models
- * [LoRA](#lora-adapter)
- * [PP](#distributed-serving)
-- * `Whisper`
- * Whisper-based
- * `openai/whisper-large-v3-turbo`
- * 🚧
- * 🚧
-:::
-
-_________________
-
-## Model Support Policy
-
-At vLLM, we are committed to facilitating the integration and support of third-party models within our ecosystem. Our approach is designed to balance the need for robustness and the practical limitations of supporting a wide range of models. Here’s how we manage third-party model support:
-
-1. **Community-Driven Support**: We encourage community contributions for adding new models. When a user requests support for a new model, we welcome pull requests (PRs) from the community. These contributions are evaluated primarily on the sensibility of the output they generate, rather than strict consistency with existing implementations such as those in transformers. **Call for contribution:** PRs coming directly from model vendors are greatly appreciated!
-
-2. **Best-Effort Consistency**: While we aim to maintain a level of consistency between the models implemented in vLLM and other frameworks like transformers, complete alignment is not always feasible. Factors like acceleration techniques and the use of low-precision computations can introduce discrepancies. Our commitment is to ensure that the implemented models are functional and produce sensible results.
-
- :::{tip}
- When comparing the output of `model.generate` from Hugging Face Transformers with the output of `llm.generate` from vLLM, note that the former reads the model's generation config file (i.e., [generation_config.json](https://github.com/huggingface/transformers/blob/19dabe96362803fb0a9ae7073d03533966598b17/src/transformers/generation/utils.py#L1945)) and applies the default parameters for generation, while the latter only uses the parameters passed to the function. Ensure all sampling parameters are identical when comparing outputs.
- :::
-
-3. **Issue Resolution and Model Updates**: Users are encouraged to report any bugs or issues they encounter with third-party models. Proposed fixes should be submitted via PRs, with a clear explanation of the problem and the rationale behind the proposed solution. If a fix for one model impacts another, we rely on the community to highlight and address these cross-model dependencies. Note: for bugfix PRs, it is good etiquette to inform the original author to seek their feedback.
-
-4. **Monitoring and Updates**: Users interested in specific models should monitor the commit history for those models (e.g., by tracking changes in the main/vllm/model_executor/models directory). This proactive approach helps users stay informed about updates and changes that may affect the models they use.
-
-5. **Selective Focus**: Our resources are primarily directed towards models with significant user interest and impact. Models that are less frequently used may receive less attention, and we rely on the community to play a more active role in their upkeep and improvement.
-
-Through this approach, vLLM fosters a collaborative environment where both the core development team and the broader community contribute to the robustness and diversity of the third-party models supported in our ecosystem.
-
-Note that, as an inference engine, vLLM does not introduce new models. Therefore, all models supported by vLLM are third-party models in this regard.
-
-We have the following levels of testing for models:
-
-1. **Strict Consistency**: We compare the output of the model with the output of the model in the HuggingFace Transformers library under greedy decoding. This is the most stringent test. Please refer to [models tests](https://github.com/vllm-project/vllm/blob/main/tests/models) for the models that have passed this test.
-2. **Output Sensibility**: We check if the output of the model is sensible and coherent, by measuring the perplexity of the output and checking for any obvious errors. This is a less stringent test.
-3. **Runtime Functionality**: We check if the model can be loaded and run without errors. This is the least stringent test. Please refer to [functionality tests](gh-dir:tests) and [examples](gh-dir:examples) for the models that have passed this test.
-4. **Community Feedback**: We rely on the community to provide feedback on the models. If a model is broken or not working as expected, we encourage users to raise issues to report it or open pull requests to fix it. The rest of the models fall under this category.
diff --git a/docs/source/serving/engine_args.md b/docs/source/serving/engine_args.md
deleted file mode 100644
index 9325a2406e..0000000000
--- a/docs/source/serving/engine_args.md
+++ /dev/null
@@ -1,36 +0,0 @@
-(engine-args)=
-
-# Engine Arguments
-
-Engine arguments control the behavior of the vLLM engine.
-
-- For [offline inference](#offline-inference), they are part of the arguments to `LLM` class.
-- For [online serving](#openai-compatible-server), they are part of the arguments to `vllm serve`.
-
-For references to all arguments available from `vllm serve` see the [serve args](#serve-args) documentation.
-
-Below, you can find an explanation of every engine argument:
-
-
-```{eval-rst}
-.. argparse::
- :module: vllm.engine.arg_utils
- :func: _engine_args_parser
- :prog: vllm serve
- :nodefaultconst:
- :markdownhelp:
-```
-
-## Async Engine Arguments
-
-Additional arguments are available to the asynchronous engine which is used for online serving:
-
-
-```{eval-rst}
-.. argparse::
- :module: vllm.engine.arg_utils
- :func: _async_engine_args_parser
- :prog: vllm serve
- :nodefaultconst:
- :markdownhelp:
-```
diff --git a/docs/source/serving/env_vars.md b/docs/source/serving/env_vars.md
deleted file mode 100644
index 9845241930..0000000000
--- a/docs/source/serving/env_vars.md
+++ /dev/null
@@ -1,15 +0,0 @@
-# Environment Variables
-
-vLLM uses the following environment variables to configure the system:
-
-:::{warning}
-Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
-
-All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
-:::
-
-:::{literalinclude} ../../../vllm/envs.py
-:end-before: end-env-vars-definition
-:language: python
-:start-after: begin-env-vars-definition
-:::
diff --git a/docs/source/serving/integrations/index.md b/docs/source/serving/integrations/index.md
deleted file mode 100644
index e2b4c08146..0000000000
--- a/docs/source/serving/integrations/index.md
+++ /dev/null
@@ -1,8 +0,0 @@
-# External Integrations
-
-:::{toctree}
-:maxdepth: 1
-
-langchain
-llamaindex
-:::
diff --git a/docs/source/serving/serve_args.md b/docs/source/serving/serve_args.md
deleted file mode 100644
index edb49f4ba6..0000000000
--- a/docs/source/serving/serve_args.md
+++ /dev/null
@@ -1,47 +0,0 @@
-(serve-args)=
-
-# Server Arguments
-
-The `vllm serve` command is used to launch the OpenAI-compatible server.
-
-## CLI Arguments
-
-The following are all arguments available from the `vllm serve` command:
-
-
-```{eval-rst}
-.. argparse::
- :module: vllm.entrypoints.openai.cli_args
- :func: create_parser_for_docs
- :prog: vllm serve
- :nodefaultconst:
- :markdownhelp:
-```
-
-## Configuration file
-
-You can load CLI arguments via a [YAML](https://yaml.org/) config file.
-The argument names must be the long form of those outlined [above](#serve-args).
-
-For example:
-
-```yaml
-# config.yaml
-
-model: meta-llama/Llama-3.1-8B-Instruct
-host: "127.0.0.1"
-port: 6379
-uvicorn-log-level: "info"
-```
-
-To use the above config file:
-
-```bash
-vllm serve --config config.yaml
-```
-
-:::{note}
-In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
-The order of priorities is `command line > config file values > defaults`.
-e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.
-:::
diff --git a/docs/source/training/rlhf.md b/docs/training/rlhf.md
similarity index 100%
rename from docs/source/training/rlhf.md
rename to docs/training/rlhf.md
diff --git a/docs/source/training/trl.md b/docs/training/trl.md
similarity index 66%
rename from docs/source/training/trl.md
rename to docs/training/trl.md
index ebdf593dbd..c7c1a5a3bb 100644
--- a/docs/source/training/trl.md
+++ b/docs/training/trl.md
@@ -6,8 +6,7 @@ Online methods such as GRPO or Online DPO require the model to generate completi
See the guide [vLLM for fast generation in online methods](https://huggingface.co/docs/trl/main/en/speeding_up_training#vllm-for-fast-generation-in-online-methods) in the TRL documentation for more information.
-:::{seealso}
-For more information on the `use_vllm` flag you can provide to the configs of these online methods, see:
-- [`trl.GRPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig.use_vllm)
-- [`trl.OnlineDPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/online_dpo_trainer#trl.OnlineDPOConfig.use_vllm)
-:::
+!!! info
+ For more information on the `use_vllm` flag you can provide to the configs of these online methods, see:
+ - [`trl.GRPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/grpo_trainer#trl.GRPOConfig.use_vllm)
+ - [`trl.OnlineDPOConfig.use_vllm`](https://huggingface.co/docs/trl/main/en/online_dpo_trainer#trl.OnlineDPOConfig.use_vllm)
diff --git a/mkdocs.yaml b/mkdocs.yaml
new file mode 100644
index 0000000000..a1c6319bb0
--- /dev/null
+++ b/mkdocs.yaml
@@ -0,0 +1,117 @@
+site_name: vLLM
+site_url: https://docs.vllm.ai
+repo_url: https://github.com/vllm-project/vllm
+exclude_docs: |
+ *.inc.md
+ *.template.md
+theme:
+ name: material
+ logo: assets/logos/vllm-logo-only-light.ico
+ favicon: assets/logos/vllm-logo-only-light.ico
+ palette:
+ # Palette toggle for automatic mode
+ - media: "(prefers-color-scheme)"
+ toggle:
+ icon: material/brightness-auto
+ name: Switch to light mode
+ # Palette toggle for light mode
+ - media: "(prefers-color-scheme: light)"
+ scheme: default
+ primary: white
+ toggle:
+ icon: material/brightness-7
+ name: Switch to dark mode
+ # Palette toggle for dark mode
+ - media: "(prefers-color-scheme: dark)"
+ scheme: slate
+ primary: black
+ toggle:
+ icon: material/brightness-2
+ name: Switch to system preference
+ features:
+ - content.code.copy
+ - content.tabs.link
+ - navigation.tracking
+ - navigation.tabs
+ - navigation.sections
+ - navigation.prune
+ - navigation.top
+ - search.highlight
+ - search.share
+ - toc.follow
+ custom_dir: docs/mkdocs/overrides
+
+hooks:
+ - docs/mkdocs/hooks/remove_announcement.py
+ - docs/mkdocs/hooks/generate_examples.py
+ - docs/mkdocs/hooks/url_schemes.py
+
+# Required to stop api-autonav from raising an error
+# https://github.com/tlambert03/mkdocs-api-autonav/issues/16
+nav:
+ - api
+
+plugins:
+ - meta
+ - search
+ - autorefs
+ - awesome-nav
+ # For API reference generation
+ - api-autonav:
+ modules: ["vllm"]
+ api_root_uri: "api"
+ - mkdocstrings:
+ handlers:
+ python:
+ options:
+ show_symbol_type_heading: true
+ show_symbol_type_toc: true
+ summary:
+ modules: true
+ show_if_no_docstring: true
+ show_signature_annotations: true
+ separate_signature: true
+ show_overloads: true
+ signature_crossrefs: true
+ inventories:
+ - https://docs.python.org/3/objects.inv
+ - https://typing-extensions.readthedocs.io/en/latest/objects.inv
+ - https://docs.aiohttp.org/en/stable/objects.inv
+ - https://pillow.readthedocs.io/en/stable/objects.inv
+ - https://numpy.org/doc/stable/objects.inv
+ - https://pytorch.org/docs/stable/objects.inv
+ - https://psutil.readthedocs.io/en/stable/objects.inv
+
+markdown_extensions:
+ - attr_list
+ - md_in_html
+ - admonition
+ - pymdownx.details
+ # For content tabs
+ - pymdownx.superfences
+ - pymdownx.tabbed:
+ slugify: !!python/object/apply:pymdownx.slugs.slugify
+ kwds:
+ case: lower
+ alternate_style: true
+ # For code highlighting
+ - pymdownx.highlight:
+ anchor_linenums: true
+ line_spans: __span
+ pygments_lang_class: true
+ - pymdownx.inlinehilite
+ - pymdownx.snippets
+ # For emoji and icons
+ - pymdownx.emoji:
+ emoji_index: !!python/name:material.extensions.emoji.twemoji
+ emoji_generator: !!python/name:material.extensions.emoji.to_svg
+ # For in page [TOC] (not sidebar)
+ - toc:
+ permalink: true
+ # For math rendering
+ - mdx_math:
+ enable_dollar_delimiter: true
+
+extra_javascript:
+ - mkdocs/javascript/run_llm_widget.js
+ - https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML
diff --git a/pyproject.toml b/pyproject.toml
index 3011cffb8f..29186d5ff0 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -165,9 +165,11 @@ markers = [
[tool.pymarkdown]
plugins.md004.style = "sublist" # ul-style
+plugins.md007.indent = 4 # ul-indent
plugins.md013.enabled = false # line-length
plugins.md041.enabled = false # first-line-h1
plugins.md033.enabled = false # inline-html
+plugins.md046.enabled = false # code-block-style
plugins.md024.allow_different_nesting = true # no-duplicate-headers
[tool.ty]
diff --git a/requirements/docs.txt b/requirements/docs.txt
index 9c267edace..a1f51334ed 100644
--- a/requirements/docs.txt
+++ b/requirements/docs.txt
@@ -1,19 +1,8 @@
-sphinx==7.4.7
-sphinx-argparse==0.5.2
-sphinx-book-theme==1.1.4
-sphinx-copybutton==0.5.2
-sphinx-design==0.6.1
-sphinx-togglebutton==0.3.2
-myst-parser==3.0.1 # `myst-parser==4.0.1` breaks inline code in titles
-msgspec
-snowballstemmer<3 # https://github.com/snowballstem/snowball/issues/229
-commonmark # Required by sphinx-argparse when using :markdownhelp:
-
-# Custom autodoc2 is necessary for faster docstring processing
-# see: https://github.com/sphinx-extensions2/sphinx-autodoc2/issues/33#issuecomment-2856386035
-git+https://github.com/hmellor/sphinx-autodoc2.git # sphinx-autodoc2==0.5.0
-
-# packages to install to build the documentation
-cachetools
--f https://download.pytorch.org/whl/cpu
-torch
\ No newline at end of file
+mkdocs
+mkdocs-api-autonav
+mkdocs-material
+mkdocstrings-python
+mkdocs-gen-files
+mkdocs-awesome-nav
+python-markdown-math
+ruff
diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index 2a27afe975..c48d8a3869 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -1263,12 +1263,10 @@ class LLMEngine:
def step(self) -> List[Union[RequestOutput, PoolingRequestOutput]]:
"""Performs one decoding iteration and returns newly generated results.
- :::{figure} https://i.imgur.com/sv2HssD.png
- :alt: Overview of the step function
- :align: center
-
- Overview of the step function.
- :::
+
+ 
+ Overview of the step function
+
Details:
- Step 1: Schedules the sequences to be executed in the next
diff --git a/vllm/engine/metrics.py b/vllm/engine/metrics.py
index 033551d07c..34b48f83b6 100644
--- a/vllm/engine/metrics.py
+++ b/vllm/engine/metrics.py
@@ -29,7 +29,7 @@ prometheus_client.disable_created_metrics()
# to extract the metrics definitions.
-# begin-metrics-definitions
+# --8<-- [start:metrics-definitions]
class Metrics:
"""
vLLM uses a multiprocessing-based frontend for the OpenAI server.
@@ -293,7 +293,7 @@ class Metrics:
labelnames=labelnames))
-# end-metrics-definitions
+# --8<-- [end:metrics-definitions]
def _unregister_vllm_metrics(self) -> None:
for collector in list(prometheus_client.REGISTRY._collector_to_names):
diff --git a/vllm/entrypoints/llm.py b/vllm/entrypoints/llm.py
index 52b50229b8..0465302c5a 100644
--- a/vllm/entrypoints/llm.py
+++ b/vllm/entrypoints/llm.py
@@ -131,10 +131,9 @@ class LLM:
**kwargs: Arguments for {class}`~vllm.EngineArgs`. (See
{ref}`engine-args`)
- :::{note}
- This class is intended to be used for offline inference. For online
- serving, use the {class}`~vllm.AsyncLLMEngine` class instead.
- :::
+ Note:
+ This class is intended to be used for offline inference. For online
+ serving, use the {class}`~vllm.AsyncLLMEngine` class instead.
"""
DEPRECATE_LEGACY: ClassVar[bool] = True
@@ -422,11 +421,10 @@ class LLM:
A list of `RequestOutput` objects containing the
generated completions in the same order as the input prompts.
- :::{note}
- Using `prompts` and `prompt_token_ids` as keyword parameters is
- considered legacy and may be deprecated in the future. You should
- instead pass them via the `inputs` parameter.
- :::
+ Note:
+ Using `prompts` and `prompt_token_ids` as keyword parameters is
+ considered legacy and may be deprecated in the future. You should
+ instead pass them via the `inputs` parameter.
"""
runner_type = self.llm_engine.model_config.runner_type
if runner_type not in ["generate", "transcription"]:
@@ -502,10 +500,9 @@ class LLM:
Returns:
A list containing the results from each worker.
- :::{note}
- It is recommended to use this API to only pass control messages,
- and set up data-plane communication to pass data.
- :::
+ Note:
+ It is recommended to use this API to only pass control messages,
+ and set up data-plane communication to pass data.
"""
return self.llm_engine.collective_rpc(method, timeout, args, kwargs)
@@ -924,11 +921,10 @@ class LLM:
A list of `PoolingRequestOutput` objects containing the
pooled hidden states in the same order as the input prompts.
- :::{note}
- Using `prompts` and `prompt_token_ids` as keyword parameters is
- considered legacy and may be deprecated in the future. You should
- instead pass them via the `inputs` parameter.
- :::
+ Note:
+ Using `prompts` and `prompt_token_ids` as keyword parameters is
+ considered legacy and may be deprecated in the future. You should
+ instead pass them via the `inputs` parameter.
"""
runner_type = self.llm_engine.model_config.runner_type
if runner_type != "pooling":
diff --git a/vllm/entrypoints/openai/protocol.py b/vllm/entrypoints/openai/protocol.py
index 5ab2356a08..da01eb472c 100644
--- a/vllm/entrypoints/openai/protocol.py
+++ b/vllm/entrypoints/openai/protocol.py
@@ -251,7 +251,7 @@ class ChatCompletionRequest(OpenAIBaseModel):
parallel_tool_calls: Optional[bool] = False
user: Optional[str] = None
- # doc: begin-chat-completion-sampling-params
+ # --8<-- [start:chat-completion-sampling-params]
best_of: Optional[int] = None
use_beam_search: bool = False
top_k: Optional[int] = None
@@ -266,9 +266,9 @@ class ChatCompletionRequest(OpenAIBaseModel):
spaces_between_special_tokens: bool = True
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
prompt_logprobs: Optional[int] = None
- # doc: end-chat-completion-sampling-params
+ # --8<-- [end:chat-completion-sampling-params]
- # doc: begin-chat-completion-extra-params
+ # --8<-- [start:chat-completion-extra-params]
echo: bool = Field(
default=False,
description=(
@@ -407,7 +407,7 @@ class ChatCompletionRequest(OpenAIBaseModel):
default=None,
description="KVTransfer parameters used for disaggregated serving.")
- # doc: end-chat-completion-extra-params
+ # --8<-- [end:chat-completion-extra-params]
# Default sampling parameters for chat completion requests
_DEFAULT_SAMPLING_PARAMS: dict = {
@@ -764,7 +764,7 @@ class CompletionRequest(OpenAIBaseModel):
top_p: Optional[float] = None
user: Optional[str] = None
- # doc: begin-completion-sampling-params
+ # --8<-- [start:completion-sampling-params]
use_beam_search: bool = False
top_k: Optional[int] = None
min_p: Optional[float] = None
@@ -779,9 +779,9 @@ class CompletionRequest(OpenAIBaseModel):
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
allowed_token_ids: Optional[list[int]] = None
prompt_logprobs: Optional[int] = None
- # doc: end-completion-sampling-params
+ # --8<-- [end:completion-sampling-params]
- # doc: begin-completion-extra-params
+ # --8<-- [start:completion-extra-params]
add_special_tokens: bool = Field(
default=True,
description=(
@@ -858,7 +858,7 @@ class CompletionRequest(OpenAIBaseModel):
default=None,
description="KVTransfer parameters used for disaggregated serving.")
- # doc: end-completion-extra-params
+ # --8<-- [end:completion-extra-params]
# Default sampling parameters for completion requests
_DEFAULT_SAMPLING_PARAMS: dict = {
@@ -1045,11 +1045,11 @@ class EmbeddingCompletionRequest(OpenAIBaseModel):
user: Optional[str] = None
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None
- # doc: begin-embedding-pooling-params
+ # --8<-- [start:embedding-pooling-params]
additional_data: Optional[Any] = None
- # doc: end-embedding-pooling-params
+ # --8<-- [end:embedding-pooling-params]
- # doc: begin-embedding-extra-params
+ # --8<-- [start:embedding-extra-params]
add_special_tokens: bool = Field(
default=True,
description=(
@@ -1064,7 +1064,7 @@ class EmbeddingCompletionRequest(OpenAIBaseModel):
"if the served model does not use priority scheduling."),
)
- # doc: end-embedding-extra-params
+ # --8<-- [end:embedding-extra-params]
def to_pooling_params(self):
return PoolingParams(dimensions=self.dimensions,
@@ -1080,11 +1080,11 @@ class EmbeddingChatRequest(OpenAIBaseModel):
user: Optional[str] = None
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None
- # doc: begin-chat-embedding-pooling-params
+ # --8<-- [start:chat-embedding-pooling-params]
additional_data: Optional[Any] = None
- # doc: end-chat-embedding-pooling-params
+ # --8<-- [end:chat-embedding-pooling-params]
- # doc: begin-chat-embedding-extra-params
+ # --8<-- [start:chat-embedding-extra-params]
add_special_tokens: bool = Field(
default=False,
description=(
@@ -1118,7 +1118,7 @@ class EmbeddingChatRequest(OpenAIBaseModel):
"default: 0). Any priority other than 0 will raise an error "
"if the served model does not use priority scheduling."),
)
- # doc: end-chat-embedding-extra-params
+ # --8<-- [end:chat-embedding-extra-params]
@model_validator(mode="before")
@classmethod
@@ -1147,11 +1147,11 @@ class ScoreRequest(OpenAIBaseModel):
text_2: Union[list[str], str]
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None
- # doc: begin-score-pooling-params
+ # --8<-- [start:score-pooling-params]
additional_data: Optional[Any] = None
- # doc: end-score-pooling-params
+ # --8<-- [end:score-pooling-params]
- # doc: begin-score-extra-params
+ # --8<-- [start:score-extra-params]
priority: int = Field(
default=0,
description=(
@@ -1160,7 +1160,7 @@ class ScoreRequest(OpenAIBaseModel):
"if the served model does not use priority scheduling."),
)
- # doc: end-score-extra-params
+ # --8<-- [end:score-extra-params]
def to_pooling_params(self):
return PoolingParams(additional_data=self.additional_data)
@@ -1173,11 +1173,11 @@ class RerankRequest(OpenAIBaseModel):
top_n: int = Field(default_factory=lambda: 0)
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=-1)]] = None
- # doc: begin-rerank-pooling-params
+ # --8<-- [start:rerank-pooling-params]
additional_data: Optional[Any] = None
- # doc: end-rerank-pooling-params
+ # --8<-- [end:rerank-pooling-params]
- # doc: begin-rerank-extra-params
+ # --8<-- [start:rerank-extra-params]
priority: int = Field(
default=0,
description=(
@@ -1186,7 +1186,7 @@ class RerankRequest(OpenAIBaseModel):
"if the served model does not use priority scheduling."),
)
- # doc: end-rerank-extra-params
+ # --8<-- [end:rerank-extra-params]
def to_pooling_params(self):
return PoolingParams(additional_data=self.additional_data)
@@ -1321,11 +1321,11 @@ class ClassificationRequest(OpenAIBaseModel):
truncate_prompt_tokens: Optional[int] = None
user: Optional[str] = None
- # doc: begin-classification-pooling-params
+ # --8<-- [start:classification-pooling-params]
additional_data: Optional[Any] = None
- # doc: end-classification-pooling-params
+ # --8<-- [end:classification-pooling-params]
- # doc: begin-classification-extra-params
+ # --8<-- [start:classification-extra-params]
priority: int = Field(
default=0,
description=(
@@ -1334,7 +1334,7 @@ class ClassificationRequest(OpenAIBaseModel):
"if the served model does not use priority scheduling."),
)
- # doc: end-classification-extra-params
+ # --8<-- [end:classification-extra-params]
def to_pooling_params(self):
return PoolingParams(additional_data=self.additional_data)
@@ -1698,7 +1698,7 @@ class TranscriptionRequest(OpenAIBaseModel):
timestamps incurs additional latency.
"""
- # doc: begin-transcription-extra-params
+ # --8<-- [start:transcription-extra-params]
stream: Optional[bool] = False
"""Custom field not present in the original OpenAI definition. When set,
it will enable output to be streamed in a similar fashion as the Chat
@@ -1707,9 +1707,9 @@ class TranscriptionRequest(OpenAIBaseModel):
# Flattened stream option to simplify form data.
stream_include_usage: Optional[bool] = False
stream_continuous_usage_stats: Optional[bool] = False
- # doc: end-transcription-extra-params
+ # --8<-- [end:transcription-extra-params]
- # doc: begin-transcription-sampling-params
+ # --8<-- [start:transcription-sampling-params]
temperature: float = Field(default=0.0)
"""The sampling temperature, between 0 and 1.
@@ -1743,7 +1743,7 @@ class TranscriptionRequest(OpenAIBaseModel):
presence_penalty: Optional[float] = 0.0
"""The presence penalty to use for sampling."""
- # doc: end-transcription-sampling-params
+ # --8<-- [end:transcription-sampling-params]
# Default sampling parameters for transcription requests.
_DEFAULT_SAMPLING_PARAMS: dict = {
diff --git a/vllm/envs.py b/vllm/envs.py
index dc23c8ea53..2d330b8fbe 100644
--- a/vllm/envs.py
+++ b/vllm/envs.py
@@ -175,7 +175,7 @@ def get_vllm_port() -> Optional[int]:
# The begin-* and end* here are used by the documentation generator
# to extract the used env vars.
-# begin-env-vars-definition
+# --8<-- [start:env-vars-definition]
environment_variables: dict[str, Callable[[], Any]] = {
@@ -813,7 +813,7 @@ environment_variables: dict[str, Callable[[], Any]] = {
lambda: os.getenv("VLLM_ALL2ALL_BACKEND", "naive"),
}
-# end-env-vars-definition
+# --8<-- [end:env-vars-definition]
def __getattr__(name: str):
diff --git a/vllm/executor/ray_distributed_executor.py b/vllm/executor/ray_distributed_executor.py
index 9b0b98731e..8e67c7a41b 100644
--- a/vllm/executor/ray_distributed_executor.py
+++ b/vllm/executor/ray_distributed_executor.py
@@ -528,12 +528,12 @@ class RayDistributedExecutor(DistributedExecutorBase):
ray.get(parallel_worker_tasks)
def _check_ray_cgraph_installation(self):
- import pkg_resources
+ import importlib.metadata
+
from packaging import version
required_version = version.parse("2.43.0")
- current_version = version.parse(
- pkg_resources.get_distribution("ray").version)
+ current_version = version.parse(importlib.metadata.version("ray"))
if current_version < required_version:
raise ValueError(f"Ray version {required_version} is "
f"required, but found {current_version}")
diff --git a/vllm/model_executor/models/blip2.py b/vllm/model_executor/models/blip2.py
index 2ff7e394a4..db0dd2051d 100644
--- a/vllm/model_executor/models/blip2.py
+++ b/vllm/model_executor/models/blip2.py
@@ -681,9 +681,8 @@ class Blip2ForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsPP,
batch.
pixel_values: The pixels in each input image.
- :::{seealso}
- {class}`Blip2ImageInputs`
- :::
+ Info:
+ [Blip2ImageInputs][]
"""
if intermediate_tensors is not None:
diff --git a/vllm/model_executor/models/llava.py b/vllm/model_executor/models/llava.py
index 95c1a0ca0b..ced71b6dcd 100644
--- a/vllm/model_executor/models/llava.py
+++ b/vllm/model_executor/models/llava.py
@@ -721,9 +721,8 @@ class LlavaForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsPP):
batch.
pixel_values: The pixels in each input image.
- :::{seealso}
- {class}`LlavaImageInputs`
- :::
+ Info:
+ [LlavaImageInputs][]
"""
if intermediate_tensors is not None:
inputs_embeds = None
diff --git a/vllm/model_executor/models/llava_next.py b/vllm/model_executor/models/llava_next.py
index 581a32325d..10261aa423 100644
--- a/vllm/model_executor/models/llava_next.py
+++ b/vllm/model_executor/models/llava_next.py
@@ -551,9 +551,8 @@ class LlavaNextForConditionalGeneration(nn.Module, SupportsMultiModal,
pixel_values: The pixels in each grid patch for each input image.
image_sizes: The original `(height, width)` for each input image.
- :::{seealso}
- {class}`LlavaNextImageInputs`
- :::
+ Info:
+ [LlavaNextImageInputs][]
"""
if intermediate_tensors is not None:
inputs_embeds = None
diff --git a/vllm/model_executor/models/mistral3.py b/vllm/model_executor/models/mistral3.py
index 2b9cbf1044..051a731208 100644
--- a/vllm/model_executor/models/mistral3.py
+++ b/vllm/model_executor/models/mistral3.py
@@ -559,9 +559,8 @@ class Mistral3ForConditionalGeneration(nn.Module, SupportsLoRA,
batch.
pixel_values: The pixels in each input image.
- :::{seealso}
- {class}`Mistral3ImagePixelInputs`
- :::
+ Info:
+ [Mistral3ImagePixelInputs][]
"""
if intermediate_tensors is not None:
inputs_embeds = None
diff --git a/vllm/multimodal/__init__.py b/vllm/multimodal/__init__.py
index 756ea11311..70568a195f 100644
--- a/vllm/multimodal/__init__.py
+++ b/vllm/multimodal/__init__.py
@@ -11,9 +11,8 @@ MULTIMODAL_REGISTRY = MultiModalRegistry()
The global {class}`~MultiModalRegistry` is used by model runners to
dispatch data processing according to the target model.
-:::{seealso}
-{ref}`mm-processing`
-:::
+Info:
+ {ref}`mm-processing`
"""
__all__ = [
diff --git a/vllm/multimodal/inputs.py b/vllm/multimodal/inputs.py
index 2335af843e..71ef1a98e0 100644
--- a/vllm/multimodal/inputs.py
+++ b/vllm/multimodal/inputs.py
@@ -289,9 +289,8 @@ class BaseMultiModalField(ABC):
@dataclass(frozen=True)
class MultiModalBatchedField(BaseMultiModalField):
"""
- :::{seealso}
- {func}`MultiModalFieldConfig.batched`
- :::
+ Info:
+ [MultiModalFieldConfig.batched][]
"""
def build_elems(
@@ -320,10 +319,9 @@ class MultiModalBatchedField(BaseMultiModalField):
@dataclass(frozen=True)
class MultiModalFlatField(BaseMultiModalField):
"""
- :::{seealso}
- {func}`MultiModalFieldConfig.flat`
- {func}`MultiModalFieldConfig.flat_from_sizes`
- :::
+ Info:
+ [MultiModalFieldConfig.flat][]
+ [MultiModalFieldConfig.flat_from_sizes][]
"""
slices: Union[Sequence[slice], Sequence[Sequence[slice]]]
dim: int = 0
@@ -363,9 +361,8 @@ class MultiModalFlatField(BaseMultiModalField):
@dataclass(frozen=True)
class MultiModalSharedField(BaseMultiModalField):
"""
- :::{seealso}
- {func}`MultiModalFieldConfig.shared`
- :::
+ Info:
+ [MultiModalFieldConfig.shared][]
"""
batch_size: int
@@ -510,9 +507,8 @@ class MultiModalFieldConfig:
Element 3: [[C],[C]]
```
- :::{seealso}
- {func}`MultiModalFieldConfig.flat`
- :::
+ Info:
+ [MultiModalFieldConfig.flat][]
"""
if size_per_item.ndim != 1:
diff --git a/vllm/multimodal/registry.py b/vllm/multimodal/registry.py
index 67d0d7fc11..8a27d866e8 100644
--- a/vllm/multimodal/registry.py
+++ b/vllm/multimodal/registry.py
@@ -214,9 +214,8 @@ class MultiModalRegistry:
When the model receives multi-modal data, the provided function is
invoked to transform the data into a dictionary of model inputs.
- :::{seealso}
- {ref}`mm-processing`
- :::
+ Info:
+ {ref}`mm-processing`
"""
def wrapper(model_cls: N) -> N:
@@ -260,9 +259,8 @@ class MultiModalRegistry:
"""
Create a multi-modal processor for a specific model and tokenizer.
- :::{seealso}
- {ref}`mm-processing`
- :::
+ Info:
+ {ref}`mm-processing`
"""
if not model_config.is_multimodal_model:
raise ValueError(f"{model_config.model} is not a multimodal model")
diff --git a/vllm/utils.py b/vllm/utils.py
index bfc01972bb..fcc0ab3b23 100644
--- a/vllm/utils.py
+++ b/vllm/utils.py
@@ -1926,9 +1926,8 @@ class _PlaceholderBase:
We need to explicitly override each dunder method because
{meth}`__getattr__` is not called when they are accessed.
- :::{seealso}
- [Special method lookup](https://docs.python.org/3/reference/datamodel.html#special-lookup)
- :::
+ Info:
+ [Special method lookup](https://docs.python.org/3/reference/datamodel.html#special-lookup)
"""
def __getattr__(self, key: str) -> Never:
diff --git a/vllm/v1/worker/gpu_worker.py b/vllm/v1/worker/gpu_worker.py
index 2b945cc411..a7c70fec04 100644
--- a/vllm/v1/worker/gpu_worker.py
+++ b/vllm/v1/worker/gpu_worker.py
@@ -172,10 +172,9 @@ class Worker(WorkerBase):
Then, it calculate the free memory that can be used for KV cache in
bytes.
- :::{tip}
- You may limit the usage of GPU memory
- by adjusting the `gpu_memory_utilization` parameter.
- :::
+ Tip:
+ You may limit the usage of GPU memory
+ by adjusting the `gpu_memory_utilization` parameter.
"""
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
diff --git a/vllm/worker/hpu_worker.py b/vllm/worker/hpu_worker.py
index 42882992f2..d7fe0fe0fe 100644
--- a/vllm/worker/hpu_worker.py
+++ b/vllm/worker/hpu_worker.py
@@ -201,10 +201,9 @@ class HPUWorker(LocalOrDistributedWorkerBase):
Then, it calculate the maximum possible number of GPU and CPU blocks
that can be allocated with the remaining free memory.
- :::{tip}
- You may limit the usage of GPU memory
- by adjusting the `gpu_memory_utilization` parameter.
- :::
+ Tip:
+ You may limit the usage of GPU memory
+ by adjusting the `gpu_memory_utilization` parameter.
"""
# Profile the memory usage of the model and get the maximum number of
# cache blocks that can be allocated with the remaining free memory.
diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py
index 41546462e5..5e3b6e4b62 100644
--- a/vllm/worker/worker.py
+++ b/vllm/worker/worker.py
@@ -234,10 +234,9 @@ class Worker(LocalOrDistributedWorkerBase):
Then, it calculate the maximum possible number of GPU and CPU blocks
that can be allocated with the remaining free memory.
- :::{tip}
- You may limit the usage of GPU memory
- by adjusting the `gpu_memory_utilization` parameter.
- :::
+ Tip:
+ You may limit the usage of GPU memory
+ by adjusting the `gpu_memory_utilization` parameter.
"""
# Profile the memory usage of the model and get the maximum number of
# cache blocks that can be allocated with the remaining free memory.
diff --git a/vllm/worker/xpu_worker.py b/vllm/worker/xpu_worker.py
index 65085f80f9..a78a41e03e 100644
--- a/vllm/worker/xpu_worker.py
+++ b/vllm/worker/xpu_worker.py
@@ -93,10 +93,9 @@ class XPUWorker(LoRANotSupportedWorkerBase, Worker):
Then, it calculate the maximum possible number of GPU and CPU blocks
that can be allocated with the remaining free memory.
- :::{tip}
- You may limit the usage of GPU memory
- by adjusting the `gpu_memory_utilization` parameter.
- :::
+ Tip:
+ You may limit the usage of GPU memory
+ by adjusting the `gpu_memory_utilization` parameter.
"""
# Profile the memory usage of the model and get the maximum number of
# cache blocks that can be allocated with the remaining free memory.