[Doc] Reorganize user guide (#18661)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-24 22:25:33 +08:00
parent 2cd4d58df4
commit 1cb194a018
27 changed files with 211 additions and 216 deletions
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -3,4 +3,4 @@ FILL IN THE PR DESCRIPTION HERE
 FIX #xxxx (*link existing issues this PR will resolve*)

 <!--- pyml disable-next-line no-emphasis-as-heading -->
-**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>** (anything written below this line will be removed by GitHub Actions)
+**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing>** (anything written below this line will be removed by GitHub Actions)
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -1,3 +1,3 @@
 # Contributing to vLLM

-You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing/overview.html).
+You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing).
--- a/README.md
+++ b/README.md
@ -100,7 +100,7 @@ Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
 ## Contributing

 We welcome and value any contributions and collaborations.
-Please check out [Contributing to vLLM](https://docs.vllm.ai/en/stable/contributing/overview.html) for how to get involved.
+Please check out [Contributing to vLLM](https://docs.vllm.ai/en/stable/contributing) for how to get involved.

 ## Sponsors

--- a/docs/.nav.yml
+++ b/docs/.nav.yml
@ -5,29 +5,35 @@ nav:
      - getting_started/quickstart.md
      - getting_started/installation
    - Examples:
-      - LMCache: getting_started/examples/lmcache
-      - getting_started/examples/offline_inference
-      - getting_started/examples/online_serving
-      - getting_started/examples/other
+      - Offline Inference: getting_started/examples/offline_inference
+      - Online Serving: getting_started/examples/online_serving
+      - Others:
+        - LMCache: getting_started/examples/lmcache
+        - getting_started/examples/other/*
    - Quick Links:
-      - User Guide: serving/offline_inference.md
-      - Developer Guide: contributing/overview.md
+      - User Guide: usage/README.md
+      - Developer Guide: contributing/README.md
      - API Reference: api/README.md
    - Timeline:
      - Roadmap: https://roadmap.vllm.ai
      - Releases: https://github.com/vllm-project/vllm/releases
  - User Guide:
+    - usage/README.md
+    - General:
+      - usage/*
    - Inference and Serving:
      - serving/offline_inference.md
      - serving/openai_compatible_server.md
      - serving/*
      - serving/integrations
-    - Training: training
    - Deployment:
      - deployment/*
      - deployment/frameworks
      - deployment/integrations
-    - Performance: performance
+    - Training: training
+    - Configuration:
+      - Summary: configuration/README.md
+      - configuration/*
    - Models:
      - models/supported_models.md
      - models/generative_models.md
@ -37,12 +43,11 @@ nav:
      - features/compatibility_matrix.md
      - features/*
      - features/quantization
-    - Other:
-      - getting_started/*
  - Developer Guide:
-    - contributing/overview.md
-    - glob: contributing/*
-      flatten_single_child_sections: true
+    - contributing/README.md
+    - General:
+      - glob: contributing/*
+        flatten_single_child_sections: true
    - Model Implementation: contributing/model
    - Design Documents:
      - V0: design
--- a/docs/configuration/README.md
+++ b/docs/configuration/README.md
@ -0,0 +1,4 @@
+# Configuration Options
+
+This section lists the most common options for running the vLLM engine.
+For a full list, refer to the [configuration][configuration] page.
--- a/docs/configuration/conserving_memory.md
+++ b/docs/configuration/conserving_memory.md
@ -0,0 +1,144 @@
+# Conserving Memory
+
+Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
+
+## Tensor Parallelism (TP)
+
+Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
+
+The following code splits the model across 2 GPUs.
+
+```python
+from vllm import LLM
+
+llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
+          tensor_parallel_size=2)
+```
+
+!!! warning
+    To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
+    before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
+
+    To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
+
+!!! note
+    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
+
+    You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
+
+## Quantization
+
+Quantized models take less memory at the cost of lower precision.
+
+Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
+and used directly without extra configuration.
+
+Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
+
+## Context length and batch size
+
+You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
+and the maximum batch size (`max_num_seqs` option).
+
+```python
+from vllm import LLM
+
+llm = LLM(model="adept/fuyu-8b",
+          max_model_len=2048,
+          max_num_seqs=2)
+```
+
+## Reduce CUDA Graphs
+
+By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
+
+!!! warning
+    CUDA graph capture takes up more memory in V1 than in V0.
+
+You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
+
+```python
+from vllm import LLM
+from vllm.config import CompilationConfig, CompilationLevel
+
+llm = LLM(
+    model="meta-llama/Llama-3.1-8B-Instruct",
+    compilation_config=CompilationConfig(
+        level=CompilationLevel.PIECEWISE,
+        # By default, it goes up to max_num_seqs
+        cudagraph_capture_sizes=[1, 2, 4, 8, 16],
+    ),
+)
+```
+
+You can disable graph capturing completely via the `enforce_eager` flag:
+
+```python
+from vllm import LLM
+
+llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
+          enforce_eager=True)
+```
+
+## Adjust cache size
+
+If you run out of CPU RAM, try the following options:
+
+- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
+- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
+
+## Multi-modal input limits
+
+You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
+
+```python
+from vllm import LLM
+
+# Accept up to 3 images and 1 video per prompt
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          limit_mm_per_prompt={"image": 3, "video": 1})
+```
+
+You can go a step further and disable unused modalities completely by setting its limit to zero.
+For example, if your application only accepts image input, there is no need to allocate any memory for videos.
+
+```python
+from vllm import LLM
+
+# Accept any number of images but no videos
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          limit_mm_per_prompt={"video": 0})
+```
+
+You can even run a multi-modal model for text-only inference:
+
+```python
+from vllm import LLM
+
+# Don't accept images. Just text.
+llm = LLM(model="google/gemma-3-27b-it",
+          limit_mm_per_prompt={"image": 0})
+```
+
+## Multi-modal processor arguments
+
+For certain models, you can adjust the multi-modal processor arguments to
+reduce the size of the processed multi-modal inputs, which in turn saves memory.
+
+Here are some examples:
+
+```python
+from vllm import LLM
+
+# Available for Qwen2-VL series models
+llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
+          mm_processor_kwargs={
+              "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
+          })
+
+# Available for InternVL series models
+llm = LLM(model="OpenGVLab/InternVL2-2B",
+          mm_processor_kwargs={
+              "max_dynamic_patch": 4,  # Default is 12
+          })
+```
--- a/docs/configuration/engine_args.md
+++ b/docs/configuration/engine_args.md
--- a/docs/configuration/model_resolution.md
+++ b/docs/configuration/model_resolution.md
@ -0,0 +1,23 @@
+# Model Resolution
+
+vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
+and finding the corresponding implementation that is registered to vLLM.
+Nevertheless, our model resolution may fail for the following reasons:
+
+- The `config.json` of the model repository lacks the `architectures` field.
+- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
+- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
+
+To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
+For example:
+
+```python
+from vllm import LLM
+
+model = LLM(
+    model="cerebras/Cerebras-GPT-1.3B",
+    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
+)
+```
+
+Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@ -1,7 +1,4 @@
---
-title: Optimization and Tuning
---
-[](){ #optimization-and-tuning }
+# Optimization and Tuning

 This guide covers optimization strategies and performance tuning for vLLM V1.

--- a/docs/configuration/serve_args.md
+++ b/docs/configuration/serve_args.md
--- a/docs/contributing/overview.md
+++ b/docs/contributing/overview.md
--- a/docs/contributing/benchmarks.md
+++ b/docs/contributing/benchmarks.md
--- a/docs/design/multiprocessing.md
+++ b/docs/design/multiprocessing.md
@ -123,7 +123,7 @@ what is happening. First, a log message from vLLM:
 WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
-    https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
+    https://docs.vllm.ai/en/latest/usage/debugging.html#python-multiprocessing
    for more information.
 ```

--- a/docs/design/v1/metrics.md
+++ b/docs/design/v1/metrics.md
@ -57,7 +57,7 @@ In v0, the following metrics are exposed via a Prometheus-compatible `/metrics`
 - `vllm:spec_decode_num_draft_tokens_total` (Counter)
 - `vllm:spec_decode_num_emitted_tokens_total` (Counter)

-These are documented under [Inferencing and Serving -> Production Metrics](../../serving/metrics.md).
+These are documented under [Inferencing and Serving -> Production Metrics](../../usage/metrics.md).

 ### Grafana Dashboard

--- a/docs/features/tool_calling.md
+++ b/docs/features/tool_calling.md
@ -93,7 +93,7 @@ specify the `name` of one of the tools in the `tool_choice` parameter of the cha

 ## Required Function Calling

-vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/getting_started/v1_user_guide.html#feature-model) for the V1 engine.
+vLLM supports the `tool_choice='required'` option in the chat completion API. Similar to the named function calling, it also uses guided decoding, so this is enabled by default and will work with any supported model. The required guided decoding features (JSON schema with `anyOf`) are currently only supported in the V0 engine with the guided decoding backend `outlines`. However, support for alternative decoding backends are on the [roadmap](https://docs.vllm.ai/en/latest/usage/v1_guide.html#feature-model) for the V1 engine.

 When tool_choice='required' is set, the model is guaranteed to generate one or more tool calls based on the specified tool list in the `tools` parameter. The number of tool calls depends on the user's query. The output format strictly follows the schema defined in the `tools` parameter.

--- a/docs/serving/offline_inference.md
+++ b/docs/serving/offline_inference.md
@ -27,188 +27,3 @@ Please refer to the above pages for more details about each API.

 !!! info
    [API Reference][offline-inference-api]
-
-[](){ #configuration-options }
-
-## Configuration Options
-
-This section lists the most common options for running the vLLM engine.
-For a full list, refer to the [configuration][configuration] page.
-
-[](){ #model-resolution }
-
-### Model resolution
-
-vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
-and finding the corresponding implementation that is registered to vLLM.
-Nevertheless, our model resolution may fail for the following reasons:
-
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
-
-To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
-For example:
-
-```python
-from vllm import LLM
-
-model = LLM(
-    model="cerebras/Cerebras-GPT-1.3B",
-    hf_overrides={"architectures": ["GPT2LMHeadModel"]},  # GPT-2
-)
-```
-
-Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.
-
-[](){ #reducing-memory-usage }
-
-### Reducing memory usage
-
-Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
-
-#### Tensor Parallelism (TP)
-
-Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
-
-The following code splits the model across 2 GPUs.
-
-```python
-from vllm import LLM
-
-llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
-          tensor_parallel_size=2)
-```
-
-!!! warning
-    To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
-    before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
-
-    To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
-
-!!! note
-    With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
-
-    You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
-
-#### Quantization
-
-Quantized models take less memory at the cost of lower precision.
-
-Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
-and used directly without extra configuration.
-
-Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
-
-#### Context length and batch size
-
-You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
-and the maximum batch size (`max_num_seqs` option).
-
-```python
-from vllm import LLM
-
-llm = LLM(model="adept/fuyu-8b",
-          max_model_len=2048,
-          max_num_seqs=2)
-```
-
-#### Reduce CUDA Graphs
-
-By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
-
-!!! warning
-    CUDA graph capture takes up more memory in V1 than in V0.
-
-You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
-
-```python
-from vllm import LLM
-from vllm.config import CompilationConfig, CompilationLevel
-
-llm = LLM(
-    model="meta-llama/Llama-3.1-8B-Instruct",
-    compilation_config=CompilationConfig(
-        level=CompilationLevel.PIECEWISE,
-        # By default, it goes up to max_num_seqs
-        cudagraph_capture_sizes=[1, 2, 4, 8, 16],
-    ),
-)
-```
-
-You can disable graph capturing completely via the `enforce_eager` flag:
-
-```python
-from vllm import LLM
-
-llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
-          enforce_eager=True)
-```
-
-#### Adjust cache size
-
-If you run out of CPU RAM, try the following options:
-
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
-
-#### Multi-modal input limits
-
-You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
-
-```python
-from vllm import LLM
-
-# Accept up to 3 images and 1 video per prompt
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
-          limit_mm_per_prompt={"image": 3, "video": 1})
-```
-
-You can go a step further and disable unused modalities completely by setting its limit to zero.
-For example, if your application only accepts image input, there is no need to allocate any memory for videos.
-
-```python
-from vllm import LLM
-
-# Accept any number of images but no videos
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
-          limit_mm_per_prompt={"video": 0})
-```
-
-You can even run a multi-modal model for text-only inference:
-
-```python
-from vllm import LLM
-
-# Don't accept images. Just text.
-llm = LLM(model="google/gemma-3-27b-it",
-          limit_mm_per_prompt={"image": 0})
-```
-
-#### Multi-modal processor arguments
-
-For certain models, you can adjust the multi-modal processor arguments to
-reduce the size of the processed multi-modal inputs, which in turn saves memory.
-
-Here are some examples:
-
-```python
-from vllm import LLM
-
-# Available for Qwen2-VL series models
-llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
-          mm_processor_kwargs={
-              "max_pixels": 768 * 768,  # Default is 1280 * 28 * 28
-          })
-
-# Available for InternVL series models
-llm = LLM(model="OpenGVLab/InternVL2-2B",
-          mm_processor_kwargs={
-              "max_dynamic_patch": 4,  # Default is 12
-          })
-```
-
-### Performance optimization and tuning
-
-You can potentially improve the performance of vLLM by finetuning various options.
-Please refer to [this guide][optimization-and-tuning] for more details.
--- a/docs/usage/README.md
+++ b/docs/usage/README.md
@ -0,0 +1,7 @@
+# Using vLLM
+
+vLLM supports the following usage patterns:
+
+- [Inference and Serving](../serving/offline_inference.md): Run a single instance of a model.
+- [Deployment](../deployment/docker.md): Scale up model instances for production.
+- [Training](../training/rlhf.md): Train or fine-tune a model.
--- a/docs/serving/env_vars.md
+++ b/docs/serving/env_vars.md
--- a/docs/getting_started/faq.md
+++ b/docs/getting_started/faq.md
--- a/docs/serving/metrics.md
+++ b/docs/serving/metrics.md
--- a/docs/serving/seed_parameter_behavior.md
+++ b/docs/serving/seed_parameter_behavior.md
@ -1,4 +1,4 @@
-# Seed Parameter Behavior
+# Reproducibility

 ## Overview

--- a/docs/deployment/security.md
+++ b/docs/deployment/security.md
@ -1,4 +1,4 @@
-# Security Guide
+# Security

 ## Inter-Node Communication

--- a/docs/getting_started/troubleshooting.md
+++ b/docs/getting_started/troubleshooting.md
@ -23,7 +23,7 @@ It'd be better to store the model in a local disk. Additionally, have a look at

 ## Out of memory

-If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options][reducing-memory-usage] to reduce the memory consumption.
+If the model is too large to fit in a single GPU, you will get an out-of-memory (OOM) error. Consider adopting [these options](../configuration/conserving_memory.md) to reduce the memory consumption.

 ## Generation quality changed

@ -159,7 +159,7 @@ If you have seen a warning in your logs like this:
 WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
    initialized. We must use the `spawn` multiprocessing start method. Setting
    VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
-    https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
+    https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing
    for more information.
 ```

@ -258,7 +258,7 @@ or:
 ValueError: Model architectures ['<arch>'] are not supported for now. Supported architectures: [...]
 ```

-But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps][model-resolution] to explicitly specify the vLLM implementation for the model.
+But you are sure that the model is in the [list of supported models][supported-models], there may be some issue with vLLM's model resolution. In that case, please follow [these steps](../configuration/model_resolution.md) to explicitly specify the vLLM implementation for the model.

 ## Failed to infer device type

--- a/docs/serving/usage_stats.md
+++ b/docs/serving/usage_stats.md
--- a/docs/getting_started/v1_user_guide.md
+++ b/docs/getting_started/v1_user_guide.md
@ -1,4 +1,4 @@
-# vLLM V1 User Guide
+# vLLM V1

 V1 is now enabled by default for all supported use cases, and we will gradually enable it for every use case we plan to support. Please share any feedback on [GitHub](https://github.com/vllm-project/vllm) or in the [vLLM Slack](https://inviter.co/vllm-slack).

--- a/vllm/envs.py
+++ b/vllm/envs.py
@ -164,7 +164,7 @@ def get_vllm_port() -> Optional[int]:
                raise ValueError(
                    f"VLLM_PORT '{port}' appears to be a URI. "
                    "This may be caused by a Kubernetes service discovery issue"
-                    "check the warning in: https://docs.vllm.ai/en/stable/serving/env_vars.html"
+                    "check the warning in: https://docs.vllm.ai/en/stable/usage/env_vars.html"
                )
        except Exception:
            pass
--- a/vllm/utils.py
+++ b/vllm/utils.py
@ -2531,7 +2531,7 @@ def _maybe_force_spawn():
        logger.warning(
            "We must use the `spawn` multiprocessing start method. "
            "Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. "
-            "See https://docs.vllm.ai/en/latest/getting_started/"
+            "See https://docs.vllm.ai/en/latest/usage/"
            "troubleshooting.html#python-multiprocessing "
            "for more information. Reason: %s", reason)
        os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"