[Docs] Replace all explicit anchors with real links (#27087)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-10-20 06:03:50 +08:00 · 2025-10-17 10:22:06 +01:00
parent e20eba753b
commit 483ea64611
28 changed files with 54 additions and 139 deletions
--- a/.markdownlint.yaml
+++ b/.markdownlint.yaml
@ -4,7 +4,6 @@ MD013: false
 MD024:
  siblings_only: true
 MD033: false
-MD042: false
 MD045: false
 MD046: false
 MD051: false
--- a/docs/api/README.md
+++ b/docs/api/README.md
@ -20,8 +20,6 @@ API documentation for vLLM's configuration classes.
 - [vllm.config.CompilationConfig][]
 - [vllm.config.VllmConfig][]

-[](){ #offline-inference-api }
-
 ## Offline Inference

 LLM Class.
@ -45,18 +43,14 @@ Engine classes for offline and online inference.

 Inference parameters for vLLM APIs.

-[](){ #sampling-params }
-
 - [vllm.SamplingParams][]
 - [vllm.PoolingParams][]

-[](){ #multi-modality }
-
 ## Multi-Modality

 vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.

-Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
+Multi-modal inputs can be passed alongside text and token prompts to [supported models](../models/supported_models.md#list-of-multimodal-language-models)
 via the `multi_modal_data` field in [vllm.inputs.PromptType][].

 Looking to add your own multi-modal model? Please follow the instructions listed [here](../contributing/model/multimodal.md).
--- a/docs/configuration/README.md
+++ b/docs/configuration/README.md
@ -4,6 +4,6 @@ This section lists the most common options for running vLLM.

 There are three main levels of configuration, from highest priority to lowest priority:

- [Request parameters][completions-api] and [input arguments][sampling-params]
+- [Request parameters](../serving/openai_compatible_server.md#completions-api) and [input arguments](../api/README.md#inference-parameters)
 - [Engine arguments](./engine_args.md)
 - [Environment variables](./env_vars.md)
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@ -27,8 +27,6 @@ You can monitor the number of preemption requests through Prometheus metrics exp

 In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.

-[](){ #chunked-prefill }
-
 ## Chunked Prefill

 Chunked prefill allows vLLM to process large prefills in smaller chunks and batch them together with decode requests. This feature helps improve both throughput and latency by better balancing compute-bound (prefill) and memory-bound (decode) operations.
--- a/docs/contributing/benchmarks.md
+++ b/docs/contributing/benchmarks.md
@ -7,8 +7,8 @@ toc_depth: 4
 vLLM provides comprehensive benchmarking tools for performance testing and evaluation:

 - **[Benchmark CLI]**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
- **[Performance benchmarks][performance-benchmarks]**: Automated CI benchmarks for development
- **[Nightly benchmarks][nightly-benchmarks]**: Comparative benchmarks against alternatives
+- **[Performance benchmarks](#performance-benchmarks)**: Automated CI benchmarks for development
+- **[Nightly benchmarks](#nightly-benchmarks)**: Comparative benchmarks against alternatives

 [Benchmark CLI]: #benchmark-cli

@ -924,8 +924,6 @@ throughput numbers correctly is also adjusted.

 </details>

-[](){ #performance-benchmarks }
-
 ## Performance Benchmarks

 The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
@ -988,8 +986,6 @@ The benchmarking currently runs on a predefined set of models configured in the

 All continuous benchmarking results are automatically published to the public [vLLM Performance Dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm).

-[](){ #nightly-benchmarks }
-
 ## Nightly Benchmarks

 These compare vLLM's performance against alternatives (`tgi`, `trt-llm`, and `lmdeploy`) when there are major updates of vLLM (e.g., bumping up to a new version). They are primarily intended for consumers to evaluate when to choose vLLM over other options and are triggered on every commit with both the `perf-benchmarks` and `nightly-benchmarks` labels.
--- a/docs/contributing/model/README.md
+++ b/docs/contributing/model/README.md
@ -1,7 +1,7 @@
 # Summary

 !!! important
-    Many decoder language models can now be automatically loaded using the [Transformers backend][transformers-backend] without having to implement them in vLLM. See if `vllm serve <model>` works first!
+    Many decoder language models can now be automatically loaded using the [Transformers backend](../../models/supported_models.md#transformers) without having to implement them in vLLM. See if `vllm serve <model>` works first!

 vLLM models are specialized [PyTorch](https://pytorch.org/) models that take advantage of various [features](../../features/README.md#compatibility-matrix) to optimize their performance.

--- a/docs/contributing/model/registration.md
+++ b/docs/contributing/model/registration.md
@ -8,7 +8,7 @@ This page provides detailed instructions on how to do so.

 ## Built-in models

-To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source][build-from-source].
+To add a model directly to the vLLM library, start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source](../../getting_started/installation/gpu.md#build-wheel-from-source).
 This gives you the ability to modify the codebase and test your model.

 After you have implemented your model (see [tutorial](basic.md)), put it into the [vllm/model_executor/models](../../../vllm/model_executor/models) directory.
--- a/docs/contributing/model/tests.md
+++ b/docs/contributing/model/tests.md
@ -39,8 +39,6 @@ For [generative models](../../models/generative_models.md), there are two levels

 For [pooling models](../../models/pooling_models.md), we simply check the cosine similarity, as defined in [tests/models/utils.py](../../../tests/models/utils.py).

-[](){ #mm-processing-tests }
-
 ### Multi-modal processing

 #### Common tests
--- a/docs/deployment/docker.md
+++ b/docs/deployment/docker.md
@ -1,7 +1,5 @@
 # Using Docker

-[](){ #deployment-docker-pre-built-image }
-
 ## Use vLLM's Official Docker Image

 vLLM offers an official Docker image for deployment.
@ -62,8 +60,6 @@ You can add any other [engine-args](../configuration/engine_args.md) you need af
    RUN uv pip install --system git+https://github.com/huggingface/transformers.git
    ```

-[](){ #deployment-docker-build-image-from-source }
-
 ## Building vLLM's Docker Image from Source

 You can build and run vLLM from source via the provided [docker/Dockerfile](../../docker/Dockerfile). To build vLLM:
--- a/docs/deployment/frameworks/anyscale.md
+++ b/docs/deployment/frameworks/anyscale.md
@ -1,7 +1,5 @@
 # Anyscale

-[](){ #deployment-anyscale }
-
 [Anyscale](https://www.anyscale.com) is a managed, multi-cloud platform developed by the creators of Ray.

 Anyscale automates the entire lifecycle of Ray clusters in your AWS, GCP, or Azure account, delivering the flexibility of open-source Ray
--- a/docs/deployment/nginx.md
+++ b/docs/deployment/nginx.md
@ -2,8 +2,6 @@

 This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.

-[](){ #nginxloadbalancer-nginx-build }
-
 ## Build Nginx Container

 This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
@ -27,8 +25,6 @@ Build the container:
 docker build . -f Dockerfile.nginx --tag nginx-lb
 ```

-[](){ #nginxloadbalancer-nginx-conf }
-
 ## Create Simple Nginx Config file

 Create a file named `nginx_conf/nginx.conf`. Note that you can add as many servers as you'd like. In the below example we'll start with two. To add more, add another `server vllmN:8000 max_fails=3 fail_timeout=10000s;` entry to `upstream backend`.
@ -53,8 +49,6 @@ Create a file named `nginx_conf/nginx.conf`. Note that you can add as many serve
    }
    ```

-[](){ #nginxloadbalancer-nginx-vllm-container }
-
 ## Build vLLM Container

 ```bash
@ -73,16 +67,12 @@ docker build \
    --build-arg https_proxy=$https_proxy
 ```

-[](){ #nginxloadbalancer-nginx-docker-network }
-
 ## Create Docker Network

 ```bash
 docker network create vllm_nginx
 ```

-[](){ #nginxloadbalancer-nginx-launch-container }
-
 ## Launch vLLM Containers

 Notes:
@ -122,8 +112,6 @@ Notes:
 !!! note
    If you are behind proxy, you can pass the proxy settings to the docker run command via `-e http_proxy=$http_proxy -e https_proxy=$https_proxy`.

-[](){ #nginxloadbalancer-nginx-launch-nginx }
-
 ## Launch Nginx

 ```bash
@ -135,8 +123,6 @@ docker run \
    --name nginx-lb nginx-lb:latest
 ```

-[](){ #nginxloadbalancer-nginx-verify-nginx }
-
 ## Verify That vLLM Servers Are Ready

 ```bash
--- a/docs/design/arch_overview.md
+++ b/docs/design/arch_overview.md
@ -47,7 +47,7 @@ Here is a sample of `LLM` class usage:
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    ```

-More API details can be found in the [Offline Inference](#offline-inference-api) section of the API docs.
+More API details can be found in the [Offline Inference](../api/README.md#offline-inference) section of the API docs.

 The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](../../vllm/entrypoints/llm.py).

--- a/docs/design/mm_processing.md
+++ b/docs/design/mm_processing.md
@ -1,6 +1,6 @@
 # Multi-Modal Data Processing

-To enable various optimizations in vLLM such as [chunked prefill][chunked-prefill] and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.
+To enable various optimizations in vLLM such as [chunked prefill](../configuration/optimization.md#chunked-prefill) and [prefix caching](../features/automatic_prefix_caching.md), we use [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor] to provide the correspondence between placeholder feature tokens (e.g. `<image>`) and multi-modal inputs (e.g. the raw input image) based on the outputs of HF processor.

 Here are the main features of [BaseMultiModalProcessor][vllm.multimodal.processing.BaseMultiModalProcessor]:

@ -41,14 +41,10 @@ While HF processors support text + multi-modal inputs natively, this is not so f

 Moreover, since the tokenized text has not passed through the HF processor, we have to apply Step 3 by ourselves to keep the output tokens and multi-modal data consistent with each other.

-[](){ #mm-dummy-text }
-
 ### Dummy text

 We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via [get_dummy_text][vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text]. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.

-[](){ #mm-automatic-prompt-updating }
-
 ### Automatic prompt updating

 We address the second issue by implementing model-agnostic code in
@ -64,4 +60,4 @@ Some HF processors, such as the one for Qwen2-VL, are [very slow](https://github

 When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.

-Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text][mm-dummy-text] to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating][mm-automatic-prompt-updating] afterwards to keep the output tokens and multi-modal data consistent with each other.
+Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#dummy-text) to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating](#automatic-prompt-updating) afterwards to keep the output tokens and multi-modal data consistent with each other.
--- a/docs/design/multiprocessing.md
+++ b/docs/design/multiprocessing.md
@ -2,7 +2,7 @@

 ## Debugging

-Please see the [Troubleshooting][troubleshooting-python-multiprocessing]
+Please see the [Troubleshooting](../usage/troubleshooting.md#python-multiprocessing)
 page for information on known issues and how to solve them.

 ## Introduction
--- a/docs/features/README.md
+++ b/docs/features/README.md
@ -36,9 +36,9 @@ th:not(:first-child) {
 }
 </style>

-| Feature | [CP][chunked-prefill] | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
+| Feature | [CP](../configuration/optimization.md#chunked-prefill) | [APC](automatic_prefix_caching.md) | [LoRA](lora.md) | [SD](spec_decode.md) | CUDA graph | [pooling](../models/pooling_models.md) | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search | [prompt-embeds](prompt_embeds.md) |
 |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-| [CP][chunked-prefill] | ✅ | | | | | | | | | | | | | | |
+| [CP](../configuration/optimization.md#chunked-prefill) | ✅ | | | | | | | | | | | | | | |
 | [APC](automatic_prefix_caching.md) | ✅ | ✅ | | | | | | | | | | | | | |
 | [LoRA](lora.md) | ✅ | ✅ | ✅ | | | | | | | | | | | | |
 | [SD](spec_decode.md) | ✅ | ✅ | ❌ | ✅ | | | | | | | | | | | |
@ -57,13 +57,11 @@ th:not(:first-child) {
 \* Chunked prefill and prefix caching are only applicable to last-token pooling.  
 <sup>^</sup> LoRA is only applicable to the language backbone of multimodal models.

-[](){ #feature-x-hardware }
-
 ### Feature x Hardware

 | Feature                                                   | Volta               | Turing    | Ampere    | Ada    | Hopper     | CPU                | AMD    | TPU | Intel GPU |
 |-----------------------------------------------------------|---------------------|-----------|-----------|--------|------------|--------------------|--------|-----| ------------|
-| [CP][chunked-prefill]                                     | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ | ✅        |
+| [CP](../configuration/optimization.md#chunked-prefill)                                     | [❌](https://github.com/vllm-project/vllm/issues/2729) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ | ✅        |
 | [APC](automatic_prefix_caching.md)                        | [❌](https://github.com/vllm-project/vllm/issues/3687) | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ | ✅        |
 | [LoRA](lora.md)                                           | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ✅ | ✅        |
 | [SD](spec_decode.md)                                      | ✅                  | ✅        | ✅        | ✅     | ✅        | ✅                  | ✅     | ❌ | [🟠](https://github.com/vllm-project/vllm/issues/26963)       |
--- a/docs/features/multimodal_inputs.md
+++ b/docs/features/multimodal_inputs.md
@ -1,6 +1,6 @@
 # Multimodal Inputs

-This page teaches you how to pass multi-modal inputs to [multi-modal models][supported-mm-models] in vLLM.
+This page teaches you how to pass multi-modal inputs to [multi-modal models](../models/supported_models.md#list-of-multimodal-language-models) in vLLM.

 !!! note
    We are actively iterating on multi-modal support. See [this RFC](https://github.com/vllm-project/vllm/issues/4194) for upcoming changes,
--- a/docs/getting_started/installation/google_tpu.md
+++ b/docs/getting_started/installation/google_tpu.md
@ -153,7 +153,7 @@ VLLM_TARGET_DEVICE="tpu" python -m pip install -e .

 ### Pre-built images

-See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.
+See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image, making sure to substitute the image name `vllm/vllm-openai` with `vllm/vllm-tpu`.

 ### Build image from source

--- a/docs/getting_started/installation/gpu.cuda.inc.md
+++ b/docs/getting_started/installation/gpu.cuda.inc.md
@ -15,7 +15,7 @@ vLLM contains pre-compiled C++ and CUDA (12.8) binaries.

 In order to be performant, vLLM has to compile many cuda kernels. The compilation unfortunately introduces binary incompatibility with other CUDA versions and PyTorch versions, even for the same PyTorch version with different building configurations.

-Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below][build-from-source] for more details.
+Therefore, it is recommended to install vLLM with a **fresh new** environment. If either you have a different CUDA version or you want to use an existing PyTorch installation, you need to build vLLM from source. See [below](#build-wheel-from-source) for more details.

 # --8<-- [end:set-up-using-python]
 # --8<-- [start:pre-built-wheels]
@ -44,8 +44,6 @@ export CUDA_VERSION=118 # or 126
 uv pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu${CUDA_VERSION}-cp38-abi3-manylinux1_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu${CUDA_VERSION}
 ```

-[](){ #install-the-latest-code }
-
 #### Install the latest code

 LLM inference is a fast-evolving field, and the latest code may contain bug fixes, performance improvements, and new features that are not released yet. To allow users to try the latest code without waiting for the next release, vLLM provides wheels for Linux running on an x86 platform with CUDA 12 for every commit since `v0.5.3`.
@ -128,11 +126,11 @@ export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vll
 uv pip install --editable .
 ```

-You can find more information about vLLM's wheels in [install-the-latest-code][install-the-latest-code].
+You can find more information about vLLM's wheels in [Install the latest code](#install-the-latest-code).

 !!! note
    There is a possibility that your source code may have a different commit ID compared to the latest vLLM wheel, which could potentially lead to unknown errors.
-    It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [install-the-latest-code][install-the-latest-code] for instructions on how to install a specified wheel.
+    It is recommended to use the same commit ID for the source code as the vLLM wheel you have installed. Please refer to [Install the latest code](#install-the-latest-code) for instructions on how to install a specified wheel.

 #### Full build (with compilation)

@ -250,7 +248,7 @@ uv pip install -e .
 # --8<-- [end:build-wheel-from-source]
 # --8<-- [start:pre-built-images]

-See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for instructions on using the official Docker image.
+See [Using Docker](../../deployment/docker.md) for instructions on using the official Docker image.

 Another way to access the latest code is to use the docker images:

@ -266,11 +264,11 @@ The latest code can contain bugs and may not be stable. Please use it with cauti
 # --8<-- [end:pre-built-images]
 # --8<-- [start:build-image-from-source]

-See [deployment-docker-build-image-from-source][deployment-docker-build-image-from-source] for instructions on building the Docker image.
+See [Building vLLM's Docker Image from Source](../../deployment/docker.md#building-vllms-docker-image-from-source) for instructions on building the Docker image.

 # --8<-- [end:build-image-from-source]
 # --8<-- [start:supported-features]

-See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
+See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.

 # --8<-- [end:supported-features]
--- a/docs/getting_started/installation/gpu.md
+++ b/docs/getting_started/installation/gpu.md
@ -66,8 +66,6 @@ vLLM is a Python library that supports the following GPU variants. Select your G

    --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"

-[](){ #build-from-source }
-
 ### Build wheel from source

 === "NVIDIA CUDA"
--- a/docs/getting_started/installation/gpu.rocm.inc.md
+++ b/docs/getting_started/installation/gpu.rocm.inc.md
@ -217,6 +217,6 @@ Where the `<path/to/model>` is the location where the model is stored, for examp
 # --8<-- [end:build-image-from-source]
 # --8<-- [start:supported-features]

-See [feature-x-hardware][feature-x-hardware] compatibility matrix for feature support information.
+See [Feature x Hardware](../../features/README.md#feature-x-hardware) compatibility matrix for feature support information.

 # --8<-- [end:supported-features]
--- a/docs/getting_started/quickstart.md
+++ b/docs/getting_started/quickstart.md
@ -2,8 +2,8 @@

 This guide will help you quickly get started with vLLM to perform:

- [Offline batched inference][quickstart-offline]
- [Online serving using OpenAI-compatible server][quickstart-online]
+- [Offline batched inference](#offline-batched-inference)
+- [Online serving using OpenAI-compatible server](#openai-compatible-server)

 ## Prerequisites

@ -42,8 +42,6 @@ uv pip install vllm --torch-backend=auto
 !!! note
    For more detail and non-CUDA platforms, please refer [here](installation/README.md) for specific instructions on how to install vLLM.

-[](){ #quickstart-offline }
-
 ## Offline Batched Inference

 With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inferencing). See the example script: [examples/offline_inference/basic/basic.py](../../examples/offline_inference/basic/basic.py)
@ -57,7 +55,7 @@ The first line of this example imports the classes [LLM][vllm.LLM] and [Sampling
 from vllm import LLM, SamplingParams
 ```

-The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here][sampling-params].
+The next section defines a list of input prompts and sampling parameters for text generation. The [sampling temperature](https://arxiv.org/html/2402.05201v1) is set to `0.8` and the [nucleus sampling probability](https://en.wikipedia.org/wiki/Top-p_sampling) is set to `0.95`. You can find more information about the sampling parameters [here](../api/README.md#inference-parameters).

 !!! important
    By default, vLLM will use sampling parameters recommended by model creator by applying the `generation_config.json` from the Hugging Face model repository if it exists. In most cases, this will provide you with the best results by default if [SamplingParams][vllm.SamplingParams] is not specified.
@ -135,8 +133,6 @@ for output in outputs:
            print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
        ```

-[](){ #quickstart-online }
-
 ## OpenAI-Compatible Server

 vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.
@ -150,7 +146,7 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct

 !!! note
    By default, the server uses a predefined chat template stored in the tokenizer.
-    You can learn about overriding it [here][chat-template].
+    You can learn about overriding it [here](../serving/openai_compatible_server.md#chat-template).
 !!! important
    By default, the server applies `generation_config.json` from the huggingface model repository if it exists. This means the default values of certain sampling parameters can be overridden by those recommended by the model creator.

--- a/docs/models/generative_models.md
+++ b/docs/models/generative_models.md
@ -140,5 +140,5 @@ outputs = llm.chat(conversation, chat_template=custom_template)

 Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:

- [Completions API][completions-api] is similar to `LLM.generate` but only accepts text.
- [Chat API][chat-api]  is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
+- [Completions API](../serving/openai_compatible_server.md#completions-api) is similar to `LLM.generate` but only accepts text.
+- [Chat API](../serving/openai_compatible_server.md#chat-api)  is similar to `LLM.chat`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for models with a chat template.
--- a/docs/models/pooling_models.md
+++ b/docs/models/pooling_models.md
@ -185,10 +185,10 @@ print(f"Data: {data!r}")

 Our [OpenAI-Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:

- [Pooling API][pooling-api] is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API][embeddings-api] is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
- [Classification API][classification-api] is similar to `LLM.classify` and is applicable to sequence classification models.
- [Score API][score-api] is similar to `LLM.score` for cross-encoder models.
+- [Pooling API](../serving/openai_compatible_server.md#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
+- [Embeddings API](../serving/openai_compatible_server.md#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](../features/multimodal_inputs.md) for embedding models.
+- [Classification API](../serving/openai_compatible_server.md#classification-api) is similar to `LLM.classify` and is applicable to sequence classification models.
+- [Score API](../serving/openai_compatible_server.md#score-api) is similar to `LLM.score` for cross-encoder models.

 ## Matryoshka Embeddings

--- a/docs/models/supported_models.md
+++ b/docs/models/supported_models.md
@ -11,9 +11,7 @@ Alongside each architecture, we include some popular models that use it.

 If vLLM natively supports a model, its implementation can be found in [vllm/model_executor/models](../../vllm/model_executor/models).

-These models are what we list in [supported-text-models][supported-text-models] and [supported-mm-models][supported-mm-models].
-
-[](){ #transformers-backend }
+These models are what we list in [supported text models](#list-of-text-only-language-models) and [supported multimodal models](#list-of-multimodal-language-models).

 ### Transformers

@ -60,7 +58,7 @@ For a model to be compatible with the Transformers backend for vLLM it must:
 - be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
    - The model directory must have the correct structure (e.g. `config.json` is present).
    - `config.json` must contain `auto_map.AutoModel`.
- be a Transformers backend for vLLM compatible model (see [writing-custom-models][writing-custom-models]):
+- be a Transformers backend for vLLM compatible model (see [Writing custom models](#writing-custom-models)):
    - Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).

 If the compatible model is:
@ -70,8 +68,6 @@ If the compatible model is:

 This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!

-[](){ #writing-custom-models }
-
 #### Writing custom models

 This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
@ -164,7 +160,7 @@ To determine whether a given model is natively supported, you can check the `con
 If the `"architectures"` field contains a model architecture listed below, then it should be natively supported.

 Models do not _need_ to be natively supported to be used in vLLM.
-The [Transformers backend][transformers-backend] enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).
+The [Transformers backend](#transformers) enables you to run models directly using their Transformers implementation (or even remote code on the Hugging Face Model Hub!).

 !!! tip
    The easiest way to check if your model is really supported at runtime is to run the program below:
@ -306,8 +302,6 @@ output = llm.encode("Hello, my name is")
 print(output)
 ```

-[](){ #feature-status-legend }
-
 ## Feature Status Legend

 - ✅︎ indicates that the feature is supported for the model.
@ -316,8 +310,6 @@ print(output)

 - ⚠️ indicates that the feature is available but may have known issues or limitations.

-[](){ #supported-text-models }
-
 ## List of Text-only Language Models

 ### Generative Models
@ -583,8 +575,6 @@ These models primarily support the [`LLM.encode`](./pooling_models.md#llmencode)
 !!! note
    Named Entity Recognition (NER) usage, please refer to [examples/offline_inference/pooling/ner.py](../../examples/offline_inference/pooling/ner.py), [examples/online_serving/pooling/ner_client.py](../../examples/online_serving/pooling/ner_client.py).

-[](){ #supported-mm-models }
-
 ## List of Multimodal Language Models

 The following modalities are supported depending on the model:
--- a/docs/serving/distributed_troubleshooting.md
+++ b/docs/serving/distributed_troubleshooting.md
@ -4,7 +4,7 @@ For general troubleshooting, see [Troubleshooting](../usage/troubleshooting.md).

 ## Verify inter-node GPU communication

-After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script][troubleshooting-incorrect-hardware-driver]. If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.
+After you start the Ray cluster, verify GPU-to-GPU communication across nodes. Proper configuration can be non-trivial. For more information, see [troubleshooting script](../usage/troubleshooting.md#incorrect-hardwaredriver). If you need additional environment variables for communication configuration, append them to [examples/online_serving/run_cluster.sh](../../examples/online_serving/run_cluster.sh), for example `-e NCCL_SOCKET_IFNAME=eth0`. Setting environment variables during cluster creation is recommended because the variables propagate to all nodes. In contrast, setting environment variables in the shell affects only the local node. For more information, see <https://github.com/vllm-project/vllm/issues/6803>.

 ## No available node types can fulfill resource request

--- a/docs/serving/offline_inference.md
+++ b/docs/serving/offline_inference.md
@ -19,7 +19,7 @@ The available APIs depend on the model type:
 - [Pooling models](../models/pooling_models.md) output their hidden states directly.

 !!! info
-    [API Reference][offline-inference-api]
+    [API Reference](../api/README.md#offline-inference)

 ## Ray Data LLM API

--- a/docs/serving/openai_compatible_server.md
+++ b/docs/serving/openai_compatible_server.md
@ -44,37 +44,35 @@ To call the server, in your preferred text editor, create a script that uses an

 We currently support the following OpenAI APIs:

- [Completions API][completions-api] (`/v1/completions`)
+- [Completions API](#completions-api) (`/v1/completions`)
    - Only applicable to [text generation models](../models/generative_models.md).
    - *Note: `suffix` parameter is not supported.*
- [Chat Completions API][chat-api] (`/v1/chat/completions`)
-    - Only applicable to [text generation models](../models/generative_models.md) with a [chat template][chat-template].
+- [Chat Completions API](#chat-api) (`/v1/chat/completions`)
+    - Only applicable to [text generation models](../models/generative_models.md) with a [chat template](../serving/openai_compatible_server.md#chat-template).
    - *Note: `parallel_tool_calls` and `user` parameters are ignored.*
- [Embeddings API][embeddings-api] (`/v1/embeddings`)
+- [Embeddings API](#embeddings-api) (`/v1/embeddings`)
    - Only applicable to [embedding models](../models/pooling_models.md).
- [Transcriptions API][transcriptions-api] (`/v1/audio/transcriptions`)
+- [Transcriptions API](#transcriptions-api) (`/v1/audio/transcriptions`)
    - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).
- [Translation API][translations-api] (`/v1/audio/translations`)
+- [Translation API](#translations-api) (`/v1/audio/translations`)
    - Only applicable to [Automatic Speech Recognition (ASR) models](../models/supported_models.md#transcription).

 In addition, we have the following custom APIs:

- [Tokenizer API][tokenizer-api] (`/tokenize`, `/detokenize`)
+- [Tokenizer API](#tokenizer-api) (`/tokenize`, `/detokenize`)
    - Applicable to any model with a tokenizer.
- [Pooling API][pooling-api] (`/pooling`)
+- [Pooling API](#pooling-api) (`/pooling`)
    - Applicable to all [pooling models](../models/pooling_models.md).
- [Classification API][classification-api] (`/classify`)
+- [Classification API](#classification-api) (`/classify`)
    - Only applicable to [classification models](../models/pooling_models.md).
- [Score API][score-api] (`/score`)
+- [Score API](#score-api) (`/score`)
    - Applicable to [embedding models and cross-encoder models](../models/pooling_models.md).
- [Re-rank API][rerank-api] (`/rerank`, `/v1/rerank`, `/v2/rerank`)
+- [Re-rank API](#re-rank-api) (`/rerank`, `/v1/rerank`, `/v2/rerank`)
    - Implements [Jina AI's v1 re-rank API](https://jina.ai/reranker/)
    - Also compatible with [Cohere's v1 & v2 re-rank APIs](https://docs.cohere.com/v2/reference/rerank)
    - Jina and Cohere's APIs are very similar; Jina's includes extra information in the rerank endpoint's response.
    - Only applicable to [cross-encoder models](../models/pooling_models.md).

-[](){ #chat-template }
-
 ## Chat Template

 In order for the language model to support chat protocol, vLLM requires the model to include
@ -174,8 +172,6 @@ with `--enable-request-id-headers`.

 ## API Reference

-[](){ #completions-api }
-
 ### Completions API

 Our Completions API is compatible with [OpenAI's Completions API](https://platform.openai.com/docs/api-reference/completions);
@ -185,7 +181,7 @@ Code example: [examples/online_serving/openai_completion_client.py](../../exampl

 #### Extra parameters

-The following [sampling parameters][sampling-params] are supported.
+The following [sampling parameters](../api/README.md#inference-parameters) are supported.

 ??? code

@ -201,8 +197,6 @@ The following extra parameters are supported:
    --8<-- "vllm/entrypoints/openai/protocol.py:completion-extra-params"
    ```

-[](){ #chat-api }
-
 ### Chat API

 Our Chat API is compatible with [OpenAI's Chat Completions API](https://platform.openai.com/docs/api-reference/chat);
@ -218,7 +212,7 @@ Code example: [examples/online_serving/openai_chat_completion_client.py](../../e

 #### Extra parameters

-The following [sampling parameters][sampling-params] are supported.
+The following [sampling parameters](../api/README.md#inference-parameters) are supported.

 ??? code

@ -234,8 +228,6 @@ The following extra parameters are supported:
    --8<-- "vllm/entrypoints/openai/protocol.py:chat-completion-extra-params"
    ```

-[](){ #embeddings-api }
-
 ### Embeddings API

 Our Embeddings API is compatible with [OpenAI's Embeddings API](https://platform.openai.com/docs/api-reference/embeddings);
@ -243,7 +235,7 @@ you can use the [official OpenAI Python client](https://github.com/openai/openai

 Code example: [examples/online_serving/pooling/openai_embedding_client.py](../../examples/online_serving/pooling/openai_embedding_client.py)

-If the model has a [chat template][chat-template], you can replace `inputs` with a list of `messages` (same schema as [Chat API][chat-api])
+If the model has a [chat template](../serving/openai_compatible_server.md#chat-template), you can replace `inputs` with a list of `messages` (same schema as [Chat API](#chat-api))
 which will be treated as a single prompt to the model. Here is a convenience function for calling the API while retaining OpenAI's type annotations:

 ??? code
@ -369,8 +361,6 @@ For chat-like input (i.e. if `messages` is passed), these extra parameters are s
    --8<-- "vllm/entrypoints/openai/protocol.py:chat-embedding-extra-params"
    ```

-[](){ #transcriptions-api }
-
 ### Transcriptions API

 Our Transcriptions API is compatible with [OpenAI's Transcriptions API](https://platform.openai.com/docs/api-reference/audio/createTranscription);
@ -468,7 +458,7 @@ For `verbose_json` response format:

 #### Extra Parameters

-The following [sampling parameters][sampling-params] are supported.
+The following [sampling parameters](../api/README.md#inference-parameters) are supported.

 ??? code

@ -484,8 +474,6 @@ The following extra parameters are supported:
    --8<-- "vllm/entrypoints/openai/protocol.py:transcription-extra-params"
    ```

-[](){ #translations-api }
-
 ### Translations API

 Our Translation API is compatible with [OpenAI's Translations API](https://platform.openai.com/docs/api-reference/audio/createTranslation);
@ -500,7 +488,7 @@ Code example: [examples/online_serving/openai_translation_client.py](../../examp

 #### Extra Parameters

-The following [sampling parameters][sampling-params] are supported.
+The following [sampling parameters](../api/README.md#inference-parameters) are supported.

 ```python
 --8<-- "vllm/entrypoints/openai/protocol.py:translation-sampling-params"
@ -512,8 +500,6 @@ The following extra parameters are supported:
 --8<-- "vllm/entrypoints/openai/protocol.py:translation-extra-params"
 ```

-[](){ #tokenizer-api }
-
 ### Tokenizer API

 Our Tokenizer API is a simple wrapper over [HuggingFace-style tokenizers](https://huggingface.co/docs/transformers/en/main_classes/tokenizer).
@ -522,18 +508,14 @@ It consists of two endpoints:
 - `/tokenize` corresponds to calling `tokenizer.encode()`.
 - `/detokenize` corresponds to calling `tokenizer.decode()`.

-[](){ #pooling-api }
-
 ### Pooling API

 Our Pooling API encodes input prompts using a [pooling model](../models/pooling_models.md) and returns the corresponding hidden states.

-The input format is the same as [Embeddings API][embeddings-api], but the output data can contain an arbitrary nested list, not just a 1-D list of floats.
+The input format is the same as [Embeddings API](#embeddings-api), but the output data can contain an arbitrary nested list, not just a 1-D list of floats.

 Code example: [examples/online_serving/pooling/openai_pooling_client.py](../../examples/online_serving/pooling/openai_pooling_client.py)

-[](){ #classification-api }
-
 ### Classification API

 Our Classification API directly supports Hugging Face sequence-classification models such as [ai21labs/Jamba-tiny-reward-dev](https://huggingface.co/ai21labs/Jamba-tiny-reward-dev) and [jason9693/Qwen2.5-1.5B-apeach](https://huggingface.co/jason9693/Qwen2.5-1.5B-apeach).
@ -649,8 +631,6 @@ The following extra parameters are supported:
 --8<-- "vllm/entrypoints/openai/protocol.py:classification-extra-params"
 ```

-[](){ #score-api }
-
 ### Score API

 Our Score API can apply a cross-encoder model or an embedding model to predict scores for sentence or multimodal pairs. When using an embedding model the score corresponds to the cosine similarity between each embedding pair.
@ -856,8 +836,6 @@ The following extra parameters are supported:
 --8<-- "vllm/entrypoints/openai/protocol.py:score-extra-params"
 ```

-[](){ #rerank-api }
-
 ### Re-rank API

 Our Re-rank API can apply an embedding model or a cross-encoder model to predict relevant scores between a single query, and
--- a/docs/usage/troubleshooting.md
+++ b/docs/usage/troubleshooting.md
@ -80,8 +80,6 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
 If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
 To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the [LLM][vllm.LLM] class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.

-[](){ #troubleshooting-incorrect-hardware-driver }
-
 ## Incorrect hardware/driver

 If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
@ -178,8 +176,6 @@ If the test script hangs or crashes, usually it means the hardware/drivers are b

    Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.

-[](){ #troubleshooting-python-multiprocessing }
-
 ## Python multiprocessing

 ### `RuntimeError` Exception