[Docs] Add comprehensive CLI reference for all large vllm subcommands (#22601)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
Harry Mellor
2025-08-11 08:13:33 +01:00
committed by GitHub
parent 1e55dfa7e5
commit bc1d02ac85
20 changed files with 205 additions and 110 deletions

View File

@ -11,7 +11,7 @@ nav:
- Quick Links:
- User Guide: usage/README.md
- Developer Guide: contributing/README.md
- API Reference: api/summary.md
- API Reference: api/README.md
- CLI Reference: cli/README.md
- Timeline:
- Roadmap: https://roadmap.vllm.ai
@ -58,11 +58,9 @@ nav:
- CI: contributing/ci
- Design Documents: design
- API Reference:
- Summary: api/summary.md
- Contents:
- api/vllm/*
- CLI Reference:
- Summary: cli/README.md
- api/README.md
- api/vllm/*
- CLI Reference: cli
- Community:
- community/*
- Blog: https://blog.vllm.ai

1
docs/cli/.meta.yml Normal file
View File

@ -0,0 +1 @@
toc_depth: 3

8
docs/cli/.nav.yml Normal file
View File

@ -0,0 +1,8 @@
nav:
- README.md
- serve.md
- chat.md
- complete.md
- run-batch.md
- vllm bench:
- bench/*.md

View File

@ -1,7 +1,3 @@
---
toc_depth: 4
---
# vLLM CLI Guide
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
@ -16,52 +12,48 @@ Available Commands:
vllm {chat,complete,serve,bench,collect-env,run-batch}
```
When passing JSON CLI arguments, the following sets of arguments are equivalent:
- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
Additionally, list elements can be passed individually using `+`:
- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
## serve
Start the vLLM OpenAI Compatible API server.
Starts the vLLM OpenAI Compatible API server.
??? console "Examples"
Start with a model:
```bash
# Start with a model
vllm serve meta-llama/Llama-2-7b-hf
```bash
vllm serve meta-llama/Llama-2-7b-hf
```
# Specify the port
vllm serve meta-llama/Llama-2-7b-hf --port 8100
Specify the port:
# Serve over a Unix domain socket
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
```bash
vllm serve meta-llama/Llama-2-7b-hf --port 8100
```
# Check with --help for more options
# To list all groups
vllm serve --help=listgroup
Serve over a Unix domain socket:
# To view a argument group
vllm serve --help=ModelConfig
```bash
vllm serve meta-llama/Llama-2-7b-hf --uds /tmp/vllm.sock
```
# To view a single argument
vllm serve --help=max-num-seqs
Check with --help for more options:
# To search by keyword
vllm serve --help=max
```bash
# To list all groups
vllm serve --help=listgroup
# To view full help with pager (less/more)
vllm serve --help=page
```
# To view a argument group
vllm serve --help=ModelConfig
### Options
# To view a single argument
vllm serve --help=max-num-seqs
--8<-- "docs/argparse/serve.md"
# To search by keyword
vllm serve --help=max
# To view full help with pager (less/more)
vllm serve --help=page
```
See [vllm serve](./serve.md) for the full reference of all available arguments.
## chat
@ -78,6 +70,8 @@ vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
vllm chat --quick "hi"
```
See [vllm chat](./chat.md) for the full reference of all available arguments.
## complete
Generate text completions based on the given prompt via the running API server.
@ -93,7 +87,7 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
vllm complete --quick "The future of AI is"
```
</details>
See [vllm complete](./complete.md) for the full reference of all available arguments.
## bench
@ -120,6 +114,8 @@ vllm bench latency \
--load-format dummy
```
See [vllm bench latency](./bench/latency.md) for the full reference of all available arguments.
### serve
Benchmark the online serving throughput.
@ -134,6 +130,8 @@ vllm bench serve \
--num-prompts 5
```
See [vllm bench serve](./bench/serve.md) for the full reference of all available arguments.
### throughput
Benchmark offline inference throughput.
@ -147,6 +145,8 @@ vllm bench throughput \
--load-format dummy
```
See [vllm bench throughput](./bench/throughput.md) for the full reference of all available arguments.
## collect-env
Start collecting environment information.
@ -159,24 +159,25 @@ vllm collect-env
Run batch prompts and write results to file.
<details>
<summary>Examples</summary>
Running with a local file:
```bash
# Running with a local file
vllm run-batch \
-i offline_inference/openai_batch/openai_example_batch.jsonl \
-o results.jsonl \
--model meta-llama/Meta-Llama-3-8B-Instruct
```
# Using remote file
Using remote file:
```bash
vllm run-batch \
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
-o results.jsonl \
--model meta-llama/Meta-Llama-3-8B-Instruct
```
</details>
See [vllm run-batch](./run-batch.md) for the full reference of all available arguments.
## More Help

View File

@ -0,0 +1,9 @@
# vllm bench latency
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Options
--8<-- "docs/argparse/bench_latency.md"

9
docs/cli/bench/serve.md Normal file
View File

@ -0,0 +1,9 @@
# vllm bench serve
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Options
--8<-- "docs/argparse/bench_serve.md"

View File

@ -0,0 +1,9 @@
# vllm bench throughput
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Options
--8<-- "docs/argparse/bench_throughput.md"

5
docs/cli/chat.md Normal file
View File

@ -0,0 +1,5 @@
# vllm chat
## Options
--8<-- "docs/argparse/chat.md"

5
docs/cli/complete.md Normal file
View File

@ -0,0 +1,5 @@
# vllm complete
## Options
--8<-- "docs/argparse/complete.md"

9
docs/cli/json_tip.inc.md Normal file
View File

@ -0,0 +1,9 @@
When passing JSON CLI arguments, the following sets of arguments are equivalent:
- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
Additionally, list elements can be passed individually using `+`:
- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`

9
docs/cli/run-batch.md Normal file
View File

@ -0,0 +1,9 @@
# vllm run-batch
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Options
--8<-- "docs/argparse/run-batch.md"

9
docs/cli/serve.md Normal file
View File

@ -0,0 +1,9 @@
# vllm serve
## JSON CLI Arguments
--8<-- "docs/cli/json_tip.inc.md"
## Options
--8<-- "docs/argparse/serve.md"

View File

@ -11,15 +11,7 @@ Engine arguments control the behavior of the vLLM engine.
The engine argument classes, [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs], are a combination of the configuration classes defined in [vllm.config][]. Therefore, if you are interested in developer documentation, we recommend looking at these configuration classes as they are the source of truth for types, defaults and docstrings.
When passing JSON CLI arguments, the following sets of arguments are equivalent:
- `--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'`
- `--json-arg.key1 value1 --json-arg.key2.key3 value2`
Additionally, list elements can be passed individually using `+`:
- `--json-arg '{"key4": ["value3", "value4", "value5"]}'`
- `--json-arg.key4+ value3 --json-arg.key4+='value4,value5'`
--8<-- "docs/cli/json_tip.inc.md"
## `EngineArgs`

View File

@ -15,8 +15,14 @@ sys.modules["aiohttp"] = MagicMock()
sys.modules["blake3"] = MagicMock()
sys.modules["vllm._C"] = MagicMock()
from vllm.benchmarks import latency # noqa: E402
from vllm.benchmarks import serve # noqa: E402
from vllm.benchmarks import throughput # noqa: E402
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs # noqa: E402
from vllm.entrypoints.openai.cli_args import make_arg_parser # noqa: E402
from vllm.entrypoints.cli.openai import ChatCommand # noqa: E402
from vllm.entrypoints.cli.openai import CompleteCommand # noqa: E402
from vllm.entrypoints.openai import cli_args # noqa: E402
from vllm.entrypoints.openai import run_batch # noqa: E402
from vllm.utils import FlexibleArgumentParser # noqa: E402
logger = logging.getLogger("mkdocs")
@ -68,7 +74,8 @@ class MarkdownFormatter(HelpFormatter):
self._markdown_output.append(
f"Possible choices: {metavar}\n\n")
self._markdown_output.append(f"{action.help}\n\n")
if action.help:
self._markdown_output.append(f"{action.help}\n\n")
if (default := action.default) != SUPPRESS:
self._markdown_output.append(f"Default: `{default}`\n\n")
@ -78,7 +85,7 @@ class MarkdownFormatter(HelpFormatter):
return "".join(self._markdown_output)
def create_parser(cls, **kwargs) -> FlexibleArgumentParser:
def create_parser(add_cli_args, **kwargs) -> FlexibleArgumentParser:
"""Create a parser for the given class with markdown formatting.
Args:
@ -88,18 +95,12 @@ def create_parser(cls, **kwargs) -> FlexibleArgumentParser:
Returns:
FlexibleArgumentParser: A parser with markdown formatting for the class.
"""
parser = FlexibleArgumentParser()
parser = FlexibleArgumentParser(add_json_tip=False)
parser.formatter_class = MarkdownFormatter
with patch("vllm.config.DeviceConfig.__post_init__"):
return cls.add_cli_args(parser, **kwargs)
def create_serve_parser() -> FlexibleArgumentParser:
"""Create a parser for the serve command with markdown formatting."""
parser = FlexibleArgumentParser()
parser.formatter_class = lambda prog: MarkdownFormatter(
prog, starting_heading_level=4)
return make_arg_parser(parser)
_parser = add_cli_args(parser, **kwargs)
# add_cli_args might be in-place so return parser if _parser is None
return _parser or parser
def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
@ -113,10 +114,24 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool):
# Create parsers to document
parsers = {
"engine_args": create_parser(EngineArgs),
"async_engine_args": create_parser(AsyncEngineArgs,
async_args_only=True),
"serve": create_serve_parser(),
"engine_args":
create_parser(EngineArgs.add_cli_args),
"async_engine_args":
create_parser(AsyncEngineArgs.add_cli_args, async_args_only=True),
"serve":
create_parser(cli_args.make_arg_parser),
"chat":
create_parser(ChatCommand.add_cli_args),
"complete":
create_parser(CompleteCommand.add_cli_args),
"bench_latency":
create_parser(latency.add_cli_args),
"bench_throughput":
create_parser(throughput.add_cli_args),
"bench_serve":
create_parser(serve.add_cli_args),
"run-batch":
create_parser(run_batch.make_arg_parser),
}
# Generate documentation for each parser

View File

@ -29,3 +29,5 @@ setproctitle
torch
transformers
zmq
uvloop
prometheus-client

View File

@ -24,8 +24,6 @@ from vllm.benchmarks.datasets import (AIMODataset, BurstGPTDataset,
from vllm.benchmarks.lib.utils import (convert_to_pytorch_benchmark_format,
write_to_json)
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args)
from vllm.inputs import TextPrompt, TokensPrompt
from vllm.lora.request import LoRARequest
from vllm.outputs import RequestOutput
@ -146,6 +144,8 @@ async def run_vllm_async(
disable_detokenize: bool = False,
) -> float:
from vllm import SamplingParams
from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args)
async with build_async_engine_client_from_engine_args(
engine_args,

View File

@ -130,28 +130,33 @@ class ChatCommand(CLISubcommand):
conversation.append(response_message) # type: ignore
print(output)
def subparser_init(
self,
subparsers: argparse._SubParsersAction) -> FlexibleArgumentParser:
chat_parser = subparsers.add_parser(
"chat",
help="Generate chat completions via the running API server.",
description="Generate chat completions via the running API server.",
usage="vllm chat [options]")
_add_query_options(chat_parser)
chat_parser.add_argument(
@staticmethod
def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
"""Add CLI arguments for the chat command."""
_add_query_options(parser)
parser.add_argument(
"--system-prompt",
type=str,
default=None,
help=("The system prompt to be added to the chat template, "
"used for models that support system prompts."))
chat_parser.add_argument("-q",
"--quick",
type=str,
metavar="MESSAGE",
help=("Send a single prompt as MESSAGE "
"and print the response, then exit."))
return chat_parser
parser.add_argument("-q",
"--quick",
type=str,
metavar="MESSAGE",
help=("Send a single prompt as MESSAGE "
"and print the response, then exit."))
return parser
def subparser_init(
self,
subparsers: argparse._SubParsersAction) -> FlexibleArgumentParser:
parser = subparsers.add_parser(
"chat",
help="Generate chat completions via the running API server.",
description="Generate chat completions via the running API server.",
usage="vllm chat [options]")
return ChatCommand.add_cli_args(parser)
class CompleteCommand(CLISubcommand):
@ -179,25 +184,30 @@ class CompleteCommand(CLISubcommand):
output = completion.choices[0].text
print(output)
def subparser_init(
self,
subparsers: argparse._SubParsersAction) -> FlexibleArgumentParser:
complete_parser = subparsers.add_parser(
"complete",
help=("Generate text completions based on the given prompt "
"via the running API server."),
description=("Generate text completions based on the given prompt "
"via the running API server."),
usage="vllm complete [options]")
_add_query_options(complete_parser)
complete_parser.add_argument(
@staticmethod
def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
"""Add CLI arguments for the complete command."""
_add_query_options(parser)
parser.add_argument(
"-q",
"--quick",
type=str,
metavar="PROMPT",
help=
"Send a single prompt and print the completion output, then exit.")
return complete_parser
return parser
def subparser_init(
self,
subparsers: argparse._SubParsersAction) -> FlexibleArgumentParser:
parser = subparsers.add_parser(
"complete",
help=("Generate text completions based on the given prompt "
"via the running API server."),
description=("Generate text completions based on the given prompt "
"via the running API server."),
usage="vllm complete [options]")
return CompleteCommand.add_cli_args(parser)
def cmd_init() -> list[CLISubcommand]:

View File

@ -20,7 +20,6 @@ from vllm.engine.arg_utils import AsyncEngineArgs, optional_type
from vllm.engine.protocol import EngineClient
from vllm.entrypoints.logger import RequestLogger
# yapf: disable
from vllm.entrypoints.openai.api_server import build_async_engine_client
from vllm.entrypoints.openai.protocol import (BatchRequestInput,
BatchRequestOutput,
BatchResponseData,
@ -34,7 +33,6 @@ from vllm.entrypoints.openai.serving_models import (BaseModelPath,
OpenAIServingModels)
from vllm.entrypoints.openai.serving_score import ServingScores
from vllm.logger import init_logger
from vllm.usage.usage_lib import UsageContext
from vllm.utils import FlexibleArgumentParser, random_uuid
from vllm.version import __version__ as VLLM_VERSION
@ -469,6 +467,9 @@ async def run_batch(
async def main(args: Namespace):
from vllm.entrypoints.openai.api_server import build_async_engine_client
from vllm.usage.usage_lib import UsageContext
async with build_async_engine_client(
args,
usage_context=UsageContext.OPENAI_BATCH_RUNNER,

View File

@ -1682,6 +1682,8 @@ class FlexibleArgumentParser(ArgumentParser):
# Set the default "formatter_class" to SortedHelpFormatter
if "formatter_class" not in kwargs:
kwargs["formatter_class"] = SortedHelpFormatter
# Pop kwarg "add_json_tip" to control whether to add the JSON tip
self.add_json_tip = kwargs.pop("add_json_tip", True)
super().__init__(*args, **kwargs)
if sys.version_info < (3, 13):
@ -1726,7 +1728,8 @@ class FlexibleArgumentParser(ArgumentParser):
def format_help(self) -> str:
# Add tip about JSON arguments to the epilog
epilog = self.epilog or ""
if not epilog.startswith(FlexibleArgumentParser._json_tip):
if (self.add_json_tip
and not epilog.startswith(FlexibleArgumentParser._json_tip)):
self.epilog = FlexibleArgumentParser._json_tip + epilog
return super().format_help()