[doc] improve readability (#18675)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
This commit is contained in:
Reid
2025-05-25 16:40:31 +08:00
committed by GitHub
parent 624b77a2b3
commit 279f854519
20 changed files with 206 additions and 59 deletions

View File

@ -26,7 +26,12 @@ The edges of the build graph represent:
> Commands to regenerate the build graph (make sure to run it **from the \`root\` directory of the vLLM repository** where the dockerfile is present):
>
> ```bash
> dockerfilegraph -o png --legend --dpi 200 --max-label-length 50 --filename docker/Dockerfile
> dockerfilegraph \
> -o png \
> --legend \
> --dpi 200 \
> --max-label-length 50 \
> --filename docker/Dockerfile
> ```
>
> or in case you want to run it directly with the docker image:

View File

@ -41,7 +41,10 @@ If your model imports modules that initialize CUDA, consider lazy-importing it t
```python
from vllm import ModelRegistry
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
ModelRegistry.register_model(
"YourModelForCausalLM",
"your_code:YourModelForCausalLM"
)
```
!!! warning

View File

@ -11,7 +11,7 @@ vLLM offers an official Docker image for deployment.
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
```console
$ docker run --runtime nvidia --gpus all \
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 8000:8000 \
@ -23,7 +23,7 @@ $ docker run --runtime nvidia --gpus all \
This image can also be used with other container engines such as [Podman](https://podman.io/).
```console
$ podman run --gpus all \
podman run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
@ -73,7 +73,10 @@ You can build and run vLLM from source via the provided <gh-file:docker/Dockerfi
```console
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai --file docker/Dockerfile
DOCKER_BUILDKIT=1 docker build . \
--target vllm-openai \
--tag vllm/vllm-openai \
--file docker/Dockerfile
```
!!! note
@ -96,8 +99,8 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
```console
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
$ python3 use_existing_torch.py
$ DOCKER_BUILDKIT=1 docker build . \
python3 use_existing_torch.py
DOCKER_BUILDKIT=1 docker build . \
--file docker/Dockerfile \
--target vllm-openai \
--platform "linux/arm64" \
@ -113,7 +116,7 @@ $ DOCKER_BUILDKIT=1 docker build . \
To run vLLM with the custom-built Docker image:
```console
$ docker run --runtime nvidia --gpus all \
docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \

View File

@ -82,7 +82,11 @@ Check the output of the command. There will be a shareable gradio link (like the
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
```console
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
HF_TOKEN="your-huggingface-token" \
sky launch serving.yaml \
--gpus A100:8 \
--env HF_TOKEN \
--env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
```
## Scale up to multiple replicas
@ -155,7 +159,9 @@ run: |
Start the serving the Llama-3 8B model on multiple replicas:
```console
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
HF_TOKEN="your-huggingface-token" \
sky serve up -n vllm serving.yaml \
--env HF_TOKEN
```
Wait until the service is ready:
@ -318,7 +324,9 @@ run: |
1. Start the chat web UI:
```console
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
sky launch \
-c gui ./gui.yaml \
--env ENDPOINT=$(sky serve status --endpoint vllm)
```
2. Then, we can access the GUI at the returned gradio link:

View File

@ -33,7 +33,8 @@ pip install streamlit openai
streamlit run streamlit_openai_chatbot_webserver.py
# or specify the VLLM_API_BASE or VLLM_API_KEY
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" streamlit run streamlit_openai_chatbot_webserver.py
VLLM_API_BASE="http://vllm-server-host:vllm-server-port/v1" \
streamlit run streamlit_openai_chatbot_webserver.py
# start with debug mode to view more details
streamlit run streamlit_openai_chatbot_webserver.py --logger.level=debug

View File

@ -77,7 +77,11 @@ If you are behind proxy, you can pass the proxy settings to the docker build com
```console
cd $vllm_root
docker build -f docker/Dockerfile . --tag vllm --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy
docker build \
-f docker/Dockerfile . \
--tag vllm \
--build-arg http_proxy=$http_proxy \
--build-arg https_proxy=$https_proxy
```
[](){ #nginxloadbalancer-nginx-docker-network }
@ -102,8 +106,26 @@ Notes:
```console
mkdir -p ~/.cache/huggingface/hub/
hf_cache_dir=~/.cache/huggingface/
docker run -itd --ipc host --network vllm_nginx --gpus device=0 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8081:8000 --name vllm0 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24gb -v $hf_cache_dir:/root/.cache/huggingface/ -p 8082:8000 --name vllm1 vllm --model meta-llama/Llama-2-7b-chat-hf
docker run \
-itd \
--ipc host \
--network vllm_nginx \
--gpus device=0 \
--shm-size=10.24gb \
-v $hf_cache_dir:/root/.cache/huggingface/ \
-p 8081:8000 \
--name vllm0 vllm \
--model meta-llama/Llama-2-7b-chat-hf
docker run \
-itd \
--ipc host \
--network vllm_nginx \
--gpus device=1 \
--shm-size=10.24gb \
-v $hf_cache_dir:/root/.cache/huggingface/ \
-p 8082:8000 \
--name vllm1 vllm \
--model meta-llama/Llama-2-7b-chat-hf
```
!!! note
@ -114,7 +136,12 @@ docker run -itd --ipc host --network vllm_nginx --gpus device=1 --shm-size=10.24
## Launch Nginx
```console
docker run -itd -p 8000:80 --network vllm_nginx -v ./nginx_conf/:/etc/nginx/conf.d/ --name nginx-lb nginx-lb:latest
docker run \
-itd \
-p 8000:80 \
--network vllm_nginx \
-v ./nginx_conf/:/etc/nginx/conf.d/ \
--name nginx-lb nginx-lb:latest
```
[](){ #nginxloadbalancer-nginx-verify-nginx }

View File

@ -42,7 +42,9 @@ print(f'Model is quantized and saved at "{quant_path}"')
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
```console
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
python examples/offline_inference/llm_engine_example.py \
--model TheBloke/Llama-2-7b-Chat-AWQ \
--quantization awq
```
AWQ models are also supported directly through the LLM entrypoint:

View File

@ -33,7 +33,12 @@ import torch
# "hxbgsyxh/llama-13b-4bit-g-1-bitblas" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1-bitblas"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, quantization="bitblas")
llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitblas"
)
```
## Read gptq format checkpoint
@ -44,5 +49,11 @@ import torch
# "hxbgsyxh/llama-13b-4bit-g-1" is a pre-quantized checkpoint.
model_id = "hxbgsyxh/llama-13b-4bit-g-1"
llm = LLM(model=model_id, dtype=torch.float16, trust_remote_code=True, quantization="bitblas", max_model_len=1024)
llm = LLM(
model=model_id,
dtype=torch.float16,
trust_remote_code=True,
quantization="bitblas",
max_model_len=1024
)
```

View File

@ -27,7 +27,11 @@ from vllm import LLM
import torch
# unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True)
llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True
)
```
## Inflight quantization: load as 4bit quantization
@ -38,8 +42,12 @@ For inflight 4bit quantization with BitsAndBytes, you need to explicitly specify
from vllm import LLM
import torch
model_id = "huggyllama/llama-7b"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
quantization="bitsandbytes")
llm = LLM(
model=model_id,
dtype=torch.bfloat16,
trust_remote_code=True,
quantization="bitsandbytes"
)
```
## OpenAI Compatible Server

View File

@ -14,14 +14,17 @@ To run a GGUF model with vLLM, you can download and use the local GGUF model fro
```console
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
```console
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--tensor-parallel-size 2
```
!!! warning
@ -31,7 +34,9 @@ GGUF assumes that huggingface can convert the metadata to a config file. In case
```console
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--hf-config-path Tinyllama/TInyLlama-1.1B-Chat-v1.0
```
You can also use the GGUF model directly through the LLM entrypoint:

View File

@ -59,7 +59,8 @@ model.save(quant_path)
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
```console
python examples/offline_inference/llm_engine_example.py --model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
python examples/offline_inference/llm_engine_example.py \
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
```
## Using GPTQModel with vLLM's Python API

View File

@ -7,7 +7,9 @@ We recommend installing the latest torchao nightly with
```console
# Install the latest TorchAO nightly build
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
pip install --pre torchao>=10.0.0 --index-url https://download.pytorch.org/whl/nightly/cu126
pip install \
--pre torchao>=10.0.0 \
--index-url https://download.pytorch.org/whl/nightly/cu126
```
## Quantizing HuggingFace Models
@ -20,7 +22,12 @@ from torchao.quantization import Int8WeightOnlyConfig
model_name = "meta-llama/Meta-Llama-3-8B"
quantization_config = TorchAoConfig(Int8WeightOnlyConfig())
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto", quantization_config=quantization_config)
quantized_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

View File

@ -27,7 +27,8 @@ vLLM currently supports the following reasoning models:
To use reasoning models, you need to specify the `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
```bash
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --reasoning-parser deepseek_r1
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
--reasoning-parser deepseek_r1
```
Next, make a request to the model that should return the reasoning content in the response.

View File

@ -45,8 +45,13 @@ for output in outputs:
To perform the same with an online mode launch the server:
```bash
python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --model facebook/opt-6.7b \
--seed 42 -tp 1 --gpu_memory_utilization 0.8 \
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model facebook/opt-6.7b \
--seed 42 \
-tp 1 \
--gpu_memory_utilization 0.8 \
--speculative_config '{"model": "facebook/opt-125m", "num_speculative_tokens": 5}'
```

View File

@ -45,7 +45,15 @@ Use the following commands to run a Docker image:
```console
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
docker run \
-it \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--net=host \
--ipc=host \
vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
```
# --8<-- [end:requirements]
@ -91,7 +99,14 @@ Currently, there are no pre-built Intel Gaudi images.
```console
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --rm vllm-hpu-env
docker run \
-it \
--runtime=habana \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
--cap-add=sys_nice \
--net=host \
--rm vllm-hpu-env
```
!!! tip

View File

@ -38,7 +38,8 @@ The installation of drivers and tools wouldn't be necessary, if [Deep Learning A
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
EOF
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB \
| sudo apt-key add -
# Update OS packages
sudo apt-get update -y
@ -96,12 +97,17 @@ source aws_neuron_venv_pytorch/bin/activate
# Install Jupyter notebook kernel
pip install ipykernel
python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
python3.10 -m ipykernel install \
--user \
--name aws_neuron_venv_pytorch \
--display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels
# Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
python -m pip config set \
global.extra-index-url \
https://pip.repos.neuron.amazonaws.com
# Install wget, awscli
python -m pip install wget

View File

@ -55,7 +55,9 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
##### Install the latest code using `pip`
```console
pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
```
`--pre` is required for `pip` to consider pre-released versions.
@ -63,7 +65,9 @@ pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Another way to install the latest code is to use `uv`:
```console
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
```
##### Install specific revisions using `pip`
@ -83,7 +87,9 @@ If you want to access the wheels for previous commits (e.g. to bisect the behavi
```console
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
uv pip install vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}
```
The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-remember command. A unique feature of `uv` is that packages in `--extra-index-url` have [higher priority than the default index](https://docs.astral.sh/uv/pip/compatibility/#packages-that-exist-on-multiple-indexes). If the latest public release is `v0.6.6.post1`, `uv`'s behavior allows installing a commit before `v0.6.6.post1` by specifying the `--extra-index-url`. In contrast, `pip` combines packages from `--extra-index-url` and the default index, choosing only the latest version, which makes it difficult to install a development version prior to the released version.
@ -192,7 +198,11 @@ Additionally, if you have trouble building vLLM, we recommend using the NVIDIA P
```console
# Use `--ipc=host` to make sure the shared memory is large enough.
docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3
docker run \
--gpus all \
-it \
--rm \
--ipc=host nvcr.io/nvidia/pytorch:23.10-py3
```
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:

View File

@ -91,19 +91,22 @@ Currently, there are no pre-built ROCm wheels.
4. Build vLLM. For example, vLLM on ROCM 6.3 can be built with the following steps:
```bash
$ pip install --upgrade pip
pip install --upgrade pip
# Build & install AMD SMI
$ pip install /opt/rocm/share/amd_smi
pip install /opt/rocm/share/amd_smi
# Install dependencies
$ pip install --upgrade numba scipy huggingface-hub[cli,hf_transfer] setuptools_scm
$ pip install "numpy<2"
$ pip install -r requirements/rocm.txt
pip install --upgrade numba \
scipy \
huggingface-hub[cli,hf_transfer] \
setuptools_scm
pip install "numpy<2"
pip install -r requirements/rocm.txt
# Build vLLM for MI210/MI250/MI300.
$ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
$ python3 setup.py develop
export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
python3 setup.py develop
```
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
@ -154,7 +157,9 @@ It is important that the user kicks off the docker build using buildkit. Either
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
```console
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm_base -t rocm/vllm-dev:base .
DOCKER_BUILDKIT=1 docker build \
-f docker/Dockerfile.rocm_base \
-t rocm/vllm-dev:base .
```
#### Build an image with vLLM
@ -189,7 +194,11 @@ DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
```console
DOCKER_BUILDKIT=1 docker build --build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" -f docker/Dockerfile.rocm -t vllm-rocm .
DOCKER_BUILDKIT=1 docker build \
--build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \
-f docker/Dockerfile.rocm \
-t vllm-rocm \
.
```
To run the above docker image `vllm-rocm`, use the below command:

View File

@ -16,19 +16,25 @@ pip3 install vllm[runai]
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
```console
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
--load-format runai_streamer
```
To run model from AWS S3 object store run:
```console
vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
vllm serve s3://core-llm/Llama-3-8b \
--load-format runai_streamer
```
To run model from a S3 compatible object store run:
```console
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 AWS_EC2_METADATA_DISABLED=true AWS_ENDPOINT_URL=https://storage.googleapis.com vllm serve s3://core-llm/Llama-3-8b --load-format runai_streamer
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 \
AWS_EC2_METADATA_DISABLED=true \
AWS_ENDPOINT_URL=https://storage.googleapis.com \
vllm serve s3://core-llm/Llama-3-8b \
--load-format runai_streamer
```
## Tunable parameters
@ -39,14 +45,18 @@ You can tune `concurrency` that controls the level of concurrency and number of
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
```console
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
--load-format runai_streamer \
--model-loader-extra-config '{"concurrency":16}'
```
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
```console
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"memory_limit":5368709120}'
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
--load-format runai_streamer \
--model-loader-extra-config '{"memory_limit":5368709120}'
```
!!! note
@ -63,7 +73,9 @@ vllm serve /path/to/sharded/model --load-format runai_streamer_sharded
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
```console
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
vllm serve /path/to/sharded/model \
--load-format runai_streamer_sharded \
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
```
To create sharded model files, you can use the script provided in <gh-file:examples/offline_inference/save_sharded_state.py>. This script demonstrates how to save a model in the sharded format that is compatible with the Run:ai Model Streamer sharded loader.
@ -71,7 +83,9 @@ To create sharded model files, you can use the script provided in <gh-file:examp
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
```console
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded --model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
vllm serve /path/to/sharded/model \
--load-format runai_streamer_sharded \
--model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
```
!!! note

View File

@ -8,7 +8,9 @@ vLLM provides an HTTP server that implements OpenAI's [Completions API](https://
In your terminal, you can [install](../getting_started/installation/README.md) vLLM, then start the server with the [`vllm serve`][serve-args] command. (You can also use our [Docker][deployment-docker] image.)
```bash
vllm serve NousResearch/Meta-Llama-3-8B-Instruct --dtype auto --api-key token-abc123
vllm serve NousResearch/Meta-Llama-3-8B-Instruct \
--dtype auto \
--api-key token-abc123
```
To call the server, in your preferred text editor, create a script that uses an HTTP client. Include any messages that you want to send to the model. Then run that script. Below is an example script using the [official OpenAI Python client](https://github.com/openai/openai-python).
@ -243,7 +245,9 @@ and passing a list of `messages` in the request. Refer to the examples below for
```bash
vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
--trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
--trust-remote-code \
--max-model-len 4096 \
--chat-template examples/template_vlm2vec.jinja
```
!!! warning
@ -285,7 +289,9 @@ and passing a list of `messages` in the request. Refer to the examples below for
```bash
vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
--trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
--trust-remote-code \
--max-model-len 8192 \
--chat-template examples/template_dse_qwen2_vl.jinja
```
!!! warning