mirror of
https://github.com/vllm-project/vllm.git
synced 2025-10-20 23:03:52 +08:00
[Docs] Fix syntax highlighting of shell commands (#19870)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
This commit is contained in:
@ -16,7 +16,7 @@ Please download the visualization scripts in the post
|
|||||||
- Download `nightly-benchmarks.zip`.
|
- Download `nightly-benchmarks.zip`.
|
||||||
- In the same folder, run the following code:
|
- In the same folder, run the following code:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export HF_TOKEN=<your HF token>
|
export HF_TOKEN=<your HF token>
|
||||||
apt update
|
apt update
|
||||||
apt install -y git
|
apt install -y git
|
||||||
|
@ -10,7 +10,7 @@ title: Using Docker
|
|||||||
vLLM offers an official Docker image for deployment.
|
vLLM offers an official Docker image for deployment.
|
||||||
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
|
The image can be used to run OpenAI compatible server and is available on Docker Hub as [vllm/vllm-openai](https://hub.docker.com/r/vllm/vllm-openai/tags).
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker run --runtime nvidia --gpus all \
|
docker run --runtime nvidia --gpus all \
|
||||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
|
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
|
||||||
@ -22,7 +22,7 @@ docker run --runtime nvidia --gpus all \
|
|||||||
|
|
||||||
This image can also be used with other container engines such as [Podman](https://podman.io/).
|
This image can also be used with other container engines such as [Podman](https://podman.io/).
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
podman run --gpus all \
|
podman run --gpus all \
|
||||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||||
@ -71,7 +71,7 @@ You can add any other [engine-args][engine-args] you need after the image tag (`
|
|||||||
|
|
||||||
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:
|
You can build and run vLLM from source via the provided <gh-file:docker/Dockerfile>. To build vLLM:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
|
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
|
||||||
DOCKER_BUILDKIT=1 docker build . \
|
DOCKER_BUILDKIT=1 docker build . \
|
||||||
--target vllm-openai \
|
--target vllm-openai \
|
||||||
@ -99,7 +99,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
|||||||
|
|
||||||
??? Command
|
??? Command
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
# Example of building on Nvidia GH200 server. (Memory usage: ~15GB, Build time: ~1475s / ~25 min, Image size: 6.93GB)
|
||||||
python3 use_existing_torch.py
|
python3 use_existing_torch.py
|
||||||
DOCKER_BUILDKIT=1 docker build . \
|
DOCKER_BUILDKIT=1 docker build . \
|
||||||
@ -118,7 +118,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
|||||||
|
|
||||||
Run the following command on your host machine to register QEMU user static handlers:
|
Run the following command on your host machine to register QEMU user static handlers:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -128,7 +128,7 @@ of PyTorch Nightly and should be considered **experimental**. Using the flag `--
|
|||||||
|
|
||||||
To run vLLM with the custom-built Docker image:
|
To run vLLM with the custom-built Docker image:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker run --runtime nvidia --gpus all \
|
docker run --runtime nvidia --gpus all \
|
||||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||||
-p 8000:8000 \
|
-p 8000:8000 \
|
||||||
|
@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
|||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
|
vllm serve Qwen/Qwen1.5-32B-Chat-AWQ --max-model-len 4096
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -11,7 +11,7 @@ title: AutoGen
|
|||||||
|
|
||||||
- Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
|
- Setup [AutoGen](https://microsoft.github.io/autogen/0.2/docs/installation/) environment
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm
|
pip install vllm
|
||||||
|
|
||||||
# Install AgentChat and OpenAI client from Extensions
|
# Install AgentChat and OpenAI client from Extensions
|
||||||
@ -23,7 +23,7 @@ pip install -U "autogen-agentchat" "autogen-ext[openai]"
|
|||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python -m vllm.entrypoints.openai.api_server \
|
python -m vllm.entrypoints.openai.api_server \
|
||||||
--model mistralai/Mistral-7B-Instruct-v0.2
|
--model mistralai/Mistral-7B-Instruct-v0.2
|
||||||
```
|
```
|
||||||
|
@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
|
|||||||
|
|
||||||
To install the Cerebrium client, run:
|
To install the Cerebrium client, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install cerebrium
|
pip install cerebrium
|
||||||
cerebrium login
|
cerebrium login
|
||||||
```
|
```
|
||||||
|
|
||||||
Next, create your Cerebrium project, run:
|
Next, create your Cerebrium project, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
cerebrium init vllm-project
|
cerebrium init vllm-project
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -58,7 +58,7 @@ Next, let us add our code to handle inference for the LLM of your choice (`mistr
|
|||||||
|
|
||||||
Then, run the following code to deploy it to the cloud:
|
Then, run the following code to deploy it to the cloud:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
cerebrium deploy
|
cerebrium deploy
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -15,7 +15,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
|||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -18,13 +18,13 @@ This guide walks you through deploying Dify using a vLLM backend.
|
|||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve Qwen/Qwen1.5-7B-Chat
|
vllm serve Qwen/Qwen1.5-7B-Chat
|
||||||
```
|
```
|
||||||
|
|
||||||
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
|
- Start the Dify server with docker compose ([details](https://github.com/langgenius/dify?tab=readme-ov-file#quick-start)):
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/langgenius/dify.git
|
git clone https://github.com/langgenius/dify.git
|
||||||
cd dify
|
cd dify
|
||||||
cd docker
|
cd docker
|
||||||
|
@ -11,14 +11,14 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
|
|||||||
|
|
||||||
To install dstack client, run:
|
To install dstack client, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install "dstack[all]
|
pip install "dstack[all]
|
||||||
dstack server
|
dstack server
|
||||||
```
|
```
|
||||||
|
|
||||||
Next, to configure your dstack project, run:
|
Next, to configure your dstack project, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
mkdir -p vllm-dstack
|
mkdir -p vllm-dstack
|
||||||
cd vllm-dstack
|
cd vllm-dstack
|
||||||
dstack init
|
dstack init
|
||||||
|
@ -13,7 +13,7 @@ It allows you to deploy a large language model (LLM) server with vLLM as the bac
|
|||||||
|
|
||||||
- Setup vLLM and Haystack environment
|
- Setup vLLM and Haystack environment
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm haystack-ai
|
pip install vllm haystack-ai
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -21,7 +21,7 @@ pip install vllm haystack-ai
|
|||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve mistralai/Mistral-7B-Instruct-v0.1
|
vllm serve mistralai/Mistral-7B-Instruct-v0.1
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -22,7 +22,7 @@ Before you begin, ensure that you have the following:
|
|||||||
|
|
||||||
To install the chart with the release name `test-vllm`:
|
To install the chart with the release name `test-vllm`:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
|
helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f values.yaml --set secrets.s3endpoint=$ACCESS_POINT --set secrets.s3bucketname=$BUCKET --set secrets.s3accesskeyid=$ACCESS_KEY --set secrets.s3accesskey=$SECRET_KEY
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -30,7 +30,7 @@ helm upgrade --install --create-namespace --namespace=ns-vllm test-vllm . -f val
|
|||||||
|
|
||||||
To uninstall the `test-vllm` deployment:
|
To uninstall the `test-vllm` deployment:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
helm uninstall test-vllm --namespace=ns-vllm
|
helm uninstall test-vllm --namespace=ns-vllm
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -18,7 +18,7 @@ And LiteLLM supports all models on VLLM.
|
|||||||
|
|
||||||
- Setup vLLM and litellm environment
|
- Setup vLLM and litellm environment
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm litellm
|
pip install vllm litellm
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -28,7 +28,7 @@ pip install vllm litellm
|
|||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -56,7 +56,7 @@ vllm serve qwen/Qwen1.5-0.5B-Chat
|
|||||||
|
|
||||||
- Start the vLLM server with the supported embedding model, e.g.
|
- Start the vLLM server with the supported embedding model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve BAAI/bge-base-en-v1.5
|
vllm serve BAAI/bge-base-en-v1.5
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -7,13 +7,13 @@ title: Open WebUI
|
|||||||
|
|
||||||
2. Start the vLLM server with the supported chat completion model, e.g.
|
2. Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port):
|
1. Start the [Open WebUI](https://github.com/open-webui/open-webui) docker container (replace the vllm serve host and vllm serve port):
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker run -d -p 3000:8080 \
|
docker run -d -p 3000:8080 \
|
||||||
--name open-webui \
|
--name open-webui \
|
||||||
-v open-webui:/app/backend/data \
|
-v open-webui:/app/backend/data \
|
||||||
|
@ -15,7 +15,7 @@ Here are the integrations:
|
|||||||
|
|
||||||
- Setup vLLM and langchain environment
|
- Setup vLLM and langchain environment
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install -U vllm \
|
pip install -U vllm \
|
||||||
langchain_milvus langchain_openai \
|
langchain_milvus langchain_openai \
|
||||||
langchain_community beautifulsoup4 \
|
langchain_community beautifulsoup4 \
|
||||||
@ -26,14 +26,14 @@ pip install -U vllm \
|
|||||||
|
|
||||||
- Start the vLLM server with the supported embedding model, e.g.
|
- Start the vLLM server with the supported embedding model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Start embedding service (port 8000)
|
# Start embedding service (port 8000)
|
||||||
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
||||||
```
|
```
|
||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Start chat service (port 8001)
|
# Start chat service (port 8001)
|
||||||
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
||||||
```
|
```
|
||||||
@ -52,7 +52,7 @@ python retrieval_augmented_generation_with_langchain.py
|
|||||||
|
|
||||||
- Setup vLLM and llamaindex environment
|
- Setup vLLM and llamaindex environment
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm \
|
pip install vllm \
|
||||||
llama-index llama-index-readers-web \
|
llama-index llama-index-readers-web \
|
||||||
llama-index-llms-openai-like \
|
llama-index-llms-openai-like \
|
||||||
@ -64,14 +64,14 @@ pip install vllm \
|
|||||||
|
|
||||||
- Start the vLLM server with the supported embedding model, e.g.
|
- Start the vLLM server with the supported embedding model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Start embedding service (port 8000)
|
# Start embedding service (port 8000)
|
||||||
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
vllm serve ssmits/Qwen2-7B-Instruct-embed-base
|
||||||
```
|
```
|
||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Start chat service (port 8001)
|
# Start chat service (port 8001)
|
||||||
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
vllm serve qwen/Qwen1.5-0.5B-Chat --port 8001
|
||||||
```
|
```
|
||||||
|
@ -15,7 +15,7 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
|
|||||||
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
|
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
|
||||||
- Check that `sky check` shows clouds or Kubernetes are enabled.
|
- Check that `sky check` shows clouds or Kubernetes are enabled.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install skypilot-nightly
|
pip install skypilot-nightly
|
||||||
sky check
|
sky check
|
||||||
```
|
```
|
||||||
@ -71,7 +71,7 @@ See the vLLM SkyPilot YAML for serving, [serving.yaml](https://github.com/skypil
|
|||||||
|
|
||||||
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
|
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
|
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -83,7 +83,7 @@ Check the output of the command. There will be a shareable gradio link (like the
|
|||||||
|
|
||||||
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
HF_TOKEN="your-huggingface-token" \
|
HF_TOKEN="your-huggingface-token" \
|
||||||
sky launch serving.yaml \
|
sky launch serving.yaml \
|
||||||
--gpus A100:8 \
|
--gpus A100:8 \
|
||||||
@ -159,7 +159,7 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
|
|||||||
|
|
||||||
Start the serving the Llama-3 8B model on multiple replicas:
|
Start the serving the Llama-3 8B model on multiple replicas:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
HF_TOKEN="your-huggingface-token" \
|
HF_TOKEN="your-huggingface-token" \
|
||||||
sky serve up -n vllm serving.yaml \
|
sky serve up -n vllm serving.yaml \
|
||||||
--env HF_TOKEN
|
--env HF_TOKEN
|
||||||
@ -167,7 +167,7 @@ HF_TOKEN="your-huggingface-token" \
|
|||||||
|
|
||||||
Wait until the service is ready:
|
Wait until the service is ready:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
watch -n10 sky serve status vllm
|
watch -n10 sky serve status vllm
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -271,13 +271,13 @@ This will scale the service up to when the QPS exceeds 2 for each replica.
|
|||||||
|
|
||||||
To update the service with the new config:
|
To update the service with the new config:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
|
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
|
||||||
```
|
```
|
||||||
|
|
||||||
To stop the service:
|
To stop the service:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
sky serve down vllm
|
sky serve down vllm
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -317,7 +317,7 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
|
|||||||
|
|
||||||
1. Start the chat web UI:
|
1. Start the chat web UI:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
sky launch \
|
sky launch \
|
||||||
-c gui ./gui.yaml \
|
-c gui ./gui.yaml \
|
||||||
--env ENDPOINT=$(sky serve status --endpoint vllm)
|
--env ENDPOINT=$(sky serve status --endpoint vllm)
|
||||||
|
@ -15,13 +15,13 @@ It can be quickly integrated with vLLM as a backend API server, enabling powerfu
|
|||||||
|
|
||||||
- Start the vLLM server with the supported chat completion model, e.g.
|
- Start the vLLM server with the supported chat completion model, e.g.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve qwen/Qwen1.5-0.5B-Chat
|
vllm serve qwen/Qwen1.5-0.5B-Chat
|
||||||
```
|
```
|
||||||
|
|
||||||
- Install streamlit and openai:
|
- Install streamlit and openai:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install streamlit openai
|
pip install streamlit openai
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -29,7 +29,7 @@ pip install streamlit openai
|
|||||||
|
|
||||||
- Start the streamlit web UI and start to chat:
|
- Start the streamlit web UI and start to chat:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
streamlit run streamlit_openai_chatbot_webserver.py
|
streamlit run streamlit_openai_chatbot_webserver.py
|
||||||
|
|
||||||
# or specify the VLLM_API_BASE or VLLM_API_KEY
|
# or specify the VLLM_API_BASE or VLLM_API_KEY
|
||||||
|
@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
|
|||||||
|
|
||||||
To install Llama Stack, run
|
To install Llama Stack, run
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install llama-stack -q
|
pip install llama-stack -q
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -115,7 +115,7 @@ Next, start the vLLM server as a Kubernetes Deployment and Service:
|
|||||||
|
|
||||||
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
|
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
kubectl logs -l app.kubernetes.io/name=vllm
|
kubectl logs -l app.kubernetes.io/name=vllm
|
||||||
...
|
...
|
||||||
INFO: Started server process [1]
|
INFO: Started server process [1]
|
||||||
@ -358,14 +358,14 @@ INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
|
|||||||
|
|
||||||
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
|
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
kubectl apply -f deployment.yaml
|
kubectl apply -f deployment.yaml
|
||||||
kubectl apply -f service.yaml
|
kubectl apply -f service.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
To test the deployment, run the following `curl` command:
|
To test the deployment, run the following `curl` command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
|
curl http://mistral-7b.default.svc.cluster.local/v1/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
|
@ -11,13 +11,13 @@ This document shows how to launch multiple vLLM serving containers and use Nginx
|
|||||||
|
|
||||||
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
|
This guide assumes that you have just cloned the vLLM project and you're currently in the vllm root directory.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export vllm_root=`pwd`
|
export vllm_root=`pwd`
|
||||||
```
|
```
|
||||||
|
|
||||||
Create a file named `Dockerfile.nginx`:
|
Create a file named `Dockerfile.nginx`:
|
||||||
|
|
||||||
```console
|
```dockerfile
|
||||||
FROM nginx:latest
|
FROM nginx:latest
|
||||||
RUN rm /etc/nginx/conf.d/default.conf
|
RUN rm /etc/nginx/conf.d/default.conf
|
||||||
EXPOSE 80
|
EXPOSE 80
|
||||||
@ -26,7 +26,7 @@ CMD ["nginx", "-g", "daemon off;"]
|
|||||||
|
|
||||||
Build the container:
|
Build the container:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker build . -f Dockerfile.nginx --tag nginx-lb
|
docker build . -f Dockerfile.nginx --tag nginx-lb
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -60,14 +60,14 @@ Create a file named `nginx_conf/nginx.conf`. Note that you can add as many serve
|
|||||||
|
|
||||||
## Build vLLM Container
|
## Build vLLM Container
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
cd $vllm_root
|
cd $vllm_root
|
||||||
docker build -f docker/Dockerfile . --tag vllm
|
docker build -f docker/Dockerfile . --tag vllm
|
||||||
```
|
```
|
||||||
|
|
||||||
If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
|
If you are behind proxy, you can pass the proxy settings to the docker build command as shown below:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
cd $vllm_root
|
cd $vllm_root
|
||||||
docker build \
|
docker build \
|
||||||
-f docker/Dockerfile . \
|
-f docker/Dockerfile . \
|
||||||
@ -80,7 +80,7 @@ docker build \
|
|||||||
|
|
||||||
## Create Docker Network
|
## Create Docker Network
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker network create vllm_nginx
|
docker network create vllm_nginx
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -129,7 +129,7 @@ Notes:
|
|||||||
|
|
||||||
## Launch Nginx
|
## Launch Nginx
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker run \
|
docker run \
|
||||||
-itd \
|
-itd \
|
||||||
-p 8000:80 \
|
-p 8000:80 \
|
||||||
@ -142,7 +142,7 @@ docker run \
|
|||||||
|
|
||||||
## Verify That vLLM Servers Are Ready
|
## Verify That vLLM Servers Are Ready
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker logs vllm0 | grep Uvicorn
|
docker logs vllm0 | grep Uvicorn
|
||||||
docker logs vllm1 | grep Uvicorn
|
docker logs vllm1 | grep Uvicorn
|
||||||
```
|
```
|
||||||
|
@ -307,7 +307,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
|
|||||||
By default, the timeout for fetching images through HTTP URL is `5` seconds.
|
By default, the timeout for fetching images through HTTP URL is `5` seconds.
|
||||||
You can override this by setting the environment variable:
|
You can override this by setting the environment variable:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -370,7 +370,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
|
|||||||
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
|
By default, the timeout for fetching videos through HTTP URL is `30` seconds.
|
||||||
You can override this by setting the environment variable:
|
You can override this by setting the environment variable:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -476,7 +476,7 @@ Full example: <gh-file:examples/online_serving/openai_chat_completion_client_for
|
|||||||
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
|
By default, the timeout for fetching audios through HTTP URL is `10` seconds.
|
||||||
You can override this by setting the environment variable:
|
You can override this by setting the environment variable:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -9,7 +9,7 @@ The main benefits are lower latency and memory usage.
|
|||||||
|
|
||||||
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
|
You can quantize your own models by installing AutoAWQ or picking one of the [6500+ models on Huggingface](https://huggingface.co/models?search=awq).
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install autoawq
|
pip install autoawq
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -43,7 +43,7 @@ After installing AutoAWQ, you are ready to quantize a model. Please refer to the
|
|||||||
|
|
||||||
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python examples/offline_inference/llm_engine_example.py \
|
python examples/offline_inference/llm_engine_example.py \
|
||||||
--model TheBloke/Llama-2-7b-Chat-AWQ \
|
--model TheBloke/Llama-2-7b-Chat-AWQ \
|
||||||
--quantization awq
|
--quantization awq
|
||||||
|
@ -12,7 +12,7 @@ vLLM now supports [BitBLAS](https://github.com/microsoft/BitBLAS) for more effic
|
|||||||
|
|
||||||
Below are the steps to utilize BitBLAS with vLLM.
|
Below are the steps to utilize BitBLAS with vLLM.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install bitblas>=0.1.0
|
pip install bitblas>=0.1.0
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -9,7 +9,7 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
|
|||||||
|
|
||||||
Below are the steps to utilize BitsAndBytes with vLLM.
|
Below are the steps to utilize BitsAndBytes with vLLM.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install bitsandbytes>=0.45.3
|
pip install bitsandbytes>=0.45.3
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -54,6 +54,6 @@ llm = LLM(
|
|||||||
|
|
||||||
Append the following to your model arguments for 4bit inflight quantization:
|
Append the following to your model arguments for 4bit inflight quantization:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
--quantization bitsandbytes
|
--quantization bitsandbytes
|
||||||
```
|
```
|
||||||
|
@ -23,7 +23,7 @@ The FP8 types typically supported in hardware have two distinct representations,
|
|||||||
|
|
||||||
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install llmcompressor
|
pip install llmcompressor
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -81,7 +81,7 @@ Since simple RTN does not require data for weight quantization and the activatio
|
|||||||
|
|
||||||
Install `vllm` and `lm-evaluation-harness` for evaluation:
|
Install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm lm-eval==0.4.4
|
pip install vllm lm-eval==0.4.4
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -99,9 +99,9 @@ Evaluate accuracy with `lm_eval` (for example on 250 samples of `gsm8k`):
|
|||||||
!!! note
|
!!! note
|
||||||
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
Quantized models can be sensitive to the presence of the `bos` token. `lm_eval` does not add a `bos` token by default, so make sure to include the `add_bos_token=True` argument when running your evaluations.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
MODEL=$PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic
|
||||||
$ lm_eval \
|
lm_eval \
|
||||||
--model vllm \
|
--model vllm \
|
||||||
--model_args pretrained=$MODEL,add_bos_token=True \
|
--model_args pretrained=$MODEL,add_bos_token=True \
|
||||||
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
|
--tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250
|
||||||
|
@ -11,7 +11,7 @@ title: GGUF
|
|||||||
|
|
||||||
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
|
||||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||||
@ -20,7 +20,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
|||||||
|
|
||||||
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
|
||||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||||
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||||
@ -32,7 +32,7 @@ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
|||||||
|
|
||||||
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
|
GGUF assumes that huggingface can convert the metadata to a config file. In case huggingface doesn't support your model you can manually create a config and pass it as hf-config-path
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
|
# If you model is not supported by huggingface you can manually provide a huggingface compatible config path
|
||||||
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
|
||||||
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
--tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||||
|
@ -21,7 +21,7 @@ for more details on this and other advanced features.
|
|||||||
|
|
||||||
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
|
You can quantize your own models by installing [GPTQModel](https://github.com/ModelCloud/GPTQModel) or picking one of the [5000+ models on Huggingface](https://huggingface.co/models?search=gptq).
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install -U gptqmodel --no-build-isolation -v
|
pip install -U gptqmodel --no-build-isolation -v
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -60,7 +60,7 @@ Here is an example of how to quantize `meta-llama/Llama-3.2-1B-Instruct`:
|
|||||||
|
|
||||||
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
|
To run an GPTQModel quantized model with vLLM, you can use [DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2](https://huggingface.co/ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2) with the following command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python examples/offline_inference/llm_engine_example.py \
|
python examples/offline_inference/llm_engine_example.py \
|
||||||
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
|
--model ModelCloud/DeepSeek-R1-Distill-Qwen-7B-gptqmodel-4bit-vortex-v2
|
||||||
```
|
```
|
||||||
|
@ -14,13 +14,13 @@ Please visit the HF collection of [quantized INT4 checkpoints of popular LLMs re
|
|||||||
|
|
||||||
To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
To use INT4 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install llmcompressor
|
pip install llmcompressor
|
||||||
```
|
```
|
||||||
|
|
||||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm lm-eval==0.4.4
|
pip install vllm lm-eval==0.4.4
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -116,8 +116,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W4A16-G128")
|
|||||||
|
|
||||||
To evaluate accuracy, you can use `lm_eval`:
|
To evaluate accuracy, you can use `lm_eval`:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ lm_eval --model vllm \
|
lm_eval --model vllm \
|
||||||
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
|
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W4A16-G128",add_bos_token=true \
|
||||||
--tasks gsm8k \
|
--tasks gsm8k \
|
||||||
--num_fewshot 5 \
|
--num_fewshot 5 \
|
||||||
|
@ -15,13 +15,13 @@ Please visit the HF collection of [quantized INT8 checkpoints of popular LLMs re
|
|||||||
|
|
||||||
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install llmcompressor
|
pip install llmcompressor
|
||||||
```
|
```
|
||||||
|
|
||||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm lm-eval==0.4.4
|
pip install vllm lm-eval==0.4.4
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -122,8 +122,8 @@ model = LLM("./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token")
|
|||||||
|
|
||||||
To evaluate accuracy, you can use `lm_eval`:
|
To evaluate accuracy, you can use `lm_eval`:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ lm_eval --model vllm \
|
lm_eval --model vllm \
|
||||||
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
|
--model_args pretrained="./Meta-Llama-3-8B-Instruct-W8A8-Dynamic-Per-Token",add_bos_token=true \
|
||||||
--tasks gsm8k \
|
--tasks gsm8k \
|
||||||
--num_fewshot 5 \
|
--num_fewshot 5 \
|
||||||
|
@ -4,7 +4,7 @@ The [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-O
|
|||||||
|
|
||||||
We recommend installing the library with:
|
We recommend installing the library with:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install nvidia-modelopt
|
pip install nvidia-modelopt
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -65,7 +65,7 @@ For optimal model quality when using FP8 KV Cache, we recommend using calibrated
|
|||||||
|
|
||||||
First, install the required dependencies:
|
First, install the required dependencies:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install llmcompressor
|
pip install llmcompressor
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -13,7 +13,7 @@ AWQ, GPTQ, Rotation and SmoothQuant.
|
|||||||
|
|
||||||
Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip:
|
Before quantizing models, you need to install Quark. The latest release of Quark can be installed with pip:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install amd-quark
|
pip install amd-quark
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -22,13 +22,13 @@ for more installation details.
|
|||||||
|
|
||||||
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
Additionally, install `vllm` and `lm-evaluation-harness` for evaluation:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install vllm lm-eval==0.4.4
|
pip install vllm lm-eval==0.4.4
|
||||||
```
|
```
|
||||||
|
|
||||||
## Quantization Process
|
## Quantization Process
|
||||||
|
|
||||||
After installing Quark, we will use an example to illustrate how to use Quark.
|
After installing Quark, we will use an example to illustrate how to use Quark.
|
||||||
The Quark quantization process can be listed for 5 steps as below:
|
The Quark quantization process can be listed for 5 steps as below:
|
||||||
|
|
||||||
1. Load the model
|
1. Load the model
|
||||||
@ -209,8 +209,8 @@ Now, you can load and run the Quark quantized model directly through the LLM ent
|
|||||||
|
|
||||||
Or, you can use `lm_eval` to evaluate accuracy:
|
Or, you can use `lm_eval` to evaluate accuracy:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ lm_eval --model vllm \
|
lm_eval --model vllm \
|
||||||
--model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \
|
--model_args pretrained=Llama-2-70b-chat-hf-w-fp8-a-fp8-kvcache-fp8-pertensor-autosmoothquant,kv_cache_dtype='fp8',quantization='quark' \
|
||||||
--tasks gsm8k
|
--tasks gsm8k
|
||||||
```
|
```
|
||||||
@ -222,7 +222,7 @@ to quantize large language models more conveniently. It supports quantizing mode
|
|||||||
of different quantization schemes and optimization algorithms. It can export the quantized model
|
of different quantization schemes and optimization algorithms. It can export the quantized model
|
||||||
and run evaluation tasks on the fly. With the script, the example above can be:
|
and run evaluation tasks on the fly. With the script, the example above can be:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
|
python3 quantize_quark.py --model_dir meta-llama/Llama-2-70b-chat-hf \
|
||||||
--output_dir /path/to/output \
|
--output_dir /path/to/output \
|
||||||
--quant_scheme w_fp8_a_fp8 \
|
--quant_scheme w_fp8_a_fp8 \
|
||||||
|
@ -4,7 +4,7 @@ TorchAO is an architecture optimization library for PyTorch, it provides high pe
|
|||||||
|
|
||||||
We recommend installing the latest torchao nightly with
|
We recommend installing the latest torchao nightly with
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Install the latest TorchAO nightly build
|
# Install the latest TorchAO nightly build
|
||||||
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
|
# Choose the CUDA version that matches your system (cu126, cu128, etc.)
|
||||||
pip install \
|
pip install \
|
||||||
|
@ -351,7 +351,7 @@ Here is a summary of a plugin file:
|
|||||||
|
|
||||||
Then you can use this plugin in the command line like this.
|
Then you can use this plugin in the command line like this.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
--enable-auto-tool-choice \
|
--enable-auto-tool-choice \
|
||||||
--tool-parser-plugin <absolute path of the plugin file>
|
--tool-parser-plugin <absolute path of the plugin file>
|
||||||
--tool-call-parser example \
|
--tool-call-parser example \
|
||||||
|
@ -26,7 +26,7 @@ The easiest way to launch a Trainium or Inferentia instance with pre-installed N
|
|||||||
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
|
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
|
||||||
- Once inside your instance, activate the pre-installed virtual environment for inference by running
|
- Once inside your instance, activate the pre-installed virtual environment for inference by running
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
|
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -47,7 +47,7 @@ Currently, there are no pre-built Neuron wheels.
|
|||||||
|
|
||||||
To build and install vLLM from source, run:
|
To build and install vLLM from source, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
pip install -U -r requirements/neuron.txt
|
pip install -U -r requirements/neuron.txt
|
||||||
@ -66,7 +66,7 @@ Refer to [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-
|
|||||||
|
|
||||||
To install the AWS Neuron fork, run the following:
|
To install the AWS Neuron fork, run the following:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone -b neuron-2.23-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git
|
git clone -b neuron-2.23-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git
|
||||||
cd upstreaming-to-vllm
|
cd upstreaming-to-vllm
|
||||||
pip install -r requirements/neuron.txt
|
pip install -r requirements/neuron.txt
|
||||||
@ -100,7 +100,7 @@ to perform most of the heavy lifting which includes PyTorch model initialization
|
|||||||
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
|
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
|
||||||
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
|
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
|
||||||
|
|
||||||
```console
|
```python
|
||||||
override_neuron_config={
|
override_neuron_config={
|
||||||
"enable_bucketing":False,
|
"enable_bucketing":False,
|
||||||
}
|
}
|
||||||
@ -108,7 +108,7 @@ override_neuron_config={
|
|||||||
|
|
||||||
or when launching vLLM from the CLI, pass
|
or when launching vLLM from the CLI, pass
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
--override-neuron-config "{\"enable_bucketing\":false}"
|
--override-neuron-config "{\"enable_bucketing\":false}"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -78,13 +78,13 @@ Currently, there are no pre-built CPU wheels.
|
|||||||
|
|
||||||
??? Commands
|
??? Commands
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ docker build -f docker/Dockerfile.cpu \
|
docker build -f docker/Dockerfile.cpu \
|
||||||
--tag vllm-cpu-env \
|
--tag vllm-cpu-env \
|
||||||
--target vllm-openai .
|
--target vllm-openai .
|
||||||
|
|
||||||
# Launching OpenAI server
|
# Launching OpenAI server
|
||||||
$ docker run --rm \
|
docker run --rm \
|
||||||
--privileged=true \
|
--privileged=true \
|
||||||
--shm-size=4g \
|
--shm-size=4g \
|
||||||
-p 8000:8000 \
|
-p 8000:8000 \
|
||||||
@ -123,7 +123,7 @@ vLLM CPU backend supports the following vLLM features:
|
|||||||
|
|
||||||
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
|
- We highly recommend to use TCMalloc for high performance memory allocation and better cache locality. For example, on Ubuntu 22.4, you can run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
|
sudo apt-get install libtcmalloc-minimal4 # install TCMalloc library
|
||||||
find / -name *libtcmalloc* # find the dynamic link library path
|
find / -name *libtcmalloc* # find the dynamic link library path
|
||||||
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
|
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:$LD_PRELOAD # prepend the library to LD_PRELOAD
|
||||||
@ -132,7 +132,7 @@ python examples/offline_inference/basic/basic.py # run vLLM
|
|||||||
|
|
||||||
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
|
- When using the online serving, it is recommended to reserve 1-2 CPU cores for the serving framework to avoid CPU oversubscription. For example, on a platform with 32 physical CPU cores, reserving CPU 30 and 31 for the framework and using CPU 0-29 for OpenMP:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_CPU_KVCACHE_SPACE=40
|
export VLLM_CPU_KVCACHE_SPACE=40
|
||||||
export VLLM_CPU_OMP_THREADS_BIND=0-29
|
export VLLM_CPU_OMP_THREADS_BIND=0-29
|
||||||
vllm serve facebook/opt-125m
|
vllm serve facebook/opt-125m
|
||||||
@ -140,7 +140,7 @@ vllm serve facebook/opt-125m
|
|||||||
|
|
||||||
or using default auto thread binding:
|
or using default auto thread binding:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_CPU_KVCACHE_SPACE=40
|
export VLLM_CPU_KVCACHE_SPACE=40
|
||||||
export VLLM_CPU_NUM_OF_RESERVED_CPU=2
|
export VLLM_CPU_NUM_OF_RESERVED_CPU=2
|
||||||
vllm serve facebook/opt-125m
|
vllm serve facebook/opt-125m
|
||||||
@ -189,7 +189,7 @@ vllm serve facebook/opt-125m
|
|||||||
|
|
||||||
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
|
- Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
|
VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" \
|
||||||
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
||||||
-tp=2 \
|
-tp=2 \
|
||||||
@ -198,7 +198,7 @@ vllm serve facebook/opt-125m
|
|||||||
|
|
||||||
or using default auto thread binding:
|
or using default auto thread binding:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
VLLM_CPU_KVCACHE_SPACE=40 \
|
VLLM_CPU_KVCACHE_SPACE=40 \
|
||||||
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
vllm serve meta-llama/Llama-2-7b-chat-hf \
|
||||||
-tp=2 \
|
-tp=2 \
|
||||||
|
@ -25,11 +25,11 @@ Currently the CPU implementation for macOS supports FP32 and FP16 datatypes.
|
|||||||
|
|
||||||
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
|
After installation of XCode and the Command Line Tools, which include Apple Clang, execute the following commands to build and install vLLM from the source.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
pip install -r requirements/cpu.txt
|
pip install -r requirements/cpu.txt
|
||||||
pip install -e .
|
pip install -e .
|
||||||
```
|
```
|
||||||
|
|
||||||
!!! note
|
!!! note
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
|
First, install recommended compiler. We recommend to use `gcc/g++ >= 12.3.0` as the default compiler to avoid potential problems. For example, on Ubuntu 22.4, you can run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
sudo apt-get update -y
|
sudo apt-get update -y
|
||||||
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
|
sudo apt-get install -y gcc-12 g++-12 libnuma-dev python3-dev
|
||||||
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
|
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /usr/bin/g++ g++ /usr/bin/g++-12
|
||||||
@ -8,14 +8,14 @@ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 10 --slave /
|
|||||||
|
|
||||||
Second, clone vLLM project:
|
Second, clone vLLM project:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git vllm_source
|
git clone https://github.com/vllm-project/vllm.git vllm_source
|
||||||
cd vllm_source
|
cd vllm_source
|
||||||
```
|
```
|
||||||
|
|
||||||
Third, install Python packages for vLLM CPU backend building:
|
Third, install Python packages for vLLM CPU backend building:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install --upgrade pip
|
pip install --upgrade pip
|
||||||
pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy
|
pip install "cmake>=3.26.1" wheel packaging ninja "setuptools-scm>=8" numpy
|
||||||
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
|
||||||
@ -23,13 +23,13 @@ pip install -v -r requirements/cpu.txt --extra-index-url https://download.pytorc
|
|||||||
|
|
||||||
Finally, build and install vLLM CPU backend:
|
Finally, build and install vLLM CPU backend:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
VLLM_TARGET_DEVICE=cpu python setup.py install
|
VLLM_TARGET_DEVICE=cpu python setup.py install
|
||||||
```
|
```
|
||||||
|
|
||||||
If you want to develop vllm, install it in editable mode instead.
|
If you want to develop vllm, install it in editable mode instead.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
VLLM_TARGET_DEVICE=cpu python setup.py develop
|
VLLM_TARGET_DEVICE=cpu python setup.py develop
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -26,7 +26,7 @@ Currently the CPU implementation for s390x architecture supports FP32 datatype o
|
|||||||
|
|
||||||
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
|
Install the following packages from the package manager before building the vLLM. For example on RHEL 9.4:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
dnf install -y \
|
dnf install -y \
|
||||||
which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
|
which procps findutils tar vim git gcc g++ make patch make cython zlib-devel \
|
||||||
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
|
libjpeg-turbo-devel libtiff-devel libpng-devel libwebp-devel freetype-devel harfbuzz-devel \
|
||||||
@ -35,7 +35,7 @@ dnf install -y \
|
|||||||
|
|
||||||
Install rust>=1.80 which is needed for `outlines-core` and `uvloop` python packages installation.
|
Install rust>=1.80 which is needed for `outlines-core` and `uvloop` python packages installation.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
curl https://sh.rustup.rs -sSf | sh -s -- -y && \
|
curl https://sh.rustup.rs -sSf | sh -s -- -y && \
|
||||||
. "$HOME/.cargo/env"
|
. "$HOME/.cargo/env"
|
||||||
```
|
```
|
||||||
@ -45,7 +45,7 @@ Execute the following commands to build and install vLLM from the source.
|
|||||||
!!! tip
|
!!! tip
|
||||||
Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
|
Please build the following dependencies, `torchvision`, `pyarrow` from the source before building vLLM.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
|
sed -i '/^torch/d' requirements-build.txt # remove torch from requirements-build.txt since we use nightly builds
|
||||||
pip install -v \
|
pip install -v \
|
||||||
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \
|
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \
|
||||||
|
@ -68,7 +68,7 @@ For more information about using TPUs with GKE, see:
|
|||||||
|
|
||||||
Create a TPU v5e with 4 TPU chips:
|
Create a TPU v5e with 4 TPU chips:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
|
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
|
||||||
--node-id TPU_NAME \
|
--node-id TPU_NAME \
|
||||||
--project PROJECT_ID \
|
--project PROJECT_ID \
|
||||||
@ -156,13 +156,13 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
|
|||||||
|
|
||||||
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
|
You can use <gh-file:docker/Dockerfile.tpu> to build a Docker image with TPU support.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
|
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
|
||||||
```
|
```
|
||||||
|
|
||||||
Run the Docker image with the following command:
|
Run the Docker image with the following command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Make sure to add `--privileged --net host --shm-size=16G`.
|
# Make sure to add `--privileged --net host --shm-size=16G`.
|
||||||
docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
||||||
```
|
```
|
||||||
@ -185,6 +185,6 @@ docker run --privileged --net host --shm-size=16G -it vllm-tpu
|
|||||||
|
|
||||||
Install OpenBLAS with the following command:
|
Install OpenBLAS with the following command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
sudo apt-get install --no-install-recommends --yes libopenblas-base libopenmpi-dev libomp-dev
|
||||||
```
|
```
|
||||||
|
@ -22,7 +22,7 @@ Therefore, it is recommended to install vLLM with a **fresh new** environment. I
|
|||||||
|
|
||||||
You can install vLLM using either `pip` or `uv pip`:
|
You can install vLLM using either `pip` or `uv pip`:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Install vLLM with CUDA 12.8.
|
# Install vLLM with CUDA 12.8.
|
||||||
# If you are using pip.
|
# If you are using pip.
|
||||||
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
|
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu128
|
||||||
@ -37,7 +37,7 @@ We recommend leveraging `uv` to [automatically select the appropriate PyTorch in
|
|||||||
|
|
||||||
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
|
As of now, vLLM's binaries are compiled with CUDA 12.8 and public PyTorch release versions by default. We also provide vLLM binaries compiled with CUDA 12.6, 11.8, and public PyTorch release versions:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Install vLLM with CUDA 11.8.
|
# Install vLLM with CUDA 11.8.
|
||||||
export VLLM_VERSION=0.6.1.post1
|
export VLLM_VERSION=0.6.1.post1
|
||||||
export PYTHON_VERSION=312
|
export PYTHON_VERSION=312
|
||||||
@ -52,7 +52,7 @@ LLM inference is a fast-evolving field, and the latest code may contain bug fixe
|
|||||||
|
|
||||||
##### Install the latest code using `pip`
|
##### Install the latest code using `pip`
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install -U vllm \
|
pip install -U vllm \
|
||||||
--pre \
|
--pre \
|
||||||
--extra-index-url https://wheels.vllm.ai/nightly
|
--extra-index-url https://wheels.vllm.ai/nightly
|
||||||
@ -62,7 +62,7 @@ pip install -U vllm \
|
|||||||
|
|
||||||
Another way to install the latest code is to use `uv`:
|
Another way to install the latest code is to use `uv`:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
uv pip install -U vllm \
|
uv pip install -U vllm \
|
||||||
--torch-backend=auto \
|
--torch-backend=auto \
|
||||||
--extra-index-url https://wheels.vllm.ai/nightly
|
--extra-index-url https://wheels.vllm.ai/nightly
|
||||||
@ -72,7 +72,7 @@ uv pip install -U vllm \
|
|||||||
|
|
||||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
|
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), due to the limitation of `pip`, you have to specify the full URL of the wheel file by embedding the commit hash in the URL:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||||
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
||||||
```
|
```
|
||||||
@ -83,7 +83,7 @@ Note that the wheels are built with Python 3.8 ABI (see [PEP 425](https://peps.p
|
|||||||
|
|
||||||
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
|
If you want to access the wheels for previous commits (e.g. to bisect the behavior change, performance regression), you can specify the commit hash in the URL:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
||||||
uv pip install vllm \
|
uv pip install vllm \
|
||||||
--torch-backend=auto \
|
--torch-backend=auto \
|
||||||
@ -99,7 +99,7 @@ The `uv` approach works for vLLM `v0.6.6` and later and offers an easy-to-rememb
|
|||||||
|
|
||||||
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
|
If you only need to change Python code, you can build and install vLLM without compilation. Using `pip`'s [`--editable` flag](https://pip.pypa.io/en/stable/topics/local-project-installs/#editable-installs), changes you make to the code will be reflected when you run vLLM:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
VLLM_USE_PRECOMPILED=1 pip install --editable .
|
VLLM_USE_PRECOMPILED=1 pip install --editable .
|
||||||
@ -118,7 +118,7 @@ This command will do the following:
|
|||||||
|
|
||||||
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
|
In case you see an error about wheel not found when running the above command, it might be because the commit you based on in the main branch was just merged and the wheel is being built. In this case, you can wait for around an hour to try again, or manually assign the previous commit in the installation using the `VLLM_PRECOMPILED_WHEEL_LOCATION` environment variable.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069 # use full commit hash from the main branch
|
||||||
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
|
||||||
pip install --editable .
|
pip install --editable .
|
||||||
@ -134,7 +134,7 @@ You can find more information about vLLM's wheels in [install-the-latest-code][i
|
|||||||
|
|
||||||
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
|
If you want to modify C++ or CUDA code, you'll need to build vLLM from source. This can take several minutes:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
pip install -e .
|
pip install -e .
|
||||||
@ -160,7 +160,7 @@ There are scenarios where the PyTorch dependency cannot be easily installed via
|
|||||||
|
|
||||||
To build vLLM using an existing PyTorch installation:
|
To build vLLM using an existing PyTorch installation:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
python use_existing_torch.py
|
python use_existing_torch.py
|
||||||
@ -173,7 +173,7 @@ pip install --no-build-isolation -e .
|
|||||||
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
|
Currently, before starting the build process, vLLM fetches cutlass code from GitHub. However, there may be scenarios where you want to use a local version of cutlass instead.
|
||||||
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
|
To achieve this, you can set the environment variable VLLM_CUTLASS_SRC_DIR to point to your local cutlass directory.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
|
VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
|
||||||
@ -184,7 +184,7 @@ VLLM_CUTLASS_SRC_DIR=/path/to/cutlass pip install -e .
|
|||||||
To avoid your system being overloaded, you can limit the number of compilation jobs
|
To avoid your system being overloaded, you can limit the number of compilation jobs
|
||||||
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
|
to be run simultaneously, via the environment variable `MAX_JOBS`. For example:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export MAX_JOBS=6
|
export MAX_JOBS=6
|
||||||
pip install -e .
|
pip install -e .
|
||||||
```
|
```
|
||||||
@ -194,7 +194,7 @@ A side effect is a much slower build process.
|
|||||||
|
|
||||||
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
|
Additionally, if you have trouble building vLLM, we recommend using the NVIDIA PyTorch Docker image.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Use `--ipc=host` to make sure the shared memory is large enough.
|
# Use `--ipc=host` to make sure the shared memory is large enough.
|
||||||
docker run \
|
docker run \
|
||||||
--gpus all \
|
--gpus all \
|
||||||
@ -205,14 +205,14 @@ docker run \
|
|||||||
|
|
||||||
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
|
If you don't want to use docker, it is recommended to have a full installation of CUDA Toolkit. You can download and install it from [the official website](https://developer.nvidia.com/cuda-toolkit-archive). After installation, set the environment variable `CUDA_HOME` to the installation path of CUDA Toolkit, and make sure that the `nvcc` compiler is in your `PATH`, e.g.:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export CUDA_HOME=/usr/local/cuda
|
export CUDA_HOME=/usr/local/cuda
|
||||||
export PATH="${CUDA_HOME}/bin:$PATH"
|
export PATH="${CUDA_HOME}/bin:$PATH"
|
||||||
```
|
```
|
||||||
|
|
||||||
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
|
Here is a sanity check to verify that the CUDA Toolkit is correctly installed:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
nvcc --version # verify that nvcc is in your PATH
|
nvcc --version # verify that nvcc is in your PATH
|
||||||
${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
|
${CUDA_HOME}/bin/nvcc --version # verify that nvcc is in your CUDA_HOME
|
||||||
```
|
```
|
||||||
@ -223,7 +223,7 @@ vLLM can fully run only on Linux but for development purposes, you can still bui
|
|||||||
|
|
||||||
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
|
Simply disable the `VLLM_TARGET_DEVICE` environment variable before installing:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_TARGET_DEVICE=empty
|
export VLLM_TARGET_DEVICE=empty
|
||||||
pip install -e .
|
pip install -e .
|
||||||
```
|
```
|
||||||
@ -238,7 +238,7 @@ See [deployment-docker-pre-built-image][deployment-docker-pre-built-image] for i
|
|||||||
|
|
||||||
Another way to access the latest code is to use the docker images:
|
Another way to access the latest code is to use the docker images:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd # use full commit hash from the main branch
|
||||||
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
|
docker pull public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:${VLLM_COMMIT}
|
||||||
```
|
```
|
||||||
|
@ -31,17 +31,17 @@ Currently, there are no pre-built ROCm wheels.
|
|||||||
|
|
||||||
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
|
Alternatively, you can install PyTorch using PyTorch wheels. You can check PyTorch installation guide in PyTorch [Getting Started](https://pytorch.org/get-started/locally/). Example:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Install PyTorch
|
# Install PyTorch
|
||||||
$ pip uninstall torch -y
|
pip uninstall torch -y
|
||||||
$ pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3
|
pip install --no-cache-dir --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.3
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
|
1. Install [Triton flash attention for ROCm](https://github.com/ROCm/triton)
|
||||||
|
|
||||||
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
|
Install ROCm's Triton flash attention (the default triton-mlir branch) following the instructions from [ROCm/triton](https://github.com/ROCm/triton/blob/triton-mlir/README.md)
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python3 -m pip install ninja cmake wheel pybind11
|
python3 -m pip install ninja cmake wheel pybind11
|
||||||
pip uninstall -y triton
|
pip uninstall -y triton
|
||||||
git clone https://github.com/OpenAI/triton.git
|
git clone https://github.com/OpenAI/triton.git
|
||||||
@ -62,7 +62,7 @@ Currently, there are no pre-built ROCm wheels.
|
|||||||
|
|
||||||
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
|
For example, for ROCm 6.3, suppose your gfx arch is `gfx90a`. To get your gfx architecture, run `rocminfo |grep gfx`.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/ROCm/flash-attention.git
|
git clone https://github.com/ROCm/flash-attention.git
|
||||||
cd flash-attention
|
cd flash-attention
|
||||||
git checkout b7d29fb
|
git checkout b7d29fb
|
||||||
@ -76,7 +76,7 @@ Currently, there are no pre-built ROCm wheels.
|
|||||||
|
|
||||||
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
|
3. If you choose to build AITER yourself to use a certain branch or commit, you can build AITER using the following steps:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python3 -m pip uninstall -y aiter
|
python3 -m pip uninstall -y aiter
|
||||||
git clone --recursive https://github.com/ROCm/aiter.git
|
git clone --recursive https://github.com/ROCm/aiter.git
|
||||||
cd aiter
|
cd aiter
|
||||||
@ -148,7 +148,7 @@ If you choose to build this rocm_base image yourself, the steps are as follows.
|
|||||||
|
|
||||||
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||||
|
|
||||||
```console
|
```json
|
||||||
{
|
{
|
||||||
"features": {
|
"features": {
|
||||||
"buildkit": true
|
"buildkit": true
|
||||||
@ -158,7 +158,7 @@ It is important that the user kicks off the docker build using buildkit. Either
|
|||||||
|
|
||||||
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
DOCKER_BUILDKIT=1 docker build \
|
DOCKER_BUILDKIT=1 docker build \
|
||||||
-f docker/Dockerfile.rocm_base \
|
-f docker/Dockerfile.rocm_base \
|
||||||
-t rocm/vllm-dev:base .
|
-t rocm/vllm-dev:base .
|
||||||
@ -169,7 +169,7 @@ DOCKER_BUILDKIT=1 docker build \
|
|||||||
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
|
First, build a docker image from <gh-file:docker/Dockerfile.rocm> and launch a docker container from the image.
|
||||||
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
It is important that the user kicks off the docker build using buildkit. Either the user put `DOCKER_BUILDKIT=1` as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
{
|
{
|
||||||
"features": {
|
"features": {
|
||||||
"buildkit": true
|
"buildkit": true
|
||||||
@ -187,13 +187,13 @@ Their values can be passed in when running `docker build` with `--build-arg` opt
|
|||||||
|
|
||||||
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
To build vllm on ROCm 6.3 for MI200 and MI300 series, you can use the default:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
|
DOCKER_BUILDKIT=1 docker build -f docker/Dockerfile.rocm -t vllm-rocm .
|
||||||
```
|
```
|
||||||
|
|
||||||
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
|
To build vllm on ROCm 6.3 for Radeon RX7900 series (gfx1100), you should pick the alternative base image:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
DOCKER_BUILDKIT=1 docker build \
|
DOCKER_BUILDKIT=1 docker build \
|
||||||
--build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \
|
--build-arg BASE_IMAGE="rocm/vllm-dev:navi_base" \
|
||||||
-f docker/Dockerfile.rocm \
|
-f docker/Dockerfile.rocm \
|
||||||
@ -205,7 +205,7 @@ To run the above docker image `vllm-rocm`, use the below command:
|
|||||||
|
|
||||||
??? Command
|
??? Command
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker run -it \
|
docker run -it \
|
||||||
--network=host \
|
--network=host \
|
||||||
--group-add=video \
|
--group-add=video \
|
||||||
|
@ -25,7 +25,7 @@ Currently, there are no pre-built XPU wheels.
|
|||||||
- First, install required driver and Intel OneAPI 2025.0 or later.
|
- First, install required driver and Intel OneAPI 2025.0 or later.
|
||||||
- Second, install Python packages for vLLM XPU backend building:
|
- Second, install Python packages for vLLM XPU backend building:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
pip install --upgrade pip
|
pip install --upgrade pip
|
||||||
@ -34,7 +34,7 @@ pip install -v -r requirements/xpu.txt
|
|||||||
|
|
||||||
- Then, build and install vLLM XPU backend:
|
- Then, build and install vLLM XPU backend:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
VLLM_TARGET_DEVICE=xpu python setup.py install
|
VLLM_TARGET_DEVICE=xpu python setup.py install
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -53,9 +53,9 @@ Currently, there are no pre-built XPU images.
|
|||||||
# --8<-- [end:pre-built-images]
|
# --8<-- [end:pre-built-images]
|
||||||
# --8<-- [start:build-image-from-source]
|
# --8<-- [start:build-image-from-source]
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
|
docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
|
||||||
$ docker run -it \
|
docker run -it \
|
||||||
--rm \
|
--rm \
|
||||||
--network=host \
|
--network=host \
|
||||||
--device /dev/dri \
|
--device /dev/dri \
|
||||||
@ -68,7 +68,7 @@ $ docker run -it \
|
|||||||
|
|
||||||
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following:
|
XPU platform supports **tensor parallel** inference/serving and also supports **pipeline parallel** as a beta feature for online serving. We require Ray as the distributed runtime backend. For example, a reference execution like following:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python -m vllm.entrypoints.openai.api_server \
|
python -m vllm.entrypoints.openai.api_server \
|
||||||
--model=facebook/opt-13b \
|
--model=facebook/opt-13b \
|
||||||
--dtype=bfloat16 \
|
--dtype=bfloat16 \
|
||||||
|
@ -24,7 +24,7 @@ please follow the methods outlined in the
|
|||||||
|
|
||||||
To verify that the Intel Gaudi software was correctly installed, run:
|
To verify that the Intel Gaudi software was correctly installed, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
|
hl-smi # verify that hl-smi is in your PATH and each Gaudi accelerator is visible
|
||||||
apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
|
apt list --installed | grep habana # verify that habanalabs-firmware-tools, habanalabs-graph, habanalabs-rdma-core, habanalabs-thunk and habanalabs-container-runtime are installed
|
||||||
pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
|
pip list | grep habana # verify that habana-torch-plugin, habana-torch-dataloader, habana-pyhlml and habana-media-loader are installed
|
||||||
@ -42,7 +42,7 @@ for more details.
|
|||||||
|
|
||||||
Use the following commands to run a Docker image:
|
Use the following commands to run a Docker image:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||||
docker run \
|
docker run \
|
||||||
-it \
|
-it \
|
||||||
@ -65,7 +65,7 @@ Currently, there are no pre-built Intel Gaudi wheels.
|
|||||||
|
|
||||||
To build and install vLLM from source, run:
|
To build and install vLLM from source, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/vllm-project/vllm.git
|
git clone https://github.com/vllm-project/vllm.git
|
||||||
cd vllm
|
cd vllm
|
||||||
pip install -r requirements/hpu.txt
|
pip install -r requirements/hpu.txt
|
||||||
@ -74,7 +74,7 @@ python setup.py develop
|
|||||||
|
|
||||||
Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following:
|
Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
git clone https://github.com/HabanaAI/vllm-fork.git
|
git clone https://github.com/HabanaAI/vllm-fork.git
|
||||||
cd vllm-fork
|
cd vllm-fork
|
||||||
git checkout habana_main
|
git checkout habana_main
|
||||||
@ -90,7 +90,7 @@ Currently, there are no pre-built Intel Gaudi images.
|
|||||||
|
|
||||||
### Build image from source
|
### Build image from source
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
|
docker build -f docker/Dockerfile.hpu -t vllm-hpu-env .
|
||||||
docker run \
|
docker run \
|
||||||
-it \
|
-it \
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
|
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
uv venv --python 3.12 --seed
|
uv venv --python 3.12 --seed
|
||||||
source .venv/bin/activate
|
source .venv/bin/activate
|
||||||
```
|
```
|
||||||
|
@ -19,7 +19,7 @@ If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/
|
|||||||
|
|
||||||
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
|
It's recommended to use [uv](https://docs.astral.sh/uv/), a very fast Python environment manager, to create and manage Python environments. Please follow the [documentation](https://docs.astral.sh/uv/#getting-started) to install `uv`. After installing `uv`, you can create a new Python environment and install vLLM using the following commands:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
uv venv --python 3.12 --seed
|
uv venv --python 3.12 --seed
|
||||||
source .venv/bin/activate
|
source .venv/bin/activate
|
||||||
uv pip install vllm --torch-backend=auto
|
uv pip install vllm --torch-backend=auto
|
||||||
@ -29,13 +29,13 @@ uv pip install vllm --torch-backend=auto
|
|||||||
|
|
||||||
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
|
Another delightful way is to use `uv run` with `--with [dependency]` option, which allows you to run commands such as `vllm serve` without creating any permanent environment:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
uv run --with vllm vllm --help
|
uv run --with vllm vllm --help
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
|
You can also use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments. You can install `uv` to the conda environment through `pip` if you want to manage it within the environment.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
conda create -n myenv python=3.12 -y
|
conda create -n myenv python=3.12 -y
|
||||||
conda activate myenv
|
conda activate myenv
|
||||||
pip install --upgrade uv
|
pip install --upgrade uv
|
||||||
@ -110,7 +110,7 @@ By default, it starts the server at `http://localhost:8000`. You can specify the
|
|||||||
|
|
||||||
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
|
Run the following command to start the vLLM server with the [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) model:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -124,7 +124,7 @@ vllm serve Qwen/Qwen2.5-1.5B-Instruct
|
|||||||
|
|
||||||
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
This server can be queried in the same format as OpenAI API. For example, to list the models:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
curl http://localhost:8000/v1/models
|
curl http://localhost:8000/v1/models
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -134,7 +134,7 @@ You can pass in the argument `--api-key` or environment variable `VLLM_API_KEY`
|
|||||||
|
|
||||||
Once your server is started, you can query the model with input prompts:
|
Once your server is started, you can query the model with input prompts:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
curl http://localhost:8000/v1/completions \
|
curl http://localhost:8000/v1/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
@ -172,7 +172,7 @@ vLLM is designed to also support the OpenAI Chat Completions API. The chat inter
|
|||||||
|
|
||||||
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
|
You can use the [create chat completion](https://platform.openai.com/docs/api-reference/chat/completions/create) endpoint to interact with the model:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
curl http://localhost:8000/v1/chat/completions \
|
curl http://localhost:8000/v1/chat/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
|
@ -9,27 +9,27 @@ Further reading can be found in [Run:ai Model Streamer Documentation](https://gi
|
|||||||
vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer.
|
vLLM supports loading weights in Safetensors format using the Run:ai Model Streamer.
|
||||||
You first need to install vLLM RunAI optional dependency:
|
You first need to install vLLM RunAI optional dependency:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip3 install vllm[runai]
|
pip3 install vllm[runai]
|
||||||
```
|
```
|
||||||
|
|
||||||
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
|
To run it as an OpenAI-compatible server, add the `--load-format runai_streamer` flag:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
||||||
--load-format runai_streamer
|
--load-format runai_streamer
|
||||||
```
|
```
|
||||||
|
|
||||||
To run model from AWS S3 object store run:
|
To run model from AWS S3 object store run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve s3://core-llm/Llama-3-8b \
|
vllm serve s3://core-llm/Llama-3-8b \
|
||||||
--load-format runai_streamer
|
--load-format runai_streamer
|
||||||
```
|
```
|
||||||
|
|
||||||
To run model from a S3 compatible object store run:
|
To run model from a S3 compatible object store run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 \
|
RUNAI_STREAMER_S3_USE_VIRTUAL_ADDRESSING=0 \
|
||||||
AWS_EC2_METADATA_DISABLED=true \
|
AWS_EC2_METADATA_DISABLED=true \
|
||||||
AWS_ENDPOINT_URL=https://storage.googleapis.com \
|
AWS_ENDPOINT_URL=https://storage.googleapis.com \
|
||||||
@ -44,7 +44,7 @@ You can tune parameters using `--model-loader-extra-config`:
|
|||||||
You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer.
|
You can tune `concurrency` that controls the level of concurrency and number of OS threads reading tensors from the file to the CPU buffer.
|
||||||
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
|
For reading from S3, it will be the number of client instances the host is opening to the S3 server.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
||||||
--load-format runai_streamer \
|
--load-format runai_streamer \
|
||||||
--model-loader-extra-config '{"concurrency":16}'
|
--model-loader-extra-config '{"concurrency":16}'
|
||||||
@ -53,7 +53,7 @@ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
|||||||
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
|
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
|
||||||
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
|
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
||||||
--load-format runai_streamer \
|
--load-format runai_streamer \
|
||||||
--model-loader-extra-config '{"memory_limit":5368709120}'
|
--model-loader-extra-config '{"memory_limit":5368709120}'
|
||||||
@ -66,13 +66,13 @@ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct \
|
|||||||
|
|
||||||
vLLM also supports loading sharded models using Run:ai Model Streamer. This is particularly useful for large models that are split across multiple files. To use this feature, use the `--load-format runai_streamer_sharded` flag:
|
vLLM also supports loading sharded models using Run:ai Model Streamer. This is particularly useful for large models that are split across multiple files. To use this feature, use the `--load-format runai_streamer_sharded` flag:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded
|
vllm serve /path/to/sharded/model --load-format runai_streamer_sharded
|
||||||
```
|
```
|
||||||
|
|
||||||
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
|
The sharded loader expects model files to follow the same naming pattern as the regular sharded state loader: `model-rank-{rank}-part-{part}.safetensors`. You can customize this pattern using the `pattern` parameter in `--model-loader-extra-config`:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /path/to/sharded/model \
|
vllm serve /path/to/sharded/model \
|
||||||
--load-format runai_streamer_sharded \
|
--load-format runai_streamer_sharded \
|
||||||
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
|
--model-loader-extra-config '{"pattern":"custom-model-rank-{rank}-part-{part}.safetensors"}'
|
||||||
@ -82,7 +82,7 @@ To create sharded model files, you can use the script provided in <gh-file:examp
|
|||||||
|
|
||||||
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
|
The sharded loader supports all the same tunable parameters as the regular Run:ai Model Streamer, including `concurrency` and `memory_limit`. These can be configured in the same way:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /path/to/sharded/model \
|
vllm serve /path/to/sharded/model \
|
||||||
--load-format runai_streamer_sharded \
|
--load-format runai_streamer_sharded \
|
||||||
--model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
|
--model-loader-extra-config '{"concurrency":16, "memory_limit":5368709120}'
|
||||||
|
@ -178,7 +178,7 @@ Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project
|
|||||||
|
|
||||||
If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository:
|
If you prefer, you can use the Hugging Face CLI to [download a model](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) or specific files from a model repository:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# Download a model
|
# Download a model
|
||||||
huggingface-cli download HuggingFaceH4/zephyr-7b-beta
|
huggingface-cli download HuggingFaceH4/zephyr-7b-beta
|
||||||
|
|
||||||
@ -193,7 +193,7 @@ huggingface-cli download HuggingFaceH4/zephyr-7b-beta eval_results.json
|
|||||||
|
|
||||||
Use the Hugging Face CLI to [manage models](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) stored in local cache:
|
Use the Hugging Face CLI to [manage models](https://huggingface.co/docs/huggingface_hub/guides/manage-cache#scan-your-cache) stored in local cache:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# List cached models
|
# List cached models
|
||||||
huggingface-cli scan-cache
|
huggingface-cli scan-cache
|
||||||
|
|
||||||
|
@ -34,15 +34,15 @@ output = llm.generate("San Francisco is a")
|
|||||||
|
|
||||||
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
|
To run multi-GPU serving, pass in the `--tensor-parallel-size` argument when starting the server. For example, to run API server on 4 GPUs:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve facebook/opt-13b \
|
vllm serve facebook/opt-13b \
|
||||||
--tensor-parallel-size 4
|
--tensor-parallel-size 4
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
|
You can also additionally specify `--pipeline-parallel-size` to enable pipeline parallelism. For example, to run API server on 8 GPUs with pipeline parallelism and tensor parallelism:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve gpt2 \
|
vllm serve gpt2 \
|
||||||
--tensor-parallel-size 4 \
|
--tensor-parallel-size 4 \
|
||||||
--pipeline-parallel-size 2
|
--pipeline-parallel-size 2
|
||||||
```
|
```
|
||||||
@ -55,7 +55,7 @@ The first step, is to start containers and organize them into a cluster. We have
|
|||||||
|
|
||||||
Pick a node as the head node, and run the following command:
|
Pick a node as the head node, and run the following command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
bash run_cluster.sh \
|
bash run_cluster.sh \
|
||||||
vllm/vllm-openai \
|
vllm/vllm-openai \
|
||||||
ip_of_head_node \
|
ip_of_head_node \
|
||||||
@ -66,7 +66,7 @@ bash run_cluster.sh \
|
|||||||
|
|
||||||
On the rest of the worker nodes, run the following command:
|
On the rest of the worker nodes, run the following command:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
bash run_cluster.sh \
|
bash run_cluster.sh \
|
||||||
vllm/vllm-openai \
|
vllm/vllm-openai \
|
||||||
ip_of_head_node \
|
ip_of_head_node \
|
||||||
@ -87,7 +87,7 @@ Then, on any node, use `docker exec -it node /bin/bash` to enter the container,
|
|||||||
|
|
||||||
After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node: vLLM will be able to leverage GPU resources of all nodes in the Ray cluster, and therefore, only run the `vllm` command on this node but not other nodes. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
After that, on any node, use `docker exec -it node /bin/bash` to enter the container again. **In the container**, you can use vLLM as usual, just as you have all the GPUs on one node: vLLM will be able to leverage GPU resources of all nodes in the Ray cluster, and therefore, only run the `vllm` command on this node but not other nodes. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /path/to/the/model/in/the/container \
|
vllm serve /path/to/the/model/in/the/container \
|
||||||
--tensor-parallel-size 8 \
|
--tensor-parallel-size 8 \
|
||||||
--pipeline-parallel-size 2
|
--pipeline-parallel-size 2
|
||||||
@ -95,7 +95,7 @@ After that, on any node, use `docker exec -it node /bin/bash` to enter the conta
|
|||||||
|
|
||||||
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
|
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve /path/to/the/model/in/the/container \
|
vllm serve /path/to/the/model/in/the/container \
|
||||||
--tensor-parallel-size 16
|
--tensor-parallel-size 16
|
||||||
```
|
```
|
||||||
|
@ -7,7 +7,7 @@ vLLM is also available via [LangChain](https://github.com/langchain-ai/langchain
|
|||||||
|
|
||||||
To install LangChain, run
|
To install LangChain, run
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install langchain langchain_community -q
|
pip install langchain langchain_community -q
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -7,7 +7,7 @@ vLLM is also available via [LlamaIndex](https://github.com/run-llama/llama_index
|
|||||||
|
|
||||||
To install LlamaIndex, run
|
To install LlamaIndex, run
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install llama-index-llms-vllm -q
|
pip install llama-index-llms-vllm -q
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -6,7 +6,7 @@ OpenAI compatible API server.
|
|||||||
|
|
||||||
You can start the server using Python, or using [Docker][deployment-docker]:
|
You can start the server using Python, or using [Docker][deployment-docker]:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm serve unsloth/Llama-3.2-1B-Instruct
|
vllm serve unsloth/Llama-3.2-1B-Instruct
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -127,13 +127,13 @@ If GPU/CPU communication cannot be established, you can use the following Python
|
|||||||
|
|
||||||
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
|
If you are testing with a single node, adjust `--nproc-per-node` to the number of GPUs you want to use:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
|
NCCL_DEBUG=TRACE torchrun --nproc-per-node=<number-of-GPUs> test.py
|
||||||
```
|
```
|
||||||
|
|
||||||
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
|
If you are testing with multi-nodes, adjust `--nproc-per-node` and `--nnodes` according to your setup and set `MASTER_ADDR` to the correct IP address of the master node, reachable from all nodes. Then, run:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
NCCL_DEBUG=TRACE torchrun --nnodes 2 \
|
NCCL_DEBUG=TRACE torchrun --nnodes 2 \
|
||||||
--nproc-per-node=2 \
|
--nproc-per-node=2 \
|
||||||
--rdzv_backend=c10d \
|
--rdzv_backend=c10d \
|
||||||
|
@ -29,14 +29,14 @@ We currently support `/v1/chat/completions`, `/v1/embeddings`, and `/v1/score` e
|
|||||||
|
|
||||||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl
|
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you've created your batch file it should look like this
|
Once you've created your batch file it should look like this
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ cat offline_inference/openai_batch/openai_example_batch.jsonl
|
cat offline_inference/openai_batch/openai_example_batch.jsonl
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
```
|
```
|
||||||
@ -47,7 +47,7 @@ The batch running tool is designed to be used from the command line.
|
|||||||
|
|
||||||
You can run the batch with the following command, which will write its results to a file called `results.jsonl`
|
You can run the batch with the following command, which will write its results to a file called `results.jsonl`
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python -m vllm.entrypoints.openai.run_batch \
|
python -m vllm.entrypoints.openai.run_batch \
|
||||||
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||||
-o results.jsonl \
|
-o results.jsonl \
|
||||||
@ -56,7 +56,7 @@ python -m vllm.entrypoints.openai.run_batch \
|
|||||||
|
|
||||||
or use command-line:
|
or use command-line:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm run-batch \
|
vllm run-batch \
|
||||||
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
-i offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||||
-o results.jsonl \
|
-o results.jsonl \
|
||||||
@ -67,8 +67,8 @@ vllm run-batch \
|
|||||||
|
|
||||||
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`
|
You should now have your results at `results.jsonl`. You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ cat results.jsonl
|
cat results.jsonl
|
||||||
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
|
{"id":"vllm-383d1c59835645aeb2e07d004d62a826","custom_id":"request-1","response":{"id":"cmpl-61c020e54b964d5a98fa7527bfcdd378","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! It's great to meet you! I'm here to help with any questions or tasks you may have. What's on your mind today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":25,"total_tokens":56,"completion_tokens":31}},"error":null}
|
||||||
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}
|
{"id":"vllm-42e3d09b14b04568afa3f1797751a267","custom_id":"request-2","response":{"id":"cmpl-f44d049f6b3a42d4b2d7850bb1e31bcc","object":"chat.completion","created":1715633336,"model":"meta-llama/Meta-Llama-3-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"*silence*"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":27,"total_tokens":32,"completion_tokens":5}},"error":null}
|
||||||
```
|
```
|
||||||
@ -79,7 +79,7 @@ The batch runner supports remote input and output urls that are accessible via h
|
|||||||
|
|
||||||
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl`, you can run
|
For example, to run against our example input file located at `https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl`, you can run
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python -m vllm.entrypoints.openai.run_batch \
|
python -m vllm.entrypoints.openai.run_batch \
|
||||||
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||||
-o results.jsonl \
|
-o results.jsonl \
|
||||||
@ -88,7 +88,7 @@ python -m vllm.entrypoints.openai.run_batch \
|
|||||||
|
|
||||||
or use command-line:
|
or use command-line:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm run-batch \
|
vllm run-batch \
|
||||||
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
|
||||||
-o results.jsonl \
|
-o results.jsonl \
|
||||||
@ -112,21 +112,21 @@ To integrate with cloud blob storage, we recommend using presigned urls.
|
|||||||
|
|
||||||
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
To follow along with this example, you can download the example batch, or create your own batch file in your working directory.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl
|
wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
Once you've created your batch file it should look like this
|
Once you've created your batch file it should look like this
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ cat offline_inference/openai_batch/openai_example_batch.jsonl
|
cat offline_inference/openai_batch/openai_example_batch.jsonl
|
||||||
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_completion_tokens": 1000}}
|
||||||
```
|
```
|
||||||
|
|
||||||
Now upload your batch file to your S3 bucket.
|
Now upload your batch file to your S3 bucket.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
aws s3 cp offline_inference/openai_batch/openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl
|
aws s3 cp offline_inference/openai_batch/openai_example_batch.jsonl s3://MY_BUCKET/MY_INPUT_FILE.jsonl
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -181,7 +181,7 @@ output_url='https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AW
|
|||||||
|
|
||||||
You can now run the batch runner, using the urls generated in the previous section.
|
You can now run the batch runner, using the urls generated in the previous section.
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
python -m vllm.entrypoints.openai.run_batch \
|
python -m vllm.entrypoints.openai.run_batch \
|
||||||
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
@ -190,7 +190,7 @@ python -m vllm.entrypoints.openai.run_batch \
|
|||||||
|
|
||||||
or use command-line:
|
or use command-line:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
vllm run-batch \
|
vllm run-batch \
|
||||||
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-i "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_INPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
-o "https://s3.us-west-2.amazonaws.com/MY_BUCKET/MY_OUTPUT_FILE.jsonl?AWSAccessKeyId=ABCDEFGHIJKLMNOPQRST&Signature=abcdefghijklmnopqrstuvwxyz12345&Expires=1715800091" \
|
||||||
@ -201,7 +201,7 @@ vllm run-batch \
|
|||||||
|
|
||||||
Your results are now on S3. You can view them in your terminal by running
|
Your results are now on S3. You can view them in your terminal by running
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
aws s3 cp s3://MY_BUCKET/MY_OUTPUT_FILE.jsonl -
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -230,8 +230,8 @@ You can run the batch using the same command as in earlier examples.
|
|||||||
|
|
||||||
You can check your results by running `cat results.jsonl`
|
You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ cat results.jsonl
|
cat results.jsonl
|
||||||
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
|
{"id":"vllm-db0f71f7dec244e6bce530e0b4ef908b","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-3580bf4d4ae54d52b67eee266a6eab20","body":{"id":"embd-33ac2efa7996430184461f2e38529746","object":"list","created":444647,"model":"intfloat/e5-mistral-7b-instruct","data":[{"index":0,"object":"embedding","embedding":[0.016204833984375,0.0092010498046875,0.0018358230590820312,-0.0028228759765625,0.001422882080078125,-0.0031147003173828125,...]}],"usage":{"prompt_tokens":8,"total_tokens":8,"completion_tokens":0}}},"error":null}
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
@ -261,8 +261,8 @@ You can run the batch using the same command as in earlier examples.
|
|||||||
|
|
||||||
You can check your results by running `cat results.jsonl`
|
You can check your results by running `cat results.jsonl`
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
$ cat results.jsonl
|
cat results.jsonl
|
||||||
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
{"id":"vllm-f87c5c4539184f618e555744a2965987","custom_id":"request-1","response":{"status_code":200,"request_id":"vllm-batch-806ab64512e44071b37d3f7ccd291413","body":{"id":"score-4ee45236897b4d29907d49b01298cdb1","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.0010900497436523438},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
||||||
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
{"id":"vllm-41990c51a26d4fac8419077f12871099","custom_id":"request-2","response":{"status_code":200,"request_id":"vllm-batch-73ce66379026482699f81974e14e1e99","body":{"id":"score-13f2ffe6ba40460fbf9f7f00ad667d75","object":"list","created":1737847944,"model":"BAAI/bge-reranker-v2-m3","data":[{"index":0,"object":"score","score":0.001094818115234375},{"index":1,"object":"score","score":1.0}],"usage":{"prompt_tokens":37,"total_tokens":37,"completion_tokens":0,"prompt_tokens_details":null}}},"error":null}
|
||||||
```
|
```
|
||||||
|
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
1. Install OpenTelemetry packages:
|
1. Install OpenTelemetry packages:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install \
|
pip install \
|
||||||
'opentelemetry-sdk>=1.26.0,<1.27.0' \
|
'opentelemetry-sdk>=1.26.0,<1.27.0' \
|
||||||
'opentelemetry-api>=1.26.0,<1.27.0' \
|
'opentelemetry-api>=1.26.0,<1.27.0' \
|
||||||
@ -12,7 +12,7 @@
|
|||||||
|
|
||||||
1. Start Jaeger in a docker container:
|
1. Start Jaeger in a docker container:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
# From: https://www.jaegertracing.io/docs/1.57/getting-started/
|
# From: https://www.jaegertracing.io/docs/1.57/getting-started/
|
||||||
docker run --rm --name jaeger \
|
docker run --rm --name jaeger \
|
||||||
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
|
-e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
|
||||||
@ -31,14 +31,14 @@
|
|||||||
|
|
||||||
1. In a new shell, export Jaeger IP:
|
1. In a new shell, export Jaeger IP:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||||
```
|
```
|
||||||
|
|
||||||
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
|
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export OTEL_SERVICE_NAME="vllm-server"
|
export OTEL_SERVICE_NAME="vllm-server"
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||||
@ -46,7 +46,7 @@
|
|||||||
|
|
||||||
1. In a new shell, send requests with trace context from a dummy client
|
1. In a new shell, send requests with trace context from a dummy client
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
|
||||||
@ -67,7 +67,7 @@
|
|||||||
OpenTelemetry supports either `grpc` or `http/protobuf` as the transport protocol for trace data in the exporter.
|
OpenTelemetry supports either `grpc` or `http/protobuf` as the transport protocol for trace data in the exporter.
|
||||||
By default, `grpc` is used. To set `http/protobuf` as the protocol, configure the `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` environment variable as follows:
|
By default, `grpc` is used. To set `http/protobuf` as the protocol, configure the `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL` environment variable as follows:
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
|
export OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf
|
||||||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
|
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://$JAEGER_IP:4318/v1/traces
|
||||||
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
vllm serve facebook/opt-125m --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
|
||||||
@ -79,13 +79,13 @@ OpenTelemetry allows automatic instrumentation of FastAPI.
|
|||||||
|
|
||||||
1. Install the instrumentation library
|
1. Install the instrumentation library
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
pip install opentelemetry-instrumentation-fastapi
|
pip install opentelemetry-instrumentation-fastapi
|
||||||
```
|
```
|
||||||
|
|
||||||
1. Run vLLM with `opentelemetry-instrument`
|
1. Run vLLM with `opentelemetry-instrument`
|
||||||
|
|
||||||
```console
|
```bash
|
||||||
opentelemetry-instrument vllm serve facebook/opt-125m
|
opentelemetry-instrument vllm serve facebook/opt-125m
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user