Compare commits

...

30 Commits

Author SHA1 Message Date
1244c25908 minimize fill_
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-04 22:03:51 +00:00
34fb0cbbd0 minimize changes to gpu_model_runner.py
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-04 21:32:41 +00:00
f33ec6d2d2 Remove unneeded changes
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-04 18:13:50 +00:00
c33aeecf24 simplify - get rid of tokenshape
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-03 22:42:09 +00:00
230730c34d Updates for FA3 and other changes
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-03 20:58:48 +00:00
d151b63b8b Merge branch 'main' into hacking_full_cudagraph
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-03 14:05:45 -05:00
a1a2aaadb9 [Model]: Add transformers backend support (#11330)
# Adds support for `transformers` as a backend

Following https://github.com/huggingface/transformers/pull/35235, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes: 
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-02-03 21:30:38 +08:00
1298a400e8 [ci/build] fix gh200 test (#12681)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 15:59:49 +08:00
ad4a9dc817 [cuda] manually import the correct pynvml module (#12679)
fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565

---------

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 15:58:21 +08:00
b9986454fe Fix for attention layers to remain unquantized during moe_wn16 quant (#12570)
Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again

---------

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Beim <805908499@qq.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: fade_away <1028552010@qq.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Shawn Du <shawnd200@outlook.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-02-03 13:46:19 +08:00
c5932e5dac Properly check if all fused layers are in the list of targets (#12666)
Thanks @kylesayrs for catching this!
2025-02-03 13:42:18 +08:00
20579c0fae make sure mistral_common not imported for non-mistral models (#12669)
When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691 .

I added the check, and make all imports of `cv2` lazy.

---------

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 13:40:25 +08:00
95460fc513 [Kernel] port sgl moe_align_block_size kernels (#12574)
sgl_moe_align_block_size is based on:


ded9fcd09a

moe_align_block_size is based on:


ba5112ff69

Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-03 13:09:50 +08:00
326fcc8b9f [Doc] Deprecate Discord (#12668) 2025-02-02 19:19:56 -08:00
e64330910b [doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667)
As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815 becomes more
frequent. Let's give clear message to users.

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 09:32:18 +08:00
e489ad7a21 [Misc] Add SPDX-License-Identifier headers to python source files (#12628)
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 11:58:18 -08:00
f256ebe4df [Hardware][Intel GPU] add XPU bf16 support (#12392)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-02-02 10:17:26 +00:00
f8ece6e17f [Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608)
As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
2025-02-02 16:40:58 +08:00
abfcdcdf27 [V1][Minor] Avoid frequently creating ConstantList (#12653)
A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-01 23:43:20 -08:00
e497f33491 [Core] Silence unnecessary deprecation warnings (#12620)
I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:

```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
     and will be removed in a future version.
     Please use 'lora_path' instead.
```

The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 15:35:50 +08:00
baaa2b24da [Bugfix] fix moe_wna16 get_quant_method (#12648)
Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.


baeded2569/vllm/attention/layer.py (L86-L92)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-02-02 15:29:56 +08:00
b4e5c03306 doc: fixing minor typo in readme.md (#12643)
Word "evolved" was mistyped

Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

---------

Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
2025-02-01 17:17:29 +00:00
3194039c0e Apply torch.compile to fused_moe/grouped_topk (#12637) 2025-02-01 16:16:19 +00:00
bed9efafeb Merge branch 'main' into hacking_full_cudagraph
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-24 16:08:18 -05:00
d4c9448b26 WIP initial working version
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-17 13:54:51 -05:00
76732ff701 Merge branch 'main' into hacking_full_cudagraph 2025-01-03 15:06:23 -05:00
22bd7296e4 disable inductor, disable piecewise
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-03 15:05:52 -05:00
4024253797 Merge branch 'main' into hacking_full_cudagraph
Eager mode works now

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-03 13:16:56 -05:00
7eba374599 flash attn changes
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-02 13:26:21 -05:00
0c7e6c1e36 Hacky hacky
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-18 15:13:26 -05:00
1026 changed files with 3003 additions and 232 deletions

View File

@ -1,12 +1,14 @@
# SPDX-License-Identifier: Apache-2.0
import os
import sys
import zipfile
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 300 MiB
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 300))
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
def print_top_10_largest_files(zip_file):

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import os

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import json
import os
from pathlib import Path

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
from transformers import AutoTokenizer

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
from pathlib import Path

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient("http://localhost:8000")

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import datetime
import json
import os

View File

@ -23,6 +23,6 @@ trap remove_docker_container EXIT
remove_docker_container
# Run the image and test offline inference
docker run --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference/basic.py
docker run -e HF_TOKEN -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference/cli.py --model meta-llama/Llama-3.2-1B
'

View File

@ -50,9 +50,9 @@ steps:
- tests/multimodal
- tests/test_utils
- tests/worker
- tests/standalone_tests/lazy_torch_compile.py
- tests/standalone_tests/lazy_imports.py
commands:
- python3 standalone_tests/lazy_torch_compile.py
- python3 standalone_tests/lazy_imports.py
- pytest -v -s mq_llm_engine # MQLLMEngine
- pytest -v -s async_engine # AsyncLLMEngine
- NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
@ -349,6 +349,7 @@ steps:
- vllm/
- tests/models
commands:
- pytest -v -s models/test_transformers.py
- pytest -v -s models/test_registry.py
- pytest -v -s models/test_initialization.py
@ -485,6 +486,7 @@ steps:
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
# Avoid importing model tests that cause CUDA reinitialization error
- pytest models/test_transformers.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/language/test_bart.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/decoder_only/vision_language/test_models.py -v -s -m 'distributed(num_gpus=2)'

View File

@ -97,10 +97,14 @@ repos:
language: system
verbose: true
stages: [commit-msg]
- id: check-spdx-header
name: Check SPDX headers
entry: python tools/check_spdx_header.py
language: python
types: [python]
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false

View File

@ -61,7 +61,7 @@ representative at an online or offline/IRL event.
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement in the #code-of-conduct
channel in the [vLLM Discord](https://discord.com/invite/jz7wjKhh6g).
channel in the [vLLM Slack](https://slack.vllm.ai).
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the

View File

@ -127,7 +127,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
# Check the size of the wheel if RUN_WHEEL_CHECK is true
COPY .buildkite/check-wheel-size.py check-wheel-size.py
# sync the default value with .buildkite/check-wheel-size.py
ARG VLLM_MAX_SIZE_MB=300
ARG VLLM_MAX_SIZE_MB=400
ENV VLLM_MAX_SIZE_MB=$VLLM_MAX_SIZE_MB
ARG RUN_WHEEL_CHECK=true
RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \

View File

@ -10,7 +10,7 @@ Easy, fast, and cheap LLM serving for everyone
</h3>
<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>
---
@ -36,7 +36,7 @@ Easy, fast, and cheap LLM serving for everyone
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evloved into a community-driven project with contributions from both academia and industry.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
@ -139,8 +139,7 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
## Contact Us
* For technical questions and feature requests, please use Github issues or discussions.
* For discussing with fellow users, please use Discord.
* For coordinating contributions and development, please use Slack.
* For discussing with fellow users and coordinating contributions and development, please use Slack.
* For security disclosures, please use Github's security advisory feature.
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import json
import os
import sys

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark guided decoding throughput."""
import argparse
import dataclasses

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark the latency of processing a single batch of requests."""
import argparse
import dataclasses

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
Offline benchmark to test the long document QA throughput.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
Benchmark the efficiency of prefix caching.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark offline prioritization."""
import argparse
import dataclasses

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
r"""Benchmark online serving throughput.
On the server side, run one of the following commands:

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
r"""Benchmark online serving throughput with guided decoding.
On the server side, run one of the following commands:

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark offline inference throughput."""
import argparse
import dataclasses

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import copy
import itertools

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# Cutlass bench utils
from typing import Iterable, Tuple

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import copy
import itertools

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# Weight Shapes are in the format
# ([K, N], TP_SPLIT_DIM)
# Example:

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import os
import aiohttp

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import asyncio
import itertools

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import json
import matplotlib.pyplot as plt

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import pickle as pkl
import time
from dataclasses import dataclass

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import os
import sys
from typing import Optional

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import time
import torch

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import copy
import json

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import copy
import itertools

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from typing import List
import torch

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import time
from datetime import datetime

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import random
import time
from typing import List, Optional

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import time
import torch

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import itertools
from typing import Optional, Tuple, Union

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from itertools import accumulate
from typing import List, Optional

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
WEIGHT_SHAPES = {
"ideal": [[4 * 256 * 32, 256 * 32]],
"mistralai/Mistral-7B-v0.1/TP1": [

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import math
import pickle
import re

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import dataclasses
from typing import Any, Callable, Iterable, Optional

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# Weight Shapes are in the format
# ([K, N], TP_SPLIT_DIM)
# Example:

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import cProfile
import pstats

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
#!/usr/bin/env python3
#

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# ruff: noqa
# code borrowed from https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import enum
from typing import Dict, Union

View File

@ -197,6 +197,72 @@ __global__ void moe_align_block_size_global_mem_kernel(
}
}
// taken from
// https://github.com/sgl-project/sglang/commit/ded9fcd09a43d5e7d5bb31a2bc3e9fc21bf65d2a
template <typename scalar_t>
__global__ void sgl_moe_align_block_size_kernel(
scalar_t* __restrict__ topk_ids, int32_t* sorted_token_ids,
int32_t* expert_ids, int32_t* total_tokens_post_pad, int32_t num_experts,
int32_t block_size, size_t numel, int32_t* cumsum) {
__shared__ int32_t shared_counts[32][8];
__shared__ int32_t local_offsets[256];
const int warp_id = threadIdx.x / WARP_SIZE;
const int lane_id = threadIdx.x % WARP_SIZE;
const int experts_per_warp = 8;
const int my_expert_start = warp_id * experts_per_warp;
for (int i = 0; i < experts_per_warp; ++i) {
if (my_expert_start + i < num_experts) {
shared_counts[warp_id][i] = 0;
}
}
const size_t tokens_per_thread = CEILDIV(numel, blockDim.x);
const size_t start_idx = threadIdx.x * tokens_per_thread;
for (int i = start_idx; i < numel && i < start_idx + tokens_per_thread; ++i) {
int expert_id = topk_ids[i];
int warp_idx = expert_id / experts_per_warp;
int expert_offset = expert_id % experts_per_warp;
atomicAdd(&shared_counts[warp_idx][expert_offset], 1);
}
__syncthreads();
if (threadIdx.x == 0) {
cumsum[0] = 0;
for (int i = 1; i <= num_experts; ++i) {
int expert_count = 0;
int warp_idx = (i - 1) / experts_per_warp;
int expert_offset = (i - 1) % experts_per_warp;
expert_count = shared_counts[warp_idx][expert_offset];
cumsum[i] =
cumsum[i - 1] + CEILDIV(expert_count, block_size) * block_size;
}
*total_tokens_post_pad = cumsum[num_experts];
}
__syncthreads();
if (threadIdx.x < num_experts) {
for (int i = cumsum[threadIdx.x]; i < cumsum[threadIdx.x + 1];
i += block_size) {
expert_ids[i / block_size] = threadIdx.x;
}
local_offsets[threadIdx.x] = cumsum[threadIdx.x];
}
__syncthreads();
for (int i = start_idx; i < numel && i < start_idx + tokens_per_thread; ++i) {
int32_t expert_id = topk_ids[i];
int32_t rank_post_pad = atomicAdd(&local_offsets[expert_id], 1);
sorted_token_ids[rank_post_pad] = i;
}
}
template <typename scalar_t, int TOPK>
__global__ void moe_sum_kernel(
scalar_t* __restrict__ out, // [..., d]
@ -305,6 +371,32 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
}
}
void sgl_moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
int64_t block_size,
torch::Tensor sorted_token_ids,
torch::Tensor experts_ids,
torch::Tensor num_tokens_post_pad) {
const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
VLLM_DISPATCH_INTEGRAL_TYPES(
topk_ids.scalar_type(), "sgl_moe_align_block_size_kernel", [&] {
// calc needed amount of shared mem for `tokens_cnts` and `cumsum`
// tensors
auto options_int =
torch::TensorOptions().dtype(torch::kInt).device(topk_ids.device());
// torch::Tensor token_cnts_buffer =
// torch::empty({(num_experts + 1) * num_experts}, options_int);
torch::Tensor cumsum_buffer =
torch::empty({num_experts + 1}, options_int);
auto kernel = vllm::moe::sgl_moe_align_block_size_kernel<scalar_t>;
kernel<<<1, 1024, 0, stream>>>(
topk_ids.data_ptr<scalar_t>(), sorted_token_ids.data_ptr<int32_t>(),
experts_ids.data_ptr<int32_t>(),
num_tokens_post_pad.data_ptr<int32_t>(), num_experts, block_size,
topk_ids.numel(), cumsum_buffer.data_ptr<int32_t>());
});
}
void moe_sum(torch::Tensor& input, // [num_tokens, topk, hidden_size]
torch::Tensor& output) // [num_tokens, hidden_size]
{

View File

@ -12,3 +12,9 @@ void moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
int64_t block_size, torch::Tensor sorted_token_ids,
torch::Tensor experts_ids,
torch::Tensor num_tokens_post_pad);
void sgl_moe_align_block_size(torch::Tensor topk_ids, int64_t num_experts,
int64_t block_size,
torch::Tensor sorted_token_ids,
torch::Tensor experts_ids,
torch::Tensor num_tokens_post_pad);

View File

@ -22,6 +22,15 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
" Tensor! num_tokens_post_pad) -> ()");
m.impl("moe_align_block_size", torch::kCUDA, &moe_align_block_size);
// temporarily adapted from
// https://github.com/sgl-project/sglang/commit/ded9fcd09a43d5e7d5bb31a2bc3e9fc21bf65d2a
m.def(
"sgl_moe_align_block_size(Tensor topk_ids, int num_experts,"
" int block_size, Tensor! sorted_token_ids,"
" Tensor! experts_ids,"
" Tensor! num_tokens_post_pad) -> ()");
m.impl("sgl_moe_align_block_size", torch::kCUDA, &sgl_moe_align_block_size);
#ifndef USE_ROCM
m.def(
"marlin_gemm_moe(Tensor! a, Tensor! b_q_weights, Tensor! sorted_ids, "

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import itertools
import math
import os

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import itertools
import re
from dataclasses import dataclass, field

View File

@ -36,7 +36,7 @@ VLLM_TARGET_DEVICE=xpu python setup.py install
:::{note}
- FP16 is the default data type in the current XPU backend. The BF16 data
type will be supported in the future.
type is supported on Intel Data Center GPU, not supported on Intel Arc GPU yet.
:::
## Set up using Docker

View File

@ -40,6 +40,82 @@ If vLLM successfully returns text (for generative models) or hidden states (for
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
### Transformers fallback
After the merge of <gh-pr:11330>, `vllm` can fallback to models that are available in `transformers`. This does not work for all models for now, but most decoder language models are supported, and vision language model support is planned!
To check if the backend is `transformers`, you can simply do this:
```python
from vllm import LLM
llm = LLM(model=..., task="generate") # Name or path of your model
llm.apply_model(lambda model: print(model.__class__))
```
If it is `TransformersModel` then it means it's based on `transformers`!
#### Supported features
##### LORA and quantization
Both are not supported yet! Make sure to open an issue and we'll work on this together with the `transformers` team!
Usually `transformers` model load weights via the `load_adapters` API, that depends on PEFT. We need to work a bit to either use this api (for now this would result in some weights not being marked as loaded) or replace modules accordingly.
Hints as to how this would look like:
```python
class TransformersModel(nn.Module, SupportsLoRA):
def __init__(*):
...
self.model.load_adapter(vllm_config.load_config.model_loader_extra_config["qlora_adapter_name_or_path"])
```
Blocker is that you need to specify supported lora layers, when we would ideally want to load whatever is inside the checkpoint!
##### Remote code
This fallback also means that any model on the hub that can be used in `transformers` with `trust_remote_code=True` that correctly implements attention can be used in production!
```python
from vllm import LLM
llm = LLM(model=..., task="generate", trust_remote_code=True) # Name or path of your model
llm.apply_model(lambda model: print(model.__class__))
```
A model just needs the following two things:
```python
from transformers import PreTrainedModel
from torch import nn
class MyAttention(nn.Module):
def forward(self, hidden_states, **kwargs): # <- kwargs are required
...
attention_interface = attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
**kwargs,
)
...
class MyModel(PreTrainedModel):
_supports_attention_backend = True
```
Here is what happens in the background:
1. The config is loaded
2. `MyModel` python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
3. The `TransformersModel` backend is used. See `/model_executors/models/transformers`, which leverage `self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
That's it!
### ModelScope
To use models from [ModelScope](https://www.modelscope.cn) instead of HuggingFace Hub, set an environment variable:

View File

@ -60,7 +60,8 @@ bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--head \
/path/to/the/huggingface/home/in/this/node
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
```
On the rest of the worker nodes, run the following command:
@ -70,10 +71,11 @@ bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--worker \
/path/to/the/huggingface/home/in/this/node
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
```
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. A common misunderstanding is to use the IP address of the worker node, which is not correct.
Then you get a ray cluster of containers. Note that you need to keep the shells running these commands alive to hold the cluster. Any shell disconnect will terminate the cluster. In addition, please note that the argument `ip_of_head_node` should be the IP address of the head node, which is accessible by all the worker nodes. The IP addresses of each worker node should be specified in the `VLLM_HOST_IP` environment variable, and should be different for each worker node. Please check the network configuration of your cluster to make sure the nodes can communicate with each other through the specified IP addresses.
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
@ -103,3 +105,7 @@ Please make sure you downloaded the model to all the nodes (with the same path),
When you use huggingface repo id to refer to the model, you should append your huggingface token to the `run_cluster.sh` script, e.g. `-e HF_TOKEN=`. The recommended way is to download the model first, and then use the path to refer to the model.
:::
:::{warning}
If you keep receiving the error message `Error: No available node types can fulfill resource request` but you have enough GPUs in the cluster, chances are your nodes have multiple IP addresses and vLLM cannot find the right one, especially when you are using multi-node inference. Please make sure vLLM and ray use the same IP address. You can set the `VLLM_HOST_IP` environment variable to the right IP address in the `run_cluster.sh` script (different for each node!), and check `ray status` to see the IP address used by Ray. See <gh-issue:7815> for more information.
:::

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
from vllm.utils import FlexibleArgumentParser

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
# Sample prompts.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use vLLM for running offline inference
with the correct prompt format on audio language models.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
# Sample prompts.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM
# Sample prompts.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# ruff: noqa
import json
import random

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM
# Sample prompts.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from dataclasses import asdict
from vllm import LLM, SamplingParams

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
# Sample prompts.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use Ray Data for running offline batch inference
distributively on a multi-nodes cluster.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM
# Sample prompts.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
'''
Demonstrate prompting of text-to-text
encoder/decoder models, specifically BART

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
'''
Demonstrate prompting of text-to-text
encoder/decoder models, specifically Florence-2

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
from typing import List, Tuple

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use LoRA with different quantization techniques
for offline inference.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import gc
import time
from typing import List

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use the multi-LoRA functionality
for offline inference.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
# Sample prompts.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import os
from vllm import LLM, SamplingParams

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# ruff: noqa
import argparse

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
from vllm.distributed import cleanup_dist_env_and_memory

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import inspect
import json
import os

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import dataclasses
import os

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
a simple demonstration of RLHF with vLLM, inspired by
the OpenRLHF framework https://github.com/OpenRLHF/OpenRLHF .

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
Saves each worker's model state dict directly to a checkpoint, which enables a
fast load path for large tensor-parallel models where each worker only needs to

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM
# Sample prompts.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import os
import time

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from enum import Enum
from pydantic import BaseModel

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
experimental support for tensor-parallel inference with torchrun,
see https://github.com/vllm-project/vllm/issues/11400 for

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from vllm import LLM, SamplingParams
prompts = [

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use vLLM for running offline inference with
the correct prompt format on vision language models for text generation.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use vLLM for running offline inference with
the correct prompt format on vision language models for multimodal embedding.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
This example shows how to use vLLM for running offline inference with
multi-image input on vision language models for text generation,

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import time
from vllm import LLM, SamplingParams

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""Example Python client for `vllm.entrypoints.api_server`
NOTE: The API server is used only for demonstration and simple performance
benchmarks. It is not intended for production use.

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
Example of using the OpenAI entrypoint's rerank API which is compatible with
the Cohere SDK: https://github.com/cohere-ai/cohere-python

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import gradio as gr

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import json

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
Example of using the OpenAI entrypoint's rerank API which is compatible with
Jina and Cohere https://jina.ai/reranker

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.

Some files were not shown because too many files have changed in this diff Show More