Compare commits

...

210 Commits

Author SHA1 Message Date
5fbbfe9a4c [BugFix] FA2 MLA Accuracy Issue (#18807)
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com>
2025-05-30 08:50:58 -07:00
5873877241 [Bugfix] Mistral tool calling when content is list (#18729)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-27 09:05:37 -07:00
696259ca01 [Core] Automatically cast multi-modal input dtype (#18756)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-27 23:45:48 +08:00
6b6d496114 optimize get_kv_cache_torch_dtype (#18531)
Signed-off-by: idellzheng <idellzheng@tencent.com>
2025-05-27 13:08:44 +00:00
aaa4ac1c95 Disable prefix cache by default for benchmark (#18639)
Signed-off-by: cascade812 <cascade812@outlook.com>
2025-05-27 20:06:34 +08:00
06a0338015 [V1][Metrics] Add API for accessing in-memory Prometheus metrics (#17010)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-05-27 09:37:06 +00:00
4318c0559d [CI/Build] Remove imports of built-in re (#18750)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-27 09:19:18 +00:00
a68e293cb9 [Doc] Convert Sphinx directives ( {class}, {meth}, {attr}, ...) to MkDocs format for better documentation linking (#18663)
Signed-off-by: Zerohertz <ohg3417@gmail.com>
2025-05-27 01:44:20 -07:00
6881107948 [BUG FIX] minicpm (#18739)
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com>
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com>
2025-05-27 01:04:49 -07:00
e0f0ff87b8 [Build] fix cpu build missing libtbbmalloc.so (#18744)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-05-27 01:03:56 -07:00
c24b1572ac Minor fix about MooncakeStoreConnector (#18721)
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
2025-05-27 08:02:28 +00:00
4693a3438c [Doc] cleanup deprecated flag for doc (#18715)
Signed-off-by: calvin chen <120380290@qq.com>
2025-05-27 07:12:02 +00:00
bbd9a84dc5 [Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh (#18752)
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai>
2025-05-27 00:10:26 -07:00
a547aeb828 feat(rocm-support): support mamba2 on rocm (#18565)
Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai>
2025-05-27 00:07:53 -07:00
fc6d0c290f [Misc] improve docs (#18734)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-27 07:07:01 +00:00
753944fa9b [Doc] Update reproducibility doc and example (#18741)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-27 07:03:13 +00:00
25a817f202 [Doc] Update OOT model docs (#18742)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-27 06:30:31 +00:00
d260f799a9 [FEAT] [ROCm] Upgrade AITER Fused MoE kernels. (#18271)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
2025-05-26 23:14:07 -07:00
b50602d5f0 [Model][Gemma3] Cast image pixel values already on CPU (#18732)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-05-27 05:42:54 +00:00
1f1b1bc03b [V1][Quantization] Add CUDA graph compatible v1 GGUF support (#18646)
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-05-27 04:40:28 +00:00
1f88dbd2bb [Misc] improve web section group title display (#18684)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-27 04:35:16 +00:00
0eebd74842 [Model][Gemma3] Simplify image input validation (#18710)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-05-27 11:13:37 +08:00
27bebcd897 Convert examples to ruff-format (#18400)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-26 16:57:54 +00:00
e7523c2e03 [V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs (#18608) 2025-05-26 11:49:36 -04:00
a869baca73 [Bugfix] Fix Llama GGUF initialization (#18717)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-26 07:49:22 -07:00
82e2339b06 [Doc] Move examples and further reorganize user guide (#18666)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-26 07:38:04 -07:00
9553fdb41e [Doc] Improve API docs (#18713)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-26 07:33:34 -07:00
243eb9199f [Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM (#18701) 2025-05-26 07:10:56 -07:00
0665e29998 [Misc] add AutoGen integration (#18712)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-05-26 13:56:18 +00:00
e76be06550 [Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI (#18709)
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai>
2025-05-26 05:26:07 -07:00
0877750029 [CI/Build] Split pooling and generation extended language models tests in CI (#18705)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-05-26 04:00:08 -07:00
6d68030f1c [Model] Add support for YARN in NemotronNAS models (#18427)
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
2025-05-26 10:31:49 +00:00
5a2c76cbe1 [CI] fix dump_input for str type (#18697)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-26 18:23:35 +08:00
38b13dfe78 [CI/Build] Replace math.isclose with pytest.approx (#18703)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-26 02:05:17 -07:00
61a45e7a72 [Bugfix] Fix Mistral-format models with sliding window (#18693)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-26 01:44:04 -07:00
65523a0995 [Doc] Fix issue template format (#18699)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-26 00:45:39 -07:00
4b7740a105 [GH] Add issue template for reporting CI failures (#18696)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-26 00:42:04 -07:00
4ea62c0ea0 [CI] add missing argument (#18694)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-26 00:22:04 -07:00
561b77a0d6 [Bugfix] Fix the lm_head in gpt_bigcode in lora mode (#6357)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
2025-05-26 14:52:25 +08:00
abd4030d94 refactor: simplify request handler, use positive condition check for handler assignment (#18690)
Signed-off-by: googs1025 <googs1025@gmail.com>
2025-05-26 06:32:28 +00:00
8820821b59 [Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example (#18644)
Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com>
Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com>
Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com>
2025-05-26 13:51:27 +08:00
fba0642704 [CI/Build][Doc] Update gte-Qwen2-1.5B-instruct usage (#18683)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-05-25 20:27:50 -07:00
6071e989df [Core][Multimodal] Convert PIL Image to array without data copy when hashing (#18682)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-05-25 17:33:35 +00:00
57fd13a707 [Bugfix] Fix profiling dummy data for Pixtral (#18677)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-25 14:05:30 +00:00
3a886bd58c [Misc] small improve (#18680)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-25 06:05:38 -07:00
35be8fad62 [CI/build] fix no regex (#18676)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-25 10:10:51 +00:00
f2faac745d [Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment (#18674)
Signed-off-by: zzzyq <zhangyuqi94@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-05-25 02:36:06 -07:00
279f854519 [doc] improve readability (#18675)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-25 01:40:31 -07:00
624b77a2b3 [doc] fix broken links (#18671)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-25 01:36:33 -07:00
503f8487c2 [Misc] Reduce logs on startup (#18649)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-24 23:03:53 -07:00
44073a7ac3 [BUGFIX] catch subclass first for try...except (#18672)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-25 05:34:24 +00:00
63934543a0 Speed up the kernels/quantization/ tests (#18669)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-25 05:02:59 +00:00
75f81750f3 [VLM] Initialize video input support for InternVL models (#18499)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-05-25 04:51:25 +00:00
6ab681bcbe [Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE (#18655)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-05-25 04:51:21 +00:00
cebc22f3b6 [Misc]Replace cuda hard code with current_platform in Ray (#14668)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-05-24 20:26:31 -07:00
6c6dcd8611 [MISC] correct signature for LoaderFunction (#18670)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-24 20:17:47 -07:00
7891fdf0c6 [V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... (#18640)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
2025-05-24 20:07:20 -07:00
6825d9a998 [BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding (#18668)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-05-24 17:33:46 -07:00
b554ab736e [CI/Build] fix permission denied issue (#18645)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-24 16:09:10 +00:00
9ea7f1abf3 fix(regression): clone from reference items (#18662)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-05-24 15:25:20 +00:00
2807271c86 [CI] enforce import regex instead of re (#18665)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-05-24 08:04:14 -07:00
b9018a3f9f [BugFix] Fix import error for fused_moe (#18642)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-24 07:53:36 -07:00
4ceafb6299 [MISC] typo fix and clean import (#18664)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-24 07:52:09 -07:00
2e6705784f [CI/Build] chmod +x to cleanup_pr_body.sh (#18650)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-24 07:26:45 -07:00
1cb194a018 [Doc] Reorganize user guide (#18661)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-24 07:25:33 -07:00
2cd4d58df4 [Model] use AutoWeightsLoader for gpt2 (#18625)
Signed-off-by: zt2370 <ztang2370@gmail.com>
2025-05-24 13:36:13 +00:00
6d166a8d35 [Doc] Add community links (#18657)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-24 06:06:38 -07:00
ef1dd6870f [Doc] Fix indentation problems in V0 Paged Attention docs (#18659)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-24 06:06:35 -07:00
e77dc4bad8 [MISC][pre-commit] Add pre-commit check for triton import (#17716)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-05-24 20:09:15 +08:00
07458a51ce [Doc] Update README links, mark external links (#18635)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-24 09:57:15 +00:00
c1e4a4052d [V1][Spec Decode] Support multi-layer eagle draft model (#18030)
Signed-off-by: qizixi <qizixi@meta.com>
2025-05-24 09:45:34 +00:00
a859320575 [Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) (#18647) 2025-05-24 09:15:36 +00:00
441dc63ac7 [Frontend] improve vllm serve --help display (#18643)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-24 07:53:22 +00:00
d55e446d13 [V1][Spec Decode] Small refactors to improve eagle bookkeeping performance (#18424)
Signed-off-by: qizixi <qizixi@meta.com>
2025-05-24 06:51:22 +00:00
ec82c3e388 FIX MOE issue in AutoRound format (#18586)
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com>
2025-05-23 22:01:40 -07:00
45ab403a1f config.py: Clarify that only local GGUF checkpoints are supported. (#18623)
Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com>
2025-05-24 08:46:34 +08:00
2b10ba7491 [Bugfix][Nixl] Fix Preemption Bug (#18631)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-05-23 23:30:16 +00:00
4fc1bf813a [Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking (#18454)
Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com>
Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com>
2025-05-23 16:16:26 -07:00
f2036734fb [ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation (#18160)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-05-23 15:52:20 -07:00
7d9216495c [Doc] Update references to doc files (#18637)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-23 15:49:21 -07:00
0ddf88e16e [CI] Enable test_initialization to run on V1 (#16736)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-23 15:09:44 -07:00
1645b60196 Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI (#18537)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-05-23 21:17:16 +00:00
2628a69e35 [V1] Support Deepseek MTP (#18435)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Co-authored-by: Rui Qiao <ruisearch42@gmail.com>
2025-05-23 10:26:28 -07:00
371f7e4ca2 [Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar (#18627)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-23 10:22:40 -07:00
15b45ffb9a [Doc] Avoid documenting dynamic / internal modules (#18626)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-23 09:58:02 -07:00
273cb3b4d9 [Doc] Fix top-level API links/docs (#18621)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-23 09:46:56 -07:00
8ddd1cf26a [Doc] fix list formatting (#18624)
Signed-off-by: David Xia <david@davidxia.com>
2025-05-23 09:41:17 -07:00
6550114c9c [v1] Redo "Support multiple KV cache groups in GPU model runner (#17945)" (#18593)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-05-23 09:39:47 -07:00
9520a989df [Docs] Change mkdocs to not use directory urls (#18622)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-23 09:33:21 -07:00
3d28ad343f Fix figures in design doc (#18612)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-23 09:09:54 -07:00
6a7988c55b Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-05-23 23:43:43 +08:00
022d8abe29 [Doc] Use a different color for the announcement (#18616)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-23 08:25:03 -07:00
5221815a00 [Doc] Fix markdown list indentation for MkDocs rendering (#18620)
Signed-off-by: Zerohertz <ohg3417@gmail.com>
2025-05-23 08:23:21 -07:00
1068556b2c [Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS (#18579) 2025-05-23 07:43:58 -07:00
2cd1fa4556 [Misc] add Haystack integration (#18601)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-23 06:21:19 -07:00
d4c2919760 Include private attributes in API documentation (#18614)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-23 06:18:31 -07:00
6220f3c6b0 [Bugfix] Fix transformers model impl ignored for mixtral quant (#18602)
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com>
2025-05-23 05:54:13 -07:00
52fb23f47e Fix examples with code blocks in docs (#18609)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-23 05:53:44 -07:00
6dd51c7ef1 [CI/Build] Fix V1 flag being set in entrypoints tests (#18598)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-23 05:51:53 -07:00
2edb533af2 Replace {func} with mkdocs style links (#18610)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-23 05:51:38 -07:00
38a95cb4a8 [Doc] Fix indent of contributing to vllm (#18611)
Signed-off-by: Zerohertz <ohg3417@gmail.com>
2025-05-23 05:50:07 -07:00
cd821ea5d2 [CI] fix kv_cache_type argument (#18594)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-23 04:49:18 -07:00
7ab056c273 [Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to requirements/cpu.txt (#18542)
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
2025-05-23 04:38:42 -07:00
6526e05111 Add myself as docs code owner (#18605)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-23 04:08:31 -07:00
e493e48524 [V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled (#17731)
Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2025-05-23 03:38:23 -07:00
4ce64e2df4 [Bugfix][Model] Fix baichuan model loader for tp (#18597)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-05-23 02:39:05 -07:00
fbb13a2c15 Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (#18034)" (#18600)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-23 02:18:22 -07:00
a1fe24d961 Migrate docs from Sphinx to MkDocs (#18145)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-23 02:09:53 -07:00
d0bc2f810b [Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform (#18430)
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Co-authored-by: Yuqi Zhang <yuqizhang@google.com>
2025-05-23 01:41:37 -07:00
b046cf792d [Feature][V1]: suupports cached_tokens in response usage (#18149)
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-05-23 01:41:03 -07:00
54af915949 [Doc] Update quickstart and install for cu128 using --torch-backend=auto (#18505)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-23 08:36:37 +00:00
71ea614d4a [Feature]Add async tensor parallelism using compilation pass (#17882)
Signed-off-by: cascade812 <cascade812@outlook.com>
2025-05-23 01:03:34 -07:00
4c611348a7 [V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (#18034)
Signed-off-by: Ronald Xu <ronaldxu@amazon.com>
2025-05-23 00:37:18 -07:00
60cad94b86 [Hardware] correct method signatures for HPU,ROCm,XPU (#18551)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-22 22:31:59 -07:00
9c1baa5bc6 [Misc] Replace cuda hard code with current_platform (#16983)
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-05-23 04:38:50 +00:00
4be2255c81 [Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key (#17291)
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com>
2025-05-23 12:30:47 +08:00
ed5d408255 [Neuron] Remove bypass on EAGLEConfig and add a test (#18514)
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
2025-05-22 21:26:32 -07:00
583507d130 [Spec Decode] Make EAGLE3 draft token ID mapping optional (#18488)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-05-22 20:17:39 -07:00
e44d8ce8c7 [Bugfix] Set KVTransferConfig.engine_id in post_init (#18576)
Signed-off-by: Linkun Chen <github@lkchen.net>
2025-05-23 02:54:42 +00:00
93ecb8139c [BugFix] Increase TP execute_model timeout (#18558)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-05-23 10:22:11 +08:00
fae453f8ce [Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs (#18482)
Signed-off-by: googs1025 <googs1025@gmail.com>
2025-05-23 10:15:32 +08:00
4b0da7b60e Enable hybrid attention models for Transformers backend (#18494)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-23 10:12:08 +08:00
c6b636f9fb [V1][Spec Decoding] Use model_loader.get_model() to load models (#18273)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-05-23 02:05:44 +00:00
04eb88dc80 Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. (#18569)
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2025-05-23 01:59:18 +00:00
46791e1b4b [AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh (#18568)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-05-22 18:45:35 -07:00
c32e249a23 [Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization (#17926)
Signed-off-by: Sanger Steel <sangersteel@gmail.com>
2025-05-22 18:44:18 -07:00
c91fe7b1b9 [Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser (#17917)
Signed-off-by: Kai Wu <kaiwu@meta.com>
2025-05-22 16:44:08 -07:00
a04720bc36 [V1][Spec Decode][Bugfix] Load quantize weights for EAGLE (#18290) 2025-05-22 15:17:33 -07:00
7b9d832c80 [Tool] Add NIXL installation script (#18172)
Signed-off-by: Linkun <github@lkchen.net>
2025-05-22 14:33:16 -07:00
6e588da0f4 [Build/CI] Fix CUDA 11.8 build (#17679)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-05-22 12:13:54 -07:00
f8d2cc5f55 [Compile][Platform] Make PiecewiseBackend pluggable and extendable (#18076)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-05-22 12:11:53 -07:00
721fb9b181 [Platform] Move platform check to right place (#18470)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-22 12:11:28 -07:00
1f3a1200e4 [Bugfix] make test_openai_schema.py pass (#18224)
Signed-off-by: David Xia <david@davidxia.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-22 18:34:06 +00:00
54631f8262 [Misc] Call ndarray.tobytes() directly instead of ndarray.data.tobytes() (#18347)
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
2025-05-22 09:00:13 -07:00
cb506ecb5a [Misc] improve Automatic Prefix Caching example (#18554)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-22 14:50:46 +00:00
93f71673ce [BugFix][CPU] Fix x86 SHM distributed module initialization (#18536)
Signed-off-by: jiang.li <jiang1.li@intel.com>
2025-05-22 07:35:00 -07:00
3f505233fd [Doc] Add stream flag for chat completion example (#18524)
Signed-off-by: calvin chen <120380290@qq.com>
2025-05-22 14:07:10 +00:00
4e04eceb58 [Bugfix] Use random hidden states in dummy sampler run (#18543)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
2025-05-22 06:48:56 -07:00
71075029f2 [Doc] Support --stream arg in openai_completion_client.py script (#18388)
Signed-off-by: googs1025 <googs1025@gmail.com>
2025-05-22 13:20:17 +00:00
ca86a7cf6e [CI/Build] Update bamba test model location (#18544)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-05-22 06:01:07 -07:00
a35a494745 [Bugfix] Add kwargs to RequestOutput __init__ to be forward compatible (#18513)
Signed-off-by: Linkun <github@lkchen.net>
2025-05-22 05:24:43 -07:00
f6037d1907 [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (#18526)
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-22 05:22:53 -07:00
fa72f9a812 Order sequence ids + config update to support specifying custom quantization layers (#18279)
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Co-authored-by: Tailin Pan <tailinpa@amazon.com>
Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com>
Co-authored-by: Yishan McNabb <yishanm@amazon.com>
Co-authored-by: Patrick Lange <patlange@amazon.com>
Co-authored-by: Maxwell Goldberg <mgld@amazon.com>
Co-authored-by: Aakash Shetty <sheaak@amazon.com>
2025-05-22 02:20:36 -07:00
ebed81fbf5 Update default neuron config for speculation (#18274)
Signed-off-by: Elaine Zhao <elaineyz@amazon.com>
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com>
Co-authored-by: Aakash Shetty <sheaak@amazon.com>
2025-05-22 02:18:55 -07:00
e2d7d31244 [Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) (#18512)
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>
2025-05-22 02:17:34 -07:00
23b67b37b2 [Doc] Fix invalid JSON in example args (#18527)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-22 07:11:46 +00:00
db5a29ba19 [Bugfix] Fix LoRA test (#18518)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-05-21 21:48:53 -07:00
51797775c3 [Bugfix][Model] Make Olmo2Model weight loading return loaded weights (#18504)
Signed-off-by: Shane A <shanea@allenai.org>
2025-05-21 21:17:03 -07:00
cf5984b2fe [BugFix][DP] Send DP wave completion only from dp_rank==0 (#18502)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com>
2025-05-21 20:25:25 -07:00
d022115cc6 [Bugfix] Inconsistent token calculation compared to HF in llava family (#18479)
Signed-off-by: jaycha <jaycha@ncsoft.com>
2025-05-21 20:21:47 -07:00
acb54ca8e1 Intialize io_thread_pool attribute in the beginning. (#18331)
Signed-off-by: rabi <ramishra@redhat.com>
2025-05-21 20:21:14 -07:00
6e0fd34d3c [CI] Fix race condition with StatelessProcessGroup.barrier (#18506)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-05-21 20:19:13 -07:00
176d62e4ea [MISC] update project urls in pyproject.toml (#18519)
Signed-off-by: Andy Xie <andy.xning@gmail.com>
2025-05-21 20:17:34 -07:00
20bd6f4d2e [FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) (#18500)
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae>
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae>
2025-05-21 19:23:59 -07:00
1f079540db [Bugfix] Consistent ascii handling in tool parsers (#17704)
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>
2025-05-21 20:41:23 +00:00
94d8ec8d2b [FEAT][ROCm] Upgrade AITER MLA v1 backend (#18338)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-05-21 10:34:28 -07:00
bb0a311213 Revert "[v1] Support multiple KV cache groups in GPU model runner (#17945) (#18459)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-05-21 10:25:23 -07:00
dd5fa7e04f [ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 (#17004)
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com>
2025-05-21 08:35:00 -07:00
2b16104557 [Misc] Update deprecation message for --enable-reasoning (#18404) 2025-05-21 07:33:11 -07:00
371376f996 [Build] fix Dockerfile shell (#18402) 2025-05-21 07:32:06 -07:00
c6c10ca920 [Bugfix] Reduce moe_sum test size to avoid OOM (#18484)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-05-21 06:46:39 -07:00
c154d89306 [Doc] fix arg docstring in linear layers (#18410)
Signed-off-by: giantcroc <1204449533@qq.com>
2025-05-21 06:45:57 -07:00
eca18691d2 [MODEL] FalconH1 (#18406)
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae>
Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae>
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae>
2025-05-21 04:59:06 -07:00
61acfc45bc [Bugfix][Failing Test] Fix test_events.py (#18460)
Signed-off-by: rabi <ramishra@redhat.com>
2025-05-21 04:57:28 -07:00
107f5fc4cb [Misc] refactor disaggregated-prefill-v1 example (#18474)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-21 11:10:14 +00:00
907f935de9 [V1] Fix general plugins not loaded in engine for multiproc (#18326)
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
2025-05-21 01:21:49 -07:00
5d7f545204 [Frontend] deprecate --device arg (#18399)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-05-21 01:21:17 -07:00
cd8dfc6dfc [Misc] MultiConnector._connectors type (#18423)
Signed-off-by: nicklucche <nlucches@redhat.com>
2025-05-20 22:48:43 -07:00
d06dd72ba9 [Bugfix][Failing Test] Fix nixl connector test when promt size < block size (#18429)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
2025-05-20 22:41:44 -07:00
ad0012a0ac Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (#18407)" (#18456)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-20 22:39:22 -07:00
92247c522e [Bug] Fix moe_sum signature (#18440)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-05-20 22:37:08 -07:00
0c15c2e486 [Bugfix] config.head_dim is now explicitly set to None (#18432)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-05-20 21:04:33 -07:00
3b17ea26e4 [TPU] Re-enable the Pallas MoE kernel (#18025)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-05-20 19:52:27 -07:00
23baa2180b fix:Build torch wheel inline rather than picking from nightly (#18351)
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
2025-05-20 22:22:24 +00:00
980a172474 [Kernel] update comment for KV shape in unified triton attn (#18099)
Signed-off-by: haochengxia <xhc_1007@163.com>
2025-05-20 11:19:34 -07:00
e1f5a71ed7 [Model] use AutoWeightsLoader for bloom (#18300)
Signed-off-by: calvin chen <120380290@qq.com>
2025-05-20 09:40:05 -07:00
f4a8a37465 [Minor] Rename quantization nvfp4 to modelopt_fp4 (#18356)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-05-20 09:08:37 -07:00
8f55962a7f [Misc] refactor prompt embedding examples (#18405)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-20 15:26:12 +00:00
be48360c1f [Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text (#18407)
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com>
2025-05-20 06:59:48 -07:00
86847700d7 [CI] Add mteb testing to test the accuracy of the embedding model (#17175) 2025-05-20 06:51:12 -07:00
d6c86d09ae Update cpu.txt (#18398)
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
2025-05-20 10:53:23 +00:00
6b35cb10a0 [Misc] Add LoRA code owner (#18387)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-05-20 03:27:30 -07:00
1b1e8e05ff [doc] update env variable export (#18391)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-20 08:53:27 +00:00
bca55b556f [Bugfix] fix adding bias twice in ipex GPTQ quantization (#18363)
Signed-off-by: rand-fly <randfly@outlook.com>
2025-05-20 00:54:33 -07:00
d981396778 [release] Change dockerhub username for TPU release (#18389) 2025-05-19 23:49:23 -07:00
9609327fa4 [Core] [Bugfix]: tensor parallel with prompt embeds (#18171)
Signed-off-by: Nan2018 <nan@protopia.ai>
Co-authored-by: Andrew Sansom <andrew@protopia.ai>
2025-05-19 20:21:27 -07:00
f07a673eb2 [Misc] Allow AutoWeightsLoader to skip loading weights with specific substr in name (#18358)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-05-19 20:20:12 -07:00
d565e0976f [neuron] fix authorization issue (#18364)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-05-19 23:30:32 +00:00
258bf621d5 fix CUDA_check redefinition in #17918 (#18287)
Signed-off-by: Lucia Fang <fanglu@fb.com>
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com>
2025-05-19 13:42:35 -07:00
dc1440cf9f Neuron up mistral (#18222)
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>
2025-05-19 09:54:47 -07:00
8171221834 [Misc] Fix typo (#18330) 2025-05-19 09:51:01 -07:00
7937c2fd52 Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup (#18337) 2025-05-19 09:49:57 -07:00
e2ee1e8e9e [Feature]Add support for models quantized with AutoRound (#17850)
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com>
2025-05-19 09:38:53 -07:00
20d8ce81eb [Frontend] add --quick option for vllm chat/complete (#18297)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-19 09:36:13 -07:00
84ab4feb7e [Doc] Fix typo (#18355) 2025-05-19 16:05:16 +00:00
6781af5608 [Quantization] Pool model support bitsandbytes (#18087)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-05-19 09:03:43 -07:00
1b15df2546 [BugFix] Fix handling of num_computed_tokens with connector (#18232)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com>
2025-05-19 09:03:25 -07:00
43b5f61dce [Doc] Move input-related docs to Features (#18353)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-05-19 15:08:39 +00:00
c5bb0ebdc6 [Doc] Fix prompt embedding examples (#18350)
Signed-off-by: wangli <wangli858794774@gmail.com>
2025-05-19 06:48:16 -07:00
d637b96099 [BugFix] [Vul] Add missing usedforsecurity=False in MD5 hashing to enable FIPS (#18319)
Signed-off-by: cascade812 <cascade812@outlook.com>
Signed-off-by: shaoyuyoung <shaoyuyoung@gmail.com>
Co-authored-by: cascade <cascade812@outlook.com>
2025-05-19 01:31:23 -07:00
275c5daeb0 fix: Add type specifications for CLI arguments in tensorizer options (#18314) 2025-05-18 23:42:17 -07:00
47fda6d089 [Build] Supports CUDA 12.6 and 11.8 after Blackwell Update (#18316)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-05-18 23:19:33 -07:00
27d0952600 [Misc] extract parser.parse_args() (#18323)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-19 04:06:26 +00:00
221cfc2fea Feature/vllm/input embedding completion api (#17590)
Signed-off-by: Andrew Sansom <andrew@protopia.ai>
Signed-off-by: Nan2018 <nan@protopia.ai>
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com>
Co-authored-by: Bryce1010 <bryceyx@gmail.com>
Co-authored-by: Andrew Sansom <andrew@protopia.ai>
Co-authored-by: Andrew Sansom <qthequartermasterman@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-05-18 20:18:05 -07:00
9da1095daf [Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa (#18175)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
2025-05-18 19:49:46 -07:00
d1211f8794 [Doc] Add doc to explain the usage of Qwen3 thinking (#18291)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-05-18 23:04:07 +00:00
b6a6e7a529 [Misc] add litellm integration (#18320)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-18 15:32:30 +00:00
4fb349f66a Fix copy-paste error in phi4mm image processing (#18315)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
2025-05-18 07:00:12 -07:00
908733aca7 [Model] Use sigmoid for single-label classification (#18313)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
2025-05-18 07:00:09 -07:00
1a8f68bb90 [doc] update reasoning doc (#18306)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-05-18 06:59:14 -07:00
691 changed files with 19615 additions and 13467 deletions

View File

@ -6,11 +6,6 @@
[tool.ruff]
line-length = 88
exclude = [
# External file, leaving license intact
"examples/other/fp8/quantizer/quantize.py",
"vllm/vllm_flash_attn/flash_attn_interface.pyi"
]
[tool.ruff.lint.per-file-ignores]
"vllm/third_party/**" = ["ALL"]

View File

@ -14,7 +14,7 @@ steps:
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.6.3 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.6.3 --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0+PTX' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
@ -31,7 +31,7 @@ steps:
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0+PTX' --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/scripts/upload-wheels.sh"
@ -64,7 +64,7 @@ steps:
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllm
username: vllmbot
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"

View File

@ -10,15 +10,17 @@ docker build -t hpu-test-env -f docker/Dockerfile.hpu .
# Setup cleanup
# certain versions of HPU software stack have a bug that can
# override the exit code of the script, so we need to use
# separate remove_docker_container and remove_docker_container_and_exit
# separate remove_docker_containers and remove_docker_containers_and_exit
# functions, while other platforms only need one remove_docker_container
# function.
EXITCODE=1
remove_docker_container() { docker rm -f hpu-test || true; }
remove_docker_container_and_exit() { remove_docker_container; exit $EXITCODE; }
trap remove_docker_container_and_exit EXIT
remove_docker_container
remove_docker_containers() { docker rm -f hpu-test || true; docker rm -f hpu-test-tp2 || true; }
remove_docker_containers_and_exit() { remove_docker_containers; exit $EXITCODE; }
trap remove_docker_containers_and_exit EXIT
remove_docker_containers
# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m
docker run --runtime=habana --name=hpu-test-tp2 --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --tensor-parallel-size 2
EXITCODE=$?

View File

@ -11,13 +11,14 @@ container_name="neuron_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
HF_CACHE="$(realpath ~)/huggingface"
mkdir -p "${HF_CACHE}"
HF_MOUNT="/root/.cache/huggingface"
HF_TOKEN=$(aws secretsmanager get-secret-value --secret-id "ci/vllm-neuron/hf-token" --region us-west-2 --query 'SecretString' --output text | jq -r .VLLM_NEURON_CI_HF_TOKEN)
NEURON_COMPILE_CACHE_URL="$(realpath ~)/neuron_compile_cache"
mkdir -p "${NEURON_COMPILE_CACHE_URL}"
NEURON_COMPILE_CACHE_MOUNT="/root/.cache/neuron_compile_cache"
# Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws
# prune old image and containers to save disk space, and only once a day
# by using a timestamp file in tmp.
@ -47,8 +48,16 @@ trap remove_docker_container EXIT
docker run --rm -it --device=/dev/neuron0 --network bridge \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
-e "HF_TOKEN=${HF_TOKEN}" \
-v "${NEURON_COMPILE_CACHE_URL}:${NEURON_COMPILE_CACHE_MOUNT}" \
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
--name "${container_name}" \
${image_name} \
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py && python3 -m pytest /workspace/vllm/tests/neuron/1_core/ -v --capture=tee-sys && python3 -m pytest /workspace/vllm/tests/neuron/2_core/ -v --capture=tee-sys"
/bin/bash -c "
python3 /workspace/vllm/examples/offline_inference/neuron.py;
python3 -m pytest /workspace/vllm/tests/neuron/1_core/ -v --capture=tee-sys;
for f in /workspace/vllm/tests/neuron/2_core/*.py; do
echo 'Running test file: '$f;
python3 -m pytest \$f -v --capture=tee-sys;
done
"

View File

@ -33,14 +33,13 @@ steps:
- label: Documentation Build # 2min
mirror_hardwares: [amdexperimental]
working_dir: "/vllm-workspace/test_docs/docs"
working_dir: "/vllm-workspace/test_docs"
fast_check: true
no_gpu: True
commands:
- pip install -r ../../requirements/docs.txt
- SPHINXOPTS=\"-W\" make html
# Check API reference (if it fails, you may have missing mock imports)
- grep \"sig sig-object py\" build/html/api/vllm/vllm.sampling_params.html
- pip install -r ../requirements/docs.txt
# TODO: add `--strict` once warnings in docstrings are fixed
- mkdocs build
- label: Async Engine, Inputs, Utils, Worker Test # 24min
mirror_hardwares: [amdexperimental]
@ -59,6 +58,7 @@ steps:
- pytest -v -s async_engine # AsyncLLMEngine
- NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
- pytest -v -s test_inputs.py
- pytest -v -s test_outputs.py
- pytest -v -s multimodal
- pytest -v -s test_utils.py # Utils
- pytest -v -s worker # Worker
@ -125,7 +125,7 @@ steps:
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
- VLLM_USE_V1=0 pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/correctness/ --ignore=entrypoints/openai/test_openai_schema.py
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_tensorizer_entrypoint.py --ignore=entrypoints/openai/correctness/
- pytest -v -s entrypoints/test_chat_utils.py
- VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
@ -138,6 +138,7 @@ steps:
- vllm/core/
- tests/distributed/test_utils
- tests/distributed/test_pynccl
- tests/distributed/test_events
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile/test_basic_correctness
- examples/offline_inference/rlhf.py
@ -156,6 +157,7 @@ steps:
- pytest -v -s distributed/test_utils.py
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py
- pytest -v -s distributed/test_events.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests
@ -220,6 +222,7 @@ steps:
- pytest -v -s v1/test_serial_utils.py
- pytest -v -s v1/test_utils.py
- pytest -v -s v1/test_oracle.py
- pytest -v -s v1/test_metrics_reader.py
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- pytest -v -s v1/e2e
@ -244,7 +247,7 @@ steps:
- python3 offline_inference/vision_language.py --seed 0
- python3 offline_inference/vision_language_embedding.py --seed 0
- python3 offline_inference/vision_language_multi_image.py --seed 0
- VLLM_USE_V1=0 python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- VLLM_USE_V1=0 python3 others/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 others/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference/encoder_decoder.py
- python3 offline_inference/encoder_decoder_multimodal.py --model-type whisper --seed 0
- python3 offline_inference/basic/classify.py
@ -312,6 +315,7 @@ steps:
- pytest -v -s compile/test_fusion.py
- pytest -v -s compile/test_silu_mul_quant_fusion.py
- pytest -v -s compile/test_sequence_parallelism.py
- pytest -v -s compile/test_async_tp.py
- label: PyTorch Fullgraph Smoke Test # 9min
mirror_hardwares: [amdexperimental, amdproduction]
@ -386,10 +390,12 @@ steps:
source_file_dependencies:
- vllm/model_executor/model_loader
- tests/tensorizer_loader
- tests/entrypoints/openai/test_tensorizer_entrypoint.py
commands:
- apt-get update && apt-get install -y curl libsodium23
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s tensorizer_loader
- pytest -v -s entrypoints/openai/test_tensorizer_entrypoint.py
- label: Benchmarks # 9min
mirror_hardwares: [amdexperimental, amdproduction]
@ -467,10 +473,7 @@ steps:
- pytest -v -s models/test_registry.py
- pytest -v -s models/test_utils.py
- pytest -v -s models/test_vision.py
# V1 Test: https://github.com/vllm-project/vllm/issues/14531
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'not llama4 and not plamo2'
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'llama4'
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'plamo2'
- pytest -v -s models/test_initialization.py
- label: Language Models Test (Standard)
mirror_hardwares: [amdexperimental]
@ -484,16 +487,25 @@ steps:
- pip freeze | grep -E 'torch'
- pytest -v -s models/language -m core_model
- label: Language Models Test (Extended)
- label: Language Models Test (Extended Generation) # 1hr20min
mirror_hardwares: [amdexperimental]
optional: true
source_file_dependencies:
- vllm/
- tests/models/language
- tests/models/language/generation
commands:
# Install causal-conv1d for plamo2 models here, as it is not compatible with pip-compile.
- pip install 'git+https://github.com/Dao-AILab/causal-conv1d@v1.5.0.post8'
- pytest -v -s models/language -m 'not core_model'
- pytest -v -s models/language/generation -m 'not core_model'
- label: Language Models Test (Extended Pooling) # 36min
mirror_hardwares: [amdexperimental]
optional: true
source_file_dependencies:
- vllm/
- tests/models/language/pooling
commands:
- pytest -v -s models/language/pooling -m 'not core_model'
- label: Multi-Modal Models Test (Standard)
mirror_hardwares: [amdexperimental]

6
.github/CODEOWNERS vendored
View File

@ -13,6 +13,7 @@
/vllm/model_executor/guided_decoding @mgoin @russellb
/vllm/multimodal @DarkLight1337 @ywang96
/vllm/vllm_flash_attn @LucasWilkinson
/vllm/lora @jeejeelee
CMakeLists.txt @tlrmchlsmth
# vLLM V1
@ -40,3 +41,8 @@ CMakeLists.txt @tlrmchlsmth
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb
/tests/v1/structured_output @mgoin @russellb
/tests/weight_loading @mgoin @youkaichao
/tests/lora @jeejeelee
# Docs
/docs @hmellor
mkdocs.yaml @hmellor

View File

@ -81,14 +81,14 @@ body:
required: true
- type: markdown
attributes:
value: >
⚠️ Please separate bugs of `transformers` implementation or usage from bugs of `vllm`. If you think anything is wrong with the models' output:
value: |
⚠️ Please separate bugs of `transformers` implementation or usage from bugs of `vllm`. If you think anything is wrong with the model's output:
- Try the counterpart of `transformers` first. If the error appears, please go to [their issues](https://github.com/huggingface/transformers/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc).
- If the error only appears in vllm, please provide the detailed script of how you run `transformers` and `vllm`, also highlight the difference and what you expect.
Thanks for contributing 🎉!
Thanks for reporting 🙏!
- type: checkboxes
id: askllm
attributes:

View File

@ -0,0 +1,69 @@
name: 🧪 CI failure report
description: Report a failing test.
title: "[CI Failure]: "
labels: ["ci-failure"]
body:
- type: markdown
attributes:
value: >
#### Include the name of the failing Buildkite step and test file in the title.
- type: input
attributes:
label: Name of failing test
description: |
Paste in the fully-qualified name of the failing test from the logs.
placeholder: |
`path/to/test_file.py::test_name[params]`
validations:
required: true
- type: checkboxes
attributes:
label: Basic information
description: Select all items that apply to the failing test.
options:
- label: Flaky test
- label: Can reproduce locally
- label: Caused by external libraries (e.g. bug in `transformers`)
- type: textarea
attributes:
label: 🧪 Describe the failing test
description: |
Please provide a clear and concise description of the failing test.
placeholder: |
A clear and concise description of the failing test.
```
The error message you got, with the full traceback and the error logs with [dump_input.py:##] if present.
```
validations:
required: true
- type: textarea
attributes:
label: 📝 History of failing test
description: |
Since when did the test start to fail?
You can look up its history via [Buildkite Test Suites](https://buildkite.com/organizations/vllm/analytics/suites/ci-1/tests?branch=main).
If you have time, identify the PR that caused the test to fail on main. You can do so via the following methods:
- Use Buildkite Test Suites to find the PR where the test failure first occurred, and reproduce the failure locally.
- Run [`git bisect`](https://git-scm.com/docs/git-bisect) locally.
- Manually unblock Buildkite steps for suspected PRs on main and check the results. (authorized users only)
placeholder: |
Approximate timeline and/or problematic PRs
A link to the Buildkite analytics of the failing test (if available)
validations:
required: true
- type: textarea
attributes:
label: CC List.
description: >
The list of people you want to CC. Usually, this includes those who worked on the PR that failed the test.
- type: markdown
attributes:
value: >
Thanks for reporting 🙏!

View File

@ -3,4 +3,4 @@ FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (*link existing issues this PR will resolve*)
<!--- pyml disable-next-line no-emphasis-as-heading -->
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>** (anything written below this line will be removed by GitHub Actions)
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing>** (anything written below this line will be removed by GitHub Actions)

6
.github/mergify.yml vendored
View File

@ -58,7 +58,7 @@ pull_request_rules:
- files~=^benchmarks/structured_schemas/
- files=benchmarks/benchmark_serving_structured_output.py
- files=benchmarks/run_structured_output_benchmark.sh
- files=docs/source/features/structured_outputs.md
- files=docs/features/structured_outputs.md
- files=examples/offline_inference/structured_outputs.py
- files=examples/online_serving/openai_chat_completion_structured_outputs.py
- files=examples/online_serving/openai_chat_completion_structured_outputs_with_reasoning.py
@ -135,9 +135,7 @@ pull_request_rules:
- files~=^tests/entrypoints/openai/tool_parsers/
- files=tests/entrypoints/openai/test_chat_with_tool_reasoning.py
- files~=^vllm/entrypoints/openai/tool_parsers/
- files=docs/source/features/tool_calling.md
- files=docs/source/getting_started/examples/openai_chat_completion_client_with_tools.md
- files=docs/source/getting_started/examples/chat_with_tools.md
- files=docs/features/tool_calling.md
- files~=^examples/tool_chat_*
- files=examples/offline_inference/chat_with_tools.py
- files=examples/online_serving/openai_chat_completion_client_with_tools_required.py

View File

@ -26,7 +26,7 @@ sed -i '/\*\*BEFORE SUBMITTING, PLEASE READ.*\*\*/,$d' "${NEW}"
# Remove HTML <details> section that includes <summary> text of "PR Checklist (Click to Expand)"
python3 - <<EOF
import re
import regex as re
with open("${NEW}", "r") as file:
content = file.read()

View File

@ -20,7 +20,12 @@ jobs:
with:
python-version: '3.12'
- name: Install Python dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install regex
- name: Update PR description
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: .github/scripts/cleanup_pr_body.sh "${{ github.event.number }}"
run: bash .github/scripts/cleanup_pr_body.sh "${{ github.event.number }}"

6
.gitignore vendored
View File

@ -77,11 +77,6 @@ instance/
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
docs/source/getting_started/examples/
docs/source/api/vllm
# PyBuilder
.pybuilder/
target/
@ -151,6 +146,7 @@ venv.bak/
# mkdocs documentation
/site
docs/examples
# mypy
.mypy_cache/

View File

@ -17,7 +17,7 @@ repos:
- id: ruff
args: [--output-format, github, --fix]
- id: ruff-format
files: ^(.buildkite|benchmarks)/.*
files: ^(.buildkite|benchmarks|examples)/.*
- repo: https://github.com/codespell-project/codespell
rev: v2.4.1
hooks:
@ -39,6 +39,7 @@ repos:
rev: v0.9.29
hooks:
- id: pymarkdown
exclude: '.*\.inc\.md'
args: [fix]
- repo: https://github.com/rhysd/actionlint
rev: v1.7.7
@ -127,6 +128,21 @@ repos:
name: Update Dockerfile dependency graph
entry: tools/update-dockerfile-graph.sh
language: script
- id: enforce-import-regex-instead-of-re
name: Enforce import regex as re
entry: python tools/enforce_regex_import.py
language: python
types: [python]
pass_filenames: false
additional_dependencies: [regex]
# forbid directly import triton
- id: forbid-direct-triton-import
name: "Forbid direct 'import triton'"
entry: python tools/check_triton_import.py
language: python
types: [python]
pass_filenames: false
additional_dependencies: [regex]
# Keep `suggestion` last
- id: suggestion
name: Suggestion

View File

@ -8,12 +8,8 @@ build:
tools:
python: "3.12"
sphinx:
configuration: docs/source/conf.py
fail_on_warning: true
# If using Sphinx, optionally build your docs in additional formats such as PDF
formats: []
mkdocs:
configuration: mkdocs.yaml
# Optionally declare the Python requirements required to build your docs
python:

View File

@ -29,9 +29,6 @@ set(ignoreMe "${VLLM_PYTHON_PATH}")
#
set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11" "3.12")
# Supported NVIDIA architectures.
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0")
# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")
@ -79,6 +76,15 @@ endif()
#
find_package(Torch REQUIRED)
# Supported NVIDIA architectures.
# This check must happen after find_package(Torch) because that's when CMAKE_CUDA_COMPILER_VERSION gets defined
if(DEFINED CMAKE_CUDA_COMPILER_VERSION AND
CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8)
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0")
else()
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0")
endif()
#
# Forward the non-CUDA device extensions to external CMake scripts.
#
@ -226,6 +232,8 @@ endif()
#
set(VLLM_EXT_SRC
"csrc/mamba/mamba_ssm/selective_scan_fwd.cu"
"csrc/mamba/causal_conv1d/causal_conv1d.cu"
"csrc/cache_kernels.cu"
"csrc/attention/paged_attention_v1.cu"
"csrc/attention/paged_attention_v2.cu"
@ -281,8 +289,6 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
FetchContent_MakeAvailable(cutlass)
list(APPEND VLLM_EXT_SRC
"csrc/mamba/mamba_ssm/selective_scan_fwd.cu"
"csrc/mamba/causal_conv1d/causal_conv1d.cu"
"csrc/quantization/aqlm/gemm_kernels.cu"
"csrc/quantization/awq/gemm_kernels.cu"
"csrc/permute_cols.cu"

View File

@ -1,3 +1,3 @@
# Contributing to vLLM
You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing/overview.html).
You may find information about contributing to vLLM on [docs.vllm.ai](https://docs.vllm.ai/en/latest/contributing).

View File

@ -1,7 +1,7 @@
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png">
<img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png" width=55%>
</picture>
</p>
@ -58,7 +58,7 @@ vLLM is fast with:
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516),INT4, INT8, and FP8.
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
@ -100,14 +100,14 @@ Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
## Contributing
We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/stable/contributing/overview.html) for how to get involved.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Sponsors
vLLM is a community project. Our compute resources for development and testing are supported by the following organizations. Thank you for your support!
<!-- Note: Please sort them in alphabetical order. -->
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md -->
<!-- Note: Please keep these consistent with docs/community/sponsors.md -->
Cash Donations:
- a16z
- Dropbox

View File

@ -146,10 +146,9 @@ python3 vllm/benchmarks/benchmark_serving.py \
``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-model "[ngram]" \
--ngram_prompt_lookup_min 2 \
--ngram-prompt-lookup-max 5 \
--num_speculative_tokens 5
--speculative_config '{"model": "[ngram]", "num_speculative_tokens": 5}
```
``` bash
@ -274,10 +273,9 @@ python3 vllm/benchmarks/benchmark_throughput.py \
--output-len=100 \
--num-prompts=2048 \
--async-engine \
--speculative-model="[ngram]" \
--ngram_prompt_lookup_min=2 \
--ngram-prompt-lookup-max=5 \
--num_speculative_tokens=5
--speculative_config '{"model": "[ngram]", "num_speculative_tokens": 5}
```
```

View File

@ -194,6 +194,11 @@ async def async_request_deepspeed_mii(
request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
api_url = request_func_input.api_url
assert api_url.endswith(("completions", "profile")), (
"OpenAI Completions API URL must end with 'completions' or 'profile'."
)
async with aiohttp.ClientSession(
trust_env=True, timeout=AIOHTTP_TIMEOUT
) as session:
@ -204,6 +209,8 @@ async def async_request_deepspeed_mii(
"temperature": 0.01, # deepspeed-mii does not accept 0.0 temp.
"top_p": 1.0,
}
headers = {"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"}
output = RequestFuncOutput()
output.prompt_len = request_func_input.prompt_len
@ -215,7 +222,7 @@ async def async_request_deepspeed_mii(
st = time.perf_counter()
try:
async with session.post(
url=request_func_input.api_url, json=payload
url=api_url, json=payload, headers=headers
) as response:
if response.status == 200:
parsed_resp = await response.json()

View File

@ -35,6 +35,7 @@ from transformers import PreTrainedTokenizerBase
from vllm.lora.request import LoRARequest
from vllm.lora.utils import get_adapter_absolute_path
from vllm.multimodal import MultiModalDataDict
from vllm.multimodal.image import convert_image_mode
from vllm.transformers_utils.tokenizer import AnyTokenizer, get_lora_tokenizer
logger = logging.getLogger(__name__)
@ -257,7 +258,7 @@ def process_image(image: Any) -> Mapping[str, Any]:
if isinstance(image, dict) and "bytes" in image:
image = Image.open(BytesIO(image["bytes"]))
if isinstance(image, Image.Image):
image = image.convert("RGB")
image = convert_image_mode(image, "RGB")
with io.BytesIO() as image_data:
image.save(image_data, format="JPEG")
image_base64 = base64.b64encode(image_data.getvalue()).decode("utf-8")

View File

@ -189,5 +189,8 @@ if __name__ == "__main__":
)
parser = EngineArgs.add_cli_args(parser)
# V1 enables prefix caching by default which skews the latency
# numbers. We need to disable prefix caching by default.
parser.set_defaults(enable_prefix_caching=False)
args = parser.parse_args()
main(args)

View File

@ -672,7 +672,7 @@ async def benchmark(
def evaluate(ret, args):
def _eval_correctness_json(expected, actual):
# extract json string from string using regex
import re
import regex as re
actual = actual.replace("\n", "").replace(" ", "").strip()
try:
@ -687,7 +687,7 @@ def evaluate(ret, args):
return actual in args.choice
def _eval_correctness_regex(expected, actual):
import re
import regex as re
return re.match(args.regex, actual) is not None

View File

@ -84,7 +84,10 @@ def main(
if version == "v2":
if current_platform.is_rocm():
global PARTITION_SIZE
PARTITION_SIZE = 1024 if not args.custom_paged_attn else PARTITION_SIZE_ROCM
if not args.custom_paged_attn and not current_platform.is_navi():
PARTITION_SIZE = 1024
else:
PARTITION_SIZE = PARTITION_SIZE_ROCM
num_partitions = (max_seq_len + PARTITION_SIZE - 1) // PARTITION_SIZE
tmp_output = torch.empty(
size=(num_seqs, num_query_heads, num_partitions, head_size),
@ -159,6 +162,7 @@ def main(
scale,
block_tables,
seq_lens,
None,
block_size,
max_seq_len,
alibi_slopes,

View File

@ -2,11 +2,11 @@
import math
import pickle
import re
from collections import defaultdict
import matplotlib.pyplot as plt
import pandas as pd
import regex as re
import seaborn as sns
from torch.utils.benchmark import Measurement as TMeasurement

View File

@ -6,11 +6,6 @@
[tool.ruff]
line-length = 88
exclude = [
# External file, leaving license intact
"examples/other/fp8/quantizer/quantize.py",
"vllm/vllm_flash_attn/flash_attn_interface.pyi"
]
[tool.ruff.lint.per-file-ignores]
"vllm/third_party/**" = ["ALL"]

View File

@ -143,6 +143,14 @@ void merge_attn_states_launcher(torch::Tensor& output,
const uint pack_size = 16 / sizeof(scalar_t);
TORCH_CHECK(head_size % pack_size == 0,
"headsize must be multiple of pack_size:", pack_size);
TORCH_CHECK(output.stride(-2) == head_size && output.stride(-1) == 1,
"output heads must be contiguous in memory");
TORCH_CHECK(
prefix_output.stride(-2) == head_size && prefix_output.stride(-1) == 1,
"prefix_output heads must be contiguous in memory");
TORCH_CHECK(
suffix_output.stride(-2) == head_size && suffix_output.stride(-1) == 1,
"suffix_output heads must be contiguous in memory");
float* output_lse_ptr = nullptr;
if (output_lse.has_value()) {
output_lse_ptr = output_lse.value().data_ptr<float>();

View File

@ -19,6 +19,7 @@ namespace vec_op {
#define VLLM_DISPATCH_CASE_FLOATING_TYPES_FP8(...) \
AT_DISPATCH_CASE(at::ScalarType::Float, __VA_ARGS__) \
AT_DISPATCH_CASE(at::ScalarType::BFloat16, __VA_ARGS__) \
AT_DISPATCH_CASE(at::ScalarType::Half, __VA_ARGS__) \
AT_DISPATCH_CASE(at::ScalarType::Float8_e5m2, __VA_ARGS__)
#define VLLM_DISPATCH_FLOATING_TYPES(TYPE, NAME, ...) \

View File

@ -15,15 +15,6 @@
cutlassGetStatusString(error)); \
}
/**
* Panic wrapper for unwinding CUDA runtime errors
*/
#define CUDA_CHECK(status) \
{ \
cudaError_t error = status; \
TORCH_CHECK(error == cudaSuccess, cudaGetErrorString(error)); \
}
inline int get_cuda_max_shared_memory_per_block_opt_in(int const device) {
int max_shared_mem_per_block_opt_in = 0;
cudaDeviceGetAttribute(&max_shared_mem_per_block_opt_in,

View File

@ -13,6 +13,10 @@
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#ifdef USE_ROCM
namespace cub = hipcub;
#endif
#include "static_switch.h"
@ -501,15 +505,9 @@ void causal_conv1d_fwd_launch(ConvParamsBase &params, cudaStream_t stream) {
auto kernel = &causal_conv1d_fwd_kernel<Ktraits>;
if (kSmemSize >= 48 * 1024) {
#ifndef USE_ROCM
C10_CUDA_CHECK(cudaFuncSetAttribute(
kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
#else
// There is a slight signature discrepancy in HIP and CUDA "FuncSetAttribute" function.
C10_CUDA_CHECK(cudaFuncSetAttribute(
(void *) kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
std::cerr << "Warning (causal_conv1d fwd launch): attempting to set maxDynamicSharedMemorySize on an AMD GPU which is currently a non-op (in ROCm versions <= 6.1). This might lead to undefined behavior. \n" << std::endl;
#endif
}
kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);

View File

@ -321,7 +321,7 @@ void selective_scan_fwd_launch(SSMParamsBase &params, cudaStream_t stream) {
auto kernel = &selective_scan_fwd_kernel<Ktraits>;
if (kSmemSize >= 48 * 1024) {
C10_CUDA_CHECK(cudaFuncSetAttribute(
kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
(void *) kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, kSmemSize));
}
kernel<<<grid, Ktraits::kNThreads, kSmemSize, stream>>>(params);
C10_CUDA_KERNEL_LAUNCH_CHECK();

View File

@ -28,4 +28,6 @@ torch::Tensor moe_wna16_gemm(torch::Tensor input, torch::Tensor output,
torch::Tensor num_tokens_post_pad, int64_t top_k,
int64_t BLOCK_SIZE_M, int64_t BLOCK_SIZE_N,
int64_t BLOCK_SIZE_K, int64_t bit);
#endif
#endif
bool moe_permute_unpermute_supported();

View File

@ -5,6 +5,9 @@
#include "permute_unpermute_kernels/dispatch.h"
#include "core/registration.h"
// moe_permute kernels require at least CUDA 12.0
#if defined(CUDA_VERSION) && (CUDA_VERSION >= 12000)
void moe_permute(
const torch::Tensor& input, // [n_token, hidden]
const torch::Tensor& topk_weights, //[n_token, topk]
@ -127,7 +130,45 @@ void moe_unpermute(
});
}
#else
void moe_permute(const torch::Tensor& input, const torch::Tensor& topk_weights,
torch::Tensor& topk_ids,
const torch::Tensor& token_expert_indicies,
const std::optional<torch::Tensor>& expert_map,
int64_t n_expert, int64_t n_local_expert, int64_t topk,
const std::optional<int64_t>& align_block_size,
torch::Tensor& permuted_input,
torch::Tensor& expert_first_token_offset,
torch::Tensor& src_row_id2dst_row_id_map,
torch::Tensor& m_indices) {
TORCH_CHECK(false, "moe_unpermute is not supported on CUDA < 12.0");
}
void moe_unpermute(const torch::Tensor& input,
const torch::Tensor& topk_weights, torch::Tensor& topk_ids,
const torch::Tensor& token_expert_indicies,
const std::optional<torch::Tensor>& expert_map,
int64_t n_expert, int64_t n_local_expert, int64_t topk,
const std::optional<int64_t>& align_block_size,
torch::Tensor& permuted_input,
torch::Tensor& expert_first_token_offset,
torch::Tensor& src_row_id2dst_row_id_map,
torch::Tensor& m_indices) {
TORCH_CHECK(false, "moe_unpermute is not supported on CUDA < 12.0");
}
#endif
bool moe_permute_unpermute_supported() {
#if defined(CUDA_VERSION) && (CUDA_VERSION >= 12000)
return true;
#else
return false;
#endif
}
TORCH_LIBRARY_IMPL_EXPAND(TORCH_EXTENSION_NAME, CUDA, m) {
m.impl("moe_permute", &moe_permute);
m.impl("moe_unpermute", &moe_unpermute);
}
}

View File

@ -1,6 +1,9 @@
#include "moe_permute_unpermute_kernel.h"
// moe_permute kernels require at least CUDA 12.0
#if defined(CUDA_VERSION) && (CUDA_VERSION >= 12000)
// CubKeyValueSorter definition begin
CubKeyValueSorter::CubKeyValueSorter()
: num_experts_(0), num_bits_(sizeof(int) * 8) {}
@ -131,9 +134,6 @@ __global__ void preprocessTopkIdKernel(int* topk_id_ptr, int size,
int num_experts) {
auto tidx = threadIdx.x;
auto bidx = blockIdx.x;
auto lidx = tidx & 31;
auto widx = tidx >> 5;
auto warp_count = (blockDim.x + 31) >> 5;
auto offset = bidx * blockDim.x;
auto bound = min(offset + blockDim.x, size);
extern __shared__ int smem_expert_map[];
@ -226,4 +226,6 @@ void getMIndices(int64_t* expert_first_token_offset,
expert_first_token_offset, align_expert_first_token_offset, m_indices,
num_local_expert, align_block_size);
}
}
}
#endif

View File

@ -10,7 +10,7 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
// Calculate the result of moe by summing up the partial results
// from all selected experts.
m.def("moe_sum(Tensor! input, Tensor output) -> ()");
m.def("moe_sum(Tensor input, Tensor! output) -> ()");
m.impl("moe_sum", torch::kCUDA, &moe_sum);
// Aligning the number of tokens to be processed by each expert such
@ -77,7 +77,9 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, m) {
"Tensor topk_ids,Tensor src_row_id2dst_row_id_map, Tensor "
"expert_first_token_offset, int n_expert, int n_local_expert,int "
"topk, Tensor! hidden_states)->()");
// conditionally compiled so impl registration is in source file
m.def("moe_permute_unpermute_supported() -> bool");
m.impl("moe_permute_unpermute_supported", &moe_permute_unpermute_supported);
#endif
}

View File

@ -123,7 +123,7 @@ bool cutlass_scaled_mm_supports_block_fp8(int64_t cuda_device_capability) {
}
bool cutlass_group_gemm_supported(int64_t cuda_device_capability) {
// CUTLASS groped FP8 kernels need at least CUDA 12.3
// CUTLASS grouped FP8 kernels need at least CUDA 12.3
// and SM90 (Hopper)
#if defined CUDA_VERSION

File diff suppressed because it is too large Load Diff

View File

@ -8,6 +8,8 @@
#include <ATen/cuda/CUDAContext.h>
#include "cuda_utils.h"
#include "cutlass/cutlass.h"
#include "cutlass/gemm/device/gemm_universal_adapter.h"
@ -95,9 +97,9 @@ struct cutlass_sparse_3x_gemm {
// clang-format off
using CollectiveMainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
ElementAB, cutlass::layout::RowMajor, AlignmentAB,
ElementAB, cutlass::layout::ColumnMajor, AlignmentAB,
cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
ElementAB, cutlass::layout::RowMajor, AlignmentAB,
ElementAB, cutlass::layout::ColumnMajor, AlignmentAB,
ElementAcc, TileShape, ClusterShape,
Stages,
KernelSchedule>::CollectiveOp;

View File

@ -482,41 +482,6 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
" Tensor page_table, float scale) -> ()");
ops.impl("cutlass_mla_decode", torch::kCUDA, &cutlass_mla_decode);
// Mamba selective scan kernel
ops.def(
"selective_scan_fwd(Tensor! u, Tensor! delta,"
"Tensor! A, Tensor! B, Tensor! C,"
"Tensor? D_, Tensor!? z_, Tensor? delta_bias_,"
"bool delta_softplus,"
"Tensor? query_start_loc,"
"Tensor? cache_indices,"
"Tensor? has_initial_state,"
"Tensor! ssm_states,"
"int pad_slot_id) -> ()");
ops.impl("selective_scan_fwd", torch::kCUDA, &selective_scan_fwd);
ops.def(
"causal_conv1d_update(Tensor! x,"
"Tensor! conv_state,"
"Tensor! weight,"
"Tensor? bias_,"
"bool silu_activation,"
"Tensor? cache_seqlens_,"
"Tensor? conv_state_indices,"
"int pad_slot_id) -> ()");
ops.impl("causal_conv1d_update", torch::kCUDA, &causal_conv1d_update);
ops.def(
"causal_conv1d_fwd(Tensor! x, Tensor! weight,"
"Tensor? bias_,"
"Tensor!? conv_states,"
"Tensor? query_start_loc,"
"Tensor? cache_indices,"
"Tensor? has_initial_state,"
"bool silu_activation,"
"int pad_slot_id) -> ()");
ops.impl("causal_conv1d_fwd", torch::kCUDA, &causal_conv1d_fwd);
// Compute NVFP4 block quantized tensor.
ops.def(
"scaled_fp4_quant(Tensor! output, Tensor input,"
@ -584,6 +549,41 @@ TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
ops.impl("dynamic_scaled_int8_quant", torch::kCUDA,
&dynamic_scaled_int8_quant);
// Mamba selective scan kernel
ops.def(
"selective_scan_fwd(Tensor! u, Tensor! delta,"
"Tensor! A, Tensor! B, Tensor! C,"
"Tensor? D_, Tensor!? z_, Tensor? delta_bias_,"
"bool delta_softplus,"
"Tensor? query_start_loc,"
"Tensor? cache_indices,"
"Tensor? has_initial_state,"
"Tensor! ssm_states,"
"int pad_slot_id) -> ()");
ops.impl("selective_scan_fwd", torch::kCUDA, &selective_scan_fwd);
ops.def(
"causal_conv1d_update(Tensor! x,"
"Tensor! conv_state,"
"Tensor! weight,"
"Tensor? bias_,"
"bool silu_activation,"
"Tensor? cache_seqlens_,"
"Tensor? conv_state_indices,"
"int pad_slot_id) -> ()");
ops.impl("causal_conv1d_update", torch::kCUDA, &causal_conv1d_update);
ops.def(
"causal_conv1d_fwd(Tensor! x, Tensor! weight,"
"Tensor? bias_,"
"Tensor!? conv_states,"
"Tensor? query_start_loc,"
"Tensor? cache_indices,"
"Tensor? has_initial_state,"
"bool silu_activation,"
"int pad_slot_id) -> ()");
ops.impl("causal_conv1d_fwd", torch::kCUDA, &causal_conv1d_fwd);
#ifndef USE_ROCM
// reorder weight for AllSpark Ampere W8A16 Fused Gemm kernel
ops.def(

View File

@ -2,8 +2,8 @@
# to run the OpenAI compatible server.
# Please update any changes made here to
# docs/source/contributing/dockerfile/dockerfile.md and
# docs/source/assets/contributing/dockerfile-stages-dependency.png
# docs/contributing/dockerfile/dockerfile.md and
# docs/assets/contributing/dockerfile-stages-dependency.png
ARG CUDA_VERSION=12.8.1
#################### BASE BUILD IMAGE ####################
@ -189,6 +189,8 @@ WORKDIR /vllm-workspace
ENV DEBIAN_FRONTEND=noninteractive
ARG TARGETPLATFORM
SHELL ["/bin/bash", "-c"]
RUN PYTHON_VERSION_STR=$(echo ${PYTHON_VERSION} | sed 's/\.//g') && \
echo "export PYTHON_VERSION_STR=${PYTHON_VERSION_STR}" >> /etc/environment
@ -255,10 +257,17 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
RUN --mount=type=cache,target=/root/.cache/uv \
. /etc/environment && \
if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
# uv pip install --system https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.4/flashinfer_python-0.2.4+cu124torch2.6-cp38-abi3-linux_x86_64.whl ; \
# TESTING: install FlashInfer from source to test 2.7.0 final RC
FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX' \
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@e00e8cedbfcb220f328fd36aa8f529f869b01e6b" ; \
# FlashInfer alreary has a wheel for PyTorch 2.7.0 and CUDA 12.8. This is enough for CI use
if [[ "$CUDA_VERSION" == 12.8* ]]; then \
uv pip install --system https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl; \
else \
export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0+PTX'; \
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
if [ "$CUDA_MAJOR" -lt 12 ]; then \
export FLASHINFER_ENABLE_SM90=0; \
fi; \
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@21ea1d2545f74782b91eb8c08fd503ac4c0743fc" ; \
fi \
fi
COPY examples examples
COPY benchmarks benchmarks
@ -268,7 +277,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
. /etc/environment && \
uv pip list
# Although we build Flashinfer with AOT mode, there's still
# Even when we build Flashinfer with AOT mode, there's still
# some issues w.r.t. JIT compilation. Therefore we need to
# install build dependencies for JIT compilation.
# TODO: Remove this once FlashInfer AOT wheel is fixed
@ -296,8 +305,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/mamba@v2.2.4"
# install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/dev.txt
RUN --mount=type=cache,target=/root/.cache/uv \
CUDA_MAJOR="${CUDA_VERSION%%.*}"; \
if [ "$CUDA_MAJOR" -ge 12 ]; then \
uv pip install --system -r requirements/dev.txt; \
fi
# install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/uv \
@ -316,7 +328,9 @@ COPY vllm/v1 /usr/local/lib/python3.12/dist-packages/vllm/v1
# will not be imported by other tests
RUN mkdir test_docs
RUN mv docs test_docs/
RUN cp -r examples test_docs/
RUN mv vllm test_docs/
RUN mv mkdocs.yaml test_docs/
#################### TEST IMAGE ####################
#################### OPENAI API SERVER ####################

View File

@ -51,9 +51,6 @@ RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --upgrade pip && \
uv pip install -r requirements/cpu.txt
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install intel-openmp==2024.2.1 intel_extension_for_pytorch==2.6.0
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4:/opt/venv/lib/libiomp5.so:$LD_PRELOAD"
RUN echo 'ulimit -c 0' >> ~/.bashrc

View File

@ -1,6 +1,6 @@
# default base image
# https://gallery.ecr.aws/neuron/pytorch-inference-neuronx
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.5.1-neuronx-py310-sdk2.22.0-ubuntu22.04"
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.6.0-neuronx-py310-sdk2.23.0-ubuntu22.04"
FROM $BASE_IMAGE
@ -22,8 +22,7 @@ WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas tenacity
RUN python3 -m pip install sentencepiece transformers==4.48.0 -U
RUN python3 -m pip install neuronx-cc==2.17.194.0 --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install neuronx-cc==2.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install pytest
# uninstall transformers-neuronx package explicitly to avoid version conflict
@ -49,6 +48,8 @@ RUN python3 -m pip install -e tests/vllm_test_utils
# FIXME: `--no-deps` argument is temporarily added to resolve transformers package version conflict
RUN python3 -m pip install transformers-neuronx==0.13.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U --no-deps
RUN python3 -m pip install sentencepiece transformers==4.48.0 -U
# overwrite entrypoint to run bash script
RUN echo "import subprocess; import sys; subprocess.check_call(sys.argv[1:])" > /usr/local/bin/dockerd-entrypoint.py

View File

@ -12,7 +12,7 @@ ARG PYTORCH_REPO="https://github.com/pytorch/pytorch.git"
ARG PYTORCH_VISION_REPO="https://github.com/pytorch/vision.git"
ARG FA_BRANCH="1a7f4dfa"
ARG FA_REPO="https://github.com/Dao-AILab/flash-attention.git"
ARG AITER_BRANCH="5a77249"
ARG AITER_BRANCH="c1debd8"
ARG AITER_REPO="https://github.com/ROCm/aiter.git"
FROM ${BASE_IMAGE} AS base

View File

@ -84,16 +84,40 @@ RUN curl https://sh.rustup.rs -sSf | sh -s -- -y && \
rustup default stable && \
rustup show
FROM python-install AS torch
ARG TORCH_VERSION=2.7.0
ENV export _GLIBCXX_USE_CXX11_ABI=1
ENV CARGO_HOME=/root/.cargo
ENV RUSTUP_HOME=/root/.rustup
ENV PATH="$CARGO_HOME/bin:$RUSTUP_HOME/bin:$PATH"
WORKDIR /tmp
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,from=rust,source=/root/.cargo,target=/root/.cargo,rw \
--mount=type=bind,from=rust,source=/root/.rustup,target=/root/.rustup,rw \
git clone https://github.com/pytorch/pytorch.git && \
cd pytorch && \
git checkout v2.7.0 && \
git submodule sync && \
git submodule update --init --recursive && \
uv pip install cmake ninja && \
uv pip install -r requirements.txt && \
python setup.py bdist_wheel
FROM python-install AS torch-vision
# Install torchvision
ARG TORCH_VERSION=2.7.0.dev20250304
ARG TORCH_VERSION=2.7.0
ARG TORCH_VISION_VERSION=v0.20.1
WORKDIR /tmp
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,from=torch,source=/tmp/pytorch/dist,target=/tmp/torch-wheels/ \
git clone https://github.com/pytorch/vision.git && \
cd vision && \
git checkout $TORCH_VISION_VERSION && \
uv pip install -v torch==${TORCH_VERSION} --extra-index-url https://download.pytorch.org/whl/nightly/cpu && \
TORCH_WHL_FILE=$(ls /tmp/torch-wheels/*.whl | head -n 1) && \
uv pip install -v $TORCH_WHL_FILE && \
python setup.py bdist_wheel
FROM python-install AS hf-xet-builder
@ -138,15 +162,17 @@ RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,from=pyarrow,source=/tmp/arrow/python/dist,target=/tmp/arrow-wheels \
--mount=type=bind,from=torch-vision,source=/tmp/vision/dist,target=/tmp/vision-wheels/ \
--mount=type=bind,from=hf-xet-builder,source=/tmp/hf-xet/dist,target=/tmp/hf-xet-wheels/ \
--mount=type=bind,from=torch,source=/tmp/pytorch/dist,target=/tmp/torch-wheels/ \
sed -i '/^torch/d' requirements/build.txt && \
ARROW_WHL_FILE=$(ls /tmp/arrow-wheels/pyarrow-*.whl | head -n 1) && \
VISION_WHL_FILE=$(ls /tmp/vision-wheels/*.whl | head -n 1) && \
HF_XET_WHL_FILE=$(ls /tmp/hf-xet-wheels/*.whl | head -n 1) && \
TORCH_WHL_FILE=$(ls /tmp/torch-wheels/*.whl | head -n 1) && \
uv pip install -v \
$ARROW_WHL_FILE \
$VISION_WHL_FILE \
$HF_XET_WHL_FILE \
--extra-index-url https://download.pytorch.org/whl/nightly/cpu \
$TORCH_WHL_FILE \
--index-strategy unsafe-best-match \
-r requirements/build.txt \
-r requirements/cpu.txt

63
docs/.nav.yml Normal file
View File

@ -0,0 +1,63 @@
nav:
- Home:
- vLLM: README.md
- Getting Started:
- getting_started/quickstart.md
- getting_started/installation
- Examples:
- Offline Inference: examples/offline_inference
- Online Serving: examples/online_serving
- Others: examples/others
- Quick Links:
- User Guide: usage/README.md
- Developer Guide: contributing/README.md
- API Reference: api/README.md
- Timeline:
- Roadmap: https://roadmap.vllm.ai
- Releases: https://github.com/vllm-project/vllm/releases
- User Guide:
- Summary: usage/README.md
- usage/v1_guide.md
- General:
- usage/*
- Inference and Serving:
- serving/offline_inference.md
- serving/openai_compatible_server.md
- serving/*
- serving/integrations
- Deployment:
- deployment/*
- deployment/frameworks
- deployment/integrations
- Training: training
- Configuration:
- Summary: configuration/README.md
- configuration/*
- Models:
- models/supported_models.md
- models/generative_models.md
- models/pooling_models.md
- models/extensions
- Features:
- features/compatibility_matrix.md
- features/*
- features/quantization
- Developer Guide:
- Summary: contributing/README.md
- General:
- glob: contributing/*
flatten_single_child_sections: true
- Model Implementation: contributing/model
- Design Documents:
- V0: design
- V1: design/v1
- API Reference:
- Summary: api/README.md
- Contents:
- glob: api/vllm/*
preserve_directory_names: true
- Community:
- community/*
- Blog: https://blog.vllm.ai
- Forum: https://discuss.vllm.ai
- Slack: https://slack.vllm.ai

View File

@ -1,25 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
clean:
@$(SPHINXBUILD) -M clean "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
rm -rf "$(SOURCEDIR)/getting_started/examples"
rm -rf "$(SOURCEDIR)/api/vllm"

View File

@ -1,43 +1,50 @@
# vLLM documents
# Welcome to vLLM
## Build the docs
<figure markdown="span">
![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM" class="no-scaled-link" width="60%" }
</figure>
- Make sure in `docs` directory
<p style="text-align:center">
<strong>Easy, fast, and cheap LLM serving for everyone
</strong>
</p>
```bash
cd docs
```
<p style="text-align:center">
<script async defer src="https://buttons.github.io/buttons.js"></script>
<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p>
- Install the dependencies:
vLLM is a fast and easy-to-use library for LLM inference and serving.
```bash
pip install -r ../requirements/docs.txt
```
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
- Clean the previous build (optional but recommended):
vLLM is fast with:
```bash
make clean
```
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
- Speculative decoding
- Chunked prefill
- Generate the HTML documentation:
vLLM is flexible and easy to use with:
```bash
make html
```
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism and pipeline parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
- Prefix caching support
- Multi-lora support
## Open the docs with your browser
For more information, check out the following:
- Serve the documentation locally:
```bash
python -m http.server -d build/html/
```
This will start a local server at http://localhost:8000. You can now open your browser and view the documentation.
If port 8000 is already in use, you can specify a different port, for example:
```bash
python -m http.server 3000 -d build/html/
```
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- [vLLM Meetups][meetups]

107
docs/api/README.md Normal file
View File

@ -0,0 +1,107 @@
# Summary
[](){ #configuration }
## Configuration
API documentation for vLLM's configuration classes.
- [vllm.config.ModelConfig][]
- [vllm.config.CacheConfig][]
- [vllm.config.TokenizerPoolConfig][]
- [vllm.config.LoadConfig][]
- [vllm.config.ParallelConfig][]
- [vllm.config.SchedulerConfig][]
- [vllm.config.DeviceConfig][]
- [vllm.config.SpeculativeConfig][]
- [vllm.config.LoRAConfig][]
- [vllm.config.PromptAdapterConfig][]
- [vllm.config.MultiModalConfig][]
- [vllm.config.PoolerConfig][]
- [vllm.config.DecodingConfig][]
- [vllm.config.ObservabilityConfig][]
- [vllm.config.KVTransferConfig][]
- [vllm.config.CompilationConfig][]
- [vllm.config.VllmConfig][]
[](){ #offline-inference-api }
## Offline Inference
LLM Class.
- [vllm.LLM][]
LLM Inputs.
- [vllm.inputs.PromptType][]
- [vllm.inputs.TextPrompt][]
- [vllm.inputs.TokensPrompt][]
## vLLM Engines
Engine classes for offline and online inference.
- [vllm.LLMEngine][]
- [vllm.AsyncLLMEngine][]
## Inference Parameters
Inference parameters for vLLM APIs.
[](){ #sampling-params }
[](){ #pooling-params }
- [vllm.SamplingParams][]
- [vllm.PoolingParams][]
[](){ #multi-modality }
## Multi-Modality
vLLM provides experimental support for multi-modal models through the [vllm.multimodal][] package.
Multi-modal inputs can be passed alongside text and token prompts to [supported models][supported-mm-models]
via the `multi_modal_data` field in [vllm.inputs.PromptType][].
Looking to add your own multi-modal model? Please follow the instructions listed [here][supports-multimodal].
- [vllm.multimodal.MULTIMODAL_REGISTRY][]
### Inputs
User-facing inputs.
- [vllm.multimodal.inputs.MultiModalDataDict][]
Internal data structures.
- [vllm.multimodal.inputs.PlaceholderRange][]
- [vllm.multimodal.inputs.NestedTensors][]
- [vllm.multimodal.inputs.MultiModalFieldElem][]
- [vllm.multimodal.inputs.MultiModalFieldConfig][]
- [vllm.multimodal.inputs.MultiModalKwargsItem][]
- [vllm.multimodal.inputs.MultiModalKwargs][]
- [vllm.multimodal.inputs.MultiModalInputs][]
### Data Parsing
- [vllm.multimodal.parse][]
### Data Processing
- [vllm.multimodal.processing][]
### Memory Profiling
- [vllm.multimodal.profiling][]
### Registry
- [vllm.multimodal.registry][]
## Model Development
- [vllm.model_executor.models.interfaces_base][]
- [vllm.model_executor.models.interfaces][]
- [vllm.model_executor.models.adapters][]

2
docs/api/vllm/.meta.yml Normal file
View File

@ -0,0 +1,2 @@
search:
boost: 0.5

View File

Before

Width:  |  Height:  |  Size: 119 KiB

After

Width:  |  Height:  |  Size: 119 KiB

View File

Before

Width:  |  Height:  |  Size: 118 KiB

After

Width:  |  Height:  |  Size: 118 KiB

View File

Before

Width:  |  Height:  |  Size: 136 KiB

After

Width:  |  Height:  |  Size: 136 KiB

View File

Before

Width:  |  Height:  |  Size: 110 KiB

After

Width:  |  Height:  |  Size: 110 KiB

View File

Before

Width:  |  Height:  |  Size: 111 KiB

After

Width:  |  Height:  |  Size: 111 KiB

View File

Before

Width:  |  Height:  |  Size: 968 KiB

After

Width:  |  Height:  |  Size: 968 KiB

View File

Before

Width:  |  Height:  |  Size: 107 KiB

After

Width:  |  Height:  |  Size: 107 KiB

View File

Before

Width:  |  Height:  |  Size: 95 KiB

After

Width:  |  Height:  |  Size: 95 KiB

View File

Before

Width:  |  Height:  |  Size: 143 KiB

After

Width:  |  Height:  |  Size: 143 KiB

View File

Before

Width:  |  Height:  |  Size: 265 KiB

After

Width:  |  Height:  |  Size: 265 KiB

View File

Before

Width:  |  Height:  |  Size: 52 KiB

After

Width:  |  Height:  |  Size: 52 KiB

View File

Before

Width:  |  Height:  |  Size: 68 KiB

After

Width:  |  Height:  |  Size: 68 KiB

View File

Before

Width:  |  Height:  |  Size: 106 KiB

After

Width:  |  Height:  |  Size: 106 KiB

View File

Before

Width:  |  Height:  |  Size: 120 KiB

After

Width:  |  Height:  |  Size: 120 KiB

View File

Before

Width:  |  Height:  |  Size: 174 KiB

After

Width:  |  Height:  |  Size: 174 KiB

View File

Before

Width:  |  Height:  |  Size: 170 KiB

After

Width:  |  Height:  |  Size: 170 KiB

View File

Before

Width:  |  Height:  |  Size: 185 KiB

After

Width:  |  Height:  |  Size: 185 KiB

View File

Before

Width:  |  Height:  |  Size: 162 KiB

After

Width:  |  Height:  |  Size: 162 KiB

View File

Before

Width:  |  Height:  |  Size: 161 KiB

After

Width:  |  Height:  |  Size: 161 KiB

View File

Before

Width:  |  Height:  |  Size: 47 KiB

After

Width:  |  Height:  |  Size: 47 KiB

View File

Before

Width:  |  Height:  |  Size: 50 KiB

After

Width:  |  Height:  |  Size: 50 KiB

View File

Before

Width:  |  Height:  |  Size: 59 KiB

After

Width:  |  Height:  |  Size: 59 KiB

View File

Before

Width:  |  Height:  |  Size: 54 KiB

After

Width:  |  Height:  |  Size: 54 KiB

View File

Before

Width:  |  Height:  |  Size: 54 KiB

After

Width:  |  Height:  |  Size: 54 KiB

View File

Before

Width:  |  Height:  |  Size: 55 KiB

After

Width:  |  Height:  |  Size: 55 KiB

View File

Before

Width:  |  Height:  |  Size: 18 KiB

After

Width:  |  Height:  |  Size: 18 KiB

View File

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 32 KiB

View File

Before

Width:  |  Height:  |  Size: 102 KiB

After

Width:  |  Height:  |  Size: 102 KiB

View File

Before

Width:  |  Height:  |  Size: 173 KiB

After

Width:  |  Height:  |  Size: 173 KiB

View File

Before

Width:  |  Height:  |  Size: 27 KiB

After

Width:  |  Height:  |  Size: 27 KiB

View File

Before

Width:  |  Height:  |  Size: 109 KiB

After

Width:  |  Height:  |  Size: 109 KiB

View File

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB

View File

Before

Width:  |  Height:  |  Size: 41 KiB

After

Width:  |  Height:  |  Size: 41 KiB

View File

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 32 KiB

View File

Before

Width:  |  Height:  |  Size: 42 KiB

After

Width:  |  Height:  |  Size: 42 KiB

View File

Before

Width:  |  Height:  |  Size: 167 KiB

After

Width:  |  Height:  |  Size: 167 KiB

View File

Before

Width:  |  Height:  |  Size: 17 KiB

After

Width:  |  Height:  |  Size: 17 KiB

View File

Before

Width:  |  Height:  |  Size: 53 KiB

After

Width:  |  Height:  |  Size: 53 KiB

View File

Before

Width:  |  Height:  |  Size: 86 KiB

After

Width:  |  Height:  |  Size: 86 KiB

View File

Before

Width:  |  Height:  |  Size: 88 KiB

After

Width:  |  Height:  |  Size: 88 KiB

View File

@ -1,6 +1,7 @@
(meetups)=
# vLLM Meetups
---
title: Meetups
---
[](){ #meetups }
We host regular meetups in San Francisco Bay Area every 2 months. We will share the project updates from the vLLM team and have guest speakers from the industry to share their experience and insights. Please find the materials of our previous meetups below:

View File

@ -0,0 +1,9 @@
# Configuration Options
This section lists the most common options for running vLLM.
There are three main levels of configuration, from highest priority to lowest priority:
- [Request parameters][completions-api] and [input arguments][sampling-params]
- [Engine arguments](./engine_args.md)
- [Environment variables](./env_vars.md)

View File

@ -0,0 +1,144 @@
# Conserving Memory
Large models might cause your machine to run out of memory (OOM). Here are some options that help alleviate this problem.
## Tensor Parallelism (TP)
Tensor parallelism (`tensor_parallel_size` option) can be used to split the model across multiple GPUs.
The following code splits the model across 2 GPUs.
```python
from vllm import LLM
llm = LLM(model="ibm-granite/granite-3.1-8b-instruct",
tensor_parallel_size=2)
```
!!! warning
To ensure that vLLM initializes CUDA correctly, you should avoid calling related functions (e.g. [torch.cuda.set_device][])
before initializing vLLM. Otherwise, you may run into an error like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`.
To control which devices are used, please instead set the `CUDA_VISIBLE_DEVICES` environment variable.
!!! note
With tensor parallelism enabled, each process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism).
You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/offline_inference/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
## Quantization
Quantized models take less memory at the cost of lower precision.
Statically quantized models can be downloaded from HF Hub (some popular ones are available at [Red Hat AI](https://huggingface.co/RedHatAI))
and used directly without extra configuration.
Dynamic quantization is also supported via the `quantization` option -- see [here][quantization-index] for more details.
## Context length and batch size
You can further reduce memory usage by limiting the context length of the model (`max_model_len` option)
and the maximum batch size (`max_num_seqs` option).
```python
from vllm import LLM
llm = LLM(model="adept/fuyu-8b",
max_model_len=2048,
max_num_seqs=2)
```
## Reduce CUDA Graphs
By default, we optimize model inference using CUDA graphs which take up extra memory in the GPU.
!!! warning
CUDA graph capture takes up more memory in V1 than in V0.
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
```python
from vllm import LLM
from vllm.config import CompilationConfig, CompilationLevel
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
compilation_config=CompilationConfig(
level=CompilationLevel.PIECEWISE,
# By default, it goes up to max_num_seqs
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
),
)
```
You can disable graph capturing completely via the `enforce_eager` flag:
```python
from vllm import LLM
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
enforce_eager=True)
```
## Adjust cache size
If you run out of CPU RAM, try the following options:
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
## Multi-modal input limits
You can allow a smaller number of multi-modal items per prompt to reduce the memory footprint of the model:
```python
from vllm import LLM
# Accept up to 3 images and 1 video per prompt
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
limit_mm_per_prompt={"image": 3, "video": 1})
```
You can go a step further and disable unused modalities completely by setting its limit to zero.
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
```python
from vllm import LLM
# Accept any number of images but no videos
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
limit_mm_per_prompt={"video": 0})
```
You can even run a multi-modal model for text-only inference:
```python
from vllm import LLM
# Don't accept images. Just text.
llm = LLM(model="google/gemma-3-27b-it",
limit_mm_per_prompt={"image": 0})
```
## Multi-modal processor arguments
For certain models, you can adjust the multi-modal processor arguments to
reduce the size of the processed multi-modal inputs, which in turn saves memory.
Here are some examples:
```python
from vllm import LLM
# Available for Qwen2-VL series models
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
mm_processor_kwargs={
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
})
# Available for InternVL series models
llm = LLM(model="OpenGVLab/InternVL2-2B",
mm_processor_kwargs={
"max_dynamic_patch": 4, # Default is 12
})
```

View File

@ -0,0 +1,18 @@
---
title: Engine Arguments
---
[](){ #engine-args }
Engine arguments control the behavior of the vLLM engine.
- For [offline inference][offline-inference], they are part of the arguments to [LLM][vllm.LLM] class.
- For [online serving][openai-compatible-server], they are part of the arguments to `vllm serve`.
You can look at [EngineArgs][vllm.engine.arg_utils.EngineArgs] and [AsyncEngineArgs][vllm.engine.arg_utils.AsyncEngineArgs] to see the available engine arguments.
However, these classes are a combination of the configuration classes defined in [vllm.config][]. Therefore, we would recommend you read about them there where they are best documented.
For offline inference you will have access to these configuration classes and for online serving you can cross-reference the configs with `vllm serve --help`, which has its arguments grouped by config.
!!! note
Additional arguments are available to the [AsyncLLMEngine][vllm.engine.async_llm_engine.AsyncLLMEngine] which is used for online serving. These can be found by running `vllm serve --help`

View File

@ -0,0 +1,12 @@
# Environment Variables
vLLM uses the following environment variables to configure the system:
!!! warning
Please note that `VLLM_PORT` and `VLLM_HOST_IP` set the port and ip for vLLM's **internal usage**. It is not the port and ip for the API server. If you use `--host $VLLM_HOST_IP` and `--port $VLLM_PORT` to start the API server, it will not work.
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
```python
--8<-- "vllm/envs.py:env-vars-definition"
```

View File

@ -0,0 +1,23 @@
# Model Resolution
vLLM loads HuggingFace-compatible models by inspecting the `architectures` field in `config.json` of the model repository
and finding the corresponding implementation that is registered to vLLM.
Nevertheless, our model resolution may fail for the following reasons:
- The `config.json` of the model repository lacks the `architectures` field.
- Unofficial repositories refer to a model using alternative names which are not recorded in vLLM.
- The same architecture name is used for multiple models, creating ambiguity as to which model should be loaded.
To fix this, explicitly specify the model architecture by passing `config.json` overrides to the `hf_overrides` option.
For example:
```python
from vllm import LLM
model = LLM(
model="cerebras/Cerebras-GPT-1.3B",
hf_overrides={"architectures": ["GPT2LMHeadModel"]}, # GPT-2
)
```
Our [list of supported models][supported-models] shows the model architectures that are recognized by vLLM.

View File

@ -1,5 +1,3 @@
(optimization-and-tuning)=
# Optimization and Tuning
This guide covers optimization strategies and performance tuning for vLLM V1.
@ -26,7 +24,7 @@ You can monitor the number of preemption requests through Prometheus metrics exp
In vLLM V1, the default preemption mode is `RECOMPUTE` rather than `SWAP`, as recomputation has lower overhead in the V1 architecture.
(chunked-prefill)=
[](){ #chunked-prefill }
## Chunked Prefill

View File

@ -0,0 +1,38 @@
---
title: Server Arguments
---
[](){ #serve-args }
The `vllm serve` command is used to launch the OpenAI-compatible server.
## CLI Arguments
The `vllm serve` command is used to launch the OpenAI-compatible server.
To see the available CLI arguments, run `vllm serve --help`!
## Configuration file
You can load CLI arguments via a [YAML](https://yaml.org/) config file.
The argument names must be the long form of those outlined [above][serve-args].
For example:
```yaml
# config.yaml
model: meta-llama/Llama-3.1-8B-Instruct
host: "127.0.0.1"
port: 6379
uvicorn-log-level: "info"
```
To use the above config file:
```bash
vllm serve --config config.yaml
```
!!! note
In case an argument is supplied simultaneously using command line and the config file, the value from the command line will take precedence.
The order of priorities is `command line > config file values > defaults`.
e.g. `vllm serve SOME_MODEL --config config.yaml`, SOME_MODEL takes precedence over `model` in config file.

View File

@ -16,9 +16,9 @@ Finally, one of the most impactful ways to support us is by raising awareness ab
Unsure on where to start? Check out the following links for tasks to work on:
- [Good first issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22)
- [Selected onboarding tasks](gh-project:6)
- [Selected onboarding tasks](gh-project:6)
- [New model requests](https://github.com/vllm-project/vllm/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22new-model%22)
- [Models with multi-modal capabilities](gh-project:10)
- [Models with multi-modal capabilities](gh-project:10)
## License
@ -27,7 +27,21 @@ See <gh-file:LICENSE>.
## Developing
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
Check out the [building from source](#build-from-source) documentation for details.
Check out the [building from source][build-from-source] documentation for details.
### Building the docs
Install the dependencies:
```bash
pip install -r requirements/docs.txt
```
Start the autoreloading MkDocs server:
```bash
mkdocs serve
```
## Testing
@ -48,29 +62,25 @@ pre-commit run mypy-3.9 --hook-stage manual --all-files
pytest tests/
```
:::{tip}
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
!!! tip
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
:::
Therefore, we recommend developing with Python 3.12 to minimise the chance of your local environment clashing with our CI environment.
:::{note}
Currently, the repository is not fully checked by `mypy`.
:::
!!! note
Currently, the repository is not fully checked by `mypy`.
:::{note}
Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
platform to run unit tests locally, rely on the continuous integration system to run the tests for
now.
:::
!!! note
Currently, not all unit tests pass when run on CPU platforms. If you don't have access to a GPU
platform to run unit tests locally, rely on the continuous integration system to run the tests for
now.
## Issues
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
:::{important}
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
:::
!!! warning
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
## Pull Requests & Code Reviews
@ -106,9 +116,8 @@ appropriately to indicate the type of change. Please use one of the following:
- `[Misc]` for PRs that do not fit the above categories. Please use this
sparingly.
:::{note}
If the PR spans more than one category, please include all relevant prefixes.
:::
!!! note
If the PR spans more than one category, please include all relevant prefixes.
### Code Quality
@ -121,9 +130,8 @@ The PR needs to meet the following code quality standards:
understand the code.
- Include sufficient tests to ensure the project stays correct and robust. This
includes both unit tests and integration tests.
- Please add documentation to `docs/source/` if the PR modifies the
user-facing behaviors of vLLM. It helps vLLM users understand and utilize the
new features or changes.
- Please add documentation to `docs/` if the PR modifies the user-facing behaviors of vLLM.
It helps vLLM users understand and utilize the new features or changes.
### Adding or Changing Kernels

View File

@ -1,13 +1,14 @@
(benchmarks)=
# Benchmark Suites
---
title: Benchmark Suites
---
[](){ #benchmarks }
vLLM contains two sets of benchmarks:
- [Performance benchmarks](#performance-benchmarks)
- [Nightly benchmarks](#nightly-benchmarks)
- [Performance benchmarks][performance-benchmarks]
- [Nightly benchmarks][nightly-benchmarks]
(performance-benchmarks)=
[](){ #performance-benchmarks }
## Performance Benchmarks
@ -17,7 +18,7 @@ The latest performance results are hosted on the public [vLLM Performance Dashbo
More information on the performance benchmarks and their parameters can be found [here](gh-file:.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md).
(nightly-benchmarks)=
[](){ #nightly-benchmarks }
## Nightly Benchmarks

Some files were not shown because too many files have changed in this diff Show More