5873877241
[Bugfix] Mistral tool calling when content is list ( #18729 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-27 09:05:37 -07:00
696259ca01
[Core] Automatically cast multi-modal input dtype ( #18756 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 23:45:48 +08:00
6b6d496114
optimize get_kv_cache_torch_dtype ( #18531 )
...
Signed-off-by: idellzheng <idellzheng@tencent.com >
2025-05-27 13:08:44 +00:00
aaa4ac1c95
Disable prefix cache by default for benchmark ( #18639 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-27 20:06:34 +08:00
06a0338015
[V1][Metrics] Add API for accessing in-memory Prometheus metrics ( #17010 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-27 09:37:06 +00:00
4318c0559d
[CI/Build] Remove imports of built-in re ( #18750 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 09:19:18 +00:00
a68e293cb9
[Doc] Convert Sphinx directives ( {class}, {meth}, {attr}, ...) to MkDocs format for better documentation linking ( #18663 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-27 01:44:20 -07:00
6881107948
[BUG FIX] minicpm ( #18739 )
...
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com >
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com >
2025-05-27 01:04:49 -07:00
e0f0ff87b8
[Build] fix cpu build missing libtbbmalloc.so ( #18744 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-05-27 01:03:56 -07:00
c24b1572ac
Minor fix about MooncakeStoreConnector ( #18721 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
2025-05-27 08:02:28 +00:00
4693a3438c
[Doc] cleanup deprecated flag for doc ( #18715 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-27 07:12:02 +00:00
bbd9a84dc5
[Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh ( #18752 )
...
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai >
2025-05-27 00:10:26 -07:00
a547aeb828
feat(rocm-support): support mamba2 on rocm ( #18565 )
...
Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai >
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai >
2025-05-27 00:07:53 -07:00
fc6d0c290f
[Misc] improve docs ( #18734 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-27 07:07:01 +00:00
753944fa9b
[Doc] Update reproducibility doc and example ( #18741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 07:03:13 +00:00
25a817f202
[Doc] Update OOT model docs ( #18742 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 06:30:31 +00:00
d260f799a9
[FEAT] [ROCm] Upgrade AITER Fused MoE kernels. ( #18271 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-26 23:14:07 -07:00
b50602d5f0
[Model][Gemma3] Cast image pixel values already on CPU ( #18732 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-27 05:42:54 +00:00
1f1b1bc03b
[V1][Quantization] Add CUDA graph compatible v1 GGUF support ( #18646 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-27 04:40:28 +00:00
1f88dbd2bb
[Misc] improve web section group title display ( #18684 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-27 04:35:16 +00:00
0eebd74842
[Model][Gemma3] Simplify image input validation ( #18710 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-27 11:13:37 +08:00
27bebcd897
Convert examples to ruff-format ( #18400 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-26 16:57:54 +00:00
e7523c2e03
[V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs ( #18608 )
2025-05-26 11:49:36 -04:00
a869baca73
[Bugfix] Fix Llama GGUF initialization ( #18717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:49:22 -07:00
82e2339b06
[Doc] Move examples and further reorganize user guide ( #18666 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:38:04 -07:00
9553fdb41e
[Doc] Improve API docs ( #18713 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:33:34 -07:00
243eb9199f
[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM ( #18701 )
2025-05-26 07:10:56 -07:00
0665e29998
[Misc] add AutoGen integration ( #18712 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-26 13:56:18 +00:00
e76be06550
[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI ( #18709 )
...
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai >
2025-05-26 05:26:07 -07:00
0877750029
[CI/Build] Split pooling and generation extended language models tests in CI ( #18705 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-26 04:00:08 -07:00
6d68030f1c
[Model] Add support for YARN in NemotronNAS models ( #18427 )
...
Signed-off-by: Nave Assaf <nassaf@nvidia.com >
2025-05-26 10:31:49 +00:00
5a2c76cbe1
[CI] fix dump_input for str type ( #18697 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-26 18:23:35 +08:00
38b13dfe78
[CI/Build] Replace math.isclose with pytest.approx ( #18703 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 02:05:17 -07:00
61a45e7a72
[Bugfix] Fix Mistral-format models with sliding window ( #18693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 01:44:04 -07:00
65523a0995
[Doc] Fix issue template format ( #18699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 00:45:39 -07:00
4b7740a105
[GH] Add issue template for reporting CI failures ( #18696 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 00:42:04 -07:00
4ea62c0ea0
[CI] add missing argument ( #18694 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-26 00:22:04 -07:00
561b77a0d6
[Bugfix] Fix the lm_head in gpt_bigcode in lora mode ( #6357 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2025-05-26 14:52:25 +08:00
abd4030d94
refactor: simplify request handler, use positive condition check for handler assignment ( #18690 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-26 06:32:28 +00:00
8820821b59
[Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example ( #18644 )
...
Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com >
Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com >
Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com >
2025-05-26 13:51:27 +08:00
fba0642704
[CI/Build][Doc] Update gte-Qwen2-1.5B-instruct usage ( #18683 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-25 20:27:50 -07:00
6071e989df
[Core][Multimodal] Convert PIL Image to array without data copy when hashing ( #18682 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-25 17:33:35 +00:00
57fd13a707
[Bugfix] Fix profiling dummy data for Pixtral ( #18677 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-25 14:05:30 +00:00
3a886bd58c
[Misc] small improve ( #18680 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 06:05:38 -07:00
35be8fad62
[CI/build] fix no regex ( #18676 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 10:10:51 +00:00
f2faac745d
[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment ( #18674 )
...
Signed-off-by: zzzyq <zhangyuqi94@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-25 02:36:06 -07:00
279f854519
[doc] improve readability ( #18675 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 01:40:31 -07:00
624b77a2b3
[doc] fix broken links ( #18671 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 01:36:33 -07:00
503f8487c2
[Misc] Reduce logs on startup ( #18649 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 23:03:53 -07:00
44073a7ac3
[BUGFIX] catch subclass first for try...except ( #18672 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-25 05:34:24 +00:00
63934543a0
Speed up the kernels/quantization/ tests ( #18669 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-25 05:02:59 +00:00
75f81750f3
[VLM] Initialize video input support for InternVL models ( #18499 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-25 04:51:25 +00:00
6ab681bcbe
[Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE ( #18655 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-25 04:51:21 +00:00
cebc22f3b6
[Misc]Replace cuda hard code with current_platform in Ray ( #14668 )
...
Signed-off-by: noemotiovon <757486878@qq.com >
2025-05-24 20:26:31 -07:00
6c6dcd8611
[MISC] correct signature for LoaderFunction ( #18670 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-24 20:17:47 -07:00
7891fdf0c6
[V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... ( #18640 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-05-24 20:07:20 -07:00
6825d9a998
[BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding ( #18668 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-24 17:33:46 -07:00
b554ab736e
[CI/Build] fix permission denied issue ( #18645 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-24 16:09:10 +00:00
9ea7f1abf3
fix(regression): clone from reference items ( #18662 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-24 15:25:20 +00:00
2807271c86
[CI] enforce import regex instead of re ( #18665 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-24 08:04:14 -07:00
b9018a3f9f
[BugFix] Fix import error for fused_moe ( #18642 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-05-24 07:53:36 -07:00
4ceafb6299
[MISC] typo fix and clean import ( #18664 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-24 07:52:09 -07:00
2e6705784f
[CI/Build] chmod +x to cleanup_pr_body.sh ( #18650 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 07:26:45 -07:00
1cb194a018
[Doc] Reorganize user guide ( #18661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 07:25:33 -07:00
2cd4d58df4
[Model] use AutoWeightsLoader for gpt2 ( #18625 )
...
Signed-off-by: zt2370 <ztang2370@gmail.com >
2025-05-24 13:36:13 +00:00
6d166a8d35
[Doc] Add community links ( #18657 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 06:06:38 -07:00
ef1dd6870f
[Doc] Fix indentation problems in V0 Paged Attention docs ( #18659 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 06:06:35 -07:00
e77dc4bad8
[MISC][pre-commit] Add pre-commit check for triton import ( #17716 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-24 20:09:15 +08:00
07458a51ce
[Doc] Update README links, mark external links ( #18635 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 09:57:15 +00:00
c1e4a4052d
[V1][Spec Decode] Support multi-layer eagle draft model ( #18030 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-24 09:45:34 +00:00
a859320575
[Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) ( #18647 )
2025-05-24 09:15:36 +00:00
441dc63ac7
[Frontend] improve vllm serve --help display ( #18643 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-24 07:53:22 +00:00
d55e446d13
[V1][Spec Decode] Small refactors to improve eagle bookkeeping performance ( #18424 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-24 06:51:22 +00:00
ec82c3e388
FIX MOE issue in AutoRound format ( #18586 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-23 22:01:40 -07:00
45ab403a1f
config.py: Clarify that only local GGUF checkpoints are supported. ( #18623 )
...
Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com >
2025-05-24 08:46:34 +08:00
2b10ba7491
[Bugfix][Nixl] Fix Preemption Bug ( #18631 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-23 23:30:16 +00:00
4fc1bf813a
[Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking ( #18454 )
...
Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com >
Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com >
2025-05-23 16:16:26 -07:00
f2036734fb
[ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation ( #18160 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-05-23 15:52:20 -07:00
7d9216495c
[Doc] Update references to doc files ( #18637 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 15:49:21 -07:00
0ddf88e16e
[CI] Enable test_initialization to run on V1 ( #16736 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 15:09:44 -07:00
1645b60196
Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI ( #18537 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-05-23 21:17:16 +00:00
2628a69e35
[V1] Support Deepseek MTP ( #18435 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn >
Co-authored-by: Rui Qiao <ruisearch42@gmail.com >
2025-05-23 10:26:28 -07:00
371f7e4ca2
[Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar ( #18627 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 10:22:40 -07:00
15b45ffb9a
[Doc] Avoid documenting dynamic / internal modules ( #18626 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 09:58:02 -07:00
273cb3b4d9
[Doc] Fix top-level API links/docs ( #18621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 09:46:56 -07:00
8ddd1cf26a
[Doc] fix list formatting ( #18624 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-23 09:41:17 -07:00
6550114c9c
[v1] Redo "Support multiple KV cache groups in GPU model runner ( #17945 )" ( #18593 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-23 09:39:47 -07:00
9520a989df
[Docs] Change mkdocs to not use directory urls ( #18622 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 09:33:21 -07:00
3d28ad343f
Fix figures in design doc ( #18612 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 09:09:54 -07:00
6a7988c55b
Refactor pplx init logic to make it modular (prepare for deepep) ( #18200 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-23 23:43:43 +08:00
022d8abe29
[Doc] Use a different color for the announcement ( #18616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 08:25:03 -07:00
5221815a00
[Doc] Fix markdown list indentation for MkDocs rendering ( #18620 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-23 08:23:21 -07:00
1068556b2c
[Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS ( #18579 )
2025-05-23 07:43:58 -07:00
2cd1fa4556
[Misc] add Haystack integration ( #18601 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-23 06:21:19 -07:00
d4c2919760
Include private attributes in API documentation ( #18614 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 06:18:31 -07:00
6220f3c6b0
[Bugfix] Fix transformers model impl ignored for mixtral quant ( #18602 )
...
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com >
2025-05-23 05:54:13 -07:00
52fb23f47e
Fix examples with code blocks in docs ( #18609 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 05:53:44 -07:00
6dd51c7ef1
[CI/Build] Fix V1 flag being set in entrypoints tests ( #18598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 05:51:53 -07:00
2edb533af2
Replace {func} with mkdocs style links ( #18610 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 05:51:38 -07:00
38a95cb4a8
[Doc] Fix indent of contributing to vllm ( #18611 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-23 05:50:07 -07:00
cd821ea5d2
[CI] fix kv_cache_type argument ( #18594 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-23 04:49:18 -07:00
7ab056c273
[Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to requirements/cpu.txt ( #18542 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-05-23 04:38:42 -07:00
6526e05111
Add myself as docs code owner ( #18605 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 04:08:31 -07:00
e493e48524
[V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled ( #17731 )
...
Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-23 03:38:23 -07:00
4ce64e2df4
[Bugfix][Model] Fix baichuan model loader for tp ( #18597 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-23 02:39:05 -07:00
fbb13a2c15
Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal ( #18034 )" ( #18600 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 02:18:22 -07:00
a1fe24d961
Migrate docs from Sphinx to MkDocs ( #18145 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 02:09:53 -07:00
d0bc2f810b
[Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform ( #18430 )
...
Signed-off-by: Yuqi Zhang <yuqizhang@google.com >
Co-authored-by: Yuqi Zhang <yuqizhang@google.com >
2025-05-23 01:41:37 -07:00
b046cf792d
[Feature][V1]: suupports cached_tokens in response usage ( #18149 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-05-23 01:41:03 -07:00
54af915949
[Doc] Update quickstart and install for cu128 using --torch-backend=auto ( #18505 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 08:36:37 +00:00
71ea614d4a
[Feature]Add async tensor parallelism using compilation pass ( #17882 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-23 01:03:34 -07:00
4c611348a7
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal ( #18034 )
...
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2025-05-23 00:37:18 -07:00
60cad94b86
[Hardware] correct method signatures for HPU,ROCm,XPU ( #18551 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-22 22:31:59 -07:00
9c1baa5bc6
[Misc] Replace cuda hard code with current_platform ( #16983 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-05-23 04:38:50 +00:00
4be2255c81
[Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key ( #17291 )
...
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com >
2025-05-23 12:30:47 +08:00
ed5d408255
[Neuron] Remove bypass on EAGLEConfig and add a test ( #18514 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-22 21:26:32 -07:00
583507d130
[Spec Decode] Make EAGLE3 draft token ID mapping optional ( #18488 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-22 20:17:39 -07:00
e44d8ce8c7
[Bugfix] Set KVTransferConfig.engine_id in post_init ( #18576 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-05-23 02:54:42 +00:00
93ecb8139c
[BugFix] Increase TP execute_model timeout ( #18558 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-23 10:22:11 +08:00
fae453f8ce
[Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs ( #18482 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-23 10:15:32 +08:00
4b0da7b60e
Enable hybrid attention models for Transformers backend ( #18494 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 10:12:08 +08:00
c6b636f9fb
[V1][Spec Decoding] Use model_loader.get_model() to load models ( #18273 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-23 02:05:44 +00:00
04eb88dc80
Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. ( #18569 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-05-23 01:59:18 +00:00
46791e1b4b
[AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh ( #18568 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-05-22 18:45:35 -07:00
c32e249a23
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization ( #17926 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-05-22 18:44:18 -07:00
c91fe7b1b9
[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser ( #17917 )
...
Signed-off-by: Kai Wu <kaiwu@meta.com >
2025-05-22 16:44:08 -07:00
a04720bc36
[V1][Spec Decode][Bugfix] Load quantize weights for EAGLE ( #18290 )
2025-05-22 15:17:33 -07:00
7b9d832c80
[Tool] Add NIXL installation script ( #18172 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-22 14:33:16 -07:00
6e588da0f4
[Build/CI] Fix CUDA 11.8 build ( #17679 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-22 12:13:54 -07:00
f8d2cc5f55
[Compile][Platform] Make PiecewiseBackend pluggable and extendable ( #18076 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-05-22 12:11:53 -07:00
721fb9b181
[Platform] Move platform check to right place ( #18470 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-05-22 12:11:28 -07:00
1f3a1200e4
[Bugfix] make test_openai_schema.py pass ( #18224 )
...
Signed-off-by: David Xia <david@davidxia.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-22 18:34:06 +00:00
54631f8262
[Misc] Call ndarray.tobytes() directly instead of ndarray.data.tobytes() ( #18347 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-22 09:00:13 -07:00
cb506ecb5a
[Misc] improve Automatic Prefix Caching example ( #18554 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-22 14:50:46 +00:00
93f71673ce
[BugFix][CPU] Fix x86 SHM distributed module initialization ( #18536 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-05-22 07:35:00 -07:00
3f505233fd
[Doc] Add stream flag for chat completion example ( #18524 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-22 14:07:10 +00:00
4e04eceb58
[Bugfix] Use random hidden states in dummy sampler run ( #18543 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
2025-05-22 06:48:56 -07:00
71075029f2
[Doc] Support --stream arg in openai_completion_client.py script ( #18388 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-22 13:20:17 +00:00
ca86a7cf6e
[CI/Build] Update bamba test model location ( #18544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-22 06:01:07 -07:00
a35a494745
[Bugfix] Add kwargs to RequestOutput __init__ to be forward compatible ( #18513 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-22 05:24:43 -07:00
f6037d1907
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18526 )
...
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-22 05:22:53 -07:00
fa72f9a812
Order sequence ids + config update to support specifying custom quantization layers ( #18279 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Tailin Pan <tailinpa@amazon.com >
Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com >
Co-authored-by: Yishan McNabb <yishanm@amazon.com >
Co-authored-by: Patrick Lange <patlange@amazon.com >
Co-authored-by: Maxwell Goldberg <mgld@amazon.com >
Co-authored-by: Aakash Shetty <sheaak@amazon.com >
2025-05-22 02:20:36 -07:00
ebed81fbf5
Update default neuron config for speculation ( #18274 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com >
Co-authored-by: Aakash Shetty <sheaak@amazon.com >
2025-05-22 02:18:55 -07:00
e2d7d31244
[Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) ( #18512 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-22 02:17:34 -07:00
23b67b37b2
[Doc] Fix invalid JSON in example args ( #18527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-22 07:11:46 +00:00
db5a29ba19
[Bugfix] Fix LoRA test ( #18518 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-21 21:48:53 -07:00
51797775c3
[Bugfix][Model] Make Olmo2Model weight loading return loaded weights ( #18504 )
...
Signed-off-by: Shane A <shanea@allenai.org >
2025-05-21 21:17:03 -07:00
cf5984b2fe
[BugFix][DP] Send DP wave completion only from dp_rank==0 ( #18502 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com >
2025-05-21 20:25:25 -07:00
d022115cc6
[Bugfix] Inconsistent token calculation compared to HF in llava family ( #18479 )
...
Signed-off-by: jaycha <jaycha@ncsoft.com >
2025-05-21 20:21:47 -07:00
acb54ca8e1
Intialize io_thread_pool attribute in the beginning. ( #18331 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-21 20:21:14 -07:00
6e0fd34d3c
[CI] Fix race condition with StatelessProcessGroup.barrier ( #18506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-21 20:19:13 -07:00
176d62e4ea
[MISC] update project urls in pyproject.toml ( #18519 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-21 20:17:34 -07:00
20bd6f4d2e
[FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) ( #18500 )
...
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae >
Co-authored-by: younesbelkada <younesbelkada@gmail.com >
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae >
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae >
2025-05-21 19:23:59 -07:00
1f079540db
[Bugfix] Consistent ascii handling in tool parsers ( #17704 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-05-21 20:41:23 +00:00
94d8ec8d2b
[FEAT][ROCm] Upgrade AITER MLA v1 backend ( #18338 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-05-21 10:34:28 -07:00
bb0a311213
Revert "[v1] Support multiple KV cache groups in GPU model runner ( #17945 ) ( #18459 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-21 10:25:23 -07:00
dd5fa7e04f
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 ( #17004 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-05-21 08:35:00 -07:00
2b16104557
[Misc] Update deprecation message for --enable-reasoning ( #18404 )
2025-05-21 07:33:11 -07:00
371376f996
[Build] fix Dockerfile shell ( #18402 )
2025-05-21 07:32:06 -07:00
c6c10ca920
[Bugfix] Reduce moe_sum test size to avoid OOM ( #18484 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-05-21 06:46:39 -07:00
c154d89306
[Doc] fix arg docstring in linear layers ( #18410 )
...
Signed-off-by: giantcroc <1204449533@qq.com >
2025-05-21 06:45:57 -07:00
eca18691d2
[MODEL] FalconH1 ( #18406 )
...
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae >
Co-authored-by: younesbelkada <younesbelkada@gmail.com >
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae >
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae >
2025-05-21 04:59:06 -07:00
61acfc45bc
[Bugfix][Failing Test] Fix test_events.py ( #18460 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-21 04:57:28 -07:00
107f5fc4cb
[Misc] refactor disaggregated-prefill-v1 example ( #18474 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-21 11:10:14 +00:00
907f935de9
[V1] Fix general plugins not loaded in engine for multiproc ( #18326 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-21 01:21:49 -07:00
5d7f545204
[Frontend] deprecate --device arg ( #18399 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-05-21 01:21:17 -07:00
cd8dfc6dfc
[Misc] MultiConnector._connectors type ( #18423 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-05-20 22:48:43 -07:00
d06dd72ba9
[Bugfix][Failing Test] Fix nixl connector test when promt size < block size ( #18429 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-20 22:41:44 -07:00
ad0012a0ac
Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18407 )" ( #18456 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-20 22:39:22 -07:00
92247c522e
[Bug] Fix moe_sum signature ( #18440 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-05-20 22:37:08 -07:00
0c15c2e486
[Bugfix] config.head_dim is now explicitly set to None ( #18432 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-20 21:04:33 -07:00
3b17ea26e4
[TPU] Re-enable the Pallas MoE kernel ( #18025 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-05-20 19:52:27 -07:00
23baa2180b
fix:Build torch wheel inline rather than picking from nightly ( #18351 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
2025-05-20 22:22:24 +00:00
980a172474
[Kernel] update comment for KV shape in unified triton attn ( #18099 )
...
Signed-off-by: haochengxia <xhc_1007@163.com >
2025-05-20 11:19:34 -07:00
e1f5a71ed7
[Model] use AutoWeightsLoader for bloom ( #18300 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-20 09:40:05 -07:00
f4a8a37465
[Minor] Rename quantization nvfp4 to modelopt_fp4 ( #18356 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-20 09:08:37 -07:00
8f55962a7f
[Misc] refactor prompt embedding examples ( #18405 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-20 15:26:12 +00:00
be48360c1f
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18407 )
...
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com >
2025-05-20 06:59:48 -07:00
86847700d7
[CI] Add mteb testing to test the accuracy of the embedding model ( #17175 )
2025-05-20 06:51:12 -07:00
d6c86d09ae
Update cpu.txt ( #18398 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-05-20 10:53:23 +00:00
6b35cb10a0
[Misc] Add LoRA code owner ( #18387 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-20 03:27:30 -07:00
1b1e8e05ff
[doc] update env variable export ( #18391 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-20 08:53:27 +00:00
bca55b556f
[Bugfix] fix adding bias twice in ipex GPTQ quantization ( #18363 )
...
Signed-off-by: rand-fly <randfly@outlook.com >
2025-05-20 00:54:33 -07:00
d981396778
[release] Change dockerhub username for TPU release ( #18389 )
2025-05-19 23:49:23 -07:00
9609327fa4
[Core] [Bugfix]: tensor parallel with prompt embeds ( #18171 )
...
Signed-off-by: Nan2018 <nan@protopia.ai >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
2025-05-19 20:21:27 -07:00
f07a673eb2
[Misc] Allow AutoWeightsLoader to skip loading weights with specific substr in name ( #18358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-19 20:20:12 -07:00
d565e0976f
[neuron] fix authorization issue ( #18364 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-05-19 23:30:32 +00:00
258bf621d5
fix CUDA_check redefinition in #17918 ( #18287 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-05-19 13:42:35 -07:00
dc1440cf9f
Neuron up mistral ( #18222 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-19 09:54:47 -07:00
8171221834
[Misc] Fix typo ( #18330 )
2025-05-19 09:51:01 -07:00
7937c2fd52
Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup ( #18337 )
2025-05-19 09:49:57 -07:00
e2ee1e8e9e
[Feature]Add support for models quantized with AutoRound ( #17850 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-19 09:38:53 -07:00
20d8ce81eb
[Frontend] add --quick option for vllm chat/complete ( #18297 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-19 09:36:13 -07:00
84ab4feb7e
[Doc] Fix typo ( #18355 )
2025-05-19 16:05:16 +00:00
6781af5608
[Quantization] Pool model support bitsandbytes ( #18087 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-19 09:03:43 -07:00
1b15df2546
[BugFix] Fix handling of num_computed_tokens with connector ( #18232 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-05-19 09:03:25 -07:00
43b5f61dce
[Doc] Move input-related docs to Features ( #18353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-19 15:08:39 +00:00
c5bb0ebdc6
[Doc] Fix prompt embedding examples ( #18350 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-05-19 06:48:16 -07:00
d637b96099
[BugFix] [Vul] Add missing usedforsecurity=False in MD5 hashing to enable FIPS ( #18319 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
Signed-off-by: shaoyuyoung <shaoyuyoung@gmail.com >
Co-authored-by: cascade <cascade812@outlook.com >
2025-05-19 01:31:23 -07:00
275c5daeb0
fix: Add type specifications for CLI arguments in tensorizer options ( #18314 )
2025-05-18 23:42:17 -07:00
47fda6d089
[Build] Supports CUDA 12.6 and 11.8 after Blackwell Update ( #18316 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-05-18 23:19:33 -07:00
27d0952600
[Misc] extract parser.parse_args() ( #18323 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-19 04:06:26 +00:00
221cfc2fea
Feature/vllm/input embedding completion api ( #17590 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Nan2018 <nan@protopia.ai >
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com >
Co-authored-by: Bryce1010 <bryceyx@gmail.com >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
Co-authored-by: Andrew Sansom <qthequartermasterman@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-18 20:18:05 -07:00
9da1095daf
[Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa ( #18175 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-18 19:49:46 -07:00
d1211f8794
[Doc] Add doc to explain the usage of Qwen3 thinking ( #18291 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-05-18 23:04:07 +00:00
b6a6e7a529
[Misc] add litellm integration ( #18320 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-18 15:32:30 +00:00
4fb349f66a
Fix copy-paste error in phi4mm image processing ( #18315 )
...
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com >
2025-05-18 07:00:12 -07:00
908733aca7
[Model] Use sigmoid for single-label classification ( #18313 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-05-18 07:00:09 -07:00
1a8f68bb90
[doc] update reasoning doc ( #18306 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-18 06:59:14 -07:00
9ab2c02ff8
Support sequence parallelism combined with pipeline parallelism ( #18243 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-17 22:47:25 +00:00
66e63e86ec
[MISC] fix typo ( #18305 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-17 10:52:09 -07:00
9214e60631
[Model] use AutoWeightsLoader for solar ( #18113 )
2025-05-17 00:24:17 -07:00
f880d42582
Fixed build on ppc64le due to openssl conflicts ( #18262 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-05-17 00:23:46 -07:00
dcfe95234c
Update Dockerfile to build for Blackwell ( #18095 )
2025-05-17 00:23:25 -07:00
48ac2bed5b
[Hardware][TPU] Optionally import for TPU backend ( #18269 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
Co-authored-by: Carol Zheng <cazheng@google.com >
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com >
Co-authored-by: Hongmin Fan <fanhongmin@google.com >
2025-05-17 15:23:12 +08:00
3e0d435027
[P/D][V1] Support dynamic loading of external KV connector implementations ( #18142 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-05-17 06:40:39 +00:00
4ee4826ede
[BugFix] Correct max_model_len derivation from config.json for Mistral format ( #17937 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Co-authored-by: tracelogfb <48808670+tracelogfb@users.noreply.github.com >
Co-authored-by: Stephen Chen <tracelog@meta.com >
2025-05-17 04:20:13 +00:00
60017dc841
[Misc] reformat the collect-env output ( #18285 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-16 19:46:18 -07:00
55f1a468d9
Move cli args docs to its own page ( #18228 ) ( #18264 )
...
Signed-off-by: Trevor Royer <troyer@redhat.com >
2025-05-16 19:43:45 -07:00
fd195b194e
[V1][P/D] Local attention optimization for NIXL ( #18170 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-16 21:16:33 -04:00
fabe89bbc4
[Spec Decode] Don't fall back to V0 when spec decoding is enabled ( #18265 )
2025-05-16 16:10:27 -07:00
e73b7dfd69
[Bugfix] fix an illegal memory access was encountered of marlin kernel + act_order ( #18245 )
2025-05-16 16:02:44 -07:00
7fdfa01530
[Sampler] Adapt to FlashInfer 0.2.3 sampler API ( #15777 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-16 15:14:03 -07:00
aef94c6d07
[CI] Assign reviewer to mergify with changes to Tensorizer files ( #18278 )
2025-05-16 12:04:14 -07:00
0ceaebf87b
[BugFix] Fix ordering of KVConnector finished send/rcv sets ( #18211 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-16 09:20:54 -07:00
1db4f47f81
[BugFix] Fix multi async save in MultiConnector ( #18246 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-16 08:13:47 -07:00
d3d91b6f71
[Misc][MacOS] fix bfloat16 error ( #18249 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-16 15:05:59 +00:00
87d871470d
[Model] Use autoweightloader for dbrx ( #18251 )
...
Signed-off-by: learner0810 <zhongjun.li@daocloud.io >
2025-05-16 07:54:13 -07:00
a5f8c111c2
[Fix] Fix typo in resolve_hf_chat_template ( #18259 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
2025-05-16 14:52:41 +00:00
e23564cb70
use ceil_div in cutlass block scaling shape check ( #17918 )
2025-05-16 03:02:58 -07:00
390ec88905
[Misc] Consolidate Audio tests into multimodal common generation tests ( #18214 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-16 09:18:08 +00:00
541817670c
[Misc] Add Ray Prometheus logger to V1 ( #17925 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-05-16 01:02:42 -07:00
67da5720d4
[PERF] Speed up Qwen2.5-VL model by speed up rotary position embedding ( #17973 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-05-15 23:31:02 -07:00
5c04bb8b86
[doc] fix multimodal example script ( #18089 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-16 06:05:34 +00:00
3d2779c29a
[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 ( #17827 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-05-15 22:28:27 -07:00
6b31c84aff
Throw better error for when running into k8s service discovery issue ( #18209 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-15 21:07:28 -07:00
b18201fe06
Allow users to pass arbitrary JSON keys from CLI ( #18208 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 21:05:34 -07:00
f4937a51c1
[Model] vLLM v1 supports Medusa ( #17956 )
...
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com >
Signed-off-by: skylee-01 <497627264@qq.com >
Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com >
2025-05-15 21:05:31 -07:00
ee659e3b60
[Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 attention on ROCm ( #18093 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
2025-05-15 19:30:17 -07:00
4e1c6a0264
[Bugfix] fix rotary embedding test for _get_padded_tensor_shape ( #18229 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-16 01:32:45 +00:00
c7852a6d9b
[Build] Allow shipping PTX on a per-file basis ( #18155 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 16:41:55 -07:00
8795eb9975
[Bugfix] Fix test_eagle test ( #18223 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-05-15 15:59:42 -07:00
0b34593017
Adding "AMD: Tensorizer Test" to amdproduction. ( #18216 )
2025-05-15 11:01:25 -07:00
e3f3aee6f4
[Misc] Avoid cuda graph log when sizes still match ( #18202 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-15 09:59:38 -07:00
92540529c0
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 ( #18205 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-15 09:53:18 -07:00
fadb8d5c2d
[Bugfix]Change the exception thrown by call_hf_processor from RuntimeError to ValueError ( #18181 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-05-15 09:01:47 -07:00
2aa5470ac5
[Frontend] Fix chat template content format detection ( #18190 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-05-15 09:00:21 -07:00
51ff154639
Improve examples rendering in docs and GitHub ( #18203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 15:57:49 +00:00
566ec04c3d
Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3" in AMD Pipeline ( #18106 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-15 08:49:23 -07:00
01c22335ba
[Kernel] [V1] Fix performance regression for triton unified attention ( #18161 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 06:39:00 -07:00
451da4bcbd
add tools into TokenizeChatRequest ( #18187 )
...
Signed-off-by: yangxia <yangxiast@gmail.com >
2025-05-15 04:01:49 -07:00
07ad27121f
Update deprecated type hinting in model_loader ( #18130 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 04:00:21 -07:00
a9944aabfa
fix: typos ( #18151 )
...
Signed-off-by: omahs <73983677+omahs@users.noreply.github.com >
2025-05-15 02:16:15 -07:00
a8f5aec20a
[V1] Update zmq socket creation in nixl connector ( #18148 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 23:17:57 -07:00
de71fec81b
[CI] don't skip fixed test_kv_cache_events() ( #18183 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-14 23:17:16 -07:00
70f8b96724
[Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends ( #18178 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-14 23:16:31 -07:00
dd2a94596a
[Model] Allow the use of sliding window in Qwen2 ( #17772 )
...
Signed-off-by: inkcherry <mingzhi.liu@intel.com >
2025-05-14 22:29:38 -07:00
420caf7557
[UT] Add ut for none hash ( #17892 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-15 13:28:11 +08:00
4f07a64075
Support custom implementations of VideoLoader backends. ( #18091 )
2025-05-15 13:26:49 +08:00
e6b8e65d2d
[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 ( #18013 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 13:26:34 +08:00
26d0419309
Update deprecated type hinting in models ( #18132 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 22:06:50 -07:00
83f74c698f
[Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm ( #18154 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-05-14 22:04:43 -07:00
2dff093574
[Misc] add lobe-chat support ( #18177 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-15 05:02:23 +00:00
afe3236e90
[Chore] astral's ty ( #18116 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-15 05:00:43 +00:00
65334ef3b9
[V1][Metrics] Remove unused code ( #18158 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-14 20:13:17 -07:00
e60f550b38
[v1] Support multiple KV cache groups in GPU model runner ( #17945 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-14 18:54:54 -07:00
f25e0d1125
[Bugfix]: make most of test_openai_schema.py pass ( #17664 )
2025-05-14 17:04:35 -07:00
09f106a91e
Upload vllm index for the rc builds ( #18173 )
2025-05-14 16:35:56 -07:00
2142035b51
[V1] Support multiple kv connectors ( #17564 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-05-14 16:28:02 -07:00
78aa341d12
[CI] Fix race condition in test_kv_cache_events test ( #18169 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 16:27:48 -07:00
7974736740
Add support for loading torchao models with AOPerModuleConfig ( #17826 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-05-14 16:24:59 -07:00
2fc9075b82
[V1] Structured Outputs + Thinking compatibility ( #16577 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 15:45:24 -07:00
d93c976a0d
[Kernel] Have rotary embeddings support tensors ( #18046 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-14 15:43:55 -07:00
749f792553
[Frontend] decrease import time of vllm.multimodal ( #18031 )
...
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-05-14 15:43:32 -07:00
856865008e
[CI] Disable Failing Tests ( #18165 )
2025-05-14 13:49:56 -07:00
f9c069c85e
Modularize fused experts and integrate PPLX kernels ( #15956 )
2025-05-14 13:11:54 -07:00
418d2f8bfb
[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model ( #17326 )
...
Co-authored-by: root <root@ekagra-8xh100.us-east5-a .c.serving-efficiency-poc.internal>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-14 12:31:46 -07:00
964472b966
[Doc] Update prefix cache metrics to counting tokens ( #18138 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-14 15:23:30 +00:00
59dd311cf5
[KVConnector] Keep KVTransferParams as a dict ( #18033 )
2025-05-14 08:05:57 -07:00
d066e52013
[Bugfix] Fix chat utils tests ( #18139 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 05:38:21 -07:00
c8ea982d9b
Update deprecated type hinting in platform, plugins, triton_utils, vllm_flash_attn ( #18129 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 05:28:16 -07:00
dc372b9c8a
Update deprecated type hinting in vllm/device_allocator and vllm/distributed ( #18126 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 04:07:57 -07:00
9b5b39b650
Update deprecated type hinting in vllm/lora ( #18128 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 03:57:59 -07:00
9ccc6ded42
[doc] add missing import ( #18133 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-14 10:57:34 +00:00
d62a076e84
[Model] GritLM supports other attention backends ( #18109 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 03:33:19 -07:00
259127f8b8
[Bugfix] Fix LoRA test ( #18123 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-14 10:25:47 +00:00
612c2edb4f
[FEAT] [ROCm]: Add AITER CK 2 Stages MoE support ( #17110 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-14 03:03:11 -07:00
38fe728d60
[Bugfix] Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile ( #17844 )
...
Signed-off-by: Andrzej Kotłowski <akotlowski@habana.ai >
2025-05-14 09:39:51 +00:00
82e7f9bb03
[Misc] replace does not exist model ( #18119 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-14 02:13:47 -07:00
63dc3426e0
[Model] Add packed_modules_mapping for Qwen3-MOE ( #18118 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-14 02:13:19 -07:00
8f5dc41481
[Bugfix] Fix entrypoints audio test failure ( #18111 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 09:08:07 +00:00
63ad622233
[New Model]: support GTE NewModel ( #17986 )
2025-05-14 01:31:31 -07:00
e7ef61c1f0
[Bugfix][Example] make lmcache v0 work. ( #18051 )
...
Signed-off-by: Ma, Jianpeng <jianpeng.ma@intel.com >
2025-05-13 23:43:44 -07:00
d4154c35a2
[Bugfix] fix moe marlin topk_weight loading ( #18080 )
...
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-13 23:31:57 -07:00
6685890d11
[Fix] Move "model_config" as keyword args in chat_utils.py ( #18098 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-13 23:27:26 -07:00
33011318c2
Fix broken example: examples/offline_inference/profiling at scheduler_config ( #18117 )
2025-05-13 23:19:14 -07:00
4f8b373225
[BugFix][AMD] Compatible patch for AITER lib after 04/20 ( #17912 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-05-13 23:05:20 -07:00
7b2f28deba
[AMD][torch.compile] Enable silu+fp8_quant fusion for rocm ( #18082 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-05-13 22:13:56 -07:00
2d912fb66f
[FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 ( #17955 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-13 22:03:47 -07:00
12e6c0b41c
[Bugfix][V1] Fix FlashInfer V1 backend using the wrong VllmConfig ( #18086 )
2025-05-13 20:36:17 -07:00
9a2a6357de
[Bugfix] Fix FP8 Marlin MoE and enable for compressed-tensors models ( #18026 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 19:48:33 -07:00
6266c57bae
[core][distributed] add ep group and all2all interface ( #18077 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-14 10:46:49 +08:00
754b699cbe
[Bug]: Fix S3 model/tokenizer path resolution ( #18083 )
...
Signed-off-by: Jon Gill <jon@yurts.ai >
2025-05-13 19:34:17 -07:00
6e27c6d86b
[Misc] Remove unused numpy tensor ( #18084 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-05-13 19:33:40 -07:00
d5af47a149
[P/D] Add some more debug logs to NixlConnector ( #18102 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-13 19:33:03 -07:00
65f0f74b66
[Hardware/NVIDIA/Modelopt] Fix modelopt forward method for v1 torch.compile ( #18101 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-05-13 19:33:00 -07:00
176a95c670
[Fix] Support CUDAGraph capture for encoder-decoder on ROCm ( #18104 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-05-13 19:31:42 -07:00
f2ae883b67
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager ( #18001 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-13 19:09:39 -07:00
40de1ef455
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature ( #14968 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-13 19:08:20 -07:00
0189a65a2e
[Docs] Expand security doc with firewall info ( #18081 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-13 19:36:00 +00:00
55aa7af994
[V1] DP scale-out (2/N): Decouple engine process management and comms ( #15977 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-13 10:48:21 -07:00
0b217da646
Update deprecated type hinting in vllm/adapter_commons ( #18073 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 08:32:51 -07:00
19324d660c
Update deprecated type hinting in vllm/compilation ( #18072 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 08:32:48 -07:00
fc407a1425
Give auto-merge label workflow permission to add labels to issues ( #18078 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 07:53:13 -07:00
009d9e7590
Convert benchmarks to ruff format ( #18068 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 13:43:29 +00:00
b922c2ebd2
[Bugfix] Fix entrypoints metrics tests ( #18063 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-13 06:42:43 -07:00
00b14e0f16
[CI] set token permissions for pre-commit CI job ( #17729 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 13:38:30 +00:00
54e467e6f8
[CI] Add token permissions for add-ready-label CI job ( #17730 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 13:38:13 +00:00
79a1d25bbd
[CI] Add workflow permissions for helm CI job ( #17727 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 12:49:07 +00:00
9944011b30
[CI] Set token permissions for reminder comment CI job ( #17728 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 12:46:58 +00:00
8c946cecca
Update deprecated type hinting in vllm/transformers_utils ( #18058 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:34:37 -07:00
ff334ca1cd
Update deprecated type hinting in vllm/profiler ( #18057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:34:34 -07:00
6223dd8114
Update deprecated type hinting in model_executor/layers ( #18056 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:17:23 -07:00
906f0598fc
[doc] add download/list/delete HF model CLI usage ( #17940 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-13 11:15:51 +00:00
cb528d0585
[Fix] check to make sure processor has chat templates ( #18047 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-13 03:04:10 -07:00
98fcba1575
Convert .buildkite to ruff format ( #17656 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 09:28:31 +00:00
23b3134eb5
[Benchmarks] Refactor run_structured_output_benchmarks.sh ( #17722 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-13 01:47:29 -07:00
ea6ae8cb45
[Bugfix] Fix marlin moe fallback logic for llama4 ( #18042 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 07:53:28 +00:00
2ff297dce9
[BugFix] Set default random seed to 0 for V1 ( #17929 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-13 07:52:19 +00:00
8dd0671bac
[Bugfix][V1] Only get input embeddings w/ multi-modal models if first PP ( #17916 )
...
Signed-off-by: Jin Huang <jinhun@amazon.com >
Co-authored-by: Jin Huang <jinhun@amazon.com >
2025-05-13 15:10:07 +08:00
f0d610a8ae
[v1][KVCacheManager] Avoid full cache hit by controlling max_length ( #17999 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-13 06:50:38 +00:00
e57e4d6e9e
Fix Broken macro for cutlass moe ( #18049 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-05-12 23:31:06 -07:00
ee5be834e7
[BugFix] Fix 4-GPU RLHF tests ( #18007 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-12 23:03:55 -07:00
48545728d8
cleanup invalid prints ( #18050 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-12 23:01:57 -07:00
dc1a821768
[Feature][V1] Support tool_choice: required when using Xgrammar as the StructuredOutputBackend. ( #17845 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-12 23:01:31 -07:00
61e0a506a3
[Bugfix] Avoid repeatedly creating dummy data during engine startup ( #17935 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-12 22:40:19 -07:00
1df491c522
[Bugfix] Fixes for new marlin moe usage ( #18017 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 03:50:04 +00:00
d8487ef557
[ROCm]: Fix build from source failure with gcc14 and ROCm 6.3 ( #13779 )
...
Signed-off-by: Arjun Kathuria <arjun.kathuria8@gmail.com >
2025-05-12 20:36:33 -07:00
c06af9a959
[Misc] Slight spelling modification ( #18039 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-12 20:36:27 -07:00
60f7624334
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support ( #11844 )
2025-05-12 19:52:47 -07:00
f6518b2b48
[ROCm] Skip tests for quantizations incompatible with ROCm ( #17905 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2025-05-12 18:39:28 -06:00
d67085c2c8
Remove noisy warnings from SchedulerConfig ( #17995 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 00:33:45 +00:00
307939f299
Use NVFP4 Marlin for CompressedTensorsW4A16Fp4 ( #18000 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2025-05-12 18:07:34 -06:00
9d7ea9dbbf
Update some more deprecated type hinting ( #17998 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-12 23:49:33 +00:00
acee8f48aa
[Model] Support MiMo-7B inference with MTP ( #17433 )
...
Signed-off-by: wp-alpha <wangpeng66@xiaomi.com >
Co-authored-by: wangpeng66 <wangpeng66@xiaomi.com >
2025-05-12 23:25:33 +00:00
f065de4e88
Fix FBGEMM integration ( #18002 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-12 23:02:07 +00:00
dc9905368d
[V1][Spec Decode] Eagle unit tests ( #17350 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-12 23:01:17 +00:00
ebab1ac37c
[CI] Make JSON output tests less likely to fail ( #17859 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-12 22:31:54 +00:00
2b0db9b0e2
Enable standard language model for torhc nightly ( #18004 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-12 14:00:04 -07:00
195adb47c0
[Chore] Remove unused method ( #18024 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-12 13:59:47 -07:00
302f3aca7e
[v1][KVCacheManager] Change prefix caching metric from counting blocks to counting tokens ( #18003 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-12 13:46:12 -07:00
e9c730c9bd
Enabling "Weight Loading Multiple GPU Test - Large Models" ( #18020 )
2025-05-12 13:05:33 -07:00
289199feb6
[Core] Use platform-agnostic device control for DP engine core ( #17245 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-05-12 12:09:16 -07:00
b9fd0d7a69
[CI/Build] Fix TPU V1 Test mixed use of & and && across tests ( #17968 )
2025-05-12 12:06:59 -07:00
72a3f6b898
Construct KVTransferConfig properly from Python instead of using JSON blobs without CLI ( #17994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-12 11:25:33 -07:00
98ea35601c
[Lora][Frontend]Add default local directory LoRA resolver plugin. ( #16855 )
...
Signed-off-by: jberkhahn <jaberkha@us.ibm.com >
2025-05-12 10:39:10 -07:00
d19110204c
[P/D] NIXL Integration ( #17751 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Brent Salisbury <bsalisbu@redhat.com >
2025-05-12 09:46:16 -07:00
05a4324f8e
Initialize the delta tool call fields explicitly ( #17340 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: igmainc <igmainc@icloud.com >
2025-05-12 13:28:58 +00:00
7ea6cb28b2
[Misc] Improve modelscope import error ( #17983 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-12 10:46:45 +00:00
9fbf2bfbd5
Correcting testcases in builkite job for IBM Power ( #17675 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-05-12 08:11:55 +00:00
3a5ea75129
[Feature] Support DeepSeekV3 Function Call ( #17784 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-05-12 00:45:21 -07:00
891b9d33de
[Fix] Benchmark "EngineClient" has no attribute "model_config" ( #17976 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-05-11 22:55:53 -07:00
430783018c
[Bugfix][TPU] Use np array when updating cache slot_mapping ( #17971 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-05-12 12:58:33 +08:00
19a3c78d1f
[Bugfix] Fix pydantic.errors.PydanticUserError ( #17962 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-05-12 12:58:23 +08:00
ada50aa295
[bugfix] fix the wrong parser ( #17958 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-12 04:58:02 +00:00
08bf784078
[Bugfix] validate grammar and throw 400 error instead of crashing the engine when xgrammar validation fails ( #17623 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-12 09:06:10 +08:00
d45fe333fb
[misc] add instructions on how to install nvshmem/pplx/deepep ( #17964 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-11 18:02:39 -07:00
021c16c7ca
[Model] Broadcast Ovis2 implementation to fit Ovis1.6 ( #17861 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-11 17:56:30 -07:00
7de18d541b
[BUG] [ROCm] [MLA] Fix variable name bug due to change in variable name in PR #17483 ( #17961 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-11 09:14:30 -07:00
a810b5b088
[BugFix] [ROCm]: Bugfix and handle addition case of input for rocm_aiter_rms_norm ( #17857 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-11 04:17:11 -07:00
009b3d5382
[Misc] not show --model in vllm serve --help ( #16691 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-11 08:47:58 +00:00
e4b8713380
[New Model]: nomic-embed-text-v2-moe ( #17785 )
2025-05-11 00:59:43 -07:00
06c0922a69
[FP8][ROCm][Attention] Enable FP8 KV cache on ROCm for V1 ( #17870 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-11 15:58:45 +08:00
cd3edfc908
[Misc] Add compressed-tensors NVFP4A16 emulation support ( #17914 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-05-11 15:58:38 +08:00
9cea90eab4
[Frontend] Add /classify endpoint ( #17032 )
...
Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com >
2025-05-11 07:57:07 +00:00
d1110f5b5a
[doc] update lora doc ( #17936 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-11 15:56:21 +08:00
8132365b74
[Bugfix]: v1 engine - consider lora adapters in allowed_token_ids ( #17855 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-05-11 00:53:58 -07:00
eea22a56ab
fix amd triton mla path ( #17871 )
2025-05-11 07:53:31 +00:00
9112155283
[Perf] Use small max_num_batched_tokens for A100 ( #17885 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-05-11 07:53:23 +00:00
90d0a74b60
[Bugfix] Add revision to transformers.Auto*.from_pretrained processors ( #17948 )
...
Signed-off-by: Xin Li <xin@centml.ai >
2025-05-11 07:52:44 +00:00
d74e5f37bc
[Kernel] fp4 marlin kernel ( #17687 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-05-10 19:58:49 -07:00
ca66a1674c
[v1] Rename specialized_manager.py to single_type_kv_cache_manager.py ( #17946 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-10 16:14:12 -07:00
950751a987
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders ( #17483 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-10 16:12:04 -07:00
4c31218f80
[Misc] remove --model from vllm serve usage ( #17944 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-10 13:23:31 +00:00
68311891f5
Don't default construct ModelConfig when default constructing VllmConfig ( #17943 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-10 13:23:00 +00:00
fc4441a4ee
Add missing content type headers to /ping and /health ( #17036 ) ( #17786 )
...
Signed-off-by: Ximo Guanter <ximo.guanter@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-10 07:13:32 +01:00
246e3e0a36
fix broken test vllm:test_kernels - test_attention_selector.py::test_flash_attn ( #17873 )
...
Co-authored-by: Stephen Chen <tracelog@meta.com >
2025-05-10 10:46:54 +08:00
7042cc96b0
[V1][Spec Decoding] Log accumulated metrics after system goes idle ( #17913 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-09 18:23:07 -07:00
0c0fdae84f
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model ( #16362 )
2025-05-09 16:24:41 -07:00
3b602cdea7
AMD conditional all test execution // new test groups ( #17556 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-05-09 15:35:58 -07:00
4b2ed7926a
Improve configs - the rest! ( #17562 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 15:18:44 -07:00
7e3571134f
[V1][Spec Decoding] Include bonus tokens in mean acceptance length ( #17908 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-09 13:32:36 -07:00
ea2236bf95
Add option to use torch._inductor.standalone_compile ( #17057 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-09 12:59:04 -07:00
7d4aedae7c
Handle error when str passed to /v1/audio/transcriptions ( #17909 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 19:23:59 +00:00
22481fbfa3
Update CT WNA16MarlinMoE integration ( #16666 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-09 13:19:45 -04:00
5c4c08f6f1
[Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfloat16 config ( #17265 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-09 17:16:12 +00:00
c44c384b1c
[Misc] Add references in ray_serve_deepseek example ( #17907 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-05-09 16:59:36 +00:00
85b72cb7b1
Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" ( #17910 )
2025-05-09 08:58:18 -07:00
6e5595ca39
[CI/Build] Automatically retry flaky tests ( #17856 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-09 09:55:17 -06:00
200da9a517
[v1] Move block management logic from KVCacheManager to SpecializedManager ( #17474 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-09 15:25:34 +00:00
9f64e93415
[BugFix][AMD] Compatible patch for latest AITER(05/07/2025) ( #17864 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-05-09 08:59:36 -06:00
ec61ea20a8
[Misc] add dify integration ( #17895 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-09 03:42:39 -07:00
c6798baa9c
Change top_k to be disabled with 0 (still accept -1 for now) ( #17773 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 10:01:49 +00:00
5b2dcbf0b8
Fix Whisper crash caused by invalid`` max_num_batched_tokens`` config ( #17853 )
...
Signed-off-by: inkcherry <mingzhi.liu@intel.com >
2025-05-09 09:16:26 +00:00
6e4a93e3f7
[Bugfix][CPU] Fix broken AVX2 CPU TP support ( #17252 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-09 08:55:14 +00:00
217db4baa6
[Bugfix][ROCm] Fix AITER MLA V1 ( #17880 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-09 08:38:21 +00:00
ff8c400502
[Doc] remove visible token in doc ( #17884 )
...
Signed-off-by: yan <yanma1@habana.ai >
2025-05-09 01:21:31 -07:00
89a0315f4c
[Doc] Update several links in reasoning_outputs.md ( #17846 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-09 01:20:55 -07:00
3d1e387652
[Docs] Add Slides from NYC Meetup ( #17879 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-05-08 21:46:54 -07:00
d310e6de98
[BUGFIX]: return fast when request requires prompt logprobs ( #17251 )
2025-05-08 21:25:41 -07:00
5e6f939484
[Attention] MLA move rotary embedding to cuda-graph region ( #17668 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-09 11:14:42 +08:00
760e3ecc8f
[V1][Structured Output] Update llguidance (>= 0.7.11) to avoid AttributeError (no StructTag) ( #17839 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-05-08 20:14:18 -07:00
3c9396a64f
[FEAT][ROCm]: Support AITER MLA on V1 Engine ( #17523 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
2025-05-09 10:42:05 +08:00
376786fac1
Add cutlass support for blackwell fp8 blockwise gemm ( #14383 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-05-08 15:09:55 -07:00
4f605a6de5
Fix noisy warning for uncalibrated q_scale/p_scale ( #17414 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-08 15:56:59 -04:00
8342e3abd1
[CI] Prune down lm-eval small tests ( #17012 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-08 19:00:26 +00:00
a83a0f92b5
[Test] Attempt all TPU V1 tests, even if some of them fail. ( #17334 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-05-08 17:20:54 +00:00
226a4272cf
[V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging ( #17860 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-08 16:57:35 +00:00
ec54d73c31
[CI] Fix test_collective_rpc ( #17858 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-08 16:47:12 +00:00
a944f8ede7
[Misc] Delete LoRA-related redundancy code ( #17841 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-08 06:02:21 -07:00
015815fe01
[Bugfix] use_fast failing to be propagated to Qwen2-VL image processor ( #17838 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-08 05:39:21 -07:00
e4ca6e3a99
Fix transient dependency error in docs build ( #17848 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-08 03:42:03 -07:00
53d0cb7423
[Misc] add chatbox integration ( #17828 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-08 10:05:26 +00:00
f50dcb7c21
[Easy] Eliminate c10::optional usage in vllm/csrc ( #17819 )
2025-05-08 03:05:10 -07:00
a1e19b635d
[Doc] Fix a typo in the file name ( #17836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-08 18:04:18 +08:00
bb239a730f
[Bugfix] Fix quark fp8 format loading on AMD GPUs ( #12612 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
Signed-off-by: kewang2 <kewang2@amd.com >
Co-authored-by: kewang2 <kewang2@amd.com >
2025-05-08 02:53:53 -07:00
a463555dee
[TPU] Fix the test_sampler ( #17820 )
2025-05-08 05:51:33 -04:00
ca04b97c93
[Bugfix] Fix tool call template validation for Mistral models ( #17644 )
...
Signed-off-by: Rick Yuan <yuan821120@gmail.com >
Signed-off-by: RIck Yuan <yuan821120@gmail.com >
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-05-08 09:47:19 +00:00
0a9bbaa104
[Misc] support model prefix & add deepseek vl2 tiny fused moe config ( #17763 )
...
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com >
Co-authored-by: 唯勤 <xsank.mz@alibaba-inc.com >
2025-05-08 07:50:22 +00:00
39956efb3f
[Bugfix] Fix bad words for Mistral models ( #17753 )
...
Signed-off-by: Qiong Zhou Huang <qiong@phonic.co >
2025-05-07 23:32:10 -07:00
597051e56f
[Qwen3]add qwen3-235b-bf16 fused moe config on A100 ( #17715 )
2025-05-07 23:09:32 -07:00
96722aa81d
[Frontend] Chat template fallbacks for multimodal models ( #17805 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-07 23:05:54 -07:00
843b222723
[Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU ( #17648 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-05-07 22:37:03 -07:00
e515668edf
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER ( #17153 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-07 22:35:03 -07:00
5a499e70d5
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs ( #17071 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: charlifu <charlifu@amd.com >
2025-05-07 22:34:49 -07:00
6930a41116
[V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var ( #17490 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-05-08 13:34:02 +08:00
998eea4a0e
Only log non-default CLI args for online serving ( #17803 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 22:33:29 -07:00
c747d84576
[Installation] OpenTelemetry version update ( #17771 )
...
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com >
2025-05-07 22:32:49 -07:00
b2da14a05a
Improve exception reporting in MP engine ( #17800 )
...
Signed-off-by: Vadim Markovtsev <vadim@poolside.ai >
2025-05-08 05:32:39 +00:00
7ea2adb802
[Core] Support full cuda graph in v1 ( #16072 )
...
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com >
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com >
2025-05-07 22:30:15 -07:00
3d13ca0e24
[BugFix] Fix --disable-log-stats in V1 server mode ( #17600 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-08 04:08:15 +00:00
66ab3b13c9
Don't call the venv vllm ( #17810 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-08 04:06:39 +00:00
a8238bbdb0
[Chore][Doc] uses model id determined from OpenAI client ( #17815 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-08 01:48:57 +00:00
d43f914d42
[Core][Feature] Input metadata dump on crash ( #13407 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2025-05-07 22:15:09 +00:00
ed5272cf21
[BugFix] Avoid secondary missing MultiprocExecutor.workers error ( #17811 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-07 21:55:04 +00:00
c20ef40fd0
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend ( #14238 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-05-07 16:28:47 -04:00
db593aa67f
[Quantization] Quark MXFP4 format loading ( #16943 )
2025-05-07 15:05:05 -04:00
f98e307588
[Bugfix] Fix missing lora name mapping for lora without prefix ( #17793 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 16:17:12 +00:00
646a31e51e
Fix and simplify deprecated=True CLI kwarg ( #17781 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 16:51:06 +01:00
be8ff88e66
[Bugfix] Fix Video IO error for short video ( #17791 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 15:36:06 +00:00
1a6af1453d
Only depend on importlib-metadata for Python < 3.10 ( #17776 )
...
Signed-off-by: Christian Heimes <christian@python.org >
2025-05-07 07:51:06 -07:00
32aa74c09c
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention ( #17139 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-07 07:12:35 -07:00
7377dd0307
[doc] update the issue link ( #17782 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-07 20:29:05 +08:00
98c89e16ff
Make key optional for rotary embedding ( #17566 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-07 00:11:46 -07:00
324a3119b0
Fix test_memory_usage_no_spec ( #17754 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-07 00:10:33 -07:00
8a15c2603a
[Frontend] Add missing chat templates for various MLLMs ( #17758 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-07 00:10:01 -07:00
043e4c4955
Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling ( #16357 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
Co-authored-by: Aaron Dou <yzdou@amazon.com >
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com >
Co-authored-by: Chongming Ni <chongmni@amazon.com >
Co-authored-by: Amulya Ballakur <amulyaab@amazon.com >
Co-authored-by: Patrick Lange <patlange@amazon.com >
Co-authored-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Lin Lin Pan <tailinpa@amazon.com >
Co-authored-by: Navyadhara Gogineni <navyadha@amazon.com >
Co-authored-by: Yishan McNabb <yishanm@amazon.com >
Co-authored-by: Mrinal Shukla <181322398+mrinalks@users.noreply.github.com >
2025-05-07 00:07:30 -07:00
ba7703e659
[Misc] Remove qlora_adapter_name_or_path ( #17699 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-06 23:10:37 -07:00
f80ae5bdcf
[Kernel] Use fused rmsnorm for some models like qwen3 series ( #17735 )
...
Signed-off-by: evian <eviantai@u.nus.edu >
Co-authored-by: evian <eviantai@u.nus.edu >
2025-05-06 23:10:02 -07:00
1a45a61387
[Kernel] GGUF MoeVec kernel ( #16780 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-06 23:07:23 -07:00
c3e9d5060e
[Misc] Use apply_rotary_emb from vllm_flash_attn for Qwen2-VL vision RoPE ( #17726 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 04:51:33 +00:00
822de7fb94
[Misc] Split model loader ( #17712 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-07 12:42:26 +08:00
8d84d836d1
[BugFix][Spec Decode] Fix hidden size mismatch between target and eagle head ( #17740 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-06 19:51:26 -07:00
950b71186f
Replace lm-eval bash script with pytest and use enforce_eager for faster CI ( #17717 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 18:00:10 -07:00
e50a1f1a9c
[TPU] Add kernel test for moe_pallas ( #17496 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-05-06 17:59:57 -07:00
a17cef70ea
Removed unused marlin cuda code ( #17684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 17:59:47 -07:00
18dd5e01f2
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels ( #17146 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-05-06 17:59:30 -07:00
6de3e13413
Add logging for torch nightly version ( #17669 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-07 00:45:51 +00:00
ed3a1d2106
[ROCm] fix num_stages for default moe config to avoid triton OutOfResource error ( #17744 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-05-07 00:39:48 +00:00
022afbeb4e
Fix doc build performance ( #17748 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 00:36:41 +00:00
2f925e5777
[Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode ( #16828 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-06 18:21:48 -04:00
de906b95f9
[Bugfix] Fix for the condition to accept empty encoder inputs for mllama ( #17732 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-06 19:59:06 +00:00
d456aea71f
[Misc] Add Next Edit Prediction (NEP) datasets support in benchmark_serving.py ( #16839 )
...
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
2025-05-06 15:38:45 -04:00
621ca2c0ab
[TPU] Increase block size and reset block shapes ( #16458 )
2025-05-06 13:55:04 -04:00
6115b11582
Make right sidebar more readable in "Supported Models" ( #17723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-06 16:48:26 +00:00
5b8c390747
[Bugfix] Fix modality limits in vision language example ( #17721 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-06 16:12:28 +00:00
7525d5f3d5
[doc] Add RAG Integration example ( #17692 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-06 16:10:23 +00:00
aabcd2cae3
[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager ( #17479 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-06 08:50:34 -07:00
0d115460a7
[Docs] Use gh-file to add links to tool_calling.md ( #17709 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-06 15:27:19 +00:00
175bda67a1
[Feat] Add deprecated=True to CLI args ( #17426 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-06 08:11:27 -07:00
cba31c47c4
[v1] AttentionMetadata for each layer ( #17394 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-06 07:58:37 -07:00
a6fed02068
[V1][PP] Support PP for MultiprocExecutor ( #14219 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-05-06 07:58:05 -07:00
d419aa5dc4
[V1] Enable TPU V1 backend by default ( #17673 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 06:49:49 -07:00
f9bc5a0693
[Bugfix] Fix triton import with local TritonPlaceholder ( #17446 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-06 17:53:09 +08:00
05e1f96419
Fix dockerfilegraph pre-commit hook ( #17698 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-06 08:56:48 +00:00
6eae34533a
[Misc] Fix ScalarType float4 naming ( #17690 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-06 01:07:15 -07:00
63ced7b43f
[Doc] Update notes for H2O-VL and Gemma3 ( #17219 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-06 07:51:02 +00:00
dc47ba32f8
[Bugfix] Fixed prompt length for random dataset ( #17408 )
...
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com >
2025-05-06 07:00:08 +00:00
edbf2d609e
[easy] Fix logspam on PiecewiseBackend errors ( #17138 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-05 23:46:11 -07:00
999328be0d
[Model] Add GraniteMoeHybrid 4.0 model ( #17497 )
...
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com >
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Co-authored-by: Thomas Ortner <boh@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-05-06 12:00:31 +08:00
98834fefaa
Update nm to rht in doc links + refine fp8 doc ( #17678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 00:41:14 +00:00
90bd2ae172
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument ( #17677 )
2025-05-05 17:34:29 -07:00
5941e0b7ea
[TPU][V1] Add support for top-logprobs ( #17072 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-05 14:20:15 -07:00
9765940824
[TPU] Enable gemma3-27b with TP>1 on multi-chips. ( #17335 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-05-05 14:19:58 -07:00
5ea5c514da
[BugFix] Increase timeout for startup failure test ( #17642 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-05 20:53:19 +00:00
d3efde8176
[Benchmarks] Remove invalid option under V1 engine ( #17651 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-05 16:30:22 -04:00
aea302be6c
Use git-path commit in hook ( #17616 )
...
Signed-off-by: Thomas J. Fan <thomasjpfan@gmail.com >
2025-05-05 17:55:32 +00:00
cc05b90d86
[Doc] Fix broken cuda installation doc rendering ( #17654 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-05 17:52:40 +00:00
1d0c9d6b2d
[Kernel] some optimizations for dense marlin and moe marlin ( #16850 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-05-05 09:39:30 -07:00
f62cad6431
[Build/CI] Upgrade CUTLASS to 3.9.2 ( #17641 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-04 19:23:17 -07:00
5394ad7387
[Bugfix] fix KeyError on top logprobs are special tokens ( #17637 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-04 19:22:35 -07:00
68e1ee0072
[Bugfix][Easy] Fix whitespace in shm_broadcast.py logging ( #17635 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-04 19:20:19 -07:00
2858830c39
[Bugfix] Prioritize dtype in root config before checking text config ( #17629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-04 12:43:05 +00:00
d6484ef3c3
Add full API docs and improve the UX of navigating them ( #17485 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-03 19:42:43 -07:00
46fae69cf0
[Misc] V0 fallback for --enable-prompt-embeds ( #17615 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-03 22:59:24 +00:00
f66f1e0fa3
[Bugfix] Fix broken Qwen2.5-omni tests ( #17613 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-03 17:08:14 +00:00
887d7af882
[Core] Gate prompt_embeds behind a feature flag ( #17607 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-04 00:19:20 +08:00
a92842454c
[Bugfix][ROCm] Using device_type because on ROCm the API is still torch.cuda ( #17601 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-02 22:25:47 -07:00
c8386fa61d
[Build/CI] Upgrade CUTLASS to 3.9.1 ( #17602 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-02 22:25:14 -07:00
87baebebd8
[Frontend][TPU] Add TPU default max-num-batched-tokens based on device name ( #17508 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-02 21:42:44 -07:00
e3d0a1d190
[Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on ROCm ( #17558 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-05-02 21:41:10 -07:00
d47b605eca
Update test requirements to CUDA 12.8 ( #17576 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-05-02 21:40:15 -07:00
22c6f6397f
[Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 ( #17603 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-05-03 02:41:59 +00:00
3ec97e2cc5
[release] Add command to clean up Docker containers/images in TPU release machine ( #17606 )
2025-05-02 18:54:34 -07:00
9b103a1d76
fix typo in logging ( #17605 )
2025-05-02 18:04:40 -07:00
b90b0852e9
[easy] Print number of needed GPUs in skip message ( #17594 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-02 15:27:43 -07:00
9352cdb56d
[Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning ( #16263 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Lu Fang <lufang@fb.com >
2025-05-02 19:44:19 +00:00
182f40ea8b
Add NVIDIA TensorRT Model Optimizer in vLLM documentation ( #17561 )
2025-05-02 11:36:46 -07:00
3e887d2e0c
permute/unpermute kernel for moe optimization ( #14568 )
...
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn >
2025-05-02 11:31:55 -07:00
0f87d8f7b2
[BugFix][Attention] Fix sliding window attention in V1 giving incorrect results ( #17574 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-02 11:01:38 -07:00
4c33d67321
[Bugfix] fix tmp_out and exp_sums dimensions ( #17438 )
...
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com >
2025-05-02 16:44:07 +00:00
cb234955df
[Misc] Clean up input processing ( #17582 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 08:11:53 -07:00
3a500cd0b6
[doc] miss result ( #17589 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-02 07:04:49 -07:00
868c546da4
Support W8A8 INT8 MoE for compressed-tensors ( #16745 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 10:03:32 -04:00
99404f53c7
[Security] Fix image hash collision ( #17378 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 08:36:39 -04:00
785d75a03b
Automatically tell users that dict args must be valid JSON in CLI ( #17577 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-02 05:24:55 -07:00
6d1479ca4b
[doc] add the print result ( #17584 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-02 05:24:45 -07:00
b8b0859b5c
add more pytorch related tests for torch nightly ( #17422 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-02 03:29:59 -07:00
d7543862bd
[Misc] Rename assets for testing ( #17575 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 03:29:25 -07:00
c777df79f7
[BugFix] Fix Memory Leak ( #17567 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-02 01:07:03 -07:00
cc2a77d7f1
[Core] [Bugfix] Add Input Embeddings ( #15428 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com >
Co-authored-by: Bryce1010 <bryceyx@gmail.com >
Co-authored-by: Nan2018 <nan@protopia.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 01:06:39 -07:00
9e2de9b9e9
[Bugifx] Remove TritonPlaceholder from sys.modules ( #17317 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-02 00:45:01 -07:00
109e15a335
Add pt_load_map_location to allow loading to cuda ( #16869 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-05-01 23:23:42 -07:00
f192ca90e6
Fix PixtralHF missing spatial_merge_size ( #17571 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-01 22:14:09 -07:00
f89d0e11bf
[Misc] Continue refactoring model tests ( #17573 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 22:06:08 -07:00
b4003d11fc
Check if bitblas is installed during support check ( #17572 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 04:32:54 +00:00
292fc59d61
[CI] Actually run tests/kv_transfer/test_disagg.py in CI ( #17555 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 04:05:04 +00:00
afcb3f8863
[Attention] MLA move o_proj q_proj into cuda-graph region ( #17484 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-02 03:16:26 +00:00
afb12e4294
[Doc] note that not all unit tests pass on CPU platforms ( #17554 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-02 02:57:21 +00:00
24aebae177
[Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 ( #17541 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-01 17:59:35 -07:00
39c0813a7f
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 ( #17504 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-01 16:19:30 -07:00
9b70e2b4c1
[Misc][Tools][Benchmark] Publish script to auto tune server parameters ( #17207 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-01 19:53:03 +00:00
173daac19d
[Bug]change the position of cuda_graph_sizes in dataclasses ( #17548 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
2025-05-01 11:52:37 -07:00
04f2cfc894
Remove duplicate code from dbrx.py ( #17550 )
2025-05-01 11:51:58 -07:00
811a6c0972
[ROCM] Add gfx950 to the custom attention archs ( #16034 )
...
Signed-off-by: jpvillam <Juan.Villamizar@amd.com >
Signed-off-by: seungrokjung <seungrok.jung@amd.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: seungrokjung <seungrok.jung@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-01 11:18:28 -07:00
9b1769dd9a
[Bugfix] Fix lint error ( #17547 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 11:12:19 -07:00
61c299f81f
[Misc]add configurable cuda graph size ( #17201 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 11:04:50 -07:00
4acfa3354a
[ROCm] update installation guide to include build aiter from source instructions ( #17542 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-01 11:01:28 -07:00
88c8304104
[Model] Refactor Ovis2 to support original tokenizer ( #17537 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-01 11:00:53 -07:00
6768ff4a22
Move the last arguments in arg_utils.py to be in their final groups ( #17531 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 10:31:44 -07:00
f2e7af9b86
[CI/Build] Remove awscli dependency ( #17532 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 09:20:54 -07:00
7423cf0a9b
[Misc] refactor example - cpu_offload_lmcache ( #17460 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-01 15:05:24 +00:00
460a2b1100
[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations ( #10867 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-05-01 07:59:28 -07:00
28566d73b3
[ROCm] remove unsupported archs from rocm triton flash-attention supported list ( #17536 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-05-01 07:54:25 -07:00
98060b001d
[Feature][Frontend]: Deprecate --enable-reasoning ( #17452 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-01 06:46:16 -07:00
f5a3c655b2
[FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config ( #17535 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-01 06:37:17 -07:00
7169f87ad0
[doc] add streamlit integration ( #17522 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-01 13:34:02 +00:00
b74d888c63
Fix more broken speculative decode tests ( #17450 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-05-01 06:05:58 -07:00
2007d4d54f
[FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X ( #17530 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-01 06:03:13 -07:00
48e925fab5
[Misc] Clean up test docstrings and names ( #17521 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 05:19:32 -07:00
1903c0b8a3
[Frontend] Show progress bar for adding requests ( #17525 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 05:15:32 -07:00
86a1f67a3b
[Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model ( #17285 )
...
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com >
2025-05-01 11:54:51 +00:00
a257d9bccc
Improve configs - ObservabilityConfig ( #17453 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 03:52:05 -07:00
015069b017
[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content ( #17515 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-01 03:29:01 -07:00
fbefc8a78d
[Core] Enable IPv6 with vllm.utils.make_zmq_socket() ( #16506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-01 09:38:18 +00:00
26bc4bbcd8
Avoid overwriting vllm_compile_cache.py ( #17418 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-05-01 07:30:57 +00:00
3c3d767201
[BugFix] Fix mla cpu - missing 3 required positional arguments ( #17494 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-01 14:36:52 +08:00
13cf6b6236
[BugFix] fix speculative decoding memory leak when speculation is disabled ( #15506 )
...
Signed-off-by: Noah Yoshida <noahcy117@gmail.com >
2025-04-30 23:28:17 -07:00
90d0a54c4d
[ROCm] Effort to reduce the number of environment variables in command line ( #17229 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-04-30 23:27:06 -07:00
7a0a146c54
[Build] Require setuptools >= 77.0.3 for PEP 639 ( #17389 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-30 23:25:36 -07:00
7ab643e425
FIxing the AMD test failures caused by PR#16457 ( #17511 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-30 23:23:07 -07:00
afb4429b4f
[CI/Build] Reorganize models tests ( #17459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-30 23:03:08 -07:00
aa4502e7f3
[CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg ( #17500 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 21:03:30 -07:00
17b4d85f63
[CI][TPU] Skip structured outputs+spec decode tests on TPU ( #17510 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 20:36:20 -07:00
1144a8efe7
[Bugfix] Temporarily disable gptq_bitblas on ROCm ( #17411 )
...
Signed-off-by: Yan Cangang <nalanzeyu@gmail.com >
2025-04-30 19:51:45 -07:00
08fb5587b4
[Bugfix][ROCm] Fix import error on ROCm ( #17495 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-30 19:51:42 -07:00
dbc18e7816
[CI][TPU] Skip Multimodal test ( #17488 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-04-30 19:51:39 -07:00
02bd654846
[Misc] Rename Audios -> Audio in Qwen2audio Processing ( #17507 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-30 19:51:36 -07:00
200bbf92e8
Bump Compressed Tensors version to 0.9.4 ( #17478 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-30 15:24:45 -07:00
81ecf425f0
[v1][Spec Decode] Make sliding window compatible with eagle prefix caching ( #17398 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-30 18:25:53 +00:00
42d9a2c4c7
doc: fix bug report Github template formatting ( #17486 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-30 10:03:20 -07:00
2ac74d098e
[doc] add install tips ( #17373 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-30 17:02:41 +00:00
584f5fb4c6
[Bugfix][ROCm] Restrict ray version due to a breaking release ( #17480 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-30 09:59:06 -07:00
d586ddc691
[BugFix] Fix authorization of openai_transcription_client.py ( #17321 )
...
Signed-off-by: zh Wang <rekind133@outlook.com >
2025-04-30 09:51:05 -07:00
0b7e701dd4
[Docs] Update optimization.md doc ( #17482 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 09:34:02 -07:00
947f2f5375
[V1] Allow turning off pickle fallback in vllm.v1.serial_utils ( #17427 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-30 16:10:54 +00:00
739e03b344
[Bugfix] Fixed mistral tokenizer path when pointing to file ( #17457 )
...
Signed-off-by: Pete Savage <psavage@redhat.com >
2025-04-30 08:08:37 -07:00
da4e7687b5
[Fix] Support passing args to logger ( #17425 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-30 08:06:58 -07:00
39317cf42b
[Docs] Add command for running mypy tests from CI ( #17475 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-30 08:06:09 -07:00
2990cee95b
[Feature] The Qwen3 reasoning parser supports guided decoding ( #17466 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 07:48:21 -07:00
0be6d05b5e
[V1][Metrics] add support for kv event publishing ( #16750 )
...
Signed-off-by: alec-flowers <aflowers@nvidia.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-04-30 07:44:45 -07:00
77073c77bc
[Core] Prevent side-channel attacks via cache salting ( #17045 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-04-30 20:27:21 +08:00
a7d5b016bd
[TPU][V1][CI] Update regression test baseline for v6 CI ( #17064 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-30 04:03:22 -07:00
d803786731
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None ( #15755 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-30 18:20:39 +08:00
1534d389af
[Misc] Remove deprecated files ( #17447 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 01:52:19 -07:00
ece5a8b0b6
Make the _apply_rotary_emb compatible with dynamo ( #17435 )
2025-04-30 07:52:48 +00:00
54072f315f
[MODEL ADDITION] Ovis2 Model Addition ( #15826 )
...
Signed-off-by: Marco <121761685+mlinmg@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-04-30 07:33:29 +00:00
be633fba0f
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' ( #17434 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 00:11:04 -07:00
ed6cfb90c8
[Hardware][Intel GPU] Upgrade to torch 2.7 ( #17444 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com >
2025-04-30 00:03:58 -07:00
6ed9f6047e
[Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue ( #17298 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-04-29 22:54:10 -07:00
a44c4f1d2f
Support LoRA for Mistral3 ( #17428 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-29 21:10:30 -07:00
88fcf00dda
Fix some speculative decode tests with tl.dot ( #17371 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-04-29 19:41:02 -07:00
d1f569b1b9
Fix call to logger.info_once ( #17416 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:39:18 -07:00
13698db634
Improve configs - ModelConfig ( #17130 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-30 10:38:22 +08:00
2c4f59afc3
Update PyTorch to 2.7.0 ( #16859 )
2025-04-29 19:08:04 -07:00
1c2bc7ead0
Truncation control for embedding models ( #14776 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-04-30 09:24:57 +08:00
4055130a85
[release] Always git fetch all to get latest tag on TPU release ( #17322 )
2025-04-29 17:52:11 -07:00
34120f5acd
[V1][Feature] Enable Speculative Decoding with Structured Outputs ( #14702 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
2025-04-30 00:02:10 +00:00
7489ec0bab
Remove Bamba 9B from CI ( #17407 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 21:10:31 +00:00
70788bdbdc
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE ( #17211 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-29 21:10:00 +00:00
c9c1b59e59
Fix: Python package installation for opentelmetry ( #17049 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
2025-04-29 20:20:24 +00:00
0350809f3a
Remove Falcon3 2x7B from CI ( #17404 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:52:25 +00:00
a6977dbd15
Simplify (and fix) passing of guided decoding backend options ( #17008 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:02:23 +00:00
2fa2a50bf9
[Bugfix] Fix Minicpm-O-int4 GPTQ model inference ( #17397 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-29 18:21:42 +00:00
08e15defa9
[CI/Build] Add retry mechanism for add-apt-repository ( #17107 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-29 10:40:52 -07:00
b37685afbb
[CI] Uses Python 3.11 for TPU ( #17359 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-29 17:39:16 +00:00
792595b59d
[TPU][V1][CI] Replace python3 setup.py develop with standard pip install --e on TPU ( #17374 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-29 10:36:48 -07:00
0c1c788312
[Doc][Typo] Fixing label in new model requests link in overview.md ( #17400 )
2025-04-29 10:29:48 -07:00
56d64fbe30
[Docs] Propose a deprecation policy for the project ( #17063 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-29 10:29:44 -07:00
608968b7c5
Enabling multi-group kernel tests. ( #17115 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-29 10:27:27 -07:00
06ffc7e1d3
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build ( #17289 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-04-29 10:26:42 -07:00
d3cf61b89b
fix gemma3 results all zero ( #17364 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
2025-04-29 09:40:25 -07:00
a39203f99e
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … ( #17369 )
...
Signed-off-by: mofanke <mofanke@gmail.com >
2025-04-29 16:32:40 +00:00
24e6ad3f16
[V1] Remove num_input_tokens from attn_metadata ( #17193 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-29 09:28:41 -07:00
2ef5d106bb
Improve literal dataclass field conversion to argparse argument ( #17391 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 16:25:08 +00:00
0ed27ef66c
Fix: Spelling of inference ( #17387 )
2025-04-29 09:23:39 -07:00
900edfa8d4
Transformers backend tweaks ( #17365 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 09:08:03 -07:00
88ad9ec6b2
[Frontend] Support chat_template_kwargs in LLM.chat ( #17356 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 22:03:35 +08:00
40896bdf3f
pre-commit autoupdate (#17380 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 06:46:55 -07:00
00ee37efa2
[Bugfix] Clean up MiniMax-VL and fix processing ( #17354 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 20:42:16 +08:00
890f104cdf
[Doc] Fix QWen3MOE info ( #17381 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-29 12:38:32 +00:00
4a5e13149a
Update docs requirements ( #17379 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 11:35:47 +00:00
97cc8729f0
[Model] Ignore rotary embed load for Cohere model ( #17319 )
2025-04-29 00:30:40 -07:00
4464109219
[Build][Bugfix] Restrict setuptools version to <80 ( #17320 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-29 00:17:23 -07:00
193e78e35d
[Fix] Documentation spacing in compilation config help text ( #17342 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-04-29 00:16:17 -07:00
bdb2cddafc
[Misc]Use a platform independent interface to obtain the device attributes ( #17100 )
2025-04-29 06:59:13 +00:00
ebb3930d28
[Misc] Move config fields to MultiModalConfig ( #17343 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 06:37:21 +00:00
cde384cd92
[Model] support MiniMax-VL-01 model ( #16328 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-04-29 12:05:50 +08:00
96e06e3cb7
[Misc] Add a Jinja template to support Mistral3 function calling ( #17195 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-28 19:53:44 -07:00
17eb306fcc
[Bugfix] Add contiguous call inside rope kernel wrapper ( #17091 )
...
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn >
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn >
2025-04-28 19:24:07 -07:00
165cb56329
Ignore '<string>' filepath ( #17330 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-28 19:23:29 -07:00
d6da8a8ff2
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu ( #17316 )
2025-04-28 19:23:18 -07:00
b4ac4fa04d
[model] make llama4 compatible with pure dense layers ( #17315 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-04-29 10:22:22 +08:00
e136000595
[V1][Spec Decode] Make Eagle model arch config driven ( #17323 )
2025-04-29 10:22:02 +08:00
86d9fc29cb
implement Structural Tag with Guidance backend ( #17333 )
...
Signed-off-by: Michal Moskal <michal@moskal.me >
2025-04-29 02:21:32 +00:00
506475de5f
[Optim] Compute multimodal hash only once per item ( #17314 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 09:40:35 +08:00
cfe4532093
[Benchmark] Add single turn MTBench to Serving Bench ( #17202 )
2025-04-28 16:46:15 -07:00
8fc88d63f1
[Model] Add tuned triton fused_moe configs for Qwen3Moe ( #17328 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-28 15:20:24 -07:00
6e74fd4945
Support loading transformers models with named parameters ( #16868 )
...
Signed-off-by: Alex <alexwu@character.ai >
2025-04-28 23:15:58 +01:00
dcbac4cb4b
[Model] Qwen3 Dense FP8 Compat Fixes ( #17318 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
2025-04-28 14:12:01 -07:00
ed2462030f
[Bugfix] Fix moe weight losing all extra attrs after process_weights_after_loading. ( #16854 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-04-28 21:05:07 +00:00
cc5befbced
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) ( #17283 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-28 13:55:50 -07:00
2c89cd96a8
[Chore] cleanup license indicators in light of SPDX ( #17259 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-04-28 19:43:52 +00:00
a0304dc504
[Security] Don't bind tcp zmq socket to all interfaces ( #17197 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-28 10:08:20 -07:00
c7941cca18
Explicitly explain quant method override ordering and ensure all overrides are ordered ( #17256 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:55:31 +00:00
b6dd32aa07
Make name of compressed-tensors quant method consistent across vLLM ( #17255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:28:13 +00:00
f94886946e
Improve conversion from dataclass configs to argparse arguments ( #17303 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:22:12 +00:00
72dfe4c74f
[Docs] Add a security guide ( #17230 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-28 15:12:17 +00:00
8b464d9660
[Misc] Clean up Qwen2.5-Omni code ( #17301 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 06:20:45 -07:00
889ebb2638
[Misc] Minor typo/grammar in platforms/interface.py ( #17307 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-28 05:45:42 -07:00
3ad986c28b
[doc] update wrong model id ( #17287 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-28 04:20:51 -07:00
344e193b7d
[Bugfix] Add missing get_language_model to new MLLMs ( #17300 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 04:09:57 -07:00
fb1c933ade
Add missing class docstring for PromptAdapterConfig ( #17302 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 04:06:59 -07:00
72c5b97231
Update tpu_worker.py 's typo ( #17288 )
2025-04-28 04:01:15 -07:00
fa93cd9f60
[Model] Add Granite Speech Support ( #16246 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-28 10:05:00 +00:00
aec9674dbe
[Core] Remove legacy input mapper/processor from V0 ( #15686 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 15:38:48 +08:00
7fcc4223dc
[Minor][Models] Pass partial_rotary_factor parameter to rope ( #17266 )
...
Signed-off-by: evian <eviantai@u.nus.edu >
Co-authored-by: evian <eviantai@u.nus.edu >
2025-04-28 04:28:59 +00:00
8262a3e23b
[Misc] Validate stop_token_ids contents ( #17268 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-28 03:54:05 +00:00
f211331c48
[Doc] small fix ( #17277 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-28 03:53:35 +00:00
9053d0b134
[Doc] Fix wrong github link in LMCache examples ( #17274 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-04-28 03:09:11 +00:00
cb3f2d8d10
[Bugfix] Fix Mistral3 spatial merge error ( #17270 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-27 19:40:05 -07:00
c12df53b60
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… ( #16751 )
...
Signed-off-by: Ther-LF <2639852836@qq.com >
2025-04-27 19:38:42 -07:00
d1aeea7553
[Bugfix] Fix missing ARG in Dockerfile for arm64 platforms ( #17261 )
...
Signed-off-by: lkm-schulz <44176356+lkm-schulz@users.noreply.github.com >
2025-04-27 19:38:14 -07:00
d8bccde686
[BugFix] Fix vllm_flash_attn install issues ( #17267 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-27 17:27:56 -07:00
20e489eaa1
[V1][Spec Decode] Make eagle compatible with prefix caching. ( #17137 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-27 09:29:43 -07:00
4213475ec7
[Metrics] Fix minor inconsistencies in bucket progression ( #17262 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-27 16:19:39 +00:00
d92879baf6
[doc] Add feature status legend ( #17257 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-27 08:17:02 -07:00
690fe019f0
[Feature] support sequence parallelism using compilation pass ( #16155 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-04-27 06:29:35 -07:00
ed7a29d9f8
[NVIDIA] Support Cutlass MLA for Blackwell GPUs ( #16032 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
2025-04-27 06:29:21 -07:00
756848e79e
[Bugfix] Fix Lora Name Parsing ( #17196 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-27 20:33:09 +08:00
18445edd0f
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens ( #17033 )
...
Signed-off-by: sfc-gh-zhwang <flex.wang@snowflake.com >
2025-04-27 12:30:53 +00:00
30215ca61f
[MISC] Use string annotation types for class definitions ( #17244 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-04-27 08:39:57 +00:00
838cedade7
[Bugfix] Get a specific type of layer from forward context ( #17222 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-27 00:58:05 -07:00
4283a28c2f
[Bugfix] Fix QWen2 VL multimodal mapping ( #17240 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-27 05:53:23 +00:00
93a126fbc7
[Misc] Make cached tokenizer pickle-compatible ( #17048 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-27 13:05:00 +08:00
8e4b351a0c
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel ( #12591 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-04-27 00:35:08 +00:00
9869453c42
Update test_flash_attn.py ( #17102 )
...
Signed-off-by: ShuaibinLi <lishuaibin@live.cn >
2025-04-26 22:17:35 +00:00
3642c59aa8
[CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh ( #16271 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-26 18:25:05 +00:00
43eea2953b
[Minor] Fix lint error in main branch ( #17233 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-26 11:10:14 -07:00
de7eb10ce4
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation ( #16878 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-04-26 10:41:35 -07:00
fd11a325b8
[MISC] rename interval to max_recent_requests ( #14285 )
2025-04-26 16:59:18 +00:00
4d17e20310
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 ( #16573 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-04-26 09:17:58 -07:00
10fd1d7380
[Bugfix] fix error due to an uninitialized tokenizer when using skip_tokenizer_init with num_scheduler_steps ( #9276 )
...
Signed-off-by: changjun.lee <pord7457@gmail.com >
2025-04-26 11:51:17 -04:00
52b4f4a8d7
[Docs] Update structured output doc for V1 ( #17135 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-26 15:12:18 +00:00
e782e0a170
[Chore] added stubs for vllm_flash_attn during development mode ( #17228 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-26 07:45:26 -07:00
dc2ceca5c5
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set ( #17088 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-04-26 14:34:24 +00:00
f8acd01ff7
[V1] Add structural_tag support using xgrammar ( #17085 )
2025-04-26 14:06:37 +00:00
c48334d405
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device ( #17186 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-04-26 05:55:14 -07:00
909fdaf152
[Bugfix] Fix standard models tests ( #17217 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-26 02:26:41 -07:00
8c1c926d00
[Bugfix] Fix missing int type for -n in multi-image example ( #17223 )
2025-04-26 08:49:52 +00:00
df6f3ce883
[Core] Remove prompt string from engine core data structures ( #17214 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-25 23:41:05 -07:00
513f074766
[CI/test] Fix Eagle Correctness Test ( #17209 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 23:40:36 -07:00
b07bf83c7d
[BugFix] Avoid race conditions in zero-copy tensor transmission ( #17203 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-26 06:00:07 +00:00
53e8cf53a4
[V1][Metrics] Allow V1 AsyncLLM to use custom logger ( #14661 )
...
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-25 22:05:40 -07:00
54271bb766
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. ( #17011 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-04-25 22:05:10 -07:00
9e96f56efb
Allocate kv_cache with stride order ( #16605 )
...
Signed-off-by: shuw <shuw@nvidia.com >
2025-04-25 22:03:31 -07:00
b278911229
[Minor][Models] Fix Return Types of Llama & Eagle ( #17220 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 21:54:47 -07:00
7bd0c7745c
[Doc] Minor fix for the vLLM TPU setup page ( #17206 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-04-26 04:39:56 +00:00
1cf0719ebd
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig ( #17213 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 21:08:15 -07:00
537d5ee025
[doc] add Anything LLM integration ( #17216 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-25 21:03:23 -07:00
c8e5be35f7
[MISC][AMD] Add unused annotation to rocm kernel file ( #17097 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-04-25 20:33:35 -07:00
a6e72e1e4f
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env ( #17142 )
...
Signed-off-by: James Wu <jjwu@meta.com >
2025-04-26 11:28:20 +08:00
5e83a7277f
[v1] [P/D] Adding LMCache KV connector for v1 ( #16625 )
2025-04-26 03:03:38 +00:00
68af5f6c5c
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary ( #17215 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-04-25 19:55:05 -07:00
8de2901fea
[Bugfix] gemma[2,3] interleaved attention when sliding window is disabled ( #17180 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-25 19:53:51 -07:00
c53e0730cb
[Misc] Refine ray_serve_deepseek example ( #17204 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-25 16:06:59 -07:00
a0e619e62a
[V1][Spec Decode] EAGLE-3 Support ( #16937 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Co-authored-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-25 15:43:07 -07:00
70116459c3
[BugFix][Frontend] Fix LLM.chat() tokenization ( #16081 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-25 22:20:05 +00:00
65e262b93b
Fix Python packaging edge cases ( #17159 )
...
Signed-off-by: Christian Heimes <christian@python.org >
2025-04-26 06:15:07 +08:00
43faa0461a
[Bugfix] Fix hybrid model tests ( #17182 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 15:14:37 -07:00
48cb2109b6
[V1] Move usage stats to worker and start logging TPU hardware ( #16211 )
2025-04-25 14:06:01 -06:00
a5450f11c9
[Security] Use safe serialization and fix zmq setup for mooncake pipe ( #17192 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-25 16:53:23 +00:00
9d98ab5ec6
[Misc] Inline Molmo requirements ( #17190 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 16:41:44 +00:00
df5c879527
[doc] update wrong hf model links ( #17184 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-25 16:40:54 +00:00
423e9f1cbe
Use Transformers helper get_text_config() instead of checking for text_config ( #17105 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-25 08:47:35 -07:00
0bd7f8fca5
Bump Transformers to 4.51.3 ( #17116 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-25 08:34:34 -07:00
d5615af9ae
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception ( #16769 )
...
Signed-off-by: Jasmond Loh <Jasmond.Loh@hotmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-25 07:26:30 -07:00
19dcc02a72
[Bugfix] Fix mistral model tests ( #17181 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 06:03:34 -07:00
7feae92c1f
[Doc] Move todo out of beam search docstring ( #17183 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-25 04:44:58 -07:00
f851b84266
[Doc] Add two links to disagg_prefill.md ( #17168 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-25 10:23:57 +00:00
fc966e9cc6
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 ( #17158 )
2025-04-25 17:10:32 +08:00
ef19e67d2c
[Doc] Add headings to improve gptqmodel.md ( #17164 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-25 01:13:13 -07:00
a41351f363
[Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization ( #15734 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-04-25 00:45:02 -07:00
6aae216b4e
[Bugfix] remove fallback in guided_json (int range, patterns) ( #16725 )
...
Signed-off-by: csy1204 <josang1204@gmail.com >
Co-authored-by: 조상연[플레이스 AI] <sang-yeon.cho@navercorp.com >
2025-04-25 06:54:43 +00:00
b22980a1dc
[Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance ( #16457 )
...
Signed-off-by: cynthieye <yexin93@qq.com >
Co-authored-by: MagnetoWang <magnetowang@outlook.com >
2025-04-25 14:52:28 +08:00
881f735827
[Misc] Benchmark Serving Script Support Appending Results ( #17028 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-24 22:53:55 -07:00
2f54045508
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton ( #15099 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-04-24 22:51:02 -07:00
5aa6efb9a5
[Misc] Clean up redundant code in uniproc_executor.py ( #16762 )
...
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com >
2025-04-24 22:49:30 -07:00
6ca0234478
Move missed SchedulerConfig args into scheduler config group in EngineArgs ( #17131 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 22:48:53 -07:00
649818995f
[Docs] Fix True->true in supported_models.md ( #17141 )
2025-04-25 04:20:04 +00:00
7a0a9da72b
[Doc] V1 : Update LoRA status ( #17133 )
...
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com >
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com >
2025-04-24 20:17:22 -07:00
69bff9bc89
fix float16 support for kimi-vl ( #17156 )
...
Co-authored-by: zhouzaida <zhouzaida@msh.team >
2025-04-24 20:16:32 -07:00
41ca7eb491
[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 ( #16864 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-24 20:12:21 -07:00
eef364723c
[FEAT] [ROCm]: AITER Fused MOE V1 Support ( #16752 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-04-25 11:06:50 +08:00
0d6e187e88
Use custom address for listening socket ( #15988 )
...
Signed-off-by: Jens Glaser <glaserj@ornl.gov >
2025-04-25 01:57:16 +00:00
9420a1fc30
Better error message for missing mistral params.json ( #17132 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 23:43:08 +00:00
583e900996
[Misc] Add example to run DeepSeek with Ray Serve LLM ( #17134 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-24 22:25:21 +00:00
05e1fbfc52
Add chat template for Llama 4 models ( #16428 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-04-24 20:19:36 +00:00
fe92176321
Add collective_rpc to llm engine ( #16999 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-04-24 20:16:52 +00:00
6d0df0ebeb
[Docs] Generate correct github links for decorated functions ( #17125 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-24 10:39:43 -07:00
0fa939e2d1
Improve configs - LoRAConfig + PromptAdapterConfig ( #16980 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 10:29:34 -07:00
0422ce109f
Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly ( #17124 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 10:28:45 -07:00
47bdee409c
Molmo Requirements ( #17026 )
...
Signed-off-by: Eyshika Agarwal <eyshikaengineer@gmail.com >
Signed-off-by: eyshika <eyshikaengineer@gmail.com >
2025-04-24 10:08:37 -07:00
49f189439d
existing torch installation pip command fix for docs ( #17059 )
2025-04-24 10:07:21 -07:00
5adf6f6b7f
Updating builkite job for IBM Power ( #17111 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-04-24 10:06:17 -07:00
4115f19958
[CI] Add automation for the tool-calling github label ( #17118 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-24 09:22:00 -07:00
340d7b1b21
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics ( #16665 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-24 08:57:40 -07:00
1bcbcbf574
[Misc] refactor example series - structured outputs ( #17040 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-24 07:49:48 -07:00
82e43b2d7e
Add missing rocm_skinny_gemms kernel test to CI ( #17060 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 07:49:37 -07:00
67309a1cb5
[Frontend] Using matryoshka_dimensions control the allowed output dimensions. ( #16970 )
2025-04-24 07:06:28 -07:00
b724afe343
[V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning ( #16954 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-24 06:15:03 -07:00
21f4f1c9a4
Improve static type checking in LoRAModelRunnerMixin ( #17104 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 06:14:47 -07:00
b0c1f6202d
[Misc] Remove OLMo2 config copy ( #17066 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-24 06:14:32 -07:00
c0dfd97519
[V1][PP] Optimization: continue scheduling prefill chunks ( #17080 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-24 05:27:08 -07:00
a9138e85b1
Fix OOT registration test ( #17099 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 04:44:12 -07:00
0a05ed57e6
Simplify TokenizerGroup ( #16790 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 04:43:56 -07:00
14288d1332
Disable enforce_eager for V1 TPU sampler and structured output tests ( #17016 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 02:50:09 -07:00
b411418ff0
[Chore] Remove Sampler from Model Code ( #17084 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-24 02:49:33 -07:00
2bc0f72ae5
Add docs for runai_streamer_sharded ( #17093 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-24 01:03:21 -07:00
9c1244de57
[doc] update to hyperlink ( #17096 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-24 00:58:08 -07:00
db2f8d915c
[V1] Update structured output ( #16812 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-23 23:57:17 -07:00
6167c0e5d2
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… ( #16472 )
...
Signed-off-by: 开哲 <kaizhe.zy@alibaba-inc.com >
Co-authored-by: 开哲 <kaizhe.zy@alibaba-inc.com >
2025-04-24 11:25:37 +08:00
ed2e464653
Addendum Fix to support FIPS enabled machines with MD5 hashing ( #17043 )
...
Signed-off-by: sydarb <areebsyed237@gmail.com >
2025-04-23 19:55:00 -07:00
2c8ed8ee48
More informative error when using Transformers backend ( #16988 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 19:54:03 -07:00
ed50f46641
[Bugfix] Enable V1 usage stats ( #16986 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-23 19:54:00 -07:00
46e678bcff
[Minor] Use larger batch sizes for A100/B100/B200/MI300x ( #17073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-23 19:18:59 -07:00
6b2427f995
[Quantization]add prefix for commandA quantized model ( #17017 )
2025-04-23 17:32:40 -07:00
b07d741661
[CI/Build] workaround for CI build failure ( #17070 )
...
Signed-off-by: csy1204 <josang1204@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-04-23 16:14:18 -07:00
41fb013d29
[V1][Spec Decode] Always use argmax for sampling draft tokens ( #16899 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-23 14:57:43 -07:00
32d4b669d0
[BugFix][V1] Fix int32 token index overflow when preparing input ids ( #16806 )
2025-04-23 12:12:35 -07:00
3cde34a4a4
[Frontend] Support guidance:no-additional-properties for compatibility with xgrammar ( #15949 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-04-23 18:34:41 +00:00
bdb3660312
Use @property and private field for data_parallel_rank_local ( #17053 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 08:50:08 -07:00
f3a21e9c68
CacheConfig.block_size should always be int when used (#17052 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 08:50:05 -07:00
8e630d680e
Improve Transformers backend model loading QoL ( #17039 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 07:33:51 -07:00
af869f6dff
[CI] Update structured-output label automation ( #17055 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-23 07:33:14 -07:00
53c0fa1e25
Ensure that pid passed to kill_process_tree is int for mypy ( #17051 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 07:32:26 -07:00
f7912cba3d
[Doc] Add top anchor and a note to quantization/bitblas.md ( #17042 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-23 07:32:16 -07:00
6317a5174a
Categorize tests/kernels/ based on kernel type ( #16799 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-23 09:21:07 -04:00
aa72d9a4ea
Mistral-format support for compressed-tensors ( #16803 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-23 08:46:23 -04:00
ce17db8085
[CI] Run v1/test_serial_utils.py in CI ( #16996 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-23 01:13:34 -07:00
8c87a9ad46
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers ( #16964 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-23 07:24:09 +00:00
ec69124eb4
[Misc] Improve readability of get_open_port function. ( #17024 )
...
Signed-off-by: gitover22 <qidizou88@gmail.com >
2025-04-23 06:16:53 +00:00
d0da99fb70
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) ( #16998 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-22 21:49:24 -07:00
b2f195c429
[V1] Avoid socket errors during shutdown when requests are in in-flight ( #16807 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-23 12:36:29 +08:00
047797ef90
[Bugfix] Triton FA function takes no keyword arguments ( #16902 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-22 21:35:24 -07:00
eb8ef4224d
[doc] add download path tips ( #17013 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-23 04:06:30 +00:00
56a735261c
[INTEL-HPU][v0] Port delayed sampling to upstream ( #16949 )
...
Signed-off-by: Michal Adamczyk <michal.adamczyk@intel.com >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
2025-04-22 20:14:11 -07:00
e1cf90e099
[misc] tune some env vars for GB200 ( #16992 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-23 10:59:48 +08:00
6bc1e30ef9
Revert "[Misc] Add S3 environment variables for better support of MinIO." ( #17021 )
2025-04-22 19:22:29 -07:00
7e081ba7ca
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check ( #17022 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-22 19:17:48 -07:00
1e013fa388
[V1][DP] More robust DP/EP dummy request coordination ( #16277 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 19:12:15 -07:00
bc7c4d206b
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 ( #13305 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com >
Signed-off-by: maleksan85 <maleksan@amd.com >
Signed-off-by: <>
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com >
2025-04-22 19:11:56 -07:00
f67e9e9f22
add Dockerfile build vllm against torch nightly ( #16936 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-04-22 19:08:27 -07:00
36fe78769f
[Bugfix] validate urls object for multimodal content parts ( #16990 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-23 09:43:06 +08:00
83d933718c
[Core][V1][TPU] Enable structured decoding on TPU V1 ( #16499 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-22 18:05:23 -06:00
5175b884f7
[BugFix] Remove default multiproc executor collective_rpc timeout ( #17000 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 23:27:14 +00:00
5536b30a4c
Fencing Kernels Tests for enabling on AMD ( #16929 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-22 09:32:40 -07:00
7f58fb9718
Add assertion for no objects while hashing hf_config ( #16930 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-22 09:32:22 -07:00
30bc3e0f66
[FEAT][ROCm]: Support AITER MLA ( #15893 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
2025-04-22 09:31:13 -07:00
f34410715f
[frontend] enhance tool_calls type check ( #16882 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-22 15:40:24 +00:00
68d4c33202
[Misc] Add S3 environment variables for better support of MinIO. ( #16977 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-22 14:27:36 +00:00
f961d7f6ef
[BugFix] Pass in correct VLLM config in FlashInfer backend ( #13207 ) ( #16973 )
...
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn >
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn >
2025-04-22 06:44:10 -07:00
d059110498
Improve configs - SpeculativeConfig ( #16971 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-22 12:55:36 +00:00
571e8dd65e
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni ( #16974 )
...
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com >
2025-04-22 12:23:17 +00:00
4b91c927f6
[Misc] refactor example series ( #16972 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-22 11:44:21 +00:00
0e237f0035
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER ( #15001 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-04-22 02:46:28 -07:00
8f7bace7c3
[Doc] Improve documentation for multimodal CLI args ( #16960 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-22 08:35:35 +00:00
e4d6144232
[BugFix] Fix incremental detokenization perf issue ( #16963 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 08:16:19 +00:00
8d32dc603d
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS ( #6036 )
...
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com >
Co-authored-by: xinyuxiao <xinyuxiao2024@gmail.com >
2025-04-22 09:01:36 +01:00
c4ab9f3e71
[V1] Remove pre-allocation for KV cache ( #16941 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-22 00:52:18 -07:00
2689d5c027
[Model] Use autoweightloader for mamba ( #16950 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-04-22 07:48:15 +00:00
acba33a0f1
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams ( #16767 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-04-22 06:02:20 +00:00
a114bf20a3
[Perf] Optimize _update_states for GPU model runner ( #16910 )
...
Signed-off-by: snowcharm <snowcharmqq@gmail.com >
2025-04-22 14:01:54 +08:00
3097ce3a32
[Doc] Update ai_accelerator/hpu-gaudi.inc.md ( #16956 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-22 05:33:27 +00:00
d6da9322c8
[Bugfix] Fix f-string for Python 3.9-3.11 ( #16962 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-21 21:45:55 -07:00
71ce44047f
Support S3 Sharded loading with RunAI Model Streamer ( #16317 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-21 21:21:49 -07:00
188b7f9b8c
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm ( #15830 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-04-21 20:46:22 -07:00
b9b4746950
[V1] Remove additional_config check ( #16710 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-04-21 20:45:27 -07:00
7b8a2ab76f
[Kernel] Add expert_map support to Cutlass FP8 MOE ( #16861 )
...
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com >
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com >
2025-04-21 20:44:32 -07:00
c9acbf1141
[Misc] Remove the chunked prefill warning for LoRA ( #16925 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-21 20:44:24 -07:00
5b794cae8d
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 ( #16727 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-21 20:42:34 -07:00
0e4254492f
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other ( #16863 )
...
Signed-off-by: Jeffrey Li <jeffrey.dot.li@gmail.com >
2025-04-22 11:40:19 +08:00
1311913f55
[BugFix][Spec Decode] No in-place update to draft probs ( #16952 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-21 19:54:19 -07:00
29f395c97c
[Doc] Remove unnecessary V1 flag ( #16924 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-21 21:04:38 -04:00
fa3bba2a53
[TPU][V1] Enable Top-P ( #16843 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-22 00:46:07 +00:00
986537f1c3
[V1] V1 FlashInfer Attention ( #16684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Aurick Qiao <qiao@aurick.net >
2025-04-22 00:38:41 +00:00
210207525e
[TPU][V1] Capture multimodal encoder during model compilation ( #15051 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Siyuan Liu <lsiyuan@google.com >
2025-04-21 18:36:59 -06:00
71eda0bb76
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml ( #16946 )
2025-04-21 18:35:32 -06:00
471fe65630
[TPU][V1] Implicitly adjust page size when there's SMEM OOM ( #16871 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-21 15:43:13 -06:00
3a0fba5cf4
[V1][Spec Decode] Handle draft tokens beyond max_model_len ( #16087 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-21 12:38:50 -07:00
299ebb62b2
[Core] Speed up decode by remove synchronizing operation in sampler ( #16436 )
...
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com >
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com >
2025-04-21 18:18:22 +00:00
f728ab8e35
[Doc] mention how to install in CPU editable mode ( #16923 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-21 17:45:51 +00:00
63e26fff78
[doc] install required python3-dev apt package ( #16888 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-21 16:15:18 +00:00
fe3462c774
[XPU][Bugfix] minor fix for XPU ( #15591 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-04-22 00:02:57 +08:00
3b34fd5273
Raise error for data-parallel with benchmark_throughput ( #16737 )
...
Signed-off-by: Kartik Ramesh <kartikx2000@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-04-21 23:51:43 +08:00
55d6d3fdb8
[Bugfix] Fix GLM rotary_dim issue and support v1 ( #16912 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
2025-04-21 14:26:34 +00:00
7272bfae77
[Misc] Refactor platform to get device specific stream and event ( #14411 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-21 21:25:49 +08:00
d9ac9e3dc5
[Misc] fix collect_env version parse ( #15267 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-04-21 20:29:40 +08:00
d41faaf9df
Restore buffers when wake up from level 2 sleep ( #16564 ) ( #16889 )
...
Signed-off-by: Han <zh950713@gmail.com >
2025-04-21 20:18:28 +08:00
b34f33438a
[Doc] Split dummy_processor_inputs() in Multimodal Docs ( #16915 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-21 11:10:01 +00:00
26c0406555
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni ( #16907 )
2025-04-21 10:25:21 +00:00
4c41278b77
[CI/CD][V1] Add spec decode tests to CI ( #16900 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-20 22:37:16 -07:00
bb3605db85
[Bugfix] Fix v1/spec_decode/test_ngram.py ( #16895 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-04-20 20:54:29 -07:00
fe742aef5a
[easy] Pass compile_fx only the config patches ( #16845 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-20 12:25:19 +08:00
4b07d36891
Improve configs - CacheConfig ( #16835 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-20 12:25:04 +08:00
87aaadef73
Serialize tensors using int8 views ( #16866 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-19 10:28:34 -07:00
682e0b6d2f
Log how much time loading a compiled artifact takes ( #16848 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-19 16:50:46 +00:00
d6195a748b
[doc] update hyperlink ( #16877 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-19 16:40:38 +00:00
205d84aaa9
[VLM] Clean up models ( #16873 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-19 12:13:06 +00:00
5124f5bf51
[Model] Qwen2.5-Omni Cleanup ( #16872 )
2025-04-19 09:37:02 +00:00
83f3c3bd91
[Model] Refactor Phi-4-multimodal to use merged processor and support V1 ( #15477 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-19 02:26:11 -07:00
d9737ca1c6
[V1][Misc] stop update prefix cache stats when logs_stats is disabled ( #16460 )
...
Signed-off-by: vie-serendipity <2733147505@qq.com >
2025-04-19 02:25:19 -07:00
9d4ca19d50
[Misc] Benchmarks for audio models ( #16505 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-19 02:24:14 -07:00
2ef0dc53b8
[Frontend] Add sampling params to v1/audio/transcriptions endpoint ( #16591 )
...
Signed-off-by: Jannis Schönleber <joennlae@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Jannis Schönleber <joennlae@gmail.com >
2025-04-19 07:03:54 +00:00
1d4680fad2
[rocm][MI300] llama4 maverick fp8 moe config tp8 ( #16847 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-04-19 06:21:43 +00:00
2c1bd848a6
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) ( #15130 )
...
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiong Wang <wangxiongts@163.com >
2025-04-18 23:14:36 -07:00
5c9121203c
[release] Publish neuron docker image ( #16733 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2025-04-18 17:11:25 -07:00
490b1698a5
[Doc] Updated Llama section in tool calling docs to have llama 3.2 config info ( #16857 )
...
Signed-off-by: jmho <jaylenho734@gmail.com >
2025-04-18 23:28:53 +00:00
5a5e29de88
[Misc] refactor examples series - Chat Completion Client With Tools ( #16829 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-18 23:24:42 +00:00
3d3ab3689f
[New Model]: Snowflake Arctic Embed (Family) ( #16649 )
2025-04-18 08:11:57 -07:00
686623c5e7
Fix nullable_kvs fallback ( #16837 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-18 05:58:39 -07:00
aadb656562
[Misc] Clean up Kimi-VL ( #16833 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-18 05:15:09 -07:00
87e067de41
[Model] use AutoWeightsLoader for BigCode, GPT-J ( #16823 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-18 10:42:41 +00:00
26507f8973
[Docs] Fix a link and grammar issue in production-stack.md ( #16809 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-18 06:42:58 +00:00
9c1d5b456d
[Doc] add podman setup instructions for official image ( #16796 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
2025-04-18 06:10:49 +00:00
e31045f95c
[Bugfix] fix pp for llama4 ( #16746 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-18 13:51:30 +08:00
aaec845f8e
[ROCm] [Attention] Cleanup ROCm output passing ( #16431 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-04-18 05:46:45 +00:00
7bdfd29a35
[Misc] add collect_env to cli and docker image ( #16759 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-17 22:13:35 -07:00
e78587a64c
Improve-mm-and-pooler-and-decoding-configs ( #16789 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 22:13:32 -07:00
7eb4255628
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales ( #16801 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-17 22:13:29 -07:00
6a0f547561
Add hardware print to TPU V1 test ( #16792 )
2025-04-17 22:13:26 -07:00
30ed81b7ca
[V1][Structured Output] Minor modification to _validate_structured_output() ( #16748 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-18 13:12:54 +08:00
7a4a5de729
[Misc] Update outdated note: LMCache now supports chunked prefill ( #16697 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-18 05:12:42 +00:00
c16fb5dae8
[Doc] Improve help examples for --compilation-config ( #16729 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 21:22:34 -07:00
e37073efd7
Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema ( #16721 )
...
Signed-off-by: Tarun Kumar <takumar@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-17 21:08:27 -07:00
183dad7a85
[Attention] Update to lastest FA3 code ( #13111 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-17 15:14:07 -07:00
3408e47159
[P/D][V1] KV Connector API V1 ( #15960 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: remi <remi@mistral.ai >
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-04-17 13:22:40 -07:00
0377b8310b
[MLA] Simplification to batch P/D reordering ( #16673 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-17 16:12:09 -04:00
e4755f7fac
[V1][Metrics] Fix http metrics middleware ( #15894 )
2025-04-17 19:52:18 +00:00
92edf35826
[ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints ( #16674 )
2025-04-17 11:44:34 -07:00
eb5819b2d9
[V1][TPU] Enable Top K ( #15489 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
Co-authored-by: Hyesoo Yang <hyeygit@gmail.com >
2025-04-17 18:18:11 +00:00
5989f4684d
[TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even ( #16726 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-17 18:09:57 +00:00
5125d72f02
[Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small ( #16548 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-17 17:48:31 +00:00
a018e555fd
[Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 ( #16753 )
...
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com >
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com >
2025-04-18 00:01:30 +08:00
6211b92273
[Bugfix]Fix index out of range error in api server log ( #16787 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-04-17 09:01:07 -07:00
05fcd1b430
[V1][Perf] Faster incremental detokenization ( #15137 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-17 07:45:24 -07:00
7c02d6a137
[Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion ( #16784 )
...
Signed-off-by: insukim1994 <insu.kim@moreh.io >
2025-04-17 14:10:08 +00:00
11c3b98491
[Doc] Document Matryoshka Representation Learning support ( #16770 )
2025-04-17 13:37:37 +00:00
dbe7f07001
[Doc] Make sure to update vLLM when installing latest code ( #16781 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 06:53:31 -06:00
c69bf4ee06
fix: hyperlink ( #16778 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 11:34:20 +00:00
d27ea94034
Improve configs - TokenizerPoolConfig + DeviceConfig ( #16603 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 11:19:42 +00:00
99ed526101
[Misc] refactor examples series - lmcache ( #16758 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 11:02:35 +00:00
207da28186
[Doc] Fix a 404 link in installation/cpu.md ( #16773 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-17 10:46:21 +00:00
5b1aca2ae3
[Bugfix] Fix GLM4 model ( #16618 )
...
Signed-off-by: intervitens <intervitens@tutanota.com >
2025-04-17 03:35:07 -07:00
d8e557b5e5
[doc] add open-webui example ( #16747 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 18:27:32 +08:00
61a44a0b22
[Doc] Add more tips to avoid OOM ( #16765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 09:54:34 +00:00
a6481525b8
[misc] ignore marlin_moe_wna16 local gen codes ( #16760 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-17 17:15:14 +08:00
8cac35ba43
[Ray] Improve documentation on batch inference ( #16609 )
...
Signed-off-by: Richard Liaw <rliaw@berkeley.edu >
2025-04-16 22:19:26 -07:00
9dbf7a2dc1
[V1] Remove log noise when idle ( #16735 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-16 21:34:08 -07:00
607029e515
[Bugfix] Revert max_prompt_len validation for decoder-only models. ( #16741 )
...
Signed-off-by: David Heineman <david@davidheineman.com >
2025-04-16 21:33:15 -07:00
cb072ce93b
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work ( #16734 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-17 04:17:39 +00:00
95aca283b4
[rocm][V0] fix selection logic for custom PA in V0 ( #16426 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-04-16 19:52:11 -07:00
2b05b8ce69
[V1][Frontend] Improve Shutdown And Logs ( #11737 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-16 19:48:34 -07:00
3c776dcefb
Adding vllm buildkite job for IBM Power ( #16679 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-04-17 10:47:47 +08:00
2cbd4d2999
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification ( #16636 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-16 19:47:26 -07:00
3092375e27
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] ( #16432 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-16 19:28:32 -07:00
3cd91dc955
Help user create custom model for Transformers backend remote code models ( #16719 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 01:05:59 +00:00
8a7368e069
[Misc] Remove redundant comment ( #16703 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-04-17 00:44:52 +00:00
93e561ec4d
Improve error for structured output backend selection ( #16717 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 00:35:35 +00:00
e1b004839a
[Hardware] Add processor inputs to platform validation ( #16680 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-04-16 09:28:42 -07:00
ee378f3d49
[Model] support modernbert ( #16648 )
...
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com >
Co-authored-by: 唯勤 <xsank.mz@alibaba-inc.com >
2025-04-16 05:30:15 -07:00
e82ee40de3
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel ( #16693 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-16 03:31:39 -07:00
facbe2a114
[Doc] Improve OOM troubleshooting ( #16704 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-16 18:29:48 +08:00
7168920491
[Misc] refactor examples series ( #16708 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-16 10:16:36 +00:00
21378a2323
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook ( #16405 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-16 10:05:31 +00:00
976711d9db
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py ( #16578 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-16 17:01:36 +08:00
44fa4d556c
[ROCM] Bind triton version to 3.2 in requirements-built.txt ( #16664 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-04-16 14:05:28 +08:00
3ac98edcb1
[Feature] add model aware kv ops helper ( #16020 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
2025-04-15 23:00:43 -07:00
966c742ed2
Disable remote caching when calling compile_fx ( #16611 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-15 22:18:28 -07:00
0d7d05f4b6
[Misc] Modify LRUCache touch ( #16689 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-16 04:51:38 +00:00
96bb8aa68b
[Bugfix] fix gpu docker image mis benchmarks dir ( #16628 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-15 21:21:14 -07:00
3badb0213b
[Model] Add PLaMo2 ( #14323 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
Signed-off-by: shemmi <shemmi@preferred.jp >
Co-authored-by: Kento Nozawa <nzw0301@preferred.jp >
Co-authored-by: Hiroaki Mikami <mhiroaki@preferred.jp >
Co-authored-by: Calvin Metzger <metzger@preferred.jp >
2025-04-15 19:31:30 -07:00
fdcb850f14
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server ( #10546 )
...
Signed-off-by: Angky William <angkywilliam@Angkys-MacBook-Pro.local >
Co-authored-by: Angky William <angkywilliam@Angkys-MacBook-Pro.local >
2025-04-15 22:31:38 +00:00
54a66e5fee
[Misc] Update compressed-tensors WNA16 to support zero-points ( #14211 )
2025-04-15 07:33:51 -06:00
280d62b8a2
[Kernel] Remove redundant Exp calculations ( #16123 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-15 12:58:37 +00:00
1666e66443
Add "/server_info" endpoint in api_server to retrieve the vllm_config. ( #16572 )
...
Signed-off-by: Xihui Cang <xihuicang@gmail.com >
2025-04-15 11:50:38 +00:00
1575c1701a
[CI/Build] Fix LoRA OOM ( #16624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-15 16:38:19 +08:00
6ae996a873
[Misc] refactor argument parsing in examples ( #16635 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-15 08:05:30 +00:00
b590adfdc1
Fix vLLM x torch.compile config caching ( #16491 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-14 23:11:11 -07:00
b4fe16c75b
Add vllm bench [latency, throughput] CLI commands ( #16508 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-14 23:10:35 -07:00
bc5dd4f669
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) ( #16631 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-04-14 23:09:58 -07:00
dbb036cf61
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py ( #16623 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-04-15 05:35:38 +00:00
70e7ed841d
[BugFix]: Update minimum pyzmq version ( #16549 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-04-14 20:06:03 -07:00
d06ba4ed3f
[Kernel] moe wna16 marlin kernel ( #14447 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-14 20:05:22 -07:00
6b40996ae8
[Core][Bugfix] Fix Offline MM Beam Search ( #16390 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-15 10:33:02 +08:00
d2020acac7
config check sleep mode support oot platforms ( #16562 )
2025-04-14 16:31:50 -07:00
1eb3c2ed48
[DOC][TPU] Add core idea about avoiding recompilation after warmup ( #16614 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-14 21:56:06 +00:00
c64ee87267
[Hardware][TPU] Add torchvision to tpu dependency file ( #16616 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-04-14 17:50:46 -04:00
b1308b84a3
[Model][VLM] Add Kimi-VL model support ( #16387 )
...
Signed-off-by: courage17340 <courage17340@163.com >
2025-04-14 21:41:48 +00:00
7b5ecf79bd
s390x: Fix PyArrow build and add CPU test script for Buildkite CI ( #16036 )
...
Signed-off-by: Nishan Acharya <Nishan.Acharya@ibm.com >
2025-04-14 10:55:32 -07:00
9883a18859
Fix triton install condition on CPU ( #16600 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-14 17:06:01 +00:00
b3f2fddd17
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 ( #16596 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-14 17:01:05 +00:00
aa29841ede
[Bugfix] Multi-modal caches not acting like LRU caches ( #16593 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-14 09:24:16 -07:00
6bf27affb6
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet ( #16048 )
...
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-04-14 17:08:39 +01:00
1dd23386ec
[Misc] Update usage with mooncake lib for kv transfer ( #16523 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-14 11:31:37 +00:00
7cbfc10943
[Misc] refactor examples ( #16563 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-14 09:59:15 +00:00
ce4ddd2d1a
[Misc] remove warning if triton>=3.2.0 ( #16553 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-14 02:39:47 -07:00
e51929ebca
Improve configs - SchedulerConfig ( #16533 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-14 17:24:16 +08:00
dc1b4a6f13
[Core][V0] Enable regex support with xgrammar ( #13228 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-14 10:13:38 +08:00
63d2705edb
[Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py ( #16556 )
2025-04-13 17:20:26 -07:00
d085a44082
Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) ( #16537 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-13 14:55:18 +00:00
f49e5aff11
[V1][Spec Decode] KV cache slots for eagle heads ( #16370 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-12 19:42:51 -07:00
6c11ecf8d3
[Bugfix] Validate logit biases to prevent out of vocab ids crashing engine ( #16529 )
...
Signed-off-by: Ryan McConville <ryan@ryanmcconville.com >
2025-04-12 20:19:19 +00:00
93e5f3c5fb
[Perf] Optimize Preparing Inputs for GPU Model Runner ( #16484 )
...
Signed-off-by: snowcharm <snowcharmqq@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-12 22:54:37 +08:00
70363bccfa
Fix syntaxWarning: invalid escape sequence '\s' ( #16532 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2025-04-12 14:39:42 +00:00
3cdc57669f
[Misc] Delete redundant code ( #16530 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-04-12 11:21:37 +00:00
68bb122eb4
[MISC] Make GroupCoordinator compatible with out-of-tree devices ( #16464 )
...
Signed-off-by: hzji210@gmail.com <hzji210@gmail.com >
2025-04-12 09:20:25 +00:00
d9fc8cd9da
[V1] Enable multi-input by default ( #15799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-12 08:52:39 +00:00
f069f3ea74
[Misc] Openai transcription client example use same Whisper model ( #16487 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-12 07:27:03 +00:00
c5bc0e7fcc
[Misc] Update chat utils tests ( #16520 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-12 06:48:43 +00:00
4a3a518722
fix: spelling ( #16466 )
...
Signed-off-by: Tianer Zhou <ezhoureal@gmail.com >
2025-04-11 23:24:22 -07:00
fbf722c6e6
[Frontend] support matryoshka representation / support embedding API dimensions ( #16331 )
2025-04-11 23:23:10 -07:00
e92d7085bf
[Feature][V1] Add xgrammar to support minLength, maxLength with test ( #16516 )
...
Signed-off-by: Leon Seidel <leon.seidel@fau.de >
2025-04-11 23:22:07 -07:00
bd6028d6b0
Optimized topk for topk=1 (Llama-4) ( #16512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-12 14:21:08 +08:00
802329dee9
[Doc] Update Llama4 Model Names in Supported Models ( #16509 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-12 02:53:10 +00:00
41cc883c29
[BugFix] Handle non-contiguous tensors properly when serializing ( #16492 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-11 17:54:06 -07:00
57504a4bcf
[CI][Bugfix] Add mistral_tool_use to Ci ( #16517 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 17:52:38 -07:00
ed4792c990
[Doc] Fix link to vLLM blog ( #16519 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-04-11 17:39:23 -07:00
87b836ba77
Bugfix for PixtralHF models without spatial_merge_size ( #16513 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 23:32:22 +00:00
56c76c2e0e
[Bugfix] clean up duplicated code ( #16485 )
...
Signed-off-by: Gogs <gogs@fake.local >
Co-authored-by: Gogs <gogs@fake.local >
2025-04-11 23:19:40 +00:00
c09632a66c
Update openai_compatible_server.md ( #16507 )
...
Signed-off-by: Christian Sears <csears@redhat.com >
2025-04-11 22:54:58 +00:00
a3bf8d4a2b
[Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 ( #16488 )
2025-04-12 06:26:55 +08:00
16eda8c43a
[Frontend] Added chat templates for LLaMa4 pythonic tool calling ( #16463 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Kai Wu <kaiwu@meta.com >
2025-04-12 06:26:17 +08:00
cd77382ac1
Improve configs - LoadConfig ( #16422 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-11 20:27:27 +00:00
71b9cde010
[Bugfix] handle alignment of encoder_seq_lens in mllama.py ( #14784 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-04-11 19:59:50 +00:00
5285589f37
[Doc] Document InternVL3 support ( #16495 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-11 19:41:09 +00:00
f41647ee6b
[Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel ( #16366 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 17:54:08 +00:00
4d022cbc75
[TPU][V1] Make --disable_chunked_mm_input mandatory for serving MM models ( #16483 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-11 17:06:14 +00:00
70de35a881
Fix erroneous "model doesn't support compile" warning ( #16486 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-11 16:24:36 +00:00
34b2cf3b33
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU ( #12779 )
...
Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com >
2025-04-11 07:38:36 -07:00
9e90c9f73f
[Bugfix] Fix bugs of running Quark quantized models ( #16236 )
...
Signed-off-by: chaow <chaow@amd.com >
2025-04-11 10:18:32 -04:00
e9528f6dc6
[Kernel] support merge_attn_states CUDA kernel, 3x speedup ( #16173 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-11 06:50:50 -06:00
51baa9c333
Don't install triton on ppc64le platform ( #16470 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-11 10:11:00 +00:00
35e076b3a8
[Misc] update api_client example ( #16459 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-11 10:05:40 +00:00
a26f59ccbc
[Misc] Raise error for V1 not supporting Long LoRA. ( #16415 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-11 01:51:20 -07:00
aa3b3d76e0
Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True ( #16447 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 08:09:52 +00:00
f7030df3be
[Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner ( #15990 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-11 15:32:37 +08:00
905e91e9ac
Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" ( #16453 )
2025-04-11 06:44:22 +00:00
f8f9c0ba62
[Bugfix] Don't set an upper bound on repetition penalty ( #16403 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-11 14:19:40 +08:00
dda811021a
[CPU][Bugfix] Fix CPU docker issues ( #16454 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-04-11 14:19:07 +08:00
93195146ea
[Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test ( #16424 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-11 04:57:16 +00:00
ed37599544
Update supported_hardware.md for TPU INT8 ( #16437 )
2025-04-11 12:28:07 +08:00
99ef59cf7f
[Llama4] Enable attention temperature tuning by default for long context (>32k) ( #16439 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-10 21:26:07 -07:00
d544d141ec
update benchmark_serving_structured_output to include auto backend ( #16438 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-11 12:25:52 +08:00
3e397a9484
check input length of sonnet samples ( #16423 )
...
Signed-off-by: alexey-belyakov <alexey.belyakov@intel.com >
2025-04-11 10:15:06 +08:00
268c325078
Fix range_ratio Bug in RandomDataset ( #16126 )
...
Signed-off-by: jadewang21 <jadewangcn@outlook.com >
2025-04-10 15:31:17 -07:00
3cc9af88ff
[TPU][V1] Disable per-request seed/Generator ( #16172 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-10 17:05:44 -04:00
7cd0bd7212
[Bugfix] Fix output token length check logic ( #16419 )
...
Signed-off-by: look <eeslook@163.com >
2025-04-10 20:16:48 +00:00
56d4aefa33
[VLM] Avoid unnecessary dummy multimodal data during processing ( #16416 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 19:32:14 +00:00
dd143ef541
[V1] Zero-copy tensor/ndarray serialization/transmission ( #13790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-10 19:23:14 +00:00
daefed052c
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B ( #15423 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-04-10 19:07:07 +00:00
5fbab20e02
[Bugfix] Fix bug when dataset is json ( #15899 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-10 18:35:41 +00:00
e8224f3dca
[V1][Spec Decode] Eagle Model loading ( #16035 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-10 11:21:48 -07:00
9665313c39
[V1] Set structured output backend to auto by default ( #15724 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-10 17:53:26 +00:00
0c54fc7273
Improve configs - ParallelConfig ( #16332 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-10 17:34:37 +00:00
c1b57855ec
[TPU][V1] Use language_model interface for getting text backbone in MM ( #16410 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-10 17:32:04 +00:00
83b824c8b4
[VLM] Remove BaseProcessingInfo.get_mm_max_tokens_per_item ( #16408 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 09:06:58 -07:00
7678fcd5b6
Fix the torch version parsing logic ( #15857 )
2025-04-10 07:37:47 -07:00
8661c0241d
[CI] Add auto update workflow for Dockerfile graph ( #11879 )
...
Signed-off-by: wineandchord <guoqizhou19@gmail.com >
2025-04-10 13:43:05 +00:00
ce8d6b75fc
[doc] update the wrong link ( #16401 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-10 21:02:37 +08:00
61de3ef74b
[Model] Remove image mm limit for LLaMa4 ( #16365 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-10 09:36:27 +00:00
ec1f9c8c91
Update Numba to 0.61.2 ( #16376 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-04-10 07:59:37 +00:00
65e09094c4
[doc] add download model tips ( #16389 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-10 07:45:26 +00:00
c70cf0fe06
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models ( #16038 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-10 15:08:47 +08:00
a5d11a54dc
[Bugfix] Fix validation error for text-only Mllama 3.2 ( #16377 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 14:19:42 +08:00
3d4c87758e
[Misc] Update transformers version limits of multi-modal tests ( #16381 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-09 23:03:33 -07:00
a9bd832fc5
[Model] use AutoWeightsLoader for deepseek_v2, internlm2 ( #16383 )
...
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com >
2025-04-09 23:01:00 -07:00
417bcefbae
fix sonnet dataset sample when prefix len is very small ( #16379 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-10 05:35:07 +00:00
baada0e737
[Bugfix][TPU] Fix TPU validate_request ( #16369 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-04-10 12:55:12 +08:00
82eb61dd4c
[misc] use tqdm.auto where appropriate ( #16290 )
...
Signed-off-by: Benjamin Kitor <bkitor@gigaio.com >
2025-04-09 21:54:54 -07:00
0d4d06fe2f
[CI][Bugfix] Pin triton version for CPU ( #16384 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-10 04:35:00 +00:00
4aed0ca6a2
[bugfix] Avoid the time consumption caused by creating dummy videos. ( #16371 )
2025-04-10 04:30:05 +00:00
1621b25288
[TPU] Fix dummy loading OOM ( #16372 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-10 04:06:16 +00:00
a564797151
[Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral ( #16325 )
...
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com >
2025-04-09 20:07:40 -07:00
1da6a09274
[Bugfix]: do not shutdown server if skip_special_use=False for MistralTokenizer ( #14094 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 19:43:09 -07:00
1e44ffc3ff
Add GLM-4-0414 support ( #16338 )
...
Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: yihong <zouzou0208@gmail.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-10 09:19:42 +08:00
a454748544
[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues ( #16275 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-09 18:51:51 -06:00
1bff42c4b7
[Misc] refactor Structured Outputs example ( #16322 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-09 23:32:42 +00:00
cb391d85dc
[Hardware] add platform-specific request validation api ( #16291 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-04-09 12:50:01 -07:00
fee5b8d37f
[Build/CI] Add tracing deps to vllm container image ( #15224 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-09 19:14:06 +00:00
b2ce859bd2
Fix benchmark_throughput.py --backend=hf ( #16352 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-09 19:09:28 +00:00
566f10a929
[CI]Fix hpu docker and numpy version for CI ( #16355 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-04-09 17:52:26 +00:00
c3b5189137
[Bugfix] catch AssertionError in MistralTokenizer as ValueError ( #16344 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 17:33:24 +00:00
a25866ac8d
[Bugfix] Fix profiling.py ( #16202 )
...
Signed-off-by: zh Wang <rekind133@outlook.com >
2025-04-09 17:03:34 +00:00
098900d7c2
Revert "Update label-tpu mergify and remove removal bot" ( #16350 )
2025-04-09 07:59:36 -07:00
98d01d3ce2
[Bugfix][Frontend] respect provided default guided decoding backend ( #15476 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 05:11:10 -07:00
d55244df31
[Model] Add SupportsMultiModal.get_language_model interface ( #16007 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-09 04:12:54 -07:00
04149cce27
[BugFix] fix some typos found by typos. ( #16314 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-09 03:43:59 -07:00
24834f4894
update neuron config ( #16289 )
...
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com >
2025-04-09 03:43:22 -07:00
ec7da6fcf3
[BugFix] llama4 qknorm should be not shared across head ( #16311 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-09 00:59:14 -07:00
819d548e8a
[BugFix] logger is not callable ( #16312 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-09 00:59:02 -07:00
477d2a8aa2
Update label-tpu mergify and remove removal bot ( #16298 )
2025-04-09 07:56:25 +00:00
e484e02857
[Bugfix] Avoid transferring cached multi-modal items from P0 to P1 ( #16273 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-09 00:51:27 -07:00
24f6b9a713
[Misc] Fix test_sharded_state_loader.py( #16004 ) ( #16005 )
...
Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com >
2025-04-09 14:47:30 +08:00
9cdde47289
[BugFix] Fix fusion test and add them to CI ( #16287 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-04-08 23:46:45 -07:00
b1eb4ca152
[TPU] Update PyTorch/XLA ( #16288 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-09 14:46:32 +08:00
87b4ac56c2
[CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding ( #16221 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-09 04:14:46 +00:00
cb84e45ac7
[Core] Upgrade to xgrammar 0.1.18, add cache size limit ( #16283 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-08 19:13:22 -07:00
4716377fbc
[Feature] Estimate max-model-len use available KV cache memory ( #16168 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 19:12:51 -07:00
4e9cf8c1dd
[Bugfix] fix gettid method is not define ( #16084 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 19:12:44 -07:00
2976dc27e9
[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs ( #16198 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com >
2025-04-08 19:12:34 -07:00
102bf967f0
[Model] Add smolvlm support ( #16017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-08 19:12:17 -07:00
1f4b09b525
Add support to modelopt quantization of Mixtral model ( #15961 )
...
Signed-off-by: Yue <yueshen@nvidia.com >
2025-04-09 01:53:31 +00:00
86c3369eb8
[CI/Build] Fix CI LoRA failure ( #16270 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-09 09:13:56 +08:00
2755c34a8f
[V1] Update structured output offline inference example ( #15721 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-08 22:34:09 +00:00
db10422184
[Bugfix] fix deepseek fp16 scale bug ( #14809 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-08 16:56:09 -04:00
e1a2c699dd
[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context ( #16209 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-08 18:56:51 +00:00
0115ccd5c0
Add warning that content below line in template will be removed ( #16276 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-08 18:18:40 +00:00
40b4284fe3
[Bugfix] Handle process_weights_after_loading for QKVCrossParallelLinear ( #15328 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-08 10:02:23 -07:00
4ebc0b9640
[Bugfix] Proper input validation for multi-modal encoder-decoder models ( #16156 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-08 09:45:21 -07:00
dc96fd54c6
[Misc] Avoid stripping meaningful whitespace from nvidia-smi topo -m output in collect_env.py ( #16272 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-04-08 16:08:09 +00:00
1f5d13ab9f
[New Model]: jinaai/jina-embeddings-v3 ( #16120 )
2025-04-08 08:39:12 -07:00
90cb44eb02
Update to transformers==4.51.1 ( #16257 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-08 06:53:39 -07:00
e11880deea
[Bugfix] Remove triton do_bench fast_flush arg ( #16256 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-04-08 13:51:06 +00:00
9351f91be9
[BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm ( #16247 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-04-08 05:10:26 -07:00
5a1e1c8353
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe ( #16203 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 04:05:47 -07:00
69ecaa7c79
[Misc] Add warning for multimodal data in LLM.beam_search ( #16241 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-08 04:05:27 -07:00
7f00899ff7
[Misc] format and refactor some examples ( #16252 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-08 10:42:32 +00:00
995e3d1f41
[Docs] Add Slides from Singapore Meetup ( #16213 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-08 07:20:22 +00:00
b4ac449a83
[Misc] Merge the logs of pp layers partitions ( #16225 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-04-08 00:18:15 -07:00
8e5314a468
[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill ( #15837 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-07 23:24:07 -07:00
87918e40c4
[torch.compile][TPU] Make @support_torch_compile work for XLA backend ( #15782 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-08 14:23:53 +08:00
f6b32efb7f
[Bugfix] Fix and reorganize broken GGUF tests and bump gguf version ( #16194 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-08 13:38:13 +08:00
b99733d092
[Bugfix] Do not skip "empty" parts of chats that are parsable ( #16219 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-08 05:14:15 +00:00
05a015d6a5
Add warning for Attention backends that do not support irope yet ( #16212 )
2025-04-08 03:59:26 +00:00
ad971af8c7
[Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 ( #16161 )
2025-04-07 20:48:47 -07:00
f2ebb6f541
[V1] Scatter and gather placeholders in the model runner ( #16076 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
2025-04-08 10:43:41 +08:00
1d01211264
Update BASE_IMAGE to 2.22 release of Neuron ( #16218 )
2025-04-07 19:11:18 -07:00
f94ab12f79
[Misc] Update compressed-tensors to version 0.9.3 ( #16196 )
...
Signed-off-by: Miles Williams <42222518+mlsw@users.noreply.github.com >
2025-04-07 19:09:06 -07:00
a865bc1ca6
[core] do not send error across process ( #16174 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-07 19:09:03 -07:00
21802c4b6d
[ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping ( #16031 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-04-07 21:28:14 -04:00
652907b354
Torchao ( #14231 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-04-07 19:39:28 -04:00
24f1c01e0f
[Bugfix][V0] XGrammar structured output supports Enum ( #15878 )
...
Signed-off-by: Leon Seidel <leon.seidel@fau.de >
2025-04-07 22:38:25 +00:00
fad6e2538e
[Misc] add description attribute in CLI ( #15921 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-07 22:30:35 +00:00
7f6d47c1a2
[V1][BugFix] Exit properly if engine core fails during startup ( #16137 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-07 15:30:15 -07:00
3147586ebd
[Bugfix] Fix guidance backend for Qwen models ( #16210 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-04-07 22:15:43 +00:00
ed636d99ca
[Misc] Move Llama 4 projector call into encoder execution ( #16201 )
2025-04-07 14:02:05 -07:00
090c856d76
[Misc] Human-readable max-model-len cli arg ( #16181 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-04-07 14:40:58 -04:00
ad434d4cfe
Print the warning only once ( #16193 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-07 18:30:06 +00:00
66d433b94f
[V1] Revert the default max_num_seqs to V0 values for most hardware ( #16158 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 13:54:36 -04:00
027b204ff1
[Bugfix] Re-enable support for ChatGLMForConditionalGeneration ( #16187 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 23:15:58 +08:00
55dcce91df
Upstream Llama4 Support to Main ( #16113 )
...
Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com >
Signed-off-by: Chris Thi <chris.c.thi@gmail.com >
Signed-off-by: drisspg <drisspguessous@gmail.com >
Signed-off-by: Jon Swenson <jmswen@gmail.com >
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Xiaodong Wang <xdwang@meta.com >
Signed-off-by: Yang Chen <yangche@fb.com >
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Lu Fang <lufang@fb.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 08:06:27 -07:00
8017c8db7f
[Doc]Update image to latest version ( #16186 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-04-07 14:17:39 +00:00
dc3529dbf6
[Misc] improve example mlpspeculator and llm_engine_example ( #16175 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-07 11:53:52 +00:00
7699258ef0
[Model] Add Qwen3 and Qwen3MoE ( #15289 )
...
Signed-off-by: YamPengLi <yampayne.lyp@alibaba-inc.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-07 04:06:41 -07:00
e9ba99f296
[V1][Structured Output] Add supports_structured_output() method to Platform ( #16148 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-07 11:06:24 +00:00
7c80368710
[VLM] Florence-2 supports online serving ( #16164 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-07 04:04:02 -07:00
95d63f38c0
doc: fix some typos in doc ( #16154 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-07 05:32:06 +00:00
bb8dab821e
[CI] Set max transformers version for Ultravox model test ( #16149 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-07 04:37:58 +00:00
fc0f87768a
[Bugfix] Make dummy encoder prompt padding alternative and add missing warnings ( #16129 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-07 04:07:15 +00:00
0a57386721
[Misc] Update Mistral-3.1 example ( #16147 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 03:57:37 +00:00
3749e28774
[V1][Minor] Minor simplification for get_computed_blocks ( #16139 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-06 20:38:12 -07:00
86fc2321ff
[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token ( #15202 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-06 20:34:51 -07:00
2549c0dfef
Fix requires-python ( #16132 )
2025-04-06 19:22:25 -07:00
b10e519895
[V1][Minor] Optimize get_cached_block ( #16135 )
2025-04-06 20:48:14 +00:00
9bde5ba127
[TPU] Update PyTorch/XLA ( #16130 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-06 18:25:55 +00:00
72c8f1ad04
[Misc] update requires-python in pyproject.toml ( #16116 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-06 14:56:34 +00:00
da224daaa9
[Bugfix] add hf_token to EngineArgs ( #16093 )
...
Signed-off-by: paolovic <paul-philipp.luley@uzh.ch >
Co-authored-by: paolovic <paul-philipp.luley@uzh.ch >
2025-04-06 14:47:33 +00:00
3a100b9278
[Bugfix] LoRA : Fix the order in which the kernels process LoRAs ( #16040 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-04-06 14:04:50 +00:00
242a637aea
[Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 ( #16103 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-06 05:52:01 -07:00
c2a9671510
[Misc] Improve model redirect to accept json dictionary ( #16119 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-06 05:51:45 -07:00
d5ae4f7f42
[Doc][Bugfix] Add missing EOF in k8s deploy doc ( #16025 )
2025-04-06 12:10:57 +00:00
b6c502a150
[Misc] refactor example eagle ( #16100 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-06 09:42:48 +00:00
9ca710e525
[CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar ( #16117 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-06 16:18:00 +08:00
eb07c8cb5b
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace ( #14501 )
...
Signed-off-by: Ben Jackson <ben@ben.com >
2025-04-06 07:44:36 +00:00
ba10801961
[Benchmark] Add sampling parameters to benchmark_serving. ( #16022 )
...
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
2025-04-06 12:30:35 +08:00
620fc2d09e
[Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 ( #16112 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-05 21:23:40 -07:00
29283eaa7e
[Model] use AutoWeightsLoader for phi, gemma, deepseek ( #16088 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-05 20:34:38 -07:00
2fa66ef713
[Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine ( #15946 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-04-05 20:04:22 -07:00
13affc432d
[Misc] Remove redundant code ( #16098 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-05 20:03:50 -07:00
d8f094a92a
[Misc] format output for encoder_decoder.py ( #16095 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-05 19:57:18 -07:00
97ae6d777f
Fix some capitalisations in generated examples doc titles ( #16094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-05 13:44:03 +00:00
6baeee70d1
Revert "doc: add info for macos clang errors ( #16049 )" ( #16091 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-05 11:51:51 +00:00
d2517a4939
[doc] fix 404 ( #16082 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-05 11:39:18 +00:00
6342adc438
fix: support clang17 for macos and fix the real libomp ( #16086 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-05 11:00:12 +00:00
0adba91547
[CI] Fix benchmark script level ( #16089 )
2025-04-05 03:36:01 -07:00
4285e423a6
[Misc] Auto detect bitsandbytes pre-quantized models ( #16027 )
...
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com >
2025-04-04 23:30:45 -07:00
63375f0cdb
[V1][Spec Decode] Update N-gram Proposer Interface ( #15750 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-04 16:32:54 -07:00
70ad3f9e98
[Bugfix][TPU] Fix V1 TPU worker for sliding window ( #16059 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-04-04 23:31:19 +00:00
d6fc629f4d
[Kernel][Minor] Re-fuse triton moe weight application ( #16071 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-04 23:27:34 +00:00
af51d80fa1
Revert "[V1] Scatter and gather placeholders in the model runner" ( #16075 )
2025-04-04 14:50:57 -07:00
f5722a5052
[V1] Scatter and gather placeholders in the model runner ( #15712 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-04-04 21:26:44 +00:00
651cf0fec1
[V1] DP scale-out (1/N): Use zmq ROUTER/DEALER sockets for input queue ( #15906 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-04 12:56:43 -07:00
4dc52e1c53
[CI] Reorganize .buildkite directory ( #16001 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-04-04 12:16:20 -07:00
4708f13a9c
[Bugfix] Fix default behavior/fallback for pp in v1 ( #16057 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-04 17:58:08 +00:00
a6d042df0a
[ROCm][Bugfix] Bring back fallback to eager mode removed in #14917 , but for ROCm only ( #15413 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-04 09:40:37 -07:00
40a36ccfeb
[ROCm][Bugfix] Use platform specific FP8 dtype ( #15717 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-04 09:40:20 -07:00
ef608c37a7
[Distributed] [ROCM] Fix custom allreduce enable checks ( #16010 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-04-04 09:39:08 -07:00
2386803f2a
[CPU] Change default block_size for CPU backend ( #16002 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-04-04 09:39:05 -07:00
95862f7b4d
[Benchmark][Doc] Update throughput benchmark and README ( #15998 )
...
Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-04-04 09:39:02 -07:00
230b131b54
[Bugfix][kernels] Fix half2float conversion in gguf kernels ( #15995 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-04 09:38:58 -07:00
0812d8dd41
[Hardware][Gaudi][BugFix] fix arguments of hpu fused moe ( #15945 )
...
Signed-off-by: zhenwei <zhenweiliu@habana.ai >
2025-04-04 09:38:55 -07:00
bf7e3c51ae
[Model] use AutoWeightsLoader for baichuan, gpt-neox, mpt ( #15939 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-04 09:38:52 -07:00
a35a8a8392
[V1][Spec Decode] Avoid logging useless nan metrics ( #16023 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-04 08:52:41 -07:00
4ef0bb1fcf
doc: add info for macos clang errors ( #16049 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-04 14:58:16 +00:00
fadc59c0e6
[TPU][V1] Remove ragged attention kernel parameter hard coding ( #16041 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-04 07:48:50 -04:00
86cbd2eee9
[Misc] improve gguf check ( #15974 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-04 01:33:36 +00:00
092475f738
[ROCm] Tweak the benchmark script to run on ROCm ( #14252 )
2025-04-03 17:12:48 -07:00
dcc56d62da
[Bugfix] Fix function names in test_block_fp8.py ( #16033 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-03 23:01:34 +00:00
f15e70d906
[TPU] Switch Test to Non-Sliding Window ( #15981 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-04-03 14:28:45 -07:00
b6be6f8d1e
[TPU] Support sliding window and logit soft capping in the paged attention kernel for TPU. ( #15732 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-04-03 14:23:28 -07:00
03a70eacaf
Re-enable the AMD Testing for the passing tests. ( #15586 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-03 11:05:17 -07:00
45b1ff7a25
[Misc][Performance] Advance tpu.txt to the most recent nightly torch … ( #16024 )
2025-04-03 17:32:54 +00:00
15ba07ef25
[Minor] Fused experts refactor ( #15914 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-03 10:19:38 -07:00
d2b58ca203
[Neuron][kernel] Fuse kv cache into a single tensor ( #15911 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-04-03 09:51:32 -07:00
82e7e19a6e
[SupportsQuant] Chameleon, Chatglm, Commandr ( #15952 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-04-03 08:25:22 -07:00
421c462948
[SupportsQuant] Bert, Blip, Blip2, Bloom ( #15573 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-04-03 08:23:19 -07:00
84884cd9ac
fix: tiny fix make format.sh excutable ( #16015 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-03 15:18:05 +00:00
a43aa183dc
[doc] update contribution link ( #15922 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-03 10:47:31 +00:00
463bbb1835
[Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process ( #15367 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-04-03 07:32:10 +00:00
5e125e74d1
[misc] improve error message for "Failed to infer device type" ( #15994 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-03 14:45:03 +08:00
06f21ce7a5
[Benchmark] Add AIMO Dataset to Benchmark ( #15955 )
...
Signed-off-by: Ziji Shi <shi.ziji.sm@gmail.com >
Signed-off-by: StevenShi-23 <shi.ziji.sm@gmail.com >
2025-04-03 06:09:18 +00:00
57a810db9c
[ROCM][V0] PA kennel selection when no sliding window provided ( #15982 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-04-03 05:28:44 +00:00
8b664706aa
[bugfix] add seed in torchrun_example.py ( #15980 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-03 12:25:01 +08:00
37bfee92bf
fix: better error message for get_config close #13889 ( #15943 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-03 03:53:19 +00:00
e73ff24e31
[ROCM][KERNEL] Paged attention for V1 ( #15720 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com >
2025-04-02 19:48:00 -07:00
bd7599d34a
[V1][TPU] Do not compile sampling more than needed ( #15883 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-03 01:36:01 +00:00
01b6113659
[TPU] optimize the all-reduce performance ( #15903 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-03 00:25:14 +00:00
1b84eff03a
[V1][TPU] TPU-optimized top-p implementation (avoids scattering). ( #15736 )
...
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
Co-authored-by: root <root@t1v-n-822696b7-w-0.us-central2-b .c.tpu-prod-env-large-adhoc.internal>
2025-04-02 17:18:08 -07:00
55acf86bf8
Fix huggingface-cli[hf-xet] -> huggingface-cli[hf_xet] ( #15969 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-02 23:37:30 +00:00
f021b97993
[V1] Support Mistral3 in V1 ( #15950 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-02 15:36:24 -07:00
1cab43c2d2
[misc] instruct pytorch to use nvml-based cuda check ( #15951 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-03 01:02:58 +08:00
8bd651b318
Restricted cmake to be less than version 4 as 4.x breaks the build of… ( #15859 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-04-02 16:19:39 +00:00
58e234a754
[Misc] V1 LoRA support CPU offload ( #15843 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-02 23:04:43 +08:00
e86c414d6a
[Model] use AutoWeightsLoader in model load_weights ( #15770 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-02 07:47:31 -07:00
550b2801ad
[CPU][Bugfix] Using custom allreduce for CPU backend ( #15934 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-04-02 07:46:47 -07:00
cefb9e5a28
[Frontend] Implement Tool Calling with tool_choice='required' ( #13483 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Signed-off-by: Matt, Matthias <matthias.matt@tuwien.ac.at >
Co-authored-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-04-02 07:45:45 -07:00
98d7367b61
[Metrics] Hide deprecated metrics ( #15458 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-02 07:37:19 -07:00
594a8b9030
[Bugfix] Fix the issue where the model name is empty string, causing no response with the model name. ( #15938 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-02 06:33:52 -07:00
44f990515b
[CI] Remove duplicate entrypoints-test ( #15940 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-02 02:44:01 -07:00
252937806c
[Bugfix][Benchmarks] Ensure async_request_deepspeed_mii uses the OpenAI choices key ( #15926 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-04-02 02:19:35 -07:00
51826d51fa
Add minimum version for huggingface_hub to enable Xet downloads ( #15873 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-02 02:03:36 -07:00
14e53ed11f
[V1] Fix json_object support with xgrammar ( #15488 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-02 02:00:08 -07:00
ddb94c2605
[core] Add tags parameter to wake_up() ( #15500 )
...
Signed-off-by: Eric <erictang000@gmail.com >
2025-04-02 01:59:27 -07:00
90969fb39a
[Kernel] Add more dtype support for GGUF dequantization ( #15879 )
...
Signed-off-by: lukas.bluebaum <lukas.bluebaum@aleph-alpha.com >
2025-04-02 01:58:48 -07:00
101f1481f9
[Build/CI] Update lm-eval to 0.4.8 ( #15912 )
...
Signed-off-by: Chris Thi <chris.c.thi@gmail.com >
2025-04-02 01:47:57 -07:00
2edc87b161
[Bugfix] Fix cache block size calculation for CPU MLA ( #15848 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-04-02 01:45:02 -07:00
4203926f10
[CI/Build] Further clean up LoRA tests ( #15920 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-02 01:39:09 -07:00
cdb57015a7
[Misc] Replace print with logger ( #15923 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-02 01:37:38 -07:00
aa557e6422
[Benchmark]Fix error message ( #15866 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-04-02 01:32:24 -07:00
0e00d40e4f
[V1][Bugfix] Fix typo in MoE TPU checking ( #15927 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-01 23:46:42 -07:00
c920e01242
[Doc] Update rocm.inc.md ( #15917 )
...
Signed-off-by: chun37 <chun.jb.37@gmail.com >
2025-04-01 23:38:26 -07:00
274d8e8818
[V1][Minor] Enhance SpecDecoding Metrics Log in V1 ( #15902 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-01 23:38:02 -07:00
2039c6305b
[Bugfix] Fix imports for MoE on CPU ( #15841 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-04-02 03:33:55 +00:00
6efb195a6e
[V1] Fix: make sure k_index is int64 for apply_top_k_only ( #15907 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-04-01 19:06:44 -07:00
24b7fb455a
[Spec Decode] Fix input triton kernel for eagle ( #15909 )
2025-04-01 18:15:14 -07:00
58f5a59769
[Docs] Add Intel as Sponsor ( #15913 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-01 17:16:55 -07:00
db9dfcfa6a
[Docs] Add Ollama meetup slides ( #15905 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-01 13:58:59 -07:00
9ef98d527e
[Model][MiniMaxText01] Support MiniMaxText01 model inference ( #13454 )
...
Signed-off-by: qscqesze <475517977@qq.com >
Co-authored-by: qingjun <qingjun@minimaxi.com >
Co-authored-by: qscqesze <475517977@qq.com >
2025-04-01 16:23:55 -04:00
93491aefc7
[BugFix] make sure socket close ( #15875 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-01 13:10:24 -07:00
7acd539cd7
[Docs] update usage stats language ( #15898 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-01 12:54:13 -07:00
e75a6301bd
[V1][Spec Decode] Implement Eagle Proposer [1/N] ( #15729 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-01 12:33:16 -07:00
a79cc68b3a
[V1][Metrics] Initial speculative decoding metrics ( #15151 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-01 10:45:04 -07:00
7e3f7a4ee7
[CI] Disable flaky structure decoding test temporarily. ( #15892 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-01 17:42:34 +00:00
9ec8257914
[Model] Add module name prefixes to gemma3 ( #15889 )
...
Signed-off-by: Bartholomew Sabat <bartek@recursal.ai >
Co-authored-by: Bartholomew Sabat <bartek@recursal.ai >
2025-04-01 10:13:40 -07:00
38327cf454
[Model] Aya Vision ( #15441 )
...
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-04-01 16:30:43 +00:00
dfa82e2a3d
[CI/Build] Clean up LoRA tests ( #15867 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-01 16:28:50 +00:00
e59ca942f5
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. ( #13932 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-04-01 12:07:43 -04:00
a57a3044aa
[ROCm][Build][Bugfix] Bring the base dockerfile in sync with the ROCm fork ( #15820 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-01 08:56:39 -07:00
4e5a0f6ae2
[Misc] Allow using OpenCV as video IO fallback ( #15055 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 15:55:13 +00:00
b63bd14999
Reinstate format.sh and make pre-commit installation simpler ( #15890 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 15:41:30 +00:00
2041c0e360
[Doc] Quark quantization documentation ( #15861 )
...
Signed-off-by: chaow <chaow@amd.com >
2025-04-01 08:32:45 -07:00
085cbc4f9f
[New Model]: jinaai/jina-reranker-v2-base-multilingual ( #15876 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-01 08:32:26 -07:00
2b93162fb0
Remove format.sh as it's been unsupported >70 days ( #15884 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 22:27:46 +08:00
2e45bd29fe
[Misc] remove unused script ( #15746 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-01 13:58:05 +00:00
51d7c6a2b2
[Model] Support Mistral3 in the HF Transformers format ( #15505 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-01 06:10:05 -07:00
f3aca1ee30
setup correct nvcc version with CUDA_HOME ( #15725 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-04-01 06:09:40 -07:00
8dd41d6bcc
[Misc] Use envs.VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE ( #15831 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-01 06:07:53 -07:00
0a298ea418
[Bugfix] Fix no video/image profiling edge case for MultiModalDataParser ( #15828 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-01 18:17:11 +08:00
d330558bab
[Docs] Fix small error in link text ( #15868 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-01 10:05:14 +00:00
656fd72976
[Misc] Fix speculative config repr string ( #15860 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-01 02:26:22 -07:00
79455cf421
[Misc] Enable V1 LoRA by default ( #15320 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-04-01 16:53:56 +08:00
30d6a015e0
[Feature] specify model in config.yaml ( #15798 )
...
Signed-off-by: weizeng <weizeng@roblox.com >
2025-04-01 01:20:06 -07:00
8af5a5c4e5
fix: can not use uv run collect_env close #13888 ( #15792 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-01 07:45:49 +00:00
3a5f0afcd2
[V1] Implement sliding window attention in kv_cache_manager ( #14097 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-01 00:33:17 -07:00
c7e63aa4d8
[ROCm] Use device name in the warning ( #15838 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-01 00:10:48 -07:00
4a9ce1784c
[sleep mode] clear pytorch cache after sleep ( #15248 )
...
Signed-off-by: <villard@us.ibm.com >
2025-03-31 22:58:58 -07:00
7e4e709b43
[V1] TPU - Fix fused MOE ( #15834 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-31 22:58:07 -07:00
63d8eabed0
[Bugfix]: Fix is_embedding_layer condition in VocabParallelEmbedding ( #15824 )
...
Signed-off-by: alexwl <alexey.a.kiryushin@gmail.com >
2025-03-31 22:57:59 -07:00
e830b01383
[Bugfix] Fix extra comma ( #15851 )
...
Signed-off-by: haochengxia <xhc_1007@163.com >
2025-03-31 22:57:28 -07:00
ff6473980d
[Bugfix][Model] fix mllama multi-image ( #14883 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-03-31 22:53:37 -07:00
a164aea35d
[Frontend] Add Phi-4-mini function calling support ( #14886 )
...
Signed-off-by: Kinfey <kinfeylo@microsoft.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-03-31 22:50:05 -07:00
a76f547e11
Rename fallback model and refactor supported models section ( #15829 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 22:49:41 -07:00
b7b7676d67
[Distributed] Add custom allreduce support for ROCM ( #14125 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-03-31 22:49:12 -07:00
e6e3c55ef2
Move dockerfiles into their own directory ( #14549 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 13:47:32 -07:00
f98a4920f9
[V1][Core] Remove unused speculative config from scheduler ( #15818 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-31 19:15:21 +00:00
d4bfc23ef0
Fix Transformers backend compatibility check ( #15290 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 10:27:07 -07:00
9a2160fa55
[V1] TPU CI - Add basic perf regression test ( #15414 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-31 13:25:20 -04:00
2de4118243
fix: change GB to GiB in logging close #14979 ( #15807 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-31 10:00:50 -07:00
239b7befdd
[V1][Spec Decode] Remove deprecated spec decode config params ( #15466 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-31 09:19:35 -07:00
09e974d483
[Bugfix] Check dimensions of multimodal embeddings in V1 ( #15816 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-31 09:01:35 -07:00
e5ef4fa99a
Upgrade transformers to v4.50.3 ( #13905 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-31 08:59:37 -07:00
037bcd942c
[Bugfix] Fix missing return value in load_weights method of adapters.py ( #15542 )
...
Signed-off-by: noc-turne <2270929247@qq.com >
2025-03-31 06:56:42 -07:00
c2e7507ad4
[Bugfix] Fix Crashing When Loading Modules With Batchnorm Stats ( #15813 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-03-31 13:23:53 +00:00
3aa2b6a637
[Model] Update support for NemotronNAS models ( #15008 )
...
Signed-off-by: Nave Assaf <nassaf@nvidia.com >
2025-03-31 20:35:14 +08:00
555aa21905
[V1] Fully Transparent Implementation of CPU Offloading ( #15354 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-31 20:22:34 +08:00
e7ae3bf3d6
fix: better install requirement for install in setup.py ( #15796 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-31 05:13:32 -07:00
b932c048ac
Recommend developing with Python 3.12 in developer guide ( #15811 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-31 11:54:49 +00:00
e85829450d
[Feature][ROCm]Enable fusion pass for torch.compile on ROCm ( #15050 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-03-31 04:42:18 -07:00
effc5d24fa
[Benchmark] Update Vision Arena Dataset and HuggingFaceDataset Setup ( #15748 )
...
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
2025-03-31 15:38:58 +08:00
18ed3132d2
[Misc] update the comments ( #15780 )
...
Signed-off-by: chengyang liu <lcy4869@gmail.com >
Co-authored-by: chengyang liu <lcy4869@gmail.com >
2025-03-30 19:39:56 -07:00
9b459eca88
[V1][Scheduler] Avoid calling _try_schedule_encoder_inputs for every request ( #15778 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-30 14:10:42 -07:00
70fedd0f79
fix: Comments to English for better dev experience ( #15768 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-30 10:47:57 -07:00
bb103b29bf
[Bugfix] Added embed_is_patch mask for fuyu model ( #15731 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
2025-03-30 03:45:08 -07:00
248e76c4df
fix: lint fix a ruff checkout syntax error ( #15767 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-03-30 03:36:02 -07:00
803d5c35f3
[V1] Override mm_counts for dummy data creation ( #15703 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-30 03:20:42 -07:00
7fd8c0f85c
fix test_phi3v ( #15321 )
...
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com >
2025-03-30 02:01:34 -07:00
44c3a5abc3
[doc] update conda to usage link in installation ( #15761 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-30 08:12:13 +00:00
6909a76201
[Bugfix] Fix Mistral guided generation using xgrammar ( #15704 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-03-29 20:20:19 -07:00
045533716b
[CI] xgrammar structured output supports Enum. ( #15757 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-29 20:20:02 -07:00
3c0ff914ac
[Bugfix] Fix Mllama interleaved images input support ( #15564 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-03-29 18:11:15 +00:00
2bc4be4e32
[V1][Minor] Simplify rejection sampler's parse_output ( #15741 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-29 09:25:17 -07:00
c67abd614f
[V1] Support interleaved modality items ( #15605 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-29 06:30:09 -07:00
6fa7cd3dbc
[Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore ( #12957 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-29 04:01:46 -07:00
94744ba41a
[V1] [Feature] Collective RPC ( #15444 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-29 03:39:14 -07:00
4965ec42d2
[FEAT] [ROCm] Add AITER int8 scaled gemm kernel ( #15433 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-29 03:33:56 -07:00
73aa7041bf
[doc] update doc ( #15740 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-29 04:27:22 +00:00
7c1f760024
[Kernel][TPU][ragged-paged-attn] vLLM code change for PR#8896 ( #15659 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-28 21:13:15 -07:00
da461f3cbf
[TPU][V1][Bugfix] Fix w8a8 recompiilation with GSM8K ( #15714 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-28 21:13:06 -07:00
5b800f0932
[Bugfix] set VLLM_WORKER_MULTIPROC_METHOD=spawn for vllm.entrypoionts.openai.api_server ( #15700 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-03-28 21:12:26 -07:00
8427f70493
Use numba 0.61 for python 3.10+ to support numpy>=2 ( #15692 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-03-29 12:11:51 +08:00
7a7992085b
[CI] Speed up V1 structured output tests ( #15718 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-28 21:10:45 -07:00
1286211f57
[Bugfix] LoRA V1: add and fix entrypoints tests ( #15715 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-28 21:10:41 -07:00
6d531ad7b8
[Misc][V1] Misc code streamlining ( #15723 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-28 20:59:47 -07:00
762b424a52
[Docs] Document v0 engine support in reasoning outputs ( #15739 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-29 03:46:57 +00:00
de1cb38769
[Model] Support Skywork-R1V ( #15397 )
...
Signed-off-by: jiacai.liu <932997367@qq.com >
Co-authored-by: jiacai.liu <932997367@qq.com >
2025-03-28 20:39:21 -07:00
c802f5430d
[ROCm][AMD][Build] Update AMD supported arch list ( #15632 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-28 20:39:18 -07:00
cff8991a50
[Docs][V1] Optimize diagrams in prefix caching design ( #15716 )
2025-03-29 03:33:58 +00:00
f3f8d8fff4
implement prometheus fast-api-instrumentor for http service metrics ( #15657 )
2025-03-29 00:12:02 +00:00
26df46ee59
[Misc] cli auto show default value ( #15582 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-03-28 22:23:00 +00:00
c3f687ac22
[V1] TPU - Fix the chunked prompt bug ( #15713 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-28 20:19:04 +00:00
04437e313d
[Bugfix] [torch.compile] Add Dynamo metrics context during compilation ( #15639 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-03-28 14:01:09 -06:00
038bededba
[TPU] [Perf] Improve Memory Usage Estimation ( #15671 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-03-28 17:37:52 +00:00
d03308be0c
[Misc] Remove stale func in KVTransferConfig ( #14746 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-28 17:33:32 +00:00
c6bc0034d0
[Misc] Remove unused utils and clean up imports ( #15708 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-28 09:41:16 -07:00
70e132244a
[Minor] Remove TGI launching script ( #15646 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-28 09:30:08 -07:00
47e9038d23
Fix cpu offload testing for gptq/awq/ct ( #15648 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-29 00:29:32 +08:00
432cf22a6a
[Bugfix] Fix regex compile display format ( #15368 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-03-28 08:58:44 -07:00
2914006fe0
[doc] add missing imports ( #15699 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-28 15:56:48 +00:00
7329ff5468
[V1] Support disable_any_whtespace for guidance backend ( #15584 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-28 23:46:45 +08:00
541d1df486
[Bugfix] embed_is_patch for Idefics3 ( #15696 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-28 08:27:52 -07:00
3b00ff9138
[Bugfix][v1] xgrammar structured output supports Enum. ( #15594 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-28 06:14:53 -07:00
91276c5721
[Model] Adding torch compile annotations to chatglm ( #15624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-28 21:14:09 +08:00
0b4167526d
[Docs] Add "Generation quality changed" section to troubleshooting ( #15701 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-28 13:03:21 +00:00
fd5fd26902
[Frontend] update priority for --api-key and VLLM_API_KEY ( #15588 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-28 19:40:12 +08:00
3bbaacbe15
[Bugfix][Frontend] Eliminate regex based check in reasoning full generator ( #14821 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-28 11:20:35 +00:00
a10314c6b3
[Misc] Fix test_sleep to use query parameters ( #14373 )
...
Signed-off-by: Lize Cai <lize.cai@sap.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-28 18:00:14 +08:00
70f2c2a709
[Bugfix] Fix 'InductorAdaptor object has no attribute 'cache_dir' ( #15674 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-28 17:10:40 +08:00
280d074103
[CPU][CI] Improve CPU Dockerfile ( #15690 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-03-28 01:36:31 -07:00
32b14baf8a
[Refactor][Frontend] Keep all logic about reasoning into one class ( #14428 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-28 00:23:30 -07:00
2d9045fce8
[TPU][CI] Fix TPUModelRunner Test ( #15667 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-03-28 00:01:26 -07:00
355f66348c
[V1] Remove legacy input registry ( #15673 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 23:34:34 -07:00
8693e47e6a
[Bugfix] Fix mm_hashes forgetting to be passed ( #15668 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-28 05:51:05 +00:00
cec8c7d7f8
Refactor error handling for multiple exceptions in preprocessing ( #15650 )
...
Signed-off-by: JasonZhu1313 <jasonchu13@outlook.com >
2025-03-28 03:27:20 +00:00
4d0ec37267
[Quantization][FP8] Adding support for fp8 gemm layer input in fp8 ( #14578 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-03-28 02:58:16 +00:00
e7f720ea56
[Misc]add coding benchmark for speculative decoding ( #15303 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
2025-03-28 10:47:05 +08:00
4ae17bf1e2
Revert "Use Cache Hinting for fused_moe kernel ( #15511 )" ( #15645 )
...
Signed-off-by: Wes Medford <wryanmedford@gmail.com >
2025-03-27 19:45:55 -07:00
8a49eea74b
[CI][TPU] Temporarily Disable Quant Test on TPU ( #15649 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-27 19:45:05 -07:00
b4245a48df
[Doc] Fix dead links in Job Board ( #15637 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-28 02:43:40 +00:00
4e0f6076be
[Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. ( #14948 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-28 10:13:41 +08:00
726efc6a32
[Quantization][V1] BitsAndBytes support V1 ( #15611 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-28 10:12:47 +08:00
bd45912b99
[TPU] Lazy Import ( #15656 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-28 09:57:01 +08:00
15dac210f0
[V1] AsyncLLM data parallel ( #13923 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-27 16:14:41 -07:00
112b3e5b3b
[CI] Update rules for applying tpu label. ( #15634 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-27 22:15:26 +00:00
32d669275b
Correct PowerPC to modern IBM Power ( #15635 )
...
Signed-off-by: Christy Norman <christy@linux.vnet.ibm.com >
2025-03-27 15:04:32 -07:00
4098b72210
[Bugfix][TPU][V1] Fix recompilation ( #15553 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-27 19:15:06 +00:00
46450b8d33
Use absolute placement for Ask AI button ( #15628 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-27 18:52:18 +00:00
13ac9cab21
[Misc] Avoid direct access of global mm_registry in compute_encoder_budget ( #15621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 17:52:00 +00:00
66aa4c0bf4
[Feature] Add middleware to log API Server responses ( #15593 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-03-27 17:49:38 +00:00
247181536f
[Misc] Replace is_encoder_decoder_inputs with split_enc_dec_inputs ( #15620 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 17:36:32 +00:00
07bf813fb5
[Doc] Link to onboarding tasks ( #15629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 16:30:53 +00:00
8958217ad5
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 ( #15211 )
...
Signed-off-by: h-sugi <h.sugi@ieee.org >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-27 22:29:29 +08:00
ac5bc615b0
[Model] MiniCPM-V/O supports V1 ( #15487 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 06:07:29 -07:00
8063dfc61a
[Doc] update --system for transformers installation in docker doc ( #15616 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-27 20:38:46 +08:00
6278bc829e
Fix incorrect filenames in vllm_compile_cache.py ( #15494 )
...
Signed-off-by: <zou3519@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-27 18:33:41 +08:00
3f532cb6a6
[Misc] Use model_redirect to redirect the model name to a local folder. ( #14116 )
2025-03-27 02:21:23 -07:00
e6c9053f9e
[Misc] Clean up scatter_patch_features ( #15559 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-27 07:45:00 +00:00
43ed4143c4
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM ( #15587 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com >
Co-authored-by: ElizaWszola <ewszola@redhat.com >
2025-03-27 06:47:25 +00:00
f4c98b4d4c
[Misc] Consolidate LRUCache implementations ( #15481 )
...
Signed-off-by: Bella kira <2374035698@qq.com >
2025-03-27 06:43:43 +00:00
e1e0fd7543
[TPU] Avoid Triton Import ( #15589 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-27 06:43:02 +00:00
df8d3d1287
[Misc] Restrict ray version dependency and update PP feature warning in V1 ( #15556 )
2025-03-27 06:21:07 +00:00
619d3de8bd
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS ( #15583 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-26 22:46:26 -07:00
ecff8309a3
[ROCm] Env variable to trigger custom PA ( #15557 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-26 22:46:12 -07:00
dcf2a590f5
Allow torchao quantization in SiglipMLP ( #15575 )
2025-03-26 22:45:51 -07:00
54aa619459
[V1] Refactor num_computed_tokens logic ( #15307 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-27 04:54:36 +00:00
fb22be5817
[moe][quant] add weight name case for offset ( #15515 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-03-27 04:50:29 +00:00
7f301dd8ef
[Doc] Update V1 user guide for fp8 kv cache support ( #15585 )
...
Signed-off-by: weizeng <weizeng@roblox.com >
2025-03-26 19:39:03 -07:00
8095341a01
[misc] LoRA: Remove unused long context test data ( #15558 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-27 10:04:51 +08:00
69db16a46a
add platform check back ( #15578 )
...
Signed-off-by: Chenyaaang <llccyy1212@gmail.com >
2025-03-27 01:50:27 +00:00
ce78f9af4e
Add automatic tpu label to mergify.yml ( #15560 )
2025-03-26 21:39:58 -04:00
9239bf718e
[Kernel] CUTLASS grouped gemm fp8 MoE kernel ( #13972 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com >
2025-03-27 00:54:44 +00:00
7a6d45bc8a
Support FIPS enabled machines with MD5 hashing ( #15299 )
...
Signed-off-by: Matthew Vine <32849887+MattTheCuber@users.noreply.github.com >
2025-03-26 20:19:46 -04:00
e74ff409e0
[TPU] support disabling xla compilation cache ( #15567 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-27 00:09:28 +00:00
7a888271f5
Use Cache Hinting for fused_moe kernel ( #15511 )
2025-03-26 23:21:34 +00:00
9d119a86ae
[V1] TPU CI - Fix test_compilation.py ( #15570 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-26 21:51:54 +00:00
b2e85e26f4
[V1] TPU - Revert to exponential padding by default ( #15565 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-26 21:35:05 +00:00
dd8a29da99
Applying some fixes for K8s agents in CI ( #15493 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-03-26 20:35:11 +00:00
27df5199d9
Support SHA256 as hash function in prefix caching ( #15297 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-03-26 11:11:28 -07:00
35fad35a48
[V1][Sampler] Faster top-k only implementation ( #15478 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-26 10:56:47 -07:00
733e7c9e95
[Refactor] Remove unnecessary backend parameter in structured output interface ( #15317 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-26 17:51:56 +00:00
0af4d764d6
Fix weight loading for some models in Transformers backend ( #15544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-26 10:17:53 -07:00
e64afa455c
multi-node offline DP+EP example ( #15484 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-26 23:54:24 +08:00
1711b929b6
[Model] Add Reasoning Parser for Granite Models ( #14202 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Co-authored-by: Joe Runde <joe@joerun.de >
2025-03-26 14:28:07 +00:00
c091c0a588
Improve validation of TP in Transformers backend ( #15540 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-26 07:26:48 -07:00
1aa162e030
Apply torchfix ( #15532 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-03-26 12:09:06 +00:00
cf5c8f1686
Separate base model from TransformersModel ( #15467 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-03-26 18:13:38 +08:00
4ec2cee000
[Misc] improve example script output ( #15528 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-03-26 10:12:47 +00:00
99f536f830
[Misc] Enhance warning information to user-defined chat template ( #15408 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-26 02:21:15 -07:00
5ebf66748b
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER ( #14967 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-26 16:30:30 +08:00
781d056280
[Feature] Enhance EAGLE Architecture with Proper RMS Norms ( #14990 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-26 08:24:07 +00:00
5aefd6ac31
Fix raw_request extraction in load_aware_call decorator ( #15382 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2025-03-25 22:29:54 -07:00
6c663dfd5e
[misc] LoRA - Skip LoRA kernels when not required ( #15152 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-26 11:33:45 +08:00
33437bc6e7
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) ( #15492 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-03-25 20:33:22 -07:00
23114d3364
[Misc] Warn about v0 in benchmark_paged_attn.py ( #15495 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-25 20:31:04 -07:00
997c8811d6
[Model] Support multi-image for Molmo ( #15438 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-26 11:26:33 +08:00
e42389f9d7
Transformers backend already supports V1 ( #15463 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-25 20:26:16 -07:00
ff38f0a32c
[CI/Build] LoRA: Delete long context tests ( #15503 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-25 17:18:34 -07:00
a5cfbab3c8
[Core] LoRA: V1 Scheduler optimization ( #15422 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-25 22:50:09 +00:00
ac3cd6e83c
[core] add bucket padding to tpu_model_runner ( #14995 )
...
Signed-off-by: Chenyaaang <llccyy1212@gmail.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-25 17:27:22 -04:00
082ab86f5f
[V1] Support long_prefill_token_threshold in v1 scheduler ( #15419 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-25 14:22:26 -07:00
6aa196c8dc
[V1][Minor] Use SchedulerInterface type for engine scheduler field ( #15499 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-25 14:21:36 -07:00
a0dd7dcd49
[TPU][V1] Fix Sampler recompilation ( #15309 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-25 16:43:54 -04:00
e977c11111
Add workaround for shared field_names in pydantic model class ( #13925 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-03-25 20:31:08 +00:00
5f063a80bd
[bugfix] add supports_v1 platform interface ( #15417 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-03-25 15:00:32 -04:00
5d8e1c9279
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) ( #15471 )
...
Co-authored-by: ServerAI <ai@exc-mad-ai.com >
2025-03-25 17:59:25 +00:00
0a049c7d86
[CI/Build] Add tests for the V1 tpu_model_runner. ( #14843 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-25 12:27:16 -04:00
d0cfec7ab9
[bugfix] fix inductor cache on max_position_embeddings ( #15436 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-25 07:05:39 -07:00
a608160027
[Kernel] Fix conflicting macro names for gguf kernels ( #15456 )
...
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
2025-03-25 13:50:49 +00:00
3f04a7fbf2
[Doc] Update V1 user guide for multi-modality ( #15460 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-25 11:01:58 +00:00
5994430b84
[Misc] Remove redundant num_embeds ( #15443 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-25 18:27:57 +08:00
a9e879b316
[Misc] Clean up MiniCPM-V/O code ( #15337 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-25 10:22:52 +00:00
3e2f37a69a
Dockerfile.ppc64le changes to move to UBI ( #15402 )
...
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-03-25 10:15:14 +00:00
4f044b1d67
[Kernel][CPU] CPU MLA ( #14744 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-03-25 09:34:59 +00:00
4157f563b4
[Hardware][TPU][Bugfix] Fix v1 mp profiler ( #15409 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-25 01:43:00 -07:00
051da7efe3
Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 ( #15160 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Richard Barnes <rbarnes@meta.com >
2025-03-25 15:36:45 +08:00
25f560a62c
[V1][Spec Decode] Update target_logits in place for rejection sampling ( #15427 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-24 21:04:41 -07:00
a09ad90a72
[V1] guidance backend for structured output + auto fallback mode ( #14779 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Loc Huynh <jc1da.3011@gmail.com >
Co-authored-by: Michal Moskal <michal@moskal.me >
2025-03-24 21:02:33 -07:00
10b34e36b9
[Bugfix] Fixed the issue of not being able to input video and image simultaneously ( #15387 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-25 03:48:08 +00:00
b5269db959
Revert "Fix non-contiguous input passed to Marlin kernel ( #15319 )" ( #15398 )
2025-03-24 20:43:51 -07:00
6db94571d7
[Misc] Remove LoRA log ( #15388 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-24 20:43:48 -07:00
97cfa65df7
Add pipeline parallel support to TransformersModel ( #12832 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-03-25 10:41:45 +08:00
911c8eb000
[Minor][Spec Decode] Remove compiled_softmax ( #15416 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-24 19:09:04 -07:00
ebcebeeb6b
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling ( #15063 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-24 17:16:46 -07:00
f533b5837f
[ROCm][Kernel] MoE weights padding ( #14454 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: charlifu <charlifu@amd.com >
2025-03-24 23:45:30 +00:00
8279201ce6
[Build] Cython compilation support fix ( #14296 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-24 23:37:54 +00:00
23fdab00a8
[Hardware][TPU] Skip failed compilation test ( #15421 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-24 23:28:57 +00:00
623e2ed29f
[BugFix][V1] Quick fix for min_tokens with multiple EOS ( #15407 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-24 15:58:59 -07:00
9d72daf4ce
[V1][Perf] Simpler request output queues ( #15156 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-24 22:44:08 +00:00
6dd55af6c9
[Doc] Update docs on handling OOM ( #15357 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-24 14:29:34 -07:00
3eb08ed9b1
[DOC] Add Kubernetes deployment guide with CPUs ( #14865 )
2025-03-24 10:48:43 -07:00
5eeadc2642
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral ( #12303 )
...
Signed-off-by: zhenwei <zhenweiliu@habana.ai >
2025-03-24 09:48:40 -07:00
3aee6573dc
[V1] Aggregate chunked prompt logprobs in model runner ( #14875 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-24 12:27:57 -04:00
9cc645141d
[MISC] Refine no available block debug msg ( #15076 )
...
Signed-off-by: Yi Liu <yiliu4@habana.ai >
Signed-off-by: yiliu30 <yi4.liu@intel.com >
Co-authored-by: Yi Liu <yiliu4@habana.ai >
2025-03-25 00:01:10 +08:00
0893567db9
[V1][Minor] fix comments ( #15392 )
...
Signed-off-by: chenjincong <chenjincong@baidu.com >
Signed-off-by: Chen-0210 <chenjincong11@gmail.com >
Co-authored-by: chenjincong <chenjincong@baidu.com >
2025-03-24 08:45:32 -07:00
8abe69b499
[Core] Don't force uppercase for VLLM_LOGGING_LEVEL ( #15306 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-24 08:27:30 -07:00
761702fd19
[Core] Integrate fastsafetensors loader for loading model weights ( #10647 )
...
Signed-off-by: Manish Sethi <Manish.sethi1@ibm.com >
2025-03-24 08:08:02 -07:00
9606d572ed
[distributed] fix dp group ( #15355 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-24 14:54:27 +00:00
cbcdf2c609
[Bugfix] Fix chat template loading ( #15143 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-24 13:50:09 +00:00
038de04d7b
Fix zmq IPv6 URL format error ( #15341 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-24 09:30:41 -04:00
6b3cc75be0
[Kernel] allow non-contiguous input for marlin kernel ( #14658 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-03-24 09:21:33 -04:00
7ffcccfa5c
Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa ( #13569 )" ( #15377 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-24 05:53:10 -07:00
cc8accfd53
[Misc] Update guided decoding logs to debug ( #15310 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2025-03-24 04:25:20 -07:00
948ab03e7e
[Bugfix][V1] Avoid importing PreTrainedModel ( #15366 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-03-24 10:33:12 +00:00
5797fb97e9
[Misc] Remove ignore_reinit_error for ray.init() ( #15373 )
2025-03-24 07:41:53 +00:00
3892e58ad7
[Misc] Upgrade BNB version ( #15183 )
2025-03-24 05:51:42 +00:00
d20e261199
Fix non-contiguous input passed to Marlin kernel ( #15319 )
2025-03-24 03:09:44 +00:00
f622dbcf39
[Fix] [torch.compile] Improve UUID system for custom passes ( #15249 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-03-24 01:54:07 +00:00
dccf535f8e
[V1] Enable V1 Fp8 cache for FA3 in the oracle ( #15191 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-23 15:07:04 -07:00
9c5c81b0da
[Misc][Doc] Add note regarding loading generation_config by default ( #15281 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-23 14:00:55 -07:00
d6cd59f122
[Frontend] Support tool calling and reasoning parser ( #14511 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-03-23 14:00:07 -07:00
bc8ed3c4ba
[V1][Spec Decode] Use better defaults for N-gram ( #15358 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-23 10:52:30 -07:00
b9bd76ca14
[V1][Spec Decode] Respect prompt_lookup_max ( #15348 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-23 10:41:44 -07:00
6ebaf9ac71
[Bugfix] consider related env vars for torch.compiled cache hash ( #14953 )
...
Signed-off-by: DefTruth <31974251+DefTruth@users.noreply.github.com >
2025-03-23 15:53:09 +00:00
f90d34b498
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 ( #15322 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-03-23 01:10:10 -07:00
f68cce8e64
[ci/build] fix broken tests in LLM.collective_rpc ( #15350 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-23 14:49:48 +08:00
09b6a95551
[ci/build] update torch nightly version for GH200 ( #15135 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-23 14:04:13 +08:00
50c9636d87
[V1][Usage] Refactor speculative decoding configuration and tests ( #14434 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-03-22 19:28:10 -10:00
0661cfef7a
Fix v1 supported oracle for worker-cls and worker-extension-cls ( #15324 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-23 10:23:35 +08:00
a827aa815d
[doc] Add back previous news ( #15331 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-22 17:38:33 -07:00
b877031d80
Remove openvino support in favor of external plugin ( #15339 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-22 14:06:39 -07:00
dd861b992f
[BugFix][Typing] Fix Imprecise Type Annotations ( #15208 )
...
Signed-off-by: Wang Ran (汪然) <wrran@outlook.com >
2025-03-22 09:05:03 -07:00
eb63ea1e18
[V1] Add disable-any-whitespace option support for xgrammar ( #15316 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-22 15:56:17 +00:00
2f4bd358f1
[Model] Support Tele-FLM Model ( #15023 )
...
Signed-off-by: Naitong Yu <ntyu@baai.ac.cn >
Signed-off-by: jiangxin <horizon94@outlook.com >
Co-authored-by: Jason Fang <jasonfang3900@gmail.com >
Co-authored-by: jiangxin <horizon94@outlook.com >
2025-03-22 02:04:44 -07:00
8a8b30eac1
[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes ( #15308 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-22 02:03:32 -07:00
2fa0e1396b
[Bugfix] Fix torch.compile raise FileNotFoundError ( #15278 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-22 13:49:34 +08:00
1c2bec0f82
[Doc] add load_format items in docs ( #14804 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-03-21 22:36:43 -07:00
ec870fba9a
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature ( #14959 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-21 22:36:14 -07:00
df1430265c
[Bugfix][V0] Multi-sequence logprobs streaming edge case ( #15259 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-03-21 22:35:37 -07:00
4c69e228b3
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout ( #15301 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-21 22:25:43 -07:00
790b79750b
[Build/CI] Fix env var typo ( #15305 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-21 22:28:46 +00:00
cfbb8c930f
[TPU][V1] MHA Pallas backend ( #15288 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-21 08:50:39 -07:00
baec0d4de9
Revert "[Feature] specify model in config.yaml ( #14855 )" ( #15293 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-21 08:30:23 -07:00
c21b99b912
[Bugfix][VLM] fix llava processor ( #15285 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-03-21 05:14:36 -07:00
93a00d7dde
[v1] Refactor KVCacheConfig ( #14079 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-21 04:56:27 -07:00
61e8c18350
[Misc] Add cProfile helpers ( #15074 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-21 04:56:09 -07:00
8afcd0f633
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend ( #15282 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-21 11:42:06 +00:00
91ca929dc7
[V1] Fix wrong import path of get_flash_attn_version ( #15280 )
...
Signed-off-by: Lehua Ding <lehuading@tencent.com >
2025-03-21 03:54:11 -07:00
84e00adc8a
[Bugfix] Fix incorrect resolving order for transformers fallback ( #15279 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-21 03:54:08 -07:00
47c7126213
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL ( #15273 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-21 10:32:33 +00:00
a989ca2bf6
[Bugfix] Add int8 torch dtype for KVCache ( #15260 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-03-21 08:58:28 +00:00
0fa3970deb
[Feature] specify model in config.yaml ( #14855 )
...
Signed-off-by: weizeng <weizeng@roblox.com >
2025-03-21 00:26:03 -07:00
da6ea29f7a
[V1] Avoid redundant input processing in n>1 case ( #14985 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-20 22:24:10 -07:00
7297941b38
[Doc] Update LWS docs ( #15163 )
...
Signed-off-by: Edwinhr716 <Edandres249@gmail.com >
2025-03-20 21:18:47 -07:00
f8a08cb90d
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs ( #14071 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-21 03:14:19 +00:00
b15fd2be2a
[Hardware][TPU] Add check for no additional graph compilation during runtime ( #14710 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-21 03:05:28 +00:00
e588ac237c
Add an example for reproducibility ( #15262 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-20 19:55:47 -07:00
5df2da5b97
[Misc] Better RayExecutor and multiprocessing compatibility ( #14705 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-20 19:27:46 -07:00
11b986b3fb
[Docs] Trim the latest news in README ( #15261 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-20 19:24:21 -07:00
296f927f24
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies ( #14857 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-03-20 19:21:08 -07:00
0032903a5b
[Bugfix] detect alibi and revert to FA2 ( #15231 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-03-20 19:20:16 -07:00
47195057e9
[V1][TPU] Speed up top-k on TPU by using torch.topk ( #15242 )
...
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
2025-03-20 19:19:40 -07:00
6edbfa924d
Mention extra_body as a way top pass vLLM only parameters using the OpenAI client ( #15240 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-20 19:18:36 -07:00
1e508343e1
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation ( #15200 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-20 19:18:04 -07:00
2e0b4cfde0
[ROCM] Upgrade torch to 2.6 ( #15244 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-20 19:17:33 -07:00
10f55fe6c5
[Misc] Clean up the BitsAndBytes arguments ( #15140 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-20 19:17:12 -07:00
d3ccbd6350
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 ( #15159 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Richard Barnes <rbarnes@meta.com >
2025-03-21 10:01:11 +08:00
0cfe7d386d
[CI/Build] LoRA : make add_lora_test safer ( #15181 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-21 09:28:53 +08:00
0c6f5023c3
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface ( #15250 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-20 17:50:43 -07:00
06dd08256f
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. ( #14617 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-03-21 00:44:37 +00:00
2b22290ce0
[V1] Add flag to disable cascade attention ( #15243 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-20 15:24:16 -07:00
d8e82bc06d
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id ( #15043 )
...
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com >
2025-03-20 10:01:02 -07:00
086b56824c
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 ( #15172 )
...
Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-21 00:30:04 +08:00
5a0905ba2a
Replace misc issues with link to forum ( #15226 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-20 23:18:20 +08:00
a8f12a63fd
Fix env vars for running Ray distributed backend on GKE ( #15166 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-20 14:59:33 +00:00
69ae2380c6
Add user forum to README ( #15220 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-20 22:39:51 +08:00
27261e40a6
[Bugfix] Multi-video inference on LLaVA-Onevision ( #15082 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-03-20 14:10:45 +00:00
e3f813c33b
[macOS] Ugrade pytorch to 2.6.0 ( #15129 )
2025-03-20 01:22:40 -07:00
c607a2652b
Fixing Imprecise Type Annotations ( #15192 )
2025-03-20 01:19:55 -07:00
3d45e3d749
[release] Tag vllm-cpu with latest upon new version released ( #15193 )
2025-03-20 01:19:10 -07:00
742369d35a
[Frontend][Bugfix] support prefill decode disaggregation on deepseek ( #14824 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
Co-authored-by: Zhai Feiyue <80079571+ZhaiFeiyue@users.noreply.github.com >
2025-03-20 00:00:33 -07:00
bfe2fe0af4
typo: Update config.py ( #15189 )
2025-03-19 23:31:21 -07:00
a8652f4f0f
Enable CUDA graph support for llama 3.2 vision ( #14917 )
...
Signed-off-by: Matt Ritter <100659061+mritterfigma@users.noreply.github.com >
2025-03-19 23:29:16 -07:00
2f726b241e
[Doc] Update README.md ( #15187 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-20 13:25:58 +08:00
a597a57595
[Attention] Flash Attention 3 - fp8 ( #14570 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-03-20 01:14:20 -04:00
ae65f3e237
[Misc]fixed disable these http request logs ( #14754 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-19 21:53:40 -07:00
34868b106a
[Doc] Update Mistral Small 3.1/Pixtral example ( #15184 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-20 04:46:06 +00:00
1f16b7fe74
[Core][V0] Add guidance backend for structured output ( #14589 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Loc Huynh <lohuynh@microsoft.com >
Co-authored-by: Michal Moskal <michal@moskal.me >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-19 21:33:51 -07:00
b88be22165
[Benchmark] Allow oversample request in benchmark dataset ( #15170 )
...
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
2025-03-20 12:32:58 +08:00
d8c6d7d6b5
[V1][TPU] Support V1 Sampler for ragged attention ( #14227 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-19 21:00:39 -07:00
40828ce5fe
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… ( #14673 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
2025-03-19 20:56:16 -07:00
ffa443afed
[Bugfix] Fix embedding assignment for InternVL-based models ( #15086 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-20 03:40:13 +00:00
70e500cad9
Fix broken tests ( #14713 )
...
Signed-off-by: JovanSardinha <jovan.sardinha@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-03-20 02:06:49 +00:00
4cb1c05c9e
[Doc] Clarify run vllm only on one node in distributed inference ( #15148 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-20 09:55:59 +08:00
c47aafa37c
[BugFix] Lazily import XgrammarBackend to avoid early cuda init ( #15171 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-20 01:30:43 +00:00
cfbca8a2f2
[V1] TPU - Tensor parallel MP support ( #15059 )
2025-03-20 00:55:18 +00:00
0fe5609874
[Docs] Annouce Ollama and Singapore Meetups ( #15161 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-19 16:18:04 -07:00
22d33baca2
[FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests ( #15150 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-19 21:04:41 +00:00
b0e96aaebb
[V1][TPU] Change kv cache shape. ( #15145 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-03-19 12:16:42 -07:00
8310e0b59b
simple bugfix: Update stats.py ( #15139 )
2025-03-19 18:26:27 +00:00
26dd972adb
[FEAT]Support reset prefix cache by specified device ( #15003 )
2025-03-19 10:54:41 -07:00
61c7a1b856
[V1] Minor V1 async engine test refactor ( #15075 )
...
Signed-off-by: andoorve <murali.andoorveedu@mail.utoronto.ca >
Co-authored-by: andoorve <murali.andoorveedu@mail.utoronto.ca >
2025-03-19 10:37:17 -07:00
374ee287d8
[Frontend] Remove custom_cache_manager ( #13791 )
...
Signed-off-by: fulvius31 <asangior@redhat.com >
2025-03-20 00:13:50 +08:00
a4d83661d7
[Misc] Update the "the first vLLM China Meetup" slides link to point to the first page ( #15134 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-03-19 15:07:39 +00:00
8363cd093d
[Bugfix] Adjust mllama to regional compilation ( #15112 )
...
Signed-off-by: Jan Kaniecki <jkaniecki@habana.ai >
2025-03-19 07:57:25 -07:00
6c5a3195db
[Misc][Benchmark] Add support for different tokenizer_mode ( #15040 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-03-19 14:56:50 +00:00
073d1ed354
[Doc] Update tip info on using latest transformers when creating a custom Dockerfile ( #15070 )
2025-03-19 13:33:40 +00:00
3d446433ec
[Bugfix] Fix size calculation of processing cache ( #15114 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-19 05:53:19 -07:00
1fe0fd12d3
[Misc] Avoid unnecessary HF do_rescale warning when passing dummy data ( #15107 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-19 03:42:31 -07:00
dafb4e504a
[V1][Bugfix] Fix oracle for device checking ( #15104 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-19 18:35:32 +08:00
68cf1601d3
[CI][Intel GPU] update XPU dockerfile and CI script ( #15109 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-03-19 01:29:25 -07:00
61f412187d
[Bugfix] Re-enable Gemma3 for V1 ( #14980 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-18 23:58:22 -07:00
05ccd0aa35
[V1] Ensure using int64 for sampled token ids ( #15065 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-18 23:52:19 -07:00
f690372b68
[Core] Update dtype detection and defaults ( #14858 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-19 13:49:33 +08:00
8b3e94a357
[Model] Remove duplicated message check in Mistral chat completion request ( #15069 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-03-19 05:09:32 +00:00
437f9162d0
[Model] Pixtral: Remove layer instantiation duplication ( #15053 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-03-19 10:34:03 +08:00
4f065f12f5
[Misc][V1] Skip device checking if not available ( #15061 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-18 19:33:43 -07:00
228b768db6
[Doc] Minor v1_user_guide update ( #15064 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-03-18 16:10:45 -07:00
027827cc1d
fix long dtype in topk sampling ( #15049 )
2025-03-18 15:57:31 -07:00
72a8639b68
[V1] TPU - CI/CD use smaller model ( #15054 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-18 21:39:21 +00:00
99abb8b650
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels ( #14930 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-18 14:31:54 -07:00
3a1e648158
[V1] Refactor Structured Output for multiple backends ( #14694 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-18 19:49:15 +00:00
46c759c165
[Bugfix] Fix LoRA extra vocab size ( #15047 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-18 09:40:29 -07:00
179a619c21
[Bugfix] Fix broken CPU quantization due to triton import ( #15038 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-18 08:57:39 -07:00
452e8fd968
[MODEL] Add support for Zamba2 models ( #13185 )
...
Signed-off-by: Yury Tokpanov <yury@zyphra.com >
Signed-off-by: Quentin Anthony <qganthony@yahoo.com >
Co-authored-by: Quentin Anthony <qganthony@yahoo.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-18 08:56:21 -07:00
8b793f7ec6
MI325 configs, fused_moe_kernel bugfix ( #14987 )
...
Signed-off-by: Eugene Kuznetsov <eugene.kuznetsov@amd.com >
2025-03-18 08:05:18 -07:00
af35d3a3cc
[TPU][V1][Bugfix] Fix chunked prefill with padding ( #15037 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-18 07:34:45 -07:00
3b457143d2
[Bugfix] Register serializers for V0 MQ Engine ( #15009 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-18 09:14:47 -04:00
ab656f2c2f
[Bugfix] Loosen type check to avoid errors in V1 ( #15021 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-18 12:54:40 +00:00
64fc2193dc
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros ( #14347 )
2025-03-18 05:50:19 -07:00
dd732028f5
[Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest ( #14352 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-03-18 05:50:05 -07:00
414919138b
[Bugfix] torchrun compatibility ( #14899 )
...
Signed-off-by: hiyouga <hiyouga@buaa.edu.cn >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-03-18 05:49:27 -07:00
db7c8ca910
[Misc] Embedding model support LoRA ( #14935 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-18 12:07:00 +00:00
f863ffc965
[Mistral-Small 3.1] Update docs and tests ( #14977 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-18 03:29:42 -07:00
400d483e87
[Kernels] LoRA - Retire SGMV and BGMV Kernels ( #14685 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-18 09:47:53 +00:00
d1695758b2
[Doc][V1] Fix V1 APC doc ( #14920 )
2025-03-18 08:15:46 +00:00
53a0cf8b95
[Neuron] trim attention kernel tests to fit trn1.2x instance ( #14988 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-03-18 15:05:52 +08:00
5eeabc2a44
[Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights ( #14950 )
2025-03-17 23:27:26 +00:00
18551e820c
[V1] TPU - Fix CI/CD runner ( #14974 )
2025-03-17 21:07:07 +00:00
e41e160263
[V1] Guard Against Main Thread Usage ( #14972 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-03-17 13:23:02 -07:00
b89fb2a4a1
[CI/Build] Use AutoModelForImageTextToText to load VLMs in tests ( #14945 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 18:35:17 +00:00
5340b0e221
[Bugfix] Fix interface for Olmo2 on V1 ( #14976 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-17 11:26:38 -07:00
37e3806132
[Bugfix] Make Gemma3 MM V0 only for now ( #14971 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-17 10:04:21 -07:00
c0efdd655b
[Fix][Structured Output] using vocab_size to construct matcher ( #14868 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-03-17 11:42:45 -04:00
aaaec52ad9
[Bugfix][Model] Mixtral: use unused head_dim config argument ( #14961 )
...
Signed-off-by: Quentin Torroba <quentin.torroba@mistral.ai >
2025-03-17 07:44:18 -07:00
e1eb45d397
[Bugfix] Fix precommit - line too long in pixtral.py ( #14960 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 07:18:50 -07:00
89fca671fb
[V1] Default MLA to V1 ( #14921 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-17 06:54:40 -07:00
d20b0c139c
Add patch merger ( #14957 )
2025-03-17 06:47:50 -07:00
166a168b0f
[Doc] Fix misleading log during multi-modal profiling ( #14955 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 06:14:32 -07:00
2bb0e1a799
[Bugfix][ROCm] running new process using spawn method for rocm in tests. ( #14810 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-17 11:33:35 +00:00
6eaf1e5c52
[Misc] Add --seed option to offline multi-modal examples ( #14934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 03:00:17 -07:00
868a8c5b2c
[Bugfix] Fix Ultravox on V1 ( #14929 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 17:15:20 +08:00
b4ad56c1bd
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. ( #14846 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-03-17 01:48:28 -07:00
69698f257e
fix minor miscalled method ( #14327 )
2025-03-17 01:47:58 -07:00
cd0cd85102
[MISC] More AMD unused var clean up ( #14926 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-17 16:40:41 +08:00
0a74bfce9c
setup.py: drop assumption about local main branch ( #14692 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-17 01:37:42 -07:00
dd3b865854
[Doc] Add vLLM Beijing meetup slide ( #14938 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-17 16:29:36 +08:00
9b87a579aa
[Misc][XPU] Use None as device capacity for XPU ( #14932 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-03-17 01:22:14 -07:00
b539222d4e
[V1] Remove input cache client ( #14864 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-16 23:42:06 -07:00
8d6cf89526
[V1] [Spec Decode] Support random sampling for spec decode ( #13933 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 22:00:20 -07:00
583a9778e0
[Benchmark] Do not save detailed info to json by default ( #14879 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-16 21:48:11 -07:00
a73e183e36
[Misc] Replace os environ to monkeypatch in test suite ( #14516 )
...
Signed-off-by: sibi <85477603+t-sibiraj@users.noreply.github.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-16 20:35:57 -07:00
1e799b7ec1
[BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context ( #14910 )
2025-03-17 03:35:37 +00:00
7f6c5ee06c
[V1][Minor] Add __repr__ to ConstantList ( #14907 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 20:20:15 -07:00
faa0275730
[V1] Optimize the overhead of rewinding ( #14905 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 20:19:30 -07:00
8a5a9b70d7
[CI/Build] Update defaults for test reproducibility ( #14893 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-17 10:38:15 +08:00
bb3aeddfaf
[CI] Nightly Tests ( #14898 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-03-17 02:06:43 +00:00
aecc780dba
[V1] Enable Entrypoints Tests ( #14903 )
2025-03-16 17:56:16 -07:00
90df7f23aa
[Doc] Add guidance for using ccache with pip install -e . in doc ( #14901 )
2025-03-16 23:10:04 +00:00
b9b5bdfc7d
[Misc] Catching Ray Compiled Graph PP test failures for V1 ( #14847 )
2025-03-16 15:46:42 -07:00
31060b2757
[V1][BugFix] Detect interleaved sliding window attention ( #14896 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-16 14:53:53 -07:00
fc1f67715d
[BugFix][V1] Fix overhead related to bad_words sampling when not in use ( #14894 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-16 14:53:34 -07:00
f6137adbcb
Revert "[Bugfix] Limit profiling run sequence length by max_model_len ( #14785 ) ( #14892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-16 09:13:46 -07:00
e53b1350f2
[Bugfix] Explicitly disable Phi-4-multimodal in V1 ( #14889 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-16 09:05:40 -07:00
d30aa7e9e6
[Bugfix] Limit profiling run sequence length by max_model_len ( #14785 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-03-16 07:44:19 -07:00
d1ad2a57af
[V1] [Spec Decode] Fix ngram tests ( #14878 )
2025-03-16 00:29:22 -07:00
b82662d952
[BugFix] Fix torch distributed stateless PG backend init ( #14870 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-15 20:26:19 -07:00
71c1e07107
[Kernel] Add more tuned configs ( #14877 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-15 20:25:03 -07:00
b30c75dda4
[V1] Remove V0 fallback for mistral-tokenizer ( #14873 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-15 20:21:11 -07:00
def232e122
[VLM] Clean up Phi-4-MM ViT implementation ( #14812 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-03-15 18:53:52 -07:00
3453b964a3
[Misc][Doc] Minor benchmark README update ( #14874 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-16 09:46:17 +08:00
61c6a5a796
[VLM] Merged multi-modal processor for Pixtral ( #12211 )
...
Signed-off-by: remi <remi@mistral.ai >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-15 06:28:27 -07:00
74bc397b0a
[Core] Expose API endpoint /is_sleeping ( #14312 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-03-15 06:28:14 -07:00
f58aea002c
[CI][Intel GPU] refine intel GPU ci docker build ( #14860 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-03-15 11:58:53 +00:00
3556a41434
[VLM] Limit multimodal input cache by memory ( #14805 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-15 02:52:05 -07:00
9ed6ee92d6
[Bugfix] EAGLE output norm bug ( #14464 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-03-15 06:50:33 +00:00
ee3778d5fc
[Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes ( #14839 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-15 05:38:19 +00:00
aaacf17324
[Doc] V1 user guide ( #13991 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-14 22:17:59 -07:00
4c7629cae9
[V1][Structured Output] calculate vocab_size eagerly ( #14851 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-14 22:09:51 -07:00
e0fdfa1608
[CI/Build] Delete LoRA bias test ( #14849 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-14 22:09:25 -07:00
5952d8ab61
[Attention] Get rid of mla cache alignment ( #14842 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-15 05:08:25 +00:00
a2ae496589
[CPU] Support FP8 KV cache ( #14741 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-03-14 22:07:36 -07:00
877e352262
[Docs] Add new East Coast vLLM Meetup slides to README and meetups.md ( #14852 )
2025-03-14 22:06:38 -07:00
d4d93db2c5
[V1] V1 Enablement Oracle ( #13726 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-03-14 22:02:20 -07:00
8c0d15d5c5
[Misc][Easy] Annotate unused vars in the csrc files ( #14798 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-15 12:40:09 +08:00
97ac781c62
[Misc] Remove misleading message in gemma2 and gemma3 ( #14850 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-14 21:35:12 -07:00
776dcec8fe
Disable outlines cache by default ( #14837 )
2025-03-15 03:57:55 +00:00
ccf02fcbae
Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… ( #14848 )
2025-03-14 20:45:42 -07:00
acaea3bb07
[Bugfix][V1] Fix flashinfer sampling ( #14815 )
2025-03-14 20:42:38 -07:00
9f37422779
[Neuron][CI] update docker run command ( #14829 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-03-14 18:51:35 -07:00
dd344e0342
[Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … ( #14844 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-15 00:41:15 +00:00
54a8804455
[Doc] More neutral K8s deployment guide ( #14084 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-03-14 16:12:36 -07:00
bbd94a19fc
[Build/CI] Upgrade aiohttp to incldue CVE fix ( #14840 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 23:11:28 +00:00
233ffce1eb
[Build/CI] Move ninja to common deps ( #14835 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 21:25:28 +00:00
40677783aa
[CI] Add TPU v1 test ( #14834 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-14 17:13:30 -04:00
14f301b541
Update to torch==2.6.0 ( #12721 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: luka <luka@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-14 16:58:30 -04:00
46f98893dd
[V1] Fix model parameterization for structured output tests ( #14833 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 20:55:18 +00:00
fe66b34728
[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies ( #14778 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-03-14 16:36:18 -04:00
270a5da495
Re-enable the AMD Entrypoints Test ( #14711 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-03-14 12:18:13 -07:00
7097b4cc1c
[release] Remove log cleanup commands from TPU job ( #14838 )
2025-03-14 11:59:52 -07:00
977a16772c
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 ( #14430 )
...
Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com >
2025-03-14 09:55:14 -07:00
73deea2fdb
[Frontend] track server_load ( #13950 )
2025-03-14 09:53:17 -07:00
9d2b4a70f4
[V1][Metrics] Updated list of deprecated metrics in v0.8 ( #14695 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-15 00:45:25 +08:00
0b0d6421b2
[Frontend] Fix log message to use http vs https ( #14774 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 09:21:09 -07:00
1140991a7b
[V1] Fix vocab size calculation for structured output ( #14826 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-14 09:18:38 -07:00
613c5bb945
[Bugfix] Fix Aria test loading ( #14823 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-14 09:11:23 -07:00
fd8e055ffb
[BugFix]: properly catch templating error when preprocess input ( #13976 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-03-14 05:58:34 -07:00
ab93f1360f
[VLM] Various cleanup and fixes ( #14806 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-14 05:58:19 -07:00
40253bab44
[Bugfix][W8A8] fixed cutlass block fp8 binding ( #14796 )
2025-03-14 03:32:42 -07:00
c77620d22d
[V1][Minor] Minor code cleanup for scheduling metrics ( #14800 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-14 08:21:28 +00:00
989ecd2007
[Misc] Gemma3ForConditionalGeneration supports LoRA ( #14797 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-14 01:07:30 -07:00
54cc46f3eb
[Bugfix] Fix small typo in the example of Streaming delimiter ( #14793 )
2025-03-14 08:05:17 +00:00
601bd3268e
[Misc] Clean up type annotation for SupportsMultiModal ( #14794 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-14 00:59:56 -07:00
09269b3127
[BugFix]Fix performance serving benchmark when enable profiling ( #14737 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-03-14 07:02:05 +00:00
27b50f1fe6
[Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel ( #14667 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-03-13 23:47:49 -07:00
9532c49836
[Attention] MLA get rid of materialization ( #14770 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-13 23:39:02 -07:00
0c2af17c76
[CI] Fix missing example model id in processor test ( #14787 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-14 13:52:15 +08:00
a6e0d096dd
[Feature] Add visionarena offline support for benchmark_throughput ( #14654 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-03-14 04:07:54 +00:00
d3d4956261
[Neuron] flatten test parameterization for neuron attention kernels ( #14712 )
2025-03-13 20:46:56 -07:00
4059adc31b
[Misc][Minor] Simplify SamplingParams.__post_init__() ( #14772 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-14 11:44:20 +08:00
f1f632d9ec
[ci] Reduce number of tests in fastcheck ( #14782 )
2025-03-13 20:43:45 -07:00
95d680b862
[Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it ( #14681 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-03-13 20:43:18 -07:00
fb4c7f8ef0
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. ( #14431 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com >
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
2025-03-13 20:42:27 -07:00
0b1cfa6180
[Kernel] LoRA - Enable CUDAGraphs for V1 ( #14626 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-13 20:42:04 -07:00
32ef4983cd
[V1] Temporarily disable FlashInfer Rejection Sampler ( #14788 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-13 20:40:35 -07:00
ad19c8a003
[V1] Move OOM check into sampler run ( #14728 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-03-13 20:40:23 -07:00
2a602b055a
forward fix PR 14245, restore build on ROCm 6.2 ( #14709 )
...
Signed-off-by: Jeff Daily <jeff.daily@amd.com >
2025-03-13 20:40:15 -07:00
7888e1d0a3
[V1] TPU - Enable prefix caching by default ( #14773 )
2025-03-13 20:40:05 -07:00
60c872d4b6
[Doc] Fix small typo in Transformers fallback ( #14791 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-03-13 20:33:12 -07:00
3fb17d26c8
[Doc] Fix typo in documentation ( #14783 )
...
Signed-off-by: yasu52 <tsuguro4649@gmail.com >
2025-03-13 20:33:09 -07:00
d47807ba08
[Attention] Remove slow setattr in MLA ( #14769 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-13 21:31:14 +00:00
02fcaa3d0a
[V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output ( #14624 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
2025-03-13 19:07:34 +00:00
8a4a2efc6f
[V1][Core] using cached vocab_size for Structured Outputs ( #14630 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-13 11:39:28 -07:00
8e9ffd37d6
[Misc] Clean up processor tests ( #14771 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-13 18:25:37 +00:00
01b3fd0af7
[V1][Minor] Minor enhancements on scheduler ( #14732 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-13 08:53:22 -07:00
f53a0586b9
[Bugfix] Fix prompt format of GLM4V ( #14539 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-13 11:37:17 +00:00
b1cc4dfef5
[VLM] Support loading InternVideo2.5 models as original InternVLChatModel ( #14738 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-13 03:10:02 -07:00
382403921f
[VLM] Support pan-and-scan for Gemma3 multi-modal processor ( #14672 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-13 02:23:12 -07:00
a73122de96
[Bugfix] fix benchmark moe ( #14653 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-13 16:12:42 +08:00
bd44b812cb
[CI/Build] Delete ultravox LoRA test ( #14730 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-13 07:57:39 +00:00
55211b01e8
[Bugfix] Fix chunked prefill for GGUF ( #14666 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
2025-03-13 07:19:03 +00:00
5d043c1685
[Quant] Bamba SupportsQuant ( #14698 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-03-13 04:57:05 +00:00
36d1ccb286
[Quant] BartModel SupportsQuant ( #14699 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-03-13 04:55:59 +00:00
1bc3b739c4
[V1][TPU] Add assertion on multi-step-scheduler ( #14707 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-03-12 21:37:58 -07:00
1bd32bc8dd
[Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config ( #14367 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-03-12 20:15:20 -07:00
128bf75283
[BugFix][TritonMLA] Process weights after model loading for GGUF ( #14555 )
...
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com >
2025-03-12 20:14:36 -07:00
a94a699c3f
[ROCm][FP8] Fix for adjustments needed only for fnuz ( #14689 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-03-12 20:14:04 -07:00
ab426ec9c0
Add ray[data] as tpu dependency ( #14691 )
...
Signed-off-by: <ricliu@google.com >
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-12 20:13:48 -07:00
165290d357
[bugfix] fixup warning message for plugged schedulers for v1 ( #14700 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-03-12 20:12:13 -07:00
ce20124671
[release] Add force remove for TPU logs ( #14697 )
2025-03-12 22:35:18 +00:00
53be4a8634
[V1] Allow sliding window + prefix caching ( #13069 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-12 11:21:19 -07:00
f5d3acd474
[BugFix][V1] Fix parallel sampling finishing/aborts ( #14512 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-12 10:29:48 -07:00
916836bbfb
[FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. ( #14664 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-03-12 09:31:19 -07:00
d9f83d6206
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm ( #14316 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-12 15:51:20 +00:00
4a754fcf15
[Bugfix] Missing thumbnail from NVLM-D processor ( #14633 )
...
Signed-off-by: ameyanjarlekar <aanjarlekar@nvidia.com >
2025-03-12 08:50:49 -07:00
c0c25e25fa
[Model] Add support for Gemma 3 ( #14660 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-12 08:36:33 -07:00
45f3f3f59e
[ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. ( #14629 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-12 08:00:28 -04:00
ff47aab056
[CPU] Upgrade CPU backend to torch-2.6 ( #13381 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-03-12 10:41:13 +00:00
debd6bbf09
[Kernel] Add ModelOpt FP4 Checkpoint Support ( #12520 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-03-12 05:13:11 +00:00
5c538c37b2
[V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing ( #14645 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-03-11 22:12:41 -07:00
e22ee1e7a2
[Kernel] GGUF MoE kernel ( #14613 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
2025-03-12 03:33:27 +00:00
e392d85831
[Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization ( #14545 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-11 20:12:52 -07:00
77a318bd01
[V1][Core] Support MistralTokenizer for Structured Output ( #14625 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-12 10:40:09 +08:00
80e78d02ac
[Model] Extend Ultravox to accept audio longer than 30s ( #13631 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
2025-03-12 10:27:10 +08:00
4a42b9f5d6
[Doc] Update benchmarks README ( #14646 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-03-11 19:23:04 -07:00
47532cd9f4
[core][V1] pluggable scheduler ( #14466 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-03-12 01:15:15 +00:00
36e0c8f7da
[Feature] Add vllm bench CLI ( #13993 )
...
Signed-off-by: Randy Chen <acad.randyjhc@gmail.com >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-12 00:31:48 +00:00
9f583e360c
[release] Add commands to clean up logs on TPU release node ( #14642 )
2025-03-12 00:14:50 +00:00
b706d898af
[Bugfix][V1][PP] Only warmup sampler at last PP rank ( #14643 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-11 23:40:07 +00:00
863d315c86
[V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly ( #14597 )
2025-03-11 19:12:26 -04:00
d374f04a33
Fix run_tpu_test ( #14641 )
...
Signed-off-by: <ricliu@google.com >
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-11 21:14:33 +00:00
61a01b27a7
[V1] Delay all xgrammar usage until needed ( #14616 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 20:21:33 +00:00
53056731fd
fix some typos : supported_head_sizes ( #14627 )
2025-03-11 10:38:24 -07:00
4cbf286794
[V1] Remove cache from StructuredOutputManager ( #14622 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 10:36:07 -07:00
c6e14a61ab
[Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. ( #14564 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-03-11 17:11:47 +00:00
07b4b7a37f
[BugFix/Build] Fix sparse kernels not getting built on hopper ( #14572 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-11 17:09:03 +00:00
07964e2f30
docs: Add documentation for s390x cpu implementation ( #14198 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-11 17:02:17 +00:00
4bf82d4b90
[V1] Add regex structured output support with xgrammar ( #14590 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 23:03:44 +08:00
9ab326713f
Uninstall dependencies before installing requirements/tpu.txt ( #14586 )
...
Signed-off-by: <ricliu@google.com >
Signed-off-by: Richard Liu <ricliu@google.com >
2025-03-11 08:01:35 -07:00
af295e9b01
[Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 ( #14609 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-11 07:59:43 -07:00
a1c8f3796c
dynamic distpatch of fp8 kernels ( #14245 )
...
Signed-off-by: Jeff Daily <jeff.daily@amd.com >
2025-03-11 10:54:56 -04:00
08a1a1121d
benchmarks: simplify test jsonschema ( #14567 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-11 13:39:30 +00:00
1477ffc381
[VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor ( #14602 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-11 11:27:36 +00:00
70b808fe1a
[Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync ( #14377 )
...
Signed-off-by: cynthieye <987073381@qq.com >
2025-03-11 07:39:56 +00:00
63d635d179
[Misc] Correct deepseek-vl2 chat template ( #14558 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-11 04:37:11 +00:00
1fc973c0b5
[V1][Core] Fix memory issue with logits & sampling ( #14508 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com >
2025-03-11 04:03:41 +00:00
c982ac5722
[Bugfix] Fix FP16 overflow for DeepSeek V2 ( #13232 )
...
Signed-off-by: Yida Wu <yida.wu@amd.com >
2025-03-10 20:46:59 -07:00
4290b704ff
[V1][PP] Do not block engine core when no requests to schedule ( #14585 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-10 19:48:24 -07:00
c91b64f749
[neuron] add reshape_and_cache ( #14391 )
2025-03-10 18:37:29 -07:00
d6123170d5
[Neuron] Add Neuron device communicator for vLLM v1 ( #14085 )
2025-03-10 18:37:04 -07:00
485afdd3cb
[MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils ( #14379 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-10 20:42:11 -04:00
90e88ab756
[Kernel] moe wna16 cuda kernel ( #13321 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-03-10 20:12:40 -04:00
04421dff8a
[V1] Prevent xgrammar from breaking TPU support ( #14575 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-10 23:06:19 +00:00
432d6dad15
Fix typo in benchmark_serving_structured_output.py ( #14566 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-10 14:58:58 -07:00
5ff0d32580
[V1] LoRA - Add triton kernels for V1 ( #13096 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-10 17:27:53 -04:00
0967110e42
[Minor] Update the tqdm bar for parallel sampling ( #14571 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-10 14:23:48 -07:00
fb0acb6c72
[Perf] Improve MLA on V1 ( #14540 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-10 12:06:58 -07:00
92b0ce2ac7
[Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 ( #14554 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-10 18:24:51 +00:00
bc2d4473bf
[Docs] Make installation URLs nicer ( #14556 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 10:43:08 -07:00
3b352a2f92
Correct capitalisation: VLLM -> vLLM ( #14562 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 16:36:21 +00:00
dea985aef0
[V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL ( #14548 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-10 16:03:11 +00:00
39be30351f
Correct capitalisation: Github -> GitHub ( #14561 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 15:53:33 +00:00
001a9c7b0d
[Doc] Update PaliGemma note to a warning ( #14565 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-10 15:02:28 +00:00
89cdaa83e7
[Kernel] Add more dtype support for GGUF kernels ( #14043 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
2025-03-10 07:30:04 -07:00
b0746fae3d
[Frontend] support image embeds ( #13955 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-10 12:36:03 +00:00
60a98b2de5
[Docs] Mention model_impl arg when explaining Transformers fallback ( #14552 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-10 12:13:10 +00:00
460f553a6d
[Misc] Add log information for handle_process_request. ( #14130 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-03-10 08:40:50 +00:00
1253b15774
[Feature] Consolidate performance benchmark datasets ( #14036 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-10 07:23:11 +00:00
dc74613fa2
[Bugfix] Wrong requirements path - rocm ( #14527 )
...
Signed-off-by: Martin Hoyer <mhoyer@redhat.com >
2025-03-10 02:49:46 +00:00
a21076ed3a
[Misc] Ensure out-of-tree quantization method recognize by cli args ( #14328 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-03-09 12:13:31 +00:00
212007b168
[Hardware][TPU] Fix the recompiling issue in logits processor after warmup ( #14510 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-09 05:44:39 -04:00
fb16eea48b
[Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work ( #14498 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-09 04:47:45 +00:00
73ae0b44e9
[Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 ( #12428 )
...
Signed-off-by: Yuchen Yan <740987012@qq.com >
2025-03-08 20:14:53 -08:00
6d7f037748
[Feat] Support chunked prefill for LMCache connector ( #14505 )
...
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2025-03-08 19:30:06 -08:00
10f7552789
[V1][TPU] Remove unnecessary padding for running on TPU. ( #14467 )
2025-03-08 21:56:04 -05:00
b0d541947a
[Attention] Default to FlashMLA backend for MLA ( #14451 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-08 18:18:39 -08:00
5f0b53c6ea
Revert "[V1][Core] Fix memory issue with logits & sampling" ( #14504 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-03-08 17:43:37 -08:00
eb8b5eb183
[V1] Support bad_words in sampler ( #13376 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-08 14:50:26 -08:00
9513290032
[Misc] Upgrade to Python 3.9 typing for additional directories ( #14492 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-08 17:35:50 +00:00
0d5e73d30e
Update CODEOWNERS for structured output ( #14496 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-08 17:19:51 +00:00
609ef61fea
[Bugfix] Fix profiling OOM and decouple encoder multimodal profiling ( #14361 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-08 16:52:34 +00:00
db84f5eb3b
[Bugfix] DeepSeek Accuracy ( #14476 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-08 16:47:03 +00:00
206e2577fa
Move requirements into their own directory ( #12547 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 16:44:35 +00:00
e02883c400
[Misc] Don't run ruff at all on 3rd party libs ( #14493 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-08 07:16:40 -08:00
9085aabd62
[benchmarks] Add option to use unique jsonschema for each request ( #14457 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-08 06:36:39 -08:00
8d5aa466fb
[V1][Core] Fix memory issue with logits & sampling ( #13776 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-08 06:11:04 -08:00
0b7f06b447
[Misc] add use_tqdm_on_load to reduce logs ( #14407 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-03-08 05:57:46 -08:00
03fe18ae0f
[VLM] Add TP support for Phi-4-MM ( #14453 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-08 05:57:14 -08:00
cb8bdfade2
[V1] TPU - Add tensor parallel support via Ray ( #13618 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-03-08 08:19:38 -05:00
33f227e16b
[CI/Build] Use a fixed seed to avoid flaky tests ( #14480 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-08 11:30:09 +00:00
cfd0ae8234
Add RLHF document ( #14482 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 09:51:39 +00:00
7caff01a7b
[Build/BugFix] Fix hopper 12.8 build ( #14354 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-08 08:11:56 +00:00
be0b399d74
Add training doc signposting to TRL ( #14439 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 07:35:07 +00:00
b8b0ccbd2d
[Bugfix] Make the deviceprofiler include LoRA memory. ( #14469 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-08 07:12:22 +00:00
c908a07f57
[Doc] Added QwQ-32B to the supported models list in the reasoning out… ( #14479 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-03-08 07:07:32 +00:00
7b6fd6e486
[Doc]add doc for Qwen models tool calling ( #14478 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-03-08 06:58:46 +00:00
47512b3200
Default to generation_config from model ( #12622 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-08 14:46:15 +08:00
3b9c6c6947
[CI/Build] refactor: set timezone of container to UTC ( #12888 )
...
Signed-off-by: Roger Meier <r.meier@siemens.com >
2025-03-07 22:42:01 -08:00
4aae667668
[core] add extra_args to SamplingParams ( #13300 )
...
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com >
2025-03-08 14:41:18 +08:00
9f3bc0f58c
[MISC][V1] Register process killing handler only in the main thread ( #14380 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-07 22:40:06 -08:00
980385f8c1
[Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache ( #14369 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-03-07 22:39:31 -08:00
ca7a2d5f28
Revert "[Perf] Reduce MLA CPU overheads in V1 ( #14384 )" ( #14471 )
2025-03-07 22:18:53 -08:00
333681408f
[Bugfix][V1] Handle MLA in kv_cache_interface ( #14462 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-07 22:18:25 -08:00
ef64044079
[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC ( #13949 )
2025-03-08 01:48:12 +00:00
66e16a038e
[Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 ( #14459 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-03-07 23:17:04 +00:00
e1f0835ae0
[V1][Metrics] Fix traceback with preemptions+LoRA ( #14220 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-07 15:36:16 -05:00
8ed5421aaa
[V1] Eagerly remove finished requests from the batch ( #14388 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-07 10:56:00 -08:00
c6359e8ca6
[v1] torch.compile integration explanation ( #14437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-08 01:55:50 +08:00
952a074980
[Misc] Add Phi4-MM example ( #14343 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-07 17:28:52 +00:00
d0feea31c7
[Kernel] optimize performance of gptq marlin kernel when n is small ( #14138 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-03-07 11:53:38 -05:00
58abe35455
[Benchmarks] Make detokenization optional in benchmark scripts ( #11697 )
...
Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com >
2025-03-07 08:09:00 -08:00
f7ebad2307
[Doc] Update prefix_caching.md to match the example image ( #14420 )
2025-03-07 15:29:00 +00:00
80e9afb5bc
[V1][Core] Support for Structured Outputs ( #12388 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-07 07:19:11 -08:00
1e3598edeb
Use the optimized block sizes after tuning the kernel. ( #14329 )
2025-03-07 13:25:13 +00:00
f7a6bd0fa1
Fix missing kv_caches and attn_metadata in OpenVINOCausalLM ( #14271 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-07 12:30:42 +00:00
0ca3b8e01c
[BUGFIX] Skip tokenization support for throughput benchmark ( #12712 )
...
Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-03-07 02:51:47 -08:00
cc10281498
[Misc] Set default value of seed to None ( #14274 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-03-07 10:40:01 +00:00
05fb6718f0
[Bugfix] Clean up multi-modal processors ( #14417 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-07 10:33:38 +00:00
12c29a881f
[Bugfix] Further clean up LoRA test ( #14422 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-07 10:30:55 +00:00
70da0c0748
correct wrong markdown syntax ( #14414 )
...
Signed-off-by: vincent-pli <justdoit.pli@gmail.com >
2025-03-07 08:01:18 +00:00
c1588a2c94
[GH] Auto-apply multi-modality label to relevant PRs ( #14402 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-07 15:26:32 +08:00
8ca7a71df7
OpenVINO: added CPU-like conditions ( #14338 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2025-03-06 22:24:49 -08:00
63137cd922
[Build] Add nightly wheel fallback when latest commit wheel unavailable ( #14358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-06 22:10:57 -08:00
ddd1ef66ec
[Bugfix] Fix JambaForCausalLM LoRA ( #14370 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-06 22:05:47 -08:00
e5e03c2c1b
[BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs ( #14396 )
2025-03-06 21:56:06 -08:00
e1744502c2
[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object ( #14390 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-03-07 05:20:16 +00:00
dae6896977
[Perf] Reduce MLA CPU overheads in V1 ( #14384 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-06 19:59:14 -08:00
c34eeec58d
[Bugfix] Correctly call cudaProfilerStop in benchmarks script ( #14183 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-03-07 00:42:49 +00:00
ad60bbb2b2
[Doc] Fix a typo ( #14385 )
2025-03-06 16:31:52 -08:00
0578e5a462
[Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue ( #14310 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-03-06 23:31:05 +00:00
04222984f8
[Docs] Add nsight guide to profiling docs ( #14298 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 14:19:58 -08:00
6832707e90
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends ( #14221 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 14:18:29 -08:00
6b2ef5cd17
[Bug] Fix Attention when ignored in by quant_method ( #14313 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 14:18:06 -08:00
958adce478
[Bugfix] Fix use_direct_call condition in FusedMoE layer for ( #14382 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-06 14:17:21 -08:00
99b0915d3b
[Kernel] Add needs_fixed_stride_order tag to most GEMMs ( #14306 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-06 14:17:09 -08:00
8ca2b21c98
[CI] Disable spawn when running V1 Test ( #14345 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-03-06 21:52:46 +00:00
d9292786e1
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa ( #13569 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-06 16:08:36 -05:00
cc2f9b32c8
[Distributed] Add enable_expert_parallel arg ( #14305 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-06 18:54:45 +00:00
cd579352bf
[V1] Do not detokenize if sampling param detokenize is False ( #14224 )
...
Signed-off-by: Himanshu Jaju <hj@mistral.ai >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-03-06 10:40:24 -08:00
9f1710f1ac
Fix mla prefill context performance ( #13897 )
...
Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com >
2025-03-06 09:35:49 -08:00
e642ec962c
Add authors to license header. ( #14371 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-03-06 08:43:09 -08:00
ada19210a3
Adding cpu inference with VXE ISA for s390x architecture ( #12613 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com >
Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com >
2025-03-06 08:40:53 -08:00
bf0560bda9
Reinstate best_of for V0 ( #14356 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-06 08:34:22 -08:00
151b08e0fe
[RLHF] use worker_extension_cls for compatibility with V0 and V1 ( #14185 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-07 00:32:46 +08:00
81b2f4a45f
[Doc] Fix date typo in README.md ( #14366 )
...
Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl >
2025-03-06 08:29:57 -08:00
82551ad616
[Core] Don't use cache during multi-modal profiling ( #14336 )
2025-03-06 08:03:31 -08:00
caac5c2e59
[Bugfix][Core] fix abort_seq_group and memory leak when n>1 ( #14326 )
...
Signed-off-by: courage17340 <courage17340@163.com >
2025-03-06 23:59:32 +08:00
6bd1dd9d26
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend ( #14152 )
2025-03-06 07:39:16 -08:00
4f27044aab
[Doc] Correct beam_search using in generative_models.md ( #14363 )
2025-03-06 15:37:10 +00:00
0ddc991f5c
[Doc] Update reasoning with stream example to use OpenAI library ( #14077 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-03-06 13:20:37 +00:00
fa82b93853
[Frontend][Docs] Transcription API streaming ( #13301 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-03-06 10:39:35 +00:00
69ff99fdcd
[Core] Optimizing cross-attention QKVParallelLinear computation ( #12325 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
2025-03-06 09:37:26 +00:00
5d802522a7
[V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 ( #14275 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-03-06 08:58:41 +00:00
1769928079
[Model] Update Paligemma multimodal processing with PromptUpdate ( #14015 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-03-06 08:31:38 +00:00
ed6ea06577
[Hardware] Update the flash attn tag to support Blackwell ( #14244 )
2025-03-05 22:01:37 -08:00
5ee10e990d
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention ( #11301 )
2025-03-05 20:00:53 -08:00
3dbd2d813a
[V1] LoRA - Enable more V1 tests ( #14315 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-03-06 11:55:42 +08:00
f5f7f00cd9
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 ( #14114 )
2025-03-06 03:49:20 +00:00
abcc61e0af
[misc] Mention ray list nodes command to troubleshoot ray issues ( #14318 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-06 02:00:36 +00:00
f6bb18fd9a
[BugFix] MLA + V1, illegal memory access and accuracy issues ( #14253 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-03-05 17:10:13 -08:00
71eaf8969b
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation ( #13850 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-03-05 17:09:29 -08:00
ca100c90fe
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM ( #13917 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-05 17:08:51 -08:00
ffad94397d
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline ( #14243 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-03-05 17:08:02 -08:00
4dacaa4a83
[BugFix] Fix prefix caching V0 MLA ( #14255 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com >
2025-03-05 17:07:42 -08:00
a7ea35aa67
[Bugfix] Remove num_tokens_across_dp ( #14302 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-05 23:55:55 +00:00
1e3e76b6cc
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch ( #14237 )
...
Signed-off-by: pyc96 <pychen96@gmail.com >
2025-03-05 22:22:40 +00:00
53ea6ad830
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test ( #14308 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-05 21:41:18 +00:00
1b7624bf5c
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env ( #14267 )
2025-03-05 21:28:50 +00:00
ac60dc7fe1
[V1][BugFix] Fix for mixed top_k batch ( #14301 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com >
2025-03-05 20:43:04 +00:00
a4f1ee35d6
Deprecate best_of Sampling Parameter in anticipation for vLLM V1 ( #13997 )
...
Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com >
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-05 20:22:43 +00:00
a32c8669ca
[V1][Minor] Remove obsolete FIXME comment ( #14304 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-05 11:59:23 -08:00
ca2ca8de57
[Docs] Add Meta Slides ( #14297 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-03-05 08:30:23 -08:00
f71b00a19e
[Bugfix] Fix broken vision language example ( #14292 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-05 15:57:10 +00:00
8f808cf86e
prefix_caching.md: Fixed typo ( #14293 )
...
Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai >
2025-03-05 15:43:13 +00:00
7bab4bb048
[Misc] Add Qwen2MoeForCausalLM moe tuning support ( #14276 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-05 23:11:29 +08:00
e17e4488bd
[LoRA] Remove linear hack outside transformers backend ( #14177 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-03-05 15:06:28 +00:00
257e200a25
[V1][Frontend] Add Testing For V1 Runtime Parameters ( #14159 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-03-05 14:18:55 +00:00
47d4a7e004
Small update for external_launcher backend docs ( #14288 )
2025-03-05 21:30:00 +08:00
7f89a594dd
[Doc] [3/N] Refer code examples for common cases in dev multimodal processor ( #14278 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-03-05 12:29:50 +00:00
961644e6a8
[Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID ( #14217 )
...
Signed-off-by: Iacopo Poli <iacopo@lighton.ai >
2025-03-05 11:44:10 +00:00
8d6cd32b7b
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler ( #14169 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-03-05 08:49:44 +00:00
ec79b67c77
[Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing ( #14256 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-03-05 07:37:16 +00:00
32985bed7c
[Frontend] Allow return_tokens_as_token_ids to be passed as a request param ( #14066 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-03-05 06:30:40 +00:00
dae9ec464c
Temporarily disable test_awq_gemm_opcheck ( #14251 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-05 06:10:35 +00:00
6eaf93020d
[platforms] improve rocm debugging info ( #14257 )
2025-03-04 21:32:18 -08:00
72c62eae5f
[V1] EP/TP MoE + DP Attention ( #13931 )
2025-03-04 21:27:26 -08:00
0a995d5434
[Model] New model support for Phi-4-multimodal-instruct ( #14119 )
2025-03-04 20:57:01 -08:00
ade3f7d988
[V1][Bugfix] Do not reset prefix caching metrics ( #14235 )
2025-03-05 04:39:13 +00:00
0df25101d6
[Bugfix] Fix gptq_marlin for deepseek-v3 ( #13750 )
...
Signed-off-by: dangshunya <dangshunya@baichuan-inc.com >
Co-authored-by: dangshunya <dangshunya@baichuan-inc.com >
2025-03-05 12:25:53 +08:00
e123aafdf0
Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 ( #14157 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-05 12:25:24 +08:00
5b143d33be
Moved numba from common requirements to cuda/rocm specific requirements ( #14199 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-03-05 12:25:00 +08:00
eb59b5a6cb
[misc] announce china meetup ( #14248 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-05 10:33:50 +08:00
fbfc3ee37e
[V1][TPU] TPU multimodal model support for ragged attention ( #14158 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-03-04 19:58:48 -05:00
3e1d223626
[ROCm] Disable a few more kernel tests that are broken on ROCm ( #14145 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-04 23:37:55 +00:00
4f5b059f14
Clean up unused padding_idx variables across many model definitions ( #13240 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-03-04 21:27:00 +00:00
288ca110f6
[Security] Serialize using safetensors instead of pickle in Mooncake Pipe ( #14228 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-03-04 21:10:32 +00:00
c2bd2196fc
[v1][Metrics] Add design doc ( #12745 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-04 20:36:55 +00:00
550c7ba3dc
[Docs] Update Dockerfile dependency image ( #14215 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-04 20:22:11 +00:00
e5b2f1601a
[Frontend] Do prompt_logprobs clamping for chat as well as completions ( #14225 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-04 20:13:06 +00:00
9badee53de
Fix performance when --generation-config is not None ( #14223 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-04 20:59:22 +01:00
beebf4742a
[TPU][Profiler] Support start_profile/stop_profile in TPU worker ( #13988 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-03-04 14:40:06 -05:00
f89978ad7c
add cutlass support for blackwell fp8 gemm ( #13798 )
2025-03-04 07:55:07 -08:00
b3cf368d79
[V1][Molmo] Fix get_multimodal_embeddings() in molmo.py ( #14161 )
2025-03-04 15:43:59 +00:00
c8525f06fc
[V0][Metrics] Deprecate some questionable request time metrics ( #14135 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-04 15:11:33 +00:00
5db6b2c961
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs ( #13869 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-04 15:06:47 +00:00
6247bae6c6
[Bugfix] Restrict MacOS CPU detection ( #14210 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-04 22:25:27 +08:00
3610fb4930
[doc] add "Failed to infer device type" to faq ( #14200 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-04 20:47:06 +08:00
71c4b40562
[sleep mode] error out with expandable_segments ( #14189 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-04 18:54:19 +08:00
ac65bc92df
[platform] add debug logging during inferring the device type ( #14195 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-03-04 18:39:16 +08:00
f78c0be80a
Fix benchmark_moe.py tuning for CUDA devices ( #14164 )
2025-03-03 21:11:03 -08:00
66233af7b6
Use math.prod instead of np.prod for trivial ops ( #14142 )
2025-03-03 21:09:22 -08:00
bf13d40972
[core] Pass all driver env vars to ray workers unless excluded ( #14099 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-03-04 11:44:17 +08:00
989f4f430c
[Misc] Remove lru_cache in NvmlCudaPlatform ( #14156 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-04 11:09:34 +08:00
bb5b640359
[core] moe fp8 block quant tuning support ( #14068 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-03-04 01:30:23 +00:00
c060b71408
[Model] Add support for GraniteMoeShared models ( #13313 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-03-04 08:04:52 +08:00
79e4937c65
[v1] Add comments to the new ragged paged attention Pallas kernel ( #14155 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-03-03 23:00:55 +00:00
cd1d3c3df8
[Docs] Add GPTQModel ( #14056 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-03-03 21:59:09 +00:00
19d98e0c7d
[Kernel] Optimize moe intermediate_cache usage ( #13625 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-03 16:29:53 -05:00
2b04c209ee
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 ( #14100 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-03-03 14:20:24 -07:00
ae122b1cbd
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics ( #14055 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 19:04:45 +00:00
872db2be0e
[V1] Simplify stats logging ( #14082 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-03-03 10:34:14 -08:00
2dfdfed8a0
[V0][Metrics] Deprecate some KV/prefix cache metrics ( #14136 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 18:25:46 +00:00
c41d27156b
[V0][Metrics] Remove unimplemented vllm:tokens_total ( #14134 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 17:50:22 +00:00
91373a0d15
Fix head_dim not existing in all model configs (Transformers backend) ( #14141 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-03 17:48:11 +00:00
848a6438ae
[ROCm] Faster Custom Paged Attention kernels ( #12348 )
2025-03-03 09:24:45 -08:00
98175b2816
Improve the docs for TransformersModel ( #14147 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-03-03 17:03:05 +00:00
4167252eaf
[V1] Refactor parallel sampling support ( #13774 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-03-03 08:15:27 -08:00
f35f8e2242
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 ( #13921 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-03-03 16:43:14 +08:00
b87c21fc89
[Misc][Platform] Move use allgather to platform ( #14010 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-03-03 15:40:04 +08:00
e584b85afd
[Misc] duplicate code in deepseek_v2 ( #14106 )
2025-03-03 14:10:11 +08:00
09e56f9262
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure ( #14051 )
2025-03-02 17:35:01 -08:00
cf069aa8aa
Update deprecated Python 3.8 typing ( #13971 )
2025-03-02 17:34:51 -08:00
bf33700ecd
[v0][structured output] Support reasoning output ( #12955 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-03-02 14:49:42 -05:00
bc6ccb9878
[Doc] Source building add clone step ( #14086 )
...
Signed-off-by: qux-bbb <1147635419@qq.com >
2025-03-02 10:59:50 +00:00
82fbeae92b
[Misc] Accurately capture the time of loading weights ( #14063 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-03-01 17:20:30 -08:00
cc5e8f6db8
[Model] Add LoRA support for TransformersModel ( #13770 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-02 09:17:34 +08:00
d54990da47
[v1] Add __repr__ to KVCacheBlock to avoid recursive print ( #14081 )
2025-03-01 20:46:02 +00:00
b9f1d4294e
[v1][Bugfix] Only cache blocks that are not in the prefix cache ( #14073 )
2025-03-01 08:25:54 +00:00
b28246f6ff
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class ( #14065 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-03-01 07:18:32 +00:00
3b5567a209
[V1][Minor] Do not print attn backend twice ( #13985 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-03-01 07:09:14 +00:00
fdcc405346
[Doc] Consolidate whisper and florence2 examples ( #14050 )
2025-02-28 22:49:15 -08:00
8994dabc22
[Documentation] Add more deployment guide for Kubernetes deployment ( #13841 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-03-01 06:44:24 +00:00
02296f420d
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor ( #14053 )
2025-02-28 22:31:01 -08:00
6a92ff93e1
[Misc][Kernel]: Add GPTQAllSpark Quantization ( #12931 )
2025-02-28 22:30:59 -08:00
6a84164add
[Bugfix] Add file lock for ModelScope download ( #14060 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-03-01 06:10:28 +00:00
f64ffa8c25
[Docs] Add pipeline_parallel_size to optimization docs ( #14059 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-03-01 05:43:54 +00:00
bd56c983d6
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass ( #10902 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-02-28 16:20:11 -07:00
084bbac8cc
[core] Bump ray to 2.43 ( #13994 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-28 21:47:44 +00:00
28943d36ce
[v1] Move block pool operations to a separate class ( #13973 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-28 20:53:31 +00:00
b526ca6726
Add RELEASE.md ( #13926 )
...
Signed-off-by: atalman <atalman@fb.com >
2025-02-28 12:25:50 -08:00
e7bd944e08
[v1] Cleanup the BlockTable in InputBatch ( #13977 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-02-28 19:03:16 +00:00
c3b6559a10
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU ( #13379 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-02-28 11:01:36 -07:00
4be4b26cb7
Fix entrypoint tests for embedding models ( #14052 )
2025-02-28 08:56:44 -08:00
2aed2c9fa7
[Doc] Fix ROCm documentation ( #14041 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-28 16:42:07 +00:00
9b61dd41e7
[Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series ( #14031 )
2025-02-28 07:36:08 -08:00
f7bee5c815
[VLM][Bugfix] Enable specifying prompt target via index ( #14038 )
2025-02-28 07:35:55 -08:00
e0734387fb
[Bugfix] Fix MoeWNA16Method activation ( #14024 )
2025-02-28 15:22:42 +00:00
f58f8b5c96
Update AutoAWQ docs ( #14042 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-28 15:20:29 +00:00
b3f7aaccd0
[V1][Minor] Restore V1 compatibility with LLMEngine class ( #13090 )
2025-02-28 00:52:25 -08:00
b91660ddb8
[Hardware][Intel-Gaudi] Regional compilation support ( #13213 )
2025-02-28 00:51:49 -08:00
76c89fcadd
Use smaller embedding model when not testing model specifically ( #13891 )
2025-02-28 00:50:43 -08:00
b9e41734c5
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) ( #13987 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-02-28 07:53:45 +00:00
1088f06242
[Doc] Move multimodal Embedding API example to Online Serving page ( #14017 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-28 07:12:04 +00:00
73e0225ee9
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama ( #13911 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-02-28 04:00:45 +00:00
6c85da3a18
[V1]SupportsV0Only protocol for model definitions ( #13959 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-27 20:02:15 -05:00
67fc426845
[Misc] Print FusedMoE detail info ( #13974 )
2025-02-27 18:53:13 -05:00
9804145cac
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict ( #13626 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-02-27 15:28:08 -08:00
2e94b9cfbb
[Attention] Flash MLA for V1 ( #13867 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Yang Chen <yangche@fb.com >
2025-02-27 23:03:41 +00:00
8294773e48
[core] Perf improvement for DSv3 on AMD GPUs ( #13718 )
...
Signed-off-by: qli88 <qiang.li2@amd.com >
2025-02-27 22:14:30 +00:00
cd813c6d4d
[V1][Minor] Minor cleanup for GPU Model Runner ( #13983 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-27 13:11:40 -08:00
38acae6e97
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups ( #13970 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-02-27 20:31:47 +00:00
a2dd48c386
[VLM] Deprecate legacy input mapper for OOT multimodal models ( #13979 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-27 19:14:55 +00:00
126f6beeb4
Bump azure/setup-helm from 4.2.0 to 4.3.0 ( #13742 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 19:04:10 +00:00
58d1b2aa77
[Attention] MLA support for V1 ( #13789 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-27 13:14:17 -05:00
f1579b229d
[VLM] Generalized prompt updates for multi-modal processor ( #13964 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-27 17:44:25 +00:00
7864875879
[Bugfix] Fix qwen2.5-vl overflow issue ( #13968 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-27 17:30:39 +00:00
1dd422b64a
Update LMFE version to v0.10.11 to support new versions of transforme… ( #13930 )
2025-02-27 17:16:12 +00:00
06c8f8d885
[bugfix] Fix profiling for RayDistributedExecutor ( #13945 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-28 01:01:21 +08:00
5677c9bb3e
Deduplicate .pre-commit-config.yaml's exclude ( #13967 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-27 16:27:47 +00:00
512d77d582
Update quickstart.md ( #13958 )
2025-02-27 16:05:11 +00:00
7f0be2aa24
[Model] Deepseek GGUF support ( #13167 )
2025-02-27 02:08:35 -08:00
edf309ebbe
[VLM] Support multimodal inputs for Florence-2 models ( #13320 )
2025-02-27 02:06:41 -08:00
788f284b53
Fix test_block_fp8.py test for MoE ( #13915 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-27 18:00:00 +08:00
4b1d141f49
[PP] Correct cache size check ( #13873 )
...
Signed-off-by: Yang Zheng <zhengy.gator@gmail.com >
2025-02-27 17:47:29 +08:00
10c3b8c1cf
[Misc] fixed 'required' is an invalid argument for positionals ( #13948 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-02-27 09:06:49 +00:00
a7f37314b7
[CI/Build] Add examples/ directory to be labelled by mergify ( #13944 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-27 08:24:11 +00:00
cd711c48b2
[V1][Metrics] Handle preemptions ( #13169 )
2025-02-26 20:04:59 -08:00
378b3ef6f8
[ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding ( #13922 )
2025-02-26 20:04:12 -08:00
c9944acbf9
[misc] Rename Ray ADAG to Compiled Graph ( #13928 )
2025-02-26 20:03:28 -08:00
ca377cf1b9
Use CUDA 12.4 as default for release and nightly wheels ( #12098 )
2025-02-26 19:06:37 -08:00
a31614e386
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined ( #13851 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-27 10:39:10 +08:00
f95903909f
[Kernel] FlashMLA integration ( #13747 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-02-27 10:35:08 +08:00
b382a7f28f
[BugFix] Make FP8 Linear compatible with torch.compile ( #13918 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-26 13:48:55 -08:00
4cb6fa0a9c
[Bugfix] Backend option to disable xgrammar any_whitespace ( #12744 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-26 10:52:34 -08:00
d08b285adf
[Misc] fixed qwen_vl_utils parameter error ( #13906 )
2025-02-26 08:31:53 -08:00
b27122acc2
[TPU] use torch2.6 with whl package ( #13860 )
...
Signed-off-by: Chenyaaang <llccyy1212@gmail.com >
2025-02-26 08:18:54 -05:00
934bb99c71
[Bugfix] Update expected token counts for Ultravox tests ( #13895 )
2025-02-26 04:56:50 -08:00
3f808cc044
[Bugfix] Do not crash V0 engine on input errors ( #13101 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-26 19:07:29 +08:00
ec8a5e5386
[Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor ( #13736 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-26 19:06:47 +08:00
215bf150a6
[Bugfix] Handle None parameters in Mistral function calls. ( #13786 )
2025-02-26 03:06:21 -08:00
0ecdd98031
Add comments on accessing kv_cache and attn_metadata ( #13887 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-26 18:41:02 +08:00
7b700ec8c8
[Bugfix] Add test example for Ultravox v0.5 ( #13890 )
2025-02-26 02:31:43 -08:00
7ca1da020f
[Misc] Fix input processing for Ultravox ( #13871 )
2025-02-25 23:56:34 -08:00
5157338ed9
[Misc] Improve LoRA spelling ( #13831 )
2025-02-25 23:43:01 -08:00
e206b54331
[v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine ( #13837 )
...
Signed-off-by: Seth Kimmel <seth.kimmel3@gmail.com >
2025-02-26 14:58:24 +08:00
1d35662e6d
[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms ( #13844 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-02-26 14:56:58 +08:00
e656f638de
[Doc] fix the incorrect module path of tensorize_vllm_model ( #13863 )
2025-02-25 22:56:19 -08:00
145944cb94
Improve pipeline partitioning ( #13839 )
2025-02-25 18:53:56 -08:00
094b7d9496
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues ( #13797 )
2025-02-25 18:52:03 -08:00
e1fe7591f2
[Misc]Code Cleanup ( #13859 )
...
Signed-off-by: noemotiovon <noemotiovon@gmail.com >
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2025-02-26 10:44:30 +08:00
5629f26df7
[V1][Spec Decode] Change Spec Decode Rejection Sampling API ( #13729 )
2025-02-25 18:14:48 -08:00
9ba28043b5
[misc] Show driver IP info when Ray fails to allocate driver worker ( #13858 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-26 09:53:43 +08:00
24679788ed
DeepSeek V2/V3/R1 only place lm_head on last pp rank ( #13833 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-26 01:24:57 +00:00
07c4353057
[Model] Support Grok1 ( #13795 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-26 01:07:12 +00:00
34e3494e70
Fix failing MyGemma2Embedding test ( #13820 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-25 12:33:03 -08:00
f75aa72732
[Neuron] Add custom_ops for neuron backend ( #13246 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: George Novack <gnovack@amazon.com >
Co-authored-by: Aoyu Zhang <aoyuzhan@amazon.com >
2025-02-25 11:47:49 -08:00
340e39e387
Fix string parsing error ( #13825 )
2025-02-25 08:20:29 -08:00
f4133ce4e5
[Bugfix] Revert inspection code in #13743 ( #13832 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-26 00:18:50 +08:00
6522d55b6f
Fix /v1/audio/transcriptions Bad Request Error ( #13811 )
2025-02-25 06:03:33 -08:00
6ff518626c
[Bugfix] Fix deepseek-vl2 inference with more than 2 images ( #13818 )
2025-02-25 06:03:02 -08:00
fa82074167
[Bugfix] Flush TunableOp results before worker processes are destroyed. ( #13623 )
...
Signed-off-by: Nichols A. Romero <nick.romero@amd.com >
2025-02-25 11:08:20 +00:00
75e9d49796
[Bugfix] Initialize attention bias on the same device as Query/Key/Value ( #13468 )
2025-02-25 02:13:09 -08:00
32c3b6bfd1
[Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs ( #13724 )
...
Signed-off-by: Chen-0210 <chenjincong11@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-02-25 10:12:19 +00:00
37b6cb4985
[CI/Build] Fix V1 LoRA failure ( #13767 )
2025-02-25 02:01:15 -08:00
aabeb2688f
[ROCm][Quantization][Kernel] Using HIP FP8 header ( #12593 )
2025-02-25 00:39:59 -08:00
2f42a4888c
[Feature] Support KV cache offloading and disagg prefill with LMCache connector. ( #12953 )
2025-02-25 00:38:42 -08:00
3173c3b34e
[misc] Clean up ray compiled graph type hints ( #13731 )
2025-02-25 00:37:08 -08:00
2d87d7d1ac
[Bugfix] Modify modelscope api usage in transformer_utils ( #13807 )
2025-02-25 00:36:07 -08:00
aab392774b
[Core] xgrammar: Expand list of unsupported jsonschema keywords ( #13783 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-25 08:21:25 +00:00
6724e79164
[Misc] Check that the model can be inspected upon registration ( #13743 )
2025-02-25 00:18:19 -08:00
03f48b3db6
[Core] LoRA V1 - Add add/pin/list/remove_lora functions ( #13705 )
2025-02-25 00:18:02 -08:00
4d251ad00e
Fix CompressedTensorsWNA16MoE with grouped scales ( #13769 )
2025-02-25 00:17:14 -08:00
18e505930d
[Bugfix] Support MLA for CompressedTensorsWNA16 ( #13725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-25 06:10:31 +00:00
4a8cfc7551
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" ( #13802 )
2025-02-24 20:33:59 -08:00
bc32bc73aa
[V1][Metrics] Implement vllm:lora_requests_info metric ( #13504 )
2025-02-24 20:01:33 -08:00
ab1091d5f2
[Misc][Attention][Quantization] init property earlier ( #13733 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-02-25 03:19:30 +00:00
1e15aaef56
[Bugfix][Quantization] Fix FP8 + EP ( #13784 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-25 10:54:17 +08:00
51010a1807
[Misc] set single whitespace between log sentences ( #13771 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-02-25 10:26:12 +08:00
7196a3b1db
[Doc] arg_utils.py: fixed a typo ( #13785 )
2025-02-24 18:23:04 -08:00
cdc1fa12eb
Remove unused kwargs from model definitions ( #13555 )
2025-02-24 17:13:52 -08:00
f61528d46d
[Misc][Chore] Clean Up AsyncOutputProcessing Logs ( #13780 )
2025-02-24 16:39:07 -08:00
1f0ae3ed0a
[Misc] Clean Up EngineArgs.create_engine_config ( #13734 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-02-24 13:52:21 -05:00
db986c19ea
Fix precommit fail in fused_moe intermediate_cache2 chunking ( #13772 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-24 09:25:47 -08:00
227578480d
Revert "[V1][Core] Fix memory issue with logits & sampling" ( #13775 )
2025-02-24 09:16:05 -08:00
befc402d34
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) ( #10980 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-02-24 08:29:41 -08:00
444b0f0f62
[Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set ( #12513 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-02-24 10:43:21 -05:00
ccc00515fd
[BugFix] Illegal memory access for MoE On H20 ( #13693 )
2025-02-24 07:37:32 -08:00
781096e385
Expert Parallelism (EP) Support for DeepSeek V2 ( #12583 )
2025-02-24 07:33:20 -08:00
7940d8a6a7
[CI/Build] add python-json-logger to requirements-common ( #12842 )
2025-02-24 06:10:33 -08:00
c0e3ecd6d2
[Bugfix] fix(logging): add missing opening square bracket ( #13011 )
2025-02-24 06:10:25 -08:00
23eca9cf68
[model][refactor] remove cuda hard code in models and layers ( #13658 )
2025-02-24 06:10:14 -08:00
437b76ff59
[V1][Core] Fix memory issue with logits & sampling ( #13721 )
2025-02-24 06:10:06 -08:00
f90a375593
[ci] Add logic to change model to S3 path only when S3 CI env var is on ( #13727 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-63-253.us-west-2.compute.internal >
2025-02-24 06:32:11 +00:00
e7ef74e26e
Fix some issues with benchmark data output ( #13641 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-02-24 10:23:18 +08:00
cbae7af552
[V1][BugFix] Fix engine core client shutdown hangs ( #13298 )
...
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.
Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-23 13:07:43 -08:00
eb24dc4a45
[v1] torchrun compatibility ( #13642 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-23 22:47:24 +08:00
9bebc9512f
[Misc] Deprecate --dataset from benchmark_serving.py ( #13708 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-23 13:32:20 +00:00
5a2ba16f5c
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms ( #13688 )
2025-02-23 02:54:29 -08:00
ba5106e519
[LMM] Implement merged multimodal processor for whisper ( #13278 )
2025-02-23 01:46:03 -08:00
d5ca2110f1
[Quant] BaiChuan SupportsQuant ( #13710 )
2025-02-22 19:21:15 -08:00
2c5e637b57
[ci] Use env var to control whether to use S3 bucket in CI ( #13634 )
2025-02-22 19:19:45 -08:00
322d2a27d6
[BugFix] Minor: logger import in attention backend ( #13706 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-02-22 16:51:13 -08:00
82e0d601fc
[CI/Build] Fix pre-commit errors from #13571 ( #13709 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-22 16:50:38 -08:00
78ac0f591d
[CI/Build] fix uv caching in Dockerfile ( #13611 )
2025-02-22 08:25:20 -08:00
b56155e7f3
[XPU]fix setuptools version for xpu ( #13548 )
2025-02-22 08:05:35 -08:00
382f66fb08
[Bugfix] Fix boolean conversion for OpenVINO env variable ( #13615 )
2025-02-22 08:04:12 -08:00
8354f6640c
[Doc] Dockerfile instructions for optional dependencies and dev transformers ( #13699 )
2025-02-22 06:04:31 -08:00
c904fdddf6
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm ( #13231 )
2025-02-22 05:54:38 -08:00
558db8083c
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths ( #13095 )
2025-02-22 05:25:41 -08:00
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm ( #13571 )
2025-02-22 05:24:05 -08:00
8db1b9d0a1
Support SSL Key Rotation in HTTP Server ( #13495 )
2025-02-22 05:17:44 -08:00
2382ad29d1
[ci] fix linter ( #13701 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-22 20:28:59 +08:00
3e472d882a
[core] set up data parallel communication ( #13591 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-22 19:28:59 +08:00
7f6bae561c
[CI/Build] Fix pre-commit errors ( #13696 )
2025-02-22 00:31:26 -08:00
105b8ce4c0
[Misc] Reduce LoRA-related static variable ( #13166 )
2025-02-22 00:21:30 -08:00
2cb8c1540e
[Metrics] Add --show-hidden-metrics-for-version CLI arg ( #13295 )
2025-02-22 00:20:45 -08:00
1cd981da4f
[V1][Metrics] Support vllm:cache_config_info ( #13299 )
2025-02-22 00:20:00 -08:00
fca20841c2
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size ( #13660 )
2025-02-22 00:19:10 -08:00
da31b5333e
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler ( #13594 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-02-22 00:08:29 -08:00
bb78fb318e
[v1] Support allowed_token_ids in v1 Sampler ( #13210 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-22 14:13:05 +08:00
8aca27fa11
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len ( #13691 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-02-22 14:10:38 +08:00
95c617e04b
[Misc] Bump compressed-tensors ( #13619 )
2025-02-21 22:09:04 -08:00
9a1f1da5d1
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA ( #13687 )
2025-02-21 22:07:45 -08:00
68d630a0c7
[ROCM] fix native attention function call ( #13650 )
2025-02-21 22:07:04 -08:00
68d535ef44
[Misc] Capture and log the time of loading weights ( #13666 )
2025-02-21 22:06:34 -08:00
c6ed93860f
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… ( #13672 )
2025-02-21 22:05:28 -08:00
0ffdf8ce0c
[HTTP Server] Make model param optional in request ( #13568 )
2025-02-21 21:55:50 -08:00
8c0dd3d4df
docs: Add a note on full CI run in contributing guide ( #13646 )
2025-02-21 21:53:59 -08:00
ada7c780d5
[Misc] Fix yapf linting tools etc not running on pre-commit ( #13695 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-22 13:10:43 +08:00
288cc6c234
[Attention] MLA with chunked prefill ( #12639 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Patrick Horn <patrick.horn@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-21 15:30:12 -08:00
900edbfa48
fix typo of grafana dashboard, with correct datasource ( #13668 )
...
Signed-off-by: John Zheng <john.zheng@hp.com >
2025-02-21 18:21:05 +00:00
b2c3fc5d65
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation ( #13586 )
2025-02-20 22:24:17 -08:00
839b27c6cc
[Kernel]Add streamK for block-quantized CUTLASS kernels ( #12978 )
2025-02-20 22:14:24 -08:00
34ad27fe83
[ci] Fix metrics test model path ( #13635 )
2025-02-20 22:12:10 -08:00
1c3c975766
[FEATURE] Enables /score endpoint for embedding models ( #12846 )
2025-02-20 22:09:47 -08:00
1cdc88614a
Missing comment explaining VDR variable in GGUF kernels ( #13290 )
2025-02-20 22:06:54 -08:00
31aa045c11
[V1][Sampler] Avoid an operation during temperature application ( #13587 )
2025-02-20 22:05:56 -08:00
a30c093502
[Bugfix] Add mm_processor_kwargs to chat-related protocols ( #13644 )
2025-02-20 22:04:33 -08:00
c7b07a95a6
Use pre-commit to update requirements-test.txt ( #13617 )
2025-02-20 22:03:27 -08:00
27a09dc52c
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant ( #13632 )
2025-02-20 22:01:48 -08:00
981f3c831e
[Misc] Adding script to setup ray for multi-node vllm deployments ( #12913 )
2025-02-20 21:16:40 -08:00
44c33f01f3
Add llmaz as another integration ( #13643 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2025-02-21 03:52:40 +00:00
33170081f1
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth ( #13245 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com >
2025-02-20 17:45:45 -08:00
71face8540
[Bugfix] Fix max_num_batched_tokens for MLA ( #13620 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-20 17:45:20 -08:00
bfbc0b32c6
[Frontend] Add backend-specific options for guided decoding ( #13505 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-20 15:07:58 -05:00
6a417b8600
fix neuron performance issue ( #13589 )
2025-02-20 10:59:36 -08:00
d3ea50113c
[V1][Minor] Print KV cache size in token counts ( #13596 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-20 09:24:31 -08:00
34aad515c8
Update pre-commit's isort version to remove warnings ( #13614 )
2025-02-20 08:00:14 -08:00
ed6e9075d3
[Bugfix] Fix deepseekv3 grouped topk error ( #13474 )
...
Signed-off-by: Chen-XiaoBing <chenxb002@whu.edu.cn >
2025-02-20 06:47:01 -08:00
992e5c3d34
Merge similar examples in offline_inference into single basic example ( #12737 )
2025-02-20 04:53:51 -08:00
b69692a2d8
[Kernel] LoRA - Refactor sgmv kernels ( #13110 )
2025-02-20 07:28:06 -05:00
a64a84433d
[2/n][ci] S3: Use full model path ( #13564 )
...
Signed-off-by: <>
2025-02-20 01:20:15 -08:00
aa1e62d0db
[ci] Fix spec decode test ( #13600 )
2025-02-20 16:56:00 +08:00
497bc83124
[CI/Build] Use uv in the Dockerfile ( #13566 )
2025-02-19 23:05:44 -08:00
3738e6fa80
[API Server] Add port number range validation ( #13506 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-20 15:05:13 +08:00
0023cd2b9d
[ROCm] MI300A compile targets deprecation ( #13560 )
2025-02-19 23:05:00 -08:00
041e294716
[Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL ( #13533 )
2025-02-19 23:04:30 -08:00
9621667874
[Misc] Warn if the vLLM version can't be retrieved ( #13501 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-20 06:24:48 +00:00
8c755c3b6d
[bugfix] spec decode worker get tp group only when initialized ( #13578 )
2025-02-20 04:46:28 +00:00
ba81163997
[core] add sleep and wake up endpoint and v1 support ( #12987 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: cennn <2523403608@qq.com >
Co-authored-by: cennn <2523403608@qq.com >
2025-02-20 12:41:17 +08:00
0d243f2a54
[ROCm][MoE] mi300 mixtral8x7B perf for specific BS ( #13577 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-20 04:01:02 +00:00
88f6ba3281
[ci] Add AWS creds for AMD ( #13572 )
2025-02-20 03:56:06 +00:00
512368e34a
[Misc] Qwen2.5 VL support LoRA ( #13261 )
2025-02-19 18:37:55 -08:00
473f51cfd9
[3/n][CI] Load Quantization test models with S3 ( #13570 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-20 10:12:30 +08:00
a4c402a756
[BugFix] Avoid error traceback in logs when V1 LLM terminates ( #13565 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-20 00:49:01 +00:00
550d97eb58
[Misc] Avoid calling unnecessary hf_list_repo_files for local model path ( #13348 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
2025-02-19 18:57:48 +00:00
fbbe1fbac6
[MISC] Logging the message about Ray teardown ( #13502 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
2025-02-19 09:40:50 -08:00
01c184b8f3
Fix copyright year to auto get current year ( #13561 )
2025-02-19 16:55:34 +00:00
ad5a35c21b
[doc] clarify multi-node serving doc ( #13558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 22:32:17 +08:00
5ae9f26a5a
[Bugfix] Fix device ordinal for multi-node spec decode ( #13269 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-19 22:13:15 +08:00
377d10bd14
[VLM][Bugfix] Pass processor kwargs properly on init ( #13516 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-19 13:13:50 +00:00
52ce14d31f
[doc] clarify profiling is only for developers ( #13554 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 20:55:58 +08:00
81dabf24a8
[CI/Build] force writing version file ( #13544 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-02-19 18:48:03 +08:00
423330263b
[Feature] Pluggable platform-specific scheduler ( #13161 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2025-02-19 17:16:38 +08:00
caf7ff4456
[V1][Core] Generic mechanism for handling engine utility ( #13060 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-19 17:09:22 +08:00
f525c0be8b
[Model][Speculative Decoding] DeepSeek MTP spec decode ( #12755 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-19 17:06:23 +08:00
983a40a8bb
[Bugfix] Fix Positive Feature Layers in Llava Models ( #13514 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-19 08:50:07 +00:00
fdc5df6f54
use device param in load_model method ( #13037 )
2025-02-19 16:05:02 +08:00
3b05cd4555
[perf-benchmark] Fix ECR path for premerge benchmark ( #13512 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:56:11 +00:00
d5d214ac7f
[1/n][CI] Load models in CI from S3 instead of HF ( #13205 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:34:59 +00:00
fd84857f64
[Doc] Add clarification note regarding paligemma ( #13511 )
2025-02-18 22:24:03 -08:00
8aada19dfc
[ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe ( #13503 )
2025-02-18 22:23:24 -08:00
9aa95b0e6a
[perf-benchmark] Allow premerge ECR ( #13509 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 05:13:41 +00:00
d0a7a2769d
[Hardware][Gaudi][Feature] Support Contiguous Cache Fetch ( #12139 )
...
Signed-off-by: yuzhou <yuzhou@habana.ai >
Signed-off-by: zhouyu5 <yu.zhou@intel.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-18 19:40:19 -08:00
00b69c2d27
[Misc] Remove dangling references to --use-v2-block-manager ( #13492 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-19 03:37:26 +00:00
4c82229898
[V1][Spec Decode] Optimize N-gram matching with Numba ( #13365 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 13:19:58 -08:00
c8d70e2437
Pin Ray version to 2.40.0 ( #13490 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 12:50:31 -08:00
30172b4947
[V1] Optimize handling of sampling metadata and req_ids list ( #13244 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-18 12:15:33 -08:00
a4d577b379
[V1][Tests] Adding additional testing for multimodal models to V1 ( #13308 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-18 09:53:14 -08:00
7b203b7694
[misc] fix debugging code ( #13487 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 09:37:11 -08:00
4fb8142a0e
[V1][PP] Enable true PP with Ray executor ( #13472 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 09:15:32 -08:00
a02c86b4dd
[CI/Build] migrate static project metadata from setup.py to pyproject.toml ( #8772 )
2025-02-18 08:02:49 -08:00
3809458456
[Bugfix] Fix invalid rotary embedding unit test ( #13431 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-02-18 11:52:03 +00:00
d3231cb436
[Bugfix] Handle content type with optional parameters ( #13383 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-02-18 11:29:13 +00:00
435b502a6e
[ROCm] Make amdsmi import optional for other platforms ( #13460 )
2025-02-18 03:15:56 -08:00
29fc5772c4
[Bugfix] Remove noisy error logging during local model loading ( #13458 )
2025-02-18 03:15:48 -08:00
2358ca527b
[Doc]: Improve feature tables ( #13224 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-18 18:52:39 +08:00
8cf97f8661
[Bugfix] Fix failing transformers dynamic module resolving with spawn multiproc method ( #13403 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-18 10:25:53 +00:00
e2603fefb8
[Bugfix] Ensure LoRA path from the request can be included in err msg ( #13450 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-18 16:19:15 +08:00
b53d79983c
Add outlines fallback when JSON schema has enum ( #13449 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-18 06:49:41 +00:00
9915912f7f
[V1][PP] Fix & Pin Ray version in requirements-cuda.txt ( #13436 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 21:58:06 -08:00
d1b649f1ef
[Quant] Aria SupportsQuant ( #13416 )
2025-02-17 21:51:09 -08:00
ac19b519ed
[core] fix sleep mode in pytorch 2.6 ( #13456 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 13:48:10 +08:00
a1074b3efe
[Bugfix] Only print out chat template when supplied ( #13444 )
2025-02-17 21:43:31 -08:00
00294e1bc6
[Quant] Arctic SupportsQuant ( #13366 )
2025-02-17 21:35:09 -08:00
88787bce1d
[Quant] Molmo SupportsQuant ( #13336 )
2025-02-17 21:34:47 -08:00
932b51cedd
[v1] fix parallel config rank ( #13445 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 12:33:45 +08:00
7c7adf81fc
[ROCm] fix get_device_name for rocm ( #13438 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-18 04:07:12 +00:00
67ef8f666a
[Model] Enable quantization support for transformers backend ( #12960 )
2025-02-17 19:52:47 -08:00
efbe854448
[Misc] Remove dangling references to SamplingType.BEAM ( #13402 )
2025-02-17 19:52:35 -08:00
b3942e157e
[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallback issue ( #13425 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-18 00:32:48 +00:00
cd4a72a28d
[V1][Spec decode] Move drafter to model runner ( #13363 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 15:40:12 -08:00
6ac485a953
[V1][PP] Fix intermediate tensor values ( #13417 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-17 13:37:45 -08:00
4c21ce9eba
[V1] Get input tokens from scheduler ( #13339 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 11:01:07 -08:00
ce77eb9410
[Bugfix] Fix VLLM_USE_MODELSCOPE issue ( #13384 )
2025-02-17 14:22:01 +00:00
30513d1cb6
[Bugfix] fix xpu communicator ( #13368 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-02-17 20:59:18 +08:00
1f69c4a892
[Model] Support Mamba2 (Codestral Mamba) ( #9292 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-02-17 20:17:50 +08:00
7b623fca0b
[VLM] Check required fields before initializing field config in DictEmbeddingItems ( #13380 )
2025-02-17 01:36:07 -08:00
238dfc8ac3
[MISC] tiny fixes ( #13378 )
2025-02-17 00:57:13 -08:00
45186834a0
Run v1 benchmark and integrate with PyTorch OSS benchmark database ( #13068 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-02-17 08:16:32 +00:00
f857311d13
Fix spelling error in index.md ( #13369 )
2025-02-17 06:53:20 +00:00
46cdd59577
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode ( #12304 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-16 19:32:26 -08:00
2010f04c17
[V1][Misc] Avoid unnecessary log output ( #13289 )
2025-02-16 19:26:24 -08:00
69e1d23e1e
[V1][BugFix] Clean up rejection sampler & Fix warning msg ( #13362 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 12:25:29 -08:00
d67cc21b78
[Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case ( #13358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-16 18:55:27 +00:00
e18227b04a
[V1][PP] Cache Intermediate Tensors ( #13353 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 10:02:27 -08:00
7b89386553
[V1][BugFix] Add __init__.py to v1/spec_decode/ ( #13359 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 09:39:08 -08:00
da833b0aee
[Docs] Change myenv to vllm. Update python_env_setup.inc.md ( #13325 )
2025-02-16 16:04:21 +00:00
5d2965b7d7
[Bugfix] Fix 2 Node and Spec Decode tests ( #13341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-16 22:20:22 +08:00
a0231b7c25
[platform] add base class for communicators ( #13208 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:14:22 +08:00
124776ebd5
[ci] skip failed tests for flashinfer ( #13352 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:09:15 +08:00
b7d309860e
[V1] Update doc and examples for H2O-VL ( #13349 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-16 10:35:54 +00:00
dc0f7ccf8b
[BugFix] Enhance test_pos_encoding to support execution on multi-devices ( #13187 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-02-16 08:59:49 +00:00
d3d547e057
[Bugfix] Pin xgrammar to 0.1.11 ( #13338 )
2025-02-15 19:42:25 -08:00
12913d17ba
[Quant] Add SupportsQuant to phi3 and clip ( #13104 )
2025-02-15 19:28:33 -08:00
80f63a3966
[V1][Spec Decode] Ngram Spec Decode ( #12193 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-15 18:05:11 -08:00
367cb8ce8c
[Doc] [2/N] Add Fuyu E2E example for multimodal processor ( #13331 )
2025-02-15 07:06:23 -08:00
54ed913f34
[ci/build] update flashinfer ( #13323 )
2025-02-15 05:33:13 -08:00
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue ( #13064 )
2025-02-15 03:59:01 -08:00
ed0de3e4b8
[AMD] [Model] DeepSeek tunings ( #13199 )
2025-02-15 03:58:09 -08:00
2ad1bc7afe
[V1][Metrics] Add iteration_tokens_total histogram from V0 ( #13288 )
2025-02-15 03:56:19 -08:00
7fdaaf48ef
[Bugfix] Fix qwen2.5-vl image processor ( #13286 )
2025-02-15 03:00:11 -08:00
067fa2255b
[Bugfix]Fix search start_index of stop_checker ( #13280 )
2025-02-14 21:39:42 -08:00
9076325677
[BugFix] Don't scan entire cache dir when loading model ( #13302 )
2025-02-14 21:33:31 -08:00
97a3d6d995
[Bugfix] Massage MLA's usage of flash attn for RoCM ( #13310 )
2025-02-14 21:33:25 -08:00
579d7a63b2
[Bugfix][Docs] Fix offline Whisper ( #13274 )
2025-02-14 21:32:37 -08:00
c9f9d5b397
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm ( #13235 )
2025-02-14 20:30:42 -08:00
0c73026844
[V1][PP] Fix memory profiling in PP ( #13315 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 20:17:25 -08:00
6a854c7a2b
[V1][Sampler] Don't apply temp for greedy-only ( #13311 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-14 18:10:53 -08:00
e7eea5a520
[V1][CI] Fix failed v1-test because of min_p ( #13316 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 17:29:51 -08:00
a12934d3ec
[V1][Core] min_p sampling support ( #13191 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
2025-02-14 15:50:05 -08:00
3bcb8c75da
[Core] Reduce TTFT with concurrent partial prefills ( #10235 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-14 15:36:07 -08:00
5e5c8e091e
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts ( #13236 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-14 12:53:42 -08:00
c9e2d644e7
[Hardware][Gaudi][Bugfix] Fix error for guided decoding ( #12317 )
2025-02-14 04:36:49 -08:00
7734e9a291
[Core] choice-based structured output with xgrammar ( #12632 )
2025-02-14 04:36:05 -08:00
6224a9f620
Support logit_bias in v1 Sampler ( #13079 )
2025-02-14 04:34:59 -08:00
085b7b2d6c
[V1] Simplify GPUModelRunner._update_states check ( #13265 )
2025-02-14 04:33:43 -08:00
4da1f667e9
[VLM] Keep track of whether prompt replacements have been applied ( #13215 )
2025-02-14 04:20:46 -08:00
556ef7f714
[Misc] Log time consumption of sleep and wake-up ( #13115 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-02-14 20:10:21 +08:00
83481ceb49
[Bugfix] Fix missing parentheses ( #13263 )
2025-02-14 01:07:10 -08:00
185cc19f92
[Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch ( #12927 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-02-14 08:22:42 +00:00
45f90bcbba
[WIP] TPU V1 Support Refactored ( #13049 )
2025-02-14 00:21:53 -08:00
b0ccfc565a
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch ( #13126 )
2025-02-13 22:39:20 -08:00
ba59b78a9c
[ROCm][V1] Add intial ROCm support to V1 ( #12790 )
2025-02-13 22:21:50 -08:00
cbc40128eb
[V1] LoRA - Enable Serving Usecase ( #12883 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-14 14:21:12 +08:00
f0b2da72a8
Expand MLA to support most types of quantization ( #13181 )
2025-02-13 22:19:22 -08:00
f2b20fe491
Consolidate Llama model usage in tests ( #13094 )
2025-02-13 22:18:03 -08:00
40932d7a05
[Misc] Remove redundant statements in scheduler.py ( #13229 )
2025-02-13 22:07:25 -08:00
84683fa271
[Bugfix] Offline example of disaggregated prefill ( #13214 )
2025-02-13 20:20:47 -08:00
067678262a
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config ( #13237 )
2025-02-13 20:19:43 -08:00
09545c0a94
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on ( #13250 )
2025-02-13 20:19:25 -08:00
dd5ede4440
[V1] Consolidate MM cache size to vllm.envs ( #13239 )
2025-02-13 20:19:03 -08:00
8c32b08a86
[Kernel] Fix awq error when n is not divisable by 128 ( #13227 )
2025-02-13 20:07:05 -08:00
410886950a
[ROCm] Avoid using the default stream on ROCm ( #13238 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-14 09:29:26 +08:00
e38be640e6
Revert "Add label if pre-commit passes" ( #13242 )
2025-02-13 16:12:32 -08:00
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels ( #13198 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-14 00:01:14 +00:00
2344192a55
Optimize moe_align_block_size for deepseek_v3 ( #12850 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-13 18:43:37 -05:00
bffddd9a05
Add label if pre-commit passes ( #12527 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-13 20:51:30 +00:00
d84cef76eb
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint ( #12909 )
2025-02-13 07:23:45 -08:00
37dfa60037
[Bugfix] Missing Content Type returns 500 Internal Server Error ( #13193 )
2025-02-13 06:52:22 -08:00
1bc3b5e71b
[VLM] Separate text-only and vision variants of the same model architecture ( #13157 )
2025-02-13 06:19:15 -08:00
02ed8a1fbe
[Misc] Qwen2.5-VL Optimization ( #13155 )
2025-02-13 06:17:57 -08:00
2092a6fa7d
[V1][Core] Add worker_base for v1 worker ( #12816 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-13 20:35:18 +08:00
c9d3ecf016
[VLM] Merged multi-modal processor for Molmo ( #12966 )
2025-02-13 04:34:00 -08:00
fdcf64d3c6
[V1] Clarify input processing and multimodal feature caching logic ( #13211 )
2025-02-13 03:43:24 -08:00
578087e56c
[Frontend] Pass pre-created socket to uvicorn ( #13113 )
2025-02-13 00:51:46 -08:00
fa253f1a70
[VLM] Remove input processor from clip and siglip ( #13165 )
2025-02-13 00:31:37 -08:00
9605c1256e
[V1][core] Implement pipeline parallel on Ray ( #12996 )
2025-02-13 08:02:46 +00:00
0ccd8769fb
[CI/Build] Allow ruff to auto-fix some issues ( #13180 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 07:45:38 +00:00
cb944d5818
Allow Unsloth Dynamic 4bit BnB quants to work ( #12974 )
2025-02-12 23:13:08 -08:00
d46d490c27
[Frontend] Move CLI code into vllm.cmd package ( #12971 )
2025-02-12 23:12:21 -08:00
04f50ad9d1
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case ( #13097 )
2025-02-12 23:11:26 -08:00
60c68df6d1
[Build] Automatically use the wheel of the base commit with Python-only build ( #13178 )
2025-02-12 23:10:28 -08:00
009439caeb
Simplify logic of locating CUDART so file path ( #13203 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-13 13:52:41 +08:00
bc55d13070
[VLM] Implement merged multimodal processor for Mllama ( #11427 )
2025-02-12 20:26:21 -08:00
d88c8666a1
[Bugfix][Example] Fix GCed profiling server for TPU ( #12792 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-13 11:52:11 +08:00
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization ( #12784 )
2025-02-12 19:51:51 -08:00
9f9704dca6
[perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance ( #12706 )
2025-02-12 19:51:33 -08:00
8eafe5eaea
[CI/Build] Ignore ruff warning up007 ( #13182 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 11:48:31 +08:00
4c0d93f4b2
[V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort ( #13173 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-12 12:58:11 -08:00
14b7899d10
[CI] Fix failing FP8 cpu offload test ( #13170 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-12 19:16:06 +00:00
09972e716c
[Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity ( #13119 )
2025-02-12 09:19:53 -08:00
36a08630e8
[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control ( #7086 )
2025-02-12 09:19:43 -08:00
2c2b560f48
[CI/Build] Use mypy matcher for pre-commit CI job ( #13162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-12 17:12:22 +00:00
042c3419fa
Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path ( #12998 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-12 09:06:13 -08:00
82cabf53a3
[Misc] Delete unused LoRA modules ( #13151 )
2025-02-12 08:58:24 -08:00
314cfade02
[Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral ( #12332 )
2025-02-12 08:29:56 -08:00
985b4a2b19
[Bugfix] Fix num video tokens calculation for Qwen2-VL ( #13148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-12 11:55:23 +00:00
f4d97e4fc2
[Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request ( #13108 )
2025-02-12 02:39:16 -08:00
f1042e86f0
[Misc] AMD Build Improvements ( #12923 )
2025-02-12 02:36:10 -08:00
7c4033acd4
Further reduce the HTTP calls to huggingface.co ( #13107 )
2025-02-12 02:34:09 -08:00
d59def4730
Bump actions/setup-python from 5.3.0 to 5.4.0 ( #12672 )
2025-02-12 16:41:22 +08:00
0c7d9effce
Bump helm/chart-testing-action from 2.6.1 to 2.7.0 ( #12463 )
2025-02-12 16:41:06 +08:00
dd3b4a01f8
Bump actions/stale from 9.0.0 to 9.1.0 ( #12462 )
2025-02-12 00:40:25 -08:00
a0597c6b75
Bump helm/kind-action from 1.10.0 to 1.12.0 ( #11612 )
2025-02-12 00:40:19 -08:00
e92694b6fe
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency ( #12921 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com >
2025-02-11 21:12:37 -08:00
842b0fd402
[ci] Add more source file dependencies for some tests ( #13123 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-11 20:38:10 -08:00
974dfd4971
[Model] IBM/NASA Prithvi Geospatial model ( #12830 )
2025-02-11 20:34:30 -08:00
3ee696a63d
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM ( #12518 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-02-12 12:25:58 +08:00
72c2b68dc9
[Misc] Move pre-commit suggestion back to the end ( #13114 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 22:34:16 +00:00
14ecab5be2
[Bugfix] Guided decoding falls back to outlines when fails to import xgrammar ( #12976 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-11 18:17:44 +00:00
deb6c1c6b4
[Doc] Improve OpenVINO installation doc ( #13102 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 18:02:46 +00:00
565c1efa65
[CI/Build][Bugfix] Fix CPU backend default threads num ( #13077 )
2025-02-11 16:55:56 +00:00
2b25b7d2e1
Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 ( #13023 )
2025-02-11 08:38:48 -08:00
6c4dbe23eb
[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES ( #12962 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-12 00:21:50 +08:00
21f5d50fa5
[Bugfix] Do not use resource module on Windows ( #12858 ) ( #13029 )
2025-02-11 08:21:18 -08:00
bf3e05215c
[Misc] Fix typo at comments at metrics.py ( #13024 )
2025-02-11 08:20:37 -08:00
ad9776353e
Set torch_dtype in TransformersModel ( #13088 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 23:51:19 +08:00
75e6e14516
[V1][Metrics] Add several request timing histograms ( #12644 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-11 10:14:00 -05:00
110f59a33e
[Bugfix] fix flaky test ( #13089 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-11 14:41:20 +00:00
2e3b969ec0
[Platform] add pre_register_and_update function ( #12432 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-02-11 22:06:46 +08:00
da317197dd
[Build] Fix cuda link target of cumem_allocator in CPU env ( #12863 )
...
Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-11 21:55:57 +08:00
7539bbc6a6
[ROCm] Using a more precise memory profiling ( #12624 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-11 21:47:10 +08:00
9cf4759493
[executor] init local_rank as device index ( #13027 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-02-11 21:20:53 +08:00
41c5dd45b9
[V1][Metrics] Add GPU prefix cache hit rate % gauge ( #12592 )
2025-02-11 08:27:25 +00:00
fc6485d277
[Bugfix]: Reasoning output bug according to the chat template change ( #13025 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-02-11 15:49:03 +08:00
78a141d768
[Misc] LoRA - Refactor Punica ops tests ( #12970 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-11 07:26:03 +00:00
c320ca8edd
[Core] Don't do platform detection at import time ( #12933 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 07:25:25 +00:00
58047c6f04
[Benchmark] Add BurstGPT to benchmark_serving ( #13063 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-02-10 21:25:30 -08:00
cb080f32e3
[Bugfix] Support missing tool parameters in mistral tokenizer ( #12884 )
...
Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com >
2025-02-11 03:33:33 +00:00
2c0f58203c
[Docs] Annouce Meta Meetup ( #13065 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-02-10 18:24:29 -08:00
2ff4857678
[V1][Minor] Move scheduler outputs to a separate file ( #13062 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-11 02:10:06 +00:00
91e876750e
[misc] Fix setup.py condition to avoid AMD from being mistaken with CPU ( #13022 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-02-10 18:06:16 -08:00
08b2d845d6
[Model] Ultravox Model: Support v0.5 Release ( #12912 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
2025-02-10 22:02:48 +00:00
2ae889052c
Fix seed parameter behavior in vLLM ( #13007 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-10 23:26:50 +08:00
51f0b5f7f6
[Bugfix] Clean up and fix multi-modal processors ( #13012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-10 10:45:21 +00:00
fde71262e0
[misc] Add retries with exponential backoff for HF file existence check ( #13008 )
2025-02-10 01:15:02 -08:00
243137143c
[Doc] Add link to tool_choice tracking issue in tool_calling.md ( #13003 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 06:09:33 +00:00
b2496bb07f
[core] fix sleep mode and pytorch checkpoint compatibility ( #13001 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 13:03:43 +08:00
44607e07d3
Check if selected backend is None in get_attn_backend_cls() ( #12975 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 11:45:07 +08:00
67c4637ccf
[V1] Use msgpack for core request serialization ( #12918 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-10 11:35:56 +08:00
aa0ca5ebb7
[core][rlhf] add colocate example for RLHF ( #12984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 10:28:59 +08:00
59fff4a01a
[core] improve error handling when wake up from sleep mode ( #12981 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 09:38:57 +08:00
29f1d47e73
[MISC] Always import version library first in the vllm package ( #12979 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-09 18:56:40 +08:00
cf797aa856
[core] port pynvml into vllm codebase ( #12963 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 15:00:00 +08:00
24700c346b
[V1] Cache uses_mrope in GPUModelRunner ( #12969 )
2025-02-08 15:32:32 -08:00
d366ccc4e3
[RFC] [Mistral] FP8 format ( #10130 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-02-08 14:12:53 -07:00
870c37481e
[V1][Minor] Remove outdated comment ( #12968 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 12:48:30 -08:00
86222a3dab
[VLM] Merged multi-modal processor for GLM4V ( #12449 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-08 20:32:16 +00:00
fe743b798d
[bugfix] fix early import of flash attention ( #12959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 00:06:56 +08:00
913df14da3
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU ( #12935 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-08 14:46:19 +00:00
8a69e0e20e
[CI/Build] Auto-fix Markdown files ( #12941 )
2025-02-08 04:25:15 -08:00
4c8dd12ef3
[Misc] Add qwen2.5-vl BNB support ( #12944 )
2025-02-08 04:24:47 -08:00
256a2d29dc
[Doc] Correct HF repository for TeleChat2 models ( #12949 )
2025-02-08 01:42:15 -08:00
c45d398e6f
[CI] Resolve transformers-neuronx version conflict ( #12925 )
2025-02-08 01:41:35 -08:00
011e612d92
[Misc] Log time consumption on weight downloading ( #12926 )
2025-02-08 09:16:42 +00:00
7e1837676a
[misc] Add LoRA to benchmark_serving ( #12898 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-08 17:15:44 +08:00
2880e21e3d
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi ( #12812 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2025-02-08 17:15:30 +08:00
407b5537db
[Build] Make pypi install work on CPU platform ( #12874 )
2025-02-08 01:15:15 -08:00
4ea48fb35c
[V1][Minor] Move cascade attn logic outside _prepare_inputs ( #12943 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 00:39:09 -08:00
e31498bdcb
[Misc] Add offline test for disaggregated prefill ( #12418 )
2025-02-08 08:38:20 +00:00
91dd8f7aa6
[bugfix] respect distributed_executor_backend in world_size=1 ( #12934 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-08 16:17:08 +08:00
d01f66b039
[Bugfix] Fix multi-round chat error when mistral tokenizer is used ( #12859 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-08 07:04:34 +00:00
cc01223f3b
[Misc] Fix typo in the example file ( #12896 )
...
Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com >
2025-02-08 06:56:43 +00:00
306923da82
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping ( #12905 )
2025-02-07 21:02:53 -08:00
3243158336
[V1] Move KV block hashes from Request to KVCacheManager ( #12922 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:14:10 -08:00
b21f0f9d17
[V1][Minor] Remove outdated comment ( #12928 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:07:37 -08:00
45cbc4991d
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues ( #12723 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 16:39:50 -08:00
932c6b7461
[V1] LM Eval With Streaming Integration Tests ( #11590 )
2025-02-07 15:07:03 -08:00
eaa92d4437
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing ( #12501 )
2025-02-07 08:13:43 -08:00
0630d4537a
[V1] Logprobs and prompt logprobs support ( #9880 )
...
This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.
New behavior:
- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-02-07 07:26:20 -08:00
538fab93cd
PR #12718 ( #12718 )
2025-02-07 06:22:37 -08:00
ce26b16268
[Misc] Remove unnecessary detokenization in multimodal processing ( #12868 )
2025-02-07 06:21:17 -08:00
1918aa1b80
[MISC][EASY] Break check file names into entry and args in the pre-commit hooks ( #12880 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 13:04:39 +00:00
6e1fc61f0f
Prevent unecessary requests to huggingface hub ( #12837 )
2025-02-06 21:37:41 -08:00
aa375dca9f
[Bugfix] Missing quant_config in deepseek embedding layer ( #12836 )
2025-02-06 21:35:09 -08:00
433c4a4923
Make vllm compatible with verl ( #12824 )
...
Co-authored-by: zhangshulai <zhangshulai@bytedance.com >
2025-02-07 11:54:20 +08:00
ef533d25fb
[Bugfix] FA2 illegal memory access ( #12848 )
2025-02-06 19:54:07 -08:00
b260782357
[misc] Revert # 12833 ( #12857 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 16:29:12 -08:00
741429a4cd
[MISC] Check space in the file names in the pre commit checks ( #12804 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 15:36:21 -08:00
aff404571b
Add Bamba Model ( #10909 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-06 15:22:42 -08:00
467a96a541
[V1] LoRA Support ( #10957 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-06 09:32:51 -08:00
8108ac841d
[Bugfix] Fix unsupported FA version check for Turing GPU ( #12828 )
2025-02-06 09:18:22 -08:00
afe74f7a96
[Doc] double quote cmake package in build.inc.md ( #12840 )
2025-02-06 09:17:55 -08:00
09b95e36ab
[torch.compile] PyTorch 2.6 and nightly compatibility ( #12393 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-07 01:09:07 +08:00
85ac82d228
[Kernel] Make rotary_embedding ops more flexible with input shape ( #12777 )
2025-02-06 08:46:13 -08:00
1e57b1ee63
[Misc] Remove unnecessary decode call ( #12833 )
2025-02-06 08:45:44 -08:00
e152f29502
[misc] Reduce number of config file requests to HuggingFace ( #12797 )
...
Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 14:59:18 +00:00
c786e757fa
[Attention] Use FA3 for MLA on Hopper ( #12807 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-02-06 11:43:12 +00:00
cefd56ee35
[Docs] Add Google Cloud Slides ( #12814 )
2025-02-06 01:02:38 -08:00
7ca9934fe7
[Misc] Update w2 scale loading for GPTQMarlinMoE ( #12757 )
2025-02-06 01:02:14 -08:00
0408efc6d0
[Misc] Improve error message for incorrect pynvml ( #12809 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 15:23:50 +08:00
449d1bce02
[Misc] Remove duplicated DeepSeek V2/V3 model definition ( #12793 )
2025-02-05 23:16:20 -08:00
1a6fcad4c9
Improve TransformersModel UX ( #12785 )
2025-02-05 22:24:57 -08:00
56534cd577
[Bugfix] Fix the test_ultravox.py's license ( #12806 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 13:25:54 +08:00
d88506dda4
[Model] LoRA Support for Ultravox model ( #11253 )
2025-02-05 19:54:13 -08:00
9cdea30b4f
[Misc][Easy] Remove the space from the file name
2025-02-05 19:23:35 -08:00
76abd0c881
[Bugfix] Better FP8 supported defaults
2025-02-05 19:22:19 -08:00
5b19b93082
[ROCm][Kernel] Using the correct warp_size value
2025-02-05 19:15:08 -08:00
75404d041b
[VLM] Update compatibility with transformers 4.49
2025-02-05 19:09:45 -08:00
bf3b79efb8
[VLM] Qwen2.5-VL
2025-02-05 13:31:38 -08:00
9a5b1554b4
[Docs] Drop duplicate [source] links
2025-02-05 13:30:50 -08:00
a4ce74c14a
[VLM] Use shared field to pass token ids to model
2025-02-05 13:30:46 -08:00
3b2005e1db
Add: Support for Sparse24Bitmask Compressed Models
2025-02-05 13:30:43 -08:00
af8486de49
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU)
2025-02-05 13:29:45 -08:00
4c3aac51e1
Merging PR #12536
...
Merged via CLI script
2025-02-05 13:24:26 -08:00
bc1bdecebf
[core][distributed] exact ray placement control ( #12732 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 02:03:19 +08:00
022bcc701a
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 ( #12546 )
2025-02-04 23:11:02 -08:00
c53dc466b1
[Doc] Remove performance warning for auto_awq.md ( #12743 )
2025-02-04 22:43:11 -08:00
3d09e592a8
[V1][Misc] Shorten FinishReason enum and use constant strings ( #12760 )
2025-02-04 22:43:02 -08:00
fcf2e3d7fc
[Bugfix] Fix OpenVINO model runner ( #12750 )
2025-02-04 22:42:46 -08:00
58b218d7ae
[Doc] Update PR Reminder with link to Developer Slack ( #12748 )
2025-02-04 22:42:09 -08:00
7ff7a638b6
[Model][Quant] Fix GLM, Fix fused module mappings for quantization ( #12634 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-02-05 05:32:06 +00:00
686006a220
[Misc] Bump the compressed-tensors version ( #12736 )
2025-02-04 20:44:48 -08:00
98fd089fc9
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models ( #12729 )
2025-02-04 20:44:26 -08:00
249824c3bf
Refactor Linear handling in TransformersModel ( #12727 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-05 04:31:12 +00:00
64862d106e
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling ( #12713 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-02-05 03:58:22 +00:00
b3a0d01e45
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS ( #12368 )
...
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com >
2025-02-04 18:46:26 -08:00
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) ( #12676 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-02-04 18:22:24 -08:00
233df6f5c4
[V1][Metrics] Add request_success_total counter, labelled with finish reason ( #12579 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-04 19:46:54 -05:00
18016a5e62
[Bugfix] Fix CI failures for InternVL and Mantis models ( #12728 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-04 23:54:23 +08:00
649550f27e
[Build] update requirements of no-device for plugin usage ( #12630 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
2025-02-04 21:19:12 +08:00
62467a834a
Avoid unnecessary multi-modal input data copy when len(batch) == 1 ( #12722 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-02-04 21:03:19 +08:00
6469038b14
[Bugfix] Fix loading of fine-tuned models based on Phi-3-Small ( #12689 )
...
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com >
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com >
2025-02-04 20:58:48 +08:00
815079de8e
[VLM] merged multimodal processor and V1 support for idefics3 ( #12660 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-04 20:00:51 +08:00
18a88fcccc
[V1] Remove scheduling constraint on partial requests ( #12674 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-04 02:43:58 -08:00
d1ca7df84d
[VLM] Merged multi-modal processor for InternVL-based models ( #12553 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-02-04 16:44:52 +08:00
96b23621c1
[Misc] Add BNB quantization for Whisper ( #12381 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-04 16:27:36 +08:00
c36ac98d01
[AMD][ROCm] Enable DeepSeek model on ROCm ( #12662 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2025-02-04 08:24:11 +00:00
4896d0c2dd
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs ( #12711 )
2025-02-03 23:27:11 -08:00
bb392af434
[Doc] Replace ibm-fms with ibm-ai-platform ( #12709 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-02-04 07:05:04 +00:00
5d98d56089
Support Pixtral-Large HF by using llava multimodal_projector_bias config ( #12710 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-04 11:55:46 +08:00
73b35cca7f
[Core] Improve hash collision avoidance in prefix caching ( #12621 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 16:28:20 -08:00
5095e96606
[V1] Revert uncache_blocks and support recaching full blocks ( #12415 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 15:04:53 -08:00
cf58b9c4ca
[MISC] Remove model input dumping when exception ( #12582 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 13:34:16 -08:00
4797dad3ec
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 ( #12707 )
2025-02-03 13:30:39 -08:00
6dd5e52823
Squelch MLA warning for Compressed-Tensors Models ( #12704 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-02-03 13:29:56 -08:00
c11de33dad
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm ( #12696 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 13:04:59 -08:00
33e0602e59
[Misc] Fix improper placement of SPDX header in scripts ( #12694 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 11:16:59 -08:00
a1a2aaadb9
[Model]: Add transformers backend support ( #11330 )
...
# Adds support for `transformers` as a backend
Following https://github.com/huggingface/transformers/pull/35235 , a
bunch of models should already be supported, we are ramping up support
for more models.
Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes:
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-02-03 21:30:38 +08:00
1298a400e8
[ci/build] fix gh200 test ( #12681 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:59:49 +08:00
ad4a9dc817
[cuda] manually import the correct pynvml module ( #12679 )
...
fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:58:21 +08:00
b9986454fe
Fix for attention layers to remain unquantized during moe_wn16 quant ( #12570 )
...
Fix to AWQ quant loading of the new R1 model
The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit
The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again
---------
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Beim <beim2015@outlook.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Beim <805908499@qq.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com >
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com >
Co-authored-by: fade_away <1028552010@qq.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Shawn Du <shawnd200@outlook.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:46:19 +08:00
c5932e5dac
Properly check if all fused layers are in the list of targets ( #12666 )
...
Thanks @kylesayrs for catching this!
2025-02-03 13:42:18 +08:00
20579c0fae
make sure mistral_common not imported for non-mistral models ( #12669 )
...
When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691 .
I added the check, and make all imports of `cv2` lazy.
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:40:25 +08:00
95460fc513
[Kernel] port sgl moe_align_block_size kernels ( #12574 )
...
sgl_moe_align_block_size is based on:
ded9fcd09a
moe_align_block_size is based on:
ba5112ff69
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-03 13:09:50 +08:00
326fcc8b9f
[Doc] Deprecate Discord ( #12668 )
2025-02-02 19:19:56 -08:00
e64330910b
[doc][misc] clarify VLLM_HOST_IP for multi-node inference ( #12667 )
...
As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815 becomes more
frequent. Let's give clear message to users.
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 09:32:18 +08:00
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
f256ebe4df
[Hardware][Intel GPU] add XPU bf16 support ( #12392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-02-02 10:17:26 +00:00
f8ece6e17f
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager ( #12608 )
...
As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254 ,
this PR achieves the task: combine allocate_slots and append_slots.
There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.
@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo
---------
Signed-off-by: Shawn Du <shawnd200@outlook.com >
2025-02-02 16:40:58 +08:00
abfcdcdf27
[V1][Minor] Avoid frequently creating ConstantList ( #12653 )
...
A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-01 23:43:20 -08:00
e497f33491
[Core] Silence unnecessary deprecation warnings ( #12620 )
...
I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:
```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
and will be removed in a future version.
Please use 'lora_path' instead.
```
The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 15:35:50 +08:00
baaa2b24da
[Bugfix] fix moe_wna16 get_quant_method ( #12648 )
...
Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.
baeded2569/vllm/attention/layer.py (L86-L92)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-02-02 15:29:56 +08:00
b4e5c03306
doc: fixing minor typo in readme.md ( #12643 )
...
Word "evolved" was mistyped
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
---------
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
2025-02-01 17:17:29 +00:00
3194039c0e
Apply torch.compile to fused_moe/grouped_topk ( #12637 )
2025-02-01 16:16:19 +00:00
4f4d427ac2
Disable chunked prefill and/or prefix caching when MLA is enabled ( #12642 )
...
From @mgoin in https://github.com/vllm-project/vllm/pull/12638
I cannot push to that branch, therefore a new PR to unblock release.
---------
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-31 23:46:57 -08:00
1e3698393f
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 ( #12280 )
...
We have `v1`, `structured-output`, and `speculative-decoding` labels on
github. This adds automation for applying these labels based on the
files touched by a PR.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-31 23:13:10 -08:00
baeded2569
[Attention] Deepseek v3 MLA support with FP8 compute ( #12601 )
...
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
2025-01-31 21:52:51 -08:00
3e1c76cf3a
Fix: Respect sparsity_config.ignore in Cutlass Integration ( #12517 )
...
This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.
---
### Changes
- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.
---
<details>
<summary>Testing Setup</summary>
The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097 ).
#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```
#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
QuantizationStrategy,
QuantizationType)
from pydantic import BaseModel
-
+from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe import FusedMoE
from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
should_ignore_layer)
from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
__all__ = ["CompressedTensorsLinearMethod"]
SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```
Apply using:
```bash
git apply logging-patch.patch
```
</details>
---
<details>
<summary>Models Tested</summary>
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24`
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`
</details>
---
<details>
<summary>Example Output</summary>
#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```
#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```
</details>
**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.
---
For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160 )](https://app.asana.com/0/0/1209227810815160 ).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
2025-02-01 13:41:59 +08:00
cfa134d247
[Bugfix/CI] Fixup benchmark_moe.py ( #12562 )
...
Fixes `is_marlin` not being passed into `get_default_config`
Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-01 13:41:35 +08:00
35b7a05507
[ci] Upgrade transformers to 4.48.2 in CI dependencies ( #12599 )
2025-01-31 21:22:23 -08:00
1867c258bd
Fix target matching for fused layers with compressed-tensors ( #12617 )
...
Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).
Example recipe:
```
recipe = """
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "group"
group_size: 128
targets: [
"model.layers.0.mlp.down_proj",
"model.layers.2.mlp.down_proj",
"model.layers.3.mlp.down_proj",
"model.layers.4.mlp.down_proj",
"model.layers.5.mlp.down_proj",
"model.layers.6.mlp.down_proj",
"model.layers.7.mlp.down_proj",
"model.layers.8.mlp.down_proj",
"model.layers.9.mlp.down_proj",
"model.layers.10.mlp.down_proj",
"model.layers.11.mlp.down_proj",
"model.layers.12.mlp.down_proj",
"model.layers.13.mlp.down_proj",
"model.layers.14.mlp.down_proj",
"model.layers.15.mlp.down_proj",
"model.layers.16.mlp.down_proj",
"model.layers.17.mlp.down_proj",
"model.layers.19.mlp.down_proj",
"model.layers.21.mlp.down_proj",
"model.layers.22.mlp.down_proj",
.
.
.
]
"""
```
To reproduce the vLLM error:
```bash
vllm serve nm-testing/eldar-test
```
With this PR
------------
Models are loaded correctly without any errors.
2025-02-01 05:07:46 +00:00
cb3e73e4c8
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 ( #11161 )
...
FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086 #12487
---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-01 12:52:07 +08:00
b1340f9d55
[V1] Bugfix: Validate Model Input Length ( #12600 )
...
SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len
FIX #12567(*link existing issues this PR will resolve*)
2025-01-31 18:32:04 -08:00
44bbca78d7
[Doc] int4 w4a16 example ( #12585 )
...
Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py )
FIX #n/a (no issue created)
@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets
---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-31 15:38:48 -08:00
60808bd4c7
[Doc] Improve installation signposting ( #12575 )
...
- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html
- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 15:38:35 -08:00
fc542144c4
[Feature] Fix guided decoding blocking bitmask memcpy ( #12563 )
...
**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.
(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

With the optimization, this is no longer the case:

---------
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
2025-01-31 15:37:30 -08:00
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 ( #12587 )
...
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-31 15:29:11 -08:00
145c2ff648
[Bugfix] Revert MoE Triton Config Default ( #12629 )
...
SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files ) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-31 15:28:47 -08:00
415f19474d
[release] Add input step to ask for Release version ( #12631 )
...
Instead of having to create a new build with release version put in as
env var.
2025-01-31 13:39:36 -08:00
89003c4082
[v1][Bugfix] Add extra_keys to block_hash for prefix caching ( #12603 )
...
This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
request_id=0,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash1", "hash2"],
)
request2 = make_request(
request_id=1,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash3", "hash2"],
)
```
---------
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-31 13:13:04 -08:00
60bcef000e
[Docs][V1] Prefix caching design ( #12598 )
...
- Create v1 design document section in docs.
- Add prefix caching design doc.
@WoosukKwon @ywang96
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:46 -08:00
847f883232
[Git] Automatically sign-off commits ( #12595 )
...
It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just
```
pre-commit install
```
Now we need to
```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```
Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.
cc @hmellor @youkaichao
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:33 -08:00
325f679f32
[BugFix] Fix Torch.Compile For DeepSeek ( #12594 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-31 12:06:39 -08:00
e3f7ff65e7
Add favicon to docs ( #12611 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 09:20:34 -08:00
7a8987dac5
[Bugfix] Gracefully handle huggingface hub http error ( #12571 )
2025-01-31 08:19:35 +00:00
cabaf4eff3
[Attention] MLA decode optimizations ( #12528 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 23:49:37 -08:00
a1fc18c030
[ROCm][AMD][Model] llama 3.2 support upstreaming ( #12421 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-01-31 12:24:28 +08:00
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling ( #11868 )
2025-01-30 18:33:00 -08:00
4078052f09
[V1][Log] Add max request concurrency log to V1 ( #12569 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-30 23:07:19 +00:00
bd2107e30a
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies ( #12555 )
...
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
2025-01-30 16:29:39 -05:00
9b0c4bab36
[Kernel] Triton Configs for Fp8 Block Quantization ( #11589 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 11:53:22 -08:00
41bf5612f5
[Misc] fix typo: add missing space in lora adapter error message ( #12564 )
...
Signed-off-by: Beim <beim2015@outlook.com >
2025-01-30 15:39:22 +00:00
a2769032ca
Set ?device={device} when changing tab in installation guides ( #12560 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-30 00:05:42 -08:00
f17f1d4608
[V1][Metrics] Add GPU cache usage % gauge ( #12561 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 18:31:01 -08:00
1c1bb0bbf2
[Misc][MoE] add Deepseek-V3 moe tuning support ( #12558 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-30 00:47:30 +00:00
e0cc5f259a
[V1][BugFix] Free encoder cache for aborted requests ( #12545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-29 13:47:33 -08:00
73aa6cfdf7
Revert "[Build/CI] Fix libcuda.so linkage" ( #12552 )
2025-01-29 21:12:24 +00:00
27b78c73ca
[Kernel] add triton fused moe kernel for gptq/awq ( #12185 )
2025-01-29 09:07:09 -05:00
b02fd288b2
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. ( #11787 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-29 01:46:12 -08:00
ff7424f491
[Frontend] Support override generation config in args ( #12409 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-01-29 01:41:01 -08:00
d93bf4da85
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM ( #12069 )
...
Signed-off-by: hzh <hezhihui_thu@163.com >
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Oleg Mosalov <oleg@krai.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Signed-off-by: Chenguang Li <757486878@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Shanshan Shen <467638484@qq.com >
Signed-off-by: elijah <f1renze.142857@gmail.com >
Signed-off-by: Yikun <yikunkero@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com >
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: sixgod <evethwillbeok@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com >
Co-authored-by: Oleg Mosalov <oleg@krai.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com >
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com >
Co-authored-by: Concurrensee <yida.wu@amd.com >
Co-authored-by: Chenguang Li <757486878@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Alex Brooks <alex.brooks@ibm.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Shanshan Shen <467638484@qq.com >
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com >
Co-authored-by: Yikun Jiang <yikunkero@gmail.com >
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: TJian <tunjian1996@gmail.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-01-29 09:24:59 +00:00
036ca94c25
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense ( #12347 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Wallas Santos <wallashss@ibm.com >
2025-01-29 08:54:35 +00:00
ef001d98ef
Fix the pydantic logging validator ( #12420 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-01-29 07:53:13 +00:00
5f671cb4c3
[V1] Improve Error Message for Unsupported Config ( #12535 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-29 04:56:56 +00:00
bd02164cf9
Bugfix for whisper quantization due to fake k_proj bias ( #12524 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-29 04:49:03 +00:00
46fb056749
[V1][Metrics] Add TTFT and TPOT histograms ( #12530 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 04:11:16 +00:00
dd6a3a02cb
[Doc] Convert docs to use colon fences ( #12471 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-29 11:38:29 +08:00
a7e3eba66f
[Frontend] Support reasoning content for deepseek r1 ( #12473 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
2025-01-29 11:38:08 +08:00
fbb5bd4cef
[TPU] Add example for profiling TPU inference ( #12531 )
...
Signed-off-by: mgoin <mgoin@redhat.com >
2025-01-29 03:16:47 +00:00
80fcc3ed1c
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels ( #12482 )
...
Signed-off-by: Fenghui Zhang <fhzhang@google.com >
2025-01-28 22:36:44 +00:00
c386c43ca3
[V1][Metrics] Add per-request prompt/generation_tokens histograms ( #12516 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 22:07:22 +00:00
f26d790718
Do not run suggestion pre-commit hook multiple times ( #12521 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-28 20:05:27 +00:00
0f657bdc52
Replace missed warning_once for rerank API ( #12472 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-28 19:06:32 +00:00
3fd1fb63ef
[V1][Metrics] Hook up IterationStats for Prometheus metrics ( #12478 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 16:38:38 +00:00
925d2f1908
[Doc] Fix typo for x86 CPU installation ( #12514 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-01-28 16:37:10 +00:00
8f58a51358
[VLM] Merged multi-modal processor and V1 support for Qwen-VL ( #12504 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-28 16:25:05 +00:00
2079e43bee
[Core] Make raw_request optional in ServingCompletion ( #12503 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-01-28 10:56:45 +00:00
e29d4358ef
[V1] Include Engine Version in Logs ( #12496 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-28 08:27:41 +00:00
8cbc424975
Update README.md with V1 alpha release ( #12495 )
2025-01-28 08:22:41 +00:00
dd66fd2b01
[CI] fix pre-commit error ( #12494 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-28 06:11:05 +00:00
0f465ab533
[FEATURE] Enables offline /score for embedding models ( #12021 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
2025-01-28 11:30:13 +08:00
23a7cbc88b
[CI/Build] Fixed the xla nightly issue report in #12451 ( #12453 )
2025-01-28 11:18:07 +08:00
426a5c3625
Fix bad path in prometheus example ( #12481 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-27 18:56:31 -07:00
ddee88d0ff
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache ( #11277 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: Jiangfei Duan <jfduan@outlook.com >
2025-01-27 17:31:16 -08:00
823ab79633
Update pre-commit hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-27 17:23:08 -07:00
6116ca8cd7
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill ( #10132 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: wallashss <wallashss@ibm.com >
Co-authored-by: wallashss <wallashss@ibm.com >
2025-01-27 13:38:35 -08:00
2bc3fbba0c
[FlashInfer] Upgrade to 0.2.0 ( #11194 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-27 18:19:24 +00:00
3f1fc7425a
[V1][CI/Test] Do basic test for top-p & top-k sampling ( #12469 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 09:40:04 -08:00
01ba927040
[V1][Metrics] Add initial Prometheus logger ( #12416 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-27 12:26:28 -05:00
103bd17ac5
[Build] Only build 9.0a for scaled_mm and sparse kernels ( #12339 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-27 10:40:00 -05:00
ce69f7f754
[Bugfix] Fix gpt2 GGUF inference ( #12467 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 18:31:49 +08:00
624a1e4711
[V1][Minor] Minor optimizations for update_from_output ( #12454 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 01:09:27 -08:00
372bf0890b
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata ( #12464 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 07:25:30 +00:00
5204ff5c3f
[Bugfix] Fix Granite 3.0 MoE model loading ( #12446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-26 21:26:44 -08:00
0cc6b383d7
[Frontend] Support scores endpoint in run_batch ( #12430 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-01-27 04:30:17 +00:00
28e0750847
[V1] Avoid list creation in input preparation ( #12457 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-26 19:57:56 -08:00
582cf78798
[DOC] Add link to vLLM blog ( #12460 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-27 03:46:19 +00:00
0034b09ceb
[Frontend] Rerank API (Jina- and Cohere-compatible API) ( #12376 )
...
Signed-off-by: Kyle Mistele <kyle@mistele.com >
2025-01-26 19:58:45 -07:00
72bac73067
[Build/CI] Fix libcuda.so linkage ( #12424 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-26 21:18:19 +00:00
68f11149d8
[Bugfix][Kernel] Fix perf regression caused by PR #12405 ( #12434 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-26 11:09:34 -08:00
72f4880425
[Bugfix/CI] Fix broken kernels/test_mha.py ( #12450 )
2025-01-26 10:39:03 -08:00
aa2cd2c43d
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 ( #12417 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-26 19:59:58 +08:00
9ddc35220b
[Frontend] generation_config.json for maximum tokens( #12242 )
...
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com >
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: shangmingc <caishangming@linux.alibaba.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-26 19:59:25 +08:00
a5255270c3
[Misc] Revert FA on ViT #12355 and #12435 ( #12445 )
2025-01-26 03:56:34 -08:00
0ee349b553
[V1][Bugfix] Fix assertion when mm hashing is turned off ( #12439 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-26 00:47:42 -08:00
fa63e710c7
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync ( #12094 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-01-26 00:42:37 -08:00
2a0309a646
[Misc][Bugfix] FA3 support to ViT MHA layer ( #12435 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-26 05:00:31 +00:00
324960a95c
[TPU][CI] Update torchxla version in requirement-tpu.txt ( #12422 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-01-25 07:23:03 +00:00
f1fc0510df
[Misc] Add FA2 support to ViT MHA layer ( #12355 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-25 15:07:35 +08:00
bf21481dde
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 ( #12408 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-25 12:17:19 +08:00
fb30ee92ee
[Bugfix] Fix BLIP-2 processing ( #12412 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-25 11:42:42 +08:00
221d388cc5
[Bugfix][Kernel] Fix moe align block issue for mixtral ( #12413 )
2025-01-25 01:49:28 +00:00
3132a933b6
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). ( #12405 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 20:20:59 +00:00
df5dafaa5b
[Misc] Remove deprecated code ( #12383 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-24 14:45:20 -05:00
ab5bbf5ae3
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build ( #12375 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 15:27:59 +00:00
3bb8e2c9a2
[Misc] Enable proxy support in benchmark script ( #12356 )
...
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp >
2025-01-24 14:58:26 +00:00
e784c6b998
[ci/build] sync default value for wheel size ( #12398 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:54:29 +08:00
9a0f3bdbe5
[Hardware][Gaudi][Doc] Add missing step in setup instructions ( #12382 )
2025-01-24 09:43:49 +00:00
c7c9851036
[ci/build] fix wheel size check ( #12396 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:31:25 +08:00
3c818bdb42
[Misc] Use VisionArena Dataset for VLM Benchmarking ( #12389 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-24 00:22:04 -08:00
6dd94dbe94
[perf] fix perf regression from #12253 ( #12380 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 11:34:27 +08:00
0e74d797ce
[V1] Increase default batch size for H100/H200 ( #12369 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-24 03:19:55 +00:00
55ef66edf4
Update compressed-tensors version ( #12367 )
2025-01-24 11:19:42 +08:00
5e5630a478
[Bugfix] Path join when building local path for S3 clone ( #12353 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
2025-01-24 11:06:07 +08:00
d3d6bb13fb
Set weights_only=True when using torch.load() ( #12366 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-24 02:17:30 +00:00