b8a93076d3
[CI] execute all piecewise compilation tests together ( #24502 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-09 11:05:25 -07:00
c3f9773b2c
[TPU] Fix tpu structured decoding in mixed batches ( #24458 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-09-09 11:04:25 -07:00
3707cb2505
[Docs] Gemma3n transcriptions endpoint support ( #24512 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-09 11:03:32 -07:00
920ed46b09
[Misc] bump outlines_core to fix the version conflicts with outlines >= 1.2.0 ( #24368 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 10:59:46 -07:00
15cb047e25
Extend renderer with embedding support and integrate completion endpoint ( #24405 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-10 01:46:46 +08:00
9ad0688e43
[Bugfix] Fix hidden_size for multimodal classification model ( #24501 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-09 10:37:25 -07:00
b9a1c4c8a2
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork ( #24279 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 12:21:56 -04:00
1aa427fdc1
[Kernels] Add Flash Linear Attention Kernels ( #24518 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-10 00:04:41 +08:00
1c63a16b65
[Core] Run garbage collector after CUDA graph capture to fix throughput regression ( #24128 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 10:38:10 -04:00
922d3b401b
[Bugfix] Handle the edge case in detokenizer where processed tokens contain both stop str and eos token ( #23938 )
...
Signed-off-by: dtransposed <damian.bogunowicz@gmail.com >
2025-09-09 07:30:24 -07:00
19332c0479
[Model] Systematic support for fp32 head, pooling models part ( #23810 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-09 07:29:50 -07:00
a55cf41a09
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT ( #24123 )
2025-09-09 10:21:10 -04:00
6fb2788163
[CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency ( #24411 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 10:02:35 +00:00
3d2a2de8f7
[RL] fast weight update with zmq + ipc handles ( #24295 )
...
Signed-off-by: huangweixiao <huangweixiao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-09 16:57:46 +08:00
1116590b16
[gpt-oss] Validate gpt-oss python tool during initialization ( #23856 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-09 08:37:48 +00:00
ccb97338af
[Misc] Add Codex settings to gitignore ( #24493 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-09 01:25:44 -07:00
45c9cb5835
[Misc] Add claude settings to gitignore ( #24492 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 01:14:45 -07:00
e283976f3a
[Performance][MM] Building the inverse permutation in O(n) time in Qwen2_5_VisionTransformer ( #24443 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
2025-09-09 00:24:11 -07:00
46876dff32
[Doc]: fixing typos to improve docs ( #24480 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 23:06:04 -07:00
1823a00d67
[Misc] Support bench serve long context ( #24373 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-08 22:53:10 -07:00
ed16d0f26f
[Doc] mention fpdb for multiprocess breakpoints ( #24452 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-09-08 21:46:45 -07:00
0cdd213641
[Misc] Improve Worker process title and logging prefix ( #22205 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 21:43:48 -07:00
948dd3443b
[Bugfix] Fix Apertus HF repo name ( #24447 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-08 21:40:29 -07:00
b2f7745774
Add data_parallel_size to VllmConfig string representation ( #24298 )
...
Co-authored-by: Cong Chen <congc@meta.com >
2025-09-08 21:35:18 -07:00
82dfb12e52
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead ( #23673 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-09-08 21:34:37 -07:00
bba1042c6f
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel ( #23647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-08 20:53:07 -07:00
b6fbc15634
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs ( #24074 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-09-09 11:37:16 +08:00
3e0d4a3475
Move KVTransferConfig from config/__init__.py to config/kv_transfer.py ( #24434 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 20:30:32 -07:00
562663a044
Bump actions/github-script from 7.0.1 to 8.0.0 ( #24413 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:12:44 +00:00
ed1623a88a
Bump actions/stale from 9.1.0 to 10.0.0 ( #24412 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:11:20 +00:00
13b89bd823
[doc] update vllm serve cli args documentation ( #24329 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-09-09 03:07:58 +00:00
22a0070530
Bump actions/setup-python from 5.4.0 to 6.0.0 ( #24414 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 02:54:58 +00:00
170129eb28
[gpt-oss] Harmony changes with container tool support ( #23386 )
...
Signed-off-by: zhiweiz <zhiweiz@fb.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: zhiweiz <zhiweiz@fb.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-08 19:03:50 -07:00
955c624915
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE ( #24134 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-08 19:01:51 -07:00
4f87abdcc6
Update reviewers for modelopt related files ( #24468 )
2025-09-09 01:53:13 +00:00
6910b56da2
[CI] Add nightly multiarch manifests to dockerhub ( #24102 )
...
Signed-off-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 01:18:09 +00:00
e10fef0883
[Hardware][IBM Z] Fix Outlines Core issue for s390x ( #24034 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2025-09-08 16:50:34 -07:00
e680723eba
[Bugfix] Disable the statslogger if the api_server_count is greater than 1 ( #22227 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-08 15:28:03 -07:00
620db1fc58
[Attention] FlashAttention MLA cudagraph support ( #23958 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-08 22:05:26 +00:00
41183c1fe0
[Spec Decode] Fix offline spec_decode.py ( #24257 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 20:44:13 +00:00
43d9ad03ba
[Model loader]: support multi-thread model weight loading ( #23928 )
...
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-08 18:49:39 +00:00
7be141b2c5
[CI] Enable encoder model compilation test ( #24442 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-08 11:48:06 -07:00
8d7f39b48c
[Model] Remove quantized mixtral ( #24437 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 11:02:14 -07:00
cd08636926
[Spec Decode][Benchmark] Add Blitzedit dataset ( #23605 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 10:32:52 -07:00
3feeeb9fea
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking ( #23563 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-08 10:32:42 -07:00
6f4a82f8b5
[Model] Enable BNB support for qwen2_5_omni_thinker ( #24420 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 09:37:08 -07:00
c44797a4d6
[Docs]add eplb_config param use docs ( #24213 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-09-08 09:36:57 -07:00
55be93baf5
[Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's doc structure ( #24438 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 09:36:54 -07:00
717fc00e98
[Docs] Move feature compatibility tables to README ( #24431 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:45:14 -07:00
01dfb5e982
[Frontend] User-provided uuids for medias in chat. (RFC #22044 ) ( #23449 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-08 06:42:20 -07:00
03dd652c16
Move KVEventsConfig from config/__init__.py to config/kv_events.py ( #24433 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:41:27 -07:00
9cd76b71ab
[Misc] Terratorch related fixes ( #24337 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-08 06:40:26 -07:00
e041314184
[Bugfix] Fix mamba2 prefill chunking ( #23279 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-08 11:42:41 +00:00
5e537f45b4
[Bugfix] Fix get_quant_config when using modelscope ( #24421 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-09-08 11:03:02 +00:00
c2a8b08fcd
[Doc] Fix issues in integrations/llamastack.md ( #24428 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 02:28:32 -07:00
f4962a6d55
[Doc]: fix typos in Python comments ( #24417 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 00:22:16 -07:00
2f0b833a05
[Docs] Fix a tip indentation and typo ( #24419 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 00:19:40 -07:00
425b04b8f4
[gpt-oss][Responses API] Fix the function call id format ( #24409 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-08 06:49:52 +00:00
60f0843ef8
[Model] Remove unnecessary CUDA sync of Qwen2VL image and video preprocess ( #24334 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:11:12 -07:00
8a46602606
[Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess ( #24332 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:10:54 -07:00
61aa4b2901
[P/D] Add a shutdown method to the Connector API ( #22699 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-07 23:07:00 -07:00
8c892b1831
[Doc] Fix UTF-8 encoding issues in documentation generation on Windows ( #24361 )
...
Signed-off-by: alekramelaheehridoy <aliqramalaheehridoy@gmail.com >
Signed-off-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
Co-authored-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
2025-09-07 22:33:52 -07:00
3bca396f79
[CI/Build] Fix local image inputs in test_pixtral.py ( #24401 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 03:31:35 +00:00
3a3e91bdfe
[CI/Build] Disable flaky test_structured_output tests ( #24404 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 02:51:59 +00:00
b3d7e3c845
[Sampler] Support returning all prompt logprobs ( #23868 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-07 19:34:31 -07:00
67841317d1
[xpu] upgrade ipex/python3.12 for xpu ( #23830 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-09-08 02:07:16 +00:00
86173ad593
[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA ( #24385 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-08 09:27:12 +08:00
795b6951cd
Add @luccafong to codeowner for spec decode ( #24397 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-08 08:30:27 +08:00
2e5d21378d
Skip MM Encoder for non-first PP ranks ( #24387 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 09:38:35 -07:00
0661cb9df3
Add renderer-based prompt processing for embedding and classification endpoints ( #24356 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-07 08:26:48 +00:00
105d3d62ef
[TPU] Remove TopKTopPSampler dependency for TPU sampler ( #24391 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 01:12:36 -07:00
62f66be1f7
[Bugfix] Fix Qwen3-coder moe tuned config ( #24072 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 05:19:46 +00:00
81c53ef55c
[Misc] collect flashinfer version in collect_env.py ( #24378 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-07 03:30:41 +00:00
75334956c2
QWEN3 Thinking Fused MoE kernels Optimization configs ( #24330 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-07 03:18:54 +00:00
77aec83b8c
[Benchmark] add benchmark for custom activation op ( #23908 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-06 20:12:05 -07:00
e67597545b
[CI][Fix] deterministic seed for flaky CI runs on structured outputs ( #24380 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-09-07 11:10:40 +08:00
37a6fa95fd
Migrate Qwen2 inputs to TensorSchema ( #23475 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-06 20:07:31 -07:00
558f0907dc
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode ( #24372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-07 01:18:59 +00:00
4172235ab7
[V0 deprecation] Deprecate V0 Neuron backend ( #21159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-06 16:15:18 -07:00
848562bd49
break execute_model in gpu_model_runner into sub-functions for custom scopes ( #24265 )
...
Co-authored-by: Bangsheng Tang <bangsheng@meta.com >
2025-09-06 14:02:47 -07:00
e68dc2f014
[Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test ( #24370 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-06 20:39:34 +00:00
a3645ed94d
[Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count ( #24285 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-06 13:27:15 -07:00
fb691ee4e7
[Fix] [gpt-oss] fix non-tool calling path for chat completion ( #24324 )
2025-09-06 19:10:32 +00:00
6024d115cd
Lora bias(enable_lora_bias) deprecate warning ( #24339 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 00:42:19 +08:00
7555d6b34a
[Bugfix] Fix test_mixtral_moe ( #24371 )
2025-09-06 09:32:03 -07:00
00a4e56d8d
[Bugfix] Fix broken deepseek fp8 TP weights loading ( #24367 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-06 09:23:12 -07:00
0eadaeff7e
[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. ( #24335 )
...
Signed-off-by: Mohan Kumar Kumar <mohan.cbein@gmail.com >
Signed-off-by: mohankku <mohan.cbein@gmail.com >
2025-09-06 08:17:03 -07:00
0077c8634e
Add @benchislett to codeowner for spec decode and structured outputs ( #24362 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-06 22:03:35 +08:00
b121ca22ad
[CI] Disable flaky structured output test from CI ( #24366 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-06 13:31:56 +00:00
eddaafc1c7
[Multimodal] Improve max video embedding length estimation in V1 ( #24312 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-06 02:33:19 -07:00
305a1cc0d2
refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer ( #24345 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-05 23:01:23 -07:00
6d6c6b05d3
[New Model]: google/embeddinggemma-300m ( #24318 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-05 22:58:36 -07:00
53b19ccdd5
[Core] Allow disabling TP sharding for parallel Linear layer ( #23024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-05 22:53:58 -07:00
6432739ef1
[Bugfix] Catch and log invalid token ids in detokenizer ( #24351 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-05 22:30:22 -07:00
ac201a0eaf
[Feature] Support Decode Context Parallel (DCP) for MLA ( #23734 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-06 13:24:05 +08:00
3c529fc994
[KV Sharing] Raise error if using eagle with fast prefill ( #24350 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-05 20:22:40 -07:00
35bf193864
[Doc]: fix typos in Python comments ( #24294 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-05 19:41:12 -07:00
35efa70297
Add @22quinn as code reviewer for RL related components ( #24346 )
2025-09-06 01:56:15 +00:00
cee182b297
[Perf][V1] Fully overlap model execution ( #23569 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-05 18:20:17 -07:00
c954c6629c
[CI] Add timeouts to tests ( #24260 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-05 17:26:22 -07:00
9dfbeb41e5
[RFC] allow cancelation after shutdown in blocking collective_rpc ( #23390 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-05 14:14:18 -07:00
eedb2a2a10
[Bugfix] Fix silu_mul+quant fusion test ( #24341 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-05 20:13:42 +00:00
23a6c5280e
[gpt-oss][Bugfix]Fix streamableparser for missing handling of certain token_ids ( #24306 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-05 10:26:00 -07:00
7812bcf278
[docs] add shenzhen meetup ( #24326 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-05 22:48:42 +08:00
006e7a34ae
Adding int4 and int8 models for CPU benchmarking ( #23709 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-09-05 20:08:50 +08:00
e599e2c65e
[XPU][P/D] Add XPU support in NixlConnector ( #22436 )
...
Signed-off-by: zhenwei <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 21:03:12 -07:00
c29fb540ff
[gpt-oss] tool parser supports for /chat/completions [1/n] ( #22386 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 20:39:12 -07:00
65e038931d
[Frontend] Skip unnecessary detokenization when token_id is requested ( #24236 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-04 23:04:12 +00:00
886ccbe5ba
[CI/Build] Reduce the number of redundant cases to test for LoRA ( #24276 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-04 21:58:44 +00:00
adc3ddb430
[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files ( #23727 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 14:25:45 -07:00
60b755cbcb
[Misc] Have AsyncLLM custom_stat_loggers extend default logger list ( #20952 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-04 14:25:30 -07:00
482e52f56c
QWEN3 Coder Fused MoE kernels Optimization configs ( #24266 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-04 20:33:43 +00:00
78336a0c3e
Upgrade FlashInfer to v0.3.0 ( #24086 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 09:49:20 -07:00
94866d7c93
[Misc] Slight improve deepgemm print ( #24085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-04 16:06:51 +00:00
83609ca91d
[Doc]: fix typos in Python comments ( #24173 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-04 08:52:17 -07:00
e41a0fa377
[Perf] Freeze core engine proc heap after init ( #24008 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-04 22:55:23 +08:00
37241077d5
[Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp ( #23725 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 09:25:40 -04:00
c9f7081f9c
[LoRA]: Add lora support to qwen-2.5-omni ( #24231 )
2025-09-04 05:50:50 -07:00
16ded21eeb
[XPU] support Triton Attention backend on Intel GPU ( #24149 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 20:41:08 +08:00
2b30afa442
Use hidden_size_per_head as head_size fallback ( #24221 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-09-04 12:59:16 +01:00
eafa8dcde6
[Model] Add pp support for hunyuan ( #24212 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-04 03:58:26 -07:00
6c7af8110a
[Doc] Update vLLM Singapore Meetup info ( #24234 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-09-04 02:58:18 -07:00
8f423e5f43
[Feature][Response API] Add streaming support for non-harmony ( #23741 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-04 17:49:06 +08:00
369a079568
[Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon ( #24200 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-09-04 02:48:25 -07:00
402759d472
[Attention] FlashAttn MLA ( #14258 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-04 02:47:59 -07:00
2c301ee2eb
[Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0 ( #24159 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
Signed-off-by: Fanli Lin <fanli0116@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 02:47:08 -07:00
3efb9f4d95
[Attention][Platform] Refactor MLA to support Custom Op ( #23332 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-04 02:46:37 -07:00
04f3c35cff
Improve flexibility of auto_tune.sh execution. ( #23766 )
...
Signed-off-by: Anthony Su <50185138+anthonsu@users.noreply.github.com >
Signed-off-by: anthonsu <50185138+anthonsu@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 09:41:41 +00:00
51d5e9be7d
[Core][Model] Terratorch backend integration ( #23513 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-04 00:22:41 -07:00
e7fc70016f
[Model] Add MiDashengLM model support ( #23652 )
...
Signed-off-by: chenbing8 <chenbing8@xiaomi.com >
Signed-off-by: bingchen-mi <chenbing8@xiaomi.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-04 00:08:09 -07:00
12e1e63cc5
[Misc] Enhance output readability of helper script ( #24214 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-09-04 06:38:26 +00:00
57b1ce94f7
[CPU] Refactor CPU unquantized linear ( #24150 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-04 14:28:45 +08:00
cb55ad86fe
Migrate ultravox inputs to TensorSchema ( #23503 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-04 06:09:11 +00:00
712b273f65
[Refactor] Introduce basic Renderer for completion-style request ( #24010 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-04 05:21:12 +00:00
e919d6f549
[Kernel][Bugfix] Fix grouped topk cu ( #24146 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
2025-09-04 12:37:37 +08:00
a38f8bd54c
[Feature][Responses API]Support MCP tools with streaming mode + background mode ( #23927 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-09-04 04:05:10 +00:00
b5ee1e3261
Remove deprecated PyNcclConnector ( #24151 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-09-03 22:49:16 +00:00
36c260dad6
[Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking ( #23460 )
...
Signed-off-by: George Nagy II <george.nagy0969@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-03 21:08:47 +00:00
a43a3f1770
[Bugfix][DP] DP distribution does not require ray[default] ( #23822 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-03 13:21:36 -07:00
6adaed42f4
[Feature][P/D]: Optimize NIXL Connector xfer Launch ( #23887 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-09-03 19:14:30 +00:00
a742322092
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend ( #23289 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-03 14:05:24 -04:00
731a6940e3
Migrate whisper inputs to TensorSchema ( #23505 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-03 18:04:00 +00:00
e9b92dcd89
[Kernels] Overlap shared experts with send/recv ( #23273 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-03 12:35:18 -04:00
fa4311d85f
[V1] v1 engine + full CUDA graph support for PLaMo2 ( #23998 )
...
Signed-off-by: Hemmi Shinichi <shemmi@preferred.jp >
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
Co-authored-by: Hemmi Shinichi <shemmi@preferred.jp >
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com >
2025-09-03 08:24:02 -07:00
6d80ae83e1
[Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16 ( #23424 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-09-03 15:01:09 +00:00
4ba0c587ba
FIX: Add libnuma-dev to Dockerfile for dev stage ( #20388 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
2025-09-03 07:17:20 -07:00
6997a25ac6
[Model] Remove useless code from MiniMax implementation ( #23982 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-09-03 11:27:04 +00:00
28f350e147
Support add_generation_prompt in embeddings endpoint with chat request ( #23931 )
...
Signed-off-by: biba10 <jaksmid@seznam.cz >
2025-09-03 10:47:55 +00:00
51383bd472
[CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant ( #24088 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-03 17:23:56 +08:00
9c99e4871f
[Misc] Clean up deadcode for legacy processing pipeline ( #24153 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-03 08:34:29 +00:00
70549c1245
[CI/Build] Serve images used by multimodal tests through local HTTP Server ( #23907 )
...
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com >
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-03 16:13:11 +08:00
f0c503f66e
[Nixl] Heterogeneous TP support FlashInfer ( #20189 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-03 15:19:54 +08:00
f38035c123
[distributed][rl] remove nccl cumem env var override ( #24141 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-03 06:45:25 +00:00
426cc8629f
[BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models ( #24132 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-03 04:57:59 +00:00
e81d4e69c1
[Misc] Add check for dual_chunk_attention ( #24070 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-03 04:19:14 +00:00
02d411fdb2
[Doc]: fix typos in Python comments ( #24115 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:14:07 -07:00
d7e1e59972
[Doc]: fix typos in Python comments ( #24093 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:05:45 -07:00
c4ed78b14f
[Compile] Fix Compile Warning for w4a8_mm_entry.cu ( #23660 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-02 20:45:52 -07:00
1bd007f234
fix some typos ( #24071 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-02 20:44:50 -07:00
136d853e65
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing ( #23656 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-09-03 02:52:51 +00:00
e32a0e8678
Upgrade xgrammar to 0.1.23 ( #22988 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-03 02:32:59 +00:00
42dc59dbac
Update release pipeline post PyTorch 2.8.0 update ( #24073 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
2025-09-03 10:09:19 +08:00
862f2ef893
[XPU] Fix the bug of LoRA logits on the XPU platform ( #24081 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-09-03 08:21:18 +08:00
2fd1a40a54
[CI/Build] Disable SiluMul NVFP4 quant fusion tests ( #24121 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-02 16:50:28 -07:00
930a24144c
[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue ( #24119 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 22:22:30 +00:00
457e471971
[AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault ( #23692 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-09-02 22:13:57 +00:00
d328f7894f
[CI] Enable all hf transformers baselines in test_hybrid ( #23936 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-02 20:15:06 +00:00
98aee612aa
[Log] Only Print Profiler Results on Rank 0 ( #23370 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 18:53:34 +00:00
598bd74cf8
Fix weights loading for Apertus ( #24100 )
...
Signed-off-by: Nathan Ranchin <nranchin@student.ethz.ch >
2025-09-02 18:34:28 +00:00
2417798471
[Metrics] Deprecate TPOT in favor of ITL ( #24110 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-09-02 18:10:10 +00:00
9480ae24e3
[Bugfix] Fix packed_factor missing attribute error ( #23902 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-09-02 10:56:31 -07:00
f399182e8c
Run ruff format on a few files. ( #24075 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-02 17:55:32 +00:00
1c41310584
[Bugfix] Fix transform_config parsing in Compressed Tensors ( #23945 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-02 13:54:10 -04:00
c83c4ff815
[Benchmark] Add support for local hf dataset path in benchmark ( #23999 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-02 17:49:16 +00:00
0e1759cd54
[docs] add SYS_NICE cap & security-opt for docker/k8s ( #24017 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 17:27:20 +00:00
e66ed3e675
[CI Failure] Skip failing nvfp4 silu test ( #23959 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-02 13:18:15 -04:00
e0653f6c0b
[Model] Classification models support logit_bias / sigmoid_normalize ( #24031 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-02 16:48:57 +00:00
38ba061f6f
[BugFix] Fix EXAONE4 rotary embeddings ( #23918 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 14:40:55 +00:00
0a74e9d0f2
[Gemma3n] Fix audio batching ( #24052 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-02 22:23:35 +08:00
8bd5844989
correct LWS deployment yaml ( #23104 )
...
Signed-off-by: cberge908 <42270330+cberge908@users.noreply.github.com >
2025-09-02 12:04:59 +00:00
ce30dca5c4
[CI]: reduce HTTP calls inside entrypoints openai tests ( #23646 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Aziz <azizbenothman76@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 10:49:32 +00:00
2f0bab3f26
[Model] Support dp on ViT on GLM-4.5V ( #23168 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-09-02 10:48:18 +00:00
fad73be1a5
[Doc]: fix typos in Python comments ( #24077 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 02:38:55 -07:00
56d04089ef
Migrate Interns1 inputs to TensorSchema ( #23510 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 04:35:45 +00:00
7be0cb8e9e
[XPU][Feature] fp8 online quantization support for XPU ( #23148 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com >
2025-09-02 04:06:53 +00:00
1fa1d6a9a0
Migrate OvisImagePatchInputs to TensorSchema ( #22024 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 12:01:36 +08:00
d59c986444
Remove runtime checks based on pooling params ( #24051 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-02 11:54:37 +08:00
04d0c60770
[Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has… ( #24028 )
...
Signed-off-by: Dazhi Jiang <dazhi_jiang@163.com >
2025-09-02 11:54:20 +08:00
2b41cbbf03
[V1][Mamba1] - FP32 SSM Kernel Support ( #23506 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-01 20:53:00 -07:00
0235103cbb
[Doc]: fix typos in Python comments ( #24042 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-01 19:07:45 -07:00
a344a5aa0a
[bugfix]fix MTP hidden states ( #24056 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-01 21:09:37 +00:00
5685370271
[Chore][V0 Deprecation] Move LogProb to a separate file ( #24055 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 12:07:53 -07:00
a0e0efd6bd
[Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506 ( #23817 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Signed-off-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-09-01 16:56:56 +00:00
cf91a89dd2
[docs][misc] IOProcessor plugins fixes ( #24046 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-09-01 09:17:41 -07:00
39a22dcaac
[Misc] Minor code simplification for spec decode ( #24053 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 08:54:01 -07:00
41c80698b3
Document multi-proc method selection for profiling ( #23802 )
...
Signed-off-by: jdebache <jdebache@nvidia.com >
2025-09-01 06:28:26 -07:00
7c8271cd1e
[Model]: support KeyeVL-1_5-8B ( #23838 )
...
Signed-off-by: wangruitao <wangruitao@kuaishou.com >
Co-authored-by: wangruitao <wangruitao@kuaishou.com >
2025-09-01 03:50:27 -07:00
3e330fcb21
[Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors ( #24033 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-09-01 03:34:52 -07:00
d46934b229
[Frontend] Gemma3n audio transcriptions/translations endpoint ( #23735 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-01 18:07:46 +08:00
107284959a
[Doc]: fix typos in Python comments ( #24026 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-01 09:38:20 +00:00
dc1a53186d
[Kernel] Update DeepGEMM to latest commit ( #23915 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-01 02:38:04 -07:00
55602bb2e6
[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN ( #20904 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-01 08:50:25 +00:00
d7fbc6ddac
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs ( #24022 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 08:12:22 +00:00
5438967fbc
[Misc] add hash_function doc string ( #24014 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-31 23:11:20 -07:00
422e793fa6
[Bugfix] Add support for <tool_call> format in streaming mode for XLAM Tool Parser ( #22769 )
...
Signed-off-by: Devon Peroutky <devon@kindo.ai >
2025-09-01 14:07:54 +08:00
1cb39dbcdd
[Misc] IO Processor plugins for pooling models ( #22820 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-31 23:07:12 -07:00
437c3ce026
Migrate Phi4 inputs to TensorSchema ( #23471 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-01 14:05:59 +08:00
499b074bfd
[Misc] refactor code by import as for torch._inductor.config ( #23677 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-01 14:05:42 +08:00
ff0e59d83a
[CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization ( #23357 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-31 22:52:20 -07:00
b55713683c
[Misc] Move fast prefill logic to separate method ( #24013 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 05:40:38 +00:00
acc1a6e10a
Fix the bug related to loading GPTP INT3 weights. ( #23328 )
...
Signed-off-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 05:39:57 +00:00
8c742a66d1
[Misc] Avoid redundant copy for encoder-only models ( #24012 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 04:02:43 +00:00
183a70967a
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) ( #23994 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 03:33:40 +00:00
14b4326b94
v1: Support KV events from connectors ( #19737 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-01 01:13:21 +00:00
752d2e1c36
[Minor] Fix some random typos in comments ( #24009 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-31 16:42:17 -07:00
81eea3d348
vllm fix check on max vocab size ( #22471 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-31 20:57:05 +08:00
9701352e4b
[Doc]: fix typos in Python comments ( #24001 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-31 08:21:59 +00:00
749be00a98
[Core][Multimodal] Allow passing multi_modal_uuids as multimodal identifiers. ( #23394 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-30 18:01:22 -07:00
5b8077b8ac
Fix wrong truncate_prompt_tokens type hint ( #22761 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
Signed-off-by: Gabriel Marinho <104592062+gmarinho2@users.noreply.github.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-30 20:39:38 +00:00
038e9be4eb
[LoRA] Much faster startup when LoRA is enabled ( #23777 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 15:37:39 +00:00
68a349114f
[Misc] enhance type hint for rearrange return value ( #23519 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:43:33 -07:00
e80bca309e
[Refactor] refactor freezing_value/cuda_event initialize outside try finally ( #23758 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:42:25 -07:00
fb4983e112
[Misc] add reorder_batch AttentionMetadataBuilder ( #23798 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:41:45 -07:00
379ea2823a
Add LoRA support for DeepSeek models (V2, V3, R1-0528) ( #23971 )
...
Signed-off-by: sadeghja1070 <sadegh.ja1070@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 06:40:02 -07:00
3a6acad431
[Model] Enable encoder DP for MiniCPM-V ( #23948 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 06:31:26 -07:00
5490d633ce
[UT] fix unify_kv_cache_configs when kv cache config needs sort ( #23843 )
2025-08-30 11:22:14 +00:00
628d00cd7b
[Bugfix] Fix test_lora_resolvers.py ( #23984 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 11:16:11 +00:00
4071c76cf3
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba ( #23831 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 00:16:15 -07:00
f1bddbd852
[Core] Cleanup TPU model runner for MM ( #23894 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-30 00:14:58 -07:00
9748c5198b
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion ( #23973 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-30 00:14:43 -07:00
ee52a32705
[CI] Move testing image from remote URL to S3 ( #23980 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 21:41:25 -07:00
8fb85b7bb6
Add routed_scaling_factor to MoE grouped topk ( #23123 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:48 -07:00
5b31cb1781
[Bugfix] Fix --config arg expansion called from api_server.py ( #23944 )
...
Signed-off-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:39 -07:00
d660c98c1b
[CI] Fix unavailable image remote URL ( #23966 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 15:40:04 -07:00
5674a40366
[Misc] Make download_weights_from_hf more reliable ( #23863 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 12:37:24 -07:00
8c3e199998
Revert gemma3n fast prefill changes ( #23897 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-29 12:16:57 -07:00
1c26b42296
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models ( #23824 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-29 18:47:58 +00:00
b7adf94c4a
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj ( #23939 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-29 10:28:35 -07:00
4d7fe40fc0
[RL][BugFix] Fix missing tokenizer error for token-in-token-out ( #23904 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 01:09:55 +08:00
0dc9532065
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant ( #23929 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2025-08-29 09:36:39 -07:00
72a69132dc
[CI] Add aiter to matching list of issue auto labeller for rocm tag ( #23942 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-29 15:29:21 +00:00
d90d8eb674
[BugFix] Async scheduling and PP compatibility with DP ( #23770 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-29 08:17:27 -07:00
0a2f4c0793
[Models] Use in-place adds in Idefics2Vision ( #23932 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 07:42:57 -07:00
1cf3753b90
[MODEL] Apertus and XIELU ( #23068 )
...
Signed-off-by: EduardDurech <39579228+EduardDurech@users.noreply.github.com >
Co-authored-by: AllenHaoHuang <allenhuangdd@gmail.com >
2025-08-29 20:29:18 +08:00
4f7cde7272
Adds json_count_leaves utility function ( #23899 )
...
Signed-off-by: aditchawdhary <aditxy@hotmail.com >
2025-08-29 05:28:13 -07:00
67c14906aa
Update PyTorch to 2.8.0 ( #20358 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-29 18:57:35 +08:00
69f46359dd
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec ( #23779 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-08-29 18:36:57 +08:00
d9e00dbd1f
[Performance] V1 Classify Models E2E Performance Optimization ( #23541 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-29 03:12:32 -07:00
ad39106b16
[CPU] Enable data parallel for CPU backend ( #23903 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-29 02:19:58 -07:00
2554b27baa
[V0 Deprecation] Remove pooling model support in V0 ( #23434 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-29 00:04:02 -07:00
934bebf192
Better errors for Transformers backend missing features ( #23759 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 07:01:40 +00:00
885ca6d31d
[Misc] Fix warnings for mistral model ( #23552 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-08-29 06:58:48 +00:00
2d0afcc9dc
[mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. ( #23895 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-28 23:29:13 -07:00
b4f9e9631c
[CI/Build] Clean up LoRA test ( #23890 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-28 23:28:35 -07:00
05d839c19e
Fix(async): Add support for truncate_prompt_tokens in AsyncLLM ( #23800 )
2025-08-28 22:55:06 -07:00
6597d7a456
[Platform] import activation_quant_fusion for CUDA only ( #23882 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-08-28 22:54:16 -07:00
5264015d74
[BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD ( #23864 )
...
Signed-off-by: Jinghui Zhang <jinghuizhang0804@gmail.com >
2025-08-28 22:54:12 -07:00
98ac0cb32d
[Bugfix] Use ReplicatedLinear for SequenceClassification head ( #23836 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-29 04:41:20 +00:00
c8b3b299c9
[tests] Improve speed and reliability of test_transcription_api_correctness ( #23854 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-29 04:25:33 +00:00
006477e60b
[ROCm][Fix] Fix rocm build caused by #23791 ( #23847 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-08-28 19:52:27 -07:00
de533ab2a1
[Models] Improve iteration over layers ( #19497 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 09:26:34 +08:00
235c9db8a7
[XPU] support data parallel for MoE models on XPU ( #22887 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-29 09:23:04 +08:00
b668055a11
[V0 Deprecation] Remove V0 Samplers test ( #23862 )
2025-08-28 18:05:52 -07:00
d3d2aad5a2
[Log] Use Debug Once for DeepGEMM E8M0 When not Enabled ( #23858 )
2025-08-28 22:18:10 +00:00
cb293f6a79
[V1] Enable prefill optimization for Gemma3n ( #22628 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-28 14:54:30 -07:00
7ffbf27239
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu ( #23737 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 14:22:46 -07:00
27e88cee74
chore: build release image by default ( #23852 )
...
Signed-off-by: Codex <codex@openai.com >
2025-08-28 13:17:15 -07:00
16a45b3a28
[NVIDIA] Support SiluMul + NVFP4 quant fusion ( #23671 )
...
Signed-off-by: jindih <jindih@nvidia.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: jindih <jindih@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedic <lgovedic@redhat.com >
2025-08-28 19:36:50 +00:00
57d4ede520
[bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) ( #23829 )
...
Signed-off-by: He-Jingkai <he-jingkai@outlook.com >
2025-08-28 19:05:20 +00:00
04d1dd7f4a
[ROCm][Aiter] Add triton fp8 bmm kernel for mla ( #23264 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com >
2025-08-28 18:18:08 +00:00
f32a5bc505
Migrate Llama4ImagePatchInputs to TensorSchema ( #22021 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 17:29:37 +00:00
8805ad9fa9
Add scale_config.yml file for Meta autoscalers for GH Actions ( #23840 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 09:31:20 -07:00
0583578f42
[ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime ( #23757 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 08:59:19 -07:00
db74d60490
[Bugfix] Add fake mode around passes ( #23349 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-08-28 11:25:56 -04:00
95089607fa
[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE ( #23819 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-28 06:56:20 -07:00
1f096f9b95
[CI] Fix linting error on main ( #23835 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-28 06:52:01 -07:00
66548f6603
[Bugfix] Fix benchmark_moe.py for blockwise fp8. ( #23823 )
...
Signed-off-by: crischeng <420985011@qq.com >
Co-authored-by: cris <grace@guisenbindeMacBook-Pro.local >
2025-08-28 21:44:09 +08:00
d3da2eea54
[Doc]: fix typos in Python scripts ( #23828 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 05:37:38 -07:00
bfab219648
[Model] [gpt-oss] fix gpt-oss pp support ( #23815 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 05:36:55 -07:00
a3432f18fd
[BugFix][Spec Decode] Use float64 for uniform_probs ( #23803 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 12:26:45 +00:00
67cee40da0
[CI/Build][Bugfix] Fix Qwen VL tests on CPU ( #23818 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-28 11:57:05 +00:00
d99c3a4f7b
[Doc]: fix typos in .md files (including those of #23751 ) ( #23825 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 04:38:19 -07:00
3462c1c522
[FIXBUG] Add return_success parameter to moe_wna16_weight_loader function ( #22797 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-28 09:03:22 +00:00
c5d004aaaf
[Model] Add PP support and VLM backbone compatability for GPT-OSS ( #23680 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-28 16:03:28 +08:00
11a7fafaa8
[New Model]: Support GteNewModelForSequenceClassification ( #23524 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-28 15:36:42 +08:00
186aced5ff
[Kernel] cuda kernels for upcoming decode context parallel feature ( #23791 )
...
Co-authored-by: hongchao <hongchao@msh.team >
2025-08-28 15:29:11 +08:00
daa1273b14
[Bugfix] when set offline model running error ( #23711 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-28 07:27:45 +00:00
c07a73317d
[CI] enable idefics3 and fuyu-8b test in multimodal test ( #23790 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 14:51:24 +08:00
22feac8e95
[Transform] [Quantization] Add transforms to compressed tensors ( #22486 )
2025-08-28 02:43:48 -04:00
c8851a4723
Add deprecation warning for lora_extra_vocab_size ( #23635 )
...
Signed-off-by: Jinheng Li <ahengljh@gmail.com >
2025-08-27 22:34:29 -07:00
f48a9af892
[CI] make all multi-gpu weight loading tests run nightly ( #23792 )
...
Signed-off-by: Alex Yun <alexyun04@gmail.com >
2025-08-27 21:27:36 -07:00
a11adafdca
Gracefully handle edge cases in harmony utils ( #23155 )
...
Signed-off-by: Jan Kessler <jakessle@uni-mainz.de >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-27 20:14:00 -07:00
a781e84ec2
[Perf] Tune configs for triton block fp8 gemm H100/H200 ( #23748 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-28 11:12:53 +08:00
1b7b161a09
[Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 ( #23556 )
...
Signed-off-by: Shrey Gupta <shreyg1303@gmail.com >
2025-08-27 20:12:44 -07:00
a69693e38f
Migrate Qwen inputs to TensorSchema ( #23473 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 10:43:26 +08:00
5da4f5d857
[Bugfix] Fix for V1 priority scheduling crashes at preemption ( #23713 )
...
Signed-off-by: Hanchenli <lihanc2002@gmail.com >
2025-08-28 00:44:52 +00:00
321938e9ac
[Feature] Add VLLM_DISABLE_PAD_FOR_CUDAGRAPH to Avoid Hang Issue ( #23595 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 21:52:24 +00:00
f9ca2b40a0
[Bugfix] Fix Marlin NVFP4 for modelopt ( #23659 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-27 17:48:16 -04:00
082cc07ef8
DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 ( #23608 )
2025-08-27 17:33:21 -04:00
853c371fc3
[V1][Mamba] - Enable V1 by default for Mamba Models ( #23650 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-08-27 20:53:30 +00:00
8bf6266a17
[Multimodal] Generate mm_hash based on request metadata when caching is turned off ( #23690 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-27 20:24:31 +00:00
0585a9e73c
Disable torch.compile for dynamic rope models in Transformers backend ( #23738 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 19:03:05 +00:00
3c0ef769ba
ci: Add arm64 docker build to release pipeline ( #23210 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
Signed-off-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com >
2025-08-27 10:41:48 -07:00
4e4d017b6f
[Docs] Fix warnings in mkdocs build (continued) ( #23743 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
2025-08-27 17:17:29 +00:00
dd58932280
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models ( #22589 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-27 10:05:16 -07:00
52883ed084
[Model] Merge SupportsMultiModalWithRawInput with SupportsMultiModal ( #23749 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 10:01:50 -07:00
4f35be10a9
[BugFix] Fix topk_softmax assert ( #19764 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
2025-08-27 09:47:28 -07:00
2b61d2e22f
[Docs] Remove in-tree Gaudi install instructions ( #23628 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 09:22:21 -07:00
3ce8285d6d
[LogitsProcs] Deduplicate built-in LP implementation logic ( #23362 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-27 23:11:33 +08:00
83f555f637
[Doc]: upgrade version of crate-ci tool for improved typo detection ( #23755 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-27 07:59:34 -07:00
841490434a
[Model] Enable native HF format InternVL support ( #23742 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-27 14:45:17 +00:00
3af47c3cc6
[Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt ( #23666 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-27 14:09:08 +00:00
513c1fe255
Only run get_attr_docs if generating help text ( #23723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 13:55:12 +00:00
fe8d7b6f03
[Model] Interface to enable batch-level DP support ( #23733 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 06:41:22 -07:00
16dc4052b0
Fix pre-commit on main ( #23747 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 06:39:48 -07:00
8dd2baa597
Add vLLM Korea Meetup in the README.md and meetups.md ( #23746 )
...
Signed-off-by: rebel-hongseok <hongseok@rebellions.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 06:25:49 -07:00
5eeef1b908
[Model] Explicit default_pooling_type interface ( #23736 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 13:24:09 +00:00
704432af3c
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models ( #23716 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-27 12:51:54 +00:00
a403d0fa41
[Misc] Remove unnecessary _send_reconfig_message() in core_client.py ( #23127 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-27 05:50:47 -07:00
8c13820f0b
[Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled ( #23718 )
...
Signed-off-by: cndoit18 <cndoit18@outlook.com >
2025-08-27 12:42:20 +00:00
9d30de4469
[model] Support MiniCPM-V 4.5 ( #23586 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Signed-off-by: Pate Motter <patemotter@google.com >
Signed-off-by: Terrencezzj <terrence@cohere.ai >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: tc-mb <157115220+tc-mb@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com >
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: oye93 <en.ouyang93@outlook.com >
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Tianyu Li <tianyu.li@arm.com >
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Signed-off-by: Federico <65908512+coval3nte@users.noreply.github.com >
Signed-off-by: Zixuan Zhang <zixuanzhang@bytedance.com >
Signed-off-by: wuhang <wuhang6@huawei.com >
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
Signed-off-by: Wei Wei <wwei6@meta.com >
Signed-off-by: Yiheng Xu <charlesyihengxu@gmail.com >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com >
Co-authored-by: Pate Motter <p@temotter.com >
Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: weiliang <weiliangl@nvidia.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Raghavan <oneraghavan@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Matúš Námešný <matus@namesny.com >
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: En Ouyang <en.ouyang93@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: nvjullin <jullin@nvidia.com >
Co-authored-by: Didier Durand <2927957+didier-durand@users.noreply.github.com >
Co-authored-by: TianyuLi0 <116711075+TianyuLi0@users.noreply.github.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Co-authored-by: Federico <65908512+coval3nte@users.noreply.github.com >
Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com >
Co-authored-by: wuhang <wuhang6@huawei.com >
Co-authored-by: yzds <41983536+youzhedian@users.noreply.github.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: czhu-cohere <conway.zhu@cohere.com >
Co-authored-by: Wei <weiweinpu@gmail.com >
Co-authored-by: Yiheng Xu <charlesyihengxu@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: CSWYF3634076 <58356743+CSWYF3634076@users.noreply.github.com >
2025-08-27 05:38:00 -07:00
1f7a9c95e4
[Docs] Fix a 1-2-3 list and style issues in tpu.md ( #23729 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-08-27 05:37:52 -07:00
8f0d7eaea8
[XPU] Fix OOM issue for data parallel with Ray backend ( #22500 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
Signed-off-by: Fanli Lin <fanli0116@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-27 19:57:38 +08:00
e03940762b
[CI/Build] Reduce LoRA layer test cases ( #23721 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-27 10:59:35 +00:00
11eddf02f0
[FlashInfer] Cache hyper params in metadata builder ( #23732 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 03:45:04 -07:00
04ff1e43fb
[Misc] Move CpuGpuBuffer to vllm/v1/utils.py ( #23728 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 03:25:00 -07:00
6578e87365
Optimize input preparation for FlashInfer [2/N] ( #23174 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 02:52:45 -07:00
5bd9f84158
[Docs] Fix an admonition important ( #23726 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-08-27 02:50:09 -07:00
91e382c935
[CI/Build] Remove redundant register in model init tests ( #23715 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 08:11:15 +00:00
6446677839
[XPU]fix cuda event used in XPU model runner ( #23708 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-27 07:27:14 +00:00
69244e67e6
[Core] Use key-only cache for BaseMultiModalProcessor ( #23018 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 14:19:13 +08:00
8dbf6ed7be
[Bugfix] fix when config.yaml config value is list parse error ( #23528 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-27 05:54:39 +00:00
9de25c294b
[CI/Build] Remove redundant LoRA model tests ( #23706 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-27 05:51:50 +00:00
fce10dbed5
[XPU] Add xpu torch.compile support ( #22609 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-27 05:33:27 +00:00
d272415e57
[Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs ( #22674 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-08-27 05:00:21 +00:00
142ac08030
[Frontend] Optimize beam search performance by limiting concurrency ( #23599 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-27 04:59:14 +00:00
3210264421
[Frontend] Add --log-error-stack to print stack trace for error response ( #22960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-27 04:58:59 +00:00
644d57d531
[Model] Add Ernie4.5 VL Model Support ( #22514 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-08-26 21:02:55 -07:00
c905684cfe
[Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. ( #23686 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-26 20:05:34 -07:00
786835807b
[Bugfix]: Qwen3 Coder Tool Parser ( #23099 )
...
Signed-off-by: Yiheng Xu <charlesyihengxu@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-08-26 19:58:32 -07:00
fecbb7c782
[Bugfix][gpt-oss] passing the cache config in gpt-oss ( #23613 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-08-27 02:54:23 +00:00
6dab89b8ec
[Docs] Fix math rendering in docs ( #23676 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 18:47:08 -07:00
de02b07db4
[Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 ( #23678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-27 09:34:57 +08:00
eb1995167e
[gpt-oss] Enable unit test for response API harmony integration ( #23533 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-26 18:23:26 -07:00
2c2b140ae8
[quantization] use channel scales for w4a8 + misc fixes ( #23570 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-08-26 18:23:23 -07:00
c7c80af084
fix pynccl reduce_scatter ( #23648 )
...
Co-authored-by: hongchao <hongchao@msh.team >
2025-08-26 18:21:11 -07:00
6891205b16
[Feature][Responses API] Support MCP tool in background mode ( #23494 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-08-27 01:06:58 +00:00
b1625dbe9c
feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 ( #23695 )
...
Signed-off-by: Zixuan Zhang <zixuanzhang@bytedance.com >
2025-08-26 18:06:10 -07:00
585e0bde36
[Bugfix] UnboundLocalError when GptOss reasoning specified ( #23054 )
...
Signed-off-by: Federico <65908512+coval3nte@users.noreply.github.com >
2025-08-27 00:29:52 +00:00
714872f1a9
[Compile] Fix Cmake Warning ( #23689 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-26 23:48:32 +00:00
5f1af97f86
[V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 ( #22594 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-26 23:28:55 +00:00
c3b0fd1ee6
[V1][P/D]P2pNcclConnector supports flashinfer ( #23536 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-26 22:56:16 +00:00
6421b66bf4
[Docs] Move quant supported hardware table to README ( #23663 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 22:26:46 +00:00
2f13319f47
Enhance the pre-notification policy ( #23532 )
...
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
2025-08-26 20:41:36 +00:00
d696f86e7b
[doc] Hybrid KV Cache Manager design doc ( #22688 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 20:19:05 +00:00
9816b81f5f
[Model] Enable video support for InternVL3.5 models ( #23658 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-26 19:46:52 +00:00
c37c0af990
[Misc] Fix comments in tests/kernels/quantization ( #23675 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-26 19:31:20 +00:00
9715f7bb0f
[Bugfix] Fix incorrect original shape in hashing ( #23672 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-26 19:01:25 +00:00
98aa16ff41
[v1] Add cross-attention KV cache support for encoder-decoder models ( #23664 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-26 18:49:06 +00:00
227e231b55
[Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models ( #23665 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-26 18:33:16 +00:00
730d0ac8b9
[Docs] Fix warnings in mkdocs build ( #23649 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 18:19:23 +00:00
9b0187003e
[Bugfix] Fix cuda event usage with CPU model runner ( #23643 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-26 17:10:42 +00:00
44ac25eae2
[CI] [Doc]: Add GH Action for auto labeling issues with rocm tag ( #20988 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-26 16:20:13 +00:00
7ea22e42d5
[Misc] Add override for allreduce fusion thresholds ( #23639 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
2025-08-26 15:53:04 +00:00
9d4183dd2e
[model] support qwen2audio embedding input ( #23625 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-26 23:48:08 +08:00
513298f1b4
[Bugfix] fix bf16 multimodal model hash ( #23623 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-26 23:47:50 +08:00
379f828fba
[Docs] Reduce requirements for docs build ( #23651 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 15:43:28 +00:00
1fdc732419
[ROCm] Starting to add AMD code reviewers for ROCm components ( #23496 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-08-26 07:32:37 -07:00
f58675bfb3
[CPU] add cpu fused moe pytorch native implementation ( #23146 )
...
Signed-off-by: Tianyu Li <tianyu.li@arm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 14:09:17 +00:00
7c04779afa
[Doc]: fix various spelling issues in multiple files ( #23636 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-26 14:05:29 +00:00
f66673a39d
[Kernel] Added flashinfer fp8 per-tensor gemms ( #22895 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 06:54:04 -07:00
b78bed1bc5
[Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) ( #23565 )
...
Signed-off-by: oye93 <en.ouyang93@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 13:04:25 +00:00
164b2273c8
[Docs] Fix broken links to docs/api/summary.md ( #23637 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 13:00:18 +00:00
2b4fc9bd9b
Support FlashAttention Backend for Hybrid SSM Models ( #23299 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-26 12:41:52 +00:00
ebd5a77bb5
feat: add usage to TranscriptionResponse (text and json response_format) ( #23576 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-08-26 05:26:26 -07:00
384dd1b0a8
[Bugfix] Add missing enable_log_outputs parameter to init_app_state function ( #23634 )
...
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com >
2025-08-26 12:13:15 +00:00
fdeb3dac13
[Model] fix DeepSeek e_score_correction_bias dtype to fp32 ( #23640 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-26 20:09:47 +08:00
d52358c1e0
[Perf] Remove duplicated NVFP4 blockscales to save memory ( #23379 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-26 19:16:33 +08:00
6ace2f72b0
Fix writing benchmark results with tuple keys ( #23633 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-08-26 19:16:09 +08:00
b00e69f8ca
Fix nits from #20059 ( #23548 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 03:27:20 -07:00
50fede6634
[V1] Enable V1 for compute capability < 8.0 + FP32 ( #23614 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-26 03:00:18 -07:00
b5d34af328
[Bugfix] Fix scheduling when repeated images in one request ( #23544 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
2025-08-26 09:46:28 +00:00
9b5f64238f
[Bugfix] Fix Qwen25VL packed_modules_mapping ( #23604 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 01:09:14 -07:00
ff77764f86
Fix CLI parameter documentation inconsistency in pooling_models.md ( #23630 )
2025-08-26 01:05:37 -07:00
bfc1edc9f5
[Docs] Fix titles for multi-file examples that are rendered in the docs ( #23573 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 00:16:44 -07:00
3ecbb14b81
[Benchmarks] add benchmark for embedding models ( #23000 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-25 23:57:08 -07:00
7d67a9d9f9
[mypy] Fix incorrect type hint for EAGLE3 support ( #23617 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:50:17 -07:00
959783fb99
[fix] fix seed-oss-parser ( #23560 )
...
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
2025-08-25 23:16:36 -07:00
ce0e9dbd43
[CI/Build] Fix typo in #23561 ( #23616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:13:03 -07:00
b395b3b0a3
[Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT ( #22760 )
...
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
2025-08-25 21:06:00 -07:00
6fad29b11b
Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper ( #23385 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-25 19:34:15 -07:00
6fd45e7b8a
[CI/Build] Use vLLM client's user agent to fetch images ( #23561 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 19:34:12 -07:00
56dcf4e7e9
[Bug] Fix DeepGEMM Env Control ( #23591 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-25 18:41:21 -07:00
ae067888d6
Update Flashinfer to 0.2.14.post1 ( #23537 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 18:30:44 -07:00
906e461ed6
[CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests ( #23568 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-25 18:29:00 -07:00
2a97ffc33d
[Misc] Add release note draft to PR template ( #23598 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-08-25 16:44:51 -07:00
efc88cf64a
[Misc] Simplify FlashInfer attention metadata ( #23585 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-25 15:42:29 -07:00
7b6a837275
[Docs] Update Documentation of Cohere Command-A Models ( #23584 )
...
Signed-off-by: Terrencezzj <terrence@cohere.ai >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
2025-08-25 21:53:52 +00:00
c34c82b7fe
[TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. ( #23574 )
...
Signed-off-by: Pate Motter <patemotter@google.com >
2025-08-25 14:29:16 -07:00
8a044754bd
[XPU] Delay BF16 check to worker init for spawn compatibility ( #22979 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-25 13:09:26 -07:00
9188ae7cb5
[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector ( #23403 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-08-25 12:57:08 -07:00
8a3cd90af5
[Kernel] Add fused grouped_topk kernel for MoE ( #23274 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-25 11:47:52 -07:00
2a167b2eeb
[test][RL] Add sleep level 2 test and fix reload with sleep mode ( #23521 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-26 00:25:52 +08:00
0ff902f3b4
[Refactor] Refactor persistent buffers with CpuGpuBuffer ( #23515 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 08:44:48 -07:00
a9082a4d14
[Bugfix] Fix Qwen3 MoE GPTQ inference ( #23490 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-25 06:40:20 -07:00
e0329ed4b4
Updates to Flex + VLLm integration ( #21416 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-08-25 09:32:42 -04:00
6879cd80ae
[Refactor] Pass tokenizer explicitly instead of binding to prompt update ( #23542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:31:57 -07:00
e269be2ba2
[Doc] Add caution for API server scale-out ( #23550 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:14:15 -07:00
5c4b6e66fe
[Attention] Unify mamba and attention backend selection ( #23171 )
...
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com >
2025-08-25 09:09:36 +00:00
d0a4a3f645
[misc] add shanghai meetup ( #23535 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-25 17:00:03 +08:00
ebafb0936d
[Bugfix] Allow dynamic number of patches for llava_onevision ( #23525 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 08:34:54 +00:00
0cb7b065c3
Feature/benchmark/random mm data/images ( #23119 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
2025-08-25 01:28:35 -07:00
2da02dd0d8
[Fix] DeepSeek V3.1 tool parser error message ( #23492 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-25 00:56:39 -07:00
d765cf01fe
[Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests ( #22711 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-25 00:41:17 -07:00
712d0f88d8
[Refactor] Dynamic target and content for prompt updates ( #23411 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-24 23:39:58 -07:00
49ab23b3cc
[gpt-oss] use reasoning channel for reasoning text in serving_chat ( #22920 )
...
Signed-off-by: Yu Guo <yuguo@meta.com >
2025-08-25 06:29:34 +00:00
c9abb10489
[Bugfix] Fix Dense module loading for sentence-transformers embedding models (simplified V2) ( #23408 )
...
Signed-off-by: FFFfff1FFFfff <yifanli0919@gmail.com >
2025-08-25 05:39:24 +00:00
787cdb3829
Migrate DonutImagePixelInputs to TensorSchema ( #23509 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 05:02:15 +00:00
a5203d04df
Migrate skyworkr1v inputs to TensorSchema ( #23499 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 04:43:21 +00:00
99f8094400
Migrate tarsier inputs to TensorSchema ( #23500 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 04:42:36 +00:00
170e8ea9ea
[Misc] Unified linear print info ( #23516 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-24 20:13:51 -07:00
a71e4765cc
[Bugfix] Fix Qwen2.5-VL quantized model weights loading ( #23512 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-08-25 10:40:22 +08:00
39971db3aa
Frontend: Adding LM Format Enforcer support to V1 engine ( #22564 )
...
Signed-off-by: Noam Gat <noamgat@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-24 19:31:22 -07:00
504d914314
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 ( #23504 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-08-24 18:06:35 -07:00
47455c424f
[Doc: ]fix various typos in multiple files ( #23487 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-25 00:04:04 +00:00
c7fc6b1354
fix incompatibililty with non cuda platform for nvfp4 ( #23478 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-08-24 15:35:41 -07:00
ad78868450
[Misc] Remove unused slot_mapping buffer ( #23502 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-24 14:03:36 -07:00
e2db1164a1
[Model] Enable BLOOM on V1 ( #23488 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-24 13:30:47 +00:00
416f05929a
[New Model]Donut model ( #23229 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-24 12:52:24 +00:00
5e021b4981
(Misc): add missing test for zero truncation size. ( #23457 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
2025-08-24 18:12:47 +08:00
1b9b16649c
[Misc] update dict parse to EPLBConfig from json dumps to dict unpacking ( #23305 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-24 08:06:34 +00:00
e76e233540
[kernel] Support W4A8 on Hopper ( #23198 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-08-24 06:18:04 +00:00
a75277285b
Migrate Paligemma inputs to TensorSchema ( #23470 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-24 04:56:56 +00:00
9dc30b7068
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks ( #23477 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-24 12:56:17 +08:00
053278a5dc
Migrate Pixtral inputs to TensorSchema ( #23472 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-24 04:55:53 +00:00
c55c028998
[gpt-oss] Streaming Output for Python Tool ( #23409 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-24 04:42:38 +00:00
65197a5fb3
[Misc] Modify CacheConfig import ( #23459 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-23 06:05:27 +00:00
b8f17f5d98
Support DeepSeek-V3.1 tool call ( #23454 )
...
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-08-23 05:50:16 +00:00
d9a55204ba
fix(tests): Correct unreachable assertion in truncation test ( #23425 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
2025-08-23 05:23:54 +00:00
b4e9fd811f
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion ( #20000 )" ( #23396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-23 04:16:48 +00:00
308fa287a8
Add glm4.5v tp2,4 fp8 config on H100_80GB ( #23443 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
2025-08-23 02:54:19 +00:00
fa78de9dc3
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs ( #22527 )
...
Signed-off-by: feng <fengli1702@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-22 20:53:21 -06:00
f6818a92cb
[UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh ( #23360 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-22 20:52:50 -06:00
23c939fd30
[Model] Support DP for ViT on MiniCPM-V-4 ( #23327 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-08-23 02:14:41 +00:00
add1adfec7
[BugFix] Fix MinPLogitsProcessor.update_states() ( #23401 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-23 08:22:11 +08:00
c80c53a30f
[BugFix] Fix batch updates for pooling models ( #23398 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-23 08:20:41 +08:00
24d0c9e6ed
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel ( #22703 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-22 22:09:05 +00:00
cc7ae5e7ca
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model ( #22281 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-08-22 21:47:57 +00:00
0313cf854d
[PERF] PyTorch Symmetric Memory All-Reduce ( #20759 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-22 15:39:08 -06:00
0483fabc74
[CI/Build] add EP dependencies to docker ( #21976 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-22 13:34:40 -07:00
da65bec309
add an env var for path to pre-downloaded flashinfer cubin files ( #22675 )
2025-08-22 19:25:45 +00:00
4645024d3a
[Quantization] Allow GGUF quantization to skip unquantized layer ( #23188 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 13:04:22 -06:00
cd7a3df26f
[Bugfix] Fix broken Florence-2 model ( #23426 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-22 17:50:52 +00:00
32d2b4064f
[Model] Add Ovis2.5 PP support ( #23405 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 17:46:34 +00:00
22cf679aad
[Doc]: fix various typos in multiple files ( #23179 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-22 10:38:46 -07:00
b6d7d34fc6
Add unit tests for batched guided and non-guided requests ( #23389 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-22 10:31:24 -07:00
341923b982
fix(tests): Ensure reliable CUDA cache clearing in MoE test ( #23416 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 17:20:59 +00:00
424fb7a5d2
[BugFix] Fix the issue where image embeddings were incorrectly split.… ( #23366 )
...
Signed-off-by: bppps <bpppsaka@gmail.com >
Co-authored-by: zouyu.zzx <zouyu.zzx@alibaba-inc.com >
Co-authored-by: bppps <bpppsaka@gmail.com >
2025-08-22 16:56:46 +00:00
88491c1b6b
[Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support ( #23337 )
2025-08-22 16:39:19 +00:00
613a23b57f
[Bugfix]: Installing dev environment due to pydantic incompatible version ( #23353 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2025-08-22 16:22:29 +00:00
51a215300b
[Fix] Bump triton version in rocm-build requirements ( #21630 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-08-22 15:13:39 +00:00
ebe14621e3
[Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script ( #23375 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
2025-08-22 15:12:28 +00:00
325aa3dee9
[Misc] local import code clean ( #23420 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-22 14:01:35 +00:00
a073be6d87
[Doc] Update the doc for log probs + prefix caching ( #23399 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 13:20:39 +00:00
695e7adcd2
[misc] Remove outdate comment about runai_model_streamer ( #23421 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
2025-08-22 13:08:53 +00:00
281710ef9a
[Attention] Allow V1 flash_attn to support cross-attention ( #23297 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-22 12:10:16 +00:00
808d2e9aa0
[Misc] Move M-RoPE init logic to _init_mrope_positions ( #23422 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-22 03:07:22 -07:00
285178b3b8
[V0 Deprecation] Remove V0 LoRA test ( #23418 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-22 09:56:51 +00:00
88016c372a
[Bugfix] Fix pooling models on CPU backend ( #23392 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-22 09:47:17 +00:00
998720859c
Migrate MiniCPMOAudioInputs to TensorSchema ( #21847 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 16:43:29 +08:00
0ba1b54ac6
[gpt-oss] add input/output usage in responses api when harmony context is leveraged ( #22667 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-08-22 08:32:24 +00:00
53415653ff
[P/D][Nixl] Make kv cache register compatible with hybrid memory allocator ( #23079 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-08-21 22:30:48 -07:00
17373dcd93
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models ( #23154 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-22 05:05:59 +00:00
5964069367
[New Model] Add Seed-Oss model ( #23241 )
...
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-22 04:58:10 +00:00
de9c085e17
[Misc] Add gemma3 chat template with pythonic-style function calling ( #17149 )
...
Signed-off-by: Philip Chung <philip.f.chung@gmail.com >
2025-08-21 21:06:50 -07:00
111692bb8c
[CI] Add end-to-end V1 min_tokens test coverage ( #22495 )
...
Signed-off-by: Arjun Reddy <189282188+arjunbreddy22@users.noreply.github.com >
Co-authored-by: Arjun Reddy <189282188+arjunbreddy22@users.noreply.github.com >
2025-08-21 22:04:07 -06:00
394591e343
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement ( #23351 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-21 21:01:08 -07:00
3ac849665d
[CI/Build] Skip Idefics3 and SmolVLM generation test again ( #23356 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 03:39:46 +00:00
0b9cc56fac
Migrate MllamaImagePixelInputs to TensorSchema ( #22020 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-22 11:28:49 +08:00
8896eb72eb
[Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed ( #18800 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-22 10:56:57 +08:00
19fe1a0510
[Kernel] Add FP8 support with FlashMLA backend ( #22668 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-08-22 02:26:32 +00:00
480bdf5a7b
[Core] Support custom executor qualname ( #23314 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-22 09:40:54 +08:00
5368f76855
[Feature][Responses API] Support logprobs(non-stream) ( #23319 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-08-21 23:09:16 +00:00
8ef6b8a38c
Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. ( #23270 )
...
Signed-off-by: Valentyn Tymofieiev <valentyn@google.com >
2025-08-21 18:01:03 -04:00
3bbe11cc13
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm ( #23265 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-21 17:56:15 -04:00
c5041f899f
[CI] improve pr comments bot ( #23380 )
2025-08-21 14:49:03 -07:00
8b5fe6eb51
[CI] Clean up actions: remove helm, publish workflows and improve pr … ( #23377 )
2025-08-21 14:29:04 -07:00
800349c2a5
[Structured Outputs] Refactor bitmask construction into get_grammar_bitmask ( #23361 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-21 20:53:33 +00:00
044931f97b
Make sure that vectorize_with_alignment produced vectorized global loads ( #23182 )
2025-08-21 20:06:54 +00:00
1d353b6352
[Core] Always use tensor cores for Flashinfer Decode Wrapper ( #23214 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-08-21 16:02:11 -04:00
3496274663
[Misc] Convert VLLM_TORCH_PROFILER_DIR path to absolute ( #23191 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-21 15:49:09 -04:00
8a19303173
[BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message ( #23318 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-21 10:31:11 -07:00
603fbbbce0
[Misc] Misc code cleanup/simplification ( #23304 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-21 17:22:55 +00:00
10f535c086
[Bugfix] Fix port conflict by obtaining a list of open ports upfront ( #21894 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-08-21 10:22:18 -07:00
48bfb0c9b7
[Bug] Fix R1 Accuracy 0 Bug ( #23294 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-21 13:11:28 -04:00
f8ce022948
add tg-mxfp4-moe-test ( #22540 )
...
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-21 17:05:47 +00:00
0278f1ac3a
Fix nvfp4 swizzling ( #23140 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-21 16:54:50 +00:00
a482e4e769
Migrate MolmoImageInputs to TensorSchema ( #22022 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-21 16:54:08 +00:00
e0b056e443
[ci/build] Fix abi tag for aarch64 ( #23329 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-21 23:32:55 +08:00
79f05e4436
[Multimodal] Always enable hashing mm data ( #23308 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-21 07:23:28 -07:00
f8daddcc4c
[Bugfix] set system_message in phi4mini chat template ( #23309 )
...
Signed-off-by: zhuangqh <zhuangqhc@gmail.com >
2025-08-21 14:22:39 +00:00
c8e33c72c6
[V1] Remove unnecessary check for main thread ( #23298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-08-21 14:08:35 +00:00
d70a16625d
[Performance] V1 Pooling Models E2E Performance Optimization ( #23162 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-21 13:26:09 +00:00
5cc54f7c5b
[Doc] Fix batch-level DP example ( #23325 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-21 06:16:38 -07:00
0c6e40bbaa
[Refactor] Simplify code for MM budget ( #23310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-21 08:00:16 +00:00
2e2000f352
[Model] Add LFM2 architecture ( #22845 )
...
Signed-off-by: Paul Pak <paulpak58@gmail.com >
2025-08-21 09:35:07 +02:00
31282401b6
[BugFix] Fix Python 3.9 Support ( #23306 )
...
Signed-off-by: Jared O'Connell <46976761+jaredoconnell@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-20 23:23:56 -07:00
0c31e28e95
[Bugfix] Fix extra whitespace in strings caused by newline ( #23272 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 22:03:00 -07:00
f571ff8eb6
[Sampler] Support returning final logprobs ( #22387 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-20 21:28:32 -07:00
f64ee61d9e
[CI] Block the cu126 wheel build while broken ( #23285 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-21 04:21:05 +00:00
8993073dc1
[CI] Delete images older than 24h. ( #23291 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-08-20 21:15:20 -07:00
655a09f653
[Model][VLM] Support R-4B Model ( #23246 )
...
Signed-off-by: yannqi <yannqi@qq.com >
Signed-off-by: 杨奇(yann qi) <51905299+yannqi@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: yannqiyang <yannqiyang@tencent.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-21 04:08:52 +00:00
f94bf9b924
[Compile] Fix Compile Warning SM100 Cutlass MLA ( #23287 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-21 03:09:39 +00:00
3663870c72
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support ( #23035 )
...
Signed-off-by: asafg <asafg@ai21.com >
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-08-20 20:08:51 -07:00
2461d9e562
[CI/Build] Split out mm processor tests ( #23260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 20:05:20 -07:00
7be5d113d8
[CPU] Refactor CPU W8A8 scaled_mm ( #23071 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-21 09:34:24 +08:00
b029de9902
[Optimization] Make new_block_ids None if empty ( #23262 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-20 18:25:56 -07:00
bbea1cefdd
[CI Bugfix] Fix CI by fully removing --enable-prompt-adapter ( #23284 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 17:18:12 -07:00
f5aa307d77
Remove duplicate entry in vllm.attention.__all__ ( #23296 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-20 17:14:59 -07:00
4b795020ed
[EP] Add logging for experts map ( #22685 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-20 23:46:06 +00:00
c86af22f31
[Fix] remove is_marlin param in benchmark_moe ( #23286 )
2025-08-20 22:04:21 +00:00
10cc12ba66
Feature/mla tests ( #23195 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-08-20 21:46:47 +00:00
a4fbb32fab
Remove chunked_prefill_enabled flag in V1 MLA ( #23183 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-08-20 21:43:17 +00:00
1b125004be
[misc] fix multiple arch wheels for the nightly index ( #23110 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-20 14:15:34 -07:00
4fbda0b20c
[Feature] use --eplb_config to set eplb param ( #20562 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: rongfu.leng <lenronfu@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-20 14:07:28 -07:00
4e51fa8cba
Do not use eval() to convert unknown types ( #23266 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-20 13:28:30 -07:00
bf7c99dfc4
[Perf] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x ( #20413 )
...
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com >
Signed-off-by: Aseem Saxena <aseem.bits@gmail.com >
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com >
2025-08-20 13:17:11 -07:00
b95697d731
[Frontend] improve error logging of chat completion ( #22957 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-20 13:03:37 -07:00
582bbe6bd7
[Fix] correct tool_id for kimi-k2 when use tool_choice=required ( #21259 )
...
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-08-20 12:59:54 -07:00
0cdbf5e61c
[Kernel/Quant] Remove the original marlin format and qqq ( #23204 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 15:13:36 -04:00
ebe56a0064
Small fix for Command-A-Vision ( #23268 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-08-20 18:15:18 +00:00
f77a0802b7
Limit HTTP header count and size ( #23267 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2025-08-20 17:57:37 +00:00
c4477f55e5
Migrate Mistral3ImagePixelInputs to TensorSchema ( #21945 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-20 17:37:29 +00:00
dfd2382039
[torch.compile] Support conditional torch.compile per module ( #22269 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-20 16:52:59 +00:00
3b11b26b50
[FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER ( #22795 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-20 09:08:29 -07:00
d6d13bd49e
[Misc] Add max_seq_len to CommonAttentionMetadata ( #23216 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-20 09:05:29 -07:00
5efd6905bc
[CLI][Doc] Formalize --mm-encoder-tp-mode ( #23190 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 23:42:28 +08:00
b17109beea
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute ( #23045 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
2025-08-20 10:35:26 -04:00
4449235843
[Bugfix] Ensure correctness of HCXVision processing ( #23254 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 14:19:30 +00:00
38217877aa
[Fix] fix offline env use local mode path ( #22526 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-20 13:34:49 +00:00
c6d80a7a96
[Model] Improve olmo and olmo2 ( #23228 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-20 12:47:05 +00:00
7cd17e22d7
[Model][V1] Support Ernie MTP ( #22169 )
...
Signed-off-by: zhouchong <zhouchong03@baidu.com >
Co-authored-by: zhouchong <zhouchong03@baidu.com >
2025-08-20 20:41:55 +08:00
50df09fe13
Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image ( #23129 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 08:05:54 -04:00
68fcd3fa73
[Bugfix] Ensure correctness of Cohere2Vision processing ( #23245 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 11:09:18 +00:00
83e69a09d6
[Model] Support deepseek with eagle ( #21086 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2025-08-20 19:01:31 +08:00
3aa8c10038
Fix missing quotes ( #23242 )
...
Signed-off-by: Shiming Zhang <wzshiming@hotmail.com >
2025-08-20 10:46:59 +00:00
103f1ec8d3
[Model] use autoWeightsLoader for gptoss ( #22446 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
2025-08-20 10:16:27 +00:00
d983769c41
fix cuda graph ( #22721 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
2025-08-20 06:24:37 +00:00
8fd920924c
[BugFix] Fix stuck stats/metrics after requests are aborted ( #22995 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-20 13:50:29 +08:00
de7b67a023
[CI/Build] Sync multimodal tests ( #23181 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 05:06:42 +00:00
f729023272
[CI/Build] Also check DP in benchmarks throughput script ( #23038 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-20 04:09:27 +00:00
1a3079a15e
chore: support pytorch format in lora ( #22790 )
...
Signed-off-by: jaeeun.kil <rha3122@naver.com >
Signed-off-by: 길재은 <rha3122@naver.com >
2025-08-20 04:02:50 +00:00
941f56858a
Fix a performance comparison issue in Benchmark Suite ( #23047 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-08-20 03:14:32 +00:00
a634733f67
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention ( #23185 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-08-20 02:57:47 +00:00
64ab3c7253
[Doc] Update V1 status of various pooling models ( #23189 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 10:33:41 +08:00
e58c5a9768
[Core] Add torch profiler CPU traces for AsyncLLM. ( #21794 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-20 02:32:47 +00:00
d46d417b58
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py ( #23132 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-19 20:18:52 -06:00
0167efe20d
[Core] Optimize scheduler request removal for single completions ( #21917 )
...
Signed-off-by: chiliu <chiliu@paypal.com >
Signed-off-by: chiliu <cliu_whu@yeah.net >
Co-authored-by: chiliu <chiliu@paypal.com >
2025-08-19 18:25:59 -07:00
c32e6ad1f6
[Quantization] Bump Compressed Tensors Version ( #23202 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-20 00:39:28 +00:00
1630cc8d0f
[Benchmarks] Add video inputs to ShareGPTDataset. ( #23199 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-19 23:42:31 +00:00
14e2b0730b
[BugFix] fix CUTLASS MLA full cudagraph ( #23200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-19 22:17:08 +00:00
0f4f0191d8
[CI/Build] Replace lm-eval gsm8k tests with faster implementation ( #23002 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-19 15:07:30 -07:00
a38b8af4c3
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend ( #22357 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2025-08-19 18:01:53 -04:00
21dce80ea9
[CI/Build] Add support for Python 3.13 ( #13164 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 13:49:34 -07:00
e61bac87ee
[Misc] Minor refactoring for FlashInfer backend ( #23147 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 13:11:51 -07:00
80141bbf2f
fix: use cache_salt for gpt-oss ( #23186 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-08-19 18:12:25 +00:00
b94faf9d50
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. ( #23125 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-19 14:00:51 -04:00
5b5f350d67
[Misc] Enable yapf for FlashInfer backend ( #23193 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 10:33:47 -07:00
f7cf5b512e
[Frontend] Add /collective_rpc API endpoint ( #23075 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-19 17:29:32 +00:00
03d4235fd2
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks ( #22654 )
...
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com >
2025-08-19 10:18:51 -07:00
d6a1a20973
[CI/Build] Update transformers to v4.55.2 ( #23093 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 10:06:17 -07:00
a70d0bd0a3
Migrate LlavaOnevisionMultiInputs to TensorSchema ( #21844 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-19 17:02:02 +00:00
24f4d1a224
Add return_token_ids parameter to OpenAI API endpoints ( #22587 )
...
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-19 09:48:31 -07:00
4f510bc2a1
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock ( #23169 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-08-19 16:18:41 +00:00
1298c67795
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL ( #22742 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-19 15:25:57 +00:00
4d9c61993a
[Bugfix] Fix benchmark_moe.py ( #23177 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-19 13:39:40 +00:00
b87cb97a53
[Model] support new model ovis2.5 ( #23084 )
...
Signed-off-by: myselvess <244285088@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 13:12:59 +00:00
f856c33ce9
[Model] Add multi_label_classification support ( #23173 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-19 12:54:30 +00:00
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel ( #21716 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-19 08:22:15 -04:00
40f26734b9
[Misc] Fix seq_lens for graph capture ( #23175 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 03:58:16 -07:00
2c3f557f08
[Doc] use power of 2 ( #23172 )
2025-08-19 03:16:23 -07:00
21bcc8263f
[Misc] Avoid accessing req_ids inside a loop ( #23159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 09:39:38 +00:00
5bfe0dea7a
[bug fix] Fix llama4 spec decoding ( #22691 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-08-19 08:53:24 +00:00
31fd3265c8
[Bugfix] Fix broken Minimax-01-VL model ( #22116 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 08:49:29 +00:00
31436e8b4f
[Misc] Add request_id into benchmark_serve.py ( #23065 )
...
Signed-off-by: yangxia <yangxiast@gmail.com >
2025-08-19 08:32:18 +00:00
4efd43e9b4
Fix GLM-4.5V-FP8 numerical issue ( #22949 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 07:56:31 +00:00
3c8a787247
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn ( #22889 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-08-19 07:48:07 +00:00
01a08739e0
[misc] split engine_model into json file for nsys profile tool ( #23117 )
...
Signed-off-by: Grace Ho <grho@nvidia.com >
Signed-off-by: Grace Ho <146482179+gracehonv@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 15:44:53 +08:00
fda9537c5e
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 ( #23114 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 14:24:31 +08:00
90bbe0a5ad
[Log] Warning Once for Cutlass MLA ( #23137 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-18 23:24:16 -07:00
e75f342261
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema ( #22023 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 13:48:26 +08:00
78dba404ad
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes ( #22725 )
...
Signed-off-by: Nikhil Suryawanshi <suryawanshin74@gmail.com >
2025-08-19 04:40:37 +00:00
e9d6a3db69
[TPU] make ptxla not imported when using tpu_commons ( #23081 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@gmail.com >
2025-08-19 11:46:42 +08:00
a4454e9401
chore: disable enable_cpp_symbolic_shape_guards ( #23048 )
...
Signed-off-by: Xiao Liu <xiszishu@gmail.com >
2025-08-18 23:08:05 -04:00
14006840ea
[V0 Deprecation] Remove V0 FlashInfer attention backend ( #22776 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 19:54:16 -07:00
6603288736
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests ( #22871 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 17:39:01 -07:00
95e3095136
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code ( #23122 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-19 08:31:38 +08:00
c9b38be8aa
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT ( #23041 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 17:20:38 -07:00
0dd3f4f5ab
[Misc] Minor refactoring for prepare_inputs ( #23116 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 16:58:05 -07:00
498259ccce
Install tpu_info==0.4.0 to fix core dump for TPU ( #23135 )
2025-08-18 16:23:33 -07:00
6d25e3fd6e
Use Blackwell FlashInfer MXFP4 MoE by default if available ( #23008 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-18 15:25:49 -07:00
ac6eb49de3
fix: OpenAI SDK compat (ResponseTextConfig) ( #23126 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
Signed-off-by: Breno Baldas Skuk <breno.skuk@hcompany.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-18 15:22:59 -07:00
bf756321c7
[CI Bugfix] Pin openai<1.100 to unblock CI ( #23118 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-18 12:14:01 -07:00
0e3bb543f0
[Bugfix] Support compile for Transformers multimodal ( #23095 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-08-18 13:35:48 +00:00
569aefd134
chore: remove unnecessary patch_padding_side for the chatglm model ( #23090 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-18 12:32:13 +00:00
d3f71f1224
[Refactor] Get prompt updates earlier ( #23097 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-18 12:31:53 +00:00
5a30bd10d8
[Bugfix] fix IntermediateTensors equal method ( #23027 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-18 02:58:11 -07:00
27e8d1ea3e
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs ( #23053 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-18 09:52:00 +00:00
5c79b0d648
[XPU][CI]add xpu env vars in CI scripts ( #22946 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-18 09:47:03 +00:00
5f5664b3e4
[XPU] Fix compile size for xpu ( #23069 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-18 00:04:08 -07:00
89657a557c
[Misc] Fix backward compatibility from #23030 ( #23070 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-17 23:33:29 -07:00
08d5f7113a
[Misc] refactor function name ( #23029 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-17 22:16:21 -07:00
b2fd0b81e0
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits ( #23055 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-08-17 22:10:26 -07:00
9f1c642254
[Bugfix] fix Qwen2.5-Omni processor output mapping ( #23058 )
...
Signed-off-by: double7 <33449816+DoubleVII@users.noreply.github.com >
Co-authored-by: 杨森 <yangsen.double7@bytedance.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-17 22:09:11 -07:00
7be3a59d8e
[Misc] enhance static type hint ( #23059 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-17 22:09:08 -07:00
8ea0c2753a
[Misc] Minor code cleanup for _get_prompt_logprobs_dict ( #23064 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-17 18:16:03 -07:00
0fc8fa751a
fix: gptq marlin weight loading failure ( #23066 )
2025-08-17 15:56:07 -07:00
21e39436c8
[XPU] fix xpu to set cudagraph batch sizes ( #23044 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
2025-08-17 21:45:42 +00:00
6d243efeda
[Misc] Convert use_structured_output property into constant ( #23060 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-17 12:41:38 -07:00
c55bc1db26
[Misc] Remove dead return ( #23061 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-17 10:36:46 -07:00
292084e72a
[BugFix] Fix for IMA in FA3 varlen combine ( #22967 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-17 08:52:04 -07:00
16bff144be
[Misc] fix typo in the multimodal doc ( #23051 )
2025-08-17 01:56:20 -07:00
fe0411fc6f
[Bugfix] should use stack instead of concat ( #22972 )
...
Signed-off-by: 947132885 <947132885@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-17 08:46:36 +00:00
4d4061b6e7
[Kernel] Add cuda kernel for gpt_oss activation ( #22951 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-17 05:03:24 +00:00
87f48623a5
[Misc] method name typo fix ( #23042 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-16 21:49:14 -07:00
5c32143b9d
[Refactor] Defer tensor data construction in MultiModalKwargs ( #23030 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-16 21:05:50 -07:00
94096a47c9
[UX] Separate marlin moe config logic from triton moe ( #23006 )
2025-08-16 22:16:42 -04:00
a258ad8bcc
[Bugfix] fix qwen3 moe fp8 accuracy issue ( #23031 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-08-16 17:41:23 -07:00
bf7f470b22
[V1] Logits processors extensibility ( #19912 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Signed-off-by: Andrew Feldman <afeld2012@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-16 12:59:17 -07:00
4fc722eca4
[Kernel/Quant] Remove AQLM ( #22943 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-16 19:38:21 +00:00
3253ae765e
[Flaky CI] Increase timeout tolerance for test_mp_crash_detection+test_default_mm_lora_chat_completions ( #23028 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-16 18:33:08 +00:00
000cceca8c
[Bugfix gpt-oss] Fix float32 convert for flashinfer sink support ( #23016 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-16 11:16:00 -07:00
68373d3126
[Frontend] Added support for HermesToolParser for models without special tokens ( #16890 )
...
Signed-off-by: minpeter <kali2005611@gmail.com >
2025-08-16 17:38:42 +00:00
52ce1420e9
Fix handling of max_num_batched_tokens for pooling tasks ( #23004 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-16 17:36:30 +00:00
829bbd7882
[New Model]mBART model ( #22883 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-16 12:16:58 +00:00
4dff91c93d
[Refactor] Allow optional MultiModalKwargsItem in IPC ( #23022 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-16 11:30:49 +00:00
de9cb61763
Add docs for PrefixRepetitionDataset + enable usage with vllm bench throughput ( #23012 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-16 10:21:20 +00:00
2dbccce8a6
[CI][Bugfix] Skip Ovis2 generation test because of broken remote code ( #22954 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-16 09:44:19 +00:00
933f45334a
[Core] Make cudagraph check cuda platform only ( #23005 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@gmail.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-16 07:46:00 +00:00
cc826a202b
[Multimodal] Update Tensor schema test to cover arbitrary shape mm inputs ( #22867 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-16 00:44:50 -07:00
6d3da472bc
[Misc] Add --save-dir option to benchmark_moe ( #23020 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-16 07:26:10 +00:00
78863f8c5c
[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors ( #22962 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-08-16 06:25:10 +00:00
5157827cfc
[Build] Env var to disable sccache ( #22968 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-16 05:36:27 +00:00
7caec10e7b
[XPU]avoid circular import during XPU init ( #23017 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-16 05:16:34 +00:00
1f83e7d849
[misc] nsys profile output kernel classifier and visualizer ( #22971 )
...
Signed-off-by: Grace Ho <grho@nvidia.com >
2025-08-16 02:52:51 +00:00
e4e37ded56
[V1] support min_tokens for detokener ( #22014 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-08-16 02:28:10 +00:00
f6b5040590
[Frontend] Avoid list copies in serving_chat.py ( #22947 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-16 02:06:30 +00:00
fbd88728b3
[Bugfix] Fix DeepSeek MTP ( #22934 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-08-16 01:25:06 +00:00
070da660c1
[Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit ( #22735 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-16 00:14:08 +00:00
ad0297d113
[Misc] Support passing multiple request ids at once to AsyncLLM.abort() ( #22944 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-15 17:00:36 -07:00
236b864e4f
[BugFix] Make run_once thread-safe ( #22978 )
...
Signed-off-by: <wenji.yyc@alibaba-inc.com >
Signed-off-by: Yichen Yan <wenji.yyc@alibaba-inc.com >
2025-08-15 16:56:17 -07:00
3e2f7985a2
Support multiple attention groups for KV sharing ( #22672 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-15 16:54:10 -07:00
c280066f9d
[v1] Move block_hashes from KVCacheManager to Request.block_hashes ( #19728 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-08-15 16:52:52 -07:00
b9dc9d2607
[BugFix] Handle case where async utility call is cancelled ( #22996 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-08-15 17:38:42 -06:00
1fc375dc05
[Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask causing unintended masking and NaN logits ( #22963 )
...
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com >
2025-08-15 23:25:05 +00:00
76144adf76
ci: Add CUDA + arm64 release builds ( #21201 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
2025-08-15 23:16:23 +00:00
f5d412bafb
[BugFix] Fix regression caused by mamba state dtype PR ( #22998 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-15 22:55:26 +00:00
177e55e3bd
[Attention] FA3 Attention Sinks Perf Boost ( #22478 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-15 17:41:07 -04:00
1723ef1aae
minor: zero workspace buffer init for flashinfer trtllm-gen attn ( #22603 )
2025-08-15 21:38:10 +00:00
00d6cba0cf
Add PrefixRepetitionRandomDataset to vllm bench serve datasets ( #20638 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-08-15 14:09:23 -07:00
7f89ed248f
[Fix] enable swap_ab for pplx problem size computation ( #22991 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-08-15 14:02:12 -07:00
8a87cd27d9
[CI] Speed up Whisper tests by reusing server ( #22859 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-15 16:56:31 -04:00
a344a1a7da
Use regex in convert-results-json-to-markdown.py ( #22989 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-08-15 20:54:20 +00:00
79899b63f6
[Bugfix] Added more env vars to hash ( #22449 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
2025-08-15 20:08:37 +00:00
6e670778cd
[Core] direct indexing on self.block_table_np in compute_slot_mapping ( #22940 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-08-15 12:12:12 -07:00
df5afa82e5
[Log] Debug Once for Randomizing dummy data for DP Rank ( #22860 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-15 11:51:50 -07:00
6cd69f51bf
[Model] Granite-4 support loading quantized checkpoint ( #22925 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-08-15 18:47:56 +00:00
8ad7285ea2
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods. ( #22035 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-15 14:46:00 -04:00
48b01fd4d4
[Structured Output] Make the output of structured output example more complete ( #22481 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-08-15 18:29:25 +00:00
993d3d122b
[Benchmarks] Include image data when ShareGPT4V dataset is used. ( #22955 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-15 18:23:06 +00:00
68af77e51c
[FIXBUG] Correctly Apply Grammar Bitmask in Mixed Batches ( #22896 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2025-08-15 17:42:49 +00:00
6b04039a72
[BugFix] Skip the Q component for QKVParallelLinear in the case of QKVCrossParallelLinear since its width is 0 ( #22369 )
...
Signed-off-by: sstamenk <sstamenk@amd.com >
2025-08-15 17:17:31 +00:00
1c859a1387
[V0 Deprecation] Remove advance_step ( #22969 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-15 08:22:31 -07:00
74f441f4b5
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer ( #20059 )
...
Signed-off-by: fhl <2410591650@qq.com >
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2025-08-15 10:01:39 -04:00
a0632a3e03
[Frontend] Expose do_log_stats interval to env ( #22905 )
...
Signed-off-by: Csrayz <jover@cmbchina.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-15 13:00:20 +00:00
e8b40c7fa2
[CI] Remove duplicated docs build from buildkite ( #22924 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-15 05:58:06 -07:00
48f4636927
[Misc] Ignore ep_kernels_workspace ( #22807 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-15 05:58:03 -07:00
75531a6c13
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) ( #22928 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com >
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-08-15 12:57:06 +00:00
22341b996e
Improve multimodal hasher performance for re-used Image prompts ( #22825 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
2025-08-15 12:32:56 +00:00
49252cf59e
[MM] Allow skipping memory profiling for multimodal models. ( #22950 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-15 11:41:38 +00:00
3e6dd40016
[Bugfix] fix cuda 12.6 and 11.8 build ( #22952 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-08-15 10:10:22 +00:00
aa300c438d
[Bugfix] Unquote file uri before reading image ( #22912 )
...
Signed-off-by: Sayandip Dutta <sayandip199309@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-15 09:28:00 +00:00
fe91ce9591
[V1] - Split Prefill and Decode for Mamba1 models ( #22653 )
...
Signed-off-by: amirk <amirk@ai21.com >
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
Co-authored-by: Asaf Joseph Gardin <39553475+Josephasafg@users.noreply.github.com >
2025-08-15 08:59:52 +00:00
5406ebf5c9
[CI] Pooling models mteb test uses enforce_eager ( #22878 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-15 01:16:15 -07:00
b2c06509e5
[P/D]Provide bucket algorithm rate limiter for proxy_server ( #22643 )
...
Signed-off-by: frankie-ys <yongshengwang@cmbchina.com >
Signed-off-by: frankie <wangyongsheng686@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Kuntai Du <kuntai@uchicago.edu >
2025-08-15 07:01:48 +00:00
b2f6c247a9
Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module." ( #22956 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-15 06:39:19 +00:00
3d232dbd19
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn ( #22818 )
...
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-08-15 06:38:05 +00:00
5c3fbfe46b
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement ( #22763 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-15 06:27:30 +00:00
b4cef5e6c7
refactor: Change scaling factors calculation for flashinfer FusedMoE ( #22812 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-15 06:19:31 +00:00
0fe85087a9
[CI Perf] Prune tests in tests/kernels/attention/ ( #22936 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-14 21:34:53 -06:00
d2b0e97ea6
[CI Perf] Prune tests in tests/kernels/moe/ ( #22939 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-14 21:33:42 -06:00
590bddbfc5
[CI Perf] Prune tests in tests/kernels/quantization/ ( #22942 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-14 21:25:34 -06:00
ae05a6d83d
[BugFix] Fix port lookup in internal DP LB tests ( #22252 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-15 11:17:11 +08:00
0933f9d518
[BugFix][KVConn] Fix use of get_required_kvcache_layout ( #22734 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-15 01:39:43 +00:00
f1f0d2fab8
Revert "[Kernel] Add cuda kernel for gpt_oss activation" ( #22948 )
2025-08-14 17:38:10 -07:00
81f4b96481
[Kernel] Add cuda kernel for gpt_oss activation ( #22538 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-14 17:21:29 -07:00
39cd09dc86
[Bugfix] use flash attn on sm90 ( #22933 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-14 16:37:22 -07:00
919234fe17
[BugFix] Fix initial DP request load imbalance ( #22910 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-14 15:20:28 -07:00
ebcce2cd36
[Core] Return final response for aborted requests from AsyncLLM.generate ( #22283 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-14 14:49:02 -07:00
4121de512e
[Quantization]: Support compressed-tensors mixed-precision model loading ( #22468 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-08-14 17:32:09 -04:00
279a5f31b3
[Kernel] Add nvfp4 gemm flashinfer backends ( #22346 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-14 16:03:55 -04:00
b8ff05361a
[CI] Temporarily disable flaky test ( #22930 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-14 19:59:16 +00:00
637093ae26
docs: update fastsafetensors usage instructions ( #22891 )
...
Signed-off-by: Nir Levy <bhr166@gmail.com >
2025-08-14 19:56:54 +00:00
33c63e9547
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel ( #22428 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Animesh Jain <anijain@umich.edu >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com >
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: yan <yan.ma@intel.com >
Signed-off-by: Yan Ma <yan.ma@intel.com >
Signed-off-by: Xiao Liu <xiszishu@gmail.com >
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Signed-off-by: Haibin Lin <haibin.lin@bytedance.com >
Signed-off-by: David Ben-David <davidb@pliops.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: huangweixiao <huangweixiao@msh.team >
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
Signed-off-by: Eric Hanley <ericehanley@google.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Signed-off-by: CLFutureX <775523362@qq.com >
Signed-off-by: Linkun Chen <github@lkchen.net >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: tlipoca9 <tlipoca9@gmail.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Benji Beck <benjibeck@meta.com >
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Zhang Jason <ning.zhang2@amd.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: asafg <asafg@ai21.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: Lain <fusiyuan2000@hotmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk >
Signed-off-by: Lionel Villard <villard@us.ibm.com >
Signed-off-by: ycyaw66 <497410282@qq.com >
Signed-off-by: David Chen <530634352@qq.com >
Signed-off-by: Linkun <github@lkchen.net >
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com >
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai >
Signed-off-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
Signed-off-by: Andrew Chan <andrewkchan.akc@gmail.com >
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: XIn Li <xinli@nvidia.com >
Signed-off-by: Junhao Li <junhao@ubicloud.com >
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Signed-off-by: <zyy1102000@gmail.com >
Signed-off-by: Guy Stone <guys@spotify.com >
Signed-off-by: <yyweiss@gmail.com >
Signed-off-by: yyw <yyweiss@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com >
Signed-off-by: Pradyun92 <142861237+Pradyun92@users.noreply.github.com >
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Animesh Jain <jainanimesh2305@yahoo.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
Co-authored-by: XiongfeiWei <isaacwxf23@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: JartX <sagformas@gmail.com >
Co-authored-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: kf <kuanfu.liu@embeddedllm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com >
Co-authored-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Yuxuan Zhang <2448370773@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Xiao <xiszishu@gmail.com >
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Ning Xie <andy.xning@gmail.com >
Co-authored-by: H <linhaibin.eric@gmail.com >
Co-authored-by: David Ben-David <sdavidbd@gmail.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: TankNee <nee@tanknee.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: ZiTian.Zhao <zitian.zhao@tencentmusic.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Abirdcfly <fp544037857@gmail.com >
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com >
Co-authored-by: Chenxi Yang <cxyang@cs.utexas.edu >
Co-authored-by: Chenxi Yang <cxyang@meta.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Weixiao Huang <hwx.simle@gmail.com >
Co-authored-by: Raghav Ravishankar <113712354+alyosha-swamy@users.noreply.github.com >
Co-authored-by: ericehanley <ericehanley@google.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
Co-authored-by: Po-Han Huang (NVIDIA) <53919306+nvpohanh@users.noreply.github.com >
Co-authored-by: PiteXChen <44110731+CLFutureX@users.noreply.github.com >
Co-authored-by: lkchen <github@lkchen.net >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: tlipoca9 <160737620+tlipoca9@users.noreply.github.com >
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: wang.yuqi <noooop@126.com >
Co-authored-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Siyuan Liu <lsiyuan@google.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Zhang Jason <ning.zhang2@amd.com >
Co-authored-by: Asaf Joseph Gardin <39553475+Josephasafg@users.noreply.github.com >
Co-authored-by: asafg <asafg@ai21.com >
Co-authored-by: Lain <siyuanf@nvidia.com >
Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com >
Co-authored-by: imning3 <hbning@pku.edu.cn >
Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: qscqesze <qingjun@minimaxi.com >
Co-authored-by: Syed Muhammad Bin Asif <92625830+syedmba@users.noreply.github.com >
Co-authored-by: Lionel Villard <villard@us.ibm.com >
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
Co-authored-by: Moritz Sanft <58110325+msanft@users.noreply.github.com >
Co-authored-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: Adrián García García <adrigarvk8@gmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: JaceyShao <65159281+JaceyShao@users.noreply.github.com >
Co-authored-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
Co-authored-by: Ricardo Decal <crypdick@users.noreply.github.com >
Co-authored-by: Andrew Chan <andrewkchan.akc@gmail.com >
Co-authored-by: fxmarty-amd <felmarty@amd.com >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
Co-authored-by: Zhiyu <zhiyuc@nvidia.com >
Co-authored-by: Shu Wang <shuw@nvidia.com >
Co-authored-by: XIn Li <xinli@nvidia.com >
Co-authored-by: Junhao Li <streaver91@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: Hong Hanh <hanh.usth@gmail.com >
Co-authored-by: Daniel Serebrenik <74646983+pliops-daniels@users.noreply.github.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Guy Stone <guys@spotify.com >
Co-authored-by: yyweiss <70619747+yyweiss@users.noreply.github.com >
Co-authored-by: Pradyun92 <142861237+Pradyun92@users.noreply.github.com >
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-08-14 11:23:22 -07:00
ab9f2cfd19
[CI] [Hybrid] Bump min transformers version for Bamba and Jamba ( #22908 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-14 11:01:16 -07:00
dbe298046c
[Bugfix] Fix parsing of --disable-mm-preprocessor-cache ( #22909 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-14 08:09:44 -07:00
625ccd1c4d
[Bugfix] Replace custom Encoding class with BatchEncoding in MistralTokenizer ( #22786 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-14 08:09:27 -07:00
92ff41abea
[Model] Modify the gate implementation of glm4_moe ( #22832 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-14 05:28:50 -07:00
829b9a62d0
[Perf] Dont create unnecessary pooling params ( #22876 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-14 05:28:09 -07:00
540d54ca8d
[CI] Re-enable transcriptions test_long_audio_request ( #22890 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-14 11:34:34 +00:00
0783f13960
[Doc] fix dead link ( #22898 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-08-14 04:06:13 -07:00
7655dc3e45
[Bugfix] Add reset prefix cache for online serving ( #22726 )
...
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-14 04:04:18 -07:00
f4efda821d
Remove Phi 4 Flash configuration workaround ( #22723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-14 04:03:49 -07:00
eb08487b18
[BugFix] Threadsafe close async zmq sockets ( #22877 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-14 03:44:29 -07:00
7c3a0741c6
[Bugfix] Fix PixtralHFImagePixelInputs dynamic shape check ( #22827 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-14 02:35:43 -07:00
00e3f9da46
vLLM Benchmark suite improvement ( #22119 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-08-14 07:12:17 +00:00
a353bd083d
[CI] remove flaky v0 test ( #22864 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-08-13 21:41:51 -07:00
1d20c34717
[CI] Fix tests/distributed/test_ca_buffer_sharing.py ( #22849 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-13 20:09:30 -07:00
b6af24fba7
[CI][Entrypoints]: add filter to generation to filter out invalid tool calls ( #22826 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-08-13 20:09:07 -07:00
0ca2393b47
[CI/Build] Increase pooling tolerance to pass CI ( #22844 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-13 18:52:48 -04:00
31a500c86f
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP ( #22437 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-08-13 14:44:06 -07:00
4e8614e88b
Move checklist in PR template ( #22852 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
2025-08-13 21:38:35 +00:00
c6cd5ca3d3
[ROCm][Bugfix] Fix compilation error in topk softmax fused kernel ( #22819 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2025-08-13 13:45:03 -07:00
df0e0f023e
[CI/Build] Skip gpt_big model test because of broken HF model ( #22848 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-13 20:36:28 +00:00
b4b78d6317
[CI/Build] Fix param mismatch in test_eagle_correctness ( #22847 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 10:55:25 -07:00
12817a8ac7
[CI] Fix tests/v1/e2e/test_kv_sharing_fast_prefill.py import on test ( #22815 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-13 10:35:50 -07:00
c9232d41f4
[CI/Build] Update VLM common tests ( #22841 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 10:03:05 -07:00
9bd9294f0e
[Bugfix] Fix MiniCPMV Image input inference failed ( #22813 )
...
Signed-off-by: HWH <67449739+jio-H@users.noreply.github.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-13 09:41:41 -07:00
da2705198f
[Misc] clear and separate error messages for input too long and input + max-tokens too long ( #22803 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-13 07:22:56 -07:00
19b927e52d
[Core] Use individual MM items in P0/P1 cache and model runner ( #22570 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 07:18:07 -07:00
20d65aa755
[Frontend] Multithreaded async multimodal load_bytes ( #22710 )
...
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
2025-08-13 06:09:26 -07:00
b159c0a67a
Fix GGUF loader for Qwen3 MoE. ( #22785 )
...
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com >
2025-08-13 06:08:23 -07:00
6772bb0f7d
Remove unnecessary CUDA sync of qwen image and video preprocess ( #22792 )
...
Signed-off-by: cyy <cyyever@outlook.com >
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-13 06:07:28 -07:00
fceafaf582
[Bugfix][mamba] Fix type annotation of Mamba2Metadata ( #22787 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-13 06:07:09 -07:00
6b794c756c
[Nixl][CI] Fix tests ( #22806 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-13 06:03:53 -07:00
98deac3879
[FEATURE] support custom vllm tuned config path for fused moe triton kernels ( #22791 )
...
Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com >
2025-08-13 20:27:25 +08:00
653124bd46
[Frontend] Add chunked processing to handle long inputs in embedding models ( #22280 )
...
Signed-off-by: x22x22 <wadeking@qq.com >
Signed-off-by: Kdump <rootshellexp@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 04:14:24 -07:00
0b1bdac6af
[Platform] Custom ops support for FusedMoe ( #22509 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-08-13 04:12:00 -07:00
d94e3026de
[V1] Add tree drafting tests for eagle spec decoding ( #22705 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-13 04:11:28 -07:00
3f52738dce
[Doc] Add max_lora_rank configuration guide ( #22782 )
...
Signed-off-by: chiliu <cliu_whu@yeah.net >
2025-08-13 04:10:07 -07:00
a01e0018b5
[Bugfix] Fix Nemotron VL image processing ( #22739 )
...
Co-authored-by: ducviet00-h2 <viet.d.hoang@h2corporation.jp >
2025-08-13 03:11:36 -07:00
9e7e5baaa8
[Model] Add missing prefix to glm4_1v ( #22716 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-13 01:23:33 -07:00
d16aa3dae4
[Model] Add option to run Step3VisionEncoder in DP ( #22697 )
...
Signed-off-by: zzh142857 <chaorenzhaozhenghao@gmail.com >
2025-08-13 00:09:13 -07:00
6807af8f46
[gpt-oss] upgrade gpt-oss to v0.0.3 and add version check ( #22768 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-12 21:37:26 -07:00
4c558cf62e
[Perf] Support topk softmax fused kernel for broader num_experts ( #22211 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-08-12 21:34:47 -07:00
77a6bf07ae
[Bug] Fix Unexpected Keyword Argument 'w1_bias' ( #22757 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-12 21:31:47 -07:00
4082338a25
Remove unneeded ROCm platform import when using CUDA ( #22765 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-12 21:26:38 -07:00
c6b928798e
Force TRTLLM attention for gpt-oss on SM100 ( #22678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-12 21:22:16 -07:00
b1361c7273
[Bugfix] Fix default enable for CUTLASS MLA on SM100 ( #22738 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-12 21:22:05 -07:00
4f0f844b16
Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op ( #22701 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-12 21:21:50 -07:00
c5830381af
[V0 Deprecation] Remove args for multi-step scheduling ( #22779 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-12 20:38:18 -07:00
d31f97cf57
[Misc] Remove tests/multi_step/__init__.py ( #22778 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-12 20:21:18 -07:00
71683ca6f6
[V0 Deprecation] Remove multi-step scheduling ( #22138 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-12 20:18:39 -07:00
e18859298d
Add hardware plugins to installation doc ( #22732 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 17:14:46 -07:00
fde0b611a3
[Model] Decouple glm4v ( #22751 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-12 17:13:17 -07:00
d0a6301588
Fix Transformers backend tensor parallel for multimodal models ( #22673 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 17:12:30 -07:00
45c3936e94
[Docs] Hide the navigation and toc sidebars on home page ( #22749 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 17:12:26 -07:00
ba81acbdc1
[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues ( #22606 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2025-08-12 15:43:06 -07:00
53c730286c
[Misc] parametrize 'dtype' in test_flash_mla ( #22641 )
...
Signed-off-by: RUTHLESS-BOT <wujiafeng@cmbchina.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-12 16:31:48 -04:00
6534d2fc97
Fix torch version check for SM100 mxfp4 ( #22535 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-12 12:54:42 -07:00
422f22e012
[CI][Nixl] Check kv cache layout during handshake ( #22745 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-12 12:53:52 -07:00
6bd8ebf026
[Kernel][AMD] Avoid D2H copy and cumsum kernel ( #22683 )
...
Signed-off-by: Xiaozhu <mxz297@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-12 12:53:36 -07:00
dab4f9f764
[Chore] Update CODEOWNERS to include @yewentao256 for CUDA kernels, attention backends, quantization, and related tests ( #22741 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-13 00:50:31 +08:00
c42fe0b63a
Add more test scenario for tensor schema ( #22733 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
2025-08-12 16:34:41 +00:00
5a4b4b3729
Add: SupportsEagle3 interface for explicit EAGLE3 support ( #22642 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-08-12 09:24:52 -07:00
e5d3d63c42
[Benchmark] Fix terminal colors in benchmark_serving_multi_turn (python 3.12) ( #22730 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-08-12 14:41:37 +00:00
3d9d40efde
[Bugfix][CI] Fix test_remote_decode_lifecycle.py::test_short_prompt_lifecycle ( #22727 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-12 07:30:17 -07:00
67c153b88a
Fix Llama4 FlashInfer FP4 MoE issues ( #22511 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-12 05:50:59 -07:00
f7ad6a1eb3
[CI Failure] fix tests/entrypoints/openai/test_skip_tokenizer.py ( #22708 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-12 05:42:58 -07:00
80bb1e8afe
Officially support SmolLM3 using the Transformers backend ( #22665 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 05:38:48 -07:00
d030b01548
[BugFix][Nixl][PD] Fix heterogenous TP ( #22663 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-08-12 05:37:30 -07:00
767e63b860
[Docs] Improve docs navigation ( #22720 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 04:25:55 -07:00
007dd90859
[gpt-oss] Enable gpt-oss on ampere ( #22714 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-12 03:21:44 -07:00
b8a9d0e429
[Misc] remove GH discussions link ( #22722 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-12 03:15:33 -07:00
50f2aae1b4
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing ( #21161 )
...
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com >
2025-08-12 02:05:14 -07:00
46ae7f6666
[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 ( #21783 )
...
Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com >
2025-08-12 02:04:37 -07:00
1ece7f30ba
Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_convert" ( #21888 )
...
Signed-off-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-12 02:03:53 -07:00
bc8372efc3
[Bugfix] Fix erroneous randomly generated cases in bad word testing ( #22170 )
...
Signed-off-by: phantomlei <phantomlei3@gmail.com >
2025-08-12 02:03:22 -07:00
8d17fa633e
[V0] Correct CUDA Graph capture for encoder-decoder models ( #22630 )
2025-08-12 02:01:08 -07:00
9f909b8996
[New Model] Support Command-A-Vision ( #22660 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-08-12 01:39:54 -07:00
59f3b93636
[DOC] update v1_guide with INTEL HW ( #22679 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-08-12 01:22:49 -07:00
78077d5417
Move SchedulerConfig from config/__init__.py to config/scheduler.py ( #22626 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 00:23:49 -07:00
6d729c43fb
[Bugfix] Fix ModernBert load & Enable sliding window attention for bidirectional attention. ( #22637 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-12 00:23:17 -07:00
2f4657952b
[doc] Update x86 CPU-inference installation doc to reflect optionality of AVX512f ( #22707 )
...
Signed-off-by: Sooraj S <94284954+sooraj-satheesh@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-08-12 00:21:08 -07:00
3a7e3bbdd2
[Doc] Added unmentioned required option "method" in the usage of EAGLE-3 based models ( #21737 )
...
Signed-off-by: Dilute-l <dilu2333@163.com >
Co-authored-by: Dilute-l <dilu2333@163.com >
2025-08-12 00:14:51 -07:00
4fbd8bb597
Fix passing SpeculativeConfig from the CLI ( #22652 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 22:13:32 -07:00
ad344ef552
[gpt-oss] Small bug fixes for frontend ( #22512 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 22:04:38 -07:00
bbaf9e9cb1
[gpt-oss] Fix mxfp4 support ( #22700 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 21:22:26 -07:00
4678503476
Migrate MiniCPMVImageInputs to TensorSchema ( #21939 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-11 20:43:37 -07:00
93d0652433
[CI] Increase timeout for test_completion_with_image_embeds ( #22670 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-11 20:31:36 -07:00
ea1292ad3e
[CI Failure] Use float32 for tests/entrypoints/openai/test_audio.py ( #22686 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-11 20:20:42 -07:00
dc5e4a653c
Upgrade FlashInfer to v0.2.11 ( #22613 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-11 19:58:41 -07:00
839ab00349
Re-enable Xet on TPU tests now that hf_xet has been updated ( #22666 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 19:54:40 -07:00
9b94d6ec8f
Enable 4bit bnb prequant MOE ( #21548 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-11 19:02:14 -07:00
1891a265d3
[gpt-oss] Add test for response API + harmony (but skipped) ( #22554 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 17:47:24 -07:00
95a935fc48
[gpt-oss] Support streaming in response API ( #22431 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 17:46:59 -07:00
458e74eb90
Support more parallel styles in Transformers backend TP ( #22651 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 10:42:48 -07:00
65abe111a3
[CI] Skip Tree Attn Test in test_max_len.py to unblock CI ( #22664 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-11 10:36:05 -07:00
807d21b80d
[BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI ( #22611 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-11 10:31:36 -07:00
c90fb03df5
[CI/Build] Skip Mllama HF runner tests with Transformers v4.55.0 ( #22659 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-11 10:00:58 -07:00
84cf78acee
[Model] Pooling models default to using chunked prefill & prefix caching if supported. ( #20930 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-11 09:41:37 -07:00
16fb668b61
fix: NIXL connector transfers partial block to pass full multi-modal context ( #21074 )
...
Signed-off-by: GuanLuo <gluo@nvidia.com >
2025-08-11 09:40:55 -07:00
f7dcce7a4a
[Feature] Add VLLM_USE_DEEP_GEMM_E8M0 Env to Control E8M0 Scale ( #21968 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-11 09:39:08 -07:00
8e13d9fe6d
[Misc] Further clean up some redundant config definitions ( #22649 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-11 09:22:25 -07:00
3fa5b25845
Document aarch64 CPU support works ( #22646 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-08-11 07:22:45 -07:00
14a5d903ab
[Model] NemotronH Support ( #22349 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
2025-08-11 04:09:24 -07:00
951b038298
[Misc] Move jsontree to utils ( #22622 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-11 03:49:32 -07:00
ebf7605b0d
[Misc] Move tensor schema tests ( #22612 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-11 00:15:27 -07:00
bc1d02ac85
[Docs] Add comprehensive CLI reference for all large vllm subcommands ( #22601 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 00:13:33 -07:00
1e55dfa7e5
[BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm ( #22017 )
2025-08-11 00:13:30 -07:00
384a052971
[Misc] benchmark_moe supports expert parallel ( #22251 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-11 00:13:27 -07:00
39052dbca8
Support token_type_ids in V1 with less code changes ( #21985 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-10 22:54:59 -07:00
9c97a1c349
[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module. ( #22521 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-10 22:52:34 -07:00
f919d4cb8f
[BugFix] Fix logits repetition penalty cuda check ( #22592 )
2025-08-10 22:52:31 -07:00
afa5b7ca0b
[Misc][gpt-oss] guard import when triton kernel when not up to date ( #22584 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-08-10 21:29:35 -07:00
1b99028069
[Misc][gpt-oss] Add rules to label gpt-oss related PRs ( #22600 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-08-10 19:49:51 -07:00
5898b135ab
[BugFix] Fix KVConnectorOutput TPU breakage ( #22598 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-10 19:33:48 -07:00
b799f4b9ea
[CI/Build] Fix tensorizer test for load_format change ( #22583 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-10 19:30:00 -07:00
06da44f0cb
Migrate LlavaImageInputs to TensorSchema ( #21770 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-10 19:29:19 -07:00
a554991748
Migrate LlavaNextVideoPixelInputs to TensorSchema ( #21843 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-10 19:29:16 -07:00
d1af8b7be9
enable Docker-aware precompiled wheel setup ( #22106 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-08-10 16:29:02 -07:00
68b254d673
Fix TensorSchema validation test for symbolic dims ( #22366 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-10 17:16:44 +00:00
8c50d62f5a
Remove redundant row_indices unsqueeze operation in MiniCPMO ( #22528 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-10 09:20:00 -07:00
b4e2916721
Migrate LlavaNextImageInputs to TensorSchema ( #21774 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-10 09:05:21 -07:00
65a7917be4
Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion Benchmarks ( #22534 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
2025-08-10 09:03:15 -07:00
b76753f0b5
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel ( #22593 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-10 09:00:36 -07:00
b81fe83b2c
[doc] add alibaba cloud as sponsor ( #22597 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-10 23:13:47 +08:00
0757551c96
[doc] add beijing meetup links ( #22596 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-10 22:51:36 +08:00
8290d15d2c
Move CacheConfig from config/__init__.py to config/cache.py ( #22586 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-10 07:36:40 -07:00
049c245143
[Misc] Replace flaky image urls in pixtral test ( #22574 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-10 06:18:21 -07:00
00976db0c3
[Docs] Fix warnings in docs build ( #22588 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-10 05:49:51 -07:00
d411df0296
[Misc] Further refine type annotations in parallel state ( #22499 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-10 05:49:48 -07:00
010e0e39ea
[Doc] Fix API doc link in side navigation ( #22585 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-10 01:35:22 -07:00
326976291b
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_config ( #22566 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-10 00:08:48 -07:00
7e8d685775
[Minor] Fix pre-commit error on main ( #22579 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-10 00:08:23 -07:00
c49848396d
Refactor sliding window configuration to Transformers best practice ( #21927 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-09 20:50:48 -07:00
2a84fb422f
[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block ( #22394 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com >
Co-authored-by: Chengji Yao <chengjiyao@gmail.com >
2025-08-09 20:49:04 -07:00
534c45b962
Improve fast_topk function with type hints and documentation ( #22530 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-09 20:25:42 -07:00
3d7363e61c
[Config] add "qwen" as a native eagle3 target supported model ( #22333 )
...
Signed-off-by: lechen <lecself@163.com >
Signed-off-by: LeChen <lecself@163.com >
2025-08-09 20:21:05 -07:00
0c5254b82a
[oss] Init gpt-oss bf16 support ( #22508 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-09 20:19:13 -07:00
61f67d8acd
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers ( #21401 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-09 20:16:11 -07:00
42172ad18f
[FEAT] [Performance] Add triton mrope to replace the torch code path ( #22375 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-09 11:50:03 -07:00
fbd8595c5c
[Bugfix] Fix basic models tests hanging due to mm processor creation ( #22571 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-09 11:42:21 -07:00
5a16fa614c
[Model] Gemma3n MM ( #20495 )
...
Signed-off-by: ShriKode <shrikode@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: ShriKode <shrikode@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-09 09:56:25 -07:00
2d18256e47
Move ParallelConfig from config/__init__.py to config/parallel.py ( #22565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-09 08:33:46 -07:00
56186474f6
[Docs] Reduce noise in docs and --help from the JSON tip ( #22567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-09 08:31:32 -07:00
1bf5e1f25b
[CI] [Hybrid] Speed up hybrid models test by removing large models ( #22563 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-09 02:04:42 -07:00
a6022e6fbc
GLM-4.5V with new class name at transformers ( #22520 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-09 00:50:21 -07:00
2be07a0db1
Update docs for Minimax-Text support ( #22562 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-09 00:18:18 -07:00
0edc0cd52b
[Bugfix] Fix CI moe kernel failure ( #22556 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-09 00:03:29 -07:00
7920e9b1c5
[Bugfix] Fix failing GPT-OSS initialization test ( #22557 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-09 00:03:26 -07:00
b7c0942b65
[ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel ( #22097 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-08-08 23:15:06 -07:00
9a0c5ded5a
[TPU] Add support for online w8a8 quantization ( #22425 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-08-08 23:12:54 -07:00
10a02535d4
Fix loading of quantized BigCode models ( #22463 )
...
Signed-off-by: Eldar Kurtic <eldar@neuralmagic.com >
2025-08-08 23:12:12 -07:00
65552b476b
[Misc] Use config definitions from Transformers library ( #21913 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-08 23:10:51 -07:00
7ad7adb67f
v1: Pass KVConnectorOutput to scheduler-side ( #22157 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-08-08 23:09:51 -07:00
6ade99eafa
[V1] [Hybrid] Support Minimax-Text-01 in V1 ( #22151 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-08 23:08:48 -07:00
3157aebb63
[Log] Add Warning for Deprecation of DeepGEMM old version ( #22194 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-08 23:07:48 -07:00
8a0ffd6285
Remove mamba_ssm from vLLM requirements; install inside test container using --no-build-isolation ( #22541 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-08 23:05:32 -07:00
23472ff51c
[Doc] Add usage of implicit text-only mode ( #22561 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Flora Feng <4florafeng@gmail.com >
2025-08-08 23:04:19 -07:00
08b751ba74
Implicit language-model-only mode via limit-mm-per-prompt ( #22299 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: XIn Li <xinli@nvidia.com >
Signed-off-by: Junhao Li <junhao@ubicloud.com >
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Signed-off-by: Linkun <github@lkchen.net >
Co-authored-by: Ning Xie <andy.xning@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
Co-authored-by: Zhiyu <zhiyuc@nvidia.com >
Co-authored-by: Shu Wang <shuw@nvidia.com >
Co-authored-by: XIn Li <xinli@nvidia.com >
Co-authored-by: Junhao Li <streaver91@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Yuxuan Zhang <2448370773@qq.com >
Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Po-Han Huang (NVIDIA) <53919306+nvpohanh@users.noreply.github.com >
Co-authored-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Hong Hanh <hanh.usth@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: lkchen <github@lkchen.net >
2025-08-08 22:21:40 -07:00
429e4e2d42
[Bugfix] Fix ModernBert cuda graph capturing in v1 ( #21901 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-08 22:17:22 -07:00
35afe1b30b
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D ( #22317 )
...
Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com >
Signed-off-by: Pradyun92 <142861237+Pradyun92@users.noreply.github.com >
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-08-08 17:04:15 -07:00
81c57f60a2
[XPU] upgrade torch 2.8 on for XPU ( #22300 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-08 17:03:45 -07:00
311d875614
Drop flaky test_healthcheck_response_time ( #22539 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-08 16:56:47 -07:00
e3edc0a7a8
Extract CompilationConfig from config.py ( #22524 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 16:34:25 -07:00
baece8c3d2
[Frontend] Add unix domain socket support ( #18097 )
...
Signed-off-by: <yyweiss@gmail.com >
Signed-off-by: yyw <yyweiss@gmail.com >
2025-08-08 16:23:44 -07:00
2fcf6b27b6
[Docs] fix broken links in metrics.md ( #22315 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 16:22:35 -07:00
41b9655751
Skip Qwen 1 in CI because remote code is no longer compatible with Transformers ( #22536 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 16:20:58 -07:00
bd875d2eb7
[Bugfix] Update FA commit hash ( #22546 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-08 16:10:25 -07:00
f703b923f3
[Misc] DeepGEMM : Avoid JIT generation in the hot-path ( #22215 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-08-08 16:09:59 -07:00
cd9b9de1fb
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA ( #21691 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-08 16:09:42 -07:00
fe6d8257a1
[gpt-oss] Support tool call and implement MCP tool server ( #22427 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-08 15:06:37 -07:00
e290594072
[Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling” ( #22466 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-08-08 19:26:21 +00:00
f756a682d9
[gpt-oss] guard import when triton kernel is not installed ( #22529 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-08 11:18:33 -07:00
f0964e29cb
[Benchmark] Add benchmark tool for multi turn conversations ( #20267 )
2025-08-08 10:28:50 -07:00
e789cad6b8
[gpt-oss] triton kernel mxfp4 ( #22421 )
...
Signed-off-by: <zyy1102000@gmail.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-08 08:24:07 -07:00
e5ebeeba53
Remove exception for Python 3.8 typing from linter ( #22506 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 03:06:46 -07:00
7be7f3824a
[Docs] Improve API docs (+small tweaks) ( #22459 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 03:02:51 -07:00
ccdae737a0
[BugFix] Don't cancel asyncio tasks directly from destructors ( #22476 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-08 01:13:18 -07:00
904063907c
[Misc] fix openai version ( #22485 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-08 01:12:54 -07:00
43c4f3d77c
[Misc] Begin deprecation of get_tensor_model_*_group ( #22494 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-08 01:11:54 -07:00
1712543df6
[CI/Build] Fix multimodal tests ( #22491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-08 00:31:19 -07:00
808a7b69df
[bench] Fix benchmark/serve.py to ignore unavailable results ( #22382 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-08-07 23:15:50 -07:00
099c046463
[Doc] Sleep mode documentation ( #22310 )
...
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Hong Hanh <hanh.usth@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-08 12:25:18 +08:00
af473f0a85
[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 ( #22426 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-07 20:25:01 -07:00
157f9c1368
Fix pre-commit ( #22487 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 20:21:54 -07:00
6f287915d8
Optimize MiniCPMO mask creation with vectorized implementation ( #22464 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
2025-08-07 20:18:50 -07:00
c152e2a8a0
not tie_word_embeddings for glm-4.5 and glm-4.5v ( #22460 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-07 19:37:23 -07:00
17eaaef595
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match ( #22065 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-08-07 19:20:21 -07:00
3303f134e0
[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) ( #22131 )
...
Signed-off-by: Junhao Li <junhao@ubicloud.com >
2025-08-07 19:18:28 -07:00
b2c8ce57c6
Fix Flashinfer CUTLASS MOE Allgather ( #21963 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-08-07 19:18:25 -07:00
a3b9c17b56
Support Tensorrt-LLM MoE fp4 for low-latency ( #21331 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: XIn Li <xinli@nvidia.com >
Co-authored-by: XIn Li <xinli@nvidia.com >
2025-08-07 19:18:22 -07:00
d57dc2364e
Add ModelOpt Qwen3 nvfp4 support ( #20101 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-08-07 19:18:19 -07:00
e2c8f1edec
[PERF] Use pybase64 to more quickly decode prompt embeddings ( #22469 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-08-07 19:15:32 -07:00
1ee5ead5f8
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine ( #21496 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-07 19:13:17 -07:00
acf8aeb79e
[Misc] normalize multiprocessing Queue usage ( #22371 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-08 01:57:27 +00:00
7e3a8dc906
Remove from_dict from SpeculativeConfig ( #22451 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-07 10:13:04 -07:00
139d155781
[Frontend] Use engine argument to control MM cache size ( #22441 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 09:47:10 -07:00
8c9da6be22
[Core] Simplify mm processing cache ( #22457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 09:47:07 -07:00
399d2a10e2
Fix pre-commit error in main ( #22462 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-07 08:54:39 -07:00
4815b00f54
[gpt-oss] Generate ResponseOutputItem from Harmony Message ( #22410 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-07 08:33:25 -07:00
4da8bf20d0
[Tool] Fix auto tool call ( #22434 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-07 07:03:38 -07:00
7e0b121812
[Bugfix] Add missing packed_modules_mapping to DeepseekV2ForCausalLM ( #22352 )
...
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
2025-08-07 06:30:48 -07:00
766bc8162c
[Core] Store only the keys for multi-modal data in P0 ( #22198 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 01:45:04 -07:00
289b18e670
[Docs] Update features/disagg_prefill, add v1 examples and development ( #22165 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-08-07 00:59:23 -07:00
35171b1172
[Doc] update docs for nightly benchmarks ( #12022 )
...
Signed-off-by: Andrew Chan <andrewkchan.akc@gmail.com >
2025-08-07 00:29:45 -07:00
a2c6696bfe
[Docs] Factor out troubleshooting to its own guide; add section for Ray Observability ( #21578 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-08-07 00:29:13 -07:00
5e8398805e
[Doc] Fix link to prefix caching design ( #22384 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-07 00:28:15 -07:00
136825de75
[Misc] Enhance code formatting in mxfp4.py ( #22423 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-07 00:26:24 -07:00
c2dba2dba8
Add H20-3e fused MoE kernel tuning configs for GLM-4.5 ( #22433 )
...
Signed-off-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
Co-authored-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
2025-08-07 00:24:47 -07:00
434d2f3f7a
[Docs] Add missing dependency for docs build ( #22435 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-07 00:22:07 -07:00
8e8e0b6af1
feat: Add --enable-log-outputs flag for logging model generations ( #20707 )
...
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai >
2025-08-06 23:10:13 -07:00
82216dc21f
[Misc] Support routing logic simulation ( #21990 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-06 23:06:20 -07:00
370661856b
[Frontend] Update OpenAI error response to upstream format ( #22099 )
...
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com >
2025-08-06 23:06:00 -07:00
cbc8457b26
[Model] Switch to Fused RMS norm in Qwen2.5_VL model. ( #22184 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: kf <kuanfu.liu@embeddedllm.com >
2025-08-06 23:05:24 -07:00
4d4297e8fe
[Bench] Split serve.py:main into async/async versions ( #22405 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-08-06 23:05:07 -07:00
2a4c825523
[CI] Skip the pooling models that do not support transformers v4.55 ( #22411 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-06 23:05:03 -07:00
4be02a3776
[Bugfix] EPLB load statistics problem ( #22167 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Signed-off-by: David Chen <530634352@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-08-07 04:07:54 +00:00
f6278b6243
[gpt-oss] Convert user input to harmony format ( #22402 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-06 20:56:02 -07:00
ad6c655dde
preload heavy modules when mp method is forkserver ( #22214 )
...
Signed-off-by: Lionel Villard <villard@us.ibm.com >
2025-08-06 20:33:24 -07:00
14bcf93a6a
Optimize logger init performance by using module-level constants ( #22373 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-06 20:32:19 -07:00
ecbea55ca2
Update hf_xet pin to resolve hangs ( #22356 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-06 20:31:41 -07:00
609b533cb6
[Bugfix] Add proper comparison for package versions ( #22314 )
...
Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk >
2025-08-06 20:31:03 -07:00
5e9455ae8f
[Bugfix]: Fix the streaming output for function calls in the minimax ( #22015 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-08-06 20:30:27 -07:00
a00d8b236f
Use float32 for test_completion.py ( #22385 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-08-07 11:07:47 +08:00
04cf435d95
[Bugfix] Fix wrong method name in Intern-S1 image processor ( #22417 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-06 20:05:20 -07:00
7377131a2c
[Qwen3] Enable dual-chunk-attention support for Qwen3 models. ( #21924 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-08-06 19:58:08 -07:00
6b47ef24de
[XPU]Fix flash_attn_varlen_func interface on xpu ( #22350 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-06 19:28:11 -07:00
1dc8a70b6d
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix ( #21588 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-06 18:40:52 -07:00
f825c6bd22
Support encoder_only attention for FlexAttention ( #22273 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-06 18:37:14 -07:00
41b67f4263
[model] Support MiniCPM-V 4.0 ( #22166 )
...
Co-authored-by: imning3 <hbning@pku.edu.cn >
2025-08-06 18:35:46 -07:00
e8961e963a
Update flashinfer-python==0.2.10 ( #22389 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-06 18:10:24 -07:00
9a3835aaa9
Fix trtllm-gen attention env and add attention sink ( #22378 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: Lain <fusiyuan2000@hotmail.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 18:07:41 -07:00
5c7cc33f4d
[gpt-oss] fix model config with hf_config ( #22401 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 18:04:04 -07:00
19c9365aa4
[gpt-oss] add demo tool server ( #22393 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-06 17:47:14 -07:00
eec890c1c1
[Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue ( #22399 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-06 17:03:53 -07:00
46a13949d5
[v1] - Mamba1 Attention Metadata ( #21249 )
...
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-08-06 17:03:42 -07:00
31f09c615f
[gpt-oss] flashinfer mxfp4 ( #22339 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-08-06 12:37:27 -07:00
31f5dc5b2a
[gpt-oss] Enhance error msg on attention sink init ( #22335 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-08-06 11:41:42 -07:00
ec7cb19224
[gpt-oss] Add loop for built-in tool call ( #22374 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 10:32:21 -07:00
2435ea7ed5
[Bugfix] Make condition in triton kernel constexpr ( #22370 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-08-06 10:00:58 -07:00
4a6b72c2ab
[BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks ( #22368 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-08-06 09:47:38 -07:00
b4b9813b5e
add the codes to check AMD Instinct GPU number ( #22367 )
...
Signed-off-by: Zhang Jason <ning.zhang2@amd.com >
2025-08-06 08:58:38 -07:00
2cb6ef8996
[BugFix] Fix FA2 RuntimeError when sinks is provided ( #22365 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-08-06 08:03:03 -07:00
9edd1db02b
[Minor] Fix type ( #22347 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-06 02:22:03 -07:00
f263a4b53f
[gpt-oss] Support chat completion api ( #22342 )
2025-08-06 01:57:39 -07:00
54991c548a
[gpt-oss] add model to supported models doc ( #22336 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-06 01:49:44 -07:00
178d03fbd6
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils ( #22340 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 01:08:49 -07:00
fa00c5d75b
[Misc] Clean up duplicated hf overrides ( #22311 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-06 07:50:25 +00:00
134a8ee8fd
[gpt-oss] Add openai-harmony as default dependency ( #22332 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 00:10:14 -07:00
90ec006937
[gpt-oss] flashinfer attention sink init ( #22330 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
2025-08-05 23:48:19 -07:00
a47e6ffe93
[GptOss] Add GptOss reasoning parser to support structure output ( #22322 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-05 23:39:13 -07:00
98a3a81024
[ROCm] Add attention sink to use_rocm_custom_paged_attention ( #22329 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-05 23:30:38 -07:00
de98252f49
Add GPT-OSS model code and config [1/N] ( #22327 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 23:26:00 -07:00
796bae07c5
Update transformers to v4.55 ( #21931 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 22:56:14 -07:00
6e20924350
Add attention sink in attention backends ( #22320 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-05 22:37:21 -07:00
dd16bdc798
Increase openai-python version ( #22316 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 21:43:21 -07:00
e3c876dca3
Upgrade FA3 for attention sink ( #22313 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 21:36:21 -07:00
5d5d419ca6
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm ( #22264 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-08-05 20:39:32 -07:00
302962e806
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation ( #22275 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-05 20:35:32 -07:00
7e6544c797
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding ( #21862 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-08-05 19:57:49 -07:00
8e6c7e873f
[Bugfix] Fix MoE BNB version ( #22260 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-05 19:56:22 -07:00
6a51530437
[Bugfix] Fix 3D input passed into cutlass_scaled_mm ( #22278 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-06 10:35:20 +08:00
35509fc5be
[Bugfix] Remove faulty test for oot attention backend ( #22286 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-06 00:05:40 +00:00
4b29d2784b
[CI][TPU] Fix docker clean up ( #22271 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-08-05 23:54:56 +00:00
59a0b8554b
[bugfix] fix blackwell deepep installation ( #22255 )
2025-08-06 01:26:09 +08:00
469b3ffaaa
[V1] port xformers backend to v1 ( #21342 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-05 10:04:46 -07:00
ae87ddd040
[Refactor] Remove Unused Environment Variable VLLM_NO_DEPRECATION_WARNING ( #22199 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-05 09:40:23 -07:00
a7cb6101ca
[CI/Build] Update flashinfer to 0.2.9 ( #22233 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-05 09:39:38 -07:00
c494f96fbc
Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail ( #22128 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-05 06:57:10 -07:00
0c275ad5ad
[V0 Deprecation][TPU] Remove V1 flag check from tests ( #22248 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-05 06:53:23 -07:00
74333ae2f6
[Misc] correct static type check for GroupCoordinator ( #21946 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-05 03:17:46 -07:00
83156c7b89
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel ( #22095 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-08-05 02:45:34 -07:00
4771df7b2b
[Feature] Non-contiguous Support for FP8 Quantization ( #21961 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-05 02:36:43 -07:00
05fae02175
Migrate KimiVLImagePixelInputs to TensorSchema ( #21769 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-08-05 02:36:18 -07:00
d1bf1b9711
[Docs][TPU] Highlight TPU Software version selection ( #22242 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-05 02:33:46 -07:00
586f286789
[Model] Pooling model activation supports per request control by PoolingParams ( #20538 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-05 00:37:00 -07:00
811ac13d03
[Core] Factor out common logic for MM budget calculation ( #22228 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-04 23:54:55 -07:00
e79a12fc3a
[UX] Fail if an invalid attention backend is specified ( #22217 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-08-04 23:54:52 -07:00
cdfd6871a5
[Bugfix] Misaligned params in TreeAttentionImpl ( #22226 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-04 22:40:09 -07:00
4b3e4474d7
Optimize configuration access with LRU cache in custom ops ( #22204 )
...
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
2025-08-04 21:43:24 -07:00
bd3db7f469
[Misc] log more detailed message for ensure_model_parallel_initialized ( #22144 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-04 19:36:55 -07:00
29b97c0995
[Doc] add backend to doc string of initialize_model_parallel ( #22142 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-04 19:36:20 -07:00
7b455cf1c0
[Misc] Remove pass_config from CompilationConfig dump_json excluded ( #21911 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-08-04 19:17:18 -07:00
8a6e108e76
fix: kimi_k2 return empty tool call list ( #22149 )
...
Signed-off-by: tlipoca9 <tlipoca9@gmail.com >
2025-08-04 19:15:31 -07:00
d7b28f3415
[Log] DeepGEMM Update Log for Unaligned Problem Size ( #22208 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-04 19:13:19 -07:00
6fa41e0c32
self.gate dtype update for GLM-4.5 ( #22203 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-04 19:12:38 -07:00
031ca762d7
[ROCm][Bugfix] Compilation passes fix ( #22202 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-08-04 19:12:28 -07:00
6ad6b8e115
[FEAT] Refactor ROPE into module ( #22192 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-04 19:12:16 -07:00
f4f4e7ef27
[V0 deprecation][P/D] Deprecate v0 KVConnectorBase code (1/2) ( #21785 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-08-04 19:11:33 -07:00
5ea71ff46f
[V1] reduce block size for tree attention correctness test to fix 'ou… ( #22207 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-04 19:11:06 -07:00
7175817637
Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." ( #22223 )
2025-08-04 18:37:06 -07:00
2dffac464c
[Bugfix] V1 Fix the cursor leakage issue during request scheduling. ( #21173 )
...
Signed-off-by: CLFutureX <775523362@qq.com >
2025-08-04 18:34:10 -07:00
bdcb42e45d
[NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading ( #22073 )
2025-08-04 21:02:55 -04:00
c09efff976
[Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2pNcclConnector ( #21819 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-08-04 20:17:05 +00:00
309c1bb822
[Bug] Update auto_tune.sh to separate benchmarking and profiling. ( #21629 )
...
Signed-off-by: Eric Hanley <ericehanley@google.com >
2025-08-04 15:12:06 +00:00
9af654cc38
[Responses API] Ignore store=True and process the request by default ( #22185 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-04 05:12:48 -07:00
a5fff3bd49
Fix Arcee model weight loading: Add custom load_weights ( #21725 )
...
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
2025-08-04 04:09:56 -07:00
1539ced93a
[Doc] Update pooling model docs ( #22186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-04 03:37:06 -07:00
54de71d0df
[Sampler] Support returning all logprobs or logits ( #21792 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-04 03:04:12 -07:00
fed5849d3f
[Bugfix] Fix failing GGUF models test ( #22174 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-04 01:27:02 -07:00
c1b4eb048a
[feat] move WEIGHT_SCALE_SUPPORTED into raise block to accelerate RLHF weight loading ( #21164 )
...
Signed-off-by: huangweixiao <huangweixiao@msh.team >
2025-08-04 15:43:06 +08:00
a7b8788d2c
[Misc] Modify the organization of GLM series ( #22171 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-03 23:51:20 -07:00
8ecb3e9e93
[CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes ( #22163 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-08-03 22:19:04 -07:00
e5949e5ae0
Remove index_put from MM embeddings merging ( #22105 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
2025-08-03 22:15:14 -07:00
49bcd893e7
[refactor] improve ConstantList exception specificity ( #22156 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-03 22:14:49 -07:00
aa7012eb6d
Add tree attention backend for v1 (part 1) ( #20401 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-03 22:13:26 -07:00
c2e75b3c11
remove duplicate code within cleanup_dist_env_and_memory ( #22147 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-03 20:03:58 -07:00
0d7db16a92
[PD] add test for chat completions endpoint ( #21925 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
2025-08-03 19:57:03 -07:00
845420ac2c
[RLHF] Fix torch.dtype not serializable in example ( #22158 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-04 02:43:33 +00:00
e27d25a0dc
[fix] fix correct assertion syntax error in attention utils. ( #22154 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-03 19:24:02 -07:00
6f5478298d
Use aiohttp connection pool for benchmarking ( #21981 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-08-03 19:23:32 -07:00
6a39ba85fe
[Bugfix] Fix failing multimodal standard test ( #22153 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-03 19:04:38 +00:00
d3c18c9cb0
fuse fp32 for GLM-4.5 e_score_correction_bias ( #22143 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-03 09:04:54 -07:00
83f7bbb318
Add chat doc in quick start ( #21213 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-03 07:47:55 -07:00
b5dfb94fa0
[CI/Build][Bugfix] Fix Qwen2.5 tests in CPU CI via fallback silu_and_mul to torch native implementation ( #22145 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-03 05:34:04 -07:00
6d98843b31
[Responses API] Disable response store by default ( #22137 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-03 04:04:21 -07:00
aefeea0fde
[V1] [P/D] Refactor KV Connector Path ( #21980 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-08-03 04:03:40 -07:00
24d1dffbeb
[executor] feat: add supports_pp attr to executors ( #21786 )
...
Signed-off-by: Haibin Lin <haibin.lin@bytedance.com >
2025-08-03 18:04:45 +08:00
7de45db9a5
[Misc] update doc comment for send ( #22026 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-03 00:55:20 -07:00
789562c28c
Support CUTLASS NVFP4 (w4a4) for Blackwell Geforce GPUs (SM120) ( #21309 )
...
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
2025-08-03 00:54:22 -07:00
3f36c325fa
[Benchmark] Support ready check timeout in vllm bench serve ( #21696 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-03 00:52:38 -07:00
3dddbf1f25
[Misc] Add tensor schema test coverage for multimodal models ( #21754 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-03 00:52:14 -07:00
337eb23bcc
[Fix] Fix llama4 modelopt weight loading error ( #22107 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-03 00:50:34 -07:00
2ff46b8826
[Misc] Bump ray to 2.48.0 ( #22123 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-02 19:42:00 -07:00
554df8a6a2
Revert "[compile][startup] Disable C++ compilation of symbolic shapes" ( #22122 )
...
Signed-off-by: Xiao Liu <xiszishu@gmail.com >
2025-08-02 09:03:30 -07:00
73e1b9b1d4
[xpu]support moe models on XPU platform ( #21643 )
...
Signed-off-by: yan <yan.ma@intel.com >
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-08-02 07:49:08 -07:00
4abfd8796f
[V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time ( #21557 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-02 05:29:40 -07:00
f5d0f4784f
[Frontend] Improve error message for too many mm items ( #22114 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-02 02:20:38 -07:00
b690e34824
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead ( #21075 )
...
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-08-02 01:59:34 -07:00
25373b6c6c
for glm-4.1V update ( #22000 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-08-02 01:46:57 -07:00
58eee5f2e0
[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion ( #20000 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-08-02 01:43:52 -07:00
067c34a155
docs: remove deprecated disable-log-requests flag ( #22113 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-02 00:19:48 -07:00
c64861d63c
[Bugfix] Mamba2 remove bugged initial state condition in chunk scan ( #22034 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-08-01 23:55:57 -07:00
8564dc9448
Fix test_kv_sharing_fast_prefill flakiness ( #22038 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-01 23:55:34 -07:00
4ac8437352
[Misc] Getting and passing ray runtime_env to workers ( #22040 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-01 23:54:40 -07:00
d3a6f2120b
[FEAT][ROCm] Enable running Flash Attention as ViT attn backend for Qwen-VL models on ROCm platform. ( #22069 )
...
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2025-08-01 23:53:18 -07:00
0edaf752d7
[Attention][DBO] Add support for "splitting" the CommonAttentionMetadata ( #21153 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-08-01 19:47:53 -07:00
6e8d8c4afb
[Test] Add Unit Test for Batched DeepGEMM ( #21559 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-02 10:45:46 +08:00
8d524ce79f
[BugFix] Improve internal DP load balancing ( #21617 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-01 19:45:27 -07:00
9f9c38c392
[Speculators][Speculative Decoding] Add Qwen Eagle3 Support ( #21835 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-08-01 19:43:37 -07:00
a65f46be5e
[Misc] DeepGemmExperts : Avoid JIT generation in the hot-path ( #21955 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-08-01 19:42:03 -07:00
57393715e8
[Misc] VLLM_TARGET_DEVICE.lower() ( #22101 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-01 19:41:40 -07:00
ee2eb6ecd8
[Model] Qwen2.5 VL SiLU-and-Mul ( #22066 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: kf <kuanfu.liu@embeddedllm.com >
2025-08-01 19:34:37 -07:00
23322431c8
[V1][CUDA] Full cudagraph support for FlashInfer ( #21367 )
2025-08-01 21:49:34 -04:00
3654847db5
feat: Add Support GPTQ Quantization MOE on ROCM vllm serve ( #21733 )
2025-08-01 21:12:19 -04:00
eefbf4a68b
[Perf] Optimize reshape_and_cache_flash CUDA Kernel ( #22036 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-01 19:18:51 -04:00
88faa466d7
[CI] Initial tests for SM100 Blackwell runner ( #21877 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-01 16:18:38 -07:00
881e1af43a
[BugFix] Harden distributed DP startup ( #21538 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-01 21:40:45 +00:00
d84b97a3e3
Add lora test for tp>1 case for TPU. ( #21970 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-08-01 18:56:08 +00:00
d331759488
Introduce RayPPCommunicator for ray-based PP ( #21660 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-01 11:50:58 -07:00
9659bc7f27
[compile][startup] Disable C++ compilation of symbolic shapes ( #20836 )
...
Signed-off-by: Animesh Jain <anijain@umich.edu >
2025-08-01 10:38:52 -07:00
3277e8f9e1
Fix pre-commit failure for SECURTIY.md ( #22102 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-01 10:36:07 -07:00
8d705996df
[Misc] Minor enhancement of benchmark_moe ( #22068 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-02 01:35:30 +08:00
38c8bce8b6
Enable headless models for pooling in the Transformers backend ( #21767 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 10:31:29 -07:00
ac45c44d98
[Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before Dispatch ( #21837 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-08-01 10:14:38 -07:00
d6664664b4
security policy: take 1 ( #21119 )
...
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-08-01 10:09:49 -07:00
b879ecd6e2
[Bugfix] fix when skip tokenizer init ( #21922 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-01 10:09:36 -07:00
3f8e952179
[Bugfix] Fix glm4.1v video inference issue ( #22067 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-01 09:33:30 -07:00
326a1b001d
Improve documentation of ModelConfig.try_get_generation_config to prevent future confusion ( #21526 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 09:32:27 -07:00
2d7b09b998
Deprecate --disable-log-requests and replace with --enable-log-requests ( #21739 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 17:16:37 +01:00
97608dc276
[Docs] use uv in CPU installation docs ( #22089 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-08-01 07:55:55 -07:00
3146519add
[BugFix] Don't change title of top-level process ( #22032 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-01 07:37:55 -07:00
8026a335a1
[BugFix] Update AttnFusionPass cache key ( #21947 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-08-01 07:11:29 -07:00
a59cd9d9f7
[Refactor] Fix Compile Warning #1444-D ( #21462 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-01 06:10:30 -07:00
5c54d9759d
[Bugfix][PD] set max_completion_tokens=1 if req has this value ( #21841 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
2025-08-01 06:08:45 -07:00
0a6d305e0f
feat(multimodal): Add customizable background color for RGBA to RGB conversion ( #22052 )
...
Signed-off-by: Jinheng Li <ahengljh@gmail.com >
Co-authored-by: Jinheng Li <ahengljh@gmail.com >
2025-08-01 06:07:33 -07:00
f81c1bb055
[Bugfix] Check NVIDIA artifactory is accessible before using flashinfer cubin kernels ( #21893 )
2025-08-01 08:28:45 -04:00
fb0e0d46fc
Fix get_kwargs for case where type hint is list[Union[str, type]] ( #22016 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 05:26:42 -07:00
26b5f7bd2a
[BUG] [ROCm] Fix import bug on ROCm ( #22083 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-01 05:25:20 -07:00
dfbc1f8880
[Speculative Decoding] Add speculators config support ( #21345 )
2025-08-01 08:25:18 -04:00
87c94bc879
Revert "Update sampling_metadata.py ( #21937 )" ( #22088 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 05:24:46 -07:00
28b18cc741
[Quantization] Enable BNB support for InternS1 ( #21953 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-01 11:09:54 +00:00
4931486988
[Doc] Added warning of speculating with draft model ( #22047 )
...
Signed-off-by: Dilute-l <dilu2333@163.com >
Co-authored-by: Dilute-l <dilu2333@163.com >
2025-08-01 02:11:56 -07:00
0f81b310db
[Misc] Remove upper bound in openai package version ( #22060 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-01 02:11:40 -07:00
e6680f9e25
[Bugfix] Add log prefix in non-dp mode engine core ( #21889 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-08-01 09:04:16 +00:00
27a145e893
[Doc] Add example for Step3-VL ( #22061 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-01 08:35:49 +00:00
da31f6ad3d
Revert precompile wheel changes ( #22055 )
2025-08-01 08:26:24 +00:00
98df153abf
[Frontend] Align tool_choice="required" behavior with OpenAI when tools is empty ( #21052 )
...
Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai >
2025-08-01 07:54:17 +00:00
e0f63e4a35
[Core] Avoid repeated len(block_token_ids) check in hash_request_tokens ( #21781 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-08-01 00:23:29 -07:00
b4e081cb15
[Bugfix] Disable multi-modal preprocessor cache for DP ( #21896 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-01 08:03:56 +01:00
79731a79f0
[Doc] Fix a syntax error of example code in structured_outputs.md ( #22045 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
2025-08-01 00:01:22 -07:00
53d7c39271
Update sampling_metadata.py ( #21937 )
...
Signed-off-by: Aviad Rossmann <aviadr@neureality.ai >
2025-07-31 23:23:18 -07:00
61dcc280fa
[Doc] Add Voxtral to Supported Models page ( #22059 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-31 23:10:56 -07:00
0f46a780d4
[Model] [Quantization] Support quantization for Gemma3n ( #21974 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-07-31 22:45:15 -07:00
e1a7fe4af5
[BugFix] fix: aot passes kvcache dtype information ( #19750 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-08-01 05:45:02 +00:00
82de9b9d46
[Misc] Automatically resolve HF processor init kwargs ( #22005 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-31 22:44:10 -07:00
ad57f23f6a
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache ( #20873 )
...
Signed-off-by: charent <19562666+charent@users.noreply.github.com >
2025-07-31 19:48:13 -07:00
3700642013
[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependencies of DeepGEMM ( #21787 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-01 01:13:27 +00:00
0bd409cf01
Move flashinfer-python to optional extra vllm[flashinfer] ( #21959 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-31 18:02:11 -07:00
e360316ab9
Add DeepGEMM to Dockerfile in vllm-base image ( #21533 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-31 18:01:55 -07:00
c3e0e9337e
[Feature] Add Flashinfer MoE Support for Compressed Tensor NVFP4 ( #21639 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-31 15:26:11 -07:00
6e672daf62
Add FlashInfer allreduce RMSNorm Quant fusion ( #21069 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-31 13:58:38 -07:00
2dff2e21d9
[Bugfix] Fix MTP weight loading ( #21941 )
2025-07-31 16:33:53 -04:00
71470bc4af
[Misc] Add unit tests for chunked local attention ( #21692 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-31 11:39:16 -07:00
9e0726e5bf
[Meta] Official Eagle mm support, first enablement on llama4 ( #20788 )
...
Signed-off-by: morgendave <morgendave@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-07-31 10:35:07 -07:00
53c21e492e
Update torch_xla pin to 20250730 ( #21956 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-31 17:26:43 +00:00
0780bb5783
Removing amdproduction Tests ( #22027 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-07-31 09:53:27 -07:00
58bb902186
fix(setup): improve precompiled wheel setup for Docker builds ( #22025 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-31 09:52:48 -07:00
7349d5268b
[ez] Remove a trailing space from compilation/decorators.py ( #22028 )
2025-07-31 09:46:07 -07:00
9484641616
[Model] Add step3 vl ( #21998 )
...
Signed-off-by: oliveryuan <yuansong@step.ai >
Co-authored-by: oliveryuan <yuansong@step.ai >
2025-07-31 23:19:06 +08:00
207b750e19
[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend ( #21458 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-31 06:00:01 -07:00
5daffe7cf6
[BugFix] Fix case where collective_rpc returns None ( #22006 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-31 12:51:37 +00:00
2836dd73f1
[Model][CI] Let more pooling models support v1 ( #21747 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-31 01:51:15 -07:00
d2aab336ad
[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES ( #21599 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-07-31 15:00:08 +08:00
9532a6d563
[Deprecation] Remove deprecated args and methods ( #21907 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 23:46:38 -07:00
3e36fcbee6
[Bugfix]: fix metadata file copy in test_sharded_state_loader ( #21830 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-31 06:22:11 +00:00
055bd3978e
[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes ( #21973 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-31 11:45:29 +08:00
0f7919fca0
[Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels ( #21818 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-30 20:41:12 -07:00
61445453df
[UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA ( #21966 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-30 20:40:34 -07:00
ec02e536df
[Bugfix] Relax lang pin for voxtral ( #21833 )
...
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-30 20:38:52 -07:00
9cb497bfa3
[Example] Add async_llm_streaming.py example for AsyncLLM streaming in python ( #21763 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-30 18:39:46 -06:00
ca9e2be3ed
[Core] Move EngineCoreRequest to Request conversion out of EngineCore ( #21627 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-07-30 15:00:54 -07:00
601f856d56
[Bugfix] Fix None value handling in trace span creation for cancelled requests ( #20272 )
2025-07-30 14:44:02 -07:00
287f527f54
[Feature] Add async tensor parallelism for scaled mm ( #20155 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-07-30 17:23:41 -04:00
f12d9256b3
[Misc] Use dracut on CentOS and skip clone if repo exists for EP kernel installation ( #21635 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-30 13:15:06 -07:00
b9b753e7a7
For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted ( #21964 )
2025-07-30 13:04:40 -07:00
56bd537dde
[Misc] Support more collective_rpc return types ( #21845 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-30 10:20:20 -07:00
8f0d516715
[TPU] Support Pathways in vLLM ( #21417 )
...
Signed-off-by: wenxindongwork <wenxindong@google.com >
2025-07-30 10:02:12 -07:00
f4135232b9
feat(distributed): add get_required_kvcache_layout class method to kv connector api ( #20433 )
...
Signed-off-by: wxsm <wxsms@foxmail.com >
2025-07-30 16:41:51 +00:00
4904e53c32
[Bugfix] SharedStorage Connector for V1 PD multimodal ( #21611 )
...
Signed-off-by: fake0fan <645327136@qq.com >
Signed-off-by: herotai214 <herotai214@gmail.com >
Co-authored-by: herotai214 <herotai214@gmail.com >
2025-07-30 09:18:37 -07:00
004203e953
[CI/Build] Fix registry tests ( #21934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 09:10:41 -07:00
5c765aec65
[Bugfix] Fix TypeError in scheduler when comparing mixed request_id types ( #21816 )
...
Signed-off-by: chiliu <chiliu@paypal.com >
Co-authored-by: chiliu <chiliu@paypal.com >
2025-07-30 08:54:44 -07:00
ad510309ee
Override attention metadata for fast prefill in some KV sharing setups ( #21590 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-30 08:54:15 -07:00
366f6b3a4d
[Bugfix] Fix multi-api server not working for text models ( #21933 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 08:42:05 -07:00
6e599eebe8
[Bugfix] Fix OOM tests in initialization test ( #21921 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-30 07:35:47 -07:00
88edf5994c
[Docs] Reduce the size of the built docs ( #21920 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-30 07:35:08 -07:00
ff08e51940
[NVIDIA] Fix Llama4 Scout FP4 functionality issues ( #21499 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-07-30 07:33:40 -07:00
8f4a1c9a04
[Misc] Improve code readability of KVCacheManager ( #21673 )
...
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com >
Signed-off-by: Ruixiang Tan <819464715@qq.com >
Signed-off-by: GitHub <noreply@github.com >
2025-07-30 07:20:43 -07:00
36ede45989
Reduce time wasted in GitHub Actions using concurrency ( #21919 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-30 07:18:02 -07:00
0e40b26073
[CI/Build] Only run markdownlint in CI ( #21892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-30 07:17:14 -07:00
0271c2ff2f
[Test] Add Benchmark and Unit Test for per_token_group_quant ( #21860 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-30 07:15:02 -07:00
e91d3c9cda
[misc] skip p2p check by default ( #21904 )
2025-07-30 22:05:04 +08:00
bf668b5bf5
[Feature] Support multiple api keys in server ( #18548 )
...
Signed-off-by: Yan Pashkovsky <yanp.bugz@gmail.com >
2025-07-30 07:03:23 -07:00
da3e0bd6e5
[Bugfix] we should use metavar is not choices ( #21902 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-30 06:51:58 -07:00
fcfd1eb9c5
[Doc] Remove vLLM prefix and add citation for PagedAttention ( #21910 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 06:36:34 -07:00
d979dd6beb
[Feature][EPLB] Add eplb support for Qwen3 ( #20815 )
...
Signed-off-by: aladerran <aladerran@gmail.com >
2025-07-30 06:27:57 -07:00
b876860c62
[Hardware][CPU] Build fix for ARM without BF16 ( #21848 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-07-30 06:22:00 -07:00
13986365a9
Add @patrickvonplaten as maintainer of mistral's related files. ( #21928 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-07-30 20:42:51 +08:00
5c8fe389d6
[Docs] Fix the example code of streaming chat completions in reasoning ( #21825 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: Zi Wang <66560864+BruceW-07@users.noreply.github.com >
2025-07-30 12:11:58 +00:00
5bbaf492a6
[Doc] Update partial support ( #21916 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 01:32:39 -07:00
533db0935d
[benchmark] add max-concurrency in result table ( #21095 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-30 01:15:43 -07:00
fc91da5499
[Model] Remove DSV2 unused code ( #21903 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-30 00:55:03 -07:00
547795232d
[Tests] Fixing bug inside MultiModalProfiler. ( #21842 )
...
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com >
2025-07-30 00:44:15 -07:00
30ef30ed5a
[CI] rollback lint-and-deploy pipeline using amd machine ( #21912 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-30 00:37:59 -07:00
02f82fe438
[Doc] Update Intern-S1 info ( #21908 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-29 23:58:57 -07:00
2ca5f82c2a
[Misc] Remove redundant config definitions ( #21891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 23:54:18 -07:00
6f8d261882
Update vLLM Benchmark Suite for Xeon based on 0.9.2 release ( #21486 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-07-30 05:57:03 +00:00
4cd7fe6cea
[Docs] Expand introduction to Ray in Multi-node deployment section ( #21584 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-29 22:07:28 -07:00
16f3250527
[CI/Build] Fix pre-commit failure in docs ( #21897 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 21:53:08 -07:00
e3bc17ceea
Add @sighingnow as maintainer of qwen's related files. ( #21895 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-07-29 21:30:44 -07:00
05cbbe20c5
[XPU] use ZE_AFFINITY_MASK for device select on xpu ( #21815 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-30 03:56:14 +00:00
65f311ce59
[Frontend] Add LLM.reward specific to reward models ( #21720 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-29 20:56:03 -07:00
1b0a155534
[Perf] Using __nv_fp8_e4m3 instead of c10::e4m3 for per_token_group_quant ( #21867 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-29 21:50:46 -06:00
44bc46da60
[Bugfix] Actually disable processing cache when API server is scaled out ( #21839 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 20:36:04 -07:00
b7b23da4d2
[Bugfix] Fix comment typo of get_num_common_prefix_blocks() ( #21827 )
...
Signed-off-by: MingzhenHan <hanmingzhen2002@outlook.com >
2025-07-29 20:35:33 -07:00
fdde18229e
[Bugfix] Fix shape mismatch assertion error when loading Gemma3n model with BitsAndBytes quantization ( #21808 )
...
Signed-off-by: sydarb <areebsyed237@gmail.com >
2025-07-30 11:35:21 +08:00
b917da442b
Expose PyTorch profiler configuration to environment variables ( #21803 )
...
Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com >
2025-07-29 19:46:31 -07:00
fb58e3a651
[Docs] Update docker.md with HF_TOKEN, new model, and podman fix ( #21856 )
2025-07-29 19:45:41 -07:00
76080cff79
[DOC] Fix path of v1 related figures ( #21868 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-29 19:45:18 -07:00
ba5c5e5404
[Docs] Switch to better markdown linting pre-commit hook ( #21851 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 19:45:08 -07:00
555e7225bc
[v1][attention] Support Hybrid Allocator + FlashInfer ( #21412 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-07-30 01:45:29 +00:00
0e36abf993
[Bugfix] Correct max tokens for non-contiguous embeds ( #21798 )
...
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
2025-07-30 01:16:25 +00:00
452b2a3180
[ci] mark blackwell test optional for now ( #21878 )
2025-07-29 18:03:27 -07:00
0d0cc9e150
[ci] add b200 test placeholder ( #21866 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-29 17:11:50 -07:00
9266d98048
[BugFix] Fix interleaved sliding window not set for Gemma3n ( #21863 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-29 16:34:19 -07:00
176bbce1db
Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure ( #21647 )" ( #21850 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-29 21:56:29 +00:00
a1873db23d
docker: docker-aware precompiled wheel support ( #21127 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-29 14:45:19 -07:00
a33ea28b1b
Add flashinfer_python to CUDA wheel requirements ( #21389 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-29 12:51:58 -07:00
7b49cb1c6b
[Doc] update Contributing page's testing section ( #18272 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-07-29 10:32:46 -07:00
f03e9cf2bb
[Doc] Add FusedMoE Modular Kernel Documentation ( #21623 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-29 10:32:30 -07:00
37f86d9048
[Docs] use uv in GPU installation docs ( #20277 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-07-29 10:32:06 -07:00
58b11b24a6
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend ( #21525 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-29 10:34:00 -04:00
ad341c5194
[Bugfix]fix mixed bits and visual language model quantization in AutoRound ( #21802 )
...
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com >
2025-07-29 07:26:31 -07:00
759b87ef3e
[TPU] Add an optimization doc on TPU ( #21155 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 07:23:19 -07:00
f693b067a2
[Docs] Merge design docs for a V1 only future ( #21832 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 07:22:50 -07:00
04e38500ee
[Bugfix] VLLM_V1 supports passing other compilation levels ( #19340 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-29 09:35:58 -04:00
ab714131e4
[Doc] Update compatibility matrix for pooling and multimodal models ( #21831 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 06:29:51 -07:00
755fa8b657
[KVCache] Make KVCacheSpec hashable ( #21791 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-07-29 19:58:29 +08:00
2470419119
[Docs] Fix the outdated URL for installing from vLLM binaries ( #21523 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 04:56:27 -07:00
61a6905ab0
[Model] Refactor JambaForCausalLM ( #21394 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-29 18:25:07 +08:00
37efc63b64
[V0 deprecation] Guided decoding ( #21347 )
...
Signed-off-by: Reza Barazesh <rezabarazesh@meta.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 03:15:30 -07:00
a4528f0cac
[Model]: Fused MoE for nomic-embed-text-v2-moe ( #18321 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-29 03:13:27 -07:00
a2480251ec
[Doc] Link to RFC for pooling optimizations ( #21806 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 23:53:18 -07:00
7234fe2685
[Misc] Rework process titles ( #21780 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-29 05:14:47 +00:00
f1e2c095ec
Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema ( #21684 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-28 22:09:45 -07:00
12a223ef9b
[AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef ROCM ( #21766 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-29 03:35:37 +00:00
e18f085103
skip fusedmoe layer for start_load_kv ( #21378 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
2025-07-28 18:59:44 -07:00
afa2607596
[CI] Parallelize Kernels MoE Test ( #21764 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-28 18:56:24 -07:00
48b763d6b5
[Refactor] Merge Compressed Tensor FP8 CompressedTensorsW8A8Fp8MoEMethod and CompressedTensorsW8A8Fp8MoECutlassMethod ( #21775 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-28 19:47:21 -06:00
947e982ede
[Docs] Minimize spacing for supported_hardware.md table ( #21779 )
2025-07-28 18:46:39 -07:00
c6c9122d50
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning ( #20396 )
...
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com >
Co-authored-by: Duncan Moss <djm.moss@gmail.com >
2025-07-28 23:13:58 +00:00
8aa1485fcf
[Perf] Disable chunked local attention by default with llama4 ( #21761 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-28 18:49:04 -04:00
89ac266b26
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels ( #17112 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-28 20:55:15 +00:00
c6f36cfa26
[Bugfix] DeepGEMM is not enabled on B200 due to _lazy_init() ( #21472 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-28 20:51:22 +00:00
b18b417fbf
Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" ( #21778 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-07-28 20:15:18 +00:00
9ba1c88a93
[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure ( #21647 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-28 20:11:16 +00:00
e0e58f9729
[Bug] Enforce contiguous input for dynamic_scaled_fp8_quant and static_scaled_fp8_quant ( #21773 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-28 19:55:48 +00:00
b361f14e39
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile ( #21350 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-07-28 15:38:20 -04:00
01c753ed98
update flashinfer to v0.2.9rc2 ( #21701 )
...
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
2025-07-28 19:31:47 +00:00
94b71ae106
Use metavar to list the choices for a CLI arg when custom values are also accepted ( #21760 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-28 19:31:10 +00:00
7d44c691b0
[P/D] Log warnings related to prefill KV expiry ( #21753 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-28 18:40:53 +00:00
e17a4d3bf9
[Bugfix] Fix granite speech shape validation ( #21762 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 14:19:21 -04:00
ec261b0291
[XPU] IPEX-optimized Punica Wrapper on XPU ( #21703 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-28 16:43:37 +00:00
04fe61aa3d
[CI/Build] Fix plugin tests ( #21758 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 15:08:05 +00:00
25708d317a
[Bugfix] Mistral crashes on tool with no description ( #21167 )
...
Signed-off-by: HugoMichard <hugo@harfanglab.fr >
2025-07-28 08:03:35 -07:00
0e18a5d058
[Misc] Reduce logs for model resolution ( #21765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 07:59:56 -07:00
34a20c49b3
[Logs] Change flashinfer sampler logs to once ( #21759 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-28 06:59:51 -07:00
31084b3b1f
[Bugfix][CI/Build] Update peft version in test requirement ( #21729 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-28 06:17:43 -07:00
bccc43c033
[Bugfix]check health for engine core process exiting unexpectedly ( #21728 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-07-28 06:17:31 -07:00
1395dd9c28
[Docs] Add revision date to rendered docs ( #21752 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-28 06:12:46 -07:00
9ace2eaf35
[Bugfix] Improve JSON extraction in LlamaToolParser ( #19024 )
...
Signed-off-by: keru <keyang.ru@oracle.com >
Co-authored-by: keru <keyang.ru@oracle.com >
2025-07-28 12:36:58 +00:00
656c24f1b5
[Ernie 4.5] Name Change for Base 0.3B Model ( #21735 )
...
Signed-off-by: vasqu <antonprogamer@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 12:22:32 +00:00
63fe3a700f
[PD] let p2p nccl toy proxy handle /chat/completions ( #21734 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-28 11:45:50 +00:00
0ae970ed15
[Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme ( #21744 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-28 04:26:49 -07:00
65e8466c37
[Bugfix] Fix environment variable setting in CPU Dockerfile ( #21730 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-28 11:02:39 +00:00
1b769dccf3
[Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts ( #21717 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-28 11:02:25 +00:00
2cc571199b
[feature] add log non default args in LLM ( #21680 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-28 02:21:22 -07:00
a4ed731546
[Model] Prioritize Transformers fallback over suffix matching ( #21719 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 02:15:31 -07:00
d128d0d554
Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema ( #21686 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-28 01:16:35 -07:00
a6c050286a
[v1][mamba] Added mamba_type into MambaSpec ( #21715 )
...
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-07-28 08:15:55 +00:00
139a7f07bd
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled ( #21707 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-28 07:18:47 +00:00
150d9e6337
[Bugfix] fix max-file-size type from str to int ( #21675 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-28 00:06:52 -07:00
139a97ec56
[Bugfix] Fix shape checking for Fuyu ( #21709 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 00:05:56 -07:00
18cc33dd60
[bugfix] fix profile impact benchmark results ( #21507 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-27 22:44:24 -07:00
7656cf4cf3
[Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled ( #21573 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
2025-07-27 22:43:50 -07:00
3ea57a56d9
Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to … ( #21683 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 22:37:23 -07:00
75856bc2cb
Migrate GraniteSpeechAudioInputs to TensorSchema ( #21682 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-07-27 22:37:20 -07:00
304dcdf575
Migrate GLMVImagePixelInputs to TensorSchema ( #21679 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 22:36:11 -07:00
88e46c7c8d
Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema ( #21678 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-07-27 22:36:08 -07:00
d8937de4c8
Migrate Gemma3ImagePixelInputs to TensorSchema ( #21676 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 22:36:05 -07:00
e626d286f5
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel ( #21242 )
2025-07-28 05:07:06 +00:00
c7ffe93d9c
[Model] Support TP/PP/mamba2 kernel for PLaMo2 ( #19674 )
...
Signed-off-by: Shinichi Hemmi <shemmi@preferred.jp >
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
Co-authored-by: Calvin Metzger <metzger@preferred.jp >
Co-authored-by: Sixue Wang <cecilwang@preferred.jp >
2025-07-28 05:00:47 +00:00
15a72ac478
[V1] Exception Handling when Loading KV Cache from Remote Store ( #21534 )
...
Signed-off-by: liuyumoye <adeline_ly2023@outlook.com >
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com >
2025-07-27 20:34:17 -07:00
04ff4be310
[Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 ( #21700 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-27 20:12:18 -07:00
93269bb43e
Fix GLM tool parser ( #21668 )
...
Co-authored-by: Chenhui Zhang <zhang.chenhui@outlook.com >
2025-07-28 10:46:38 +08:00
82acf2184d
Fix typo for limit-mm-per-prompt in docs ( #21697 )
...
Signed-off-by: Joachim Studnia <joachim@mistral.ai >
2025-07-27 19:45:37 -07:00
86ae693f20
[Deprecation][2/N] Replace --task with --runner and --convert ( #21470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-27 19:42:40 -07:00
8f605ee309
[Attention] Make CutlassMLA the default backend for SM100 (blackwell) ( #21626 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-27 20:13:00 +00:00
a9b2a1d704
[Misc] Refactor vllm config str ( #21666 )
2025-07-27 09:51:44 -07:00
57c22e57f9
Fix CUDA permute/unpermute for use with DeepGemm Moe ( #17934 )
...
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn >
2025-07-27 07:08:00 -07:00
bda9d0535f
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor ( #21631 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-27 05:25:21 -07:00
3d847a3125
[VLM] Add video support for Intern-S1 ( #21671 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-27 11:49:43 +00:00
5f8c9a425e
Migrate Florence2ImagePixelInputs to TensorSchema ( #21663 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-27 02:43:02 -07:00
1cbf951ba2
[Misc] add default value for file pattern arg ( #21659 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-27 05:14:51 +00:00
a8936e5193
Refactor: Remove numpy dependency from LoggingStatLogger ( #20529 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-07-27 04:06:21 +00:00
01a395e9e7
[CI/Build][Doc] Clean up more docs that point to old bench scripts ( #21667 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-07-27 04:02:12 +00:00
971948b846
Handle non-serializable objects in vllm bench ( #21665 )
2025-07-27 03:35:22 +00:00
eed2f463b2
[VLM] Support HF format Phi-4-MM model ( #17121 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-26 20:07:57 -07:00
20950b29fb
Migrate ChameleonImagePixelInputs to TensorSchema ( #21657 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 19:34:25 -07:00
3339cba3ff
Migrate FuyuImagePatchInputs to TensorSchema ( #21662 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 19:34:14 -07:00
0b8caf9095
Migrate DeepseekVL2ImageInputs to TensorSchema ( #21658 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 19:34:11 -07:00
ccf27cc4d4
Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to TensorSchema ( #21656 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 10:33:52 +08:00
c657369841
support torch.compile for bailing moe ( #21664 )
2025-07-26 23:54:32 +00:00
6c66f28fa5
Remove xformers requirement for Mistral-format Pixtral and Mistral3 ( #21154 )
...
Signed-off-by: Wenchen Lo <charles761013@gmail.com >
2025-07-26 17:20:29 -06:00
de509ae8eb
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels ( #21411 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
2025-07-26 07:10:36 -07:00
e7c4f9ee86
[CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI ( #21355 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-07-26 07:10:14 -07:00
9094d11c5d
[Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon ( #21380 )
...
Signed-off-by: Yeju Zhou <yejuzhou@outlook.com >
2025-07-26 07:09:57 -07:00
56e544f24b
[Refactor] Remove moe_align_block_size_triton ( #21335 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-26 07:08:29 -07:00
97d6c30cc9
[BugFix] Fix shared storage connector load kv only load attention layer ( #21428 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-07-26 07:07:40 -07:00
a40a8506df
[Misc] Improve memory profiling debug message ( #21429 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-07-26 07:07:21 -07:00
c215f5c877
[Bug] Fix has_flashinfer_moe Import Error when it is not installed ( #21634 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-26 07:06:14 -07:00
1cd6eaba54
Support encoder-only models without KV-Cache ( #21270 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-07-26 21:09:52 +08:00
f27fdfc3ed
[Bugfix] Investigate Qwen2-VL failing test ( #21527 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-26 06:09:29 -07:00
de10ff0b7c
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation ( #21622 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 06:08:18 -07:00
9d197280fa
Migrate AriaImagePixelInputs to TensorSchema for shape validation ( #21620 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 06:08:15 -07:00
e98def439c
[Take 2] Correctly kill vLLM processes after benchmarks ( #21646 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-26 06:06:05 -07:00
05c1126f29
[Misc] remove unused try-except in pooling config check ( #21618 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-26 12:20:03 +00:00
875af38e01
Support Intern-S1 ( #21628 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-26 19:14:04 +08:00
7728dd77bb
[TPU][Test] Divide TPU v1 Test into 2 parts. ( #21431 )
2025-07-26 06:20:30 +00:00
2f6e6b33fb
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison ( #21612 )
...
Signed-off-by: Alexandre Juan <a.juan@netheos.net >
2025-07-25 20:11:10 -07:00
a55c95096b
Correctly kill vLLM processes after finishing serving benchmarks ( #21641 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-25 19:06:21 -07:00
97349fe2bc
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL ( #21530 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-07-25 18:37:32 -07:00
62965de5fe
[Model] Ultravox: Support Llama 4 and Gemma 3 backends ( #17818 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
Signed-off-by: Patrick Li <patrick8289@gmail.com >
Co-authored-by: Patrick Li <patrick8289@gmail.com >
2025-07-25 18:12:31 -07:00
7ae75fa6d0
[Feature] Add support for MoE models in the calibration-free RTN-based quantization ( #20766 )
...
Signed-off-by: Alex Kogan <alex.kogan@oracle.com >
2025-07-25 18:09:34 -07:00
f1b286b2fb
[TPU] Update ptxla nightly version to 20250724 ( #21555 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-25 17:09:00 -07:00
c7742d6113
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn ( #21540 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-25 17:08:30 -07:00
cea96a0156
[Bugfix] Fix sync_and_slice_intermediate_tensors ( #21537 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-25 17:07:58 -07:00
2eddd437ba
Add interleaved RoPE test for Llama4 (Maverick) ( #21478 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-25 17:07:26 -07:00
75d29cf4e1
[Perf] Cuda Kernel for Int8 Per Token Group Quant ( #21476 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-25 17:07:07 -07:00
41d3082c41
Add Unsloth to RLHF.md ( #21636 )
2025-07-25 17:06:48 -07:00
7cfea0df39
[TPU][Test] Rollback PR-21550. ( #21619 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-25 13:22:01 -07:00
5ac3168ee3
[Docs] add auto-round quantization readme ( #21600 )
...
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-25 08:52:42 -07:00
396ee94180
[CI] Unifying Dockerfiles for ARM and X86 Builds ( #21343 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-25 07:33:56 -07:00
e189b50f53
Add support for Prithvi in Online serving mode ( #21518 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-25 07:01:27 -07:00
136d750f5f
[Kernel] Improve machete memory bound perf ( #21556 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-25 06:53:21 -07:00
b3caeb82e7
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. ( #20295 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com >
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com >
2025-07-25 06:50:21 -07:00
eab2f3980c
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel ( #20839 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: Yu Chin Fabian Lim <fabian.lim@gmail.com >
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Yu Chin Fabian Lim <fabian.lim@gmail.com >
2025-07-25 06:49:36 -07:00
9fe98d4250
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers ( #21009 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-25 06:49:11 -07:00
29c6fbe58c
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B ( #20931 )
...
Signed-off-by: bigshanedogg <bigshane319@gmail.com >
2025-07-25 06:05:42 -07:00
c72f049cb4
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter ( #21586 )
...
Signed-off-by: zhouchong <zhouchong03@baidu.com >
Co-authored-by: zhouchong <zhouchong03@baidu.com >
2025-07-25 06:02:53 -07:00
f3a683b7c9
[Bugfix][Logprobs] Fix logprobs op to support more backend ( #21591 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2025-07-25 05:53:07 -07:00
46d81d6951
[V1] Get supported tasks from model runner instead of model config ( #21585 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-25 05:36:45 -07:00
5c3f2628d5
[Quantization] Enable BNB support for more MoE models ( #21370 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-25 03:57:34 -07:00
7311f74468
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' ( #21579 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-25 03:42:23 -07:00
8ed01e32f7
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct ( #21598 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-07-25 02:36:55 -07:00
e38e96a3c0
[Tests] Harden DP tests ( #21508 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-25 02:27:24 -07:00
40d86ee412
[TPU][Bugfix] fix OOM issue in CI test ( #21550 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-24 23:01:53 -07:00
85d051f026
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS ( #21262 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-07-24 22:54:23 -07:00
5140f54b89
[CI/Build] fix cpu_extension for apple silicon ( #21195 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-07-24 22:53:59 -07:00
947edd099e
[Misc][Tools] make max-model-len a parameter in auto_tune script ( #21321 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-24 22:46:43 -07:00
fde60ee775
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings ( #21479 )
...
Signed-off-by: Hongmin Fan <fanhongmin@google.com >
2025-07-25 13:46:06 +08:00
b38bc652ac
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 ( #21494 )
...
Signed-off-by: wzqd <1057337859@qq.com >
2025-07-24 22:45:16 -07:00
adaf2c6d4f
[Bugfix] fix modelscope snapshot_download serialization ( #21536 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-24 22:44:38 -07:00
42343f1f89
[CI] Update CODEOWNERS for CPU and Intel GPU ( #21582 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-24 21:58:03 -07:00
965bc71b04
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs ( #21232 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-24 21:43:52 -07:00
807a328bb6
[Docs] Add requirements/common.txt to run unit tests ( #21572 )
...
Signed-off-by: Zhou Fang <fang.github@gmail.com >
2025-07-24 20:51:15 -07:00
e0be2c4d09
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. ( #21560 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-24 20:44:50 -07:00
9c8b2c2a8a
[DP] Support api-server-count > 0 in hybrid DP LB mode ( #21510 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-24 20:18:16 -07:00
2212cd6cfb
[Bugfix] DeepGemm utils : Fix hardcoded type-cast ( #21517 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-24 20:17:29 -07:00
ce3a9b1378
[Kernel] adding fused_moe configs for upcoming granite4 ( #21332 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-24 20:16:59 -07:00
2ce90e5b01
Fix GLM-4 PP Missing Layer When using with PP. ( #21531 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-07-24 20:07:38 -07:00
633f6e804b
[Bug] Fix DeepGemm Init Error ( #21554 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-24 20:07:22 -07:00
b57296bb9a
[Docs] Fix site_url for RunLLM ( #21564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 20:05:58 -07:00
34ddcf9ff4
[Frontend] run-batch supports V1 ( #21541 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-24 20:05:55 -07:00
fe56180c7f
[MoE] More balanced expert sharding ( #21497 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-07-24 15:56:08 -07:00
07d80d7b0e
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. ( #21539 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-24 15:33:04 -07:00
2dd72d23d9
update flashinfer to v0.2.9rc1 ( #21485 )
...
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
2025-07-24 14:06:11 -07:00
a6c7fb8cff
[Docs] Add Expert Parallelism Initial Documentation ( #21373 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 12:36:06 -07:00
a7272c23d0
[Docs][minor] Fix broken gh-file link in distributed serving docs ( #21543 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-24 10:36:56 -07:00
6066284914
[P/D] Support CPU Transfer in NixlConnector ( #18293 )
...
Signed-off-by: Juncheng Gu <juncgu@gmail.com >
Signed-off-by: Richard Liu <ricliu@google.com >
Co-authored-by: Richard Liu <39319471+richardsliu@users.noreply.github.com >
Co-authored-by: Richard Liu <ricliu@google.com >
2025-07-24 17:58:42 +01:00
1e9ea8e69d
[P/D] Move FakeNixlWrapper to test dir ( #21328 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-24 08:53:45 -07:00
d9f9a3fd96
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform ( #21036 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-07-24 23:23:36 +08:00
1b25f1fe75
Update flashinfer CUTLASS MoE Kernel ( #21408 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
2025-07-24 08:13:31 -07:00
e8cb0d0495
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access ( #21465 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-24 08:13:24 -07:00
684174115d
[Docs] Rewrite Distributed Inference and Serving guide ( #20593 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 08:13:05 -07:00
cdb79ee63d
[Docs] Update Tensorizer usage documentation ( #21190 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
Signed-off-by: William Goldby <willgoldby@gmail.com >
Co-authored-by: William Goldby <willgoldby@gmail.com >
2025-07-24 06:56:18 -07:00
5a19a6c670
[Fix] Update mamba_ssm to 2.2.5 ( #21421 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-24 03:25:41 -07:00
2ded067fd2
[Bugfix] Fix CUDA arch flags for MoE permute ( #21426 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-24 03:23:59 -07:00
13abd0eaf9
[Model] Officially support Emu3 with Transformers backend ( #21319 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 03:22:12 -07:00
61b8cea3b4
[Attention] Optimize FlashInfer MetadataBuilder Build call ( #21137 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-24 03:21:46 -07:00
526078a96c
bump flashinfer to v0.2.8 ( #21385 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-07-24 03:20:38 -07:00
6da0078523
[Feat] Allow custom naming of vLLM processes ( #21445 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-24 03:15:23 -07:00
73e3949d07
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() ( #21501 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-24 03:13:40 -07:00
6eca337ce0
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 ( #20544 )
...
Signed-off-by: okada <kokuzen@gmail.com >
Signed-off-by: okada shintarou <okada@preferred.jp >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 02:56:36 -07:00
85bda9e7d0
remove GLM-4.5 quantization wrong Code ( #21435 )
2025-07-24 01:52:43 -07:00
610852a423
[Core] Support model loader plugins ( #21067 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-24 01:49:44 -07:00
f0f4de8f26
[Misc] Fix duplicate FusedMoEConfig debug messages ( #21455 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-24 01:27:30 -07:00
fc5f756db4
[v1][Core] Clean up usages of SpecializedManager ( #21407 )
...
Signed-off-by: Zhou Fang <fang.github@gmail.com >
2025-07-24 00:40:11 -07:00
e74bfc70e4
[TPU][Bugfix] fix moe layer ( #21340 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-07-24 00:38:39 -07:00
90eeea8f85
[Bugfix][ROCm] Fix for warp_size uses on host ( #21205 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-24 00:37:19 -07:00
dde295a934
Deduplicate Transformers backend code using inheritance ( #21461 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 00:16:23 -07:00
6d8d0a24c0
Add think chunk ( #21333 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-07-23 21:51:32 -07:00
11ef7a611e
[BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses ( #21211 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-23 21:44:04 -07:00
dc2f159f8a
Dump input metadata on crash for async scheduling ( #21258 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-23 21:10:30 -07:00
d5b981f8b1
[DP] Internal Load Balancing Per Node [one-pod-per-node] ( #21238 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-23 20:57:32 -07:00
eec6942014
[BugFix] Fix KVConnector TP worker aggregation ( #21473 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 20:56:49 -07:00
fd48d99ffd
[BugFix]: Batch generation from prompt_embeds fails for long prompts ( #21390 )
...
Signed-off-by: KazusatoOko <kazusto.oko@sakana.ai >
Co-authored-by: KazusatoOko <kazusto.oko@sakana.ai >
2025-07-23 20:43:17 -07:00
f8c15c4efb
[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process ( #21437 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-07-23 20:42:11 -07:00
aa08a954f9
[Bugfix] Fix casing warning ( #21468 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-07-23 20:41:23 -07:00
13e4ee1dc3
[XPU][UT] increase intel xpu CI test scope ( #21492 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-23 20:24:04 -07:00
772ce5af97
[Misc] Add dummy maverick test to CI ( #21324 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-23 20:22:42 -07:00
63d92abb7c
[Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding ( #21374 )
...
Signed-off-by: Deven Labovitch <deven@videa.ai >
2025-07-23 20:22:19 -07:00
11599b0e1f
feat(gguf_loader): accept HF repo paths & URLs for GGUF ( #20793 )
...
Signed-off-by: Hardik <hardikgupta1999@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-23 20:21:02 -07:00
f3137cdd81
[Core] Freeze gc during cuda graph capture to speed up init ( #21146 )
...
Signed-off-by: Codex <codex@openai.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 17:20:14 -07:00
82ec66f514
[V0 Deprecation] Remove Prompt Adapters ( #20588 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 16:36:48 -07:00
78c13e30e1
[V1] Fix local chunked attention always disabled ( #21419 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-23 15:59:30 -07:00
5c9b807b34
[Core] Add reload_weights RPC method ( #20096 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-23 14:24:52 -07:00
14bf19e39f
[TPU][TEST] Fix the downloading issue in TPU v1 test 11. ( #21418 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-23 11:29:36 -07:00
4ac7713e32
Add test case for compiling multiple graphs ( #21044 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-23 11:00:47 -07:00
8560a5b258
[Core][Model] PrithviMAE Enablement on vLLM v1 engine ( #20577 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-07-23 11:00:23 -07:00
316b1bf706
[Tests] Add tests for headless internal DP LB ( #21450 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 07:49:25 -07:00
7c734ee09b
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. ( #21364 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-07-23 06:34:37 -07:00
f59ec35b7f
[V1] Check all pooling tasks during profiling ( #21299 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-23 05:53:26 -07:00
2671334d45
[Model] add Hunyuan V1 Dense Model support. ( #21368 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-23 03:54:08 -07:00
2cc5016a19
[Docs] Clean up v1/metrics.md ( #21449 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 03:37:25 -07:00
6929f8b437
[Misc] fixed nvfp4_moe test failures due to invalid kwargs ( #21246 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-07-23 01:41:43 -07:00
32ec9e2f2a
Mamba V2 Test not Asserting Failures. ( #21379 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-07-23 01:40:27 -07:00
accac82928
[Sampler] Introduce logprobs mode for logging ( #21398 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-23 01:39:25 -07:00
23637dcdef
[Docs] Fix bullets and grammars in tool_calling.md ( #21440 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 01:23:20 -07:00
6364af92f8
Fixed typo in profiling logs ( #21441 )
2025-07-23 01:18:54 -07:00
7aaa2bd5a8
[Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload ( #19679 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-07-23 00:30:05 -07:00
2f5c14de6a
add clear messages for deprecated models ( #21424 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-07-23 00:03:16 -07:00
f002e9a870
[Cleanup] Only log MoE DP setup warning if DP is enabled ( #21315 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 00:02:48 -07:00
a1f3610fc6
[Core] Add basic unit test for maybe_evict_cached_block ( #21400 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-23 00:02:02 -07:00
4ecedd1806
[Bugfix] Fix nightly transformers CI failure ( #21427 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-23 00:01:01 -07:00
107111a859
Changing "amdproduction" allocation. ( #21409 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-07-22 20:48:31 -07:00
2dec7c1a5d
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported ( #21420 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-22 20:34:50 -07:00
08d2bd78da
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update ( #21414 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-22 20:33:57 -07:00
4f76a05f4f
[BugFix] Update python to python3 calls for image; fix prefix & input calculations. ( #21391 )
...
Signed-off-by: Eric Hanley <ericehanley@google.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-22 20:33:00 -07:00
f154bb9ff0
Simplify weight loading in Transformers backend ( #21382 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-22 20:29:43 -07:00
3ec7170ff1
[Bugfix][ROCm][Build] Fix build regression on ROCm ( #21393 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-22 20:27:41 -07:00
c401c64b4c
[CI/Build] Fix model executor tests ( #21387 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 20:25:37 -07:00
b77c7d327f
[BugFix] Fix ray import error mem cleanup bug ( #21381 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-07-22 16:19:55 -07:00
35bc8bd5fb
[Misc] Copy HF_TOKEN env var to Ray workers ( #21406 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-22 16:18:42 -07:00
4594fc3b28
[Model] Add Qwen3CoderToolParser ( #21396 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-07-22 15:05:57 -07:00
ae268b6326
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num ( #21325 )
...
Signed-off-by: XIn Li <xinli@nvidia.com >
2025-07-22 12:42:31 -07:00
35366ae57c
[CI/Build] Fix test failure due to updated model repo ( #21375 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 08:39:35 -07:00
2226d5bd85
[Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers ( #21353 )
...
Signed-off-by: ariG23498 <aritra.born2fly@gmail.com >
2025-07-22 08:27:28 -07:00
44554a0068
Add tokenization_kwargs to encode for embedding model truncation ( #21033 )
2025-07-22 08:24:00 -07:00
226b452a20
Revert "[Refactor] Fix Compile Warning #1444-D ( #21208 )" ( #21384 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 08:22:10 -07:00
f38ee34a0a
[feat] Enable mm caching for transformers backend ( #21358 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-07-22 08:18:46 -07:00
b194557a6c
Adds parallel model weight loading for runai_streamer ( #21330 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-22 08:15:53 -07:00
774d0c014b
[Perf] Cuda Kernel for Per Token Group Quant ( #21083 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 07:27:15 -07:00
2c8db17cfd
[feat]: add SM100 support for cutlass FP8 groupGEMM ( #20447 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:27:12 -07:00
4fb56914c5
[perf] Add fused MLA QKV + strided layernorm ( #21116 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:07:44 -07:00
0df4d9b06b
[Misc] unify variable for LLM instance v2 ( #21356 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-22 06:32:36 -07:00
ed25054577
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool ( #21222 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 06:17:47 -07:00
10904e6d75
[benchmark] Port benchmark request sent optimization to benchmark_serving ( #21209 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:28:00 -07:00
a32237665d
[Core] Optimize update checks in LogitsProcessor ( #21245 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:27:18 -07:00
bc8a8ce5ec
[Misc] Remove deprecated args in v0.10 ( #21349 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-22 05:26:39 -07:00
32142b3c62
[Bugfix] Fix eviction cached blocked logic ( #21357 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-22 01:18:40 -07:00
82b8027be6
Add arcee model ( #21296 )
...
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-22 00:57:43 -07:00
3779eb8c81
[Feature][eplb] add verify ep or tp or dp ( #21102 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-21 23:41:14 -07:00
9e23ad9655
Update fp4 quantize API ( #21327 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-07-21 23:40:21 -07:00
e69a92a1ce
[Bug] DeepGemm: Fix Cuda Init Error ( #21312 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:36:18 -07:00
8425f785ad
[Misc] DeepEPHighThroughtput - Enable Inductor pass ( #21311 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-21 23:35:45 -07:00
c17231e827
Fix kv_cache_dtype handling for out-of-tree HPU plugin ( #21302 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-21 23:35:14 -07:00
6e5b5ca580
[Refactor] Fix Compile Warning #1444-D ( #21208 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:33:51 -07:00
488d8a986a
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible ( #21300 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-21 23:31:18 -07:00
af376ca19d
[Core] Minimize number of dict lookup in _maybe_evict_cached_block ( #21281 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-21 22:37:34 -07:00
e7b2042681
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 ) ( #21334 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-21 21:49:01 -07:00
90f1e55421
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU ( #21338 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-21 21:48:27 -07:00
5e70dcd6e6
[Doc] Fix CPU doc format ( #21316 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 21:47:49 -07:00
25d585ab7b
[XPU] Enable external_launcher to serve as an executor via torchrun ( #21021 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-07-21 21:47:35 -07:00
8d0a01a5f2
[v1][sampler] Inplace logprobs comparison to get the token rank ( #21283 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-21 13:47:47 -07:00
0ec82edda5
[perf] Speed up align sum kernels ( #21079 )
...
Signed-off-by: Himanshu Jaju <hj@mistral.ai >
2025-07-21 11:19:23 -07:00
005ae9be6c
Fix bad lm-eval fork ( #21318 )
2025-07-21 10:47:51 -07:00
29d1ffc5b4
[DP] Fix Prometheus Logging ( #21257 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-07-21 09:11:35 -07:00
304dce7ec0
[Attention] Clean up iRoPE in V1 ( #21188 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-07-21 09:10:30 -07:00
6ece16c4fe
[Misc] Add dummy maverick test ( #21199 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-21 09:08:09 -07:00
a0e827e07c
[BugFix] make utils.current_stream thread-safety ( #21252 ) ( #21253 )
...
Signed-off-by: simpx <simpxx@gmail.com >
2025-07-21 09:07:36 -07:00
a15a50fc17
[CPU] Enable shared-memory based pipeline parallel for CPU backend ( #21289 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 09:07:08 -07:00
6dda13c86b
[Misc] Add sliding window to flashinfer test ( #21282 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-21 08:37:49 -07:00
6b46c4b653
Add Nvidia ModelOpt config adaptation ( #19815 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-21 10:02:58 -04:00
d97841078b
[Misc] unify variable for LLM instance ( #20996 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-21 12:18:33 +01:00
e6b90a2805
[Docs] Make tables more space efficient in supported_models.md ( #21291 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:25:02 -07:00
be54a951a3
[Docs] Fix hardcoded links in docs ( #21287 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:23:57 -07:00
042af0c8d3
[Model][1/N] Support multiple poolers at model level ( #21227 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-21 02:22:21 -07:00
378d33c392
[Bugfix] Fix missing placeholder in logger debug ( #21280 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-20 22:50:06 -07:00
940af1f03a
Add the instruction to run e2e validation manually before release ( #21023 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-20 22:29:18 -07:00
92615d7fe8
[Docs] Add RFC Meeting to Issue Template ( #21279 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-20 21:58:07 -07:00
8188196a1c
[CI] Cleanup modelscope version constraint in Dockerfile ( #21243 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-07-20 20:13:02 -07:00
7ba34b1241
[bugfix] fix syntax warning caused by backslash ( #21251 )
2025-07-20 17:12:10 +00:00
9499e26e2a
[Model] Support VLMs with transformers backend ( #20543 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-20 13:25:50 +00:00
51ba839555
[Model] use AutoWeightsLoader for bart ( #18299 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-07-20 08:15:50 +00:00
d1fb65bde3
Enable v1 metrics tests ( #20953 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-20 03:22:02 +00:00
3a1d8940ae
[TPU] support fp8 kv cache quantization ( #19292 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-20 03:01:00 +00:00
2b504eb770
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. ( #21233 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 16:09:58 -07:00
10eb24cc91
GLM-4 Update ( #20736 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Lu Fang <fanglu@fb.com >
2025-07-19 22:40:31 +00:00
2e8cbb58f3
[BugFix] Fix full cuda graph slot_mapping ( #21228 )
...
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
2025-07-19 14:13:18 -07:00
752c6ade2e
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small ( #21217 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-19 13:53:17 -07:00
881e3cbe3b
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers ( #21194 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 19:27:21 +00:00
9f414a12ad
[BugFix] Make PD work with Ray ( #21072 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-19 08:46:50 -07:00
6a971ed692
[Docs] Update the link to the 'Prometheus/Grafana' example ( #21225 )
2025-07-19 06:58:07 -07:00
da6579bf41
[CI/CD][bugfix]fix: error argument to loads has incompatible type ( #21223 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com >
2025-07-19 05:16:48 -07:00
c81259d33a
Fix/remove some broken model executor tests ( #21224 )
...
Signed-off-by: Rabi Mishra <ramishra@redhat.com >
2025-07-19 12:15:07 +00:00
e3a0e43d7f
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code ( #21032 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-19 05:13:55 -07:00
b3d82108e7
[Bugfix][Frontend] Fix openai CLI arg middleware ( #21220 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-19 02:40:38 -07:00
6d0734c562
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency ( #20645 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-19 02:33:01 -07:00
7d94577138
Add torch golden impl for moe_align_block_size kernel test ( #20653 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-19 02:32:36 -07:00
59f935300c
[BugFix] Fix potential cuda-graph IMA ( #21196 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-19 02:18:47 -07:00
18e519ec86
[Bugfix] Fix ndarray video color from VideoAsset ( #21064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-19 02:17:16 -07:00
1eaff27815
[V0 deprecation] Remove long context LoRA ( #21169 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-19 02:15:41 -07:00
cf8cc32674
Fix a couple of Voxtral tests ( #21218 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-19 09:13:41 +00:00
3a2cb2649d
[Misc][Tools][Benchmark] Add readme file for auto_tune script ( #20779 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-19 09:06:59 +00:00
3e04107d97
[Model] EXAONE 4.0 model support ( #21060 )
...
Signed-off-by: Deepfocused <rlawhdrhs27@gmail.com >
Signed-off-by: woongsik <rlawhdrhs27@gmail.com >
2025-07-19 14:25:44 +08:00
37bd8d6e4c
[Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 ( #21187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 23:25:22 -07:00
468e2400fe
[BugFix][CPU] Fix TorchSDPABackendImpl doesn't have use_irope ( #21200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-18 23:18:48 -07:00
dcc6cfb991
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel ( #21193 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-18 23:09:51 -07:00
dd572c0ab3
[V0 Deprecation] Remove V0 Spec Decode workers ( #21152 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-18 21:47:50 -07:00
9ffe905a41
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 ( #21183 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-07-18 21:15:03 -07:00
9a9fda1423
[Core] Support Local Chunked Attention for Hybrid KV Cache ( #19351 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <fanglu@meta.com >
2025-07-18 20:48:38 -07:00
466e878f2a
[Quantization] Enable BNB support for more MoE models ( #21100 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-18 17:52:02 -07:00
217937221b
Elastic Expert Parallel Initial Support ( #20775 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-18 17:46:09 -07:00
5782581acf
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) ( #21077 )
...
Signed-off-by: hax0r31337 <liulihaocaiqwq@gmail.com >
2025-07-18 18:40:18 -04:00
0f199f197b
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue ( #21005 )
...
Signed-off-by: Jialin Ouyang <jialino@meta.com >
2025-07-18 12:34:40 -07:00
b2eb2b5ad7
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 ( #19346 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-18 14:10:21 -04:00
21274ab476
[CI] Update CODEOWNERS for vllm/compilation ( #21185 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-18 06:51:12 -07:00
ed8cbfedf8
Let GraniteMoeAttention use YaRN ( #21174 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-18 05:52:52 -07:00
45badd05d0
[Core] Set pooling params based on task and model ( #21128 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 05:41:17 -07:00
4adc66f64d
[Bugfix] Allocate less memory in non-batched CUTLASS MoE ( #21121 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-18 18:55:52 +08:00
55ad648715
[Doc] Fix typo in model name ( #21178 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 03:55:10 -07:00
5895afd780
[Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. ( #20750 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 09:10:47 +00:00
ca4eb82bcb
[Model] Re-add the implicit conversion feature for as_seq_cls_model ( #21103 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 07:15:07 +00:00
ba2dfbb0c2
[Misc] Make MM embedding merge interface explicit in model runner ( #21147 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-18 07:13:57 +00:00
1bf65138f6
[benchmark] Sending request strictly follows the random intervals ( #21108 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-18 06:22:08 +00:00
54cf1cae62
[Misc] Do not print async output warning for v1 ( #21151 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 21:57:02 -07:00
5780121c95
[Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm ( #20911 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-18 04:34:43 +00:00
c7d8724e78
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) ( #20037 )
...
Signed-off-by: shuw <shuw@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-17 21:32:45 -07:00
b38baabcf9
[Doc] Add inplace weights loading example ( #19640 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-17 21:12:23 -07:00
89cab4d01f
[Attention] Make local attention backend agnostic ( #21093 )
2025-07-18 00:10:42 -04:00
b9a21e9173
[Docs] Update supported models documentation with missing models ( #20844 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-07-17 20:12:13 -07:00
c4e3b12524
[Docs] Add minimal demo of Ray Data API usage ( #21080 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-17 20:09:19 -07:00
8dfb45ca33
[Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel ( #21133 )
2025-07-18 00:35:58 +00:00
8a8fc94639
[Log] Debugging Log with more Information ( #20770 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 00:19:46 +00:00
4de7146351
[V0 deprecation] Remove V0 HPU backend ( #21131 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 16:37:36 -07:00
ac9fb732a5
On environments where numa cannot be detected we get 0 ( #21115 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-07-17 18:52:17 +00:00
a3a6c695f4
[Misc] Qwen MoE model supports LoRA ( #20932 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 18:32:52 +00:00
90bd2ab6e3
[Model] Update pooling model interface ( #21058 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-17 16:05:40 +00:00
9fb2d22032
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-17 09:56:44 -04:00
2d6a38209b
[Docs] Move code block out of admonition now that it's short ( #21118 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 06:12:29 -07:00
89e3c4e9b4
[Misc] Avoid unnecessary import ( #21106 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-17 12:57:41 +00:00
fe8a2c544a
[Docs] Improve docstring formatting for FusedMoEParallelConfig.make ( #21117 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 04:13:00 -07:00
4ef00b5cac
[VLM] Add Nemotron-Nano-VL-8B-V1 support ( #20349 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-17 03:07:55 -07:00
5a7fb3ab9e
[Model] Add ToolParser and MoE Config for Hunyuan A13B ( #20820 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-17 09:10:09 +00:00
11dfdf21bf
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels ( #20903 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-17 08:10:37 +00:00
fdc5b43d20
[Bugfix]: Fix final_res_batch list index out of range error ( #21055 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-17 00:29:09 -07:00
c5b8b5953a
[Misc] Fix PhiMoE expert mapping ( #21085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 05:47:49 +00:00
4fcef49ec4
[V1] [KVConnector] Fix MultiprocExecutor worker output aggregation ( #21048 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-07-17 13:29:45 +08:00
8a4e5c5f3c
[V1][P/D]Enhance Performance and code readability for P2pNcclConnector ( #20906 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-07-16 22:13:00 -07:00
76b494444f
[Attention] Refactor attention metadata builder interface ( #20466 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-17 04:44:25 +00:00
28a6d5423d
[Bugfix] Fix Machete zero point issue for GPTQ models on SM90 ( #21066 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:54:45 -07:00
58760e12b1
[TPU] Start using python 3.12 ( #21000 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-16 19:37:44 -07:00
a50d918225
[Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile ( #21013 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:37:13 -07:00
c9ba8104ed
[Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group ( #21024 )
...
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com >
2025-07-16 19:36:36 -07:00
4e7dfbe7b4
Update PyTorch to torch==2.7.1 for CUDA ( #21011 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-17 02:30:44 +00:00
72ad273582
Remove torch_xla.tpu.version() from pallas.py. ( #21065 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-17 00:25:26 +00:00
01513a334a
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) ( #12010 )
...
Signed-off-by: Nir David <ndavid@habana.ai >
Signed-off-by: Uri Livne <ulivne@habana.ai >
Co-authored-by: Uri Livne <ulivne@habana.ai >
2025-07-16 15:33:41 -04:00
ac2bf41e53
[Model] Remove model sampler ( #21059 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 19:03:37 +00:00
a931b4cdcf
Remove Qwen Omni workaround that's no longer necessary ( #21057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-16 16:25:23 +00:00
a0f8a79646
[fix] fix qwen image_embeds input ( #21049 )
...
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai >
2025-07-16 15:17:20 +00:00
18bdcf4113
feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information ( #20575 )
...
Signed-off-by: m-misiura <mmisiura@redhat.com >
2025-07-16 21:52:14 +08:00
1c3198b6c4
[Model] Consolidate pooler implementations ( #20927 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 13:39:13 +00:00
260127ea54
[Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md ( #19199 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-16 06:11:38 -07:00
d0dc4cfca4
Fix inadvertently silenced PP tests for mp, add DeepSeek V2/V3 model family to PP tests ( #20831 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-16 00:14:49 -07:00
d31a647124
[BugFix] Fix import error on non-blackwell machines ( #21020 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-15 22:27:29 -07:00
85431bd9ad
[TPU] fix kv_cache_update kernel block size choosing logic ( #21007 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-16 04:39:48 +00:00
c11013db8b
[Meta] Llama4 EAGLE Support ( #20591 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: qizixi <qizixi@meta.com >
2025-07-15 21:14:15 -07:00
1eb2b9c102
[CI] update typos config for CI pre-commit and fix some spells ( #20919 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-15 21:12:40 -07:00
6ebf313790
Avoid direct comparison of floating point numbers ( #21002 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-15 21:12:14 -07:00
cfbcb9ed87
[Voxtral] Add more tests ( #21010 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-15 21:11:49 -07:00
76ddeff293
[Doc] Remove duplicate docstring ( #21012 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-15 20:09:13 -07:00
f46098335b
[Bugfix] Fix Mistral3 support on SM100/SM120 ( #20998 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 20:08:41 -07:00
e9534c7202
[CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty ( #21006 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-15 20:07:05 -07:00
7976446015
Add Dockerfile argument for VLLM_USE_PRECOMPILED environment ( #20943 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-15 19:53:57 -07:00
fcb9f879c1
[Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… ( #20937 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-15 19:53:42 -07:00
3ed94f9d0a
[Docs] Enhance Anyscale documentation, add quickstart links for vLLM ( #21018 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 19:46:56 -07:00
fa839565f2
[Misc] Refactor: Improve argument handling for conda command ( #20481 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 19:43:19 -07:00
75a99b98bf
[Chore] Remove outdated transformers check ( #20989 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-07-15 19:42:40 -07:00
b5c3b68359
[Misc] bump xgrammar version to v0.1.21 ( #20992 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 19:42:16 -07:00
6cbc4d4bea
[Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture ( #20923 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 19:19:10 -07:00
153c6f1e61
[Frontend] Remove print left in FrontendArgs.add_cli_args ( #21004 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 19:18:41 -07:00
34cda778a0
[Frontend] OpenAI Responses API supports input image ( #20975 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 18:59:36 -06:00
30800b01c2
[Nvidia] Integrate SM100 cudnn prefill API to MLA prefill ( #20411 )
...
Signed-off-by: Elfie Guo <elfieg@nvidia.com >
Co-authored-by: Elfie Guo <eflieg@nvidia.com >
2025-07-15 17:56:45 -07:00
10be209493
[Bug Fix] get_distributed_init_method should get the ip from get_ip i… ( #20889 )
...
Signed-off-by: Chen Li <lcpingping@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-07-15 21:23:52 +00:00
19c863068b
[Frontend] Support cache_salt in /v1/completions and /v1/responses ( #20981 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-07-15 21:01:04 +00:00
f29fd8a7f8
[BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 ( #20838 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
2025-07-15 16:08:26 -04:00
ed10f3cea1
[ROCm] warpSize is being made non constexpr in ROCm 7.0 ( #20330 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-15 14:01:44 -04:00
b637e9dcb8
Add full serve CLI reference back to docs ( #20978 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:42:30 +00:00
1e36c8687e
[Deprecation] Remove nullable_kvs ( #20969 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:21:50 +00:00
5bac61362b
Configure Gemini ( #20971 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 09:37:05 -07:00
313ae8c16a
[Deprecation] Remove everything scheduled for removal in v0.10.0 ( #20979 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 15:57:53 +00:00
c847e34b39
[CI/Build] Fix wrong path in Transformers Nightly Models Test ( #20994 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-15 08:53:16 -07:00
e7e3e6d263
Voxtral ( #20970 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-15 07:35:30 -07:00
4ffd963fa0
[v1][core] Support for attention free models ( #20811 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-07-15 14:20:01 +00:00
56fe4bedd6
[Deprecation] Remove TokenizerPoolConfig ( #20968 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 14:00:50 +00:00
d91278181d
[doc] Add more details for Ray-based DP ( #20948 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-15 05:37:12 -07:00
20149d84d9
[MISC] Add init files for python package ( #20908 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-07-15 12:16:33 +00:00
3534c39a20
[V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli ( #20840 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 04:04:35 -07:00
c586b55667
[TPU] Optimize kv cache update kernel ( #20415 )
...
Signed-off-by: Yifei Teng <tengyifei88@gmail.com >
2025-07-15 03:56:43 -07:00
33d560001e
[Docs] Improve documentation for ray cluster launcher helper script ( #20602 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 03:55:45 -07:00
f148c44c6a
[frontend] Refactor CLI Args for a better modular integration ( #20206 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-15 02:23:42 -07:00
235bfd5dfe
[Docs] Improve documentation for RLHF example ( #20598 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 01:54:10 -07:00
68d28e37b0
[frontend] Add --help=page option for paginated help output ( #20961 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 00:42:00 -07:00
37a7d5d74a
[Misc] Refactor AllReduceFusionPass. Remove parameter ( #20918 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-15 06:57:40 +00:00
d4d309409f
Implement Async Scheduling ( #19970 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-14 23:01:46 -07:00
85bd6599e4
[Model] Add AutoWeightsLoader support for BERT, RoBERTa ( #20534 )
...
Signed-off-by: Jennifer He <islandhe@gmail.com >
Signed-off-by: <islandhe@gmail.com >
Signed-off-by: Jen H <islandhe@gmail.com >
2025-07-15 13:34:24 +08:00
91b3d190ae
[cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir ( #20940 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-15 13:02:17 +08:00
fc017915f5
[Doc] Clearer mistral3 and pixtral model support description ( #20926 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 21:56:53 -07:00
9ad0a4588b
[Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer ( #20934 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-07-15 03:27:50 +00:00
016b8d1b7f
Enabled BnB NF4 inference on Gaudi ( #20172 )
...
Signed-off-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
2025-07-14 20:26:08 -07:00
80305c1b24
[CI] Fix flaky test_streaming_response test ( #20913 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:15:15 -07:00
37e2ecace2
feat: add image zoom to improve image viewing experience ( #20763 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 20:14:23 -07:00
054c8657e3
[Docs] Add Kuberay to deployment integrations ( #20592 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-14 20:13:55 -07:00
d4170fad39
Use w8a8 quantized matmul Pallas kernel ( #19170 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-15 03:06:33 +00:00
946aadb4a0
[CI/Build] Split Entrypoints Test into LLM and API Server ( #20945 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 02:44:18 +00:00
bcdfb2a330
[Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM ( #20933 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 01:42:17 +00:00
ba8c300018
[BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache ( #20942 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-15 01:26:18 +00:00
8cdc371217
SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP ( #20769 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-07-15 01:06:38 +00:00
61e20828da
Fall back if flashinfer comm module not found ( #20936 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-14 23:11:18 +00:00
55e1c66da5
[Docs] remove outdated performance benchmark ( #20935 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-07-14 22:14:17 +00:00
86f3ac21ce
Fix overflow indexing in causal_conv1d kernel ( #20938 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-14 21:43:07 +00:00
149f2435a5
[Misc] Relax translations tests ( #20856 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:08:36 +00:00
c0569dbc82
[Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts ( #20725 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-14 19:47:16 +00:00
8bb43b9c9e
Add benchmark dataset for mlperf llama tasks ( #20338 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-14 19:10:07 +00:00
559756214b
Change default model to Qwen3-0.6B ( #20335 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-14 16:54:52 +00:00
6d0cf239c6
[CI/Build] Add Transformers nightly tests in CI ( #20924 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 16:33:17 +00:00
3fc964433a
[Misc] Clean up Aimv2 config registration in Ovis config ( #20921 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 15:36:43 +00:00
0caf61c08a
[CI] Update codeowner for compilation code ( #20929 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-14 08:33:19 -07:00
667624659b
[CI] cc folks on changes to vllm/compilation ( #20925 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-14 07:52:17 -07:00
38efa28278
[Model] Add Ling implementation ( #20680 )
...
Signed-off-by: vito.yy <vito.yy@antgroup.com >
2025-07-14 22:10:32 +08:00
e8cc53af5e
[Misc] Log the reason for falling back to FlexAttention ( #20699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 04:16:51 -07:00
a4851cfe68
[Bugfix]: Fix messy code when using logprobs ( #20910 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-14 11:06:45 +00:00
9887e8ec50
[Misc] Remove unused function ( #20909 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 10:48:55 +00:00
f326ab9c88
[Bugfix] Bump up mistral_common to support v13 tokenizer ( #20905 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 10:45:03 +00:00
dcf2a5e208
[CI/Build] Fix OOM issue in Jina-VL test ( #20907 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 10:32:35 +00:00
1e9438e0b0
[MISC] Move bind_kv_cache to worker module ( #20900 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-14 09:40:00 +00:00
697ef765ee
[Refactor][V1] Move outlines utils for V1 imports ( #20878 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-07-14 00:58:35 -07:00
a99b9f7dee
[Quantization] add BNB for MixtralForCausalLM ( #20893 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-14 07:34:34 +00:00
c488b928a7
[ROCm] [Bugfix] [Critical]: Fix mamba compilation bug ( #20883 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-07-14 15:23:28 +08:00
2c7fa47161
Fix: Add missing EOFError handling in CLI complete command ( #20896 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 07:09:57 +00:00
88fc8a97e3
Removing redundant python version check ( #20888 )
...
Signed-off-by: Dannyso05 <dansong1177@gmail.com >
2025-07-14 06:15:05 +00:00
66f6fbd393
[Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) ( #20511 )
...
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com >
2025-07-14 02:45:31 +00:00
8632e831ba
[Core] Add update_config RPC method ( #20095 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 00:49:18 +00:00
4bbfc36b16
[V1] Hybrid allocator without prefix caching ( #20661 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-13 16:55:14 +00:00
80d38b8ac8
[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs ( #20880 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-07-13 15:19:32 +00:00
211b6a6113
[Bugfix] fix define of RerankDocument ( #20877 )
...
Signed-off-by: liuchenlong <liuchenlong@xiaohongshu.com >
Co-authored-by: liuchenlong <liuchenlong@xiaohongshu.com >
2025-07-13 14:32:40 +00:00
247102f07f
[Bugfix] Fix: add patch_rope_scaling after hf override ( #20857 )
...
Signed-off-by: Wang Siyuan <wsy0227@sjtu.edu.cn >
Signed-off-by: Wang Siyuan <sywang0227@gmail.com >
2025-07-13 00:13:25 -07:00
bd4c1e6fdb
Support for LlamaForSequenceClassification ( #20807 )
...
Signed-off-by: thechaos16 <thechaos16@gmail.com >
2025-07-13 00:09:34 -07:00
99b4f080d8
Renable google/gemma-3-1b-it accuracy test. ( #20866 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-12 21:48:56 -07:00
020f58abcd
[Core] Support multiple tasks per model ( #20771 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-12 19:40:11 -07:00
c1acd6d7d4
[Refactor] Change the way of import triton ( #20774 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:39:55 -07:00
3b3b778d4a
[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs ( #20825 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-12 19:39:14 -07:00
42d440c22b
[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant ( #20841 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:38:45 -07:00
f45a332886
[Sched] Enhance the logic to remove stopped requests from queues ( #20739 )
2025-07-12 15:33:13 -07:00
6e2c176e1f
[Bugfix] Restrict Machete to only run on Hopper ( #20830 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-12 17:34:40 +00:00
a86754a12b
[docs] convert supported configs to table ( #20858 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-12 06:54:50 -07:00
c2a2f19aba
[Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models ( #20843 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-07-12 06:11:30 -07:00
2c11a738b3
[Model] New model support for microsoft/Phi-4-mini-flash-reasoning ( #20702 )
...
Signed-off-by: Congcong Chen <congcongchen@microsoft.com >
2025-07-12 06:02:10 -07:00
b639327ad9
Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694 " ( #20853 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 23:07:35 -07:00
4afe687a82
Enable ModelOpt Llama4 fp8 checkpoint deployment ( #20419 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-11 23:07:16 -07:00
5de8d9f111
Remove extra tensor on CPU ( #20693 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-12 14:06:34 +08:00
c1c8ca57ff
[cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile ( #20790 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-11 23:06:13 -07:00
a3a5a47e48
[Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 ( #20823 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-11 23:06:04 -07:00
fb25e95688
[Docs] Update basic.md ( #20846 )
2025-07-11 23:05:32 -07:00
0d4891cd03
[Bug] Fix DeepGemm for EP low latency case ( #20833 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 23:05:12 -07:00
f56d2996ca
[Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph ( #20834 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-07-11 23:04:45 -07:00
147afb448b
[Bugfix] Replace unavailable video url in multimodal test ( #20854 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-12 05:25:39 +00:00
3c7d942da8
[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models ( #20637 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-11 21:33:26 -07:00
890323dc1b
[Bugfix] : Fix typo - logger.warn_once -> logger.warning_once ( #20852 )
2025-07-11 20:56:24 -07:00
01cae37713
[CI/Build] Ensure compatability with Transformers v4.53 ( #20541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-11 20:53:07 -07:00
11c0198615
[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading ( #20682 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-07-11 20:52:43 -07:00
b1235c3e10
[Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices ( #20822 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 20:52:05 -07:00
44d02f54db
[Misc] Restrict deep_gemm's log output ( #20827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 20:50:42 -07:00
a8593237c0
Add pynccl all-gatherv and reducescatterv ( #20154 )
...
Signed-off-by: Trevor Morris <tmorris@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 18:59:23 -07:00
fc0f41d10a
Integration SM100 FlashInfer fused allreduce RMSNorm ( #20691 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-11 18:58:15 -07:00
7b828e30d5
[CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' ( #20845 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 18:57:24 -07:00
5f0af36af5
Update kimi-k2 tool calling docs, enable unit tests ( #20821 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 20:16:14 +00:00
0d21b2664c
[Bugfix] Fix OOM in language generation test ( #20814 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-11 11:21:52 -07:00
9907fc4494
[Docs] Data Parallel deployment documentation ( #20768 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-11 09:42:10 -07:00
d47661f0cd
[Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM ( #20646 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 10:05:33 -06:00
53fa457391
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility ( #20449 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-11 07:51:46 -07:00
6fb162447b
[doc] fix ordered list issue ( #20819 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-11 06:49:46 -07:00
66177189c5
[Bugfix] Add missing field to TritonLanguagePlaceholder ( #20812 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 05:25:11 -07:00
b4f0b5f9aa
Temporarily suspend google/gemma-3-1b-it. ( #20722 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-11 11:21:26 +00:00
cbd14ed561
[Bugfix] Refactor /invocations to be task-agnostic ( #20764 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-11 03:20:54 -07:00
7bd4c37ae7
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). ( #19825 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: shuw <shuw@nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 09:23:23 +00:00
8020e98c9f
[Quantization][1/N] MoE support BNB-Inflight Quantization ( #20061 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 08:01:13 +00:00
762be26a8e
[Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging ( #20777 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Signed-off-by: luka <lgovedic@redhat.com >
2025-07-11 00:15:22 -07:00
6a9e6b2abf
[doc] fold long code block ( #20795 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-10 23:16:41 -07:00
5d09152ff1
[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine ( #20660 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-11 05:53:31 +00:00
31d5c1797f
[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf ( #19830 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 04:56:28 +00:00
35514b682a
[XPU] XCCL support enabled in torch 2.8.0.dev nightly builds ( #20705 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-10 20:39:52 -07:00
e2de455c34
[Feature] Integrate SM100 DeepGEMM support ( #20087 )
2025-07-10 20:18:05 -07:00
5b032352cc
[Attention] MLA - Flashinfer Ragged Prefill ( #20034 )
2025-07-10 20:17:47 -07:00
922f316441
[Model] Support HF format of minimax ( #20211 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 02:55:21 +00:00
5923ab9524
[fix]: disable cutlass block scaled group gemm for EP ( #20781 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-07-11 02:39:18 +00:00
0cf893cae1
Add kimi-k2 tool parser ( #20789 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 10:36:23 +08:00
cf75cd2098
[CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install ( #20772 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 01:16:01 +00:00
b854321ffe
[Docs] Lazy import gguf ( #20785 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-10 16:06:37 -07:00
5b6fe23d05
[Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. ( #20786 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-10 14:52:46 -07:00
f0c98cae27
[Misc] MoE ModularKernel : Introduce TopKWeightAndReduce ( #20648 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-10 14:40:38 -07:00
574ad60db9
[KVConnector] Always call connector clear_metadata() at end of step ( #20756 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: David Ben-David <sdavidbd@gmail.com >
2025-07-10 22:37:27 +01:00