920db41128
[Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix K/V scale loading for MLA Attn ( #25968 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
9ea82ecd25
Fix V1 engine serialization error with Ray distributed executor ( #26148 )
...
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
13e211bbbc
Avoid division by zero in cache DS MLA kernel ( #26174 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
2d68bba3cd
Stop mergify from keeping stale PRs alive ( #26169 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
e45271b09c
[BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 ( #26123 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
84135b1489
Fix undefined symbol: cutlass_moe_mm_sm100 ( #26098 )
...
Signed-off-by: Jun Jiang <jasl9187@hotmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
611c23b68f
[Renderer] Move Processor out of LLMEngine ( #26165 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
c40c0d9c82
[Model] Fixed stream generator for gpt-oss + spec-decoding ( #26027 )
...
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
d8b1f9ccc3
[CI/Build] do not enforce precompilation on tpu ci tests ( #25992 )
...
Signed-off-by: Xiang Si <sixiang@google.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
fac9b430ec
[Model] Supplement to PR 24862: Pass param prefix to LLMHead ( #25805 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
c6f384dafd
[backends][short_conv] CUDA graph piecewise edits ( #24215 )
...
Signed-off-by: Paul Pak <paulpak58@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
7faf51f1cc
[Bugfix] Re-enable prefill of max model length ( #24446 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
ff1daf6c8a
[Renderer] Move Processor out of AsyncLLM ( #24138 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
f376868620
Quick fix for IMA with the Prefix Prefill kernel during graph capture ( #25983 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
564233d550
[Doc] Fixed shape description for fused_batched_moe.py ( #25668 )
...
Signed-off-by: Egor <e.a.krivov@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
2bcc745042
[Multi Modal] Configurable MM Profiling ( #25631 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
fa29d31f0d
[openai] Fix missing tool usage check (system message) ( #24768 )
...
Signed-off-by: kyt <eluban4532@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
2168fc8fae
[NIXL][Misc] Expose metrics from NIXL for logging to CLI ( #25388 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
8d332b3cf6
[CI] Fix distributed hybrid tests in CI ( #26155 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
c634415273
[test utils] correct wrong typing ( #26159 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
c81dc099a3
[Model] Use merge_by_field_config for MM models (InternVL family) ( #26153 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
edaae1825f
add(v1): RequestStatesStats to RequestOutput ( #24947 )
...
Signed-off-by: huijjj <huijong.jeong@squeezebits.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
5b80f22087
[Perf] Optimize reshape_and_cache CUDA Kernel ( #25955 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Liu-congo <1502632128@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
ae03f4c010
[Input] Remove unused prompt field ( #26097 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
7e4b1861c3
[Misc] Remove typing.List ( #26150 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
d628fa1e56
[BUG] Reorder model config creation ( #26124 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
6b12b2ee38
FusedMoE support for the Transformers backend (#22650 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
bbeace233b
[Model] Use merge_by_field_config for MM models (G) ( #26117 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
09b1a5676d
[Bugfix] Fix import gemm_afp4wfp4 failure on AMD ( #26068 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
f35f896e3a
[ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm ( #26104 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
218349d760
[Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv ( #26103 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
79b2fe7f19
[gpt-oss] disable tool server initialization if no tool in request ( #25790 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
56d0073f2a
[Bug]: Limit num_reqs in dummy_run when max_num_seqs is small ( #26144 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
a06bb9bf36
[DeepSeek] Improve performance of DS MLA cache kernel ( #26132 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
173c8a9520
[CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper ( #26138 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
2ea7d48656
[Attention] Move Backend enum into registry ( #25893 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
8db7b7f39c
[Bug][Benchmark] Fix duplicate req in oversampling ( #26140 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
587b30c571
[Log] Optimize DeepGEMM Missing Log ( #26106 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
0c76bb2de1
[Bugfix] Disable cascade attention with FlashInfer ( #26130 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
72c5dd0310
Fix MTP with deepep_low_latency ( #25904 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
abc55b1fe5
[Perf] Fix and reapply move apply w8a8 block fp8 linear to class ( #25696 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: ElizaWszola <elizaw.9289@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
d737c66b95
[Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MTP ( #25119 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
da3a188bdb
EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench ( #25916 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:58 -07:00
77e958752b
[Deepseek v3.2] Support indexer prefill chunking ( #25999 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
c5880cfa4c
[Small] Prevent bypassing media domain restriction via HTTP redirects ( #26035 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
01888b5cbf
[BugFix] Fix FI accuracy issue when used for MLA prefill ( #26063 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
fa179abde3
[CI/Build] Replace vllm.entrypoints.openai.api_server entrypoint with vllm serve command ( #25967 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
5c8a4a2208
[CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests ( #26040 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
06d102ecc8
[Qwen][ROCm] Flash Attention Rotary Embeddings ( #24642 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
422f2cca4b
[Platform][CI] Added OOT platform interface e2e test that running on Ascend NPU ( #25470 )
...
Signed-off-by: leo-pony <nengjunma@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
3884dce376
[Model] Use merge_by_field_config for MM models (D-F) ( #26076 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
00c0b25e82
[Model] Use merge_by_field_config for MM models (A-C) ( #26073 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
0655b90d80
[FA/Chore] Bump vllm-flash-attention ( #25537 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
83fa298682
Change size of single CUDA graph for CI to 4 ( #26089 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
5a083ce2ea
Update base image to 22.04 (jammy) ( #26065 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
115019045d
Run:ai model streamer add GCS package support ( #24909 )
...
Signed-off-by: Peter Schuurman <psch@google.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
93d2be10b6
[Misc] Make handling of SamplingParams clearer in n>1 case ( #26032 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
91e10c725c
[ROCm][Bugfix] Add missing parameter to ROCm backend ( #26029 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
2ae74a80af
Support RL online quantization with torchao ( #23014 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
ac1598d166
[BugFix] ChunkedLocalAttention is currently not CG compatible ( #26034 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
ce8ee3d9e7
[Bug] Fix Negative Cuda Memory Usage ( #25683 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
d4a83e01bb
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series ( #25908 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
90529cec41
[BugFix][DP/EP] Fix CUTLASS MLA hang under load ( #26026 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
bba7623426
[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability ( #26030 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
d2f544018f
Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets ( #25995 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
ed7eb771a3
[NVIDIA] Blackwell Family ( #24673 )
...
Signed-off-by: Johnny <johnnynuca14@gmail.com >
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: Johnny <johnnync13@gmail.com >
Signed-off-by: Salvatore Cena <cena@cenas.it >
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com >
Co-authored-by: Salvatore Cena <cena@cenas.it >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
0944358a90
[Bugfix] Apply same sampling parameters for both n=1 and n>1 ( #26005 )
...
Signed-off-by: Kenichi Maehashi <maehashi@preferred.jp >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
aeff0604bb
[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type ( #26007 )
...
Signed-off-by: Nathan Scott <nathans@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
a561b9832d
[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 ( #25829 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
Co-authored-by: Luka Govedic <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
e8773e620f
[CI] Only capture a single CUDA graph size in CI by default ( #25951 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
63c56cbb25
[Misc] Factor out common _apply_feature_select_strategy ( #26003 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
25e5b9ccec
[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker ( #26004 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
b9ed8c9679
[Doc] updating torch.compile doc link ( #25989 )
...
Signed-off-by: nadathurv <work.vnadathur@gmail.com >
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com >
Co-authored-by: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
9506409fc6
[Misc]allow disable pynccl ( #25421 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
fda819837e
Update to Transformers v4.56.2 ( #24638 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
7c795fdf41
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 ( #25988 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
6444f65a2b
[Bugfix] Fix __syncwarp on ROCM ( #25996 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
4c094b339e
[MM] Add text-only mode for Qwen3-VL ( #26000 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
cd0bbf5de2
Fix INT8 quantization error on Blackwell GPUs (SM100+) ( #25935 )
...
Signed-off-by: padg9912 <phone.and.desktop@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
2b6b859916
[Log] Optimize Log for FP8MOE ( #25709 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
04cb503fda
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning ( #25843 )
...
Signed-off-by: Salvatore Cena <cena@cenas.it >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
d437ba32fd
[Model] MTP fallback to eager for DeepSeek v32 ( #25982 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
e734a2a085
[Misc] Make EP kernels install script support uv ( #25785 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
fd56f2e644
[gpt-oss] use vLLM instead of openai types for streaming ( #25186 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
1690954497
[Docs] Remove API Reference from search index ( #25949 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
b3e1846da6
Add explicit pooling classes for the Transformers backend ( #25322 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
8328d39d40
[V1] [P/D] Add Support for KV Load Failure Recovery ( #19330 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
ef318228e7
[Bench] Add DeepSeekV32 to MoE benchmark ( #25962 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
8ecccdd15f
[Llama4] [multimodal] Fix misplaced dtype cast of cos_sin_cache in Llama4VisionRotaryEmbedding ( #25889 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
bb2e04e41e
OffloadingConnector: Fix GPU block tracking bug ( #25856 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
6083b4d926
[Docs] Add moe kernel features doc ( #25297 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: bnellnm <49004751+bnellnm@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
493acdb7e2
[Doc] Improve MM Pooling model documentation ( #25966 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
3c75d3b00c
[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' ( #25958 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
206ab1f0df
[bugfix][deepseek] fix flashmla kernel selection ( #25956 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
e33579cd96
[Bugfix] Token type and position embeddings fail to be applied to inputs_embeds ( #25922 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
8c52fccb1a
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging ( #25895 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
ea6144a019
[Bugfix][Model] Fix inference for Hunyuan dense models ( #25354 )
...
Signed-off-by: anion <1005128408@qq.com >
Signed-off-by: Anion <123177548+Anionex@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
b6ea29b721
Add Hugging Face Inference Endpoints guide to Deployment docs ( #25886 )
...
Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
d9f8ded136
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 ( #25858 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
02776c0386
[Fix] Improve CPU backend compatibility for RISC-V ( #25816 )
...
Signed-off-by: lyd1992 <liuyudong@iscas.ac.cn >
Signed-off-by: ihb2032 <1355790728@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
8914d52869
[CI] Move applicable tests to CPU ( #24080 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
bf8bb7e250
[NIXL] Add support for MLA caches with different latent dim ( #25902 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
eea2536a35
[perf] Use CPU tensor to reduce GPU->CPU sync ( #25884 )
...
Signed-off-by: Lehua Ding <lehuading@tencent.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
a1898466a6
[Model] Move vision_feature_select_strategy into resolve_visual_encoder_outputs ( #25938 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:57 -07:00
9dce93e07c
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 ( #25936 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
c0734fc51a
Updated TRL integration docs ( #25684 )
...
Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
034f3a4980
[Doc] Add Cambricon MLU support ( #25942 )
...
Signed-off-by: a120092009 <zhaoty0121@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
0230cd0afb
[New Model] DeepSeek-V3.2 (Rebased to Main) ( #25896 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com >
Signed-off-by: Lucia Fang <fanglu@meta.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: Lucia Fang <fanglu@meta.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com >
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
da71651386
[Bugfix]: Clean up chunked prefill logging when using whisper ( #25075 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
0da98ff2eb
[Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect logical_not ( #25925 )
...
Signed-off-by: zhoukz <me@zhoukz.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
db4a03e2e2
[BugFix] Pass config_format via try_get_generation_config ( #25912 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
e165f980d9
[BugFix] Fix DP/EP hang ( #25906 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
ea7cf8db35
MoveVllmConfig from config/__init__.py to config/vllm.py ( #25271 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
1108ffb3e6
[Benchmark] Support benchmark throughput for external launcher DP ( #25913 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
0c7cc69e29
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 ( #25909 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
6941d53c0c
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models ( #25717 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
97f1312f8c
[V0 Deprecation] Remove vllm.worker and update according imports ( #25901 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
09b01cd395
[NIXL] Increase default KV block eviction timeout on P ( #25897 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
4deb9c88ca
[Doc] Polish example for torchrun dp ( #25899 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
b7973eabe5
[Kernel] Chunk-aligned mamba2 ( #24683 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
e7203c2338
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so ( #25605 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
ae0c35923f
[Doc] Add documentation for vLLM continuous benchmarking and profiling ( #25819 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
c692506e10
[BugFix][torch.compile] KV scale calculation issues with FP8 quantization ( #25513 )
...
Signed-off-by: adabeyta <aabeyta@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
9555929e13
[Bugfix] Use correct key "ignore" for config.json non-quantized layers ( #25706 )
...
Signed-off-by: Lee Nau <lnau@nvidia.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
2405817748
[Model] Remove MotifForCausalLM ( #25866 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
616bce15ce
[CI/Build] Include Transformers backend test in nightly transformers test ( #25885 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
c33992154a
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue ( #25883 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
f84b2a0dd0
[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) ( #24690 )
...
Signed-off-by: Chenxi Yang <cxyang@fb.com >
Co-authored-by: Chenxi Yang <cxyang@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
9f78b9ca84
[torch.compile] serialize cudagraph_mode as its enum name instead of value ( #25868 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
4e2774f5c3
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models ( #25854 )
...
Signed-off-by: zhoukz <me@zhoukz.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
85d4306047
[Bugfix] Fix requirements paths in install instructions ( #25827 )
...
Signed-off-by: yingjun-mou <renzomou@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
770a2cf7ae
update to latest deepgemm for dsv3.2 ( #25871 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
ea55445b8d
[Misc] Remove more get_input_embeddings_v0 ( #25857 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
b765adccd7
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge ( #25331 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
4079a63a86
[Bugfix] Fallback ViT attn backend to SDPA for blackwell ( #25851 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
00eba10dd1
[XPU]Fix xpu spec decoding UTs, avoid using cuda graph ( #25847 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
20d1d0e38b
Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS ( #25832 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
70ba2d1ec9
[P/D] NIXL Updates ( #25844 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
eb447aff56
[Misc] fix tests failure by using current_platform ( #25825 )
...
Signed-off-by: Juechen Liu <jueliu@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
cf0a7912ca
Remove redundant cudagraph dispatcher warning ( #25841 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
0b343e3218
[Bugfix] fix Qwen3VLMoe load when pp > 1 ( #25838 )
...
Signed-off-by: liuye.hj <liuye.hj@alibaba-inc.com >
Co-authored-by: liuye.hj <liuye.hj@alibaba-inc.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
e40c12696a
Update GLM-4.5 Doc transformers version ( #25830 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
02ab3860a6
Fix random dataset mismatched token length with config. ( #24937 )
...
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
6dee906d2c
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling ( #25557 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
495f368238
[Bugfix] Fix Qwen3-VL regression from #24982 ( #25814 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
02e87f1893
[MM] Optimize memory profiling for scattered multimodal embeddings ( #25810 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
32cb65b2b6
[Bugfix][NIXL] Fix Async Scheduler timeout issue ( #25808 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
04384cb9da
[Core] GC Debug callback ( #24829 )
...
Signed-off-by: Jialin Ouyang <jialino@meta.com >
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
Co-authored-by: Jialin Ouyang <jialino@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
942fba3823
[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location ( #25766 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
d8fc00d623
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable ( #25651 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
7b28ef2bc1
[Core] Refactor self.model() to call a helper for subclassing. ( #25084 )
...
Signed-off-by: Patrick Toulme <ptoulme@meta.com >
Signed-off-by: Patrick Toulme <pctoulme+1@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
9b4c752106
[env] default nixl side port conflicts with kv-event zmq port ( #25056 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
7d92e508b4
[docs] transcriptions API audio upload ( #25446 )
...
Signed-off-by: zxw <1020938856@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
e94aabe03d
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models ( #24982 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
1e5e5d757e
[Bugfix] Fix triton import precommit failure ( #25803 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
c7ae7edb33
Fix GPTQ model loading in Transformers backend ( #25770 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
1cb6005627
Add filtering for chat template kwargs ( #25794 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
3e7f33c801
Validate API tokens in constant time ( #25781 )
...
Signed-off-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
0b8166aa8f
[Bugfix] Merge MM embeddings by index instead of token IDs ( #16229 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
6970fa9937
[Bugfix] Add missing image_size for phi4_multimodal ( #25796 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
d7cf378359
[Misc] Update openai client example file for multimodal ( #25795 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
1171480d88
[Misc] Fix codeowners override for v1 sample and attention ( #25037 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
0f97a2e1db
[CI/Build] Reorganize root-level V1 tests ( #25767 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
a8913725a1
[CI/Build] Add timing to Model Executor Test ( #25799 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
0a4674c871
[CI/Build] Consolidate model loader tests and requirements ( #25765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
1a893d188c
[Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL ( #25788 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
38c2df831a
[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl ( #22872 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
55971f85c9
Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile ( #25782 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
dbb7782d5b
Add option to restrict media domains ( #25783 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
806b292c0e
[Core] Don't count preempted tokens in prefix cache hit rate ( #25787 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
93ba7648d0
[Spec decode] automatically disable mm for text-only draft models ( #25667 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
e7cba8f6b1
[Bugfix] Optimize CpuGpuBuffer initialization ( #25447 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
c4b9864e22
Kernel-override Determinism [1/n] ( #25603 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
dbdea93f46
Reduce the Cuda Graph memory footprint when running with DBO ( #25779 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
1356ae0aa8
[spec decode] Consolidate speculative decode method name for MTP ( #25232 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
dc191cc5d9
[CI] Fix FlashInfer AOT in release docker image ( #25730 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
ceb346015c
[V1] address post issues related to #20059 (part 1) ( #23046 )
...
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
b6f16d37b0
[CI] Add E2E Blackwell Quantized MoE Test ( #25723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
5157781987
[Docs] Add Toronto Meetup ( #25773 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
f16c440c9f
[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition ( #25355 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:56 -07:00
8c1b61bd77
[Doc]: improve CPU(x86) build-wheel-from-source section ( #25617 )
...
Signed-off-by: Kosseila (CloudThrill) <klouddude@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
e0175fbf01
Eagle3 that supports the Minicpm3 model ( #24243 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: liudan <adan@minicpm.com >
Co-authored-by: liudan <liudan@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
c72298213d
[Misc] fix unique_filepath ( #25732 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
41174e2803
[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility in Ray ( #25439 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
6ca8d9753c
[BugFix] Fix using dbo_decode_token_threshold always (and ignoring dbo_prefill_token_threshold) ( #25622 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
d70c154975
[Quantization] Add field to skip unquantized modules for GPTQ config ( #25455 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
129a643b4c
[CI/Build] Fix some V1 tests not being run ( #25569 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
d3c732e985
[CI/Build] Split up Distributed Tests ( #25572 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
fb0eece290
[Bugfix] Properly abort pooling request. ( #25734 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
515e30b023
[CI] Fix test_shared_storage_connector_hashes ( #25748 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
62ae26c870
[Model] Mamba2 varlen refactor ( #21467 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: RishiAstra <40644327+RishiAstra@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
87ee8535a6
[Doc] Update Batch-level DP docs ( #25757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
ced693e845
Support LongCat-Flash-Chat tool call ( #24083 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
fa55373af1
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk ( #25698 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
c761b84d5f
[misc] refactor speculative config ( #25657 )
...
Signed-off-by: zxw <1020938856@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
bc37468b3c
Remove cuda hard-code in compute_causal_conv1d_metadata ( #25555 )
...
Signed-off-by: Icey <1790571317@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
067fe8b10e
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. ( #25743 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
0aea9348cc
fix: print outputt offline_inference/base/chat.py example ( #25744 )
...
Signed-off-by: Iceber Gu <caiwei95@hotmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
79586c5449
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X ( #25703 )
...
Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b2d5d42337
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled ( #25739 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
74ea69f413
fix: revert cast to cpu in MsgpackEncoder._encode_tensor to avoid hidden performance regressions ( #25738 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
e82e3b55f6
[CI/Build] fix doc build warning: Failed to get 'name: description' pair ( #25733 )
...
Signed-off-by: yiting.jiang <yiting.jiang@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
9e6628ccfc
EVS Support (Video tokens pruning) ( #22980 )
...
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com >
Signed-off-by: Eugene Khvedchenya <ekhvedchenya@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
6ada221271
[Misc] Remove unnecessary memoryviews in shm_broadcast.py ( #25721 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
ef160aa08e
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder ( #25701 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
c064c82674
Llamas 3.1 405B fp4 changes upstreaming from 355_wip ( #25135 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
6f97de4e47
[Misc] Don't log shm dequeue delay warning on worker side ( #25720 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
3a32aa8a6b
[Refactor] Remove DeepGEMM OP Register ( #25710 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
1d21080118
Fix routing_bias dtype ( #25711 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
1d1436c3f7
[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 ( #25708 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
37d836081a
[Core] Enable command line logging for LLMEngine ( #25610 )
...
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
f3a478b55e
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. ( #24986 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b558c3a8b7
[Optimization] Use a cheaper cache key in get_model_architecture ( #25682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
745b204ddc
[Optimization] Streamline InputPreprocessor ( #25702 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b0e9f04bbd
[Misc] Simplify test_argsort_mm_positions ( #25690 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
80385959af
[V0 deprecation] Clean up LoRA ( #25686 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
a355561291
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names ( #25489 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
9659b7e78f
[V0 deprecation] Clean up V0 fallback in compilation config ( #25675 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
34e6a31e40
[Model] Define merge_by_field_config MM interface ( #25676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
c7ca3c5d2f
[Model] Add optional parameter to reasoning parser constructor ( #25554 )
...
Signed-off-by: taohui <taohui3@gmail.com >
Signed-off-by: Tao Hui <taohui3@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
fe6357a780
[BugFix] Fix DBO hang ( #25625 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
0cee734ab4
Revert "[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super()" ( #25681 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
252a0ff8c3
[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… ( #24662 )
...
Signed-off-by: AlonKejzman <alonkeizman@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
2655d7ab83
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning ( #25532 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
91d4299774
[Misc] Remove cruft file in repo ( #25678 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
f7f76a8668
[Bugfix] Fix InternS1 video processing after Transformers v4.56 ( #25644 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
054c8b526f
[ux] Switch a warning to debug about a pytorch fallback ( #23750 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
2469b8291b
[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata ( #25652 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
18c20257bf
[torch.compile] Make Query Quantization Fusable ( #24914 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
a5fa821b96
[misc] log info messages by default for hanging / busy / idle ( #25627 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
af10a37c6c
[mypy] Fix wrong type annotations related to tuple ( #25660 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
a88371f84e
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar ( #22112 )
...
Signed-off-by: chenlang <chen.lang5@zte.com.cn >
Co-authored-by: chenlang <10346245@zte.com.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
d7f6489f50
[XPU][Triton]add xpu config in triton_reshape_and_cache_flash ( #25643 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
222411313d
[CI/Build] Fix flaky entrypoints test ( #25663 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
22114ffebb
Add backward compatibility for guided_... API ( #25615 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
f3d9099b44
[V0 deprecation] Remove unreachable model_config.supported_tasks ( #25642 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
3d940e2c3f
[Bugfix] Parse SpeculativeConfig Error ( #25142 )
...
Signed-off-by: zxw <1020938856@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
686cfd91e3
[mypy] Further improve MM type annotations ( #25654 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
f17d37b006
[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling ( #25648 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
034c0152db
[Bugfix] Add triton.language.tensor placeholder ( #25649 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
fd28c58825
[Misc] Fix Qwen3-VL video_grid_thw typing ( #25646 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
5e16b8c552
[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and Darwin ( #25579 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
6c6e553644
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… ( #25607 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
6a437a4178
typo: remove duplicate is ( #25641 )
...
Signed-off-by: nicole-lihui <nicole.li@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
004eed39ff
Map CwmForCausalLM to llama and LlamaForCausalLM ( #25611 )
...
Signed-off-by: Jacob Kahn <jacobkahn1@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
8b17d2554c
[Misc] Simplify PoolerOutput and move to v1/outputs ( #25629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
94b78f576c
[Bugfix] fix apply_temperature to avoid nan in probs ( #24734 )
...
Signed-off-by: courage17340 <courage17340@163.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
d8ffa3c5f4
optimize: eliminate duplicate split_enc_dec_inputs calls ( #25573 )
...
Signed-off-by: nicole-lihui <nicole.li@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
c26e7b14d7
[Model] Add LongCat-Flash ( #23991 )
...
Signed-off-by: yangxurui <yangxurui@meituan.com >
Co-authored-by: yangxurui <yangxurui@meituan.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
12c21d28c1
Enable Fbgemm NVFP4 on Dense models ( #25609 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
517a857166
[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function calling disabled super() ( #25613 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b839194931
[Kernel] Support DCP for Triton backend ( #25132 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
1d6f767dc4
[Model] Improve DotsOCRForCausalLM ( #25466 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b95429c920
[MISC] replace c10::optional with std::optional ( #25602 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
7319686692
Improve --help for enhanced user experience ( #24903 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b3fd4ed80c
[Refactor] Use DeepGEMM Col Major TMA Aligned Tensor ( #25517 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
461aa1463b
feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expert Parallel ( #25503 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b4a80dad98
[Logging] Improve log for when DeepEP HT disables CUDA Graphs ( #25531 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
61a6443bc3
[V0 Deprecation] Remove unused classes in attention ( #25541 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
c8071faa5d
fix compile error
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
46ed215d6b
[Docs] Enable fail_on_warning for the docs build in CI ( #25580 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
0e0d51c9c6
Suppress benign cuBLAS warning when capturing cudagraphs with DBO ( #25596 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
72a5101c7a
Support mnnvl all2allv from Flashinfer ( #21003 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
7d9f44ad2a
[Bugfix] add cache model when from object storage get model ( #24764 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
984bfb4ba7
Fixes and updates to bench_per_token_quant_fp8 ( #25591 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b1f9a1f46a
[ROCm][Build][Bugfix] Fix ROCm base docker whls installation order ( #25415 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
3331ced61b
[ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly disabled ( #25275 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
b614e0f82b
[Misc] Improve type annotations for jsontree ( #25577 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
44d6701f70
Move DeviceConfig, ObservabilityConfig, SpeechToTextConfig to their own files ( #25564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
71566e8afc
[Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools in non-streaming output ( #25405 )
...
Signed-off-by: taohui <taohui3@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
88d8c72d5f
[docs] fix nixl kv_connector_extra_config.backends key ( #25565 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:55 -07:00
0cb913b0a2
[Benchmark] Fix regression in structured output benchmark ( #25500 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
f98d4d38c0
[Bug] fix import and unit test ( #25558 )
...
Signed-off-by: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
d5c0f43b86
[Bugfix] Fix dummy video number of frames calculation ( #25553 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
54174c67f8
[misc] update the warning message ( #25566 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
d1e2d17b57
[BugFix] Potential Fix for FA3 full-cudagraph IMA ( #25490 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
9914857f2b
[V0 Deprecation] Remove max_seq_len_to_capture ( #25543 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
7441d07360
[CI/Build] add nightly prime-rl integration tests ( #25207 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
4ca175ea0b
[Misc]] Move processing context to multimodal directory ( #25548 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
c39befcead
[CI/Build] Fix v1 OOT registration test ( #25547 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
c8ef8a50d2
[Bugfix][CPU] Skip unsupported custom op register on CPU ( #25534 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
fc90ce79f0
[Misc] Retry HF processing if "Already borrowed" error occurs ( #25535 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
5b4ba2e1e1
[TPU][Bugfix] fix the missing apply_model in tpu worker ( #25526 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
d7fb5a4ae8
[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls ( #25514 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
f52b991db6
[Perf] Fix jit compiles at runtime of fla gated delta rule ( #25432 )
...
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
177c37e960
[Spec Decode] Enable FlashInfer Spec Decoding ( #25196 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Co-authored-by: lhsjohn <huashuoli@tencent.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
0e54bbe108
[KV sharing] Re-land Gemma3n model changes from #22628 ( #24357 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
6b87ce2ecd
[fix]: add Arm 4bit fused moe support ( #23809 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
a986f17028
[BugFix] Fix MLA assert with CUTLASS MLA ( #25478 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
faa58fa791
[Compile] Fix AMD Compile Error ( #25518 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
4ed6b67da3
[Core] Support weight_loader_v2 for UnquantizedLinearMethod ( #23036 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
cb825af948
[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen ( #25520 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
342d17fb7f
[V1][Metrics] Add per-request TPOT histogram ( #24015 )
...
Signed-off-by: baxingpiaochong <771405853@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
3c62d28bb9
[Model] Support SeedOss Reason Parser ( #24263 )
...
Signed-off-by: Yan Lu <luyan@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
9596fbd6e5
[BUG] Allows for RunAI Streamer and Torch.compile cache to be used together ( #24922 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
03585bc79d
[Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv' ( #25519 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
770cb2e1f8
Add CUTLASS FP8 MOE benchmark scripts and kernel config ( #25302 )
...
Signed-off-by: Chenxi Yang <cxyang@fb.com >
Co-authored-by: Chenxi Yang <cxyang@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
b50fa00537
Improve output when failing json.loads() on structured output test ( #25483 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
8e6a5e7dd4
[BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch ( #25505 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
faae7a7eab
[Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 ( #25509 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
d562c2ea09
[Perf] Increase default max splits for FA3 full cudagraphs ( #25495 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
81ee45298d
[ROCm] Small functional changes for gptoss ( #25201 )
...
Signed-off-by: jpvillam <jpvillam@amd.com >
Co-authored-by: jpvillam <jpvillam@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
d12433adfc
[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel ( #25197 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
4ebc513fc1
Add VLLM_NVTX_SCOPES_FOR_PROFILING=1 to enable nvtx.annotate scopes ( #25501 )
...
Signed-off-by: Corey Lowman <clowman1993@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
7a8f0a3548
[BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting ( #25359 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
907bbca7b7
Remove redundant mutates_args and dispatch_key for direct_register_custom_op ( #25512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
eb1f43bc82
[gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI ( #25428 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
99eaeebe66
Fix triton_reshape_and_cache_flash.py triton import ( #25522 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
715e24e1b3
Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… ( #25493 )
...
Signed-off-by: rouchenzi <ruochenwen@gmail.com >
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
cf0e250200
[V0 Deprecation] Remove placeholder attn ( #25510 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
0c11617ff1
[Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] ( #24830 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
930e691c65
[CI/Build] Fix and re-enable v1 PP test on CI ( #25496 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
c0f11557e1
[Bugfix] Fix for the import error from #24588 ( #25481 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
0438c65376
[Build] Update Xgrammar to 0.1.25 ( #25467 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
d8fda7420a
[Bugfix] gpt-oss container tool output bug ( #25485 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
86e5b73d71
[CI] Fix Pre-commit Issue ( #25497 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
e49561cd91
Enable symmetric memory all reduce by default only enabling for TP ( #25070 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
0e30643147
[Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 ( #25508 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
8ba3b17cc1
[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue ( #25406 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
8222e2651d
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE ( #25444 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
b672b8c3b8
[Performance] Move apply_w8a8_block_fp8_linear to an op class ( #24666 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: ElizaWszola <elizaw.9289@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
56201cfb01
[core] add nccl symmetric memory for all reduce ( #24532 )
...
Signed-off-by: Amir Samani <asamani@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
9689be1e8e
[ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 ( #24988 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
Signed-off-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
65c4513ad8
[Core] Ensure LoRA linear respect the base_layer's tp_size and tp_rank ( #25487 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
5acda4cc71
[Spec Decode][CI] Add e2e test for examples/spec_decode.py and prevent breaking Acceptance Length ( #24531 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
78f892c373
[Misc] Reduce initialization time of auto_tune ( #23682 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
26da2c6244
[V1][Kernel] Add triton implementation for reshape_and_cache_flash ( #24503 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
0081c6956a
Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu ( #25346 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
6462feef65
[Log] Optimize kv cache memory log from Bytes to GiB ( #25204 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
e9a74500e5
[BugFix] Fix UB in per_token_group_quant.cu ( #24913 )
...
Signed-off-by: Shreeasish Kumar <shreeasish@rivosinc.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
02a3ce2230
[Kernels] Support blocked fp8 quantization for compressed tensors MoE ( #25219 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
9cae377a16
Add backward compatibility for GuidedDecodingParams ( #25422 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
8c5c35c027
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support ( #24845 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
f97da2c732
[V1] Remove V0 code paths for Hybrid models ( #25400 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
02134245a9
[UX] Change kv-cache-memory log level to debug ( #25479 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
2ab27b70f5
[XPU] Fix MOE DP accuracy issue on XPU ( #25465 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
a500f7cc09
[Docs] NixlConnector quickstart guide ( #24249 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
1b75f784b8
[P/D] Support NIXL connector to disconnect during a clean shutdown ( #24423 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
0eddd2b528
[BugFix] Register expert_map as named buffer for wake_up and sleep ( #25458 )
...
Signed-off-by: wuxibin <wuxibin@bytedance.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
030774abcf
[CI/Build] Fix disabled v1 attention backend selection test ( #25471 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
77389d87b2
[docs] Benchmark Serving Incorrect Arg ( #25474 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
59659b74c4
[Core] Optimize LoRA weight loading ( #25403 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
3b96eafdb0
[Bugfix] Fix idefics3 tie_word_embeddings ( #25454 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
fb64e67533
[Test]: Hermes tool parser stream output error in Qwen3 case ( #25203 )
...
Signed-off-by: Andreas Hartel <andreas.hartel@aleph-alpha.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
215da8510d
[Misc] Move DP for ViT code inside model executor dir ( #25459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
c4a15ee240
[Frontend] Add a new xml-based tool parser for qwen3-coder ( #25028 )
...
Signed-off-by: Zhikaiiii <1658973216@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
3a640b8f74
Handle triton kernel import exception ( #25319 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
0a1397c7df
[Model] Enable DP for ViT in Qwen2-VL ( #25445 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
921945c81e
[NIXL][OOT platform] support nixl_connector with oot platform and other nixl_backend ( #25121 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
675fc471bf
[DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP ( #24588 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
b0ae0ad935
[Docs] Fix griffe warnings in vllm/lora/ops ( #25369 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
e99b286f01
[Bugfix] Remove contiguous output req for context parallel MLA ( #25414 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
23a7805022
[benchmarks]allow skip ready check for bench serve ( #25420 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
e3a3c738b0
[XPU] Fix compile_size is None case. ( #25433 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
e41946ecdb
[feat] Support MRoPE + YaRN ( #25384 )
...
Signed-off-by: liuye.hj <liuye.hj@alibaba-inc.com >
Co-authored-by: liuye.hj <liuye.hj@alibaba-inc.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
f071a31ede
[Bug] Fix Long Context OOM Issue ( #25290 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
1b30043f0d
[V0 deprecation] Remove _set_default_args_v0 function ( #25409 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
a0b5617263
[V0 deprecation] Remove platform v1 controling interface ( #25410 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
e6c22d2b2f
[Perf] Apply torch.compile for per_block_cast_to_fp8 ( #24611 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
dbb029cfe1
[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling ( #25184 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
25dd155e60
[BugFix] [DP/EP] Fix slow execution when BS <= DP ( #25407 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
864bbe36f0
[Bugfix] Fix missing clear_connector_metadata ( #25397 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
e97cf2e32b
[Core] Drop overly aggressive whisper assertion ( #25408 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
d96a3fc653
[Bugfix] fix custom op test ( #25429 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:54 -07:00
aac85cc6d6
[Frontend] Responses API MCP tools for built in tools and to pass through headers ( #24628 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Signed-off-by: Alec S <10566873+alecsolder@users.noreply.github.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
f1e3d031e4
[TPU] update torch_xla dependency for PyPI compatibility ( #25278 )
...
Signed-off-by: Johnny Yang <johnnyyang@google.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
6e9229e919
[CI/Build] Skip Qwen3-VL initialization tests until models are actually released ( #25394 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
ff54b6bfe3
[KV offload][5/N] Add CPUOffloadingSpec ( #24251 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
6dbbecd5b2
[torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug ( #23091 ), fix test ( #24376 ), and prep for custom op matching ( #24604 ) ( #24542 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: luka <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
6850bfe15c
[misc] Remove RFC review hours reference ( #25416 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
d988b84e8e
[DP] support torchrun external launcher with Data Parallelism ( #24899 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
7337ec6c9f
[CI Failure] Fix fp8 kv cache on <SM90 ( #25396 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
90ba32a0bf
[Compiler] Disable Inductor standalone compile by default ( #25391 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
2a8bd2b93b
[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables ( #25274 )
...
Signed-off-by: qqma <qqma@amazon.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: qqma <qqma@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
3968ae72ed
[EPLB] Reduce EPLB Inference Overhead ( #24573 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
e55ffe3595
[V1][Attention] Split triton_attn in triton-only and rocm specific backends ( #24648 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
4057e2b162
[Bugfix] Fix several issues with p2p xPyD in GET type ( #23993 )
...
Signed-off-by: Csrayz <jover@cmbchina.com >
Signed-off-by: ivyilike <pww123@cmbchina.com >
Co-authored-by: ivyilike <pww123@cmbchina.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
cc494282a9
[Kernel] MI-300X triton moe configs ( #23445 )
...
Signed-off-by: Sara Kokkila Schumacher <saraks@ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
44be2b7349
Make mypy behave like a proper pre-commit hook ( #25313 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
104e62fbc8
Make pickle import check fast ( #25379 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
ddf4e1f56f
[Misc] Remove unused encoder-decoder error strings ( #25374 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
cbba9bd0b0
refactor: abstract graph mode support into platform interface ( #25161 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
4bc6b5d2c3
[TPU] Deprecate xm.mark_step in favor of `torch_xla.sync ( #25254 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
8d8de42790
[TPU][Bugfix][CI] Fix broken tests/build dependency ( #25255 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
ef85a438da
Enable Eagle3 speculative decoding for GPT-OSS model ( #25246 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
2f237d3df4
[V0 Deprecation] Remove MultiModalPlaceholderMap ( #25366 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
243c358fa8
[V0 Deprecation] Remove V0-only methods in multi-modal registry ( #25362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
1b3aa0f297
[Bugfix] Fix hermes tool parser handling of non-string argument types ( #22002 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Signed-off-by: David Chen <530634352@qq.com >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
dba6db9937
[Docs] GSM8K Accuracy Evaluation doc update ( #25360 )
...
Signed-off-by: David Chen <530634352@qq.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
5322390f1d
[Model] Support Dots OCR ( #24645 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: yinz-aizip <yinz@aizip.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
5f6a36054a
Multimodal - audio tests ( #25285 )
...
Signed-off-by: Debolina Roy <debroy@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
e348e1027c
[Bugfix][V0 Deprecation][CI] use async mock and await for async method ( #25325 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
a815d820ee
Remove V0 attention backends ( #25351 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
319966a678
[Perf] Further optimization for Qwen3-VL fast_pos_embed_interpolate ( #25347 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
b81364a7cd
[V0 Deprecation] Remove V0 sampling metadata ( #25345 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
791089df20
feat: Enable engine-level arguments with speculators models ( #25250 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
71f2b5ddea
[V0 Deprecation] Remove async_output_proc, preemption mode, delay factor ( #25334 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
81e17a1e26
[V0 Deprecation] Remove V0 Sequence class & Sampler ( #25332 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
ed84bda7a5
fix cub helpers
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
c7b1c0cf8b
fix cub_helpers
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
a31d353b71
[Optimization] Cache chat template result when processor fails to be loaded ( #25341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
80cad257da
[Bugfix] Typos in error message for missing model config file ( #25339 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
5fd95c77af
[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate ( #25337 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
f6278e3065
[V1] Add sliding window support to Flex Attention backend ( #24089 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
9e9b3b4ff9
[V0 Deprecation] Remove V0 MP executor ( #25329 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
20235c1822
[V0 Deprecation] Remove from_seq_group methods ( #25330 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
059a13a3bc
[Multi Modal][Performance] Fused Q,K's apply_rope in more models ( #25005 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
a6cf307fa8
[V0 Deprecation] Remove V0 model runner base & simplify worker base ( #25328 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
b18dde7478
[Doc] improve test-pipeline.yaml documentation ( #25305 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
7cdd90211b
[V0 Deprecation] Remove V0 core ( #25321 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
86fdd686be
[CI] Skip tests failing on main ( #25326 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
171592330b
[Chore] Remove unused sampler in models ( #25324 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
4bb2eb42d4
[V0 Deprecation] Remove V0 Output Processor ( #25320 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
32d43a5a9e
[V0 Deprecation] Remove LLMEngine ( #25033 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
d9ba479eee
[Docs] Fix warnings in vllm/profiler and vllm/transformers_utils ( #25220 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
9cfa7697c1
[V0 Deprecation] Enable the remaining multimodal tests in V1 ( #25307 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
9fc86d2802
[Core] Enable sharded state loader for V1 engine and enhance test coverage ( #25308 )
...
Signed-off-by: pengdrumli <pengdrumli@tencent.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
bc76128565
[Model] Cleanup InternViT's data parallel implementation ( #25306 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
af4dedf6d3
Generate _ModelInfo properties file when loading to improve loading speed ( #23558 )
...
Signed-off-by: Manoel Marques <manoel.marques@ibm.com >
Signed-off-by: Manoel Marques <manoelmrqs@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
dad5f4d16d
[Docs] Fix warnings in mkdocs build (continued) ( #25042 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
c2fdc71c91
[CI Failure] Disable FlashInfer RoPE to unblock CI ( #25299 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
e33af1e0c2
[V1] Support LLM.apply_model ( #18465 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
0ac65d171b
[Bugfix] Fix Qwen3-VL-MoE weight loading for EP ( #25300 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
267b4421b7
[Hybrid Allocator] Support full attention with different hidden size ( #25101 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
8f3edbd93f
[Optimization] Avoid repeated model architecture conversion for pooling models ( #25261 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
239aef5c9f
[Bugfix] fix tool call arguments is empty ( #25223 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: xin.li <xin.li@daocloud.io >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
9d70c103aa
[BUG FIX][NON-CUDA]quick fix to avoid call cudagraph_unsafe in attention ( #25298 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
d897924b45
[BugFix] Exclude self when checking for port collision ( #25286 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
b7c986673d
[BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) ( #25268 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
14e1e9b09a
Improve weight loading for encoder models in Transformers backend ( #25289 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
ea01b17b6f
[Misc] Support more collective_rpc return types ( #25294 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
123e7ad492
[BugFix] Ensure appropriate guards in destructors ( #25284 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
ce65ce2d61
[torch.compile] CUDAGraph Inductor partition integration ( #24281 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
Signed-off-by: Boyuan Feng <fby.1994@gmail.com >
Signed-off-by: boyuanfeng <boyuan@meta.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
d4006bd84d
[docs] Prompt Embedding feature support ( #25288 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
7493472a9b
test: Remove vestigial skip for prompt embeds tests after landing v1 Prompt Embeds support ( #25291 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
937ab7e85e
Don't skip special tokens with hermes-style tool calling ( #25281 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
bc997c18ca
[Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE #2969 ( #25090 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
d55c6010ac
[BugFix] Fix async scheduling CPU tensor race take 2 ( #25279 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
5051270200
allow disable flashinfer prefill ( #25276 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
6e94161f94
Enable modelopt gemma3 nvfp4/fp8, make workflow more robust ( #22771 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
e54a476058
[Compile] Fix Compile Warning for Ignoring MIN_BLOCK_PER_SM ( #25193 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
8da7b98366
[Frontend] Responses API messages out, just harmony for now ( #24985 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
9da51c77a9
Fix: Correct FusedMoE layer reference in auto_round quantization ( #24818 )
...
Signed-off-by: David-Wen <18927700430@163.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
d0a1364188
[BugFix] Make FlashInferMetadataBuilder non-blocking ( #25040 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
2c3ba7362f
[Perf] Use FlashInfer RoPE for RotaryEmbedding.forward_cuda when available ( #21126 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:53 -07:00
bfd32678e6
Specify platform in pip-compile pre-commit hook so it runs on MacOS ( #25273 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
e29f599d30
[Bugfix] Fix chunked a2_scales in modular kernels ( #25264 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
b6724e95f8
[Bugfix] GPT OSS Attritbute error on H100 ( #25228 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
17b9f3a83d
Optimize triton unified attention performance for sliding window attention ( #24390 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
378c68bead
[KV offload][4/N] Offloading KV connector ( #22595 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
67f0418b1d
[bugfix] fix structured outputs key missing issue from #24929 ( #25195 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
779ed75310
[Docs] add __init__.py to vllm/model_executor/layers/quantization/compressed_tensors/transform ( #24974 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
abb448b457
Update vllm/model_executor/layers/quantization/kernels/scaled_mm/cutlass.py
...
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
ae36150ec2
test
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-03 13:35:52 -07:00
2506ce5189
[Core][Prefix Hash] Fix prefix hash metrics sliding window maintainance ( #24990 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-09-19 12:22:53 -06:00
47fd08aaf9
[CI/Build] fix test function_calling ( #25072 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-19 12:16:32 -06:00
12aed7e453
Encoder model support for the Transformers backend ( #25174 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 19:15:22 +01:00
d90e212a3a
Remove Redundant Assignment in Qwen3_VisionPatchMerger ( #25224 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-19 12:15:13 -06:00
2821986450
[Core] Modify the initialization parameters of the lora manager ( #25249 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-19 18:01:28 +00:00
6c117cff7d
[Frontend] Pass API server count to each process ( #23717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 01:15:19 +08:00
7ac67ea525
[KV offload][3/N] Add worker-side CPU support ( #21448 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 09:53:45 -07:00
ce75e15373
refactor(benchmarks): add type annotations to wait_for_endpoint parameters ( #25218 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-19 16:36:52 +00:00
aed16879a9
Move ModelConfig from config/__init__.py to config/model.py ( #25252 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 16:22:33 +00:00
cf278ff3b2
Update CODEOWNERS ( #25269 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 09:12:55 -07:00
838d7116ba
[Qwen] Remove cuda hard-code in qwen3 next ( #25243 )
...
Signed-off-by: Icey <1790571317@qq.com >
2025-09-19 12:25:12 +00:00
5089fd749c
[V0 Deprecation] Remove V0 logic from get_input_embeddings interface ( #25242 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-19 11:10:52 +00:00
a3d087adec
[P/D][Nixl] Introduce KVTransferMetrics and aggregation strategy ( #22188 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-19 11:09:14 +00:00
058525b997
Move PoolerConfig from config/__init__.py to config/pooler.py ( #25181 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 11:02:55 +00:00
1dfea5f4a9
[Bugfix][Perf] Misc fixes for Qwen3 VL ( #25238 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-19 10:46:16 +00:00
cea91a32f2
[Kernel][Performance] Add Triton kernel for Qwen3-VL interleaved MRoPE ( #25055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 10:27:49 +00:00
a684c0124c
[bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B ( #25146 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 08:45:06 +00:00
f2718d2948
[Misc] Cleanup test conftest for deprecated encoder-decoder models ( #25231 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 07:44:56 +00:00
825fdb11ad
[Bugfix][CPU] Add placeholder to avoid import errors when using fused_moe ops on platforms without triton ( #25137 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-19 07:41:12 +00:00
8c1d4acbfe
[CPU] Disable oneDNN linear on non-x86 platforms ( #25166 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-19 07:27:22 +00:00
486c5599e3
[Build] Update Xgrammar to 0.1.24 to get a CVE fix ( #25188 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-19 14:27:17 +08:00
a6149aa587
[OOT] Support sync_model_loading for OOT ( #25126 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
2025-09-19 05:41:53 +00:00
6c8a3c099b
[Docs] Fix griffe warnings in vllm/multimodal ( #25216 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-18 22:10:44 -07:00
31a8a2a7bc
[Misc] Clean up MM profiling warnings ( #25222 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-19 04:46:57 +00:00
1a0a04dae9
[Perf] Optimize memory peak during EAGLE model loading. ( #24585 )
...
Signed-off-by: Chen Ding <candy.dc@alibaba-inc.com >
2025-09-19 03:31:16 +00:00
6d8246aaff
[gpt-oss] Add ResponseReasoningPartAddedEvent, ResponseReasoningPartDoneEvent for streaming ( #24938 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-18 19:11:59 -07:00
9d1c50a5ac
[KV offload][2/N] Introduce LRU-based CPU offloading management ( #20075 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 00:20:51 +00:00
9a4600e4dc
[CORE] Prompt Embeddings Support for v1 Engine ( #24278 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-19 08:03:09 +08:00
9fac6aa30b
[BugFix] Fix DeepGEMM warmup, no m.weight_scale_inv ( #25206 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-18 14:26:28 -07:00
a53ad626d6
[KV offload][1b/N] rename offloading to kv_offload ( #25191 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-18 20:53:52 +00:00
1c3dad22ff
[V0 Deprecation] Remove unused async_timeout.py ( #25190 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-18 20:35:21 +00:00
d2a30a2d93
[Bug] Fix torch Compilation Cache Hit Error ( #25093 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-18 12:38:37 -07:00
75fb112d80
[Bug] Fix returned_lse not Defined issue ( #25106 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-18 19:32:24 +00:00
38db529f66
[feat]: Create interface for model-specific M-RoPE ( #24194 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Aziz <azizbenothman76@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-18 19:18:56 +00:00
064cac7bb7
[fix]: remove data type hardcoding from gptoss model implementation ( #23807 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-09-18 18:15:23 +00:00
e19bce40a1
[V0 Deprecation] Remove AsyncLLMEngine ( #25025 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-18 11:07:42 -07:00
505805b645
[KV offload][1/N] Introduce an offloading component ( #19848 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-18 10:57:07 -07:00
bbdc0f2366
[ROCm][AITER][Bugfix] Switch AITER to use PIECEWISE_AND_FULL compilation ( #25104 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2025-09-18 17:46:47 +00:00
dc34059360
[ROCm][CI/Build] Use ROCm7.0 as the base ( #25178 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-18 09:36:55 -07:00
c4cb0af98a
[spec decode] Fix MTP inference path for MiMo-7B model ( #25136 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-18 09:12:19 -07:00
1c3b1634aa
[Misc] Add codeowner for Transformers backend ( #25180 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 09:01:50 -07:00
2ea50e977a
Enable Allgather/ReduceScatter backend for NaiveAllToAll ( #23964 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Shu Wang <shuw@nvidia.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-18 15:52:58 +00:00
b419937c78
[Docs] Fix warnings in mkdocs build (continued) ( #25163 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-18 08:23:26 -07:00
5f696c33b1
[New Model] Support BertForTokenClassification / Named Entity Recognition (NER) task ( #24872 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-18 23:22:01 +08:00
67244c86f0
feat(api): Return 503 on /health when engine is dead ( #24897 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-09-18 14:29:40 +00:00
072d7e53e5
[PERF] Add conv1d metadata to GDN attn ( #25105 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-09-18 14:27:49 +00:00
01a583fea4
[Kernel] Decouple Tile Size from Block Size in Triton Unified Attention Kernel ( #21197 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-09-18 14:27:01 +00:00
bc19d75985
[Misc] Add kv-connector label ( #25156 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-18 13:56:07 +00:00
fbd6523ac0
Refactor dense FP8 tensor/channel/block utils and add CT FP8 block ( #21404 )
2025-09-18 08:53:45 -04:00
470484a4f5
[Structured Output][Refactor] Move apply_grammar_bitmask() method from ModelRunner to structured output utils ( #21999 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-09-18 20:44:31 +08:00
21da73343a
[Misc] Clean up flags in vllm bench serve ( #25138 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-18 12:43:33 +00:00
66072b36db
[Bugfix][Mamba] - Fix Conv State Kernel FP32 Support ( #24883 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-18 12:21:17 +00:00
3ed1ec4af2
Fix validate-config pre-commit check ( #25157 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 12:06:28 +00:00
5a33ae9a3f
Fix forward reference warning in documentation ( #25150 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 11:41:41 +00:00
c9ff9e6f0c
[Docs] add the parallel sampling usage in LLMEngine and AsyncLLM ( #24222 )
2025-09-18 04:37:08 -07:00
eaffe4486c
[Docs] Fix pooling-params doc references in openai_compatible_server.md ( #24939 )
2025-09-18 04:36:47 -07:00
8ed039d527
Move StructuredOutputsConfig from config/__init__.py to config/structured_outputs.py ( #25153 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 11:24:27 +00:00
37970105fe
[Model] Improve Pooling Model ( #25149 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-18 11:04:21 +00:00
cc935fdd7e
[Frontend] Support setting logprobs to -1 ( #25031 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-18 10:34:42 +00:00
abdfcd4f3d
silu-v1: Fix EPS not being used during max-reduction ( #25069 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-18 10:25:12 +00:00
4f02b77de4
Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains ( #24951 )
...
Signed-off-by: lyd1992 <liuyudong@iscas.ac.cn >
Signed-off-by: ihb2032 <1355790728@qq.com >
2025-09-18 17:43:23 +08:00
29283e8976
[Chore] Cleanup guided namespace, move to structured outputs config ( #22772 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 09:20:27 +00:00
05b044e698
[Doc] Fix cross-reference warnings ( #25058 )
...
Signed-off-by: Punit Vara <punitvara@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 02:05:16 -07:00
aa3f105c59
Add 'path' option to ImagePrompt data_format ( #25081 )
...
Signed-off-by: Gerard Finol <gerard.finol@urv.cat >
2025-09-18 02:02:14 -07:00
ef7eefe17a
[Qwen] Add fp8 checkpoint support for qwen3-next. ( #25079 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-18 08:16:04 +00:00
350c94deb3
[Bugfix] when use s3 model cannot use default load_format ( #24435 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-18 07:47:43 +00:00
f4cd80f944
Retrieve sliding_window from text config in Gemma3 MM ( #25085 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 06:29:05 +00:00
349e0e3462
[Docs] Fix API Reference ( #25140 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-17 23:23:29 -07:00
81b16a2bc9
[Kernel] Better inf handling for grouped topk cu ( #24886 )
...
Signed-off-by: lumina37 <starry.qvq@gmail.com >
2025-09-18 05:53:55 +00:00
e111d5b0ae
[CLI] Use streaming in CLI chat and completion commands ( #23769 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-17 22:30:26 -07:00
a904ea78ea
[benchmark] add peak throughput metrics and plot ( #23867 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-17 22:30:02 -07:00
b7433ca1a4
[Spec Decode] Efficient padded speculation ( #24539 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2025-09-18 01:07:24 -04:00
5c65a72bb1
[V0 Deprecation] Remove more V0 tests ( #25117 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 22:05:25 -07:00
9d8a2d86d2
[EPLB] Add EPLB support for hunyuan_v1 ( #23078 )
2025-09-18 04:51:35 +00:00
3bc18127ff
[XPU] Whisper model support on XPU Platform ( #25123 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-09-18 04:30:10 +00:00
bec060fd99
Mark prompt logprobs as incompatible with prompt embeds at API level ( #25077 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-17 21:25:07 -07:00
52bc9d5b3e
[Model] enable data parallel for InternVL vision encoder ( #23909 )
...
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu >
Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-17 21:11:46 -07:00
dc2979c585
[Kernels] Overlap shared experts with combine instead of dispatch ( #24254 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-18 12:10:21 +08:00
027d37df38
[Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models ( #24960 )
...
Signed-off-by: toncao <cpatonn@gmail.com >
Co-authored-by: toncao <cpatonn@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-18 12:08:50 +08:00
b98219670f
[Core][MM] Cleanup MultiModalCache ( #25006 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-17 21:08:41 -07:00
32baf1d036
[Docs] Clean up the contributing README ( #25099 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-17 21:05:18 -07:00
3127274d02
[MM Encoder] Apply DP ViT for Qwen3-VL model series ( #24955 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huang Jie <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: 松灵 <26085463+wulipc@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 21:04:21 -07:00
4ac510f484
[Kernels] Enable DeepGEMM by default ( #24462 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-17 20:19:52 -07:00
7fb2a5be28
[V0 Deprecation] Skip PP test ( #25128 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 20:18:36 -07:00
6c036615dc
[V0 Deprecation] Remove misc V0 tests ( #25118 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:41:55 -07:00
2fc24e94f9
[V0 Deprecation] Remove V0 Tracing & Metrics tests ( #25115 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:40:44 -07:00
2c3c1bd07a
[V0 Deprecation] Remove V0 Engine tests ( #25114 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:38:09 -07:00
5963b98b46
[Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses ( #22537 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-17 17:43:31 -06:00
e6585ddb45
[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel ( #24833 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-17 16:37:23 -07:00
2a4d6412e6
Add a batched auto tune script ( #25076 )
...
Signed-off-by: Karan Goel <karangoel@google.com >
Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-17 22:41:18 +00:00
e67a79db03
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic ( #24600 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 15:36:29 -07:00
9f882d8791
Disable failing GPT-OSS Eval (Blackwell) for now ( #25107 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 15:36:00 -07:00
1a456c7c90
Aiter mha fp8 fix ( #24991 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
2025-09-17 22:29:14 +00:00
fedb75fa27
[Bugfix][B200] Fix cutlass_mla hang ( #24966 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 18:06:38 -04:00
bff2e5f1d6
[gpt-oss][2] fix types for streaming ( #24556 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-17 22:04:28 +00:00
3c068c637b
[Kernel] Faster pre-processing time for W4A8 ( #23972 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-09-17 14:35:32 -07:00
f20c3b0951
[BUG] Exclude .pth files when pulling remote files ( #25092 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2025-09-17 20:42:09 +00:00
883131544f
[Bugfix] Update import path for bc_linter_include ( #24766 )
...
Signed-off-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu >
2025-09-17 20:33:11 +00:00
ee5fd49150
[Misc] Update owners for KV connector and V1 offloading ( #25041 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-09-17 12:37:29 -07:00
7ae9887542
[V1] Logits processor docs ( #22919 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Signed-off-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Joseph Marinier <Joseph.Marinier@gmail.com >
2025-09-17 11:53:12 -07:00
e3db5ebb66
[CI Bugfix] Fix failing test_model_load_with_params tests due to tokenizer refactor ( #25086 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 11:15:05 -07:00
9d442b7c48
[V0 Deprecation] Remove V0 tests in test_sequence.py ( #25088 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 11:08:45 -07:00
eb68c2dcd9
[CI] Revert back prepare_prompts and check_answers ( #25087 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 11:03:16 -07:00
8b32464ac1
Change log level from info to debug for IOProcessor ( #24999 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 10:21:28 -07:00
99cc41ad50
[V0 Deprecation] Remove unused output processor util ( #25023 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-17 09:50:07 -07:00
d6a518fdde
Remove unused find_cuda_init helper script ( #25044 )
2025-09-17 09:47:40 -07:00
4aa8c7b047
cleanup: remove adapter commons ( #25045 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-17 16:46:29 +00:00
4b946d693e
[V0 Deprecation] Remove V0 Core tests ( #25082 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 09:32:42 -07:00
087c6ffc92
[CI Bugfix] Fix failing test_invalid_env ( #25078 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 08:28:58 -07:00
4a2d33e371
[Docs] vllm/benchmarks/datasets.py fix docstring param format. ( #24970 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-17 08:11:51 -07:00
8f3616f422
Remove old cutlass mla ( #23961 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-17 14:31:43 +00:00
47f670b03b
[Docs] improve code formatting and comments for eliminate griffe build warning. ( #25010 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-17 07:31:20 -07:00
dd6a910aac
[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. ( #24957 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-17 21:59:09 +08:00
1b962e2457
[fix] lora benchmarks pass no_lora_flag_cpu ( #23774 )
...
Signed-off-by: Dylan Maloy <34420038+dolpm@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-17 21:22:25 +08:00
bfe9380161
Apply fixes for CUDA 13 ( #24599 )
...
Signed-off-by: Aidyn-A <aidyn.b.aitzhan@gmail.com >
2025-09-17 09:15:42 -04:00
9fccd04e30
[Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check ( #25046 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-17 05:54:02 -07:00
252ada5559
Add RADIO Vision Encoder Support to vLLM ( #24595 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster >
2025-09-17 05:53:30 -07:00
e120533d7a
[Misc] Avoid use of deprecated AutoModelForVision2Seq ( #25065 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-17 12:19:15 +00:00
2b85697031
[BugFix] enable DOTALL to match multi-line tool_call parameters in extract_tool_call_required_streaming ( #24668 )
...
Signed-off-by: Shijun Yin <shijun.yin@outlook.com >
2025-09-17 09:21:18 +00:00
544fe76b95
[Frontend] Support returning all prompt logprobs ( #24956 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-17 09:03:52 +00:00
bb58dc8c20
[DP] Create placement groups by ray_device_key ( #25026 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-17 08:57:25 +00:00
0fb2551c23
[Docs] Fix griffe warning in base_static_graph.py ( #25018 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-17 08:49:19 +00:00
6c47f6bfa4
[Core] Remove tokenizer group in vLLM ( #24078 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-17 08:42:59 +00:00
c15309a730
[Model] Apply SharedFusedMoE to glm4_moe. ( #24849 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-17 16:02:31 +08:00
4a9375fe9d
[Model] Pass param prefix to LLMHead ( #24862 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-17 16:01:27 +08:00
03191cd8f0
[Core][MultiModalHasher] Hash images without converting image mode ( #24969 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-17 00:57:34 -07:00
b77bf34e53
[EPLB] Support EPLB for Mixtral Model ( #22842 )
...
Signed-off-by: rouchenzi <ruochenwen@gmail.com >
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com >
Co-authored-by: Bowen Wang <abmfy@icloud.com >
2025-09-17 07:27:34 +00:00
dd39baf717
[XPU] Fix xpu model runner call torch.cuda APIs ( #25011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-17 06:45:25 +00:00
43a62c51be
Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) ( #23255 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-09-17 05:53:17 +00:00
ca2d1925ef
[Rocm] [quantization] Fix quark ptpc moe and add test case ( #24649 )
...
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com >
Co-authored-by: Haoyang Li <haoyang.li@amd.com >
2025-09-16 22:15:13 -07:00
0f7acdd73c
[Model] Support Qwen3-VL Model Series ( #24727 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huang Jie <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: 松灵 <26085463+wulipc@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 05:01:04 +00:00
5801e49776
[V0 Deprecation] Remove MQLLMEngine ( #25019 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-16 21:29:27 -07:00
58d4c705a8
[Core] Get num_encoder_tokens from scheduler config ( #24989 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-16 20:59:07 -07:00
ea3de5ef0d
[misc] fix typo in value error ( #24995 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2025-09-16 20:58:38 -07:00
67532a1a68
[UX] Remove "quantization is not fully optimized yet" log ( #25012 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-16 20:57:51 -07:00
5672ba90bd
[Docs] fix invalid doc link ( #25017 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-09-16 20:53:23 -07:00
dd83a157f1
[UX] Enforce valid choices for envs like VLLM_ATTENTION_BACKEND, etc ( #24761 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-16 20:42:23 -07:00
5a411ef6c4
[Benchmarks] Add MMVU video dataset support and clean up deprecated datasets ( #24719 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 03:29:43 +00:00
eeb135eb87
[Core] Use CpuGpuBuffer for block table tensors ( #24795 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-16 19:18:06 -07:00
3059b9cc6b
[Doc] Add --force-overwrite option to generate_cmake_presets.py ( #24375 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-16 18:45:29 -07:00
64ad551878
Removes source compilation of nixl dependency ( #24874 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Signed-off-by: Benjamin Bartels <benjamin@bartels.dev >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <36171005+dtrifiro@users.noreply.github.com >
2025-09-17 01:33:18 +00:00
cef32104b4
[FP8] Extend per-token-group quantization support to QuantFP8 ( #24342 )
...
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-09-16 18:31:06 -07:00
493b10f8bf
[CI] GPT-OSS GPQA eval test for Blackwell ( #24920 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-16 18:13:21 -07:00
d119fc8614
[CI][Bugfix] Fix failing Blackwell test ( #24993 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-16 15:55:02 -07:00
dbebb7f812
[Perf] Reuse workspace for FP8+FP4 Marlin MoE ( #20500 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-16 15:45:10 -06:00
3053a22b33
fp8 kv cache support fix for torch.compile ( #22758 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2025-09-16 21:27:11 +00:00
02d4b85454
Use kwargs for long lists of EngineCoreRequest arguments in tests and fix extra kwargs ( #24987 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-16 14:06:56 -07:00
86daa875fe
[gpt-oss][1][bugfix] fix streaming final output ( #24466 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-16 13:56:16 -06:00
dcf2f3ec06
[ROCm] Add dependencies for ROCm ( #24900 )
...
Signed-off-by: Yida Wu <yida.wu@amd.com >
2025-09-16 19:49:06 +00:00
218454b9b2
[MISC] Add code owners of vllm/v1 to vllm/v1/core ( #24928 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-16 19:07:34 +00:00
f4d6eb95cf
[gpt-oss][1b] streaming add item id, content id ( #24788 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-16 18:41:12 +00:00
cd1f885bcf
Directly get max encoder len from VLLM config in V1 ( #24866 )
...
Signed-off-by: Sugar-zsg <952242923@qq.com >
2025-09-16 17:52:31 +00:00
d593cf28fa
[Misc] Add removed encoder-decoder models to previously supported models list ( #24961 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-16 10:46:46 -07:00
faa7a5daac
[Bugfix] Fix unable to run encoder model when disable_hybrid_kv_cache_manager is true ( #24571 )
...
Signed-off-by: lianyibo <lianyibo1@kunlunit.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-09-16 17:36:58 +00:00
567939953b
[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM ( #23693 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-16 12:21:48 -04:00
08369289af
[Core][MultiModalHasher] Don't convert memoryviews to bytes during hashing ( #24925 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-16 15:32:47 +00:00
73cfb3c5ee
[Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 ( #24331 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-09-16 14:53:43 +00:00
4e5affeaa1
[CI] Add Decode Context Parallelism (DCP) test to CI ( #24487 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-16 21:21:28 +08:00
e4f0b4cd96
(doc): set cmake c++ compatible standard when building on MacOS CPU. ( #23483 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 06:08:46 -07:00
de3e53a75b
feat: Add Grafana and Perces monitoring dashboards for vLLM ( #23498 )
2025-09-16 05:53:40 -07:00
85e0df1392
[Docs] move benchmarks README to contributing guides ( #24820 )
2025-09-16 05:52:57 -07:00
0faf3cc3e8
Move SpeculativeConfig from config/__init__.py to config/speculative.py ( #24904 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 12:51:35 +01:00
7ea5c73ad7
[Feat][EPLB] A novel static EPLB placement strategy for MoE models. ( #23745 )
...
Signed-off-by: bruceszchen <bruceszchen@tencent.com >
Signed-off-by: Chen Bruce <bruceszchen@tencent.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Chen Bruce <cszwwdz@vip.qq.com >
Co-authored-by: lemon412 <lemon412@foxmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 10:55:16 +00:00
27fcfe7bcf
[Mamba] Support TP>1 with quantization for mamba2 mixer in case n_groups % tp_size == 0 ( #24593 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-16 10:51:01 +00:00
68dbde5dbb
[Bugfix] remove duplicate tokens streamed in required tool choice streaming ( #23312 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-16 15:16:32 +08:00
04ad0dc275
[benchmark] Add triton version in the moe tuned config ( #24769 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-16 14:10:54 +08:00
238c4c1705
[QWEN NEXT] Fused MoE kernels Optimization configs ( #24924 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-16 13:06:03 +08:00
8c54610265
[Bug] [Spec Dec]: Fix kv_cache dtype mismatch for Eagle3 drafter on FP8 target ( #24505 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-16 04:45:38 +00:00
17871983a2
[Bugfix] Fix sequence parallelism bug when enable pipeline parallelism ( #24021 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-09-16 04:32:32 +00:00
759ef49b15
Remove V0 Encoder-Decoder Support ( #24907 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-15 21:17:14 -07:00
5206ab20ba
[XPU] Fix circular import error. ( #24927 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-16 03:35:36 +00:00
0af3ce1355
Upgrade flashinfer to 0.3.1 ( #24470 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-16 02:36:09 +00:00
e1279ef00f
[Docs] Update instructions for how to using existing torch binary ( #24892 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 02:25:50 +00:00
2942970d44
[Metrics] Hide deprecated metrics with gpu_ prefix ( #24245 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-09-15 20:15:57 -06:00
3c96e7b8a1
[CI] Small Accuracy Eval Test for Deepseek Model ( #24259 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:14:50 -06:00
b42566f440
[Bug] Fix is_flashmla_supported Check Error ( #24774 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:10:55 -06:00
d96e11167d
Add pytest-cov and .coveragerc ( #24778 )
...
Signed-off-by: Reza Barazesh <rezabarazesh@meta.com >
2025-09-15 20:08:46 -06:00
2891603efd
[ROCm][Bugfix] Fix the case where there's bias ( #24895 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-15 20:05:12 -06:00
de2cc3d867
[Deprecation] Remove DeepGEMM Old Symbol Wrapper ( #24902 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:03:29 -06:00
e95084308b
Updated CODEOWNERS for flashinfer, mla, fused_moe ( #24906 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-16 02:01:28 +00:00
7f6f2c1182
HuggingFace -> Hugging Face in Integration with Hugging Face docs (#24889 )
2025-09-15 17:28:35 -07:00
5bcc153d7b
[Compile] Fix noop_elimination pass and add tests for noop_elimination ( #24880 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-15 23:33:18 +00:00
45bfa49cb8
[Tests] fix initialization of kv hash in tests ( #24273 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-09-15 21:48:27 +00:00
fd2f10546c
[ci] fix wheel names for arm wheels ( #24898 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-15 14:39:08 -07:00
e757a629e7
[Bug] Fix Cutlass Scaled MM Compilation Error ( #24887 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 17:21:17 -04:00
aae725af7c
[Performance] Remove redundant clone() calls in cutlass_mla ( #24891 )
2025-09-15 20:21:53 +00:00
73df49ef3a
[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still ( #24759 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-15 13:08:08 -07:00
25aba2b6a3
[gpt-oss] Add IncompleteDetails to ResponsesRepsonse ( #24561 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-15 13:07:55 -07:00
94b03f88dd
Bump Flashinfer to 0.3.1 ( #24868 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
2025-09-15 12:45:55 -07:00
49bfc538e4
Update num_tokens_across_dp to use nccl instead of gloo ( #24105 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-09-15 19:05:48 +00:00
a0b26701c9
[Transform] Deterministic Hadacore Transforms ( #24106 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-15 12:59:31 -06:00
c4afdb69cc
Move MultiModalConfig from config/__init__.py to config/multimodal.py ( #24659 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-15 17:43:16 +00:00
b834b4cbf1
[USAGE] Improve error handling for weight initialization in Unquantized… ( #20321 )
...
Signed-off-by: Rafael Marcelino Koike <rafael.koike@oracle.com >
Signed-off-by: Rafael Koike <koike.rafael@gmail.com >
2025-09-15 16:45:49 +00:00
740f0647b1
Reinstate existing torch script ( #24729 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-15 09:43:40 -07:00
01413e0cf5
Fp8 paged attention update ( #22222 )
...
Signed-off-by: Xiao Yu <xiao.yu@amd.com >
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com >
Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com >
Co-authored-by: Xiao Yu <xiao.yu@amd.com >
Co-authored-by: Bowen Bao <bowenbao@amd.com >
2025-09-15 10:43:26 -04:00
0e219cd50b
[Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 ( #24822 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-15 20:45:06 +08:00
72c99f2a75
[Model]: support Ling2.0 ( #24627 )
...
Signed-off-by: vito.yy <vito.yy@antgroup.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-15 05:09:30 -07:00
bf214ca226
[Misc] Fix examples openai_pooling_client.py ( #24853 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-15 11:57:30 +00:00
2e41f5abca
[XPU] Set consistent default KV cache layout ( #24745 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-15 18:09:34 +08:00
bc0f6059a2
[UT] enhance free kv cache block queue popleft_n ( #24220 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 10:04:37 +00:00
8de261b04a
[P/D]kv_output_aggregator support P TP > D TP ( #23917 )
...
Signed-off-by: LCAIZJ <leichao139636@163.com >
Co-authored-by: leichao.lc <leichao.lc@antgroup.com >
2025-09-15 11:36:06 +02:00
a0d8b9738d
[Misc] Own KVConnectors installation ( #24867 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-15 02:21:09 -07:00
59e17dd4a0
[Misc] rename interval to max_recent_requests ( #24229 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 09:18:42 +00:00
4979eb79da
[Doc]: fix typos in various files ( #24821 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-15 01:08:52 -07:00
a8c0f59973
[Bugfix] MiDashengLM model contact error under concurrent testing ( #24738 )
...
Signed-off-by: chenbing8 <chenbing8@xiaomi.com >
Signed-off-by: bingchen-mi <chenbing8@xiaomi.com >
2025-09-15 06:38:12 +00:00
f4a948f33f
[Frontend] Skip stop in reasoning content ( #14550 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-15 06:04:55 +00:00
3f3313981c
[kv cache] update num_free_blocks in the end ( #24228 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 05:15:12 +00:00
78818dd1b0
[Docs] Have a try to improve frameworks/streamlit.md ( #24841 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-14 21:50:36 -07:00
8e5cdcda4e
[Hybrid Allocator] Support Pipeline Parallel ( #23974 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-14 15:55:17 -07:00
90f3f7d73e
[Spec Decoding]Support Spec Decoding Metrics in DP Mode ( #24049 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 21:11:09 +00:00
6dc8da5dc1
[Chore] Remove ipex_ops warning ( #24835 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 19:41:53 +00:00
79cbcab871
Force use C++17 globally to avoid compilation error ( #24823 )
...
Signed-off-by: chenfengjin <1871653365@qq.com >
2025-09-14 19:30:10 +00:00
ff68035932
[Benchmarks] Throw usage error when using dataset-name random and dataset-path together ( #24819 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-14 17:50:01 +00:00
1177dd53e9
fix type of sampling rate for encode_base64 ( #24826 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-14 16:17:16 +00:00
fc2dbcda8b
[Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement ( #24783 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 11:20:17 -04:00
fec347dee1
[Misc] Improve s3_utils type hints with BaseClient ( #24825 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-14 12:11:14 +00:00
cc3173ae98
[Multi Modal][Performance] Fused Q,K's apply_rope into one ( #24511 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-14 08:10:21 +00:00
3e903b6cb4
[Chore] Minor simplification for non-PP path ( #24810 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-13 17:41:36 -07:00
973c9d01da
[Minor] Simplify duplicative device check for cuda ( #24793 )
...
Signed-off-by: Ziliang Peng <ziliangdotme@gmail.com >
2025-09-13 18:28:38 +00:00
15b8fef453
Remove redundant assignment in xfer_buffers, This is a little fix ( #24732 )
...
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com >
2025-09-13 08:11:59 +00:00
cfa3234a5b
[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again ( #24771 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-13 15:45:11 +08:00
41ae4a1eab
[Doc]: fix typos in various files ( #24798 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-13 00:43:33 -07:00
4dad72f0d9
[Misc] Correct an outdated comment. ( #24765 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-13 00:34:53 -07:00
59d7ffc17f
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe ( #24750 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-13 07:29:19 +00:00
1da0f1441d
[Core][Multimodal] Cache supports_kw ( #24773 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-13 07:27:04 +00:00
98229db244
[Kernels][DP/EP] Optimize Silu Kernel for R1 ( #24054 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-13 00:17:27 -07:00
dbeee3844c
[Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization ( #24757 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-13 00:16:24 -07:00
30498f2a65
[Doc]: Remove 404 hyperlinks ( #24785 )
...
Signed-off-by: Rakesh Asapanna <45640029+rozeappletree@users.noreply.github.com >
2025-09-13 00:15:41 -07:00
abc7989adc
[Docs] Remove Neuron install doc as backend no longer exists ( #24396 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-13 00:15:03 -07:00
9a8966bcc2
[Docs] Fix warnings in mkdocs build (continued) ( #24791 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-13 00:13:44 -07:00
5febdc8750
[Chore] Remove unused batched RoPE op & kernel ( #24789 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-13 00:08:20 -07:00
99bfef841f
[Bugfix] Fix GPUModelRunner has no attribute lora_manager ( #24762 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 23:55:14 -07:00
89e08d6d18
[Model] Add Olmo3 model implementation ( #24534 )
...
Signed-off-by: Shane A <shanea@allenai.org >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-13 03:26:21 +00:00
7f2ea7074e
[Frontend][Multimodal] Allow skipping media data when UUIDs are provided. ( #23950 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-13 02:16:06 +00:00
4fdd6f5cbf
[Core] Support async scheduling with uniproc executor ( #24219 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Ronald1995 <ronaldautomobile@163.com >
Co-authored-by: Ronald1995 <ronaldautomobile@163.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-12 16:34:28 -07:00
8226dd56bf
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes ( #24660 ) ( #24667 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-12 22:31:32 +00:00
5fe643fc26
Add FLASHINFER_MLA to backend selector test ( #24753 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-12 22:30:07 +00:00
7ba32aa60b
[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode ( #24705 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-12 15:45:53 -06:00
c89ed8de43
Invert pattern order to make sure that out_proj layers are identified ( #24781 )
...
Signed-off-by: Alexandre Marques <almarque@redhat.com >
2025-09-12 14:45:29 -07:00
3beadc2f25
[Compilation Bug] Fix Inductor Graph Output with Shape Issue ( #24772 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-12 21:23:05 +00:00
bc636f21a6
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints ( #23937 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
2025-09-12 13:57:53 -07:00
017354c0ef
[CI] Trigger BC Linter when labels are added/removed ( #24767 )
2025-09-12 11:44:36 -07:00
010acc6e1e
[Bugfix] Fix incompatibility between #20452 and #24548 ( #24754 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-12 11:17:29 -07:00
c8c42597ab
[CI] Speed up model unit tests in CI ( #24253 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-09-12 10:36:50 -07:00
9d2a44606d
[UX] Remove AsyncLLM torch profiler disabled log ( #24609 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-12 10:08:44 -07:00
f17c075884
[Model] Switch to Fused RMSNorm in GLM-4.1V model ( #24733 )
...
Signed-off-by: SamitHuang <285365963@qq.com >
2025-09-12 09:12:23 -07:00
b0d1213ac3
[Models] Prevent CUDA sync in Qwen2.5-VL ( #24741 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-12 16:03:55 +00:00
57f94e88ea
[Models] Optimise and simplify _validate_and_reshape_mm_tensor ( #24742 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-12 15:37:37 +00:00
684b6870e1
[Bugfix][Frontend] Fix --enable-log-outputs does not match the documentation ( #24626 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-12 08:01:24 -07:00
a5b84f1cbf
[Core] Shared memory based object store for Multimodal data caching and IPC ( #20452 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-09-12 07:54:17 -07:00
9f04d9d55f
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP ( #24739 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-12 07:54:04 -07:00
4d7c1d531b
[Bugfix] Fix MRoPE dispatch on XPU ( #24724 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-09-12 21:43:56 +08:00
41f17bf290
[Docs] Fix warnings in mkdocs build (continued) ( #24740 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-12 06:43:15 -07:00
bcb06d7baf
[Doc]: fix typos in various files ( #24726 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-12 06:43:12 -07:00
0377802c20
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec ( #24548 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-12 21:42:23 +08:00
72fc8aa412
[Multi Modal] Add FA3 in VIT ( #24347 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-12 21:27:24 +08:00
fdb09c77d6
[sleep mode] save memory for on-the-fly quantization ( #24731 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-12 11:25:19 +00:00
7a1c4025f1
[Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward for better memory access ( #24701 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-09-12 19:23:07 +08:00
60a0951924
[Bugfix] Fix BNB name match ( #24735 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 11:12:01 +00:00
64d90c3e4f
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call ( #24717 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-12 18:57:07 +08:00
59d5d2c736
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend ( #24721 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-12 18:51:01 +08:00
d21a36f5f9
[CI] Add ci_envs for convenient local testing ( #24630 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-12 08:52:25 +00:00
561a0baee0
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order ( #24640 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-12 07:49:09 +00:00
f592b3174b
[BugFix] Fix Qwen3-Next PP ( #24709 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-11 23:35:04 -07:00
7920de0a2a
[Bugfix] Fix MRoPE dispatch on CPU ( #24712 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-12 04:56:31 +00:00
ddcec289c7
Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds ( #24686 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-12 04:35:48 +00:00
e090b7b45b
Enable conversion of multimodal models to pooling tasks ( #24451 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-12 03:30:41 +00:00
6a50eaa0d3
[DOCs] Update ROCm installation docs section ( #24691 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-11 20:02:53 -07:00
12a8414d81
[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 ( #24707 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 10:06:26 +08:00
880c741bb6
[Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases ( #24680 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-11 18:16:43 -07:00
40b6c9122b
[V1] feat:add engine v1 tracing ( #20372 )
...
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com >
Signed-off-by: Ye Zhang <zhysishu@gmail.com >
Signed-off-by: RichardoMu <44485717+RichardoMrMu@users.noreply.github.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Mu Huai <tianbowen.tbw@antgroup.com >
Co-authored-by: Ye Zhang <zhysishu@gmail.com >
Co-authored-by: Benjamin Bartels <benjamin@bartels.dev >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: 瑜琮 <ly186375@antfin.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-11 17:10:39 -07:00
2e6bc46821
[Startup] Make DeepGEMM warmup scale with max-num-batched-tokens ( #24693 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-11 20:10:19 -04:00
fcba05c435
[Bug] Fix Layer weight_block_size Assertion Issue ( #24674 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 19:47:59 -04:00
7a30fa8708
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler ( #18698 )
...
Signed-off-by: Zazzle516 <2405677060@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 23:18:09 +00:00
f82f7a8990
[Qwen3-Next] MOE configs for H100 TP4 ( #24699 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-11 15:45:52 -07:00
c3aea10dc8
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel ( #23280 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-11 15:43:14 -07:00
d4fd2768ef
[Bugfix][Attention] Fix FlashInfer MLA block size logic ( #24692 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-11 22:39:42 +00:00
7a70a71892
[Qwen3-Next] Add B200 MoE configs for Qwen3-next ( #24698 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-09-11 15:34:58 -07:00
7d4651997a
[CI/Build] Add bc-linter to vLLM CI ( #21234 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-09-11 15:34:36 -07:00
569bf1c9c0
[Qwen3-Next] MoE configs for H200 TP=1,2,4 ( #24695 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-11 14:38:16 -07:00
1ec20355f5
[Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False ( #24696 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 14:32:27 -07:00
e42af78b18
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention ( #24197 )
...
Signed-off-by: Xiaozhu <mxz297@gmail.com >
2025-09-11 14:20:09 -07:00
074854b24f
[Kernel][B200] mxfp4 fused cutlass moe ( #23696 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-11 17:04:56 -04:00
79ac59f32e
Update Spec Decode metrics to include drafted and accepted token throughput ( #24127 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-11 19:58:43 +00:00
b971f91504
[BugFix] Fix tokenize asyncio task leak ( #24677 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-11 19:44:04 +00:00
c733bd5e87
[Qwen3-Next] Add MoE Config for H200 ( #24688 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-11 12:40:15 -07:00
a892b259b4
[Doc] Remove Useless Comments ( #24687 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 12:25:47 -07:00
127ded0a9e
[Ultravox] Use wrapped_model_config to instantiate inner model ( #24679 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2025-09-11 18:52:24 +00:00
bb2b5126da
[VLM] Migrate remain DP-supported ViT models to use disable_tp ( #24363 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-11 18:30:41 +00:00
361ae27f8a
[Docs] Fix formatting of transcription doc ( #24676 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 11:18:06 -07:00
e26fef8397
fix some typos ( #24616 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-11 10:48:46 -07:00
c1eda615ba
Fix model name included in responses ( #24663 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 10:47:51 -07:00
4aa23892d6
[Bugfix] Fix platform-specific routing in CustomOp implementations ( #24444 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-09-11 17:15:01 +00:00
1fdd5c42d7
[Kernels] Enable Torch Symmetric Memory All-Reduce By Default ( #24111 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-11 09:45:31 -07:00
bcbe2a4d9e
[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames ( #24161 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-11 09:44:34 -07:00
51d41265ad
[Docs] Fix typos in EP deployment doc ( #24669 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 09:07:23 -07:00
4984a291d5
[Doc] Fix Markdown Pre-commit Error ( #24670 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 09:05:59 -07:00
404c85ca72
[Docs] Add transcription support to model ( #24664 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-11 07:39:01 -07:00
817beef7f3
[Bugifx] Fix qwen-next packed_modules_mapping ( #24656 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 22:26:17 +08:00
4f6593b058
[HybridKVCache][Platform] Add support_hybrid_kv_cache for platform ( #24646 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2025-09-11 21:47:58 +08:00
94e6b2d55f
Allow users to specify kv cache memory size ( #21489 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 13:41:07 +00:00
fd1ce98cdd
[CI] Split mteb test from Language Models Test ( #24634 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 06:37:51 -07:00
d11ec124a0
[Bench] Add qwen-next in benchmark_moe.py ( #24661 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 21:29:43 +08:00
f510715882
[build] add torch to tool.uv no-build-isolation-package ( #24303 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 13:19:44 +00:00
f946197473
[Docs] Fixes a typo in the qwen3next model name. ( #24654 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-11 19:35:14 +08:00
0cd72a7b72
[XPU] add missing dependency tblib for XPU CI ( #24639 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
2025-09-11 11:22:33 +00:00
5f5271f1ee
Move LoRAConfig from config/__init__.py to config/lora.py ( #24644 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 11:01:38 +00:00
d6249d0699
Fix typing for safetensors_load_strategy ( #24641 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 10:41:39 +00:00
25bb9e8c65
[CI Failure] fix models/language/pooling/test_auto_prefix_cache_support.py ( #24636 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 03:31:23 -07:00
a1213fae5f
[Misc] Add @NickLucche to codeowners ( #24647 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-11 17:18:09 +08:00
a8b0361c92
[CI] Split pooling from entrypoints Test ( #24632 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 01:53:09 -07:00
ed5ae4aace
[Bugfix] Fix _synced_weight_loader ( #24565 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-09-11 16:52:33 +08:00
0fc36463e0
[CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Test ( #24615 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
2025-09-11 01:52:10 -07:00
d14c4ebf08
[Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ ( #24633 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-11 01:50:12 -07:00
ba6011027d
[Docs] Update V1 doc to reflect whisper support ( #24606 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-11 01:50:08 -07:00
85df8afdae
[Docs] Revise frameworks/anything-llm.md ( #24489 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-11 01:50:05 -07:00
6aeb1dab4a
[Bugfix] Fix incorrect import of CacheConfig ( #24631 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-11 01:48:25 -07:00
e93f4cc9e3
Add the support for the qwen3 next model (a hybrid attention model). ( #24526 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 15:32:09 +08:00
2048c4e379
[torchao] Support quantization configs using module swap ( #21982 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-09-10 23:53:24 -07:00
d13360183a
Remove redundant all gather + split ( #23441 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-10 23:45:07 -07:00
9bd831f501
[Model] New model support for Motif-1-Tiny ( #23414 )
...
Signed-off-by: ca1207 <ca1207zzz@gmail.com >
Signed-off-by: TaehyunKim <73943231+ca1207@users.noreply.github.com >
Co-authored-by: WyldeCat <skan1543@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 23:29:40 -07:00
e2b1f863aa
[Doc]: fixing doc typos ( #24635 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-10 23:19:28 -07:00
41329a0ff9
[Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre ( #24469 )
...
Signed-off-by: Shiqi Sheng <shengshiqi@google.com >
Signed-off-by: shengshiqi-google <160179165+shengshiqi-google@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-10 23:10:01 -07:00
ee0bc5e1b4
Enable --profile in 'vllm bench throughput' ( #24575 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2025-09-10 23:06:19 -07:00
3d1393f6fc
Kimi K2 Fused MoE kernels Optimization configs ( #24597 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-10 23:06:16 -07:00
8a894084d2
[Engine][Chore] use local variable and remove output var assignment ( #24554 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
2025-09-10 23:05:42 -07:00
e2d8c27f68
[BugFix] Fix pipeline parallel ( #24621 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-10 23:05:30 -07:00
29799ddacc
[Bugfix] Add missing VIT backend dispatch on CPU ( #24623 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-10 22:28:41 -07:00
f17a6aa4ec
[Ultravox] Fix Gemma instantiation, support quantization via --hf-overrides ( #24131 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2025-09-10 22:25:34 -07:00
6c8deacd72
[Bug] [Spec Decode] Fix model_initialization test and mismatch in aux_hidden_layers ( #24613 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-10 21:23:18 -07:00
55b823ba0f
Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool parser ( #24406 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-11 04:23:04 +00:00
8c5a747246
[distributed] update known issues ( #24624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-11 11:09:38 +08:00
5931b7e5d9
[Models][Quantization] Add quantization configuration update in Voxtral model ( #24122 )
...
Signed-off-by: Alexandre Marques <almarque@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-10 19:13:56 -07:00
cc99baf14d
[Misc] Make timeout passable in init_distributed_environment ( #24522 )
...
Signed-off-by: jberkhahn <jaberkha@us.ibm.com >
2025-09-10 15:41:12 -07:00
dcb28a332b
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration ( #21078 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-10 15:31:10 -07:00
fba7856581
[Perf] Warmup FlashInfer attention during startup ( #23439 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-10 15:03:17 -07:00
b5e383cd8b
[gpt-oss] raise error for flashinfer backend without trtllm ( #24482 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-10 14:33:13 -07:00
9a161307f5
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends ( #19767 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-10 13:59:55 -07:00
37e8182bfe
[v1] Add Whisper model support (encoder-decoder) ( #21088 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
2025-09-10 13:53:35 -07:00
4db4426404
[CI] Fail subprocess tests with root-cause error ( #23795 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-10 13:53:21 -07:00
a0933c3bd6
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs ( #24577 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-09-10 12:33:41 -07:00
09e68bce34
[Misc] update log level debug to warning when process port is used by ( #24226 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-10 11:32:57 -07:00
9fb74c27a7
[Core] Support configuration parsing plugin ( #24277 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Signed-off-by: Xingyu Liu <38244988+charlotte12l@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-10 11:32:43 -07:00
4032949630
[Bugfix] Fix DeepEP config for DP4TP4 ( #23619 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-10 10:37:56 -07:00
08abfa78ec
[Bugfix] fix modelopt exclude_modules name mapping ( #24178 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-10 10:20:46 -07:00
2bef2d1405
[Logging] allow config logging stream ( #24336 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-10 15:02:01 +00:00
36cacd0958
[Doc] Add documentation for GLM-4.5 series models: tool-calling and reasoning parser ( #24589 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-09-10 07:50:55 -07:00
bb3eb80d92
[Core] Split LoRA layers ( #24574 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 07:47:51 -07:00
fcc0a3130a
[CI] Fix tensorizer test assertion ( #24545 )
...
Signed-off-by: Peter Schuurman <psch@google.com >
2025-09-10 06:57:36 -07:00
736569da8d
[Platform] Custom ops support for LMhead and LogitsProcessor ( #23564 )
...
Signed-off-by: zzhx1 <zzh_201018@outlook.com >
2025-09-10 06:26:31 -07:00
2eb9986a2d
[BugFix] python collect_env.py and vllm collect-env compatibility with uv venv ( #24066 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-09-10 21:25:33 +08:00
ccee371e86
[Docs] Fix warnings in mkdocs build (continued) ( #24092 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-10 06:23:28 -07:00
c0bd6a684a
Fix Auto_Round Quatization Loading on SM75 and Lower GPUs ( #24217 )
...
Signed-off-by: RoadToNowhereX <37441177+RoadToNowhereX@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-10 06:22:31 -07:00
3144d90217
fix some typos ( #24167 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-10 06:21:23 -07:00
2f5e5c18de
[CI/Build] bump timm dependency ( #24189 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-10 06:20:59 -07:00
bd98842c8a
[CI] Add PPL test for generation models ( #24485 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-10 06:16:39 -07:00
d6069887c6
[rocm] enable torchao quantization for rocm ( #24400 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-09-10 06:16:21 -07:00
492196ed0e
[CI/Build] split true unit tests to Entrypoints Unit Tests ( #24418 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-10 06:16:07 -07:00
f4f1a8df22
[BugFix] Ensure integrity of reused CPU tensors during async scheduling ( #24527 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: guoze.lin <guozelin@tencent.com >
2025-09-10 21:15:14 +08:00
0b9a612fa3
[BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat ( #24549 )
...
Signed-off-by: lacora2017 <yehu@meta.com >
Co-authored-by: lacora2017 <yehu@meta.com >
2025-09-10 21:14:55 +08:00
4c04eef706
[BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo ( #24559 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-10 06:14:27 -07:00
f36355abfd
Move LoadConfig from config/__init__.py to config/load.py ( #24566 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-10 06:14:18 -07:00
9e3c3a7df2
[LoRA]: Add LoRA support to Mistral's Voxtral models ( #24517 )
...
Signed-off-by: Yash Pratap Singh <yashsingh20001@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 06:12:03 -07:00
6cbd41909e
Feature/vit attention unification# 23880 ( #23978 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-10 06:10:14 -07:00
72d30108a0
Support for NemotronH Nano VLM ( #23644 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
2025-09-10 06:10:06 -07:00
8b83b93739
[Docs] Document the extra memory footprint overhead when using EPLB ( #24537 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-10 06:09:49 -07:00
9dbefd88e9
[Docs] Improve organisation of API Reference nav ( #24569 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-10 06:08:21 -07:00
7c195d43da
[ROCm][Bugfix] Fix Aiter RMSNorm ( #23412 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-10 21:08:03 +08:00
0ae43dbf8c
[Attention] add DCP support for FLASH_ATTN_MLA backend ( #24453 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-10 17:19:26 +08:00
267c80d31f
[Model] Limit CPU threads for image transformations in InternVL to reduce cpu contention. ( #24519 )
...
Signed-off-by: li-jinpeng <3332126450@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-10 16:45:44 +08:00
77f62613f9
Consolidate rendering parameters into RenderConfig dataclass ( #24543 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-10 08:44:47 +00:00
feaf202e93
[Bugfix] Guard _may_reorder_batch for encoder-only models on CPU ( #24319 ) ( #24348 )
...
Signed-off-by: Remy <eunhwan.shin@dtonic.io >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-09-10 14:24:42 +08:00
91130ae376
[docs] promo pytorch conf and ray summit ( #24562 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-09 23:24:20 -07:00
e40827280b
[Docs] Enable relative links in examples to function when rendered in the docs ( #24041 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-09 21:40:45 -07:00
4377b1ae3b
[Bugfix] Update Run:AI Model Streamer Loading Integration ( #23845 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Signed-off-by: Peter Schuurman <psch@google.com >
Co-authored-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-09 21:37:17 -07:00
009d689b0c
[Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing. ( #24271 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-09 21:36:09 -07:00
0efdb5c3ba
[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading ( #24154 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-09-10 04:27:53 +00:00
53b42f4102
[BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-enable test for LlamaForCausalLMEagle3 ( #24392 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-09 21:24:23 -07:00
309d7aa401
[P/D] MultiConnector supports shutdown ( #24425 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-09 21:24:11 -07:00
b4a01aaf95
[KV Connector] More async support for get_num_new_matched_tokens ( #23620 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-09-09 21:23:37 -07:00
83dd28aae4
[CI] Adjust threshold for flaky ngram spec decoding test ( #24528 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 21:07:33 -07:00
f88e84016f
[BugFix] Fix async core engine client finalizer ( #24540 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 21:07:13 -07:00
3c2156b3af
[Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later) ( #24129 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-09-10 03:50:21 +00:00
7e7db04310
[CI] Retry flaky fp8 cutlass mla tests ( #24536 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 20:33:10 -07:00
41f160b974
Add @heheda12345 to CODEOWNERS of KVCacheManager related code ( #24546 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-10 03:30:32 +00:00
dc625ea6b8
[Perf] Convert np array to torch tensor to index into block table for attn chunking ( #24474 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-09 20:01:06 -07:00
b23fb78623
[Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. ( #24538 )
2025-09-09 17:53:53 -07:00
561f38dc3c
[Bugfix] Improve EPLB config validation error message ( #24524 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-10 00:32:36 +00:00
73e688cb79
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm ( #24275 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-09-09 23:27:35 +00:00
fb1a8f932a
[Benchmark] Add option to skip oversampling in benchmark ( #24457 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-09 22:00:17 +00:00
0dc9cbb527
[Benchmark] Update bench doc with mtbench, blazedit, spec bench ( #24450 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-09 21:15:41 +00:00
b5fb3005a8
[Log] Use a relative path in debug-level logs to distinguish files with identical names ( #23846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-09 16:46:35 -04:00
15de5ff9ea
[Feature] Disallow FlashMLA on Blackwell ( #24521 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-09 14:59:34 -04:00
b8a93076d3
[CI] execute all piecewise compilation tests together ( #24502 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-09 11:05:25 -07:00
c3f9773b2c
[TPU] Fix tpu structured decoding in mixed batches ( #24458 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-09-09 11:04:25 -07:00
3707cb2505
[Docs] Gemma3n transcriptions endpoint support ( #24512 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-09 11:03:32 -07:00
920ed46b09
[Misc] bump outlines_core to fix the version conflicts with outlines >= 1.2.0 ( #24368 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 10:59:46 -07:00
15cb047e25
Extend renderer with embedding support and integrate completion endpoint ( #24405 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-10 01:46:46 +08:00
9ad0688e43
[Bugfix] Fix hidden_size for multimodal classification model ( #24501 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-09 10:37:25 -07:00
b9a1c4c8a2
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork ( #24279 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 12:21:56 -04:00
1aa427fdc1
[Kernels] Add Flash Linear Attention Kernels ( #24518 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-10 00:04:41 +08:00
1c63a16b65
[Core] Run garbage collector after CUDA graph capture to fix throughput regression ( #24128 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 10:38:10 -04:00
922d3b401b
[Bugfix] Handle the edge case in detokenizer where processed tokens contain both stop str and eos token ( #23938 )
...
Signed-off-by: dtransposed <damian.bogunowicz@gmail.com >
2025-09-09 07:30:24 -07:00
19332c0479
[Model] Systematic support for fp32 head, pooling models part ( #23810 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-09 07:29:50 -07:00
a55cf41a09
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT ( #24123 )
2025-09-09 10:21:10 -04:00
6fb2788163
[CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency ( #24411 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 10:02:35 +00:00
3d2a2de8f7
[RL] fast weight update with zmq + ipc handles ( #24295 )
...
Signed-off-by: huangweixiao <huangweixiao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-09 16:57:46 +08:00
1116590b16
[gpt-oss] Validate gpt-oss python tool during initialization ( #23856 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-09 08:37:48 +00:00
ccb97338af
[Misc] Add Codex settings to gitignore ( #24493 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-09 01:25:44 -07:00
45c9cb5835
[Misc] Add claude settings to gitignore ( #24492 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 01:14:45 -07:00
e283976f3a
[Performance][MM] Building the inverse permutation in O(n) time in Qwen2_5_VisionTransformer ( #24443 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
2025-09-09 00:24:11 -07:00
46876dff32
[Doc]: fixing typos to improve docs ( #24480 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 23:06:04 -07:00
1823a00d67
[Misc] Support bench serve long context ( #24373 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-08 22:53:10 -07:00
ed16d0f26f
[Doc] mention fpdb for multiprocess breakpoints ( #24452 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-09-08 21:46:45 -07:00
0cdd213641
[Misc] Improve Worker process title and logging prefix ( #22205 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 21:43:48 -07:00
948dd3443b
[Bugfix] Fix Apertus HF repo name ( #24447 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-08 21:40:29 -07:00
b2f7745774
Add data_parallel_size to VllmConfig string representation ( #24298 )
...
Co-authored-by: Cong Chen <congc@meta.com >
2025-09-08 21:35:18 -07:00
82dfb12e52
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead ( #23673 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-09-08 21:34:37 -07:00
bba1042c6f
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel ( #23647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-08 20:53:07 -07:00
b6fbc15634
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs ( #24074 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-09-09 11:37:16 +08:00
3e0d4a3475
Move KVTransferConfig from config/__init__.py to config/kv_transfer.py ( #24434 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 20:30:32 -07:00
562663a044
Bump actions/github-script from 7.0.1 to 8.0.0 ( #24413 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:12:44 +00:00
ed1623a88a
Bump actions/stale from 9.1.0 to 10.0.0 ( #24412 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:11:20 +00:00
13b89bd823
[doc] update vllm serve cli args documentation ( #24329 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-09-09 03:07:58 +00:00
22a0070530
Bump actions/setup-python from 5.4.0 to 6.0.0 ( #24414 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 02:54:58 +00:00
170129eb28
[gpt-oss] Harmony changes with container tool support ( #23386 )
...
Signed-off-by: zhiweiz <zhiweiz@fb.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: zhiweiz <zhiweiz@fb.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-08 19:03:50 -07:00
955c624915
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE ( #24134 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-08 19:01:51 -07:00
4f87abdcc6
Update reviewers for modelopt related files ( #24468 )
2025-09-09 01:53:13 +00:00
6910b56da2
[CI] Add nightly multiarch manifests to dockerhub ( #24102 )
...
Signed-off-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 01:18:09 +00:00
e10fef0883
[Hardware][IBM Z] Fix Outlines Core issue for s390x ( #24034 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2025-09-08 16:50:34 -07:00
e680723eba
[Bugfix] Disable the statslogger if the api_server_count is greater than 1 ( #22227 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-08 15:28:03 -07:00
620db1fc58
[Attention] FlashAttention MLA cudagraph support ( #23958 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-08 22:05:26 +00:00
41183c1fe0
[Spec Decode] Fix offline spec_decode.py ( #24257 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 20:44:13 +00:00
43d9ad03ba
[Model loader]: support multi-thread model weight loading ( #23928 )
...
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-08 18:49:39 +00:00
7be141b2c5
[CI] Enable encoder model compilation test ( #24442 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-08 11:48:06 -07:00
8d7f39b48c
[Model] Remove quantized mixtral ( #24437 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 11:02:14 -07:00
cd08636926
[Spec Decode][Benchmark] Add Blitzedit dataset ( #23605 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 10:32:52 -07:00
3feeeb9fea
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking ( #23563 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-08 10:32:42 -07:00
6f4a82f8b5
[Model] Enable BNB support for qwen2_5_omni_thinker ( #24420 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 09:37:08 -07:00
c44797a4d6
[Docs]add eplb_config param use docs ( #24213 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-09-08 09:36:57 -07:00
55be93baf5
[Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's doc structure ( #24438 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 09:36:54 -07:00
717fc00e98
[Docs] Move feature compatibility tables to README ( #24431 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:45:14 -07:00
01dfb5e982
[Frontend] User-provided uuids for medias in chat. (RFC #22044 ) ( #23449 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-08 06:42:20 -07:00
03dd652c16
Move KVEventsConfig from config/__init__.py to config/kv_events.py ( #24433 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:41:27 -07:00
9cd76b71ab
[Misc] Terratorch related fixes ( #24337 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-08 06:40:26 -07:00
e041314184
[Bugfix] Fix mamba2 prefill chunking ( #23279 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-08 11:42:41 +00:00
5e537f45b4
[Bugfix] Fix get_quant_config when using modelscope ( #24421 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-09-08 11:03:02 +00:00
c2a8b08fcd
[Doc] Fix issues in integrations/llamastack.md ( #24428 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 02:28:32 -07:00
f4962a6d55
[Doc]: fix typos in Python comments ( #24417 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 00:22:16 -07:00
2f0b833a05
[Docs] Fix a tip indentation and typo ( #24419 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 00:19:40 -07:00
425b04b8f4
[gpt-oss][Responses API] Fix the function call id format ( #24409 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-08 06:49:52 +00:00
60f0843ef8
[Model] Remove unnecessary CUDA sync of Qwen2VL image and video preprocess ( #24334 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:11:12 -07:00
8a46602606
[Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess ( #24332 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:10:54 -07:00
61aa4b2901
[P/D] Add a shutdown method to the Connector API ( #22699 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-07 23:07:00 -07:00
8c892b1831
[Doc] Fix UTF-8 encoding issues in documentation generation on Windows ( #24361 )
...
Signed-off-by: alekramelaheehridoy <aliqramalaheehridoy@gmail.com >
Signed-off-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
Co-authored-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
2025-09-07 22:33:52 -07:00
3bca396f79
[CI/Build] Fix local image inputs in test_pixtral.py ( #24401 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 03:31:35 +00:00
3a3e91bdfe
[CI/Build] Disable flaky test_structured_output tests ( #24404 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 02:51:59 +00:00
b3d7e3c845
[Sampler] Support returning all prompt logprobs ( #23868 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-07 19:34:31 -07:00
67841317d1
[xpu] upgrade ipex/python3.12 for xpu ( #23830 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-09-08 02:07:16 +00:00
86173ad593
[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA ( #24385 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-08 09:27:12 +08:00
795b6951cd
Add @luccafong to codeowner for spec decode ( #24397 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-08 08:30:27 +08:00
2e5d21378d
Skip MM Encoder for non-first PP ranks ( #24387 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 09:38:35 -07:00
0661cb9df3
Add renderer-based prompt processing for embedding and classification endpoints ( #24356 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-07 08:26:48 +00:00
105d3d62ef
[TPU] Remove TopKTopPSampler dependency for TPU sampler ( #24391 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 01:12:36 -07:00
62f66be1f7
[Bugfix] Fix Qwen3-coder moe tuned config ( #24072 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 05:19:46 +00:00
81c53ef55c
[Misc] collect flashinfer version in collect_env.py ( #24378 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-07 03:30:41 +00:00
75334956c2
QWEN3 Thinking Fused MoE kernels Optimization configs ( #24330 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-07 03:18:54 +00:00
77aec83b8c
[Benchmark] add benchmark for custom activation op ( #23908 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-06 20:12:05 -07:00
e67597545b
[CI][Fix] deterministic seed for flaky CI runs on structured outputs ( #24380 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-09-07 11:10:40 +08:00
37a6fa95fd
Migrate Qwen2 inputs to TensorSchema ( #23475 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-06 20:07:31 -07:00
558f0907dc
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode ( #24372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-07 01:18:59 +00:00
4172235ab7
[V0 deprecation] Deprecate V0 Neuron backend ( #21159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-06 16:15:18 -07:00
848562bd49
break execute_model in gpu_model_runner into sub-functions for custom scopes ( #24265 )
...
Co-authored-by: Bangsheng Tang <bangsheng@meta.com >
2025-09-06 14:02:47 -07:00
e68dc2f014
[Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test ( #24370 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-06 20:39:34 +00:00
a3645ed94d
[Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count ( #24285 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-06 13:27:15 -07:00
fb691ee4e7
[Fix] [gpt-oss] fix non-tool calling path for chat completion ( #24324 )
2025-09-06 19:10:32 +00:00
6024d115cd
Lora bias(enable_lora_bias) deprecate warning ( #24339 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 00:42:19 +08:00
7555d6b34a
[Bugfix] Fix test_mixtral_moe ( #24371 )
2025-09-06 09:32:03 -07:00
00a4e56d8d
[Bugfix] Fix broken deepseek fp8 TP weights loading ( #24367 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-06 09:23:12 -07:00
0eadaeff7e
[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. ( #24335 )
...
Signed-off-by: Mohan Kumar Kumar <mohan.cbein@gmail.com >
Signed-off-by: mohankku <mohan.cbein@gmail.com >
2025-09-06 08:17:03 -07:00
0077c8634e
Add @benchislett to codeowner for spec decode and structured outputs ( #24362 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-06 22:03:35 +08:00
b121ca22ad
[CI] Disable flaky structured output test from CI ( #24366 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-06 13:31:56 +00:00
eddaafc1c7
[Multimodal] Improve max video embedding length estimation in V1 ( #24312 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-06 02:33:19 -07:00
305a1cc0d2
refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer ( #24345 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-05 23:01:23 -07:00
6d6c6b05d3
[New Model]: google/embeddinggemma-300m ( #24318 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-05 22:58:36 -07:00
53b19ccdd5
[Core] Allow disabling TP sharding for parallel Linear layer ( #23024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-05 22:53:58 -07:00
6432739ef1
[Bugfix] Catch and log invalid token ids in detokenizer ( #24351 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-05 22:30:22 -07:00
ac201a0eaf
[Feature] Support Decode Context Parallel (DCP) for MLA ( #23734 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-06 13:24:05 +08:00
3c529fc994
[KV Sharing] Raise error if using eagle with fast prefill ( #24350 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-05 20:22:40 -07:00
35bf193864
[Doc]: fix typos in Python comments ( #24294 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-05 19:41:12 -07:00
35efa70297
Add @22quinn as code reviewer for RL related components ( #24346 )
2025-09-06 01:56:15 +00:00
cee182b297
[Perf][V1] Fully overlap model execution ( #23569 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-05 18:20:17 -07:00
c954c6629c
[CI] Add timeouts to tests ( #24260 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-05 17:26:22 -07:00
9dfbeb41e5
[RFC] allow cancelation after shutdown in blocking collective_rpc ( #23390 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-05 14:14:18 -07:00
eedb2a2a10
[Bugfix] Fix silu_mul+quant fusion test ( #24341 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-05 20:13:42 +00:00
23a6c5280e
[gpt-oss][Bugfix]Fix streamableparser for missing handling of certain token_ids ( #24306 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-05 10:26:00 -07:00
7812bcf278
[docs] add shenzhen meetup ( #24326 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-05 22:48:42 +08:00
006e7a34ae
Adding int4 and int8 models for CPU benchmarking ( #23709 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-09-05 20:08:50 +08:00
e599e2c65e
[XPU][P/D] Add XPU support in NixlConnector ( #22436 )
...
Signed-off-by: zhenwei <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 21:03:12 -07:00
c29fb540ff
[gpt-oss] tool parser supports for /chat/completions [1/n] ( #22386 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 20:39:12 -07:00
65e038931d
[Frontend] Skip unnecessary detokenization when token_id is requested ( #24236 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-04 23:04:12 +00:00
886ccbe5ba
[CI/Build] Reduce the number of redundant cases to test for LoRA ( #24276 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-04 21:58:44 +00:00
adc3ddb430
[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files ( #23727 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 14:25:45 -07:00
60b755cbcb
[Misc] Have AsyncLLM custom_stat_loggers extend default logger list ( #20952 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-04 14:25:30 -07:00
482e52f56c
QWEN3 Coder Fused MoE kernels Optimization configs ( #24266 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-04 20:33:43 +00:00
78336a0c3e
Upgrade FlashInfer to v0.3.0 ( #24086 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 09:49:20 -07:00
94866d7c93
[Misc] Slight improve deepgemm print ( #24085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-04 16:06:51 +00:00
83609ca91d
[Doc]: fix typos in Python comments ( #24173 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-04 08:52:17 -07:00
e41a0fa377
[Perf] Freeze core engine proc heap after init ( #24008 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-04 22:55:23 +08:00
37241077d5
[Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp ( #23725 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 09:25:40 -04:00
c9f7081f9c
[LoRA]: Add lora support to qwen-2.5-omni ( #24231 )
2025-09-04 05:50:50 -07:00
16ded21eeb
[XPU] support Triton Attention backend on Intel GPU ( #24149 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 20:41:08 +08:00
2b30afa442
Use hidden_size_per_head as head_size fallback ( #24221 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-09-04 12:59:16 +01:00
eafa8dcde6
[Model] Add pp support for hunyuan ( #24212 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-04 03:58:26 -07:00
6c7af8110a
[Doc] Update vLLM Singapore Meetup info ( #24234 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-09-04 02:58:18 -07:00
8f423e5f43
[Feature][Response API] Add streaming support for non-harmony ( #23741 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-04 17:49:06 +08:00
369a079568
[Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon ( #24200 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-09-04 02:48:25 -07:00
402759d472
[Attention] FlashAttn MLA ( #14258 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-04 02:47:59 -07:00
2c301ee2eb
[Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0 ( #24159 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
Signed-off-by: Fanli Lin <fanli0116@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 02:47:08 -07:00
3efb9f4d95
[Attention][Platform] Refactor MLA to support Custom Op ( #23332 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-04 02:46:37 -07:00
04f3c35cff
Improve flexibility of auto_tune.sh execution. ( #23766 )
...
Signed-off-by: Anthony Su <50185138+anthonsu@users.noreply.github.com >
Signed-off-by: anthonsu <50185138+anthonsu@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 09:41:41 +00:00
51d5e9be7d
[Core][Model] Terratorch backend integration ( #23513 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-04 00:22:41 -07:00
e7fc70016f
[Model] Add MiDashengLM model support ( #23652 )
...
Signed-off-by: chenbing8 <chenbing8@xiaomi.com >
Signed-off-by: bingchen-mi <chenbing8@xiaomi.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-04 00:08:09 -07:00
12e1e63cc5
[Misc] Enhance output readability of helper script ( #24214 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-09-04 06:38:26 +00:00
57b1ce94f7
[CPU] Refactor CPU unquantized linear ( #24150 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-04 14:28:45 +08:00
cb55ad86fe
Migrate ultravox inputs to TensorSchema ( #23503 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-04 06:09:11 +00:00
712b273f65
[Refactor] Introduce basic Renderer for completion-style request ( #24010 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-04 05:21:12 +00:00
e919d6f549
[Kernel][Bugfix] Fix grouped topk cu ( #24146 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
2025-09-04 12:37:37 +08:00
a38f8bd54c
[Feature][Responses API]Support MCP tools with streaming mode + background mode ( #23927 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-09-04 04:05:10 +00:00
b5ee1e3261
Remove deprecated PyNcclConnector ( #24151 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-09-03 22:49:16 +00:00
36c260dad6
[Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking ( #23460 )
...
Signed-off-by: George Nagy II <george.nagy0969@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-03 21:08:47 +00:00
a43a3f1770
[Bugfix][DP] DP distribution does not require ray[default] ( #23822 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-03 13:21:36 -07:00
6adaed42f4
[Feature][P/D]: Optimize NIXL Connector xfer Launch ( #23887 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-09-03 19:14:30 +00:00
a742322092
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend ( #23289 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-03 14:05:24 -04:00
731a6940e3
Migrate whisper inputs to TensorSchema ( #23505 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-03 18:04:00 +00:00
e9b92dcd89
[Kernels] Overlap shared experts with send/recv ( #23273 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-03 12:35:18 -04:00
fa4311d85f
[V1] v1 engine + full CUDA graph support for PLaMo2 ( #23998 )
...
Signed-off-by: Hemmi Shinichi <shemmi@preferred.jp >
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
Co-authored-by: Hemmi Shinichi <shemmi@preferred.jp >
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com >
2025-09-03 08:24:02 -07:00
6d80ae83e1
[Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16 ( #23424 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-09-03 15:01:09 +00:00
4ba0c587ba
FIX: Add libnuma-dev to Dockerfile for dev stage ( #20388 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
2025-09-03 07:17:20 -07:00
6997a25ac6
[Model] Remove useless code from MiniMax implementation ( #23982 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-09-03 11:27:04 +00:00
28f350e147
Support add_generation_prompt in embeddings endpoint with chat request ( #23931 )
...
Signed-off-by: biba10 <jaksmid@seznam.cz >
2025-09-03 10:47:55 +00:00
51383bd472
[CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant ( #24088 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-03 17:23:56 +08:00
9c99e4871f
[Misc] Clean up deadcode for legacy processing pipeline ( #24153 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-03 08:34:29 +00:00
70549c1245
[CI/Build] Serve images used by multimodal tests through local HTTP Server ( #23907 )
...
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com >
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-03 16:13:11 +08:00
f0c503f66e
[Nixl] Heterogeneous TP support FlashInfer ( #20189 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-03 15:19:54 +08:00
f38035c123
[distributed][rl] remove nccl cumem env var override ( #24141 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-03 06:45:25 +00:00
426cc8629f
[BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models ( #24132 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-03 04:57:59 +00:00
e81d4e69c1
[Misc] Add check for dual_chunk_attention ( #24070 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-03 04:19:14 +00:00
02d411fdb2
[Doc]: fix typos in Python comments ( #24115 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:14:07 -07:00
d7e1e59972
[Doc]: fix typos in Python comments ( #24093 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:05:45 -07:00
c4ed78b14f
[Compile] Fix Compile Warning for w4a8_mm_entry.cu ( #23660 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-02 20:45:52 -07:00
1bd007f234
fix some typos ( #24071 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-02 20:44:50 -07:00
136d853e65
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing ( #23656 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-09-03 02:52:51 +00:00
e32a0e8678
Upgrade xgrammar to 0.1.23 ( #22988 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-03 02:32:59 +00:00
42dc59dbac
Update release pipeline post PyTorch 2.8.0 update ( #24073 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
2025-09-03 10:09:19 +08:00
862f2ef893
[XPU] Fix the bug of LoRA logits on the XPU platform ( #24081 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-09-03 08:21:18 +08:00
2fd1a40a54
[CI/Build] Disable SiluMul NVFP4 quant fusion tests ( #24121 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-02 16:50:28 -07:00
930a24144c
[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue ( #24119 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 22:22:30 +00:00
457e471971
[AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault ( #23692 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-09-02 22:13:57 +00:00
d328f7894f
[CI] Enable all hf transformers baselines in test_hybrid ( #23936 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-02 20:15:06 +00:00
98aee612aa
[Log] Only Print Profiler Results on Rank 0 ( #23370 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 18:53:34 +00:00
598bd74cf8
Fix weights loading for Apertus ( #24100 )
...
Signed-off-by: Nathan Ranchin <nranchin@student.ethz.ch >
2025-09-02 18:34:28 +00:00
2417798471
[Metrics] Deprecate TPOT in favor of ITL ( #24110 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-09-02 18:10:10 +00:00
9480ae24e3
[Bugfix] Fix packed_factor missing attribute error ( #23902 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-09-02 10:56:31 -07:00
f399182e8c
Run ruff format on a few files. ( #24075 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-02 17:55:32 +00:00
1c41310584
[Bugfix] Fix transform_config parsing in Compressed Tensors ( #23945 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-02 13:54:10 -04:00
c83c4ff815
[Benchmark] Add support for local hf dataset path in benchmark ( #23999 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-02 17:49:16 +00:00
0e1759cd54
[docs] add SYS_NICE cap & security-opt for docker/k8s ( #24017 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 17:27:20 +00:00
e66ed3e675
[CI Failure] Skip failing nvfp4 silu test ( #23959 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-02 13:18:15 -04:00
e0653f6c0b
[Model] Classification models support logit_bias / sigmoid_normalize ( #24031 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-02 16:48:57 +00:00
38ba061f6f
[BugFix] Fix EXAONE4 rotary embeddings ( #23918 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 14:40:55 +00:00
0a74e9d0f2
[Gemma3n] Fix audio batching ( #24052 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-02 22:23:35 +08:00
8bd5844989
correct LWS deployment yaml ( #23104 )
...
Signed-off-by: cberge908 <42270330+cberge908@users.noreply.github.com >
2025-09-02 12:04:59 +00:00
ce30dca5c4
[CI]: reduce HTTP calls inside entrypoints openai tests ( #23646 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Aziz <azizbenothman76@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 10:49:32 +00:00
2f0bab3f26
[Model] Support dp on ViT on GLM-4.5V ( #23168 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-09-02 10:48:18 +00:00
fad73be1a5
[Doc]: fix typos in Python comments ( #24077 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 02:38:55 -07:00
56d04089ef
Migrate Interns1 inputs to TensorSchema ( #23510 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 04:35:45 +00:00
7be0cb8e9e
[XPU][Feature] fp8 online quantization support for XPU ( #23148 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com >
2025-09-02 04:06:53 +00:00
1fa1d6a9a0
Migrate OvisImagePatchInputs to TensorSchema ( #22024 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 12:01:36 +08:00
d59c986444
Remove runtime checks based on pooling params ( #24051 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-02 11:54:37 +08:00
04d0c60770
[Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has… ( #24028 )
...
Signed-off-by: Dazhi Jiang <dazhi_jiang@163.com >
2025-09-02 11:54:20 +08:00
2b41cbbf03
[V1][Mamba1] - FP32 SSM Kernel Support ( #23506 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-01 20:53:00 -07:00
0235103cbb
[Doc]: fix typos in Python comments ( #24042 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-01 19:07:45 -07:00
a344a5aa0a
[bugfix]fix MTP hidden states ( #24056 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-01 21:09:37 +00:00
5685370271
[Chore][V0 Deprecation] Move LogProb to a separate file ( #24055 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 12:07:53 -07:00
a0e0efd6bd
[Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506 ( #23817 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Signed-off-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-09-01 16:56:56 +00:00
cf91a89dd2
[docs][misc] IOProcessor plugins fixes ( #24046 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-09-01 09:17:41 -07:00
39a22dcaac
[Misc] Minor code simplification for spec decode ( #24053 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 08:54:01 -07:00
41c80698b3
Document multi-proc method selection for profiling ( #23802 )
...
Signed-off-by: jdebache <jdebache@nvidia.com >
2025-09-01 06:28:26 -07:00
7c8271cd1e
[Model]: support KeyeVL-1_5-8B ( #23838 )
...
Signed-off-by: wangruitao <wangruitao@kuaishou.com >
Co-authored-by: wangruitao <wangruitao@kuaishou.com >
2025-09-01 03:50:27 -07:00
3e330fcb21
[Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors ( #24033 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-09-01 03:34:52 -07:00
d46934b229
[Frontend] Gemma3n audio transcriptions/translations endpoint ( #23735 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-01 18:07:46 +08:00
107284959a
[Doc]: fix typos in Python comments ( #24026 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-01 09:38:20 +00:00
dc1a53186d
[Kernel] Update DeepGEMM to latest commit ( #23915 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-01 02:38:04 -07:00
55602bb2e6
[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN ( #20904 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-01 08:50:25 +00:00
d7fbc6ddac
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs ( #24022 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 08:12:22 +00:00
5438967fbc
[Misc] add hash_function doc string ( #24014 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-31 23:11:20 -07:00
422e793fa6
[Bugfix] Add support for <tool_call> format in streaming mode for XLAM Tool Parser ( #22769 )
...
Signed-off-by: Devon Peroutky <devon@kindo.ai >
2025-09-01 14:07:54 +08:00
1cb39dbcdd
[Misc] IO Processor plugins for pooling models ( #22820 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-31 23:07:12 -07:00
437c3ce026
Migrate Phi4 inputs to TensorSchema ( #23471 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-01 14:05:59 +08:00
499b074bfd
[Misc] refactor code by import as for torch._inductor.config ( #23677 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-01 14:05:42 +08:00
ff0e59d83a
[CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization ( #23357 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-31 22:52:20 -07:00
b55713683c
[Misc] Move fast prefill logic to separate method ( #24013 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 05:40:38 +00:00
acc1a6e10a
Fix the bug related to loading GPTP INT3 weights. ( #23328 )
...
Signed-off-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 05:39:57 +00:00
8c742a66d1
[Misc] Avoid redundant copy for encoder-only models ( #24012 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 04:02:43 +00:00
183a70967a
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) ( #23994 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 03:33:40 +00:00
14b4326b94
v1: Support KV events from connectors ( #19737 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-01 01:13:21 +00:00
752d2e1c36
[Minor] Fix some random typos in comments ( #24009 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-31 16:42:17 -07:00
81eea3d348
vllm fix check on max vocab size ( #22471 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-31 20:57:05 +08:00
9701352e4b
[Doc]: fix typos in Python comments ( #24001 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-31 08:21:59 +00:00
749be00a98
[Core][Multimodal] Allow passing multi_modal_uuids as multimodal identifiers. ( #23394 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-30 18:01:22 -07:00
5b8077b8ac
Fix wrong truncate_prompt_tokens type hint ( #22761 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
Signed-off-by: Gabriel Marinho <104592062+gmarinho2@users.noreply.github.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-30 20:39:38 +00:00
038e9be4eb
[LoRA] Much faster startup when LoRA is enabled ( #23777 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 15:37:39 +00:00
68a349114f
[Misc] enhance type hint for rearrange return value ( #23519 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:43:33 -07:00
e80bca309e
[Refactor] refactor freezing_value/cuda_event initialize outside try finally ( #23758 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:42:25 -07:00
fb4983e112
[Misc] add reorder_batch AttentionMetadataBuilder ( #23798 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:41:45 -07:00
379ea2823a
Add LoRA support for DeepSeek models (V2, V3, R1-0528) ( #23971 )
...
Signed-off-by: sadeghja1070 <sadegh.ja1070@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 06:40:02 -07:00
3a6acad431
[Model] Enable encoder DP for MiniCPM-V ( #23948 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 06:31:26 -07:00
5490d633ce
[UT] fix unify_kv_cache_configs when kv cache config needs sort ( #23843 )
2025-08-30 11:22:14 +00:00
628d00cd7b
[Bugfix] Fix test_lora_resolvers.py ( #23984 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 11:16:11 +00:00
4071c76cf3
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba ( #23831 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 00:16:15 -07:00
f1bddbd852
[Core] Cleanup TPU model runner for MM ( #23894 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-30 00:14:58 -07:00
9748c5198b
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion ( #23973 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-30 00:14:43 -07:00
ee52a32705
[CI] Move testing image from remote URL to S3 ( #23980 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 21:41:25 -07:00
8fb85b7bb6
Add routed_scaling_factor to MoE grouped topk ( #23123 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:48 -07:00
5b31cb1781
[Bugfix] Fix --config arg expansion called from api_server.py ( #23944 )
...
Signed-off-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:39 -07:00
d660c98c1b
[CI] Fix unavailable image remote URL ( #23966 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 15:40:04 -07:00
5674a40366
[Misc] Make download_weights_from_hf more reliable ( #23863 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 12:37:24 -07:00
8c3e199998
Revert gemma3n fast prefill changes ( #23897 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-29 12:16:57 -07:00
1c26b42296
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models ( #23824 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-29 18:47:58 +00:00
b7adf94c4a
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj ( #23939 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-29 10:28:35 -07:00
4d7fe40fc0
[RL][BugFix] Fix missing tokenizer error for token-in-token-out ( #23904 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 01:09:55 +08:00
0dc9532065
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant ( #23929 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2025-08-29 09:36:39 -07:00
72a69132dc
[CI] Add aiter to matching list of issue auto labeller for rocm tag ( #23942 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-29 15:29:21 +00:00
d90d8eb674
[BugFix] Async scheduling and PP compatibility with DP ( #23770 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-29 08:17:27 -07:00
0a2f4c0793
[Models] Use in-place adds in Idefics2Vision ( #23932 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 07:42:57 -07:00
1cf3753b90
[MODEL] Apertus and XIELU ( #23068 )
...
Signed-off-by: EduardDurech <39579228+EduardDurech@users.noreply.github.com >
Co-authored-by: AllenHaoHuang <allenhuangdd@gmail.com >
2025-08-29 20:29:18 +08:00
4f7cde7272
Adds json_count_leaves utility function ( #23899 )
...
Signed-off-by: aditchawdhary <aditxy@hotmail.com >
2025-08-29 05:28:13 -07:00
67c14906aa
Update PyTorch to 2.8.0 ( #20358 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-29 18:57:35 +08:00
69f46359dd
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec ( #23779 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-08-29 18:36:57 +08:00
d9e00dbd1f
[Performance] V1 Classify Models E2E Performance Optimization ( #23541 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-29 03:12:32 -07:00
ad39106b16
[CPU] Enable data parallel for CPU backend ( #23903 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-29 02:19:58 -07:00
2554b27baa
[V0 Deprecation] Remove pooling model support in V0 ( #23434 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-29 00:04:02 -07:00
934bebf192
Better errors for Transformers backend missing features ( #23759 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 07:01:40 +00:00
885ca6d31d
[Misc] Fix warnings for mistral model ( #23552 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-08-29 06:58:48 +00:00
2d0afcc9dc
[mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. ( #23895 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-28 23:29:13 -07:00
b4f9e9631c
[CI/Build] Clean up LoRA test ( #23890 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-28 23:28:35 -07:00
05d839c19e
Fix(async): Add support for truncate_prompt_tokens in AsyncLLM ( #23800 )
2025-08-28 22:55:06 -07:00
6597d7a456
[Platform] import activation_quant_fusion for CUDA only ( #23882 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-08-28 22:54:16 -07:00
5264015d74
[BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD ( #23864 )
...
Signed-off-by: Jinghui Zhang <jinghuizhang0804@gmail.com >
2025-08-28 22:54:12 -07:00
98ac0cb32d
[Bugfix] Use ReplicatedLinear for SequenceClassification head ( #23836 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-29 04:41:20 +00:00
c8b3b299c9
[tests] Improve speed and reliability of test_transcription_api_correctness ( #23854 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-29 04:25:33 +00:00
006477e60b
[ROCm][Fix] Fix rocm build caused by #23791 ( #23847 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-08-28 19:52:27 -07:00
de533ab2a1
[Models] Improve iteration over layers ( #19497 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 09:26:34 +08:00
235c9db8a7
[XPU] support data parallel for MoE models on XPU ( #22887 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-29 09:23:04 +08:00
b668055a11
[V0 Deprecation] Remove V0 Samplers test ( #23862 )
2025-08-28 18:05:52 -07:00
d3d2aad5a2
[Log] Use Debug Once for DeepGEMM E8M0 When not Enabled ( #23858 )
2025-08-28 22:18:10 +00:00
cb293f6a79
[V1] Enable prefill optimization for Gemma3n ( #22628 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-28 14:54:30 -07:00
7ffbf27239
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu ( #23737 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 14:22:46 -07:00
27e88cee74
chore: build release image by default ( #23852 )
...
Signed-off-by: Codex <codex@openai.com >
2025-08-28 13:17:15 -07:00
16a45b3a28
[NVIDIA] Support SiluMul + NVFP4 quant fusion ( #23671 )
...
Signed-off-by: jindih <jindih@nvidia.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: jindih <jindih@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedic <lgovedic@redhat.com >
2025-08-28 19:36:50 +00:00
57d4ede520
[bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) ( #23829 )
...
Signed-off-by: He-Jingkai <he-jingkai@outlook.com >
2025-08-28 19:05:20 +00:00
04d1dd7f4a
[ROCm][Aiter] Add triton fp8 bmm kernel for mla ( #23264 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com >
2025-08-28 18:18:08 +00:00
f32a5bc505
Migrate Llama4ImagePatchInputs to TensorSchema ( #22021 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 17:29:37 +00:00
8805ad9fa9
Add scale_config.yml file for Meta autoscalers for GH Actions ( #23840 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 09:31:20 -07:00
0583578f42
[ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime ( #23757 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 08:59:19 -07:00
db74d60490
[Bugfix] Add fake mode around passes ( #23349 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-08-28 11:25:56 -04:00
95089607fa
[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE ( #23819 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-28 06:56:20 -07:00
1f096f9b95
[CI] Fix linting error on main ( #23835 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-28 06:52:01 -07:00
66548f6603
[Bugfix] Fix benchmark_moe.py for blockwise fp8. ( #23823 )
...
Signed-off-by: crischeng <420985011@qq.com >
Co-authored-by: cris <grace@guisenbindeMacBook-Pro.local >
2025-08-28 21:44:09 +08:00
d3da2eea54
[Doc]: fix typos in Python scripts ( #23828 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 05:37:38 -07:00
bfab219648
[Model] [gpt-oss] fix gpt-oss pp support ( #23815 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 05:36:55 -07:00
a3432f18fd
[BugFix][Spec Decode] Use float64 for uniform_probs ( #23803 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 12:26:45 +00:00
67cee40da0
[CI/Build][Bugfix] Fix Qwen VL tests on CPU ( #23818 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-28 11:57:05 +00:00
d99c3a4f7b
[Doc]: fix typos in .md files (including those of #23751 ) ( #23825 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 04:38:19 -07:00
3462c1c522
[FIXBUG] Add return_success parameter to moe_wna16_weight_loader function ( #22797 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-28 09:03:22 +00:00
c5d004aaaf
[Model] Add PP support and VLM backbone compatability for GPT-OSS ( #23680 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-28 16:03:28 +08:00
11a7fafaa8
[New Model]: Support GteNewModelForSequenceClassification ( #23524 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-28 15:36:42 +08:00
186aced5ff
[Kernel] cuda kernels for upcoming decode context parallel feature ( #23791 )
...
Co-authored-by: hongchao <hongchao@msh.team >
2025-08-28 15:29:11 +08:00
daa1273b14
[Bugfix] when set offline model running error ( #23711 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-28 07:27:45 +00:00
c07a73317d
[CI] enable idefics3 and fuyu-8b test in multimodal test ( #23790 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 14:51:24 +08:00
22feac8e95
[Transform] [Quantization] Add transforms to compressed tensors ( #22486 )
2025-08-28 02:43:48 -04:00
c8851a4723
Add deprecation warning for lora_extra_vocab_size ( #23635 )
...
Signed-off-by: Jinheng Li <ahengljh@gmail.com >
2025-08-27 22:34:29 -07:00
f48a9af892
[CI] make all multi-gpu weight loading tests run nightly ( #23792 )
...
Signed-off-by: Alex Yun <alexyun04@gmail.com >
2025-08-27 21:27:36 -07:00
a11adafdca
Gracefully handle edge cases in harmony utils ( #23155 )
...
Signed-off-by: Jan Kessler <jakessle@uni-mainz.de >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-27 20:14:00 -07:00
a781e84ec2
[Perf] Tune configs for triton block fp8 gemm H100/H200 ( #23748 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-28 11:12:53 +08:00
1b7b161a09
[Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 ( #23556 )
...
Signed-off-by: Shrey Gupta <shreyg1303@gmail.com >
2025-08-27 20:12:44 -07:00
a69693e38f
Migrate Qwen inputs to TensorSchema ( #23473 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 10:43:26 +08:00
5da4f5d857
[Bugfix] Fix for V1 priority scheduling crashes at preemption ( #23713 )
...
Signed-off-by: Hanchenli <lihanc2002@gmail.com >
2025-08-28 00:44:52 +00:00
321938e9ac
[Feature] Add VLLM_DISABLE_PAD_FOR_CUDAGRAPH to Avoid Hang Issue ( #23595 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 21:52:24 +00:00
f9ca2b40a0
[Bugfix] Fix Marlin NVFP4 for modelopt ( #23659 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-27 17:48:16 -04:00
082cc07ef8
DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 ( #23608 )
2025-08-27 17:33:21 -04:00
853c371fc3
[V1][Mamba] - Enable V1 by default for Mamba Models ( #23650 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-08-27 20:53:30 +00:00
8bf6266a17
[Multimodal] Generate mm_hash based on request metadata when caching is turned off ( #23690 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-27 20:24:31 +00:00
0585a9e73c
Disable torch.compile for dynamic rope models in Transformers backend ( #23738 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 19:03:05 +00:00
3c0ef769ba
ci: Add arm64 docker build to release pipeline ( #23210 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
Signed-off-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com >
2025-08-27 10:41:48 -07:00
4e4d017b6f
[Docs] Fix warnings in mkdocs build (continued) ( #23743 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
2025-08-27 17:17:29 +00:00
dd58932280
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models ( #22589 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-27 10:05:16 -07:00
52883ed084
[Model] Merge SupportsMultiModalWithRawInput with SupportsMultiModal ( #23749 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 10:01:50 -07:00
4f35be10a9
[BugFix] Fix topk_softmax assert ( #19764 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
2025-08-27 09:47:28 -07:00
2b61d2e22f
[Docs] Remove in-tree Gaudi install instructions ( #23628 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 09:22:21 -07:00
3ce8285d6d
[LogitsProcs] Deduplicate built-in LP implementation logic ( #23362 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-27 23:11:33 +08:00
83f555f637
[Doc]: upgrade version of crate-ci tool for improved typo detection ( #23755 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-27 07:59:34 -07:00
841490434a
[Model] Enable native HF format InternVL support ( #23742 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-27 14:45:17 +00:00
3af47c3cc6
[Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt ( #23666 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-27 14:09:08 +00:00
513c1fe255
Only run get_attr_docs if generating help text ( #23723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 13:55:12 +00:00
fe8d7b6f03
[Model] Interface to enable batch-level DP support ( #23733 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 06:41:22 -07:00
16dc4052b0
Fix pre-commit on main ( #23747 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 06:39:48 -07:00
8dd2baa597
Add vLLM Korea Meetup in the README.md and meetups.md ( #23746 )
...
Signed-off-by: rebel-hongseok <hongseok@rebellions.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 06:25:49 -07:00
5eeef1b908
[Model] Explicit default_pooling_type interface ( #23736 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 13:24:09 +00:00
704432af3c
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models ( #23716 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-27 12:51:54 +00:00
a403d0fa41
[Misc] Remove unnecessary _send_reconfig_message() in core_client.py ( #23127 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-27 05:50:47 -07:00
8c13820f0b
[Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled ( #23718 )
...
Signed-off-by: cndoit18 <cndoit18@outlook.com >
2025-08-27 12:42:20 +00:00
9d30de4469
[model] Support MiniCPM-V 4.5 ( #23586 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Signed-off-by: Pate Motter <patemotter@google.com >
Signed-off-by: Terrencezzj <terrence@cohere.ai >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: tc-mb <157115220+tc-mb@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com >
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: oye93 <en.ouyang93@outlook.com >
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Tianyu Li <tianyu.li@arm.com >
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Signed-off-by: Federico <65908512+coval3nte@users.noreply.github.com >
Signed-off-by: Zixuan Zhang <zixuanzhang@bytedance.com >
Signed-off-by: wuhang <wuhang6@huawei.com >
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
Signed-off-by: Wei Wei <wwei6@meta.com >
Signed-off-by: Yiheng Xu <charlesyihengxu@gmail.com >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com >
Co-authored-by: Pate Motter <p@temotter.com >
Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: weiliang <weiliangl@nvidia.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Raghavan <oneraghavan@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Matúš Námešný <matus@namesny.com >
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: En Ouyang <en.ouyang93@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: nvjullin <jullin@nvidia.com >
Co-authored-by: Didier Durand <2927957+didier-durand@users.noreply.github.com >
Co-authored-by: TianyuLi0 <116711075+TianyuLi0@users.noreply.github.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Co-authored-by: Federico <65908512+coval3nte@users.noreply.github.com >
Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com >
Co-authored-by: wuhang <wuhang6@huawei.com >
Co-authored-by: yzds <41983536+youzhedian@users.noreply.github.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: czhu-cohere <conway.zhu@cohere.com >
Co-authored-by: Wei <weiweinpu@gmail.com >
Co-authored-by: Yiheng Xu <charlesyihengxu@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: CSWYF3634076 <58356743+CSWYF3634076@users.noreply.github.com >
2025-08-27 05:38:00 -07:00
1f7a9c95e4
[Docs] Fix a 1-2-3 list and style issues in tpu.md ( #23729 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-08-27 05:37:52 -07:00
8f0d7eaea8
[XPU] Fix OOM issue for data parallel with Ray backend ( #22500 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
Signed-off-by: Fanli Lin <fanli0116@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-27 19:57:38 +08:00
e03940762b
[CI/Build] Reduce LoRA layer test cases ( #23721 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-27 10:59:35 +00:00
11eddf02f0
[FlashInfer] Cache hyper params in metadata builder ( #23732 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 03:45:04 -07:00
04ff1e43fb
[Misc] Move CpuGpuBuffer to vllm/v1/utils.py ( #23728 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 03:25:00 -07:00
6578e87365
Optimize input preparation for FlashInfer [2/N] ( #23174 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 02:52:45 -07:00
5bd9f84158
[Docs] Fix an admonition important ( #23726 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-08-27 02:50:09 -07:00
91e382c935
[CI/Build] Remove redundant register in model init tests ( #23715 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 08:11:15 +00:00
6446677839
[XPU]fix cuda event used in XPU model runner ( #23708 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-27 07:27:14 +00:00
69244e67e6
[Core] Use key-only cache for BaseMultiModalProcessor ( #23018 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 14:19:13 +08:00
8dbf6ed7be
[Bugfix] fix when config.yaml config value is list parse error ( #23528 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-27 05:54:39 +00:00
9de25c294b
[CI/Build] Remove redundant LoRA model tests ( #23706 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-27 05:51:50 +00:00
fce10dbed5
[XPU] Add xpu torch.compile support ( #22609 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-27 05:33:27 +00:00
d272415e57
[Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs ( #22674 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-08-27 05:00:21 +00:00
142ac08030
[Frontend] Optimize beam search performance by limiting concurrency ( #23599 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-27 04:59:14 +00:00
3210264421
[Frontend] Add --log-error-stack to print stack trace for error response ( #22960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-27 04:58:59 +00:00
644d57d531
[Model] Add Ernie4.5 VL Model Support ( #22514 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-08-26 21:02:55 -07:00
c905684cfe
[Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. ( #23686 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-26 20:05:34 -07:00
786835807b
[Bugfix]: Qwen3 Coder Tool Parser ( #23099 )
...
Signed-off-by: Yiheng Xu <charlesyihengxu@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-08-26 19:58:32 -07:00
fecbb7c782
[Bugfix][gpt-oss] passing the cache config in gpt-oss ( #23613 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-08-27 02:54:23 +00:00
6dab89b8ec
[Docs] Fix math rendering in docs ( #23676 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 18:47:08 -07:00
de02b07db4
[Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 ( #23678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-27 09:34:57 +08:00
eb1995167e
[gpt-oss] Enable unit test for response API harmony integration ( #23533 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-26 18:23:26 -07:00
2c2b140ae8
[quantization] use channel scales for w4a8 + misc fixes ( #23570 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-08-26 18:23:23 -07:00
c7c80af084
fix pynccl reduce_scatter ( #23648 )
...
Co-authored-by: hongchao <hongchao@msh.team >
2025-08-26 18:21:11 -07:00
6891205b16
[Feature][Responses API] Support MCP tool in background mode ( #23494 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-08-27 01:06:58 +00:00
b1625dbe9c
feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 ( #23695 )
...
Signed-off-by: Zixuan Zhang <zixuanzhang@bytedance.com >
2025-08-26 18:06:10 -07:00
585e0bde36
[Bugfix] UnboundLocalError when GptOss reasoning specified ( #23054 )
...
Signed-off-by: Federico <65908512+coval3nte@users.noreply.github.com >
2025-08-27 00:29:52 +00:00
714872f1a9
[Compile] Fix Cmake Warning ( #23689 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-26 23:48:32 +00:00
5f1af97f86
[V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 ( #22594 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-26 23:28:55 +00:00
c3b0fd1ee6
[V1][P/D]P2pNcclConnector supports flashinfer ( #23536 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-26 22:56:16 +00:00
6421b66bf4
[Docs] Move quant supported hardware table to README ( #23663 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 22:26:46 +00:00
2f13319f47
Enhance the pre-notification policy ( #23532 )
...
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
2025-08-26 20:41:36 +00:00
d696f86e7b
[doc] Hybrid KV Cache Manager design doc ( #22688 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 20:19:05 +00:00
9816b81f5f
[Model] Enable video support for InternVL3.5 models ( #23658 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-26 19:46:52 +00:00
c37c0af990
[Misc] Fix comments in tests/kernels/quantization ( #23675 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-26 19:31:20 +00:00
9715f7bb0f
[Bugfix] Fix incorrect original shape in hashing ( #23672 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-26 19:01:25 +00:00
98aa16ff41
[v1] Add cross-attention KV cache support for encoder-decoder models ( #23664 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-26 18:49:06 +00:00
227e231b55
[Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models ( #23665 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-26 18:33:16 +00:00
730d0ac8b9
[Docs] Fix warnings in mkdocs build ( #23649 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 18:19:23 +00:00
9b0187003e
[Bugfix] Fix cuda event usage with CPU model runner ( #23643 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-26 17:10:42 +00:00
44ac25eae2
[CI] [Doc]: Add GH Action for auto labeling issues with rocm tag ( #20988 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-26 16:20:13 +00:00
7ea22e42d5
[Misc] Add override for allreduce fusion thresholds ( #23639 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
2025-08-26 15:53:04 +00:00
9d4183dd2e
[model] support qwen2audio embedding input ( #23625 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-26 23:48:08 +08:00
513298f1b4
[Bugfix] fix bf16 multimodal model hash ( #23623 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-26 23:47:50 +08:00
379f828fba
[Docs] Reduce requirements for docs build ( #23651 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 15:43:28 +00:00
1fdc732419
[ROCm] Starting to add AMD code reviewers for ROCm components ( #23496 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-08-26 07:32:37 -07:00
f58675bfb3
[CPU] add cpu fused moe pytorch native implementation ( #23146 )
...
Signed-off-by: Tianyu Li <tianyu.li@arm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 14:09:17 +00:00
7c04779afa
[Doc]: fix various spelling issues in multiple files ( #23636 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-26 14:05:29 +00:00
f66673a39d
[Kernel] Added flashinfer fp8 per-tensor gemms ( #22895 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 06:54:04 -07:00
b78bed1bc5
[Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) ( #23565 )
...
Signed-off-by: oye93 <en.ouyang93@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 13:04:25 +00:00
164b2273c8
[Docs] Fix broken links to docs/api/summary.md ( #23637 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 13:00:18 +00:00
2b4fc9bd9b
Support FlashAttention Backend for Hybrid SSM Models ( #23299 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-26 12:41:52 +00:00
ebd5a77bb5
feat: add usage to TranscriptionResponse (text and json response_format) ( #23576 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-08-26 05:26:26 -07:00
384dd1b0a8
[Bugfix] Add missing enable_log_outputs parameter to init_app_state function ( #23634 )
...
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com >
2025-08-26 12:13:15 +00:00
fdeb3dac13
[Model] fix DeepSeek e_score_correction_bias dtype to fp32 ( #23640 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-26 20:09:47 +08:00
d52358c1e0
[Perf] Remove duplicated NVFP4 blockscales to save memory ( #23379 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-26 19:16:33 +08:00
6ace2f72b0
Fix writing benchmark results with tuple keys ( #23633 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-08-26 19:16:09 +08:00
b00e69f8ca
Fix nits from #20059 ( #23548 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 03:27:20 -07:00
50fede6634
[V1] Enable V1 for compute capability < 8.0 + FP32 ( #23614 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-26 03:00:18 -07:00
b5d34af328
[Bugfix] Fix scheduling when repeated images in one request ( #23544 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
2025-08-26 09:46:28 +00:00
9b5f64238f
[Bugfix] Fix Qwen25VL packed_modules_mapping ( #23604 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 01:09:14 -07:00
ff77764f86
Fix CLI parameter documentation inconsistency in pooling_models.md ( #23630 )
2025-08-26 01:05:37 -07:00
bfc1edc9f5
[Docs] Fix titles for multi-file examples that are rendered in the docs ( #23573 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 00:16:44 -07:00
3ecbb14b81
[Benchmarks] add benchmark for embedding models ( #23000 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-25 23:57:08 -07:00
7d67a9d9f9
[mypy] Fix incorrect type hint for EAGLE3 support ( #23617 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:50:17 -07:00
959783fb99
[fix] fix seed-oss-parser ( #23560 )
...
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
2025-08-25 23:16:36 -07:00
ce0e9dbd43
[CI/Build] Fix typo in #23561 ( #23616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:13:03 -07:00
b395b3b0a3
[Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT ( #22760 )
...
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
2025-08-25 21:06:00 -07:00
6fad29b11b
Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper ( #23385 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-25 19:34:15 -07:00
6fd45e7b8a
[CI/Build] Use vLLM client's user agent to fetch images ( #23561 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 19:34:12 -07:00
56dcf4e7e9
[Bug] Fix DeepGEMM Env Control ( #23591 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-25 18:41:21 -07:00
ae067888d6
Update Flashinfer to 0.2.14.post1 ( #23537 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 18:30:44 -07:00
906e461ed6
[CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests ( #23568 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-25 18:29:00 -07:00
2a97ffc33d
[Misc] Add release note draft to PR template ( #23598 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-08-25 16:44:51 -07:00
efc88cf64a
[Misc] Simplify FlashInfer attention metadata ( #23585 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-25 15:42:29 -07:00
7b6a837275
[Docs] Update Documentation of Cohere Command-A Models ( #23584 )
...
Signed-off-by: Terrencezzj <terrence@cohere.ai >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
2025-08-25 21:53:52 +00:00
c34c82b7fe
[TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. ( #23574 )
...
Signed-off-by: Pate Motter <patemotter@google.com >
2025-08-25 14:29:16 -07:00
8a044754bd
[XPU] Delay BF16 check to worker init for spawn compatibility ( #22979 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-25 13:09:26 -07:00
9188ae7cb5
[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector ( #23403 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-08-25 12:57:08 -07:00
8a3cd90af5
[Kernel] Add fused grouped_topk kernel for MoE ( #23274 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-25 11:47:52 -07:00
2a167b2eeb
[test][RL] Add sleep level 2 test and fix reload with sleep mode ( #23521 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-26 00:25:52 +08:00
0ff902f3b4
[Refactor] Refactor persistent buffers with CpuGpuBuffer ( #23515 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 08:44:48 -07:00
a9082a4d14
[Bugfix] Fix Qwen3 MoE GPTQ inference ( #23490 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-25 06:40:20 -07:00
e0329ed4b4
Updates to Flex + VLLm integration ( #21416 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-08-25 09:32:42 -04:00
6879cd80ae
[Refactor] Pass tokenizer explicitly instead of binding to prompt update ( #23542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:31:57 -07:00
e269be2ba2
[Doc] Add caution for API server scale-out ( #23550 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:14:15 -07:00
5c4b6e66fe
[Attention] Unify mamba and attention backend selection ( #23171 )
...
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com >
2025-08-25 09:09:36 +00:00
d0a4a3f645
[misc] add shanghai meetup ( #23535 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-25 17:00:03 +08:00
ebafb0936d
[Bugfix] Allow dynamic number of patches for llava_onevision ( #23525 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 08:34:54 +00:00
0cb7b065c3
Feature/benchmark/random mm data/images ( #23119 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
2025-08-25 01:28:35 -07:00
2da02dd0d8
[Fix] DeepSeek V3.1 tool parser error message ( #23492 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-25 00:56:39 -07:00
d765cf01fe
[Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests ( #22711 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-25 00:41:17 -07:00
712d0f88d8
[Refactor] Dynamic target and content for prompt updates ( #23411 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-24 23:39:58 -07:00
49ab23b3cc
[gpt-oss] use reasoning channel for reasoning text in serving_chat ( #22920 )
...
Signed-off-by: Yu Guo <yuguo@meta.com >
2025-08-25 06:29:34 +00:00
c9abb10489
[Bugfix] Fix Dense module loading for sentence-transformers embedding models (simplified V2) ( #23408 )
...
Signed-off-by: FFFfff1FFFfff <yifanli0919@gmail.com >
2025-08-25 05:39:24 +00:00
787cdb3829
Migrate DonutImagePixelInputs to TensorSchema ( #23509 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 05:02:15 +00:00
a5203d04df
Migrate skyworkr1v inputs to TensorSchema ( #23499 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 04:43:21 +00:00
99f8094400
Migrate tarsier inputs to TensorSchema ( #23500 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 04:42:36 +00:00
170e8ea9ea
[Misc] Unified linear print info ( #23516 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-24 20:13:51 -07:00
a71e4765cc
[Bugfix] Fix Qwen2.5-VL quantized model weights loading ( #23512 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-08-25 10:40:22 +08:00
39971db3aa
Frontend: Adding LM Format Enforcer support to V1 engine ( #22564 )
...
Signed-off-by: Noam Gat <noamgat@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-24 19:31:22 -07:00
504d914314
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 ( #23504 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-08-24 18:06:35 -07:00
47455c424f
[Doc: ]fix various typos in multiple files ( #23487 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-25 00:04:04 +00:00
c7fc6b1354
fix incompatibililty with non cuda platform for nvfp4 ( #23478 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-08-24 15:35:41 -07:00
ad78868450
[Misc] Remove unused slot_mapping buffer ( #23502 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-24 14:03:36 -07:00
e2db1164a1
[Model] Enable BLOOM on V1 ( #23488 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-24 13:30:47 +00:00
416f05929a
[New Model]Donut model ( #23229 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-24 12:52:24 +00:00
5e021b4981
(Misc): add missing test for zero truncation size. ( #23457 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
2025-08-24 18:12:47 +08:00
1b9b16649c
[Misc] update dict parse to EPLBConfig from json dumps to dict unpacking ( #23305 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-24 08:06:34 +00:00
e76e233540
[kernel] Support W4A8 on Hopper ( #23198 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-08-24 06:18:04 +00:00
a75277285b
Migrate Paligemma inputs to TensorSchema ( #23470 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-24 04:56:56 +00:00
9dc30b7068
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks ( #23477 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-24 12:56:17 +08:00
053278a5dc
Migrate Pixtral inputs to TensorSchema ( #23472 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-24 04:55:53 +00:00
c55c028998
[gpt-oss] Streaming Output for Python Tool ( #23409 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-24 04:42:38 +00:00
65197a5fb3
[Misc] Modify CacheConfig import ( #23459 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-23 06:05:27 +00:00
b8f17f5d98
Support DeepSeek-V3.1 tool call ( #23454 )
...
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-08-23 05:50:16 +00:00
d9a55204ba
fix(tests): Correct unreachable assertion in truncation test ( #23425 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
2025-08-23 05:23:54 +00:00
b4e9fd811f
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion ( #20000 )" ( #23396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-23 04:16:48 +00:00
308fa287a8
Add glm4.5v tp2,4 fp8 config on H100_80GB ( #23443 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
2025-08-23 02:54:19 +00:00
fa78de9dc3
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs ( #22527 )
...
Signed-off-by: feng <fengli1702@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-22 20:53:21 -06:00
f6818a92cb
[UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh ( #23360 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-22 20:52:50 -06:00
23c939fd30
[Model] Support DP for ViT on MiniCPM-V-4 ( #23327 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-08-23 02:14:41 +00:00
add1adfec7
[BugFix] Fix MinPLogitsProcessor.update_states() ( #23401 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-23 08:22:11 +08:00
c80c53a30f
[BugFix] Fix batch updates for pooling models ( #23398 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-23 08:20:41 +08:00
24d0c9e6ed
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel ( #22703 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-22 22:09:05 +00:00
cc7ae5e7ca
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model ( #22281 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-08-22 21:47:57 +00:00
0313cf854d
[PERF] PyTorch Symmetric Memory All-Reduce ( #20759 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-22 15:39:08 -06:00
0483fabc74
[CI/Build] add EP dependencies to docker ( #21976 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-22 13:34:40 -07:00
da65bec309
add an env var for path to pre-downloaded flashinfer cubin files ( #22675 )
2025-08-22 19:25:45 +00:00
4645024d3a
[Quantization] Allow GGUF quantization to skip unquantized layer ( #23188 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 13:04:22 -06:00
cd7a3df26f
[Bugfix] Fix broken Florence-2 model ( #23426 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-22 17:50:52 +00:00
32d2b4064f
[Model] Add Ovis2.5 PP support ( #23405 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 17:46:34 +00:00
22cf679aad
[Doc]: fix various typos in multiple files ( #23179 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-22 10:38:46 -07:00
b6d7d34fc6
Add unit tests for batched guided and non-guided requests ( #23389 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-22 10:31:24 -07:00
341923b982
fix(tests): Ensure reliable CUDA cache clearing in MoE test ( #23416 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 17:20:59 +00:00
424fb7a5d2
[BugFix] Fix the issue where image embeddings were incorrectly split.… ( #23366 )
...
Signed-off-by: bppps <bpppsaka@gmail.com >
Co-authored-by: zouyu.zzx <zouyu.zzx@alibaba-inc.com >
Co-authored-by: bppps <bpppsaka@gmail.com >
2025-08-22 16:56:46 +00:00
88491c1b6b
[Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support ( #23337 )
2025-08-22 16:39:19 +00:00
613a23b57f
[Bugfix]: Installing dev environment due to pydantic incompatible version ( #23353 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2025-08-22 16:22:29 +00:00
51a215300b
[Fix] Bump triton version in rocm-build requirements ( #21630 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-08-22 15:13:39 +00:00
ebe14621e3
[Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script ( #23375 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
2025-08-22 15:12:28 +00:00
325aa3dee9
[Misc] local import code clean ( #23420 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-22 14:01:35 +00:00
a073be6d87
[Doc] Update the doc for log probs + prefix caching ( #23399 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 13:20:39 +00:00
695e7adcd2
[misc] Remove outdate comment about runai_model_streamer ( #23421 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
2025-08-22 13:08:53 +00:00
281710ef9a
[Attention] Allow V1 flash_attn to support cross-attention ( #23297 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-22 12:10:16 +00:00
808d2e9aa0
[Misc] Move M-RoPE init logic to _init_mrope_positions ( #23422 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-22 03:07:22 -07:00
285178b3b8
[V0 Deprecation] Remove V0 LoRA test ( #23418 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-22 09:56:51 +00:00
88016c372a
[Bugfix] Fix pooling models on CPU backend ( #23392 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-22 09:47:17 +00:00
998720859c
Migrate MiniCPMOAudioInputs to TensorSchema ( #21847 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 16:43:29 +08:00
0ba1b54ac6
[gpt-oss] add input/output usage in responses api when harmony context is leveraged ( #22667 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-08-22 08:32:24 +00:00
53415653ff
[P/D][Nixl] Make kv cache register compatible with hybrid memory allocator ( #23079 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-08-21 22:30:48 -07:00
17373dcd93
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models ( #23154 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-22 05:05:59 +00:00
5964069367
[New Model] Add Seed-Oss model ( #23241 )
...
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-22 04:58:10 +00:00
de9c085e17
[Misc] Add gemma3 chat template with pythonic-style function calling ( #17149 )
...
Signed-off-by: Philip Chung <philip.f.chung@gmail.com >
2025-08-21 21:06:50 -07:00
111692bb8c
[CI] Add end-to-end V1 min_tokens test coverage ( #22495 )
...
Signed-off-by: Arjun Reddy <189282188+arjunbreddy22@users.noreply.github.com >
Co-authored-by: Arjun Reddy <189282188+arjunbreddy22@users.noreply.github.com >
2025-08-21 22:04:07 -06:00
394591e343
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement ( #23351 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-21 21:01:08 -07:00
3ac849665d
[CI/Build] Skip Idefics3 and SmolVLM generation test again ( #23356 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 03:39:46 +00:00
0b9cc56fac
Migrate MllamaImagePixelInputs to TensorSchema ( #22020 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-22 11:28:49 +08:00
8896eb72eb
[Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed ( #18800 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-22 10:56:57 +08:00
19fe1a0510
[Kernel] Add FP8 support with FlashMLA backend ( #22668 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-08-22 02:26:32 +00:00
480bdf5a7b
[Core] Support custom executor qualname ( #23314 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-22 09:40:54 +08:00
5368f76855
[Feature][Responses API] Support logprobs(non-stream) ( #23319 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-08-21 23:09:16 +00:00
8ef6b8a38c
Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. ( #23270 )
...
Signed-off-by: Valentyn Tymofieiev <valentyn@google.com >
2025-08-21 18:01:03 -04:00
3bbe11cc13
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm ( #23265 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-21 17:56:15 -04:00
c5041f899f
[CI] improve pr comments bot ( #23380 )
2025-08-21 14:49:03 -07:00
8b5fe6eb51
[CI] Clean up actions: remove helm, publish workflows and improve pr … ( #23377 )
2025-08-21 14:29:04 -07:00
800349c2a5
[Structured Outputs] Refactor bitmask construction into get_grammar_bitmask ( #23361 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-21 20:53:33 +00:00
044931f97b
Make sure that vectorize_with_alignment produced vectorized global loads ( #23182 )
2025-08-21 20:06:54 +00:00
1d353b6352
[Core] Always use tensor cores for Flashinfer Decode Wrapper ( #23214 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-08-21 16:02:11 -04:00
3496274663
[Misc] Convert VLLM_TORCH_PROFILER_DIR path to absolute ( #23191 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-21 15:49:09 -04:00
8a19303173
[BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message ( #23318 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-21 10:31:11 -07:00
603fbbbce0
[Misc] Misc code cleanup/simplification ( #23304 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-21 17:22:55 +00:00
10f535c086
[Bugfix] Fix port conflict by obtaining a list of open ports upfront ( #21894 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-08-21 10:22:18 -07:00
48bfb0c9b7
[Bug] Fix R1 Accuracy 0 Bug ( #23294 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-21 13:11:28 -04:00
f8ce022948
add tg-mxfp4-moe-test ( #22540 )
...
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-21 17:05:47 +00:00
0278f1ac3a
Fix nvfp4 swizzling ( #23140 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-21 16:54:50 +00:00
a482e4e769
Migrate MolmoImageInputs to TensorSchema ( #22022 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-21 16:54:08 +00:00
e0b056e443
[ci/build] Fix abi tag for aarch64 ( #23329 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-21 23:32:55 +08:00
79f05e4436
[Multimodal] Always enable hashing mm data ( #23308 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-21 07:23:28 -07:00
f8daddcc4c
[Bugfix] set system_message in phi4mini chat template ( #23309 )
...
Signed-off-by: zhuangqh <zhuangqhc@gmail.com >
2025-08-21 14:22:39 +00:00
c8e33c72c6
[V1] Remove unnecessary check for main thread ( #23298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-08-21 14:08:35 +00:00
d70a16625d
[Performance] V1 Pooling Models E2E Performance Optimization ( #23162 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-21 13:26:09 +00:00
5cc54f7c5b
[Doc] Fix batch-level DP example ( #23325 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-21 06:16:38 -07:00
0c6e40bbaa
[Refactor] Simplify code for MM budget ( #23310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-21 08:00:16 +00:00
2e2000f352
[Model] Add LFM2 architecture ( #22845 )
...
Signed-off-by: Paul Pak <paulpak58@gmail.com >
2025-08-21 09:35:07 +02:00
31282401b6
[BugFix] Fix Python 3.9 Support ( #23306 )
...
Signed-off-by: Jared O'Connell <46976761+jaredoconnell@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-20 23:23:56 -07:00
0c31e28e95
[Bugfix] Fix extra whitespace in strings caused by newline ( #23272 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 22:03:00 -07:00
f571ff8eb6
[Sampler] Support returning final logprobs ( #22387 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-20 21:28:32 -07:00
f64ee61d9e
[CI] Block the cu126 wheel build while broken ( #23285 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-21 04:21:05 +00:00
8993073dc1
[CI] Delete images older than 24h. ( #23291 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-08-20 21:15:20 -07:00
655a09f653
[Model][VLM] Support R-4B Model ( #23246 )
...
Signed-off-by: yannqi <yannqi@qq.com >
Signed-off-by: 杨奇(yann qi) <51905299+yannqi@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: yannqiyang <yannqiyang@tencent.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-21 04:08:52 +00:00
f94bf9b924
[Compile] Fix Compile Warning SM100 Cutlass MLA ( #23287 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-21 03:09:39 +00:00
3663870c72
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support ( #23035 )
...
Signed-off-by: asafg <asafg@ai21.com >
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-08-20 20:08:51 -07:00
2461d9e562
[CI/Build] Split out mm processor tests ( #23260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 20:05:20 -07:00
7be5d113d8
[CPU] Refactor CPU W8A8 scaled_mm ( #23071 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-21 09:34:24 +08:00
b029de9902
[Optimization] Make new_block_ids None if empty ( #23262 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-20 18:25:56 -07:00
bbea1cefdd
[CI Bugfix] Fix CI by fully removing --enable-prompt-adapter ( #23284 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 17:18:12 -07:00
f5aa307d77
Remove duplicate entry in vllm.attention.__all__ ( #23296 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-20 17:14:59 -07:00
4b795020ed
[EP] Add logging for experts map ( #22685 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-20 23:46:06 +00:00
c86af22f31
[Fix] remove is_marlin param in benchmark_moe ( #23286 )
2025-08-20 22:04:21 +00:00
10cc12ba66
Feature/mla tests ( #23195 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-08-20 21:46:47 +00:00
a4fbb32fab
Remove chunked_prefill_enabled flag in V1 MLA ( #23183 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-08-20 21:43:17 +00:00
1b125004be
[misc] fix multiple arch wheels for the nightly index ( #23110 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-20 14:15:34 -07:00
4fbda0b20c
[Feature] use --eplb_config to set eplb param ( #20562 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: rongfu.leng <lenronfu@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-20 14:07:28 -07:00
4e51fa8cba
Do not use eval() to convert unknown types ( #23266 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-20 13:28:30 -07:00
bf7c99dfc4
[Perf] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x ( #20413 )
...
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com >
Signed-off-by: Aseem Saxena <aseem.bits@gmail.com >
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com >
2025-08-20 13:17:11 -07:00
b95697d731
[Frontend] improve error logging of chat completion ( #22957 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-20 13:03:37 -07:00
582bbe6bd7
[Fix] correct tool_id for kimi-k2 when use tool_choice=required ( #21259 )
...
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-08-20 12:59:54 -07:00
0cdbf5e61c
[Kernel/Quant] Remove the original marlin format and qqq ( #23204 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 15:13:36 -04:00
ebe56a0064
Small fix for Command-A-Vision ( #23268 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-08-20 18:15:18 +00:00
f77a0802b7
Limit HTTP header count and size ( #23267 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2025-08-20 17:57:37 +00:00
c4477f55e5
Migrate Mistral3ImagePixelInputs to TensorSchema ( #21945 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-20 17:37:29 +00:00
dfd2382039
[torch.compile] Support conditional torch.compile per module ( #22269 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-20 16:52:59 +00:00
3b11b26b50
[FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER ( #22795 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-20 09:08:29 -07:00
d6d13bd49e
[Misc] Add max_seq_len to CommonAttentionMetadata ( #23216 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-20 09:05:29 -07:00
5efd6905bc
[CLI][Doc] Formalize --mm-encoder-tp-mode ( #23190 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 23:42:28 +08:00
b17109beea
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute ( #23045 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
2025-08-20 10:35:26 -04:00
4449235843
[Bugfix] Ensure correctness of HCXVision processing ( #23254 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 14:19:30 +00:00
38217877aa
[Fix] fix offline env use local mode path ( #22526 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-20 13:34:49 +00:00
c6d80a7a96
[Model] Improve olmo and olmo2 ( #23228 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-20 12:47:05 +00:00
7cd17e22d7
[Model][V1] Support Ernie MTP ( #22169 )
...
Signed-off-by: zhouchong <zhouchong03@baidu.com >
Co-authored-by: zhouchong <zhouchong03@baidu.com >
2025-08-20 20:41:55 +08:00
50df09fe13
Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image ( #23129 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 08:05:54 -04:00
68fcd3fa73
[Bugfix] Ensure correctness of Cohere2Vision processing ( #23245 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 11:09:18 +00:00
83e69a09d6
[Model] Support deepseek with eagle ( #21086 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2025-08-20 19:01:31 +08:00
3aa8c10038
Fix missing quotes ( #23242 )
...
Signed-off-by: Shiming Zhang <wzshiming@hotmail.com >
2025-08-20 10:46:59 +00:00
103f1ec8d3
[Model] use autoWeightsLoader for gptoss ( #22446 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
2025-08-20 10:16:27 +00:00
d983769c41
fix cuda graph ( #22721 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
2025-08-20 06:24:37 +00:00
8fd920924c
[BugFix] Fix stuck stats/metrics after requests are aborted ( #22995 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-20 13:50:29 +08:00
de7b67a023
[CI/Build] Sync multimodal tests ( #23181 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 05:06:42 +00:00
f729023272
[CI/Build] Also check DP in benchmarks throughput script ( #23038 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-20 04:09:27 +00:00
1a3079a15e
chore: support pytorch format in lora ( #22790 )
...
Signed-off-by: jaeeun.kil <rha3122@naver.com >
Signed-off-by: 길재은 <rha3122@naver.com >
2025-08-20 04:02:50 +00:00
941f56858a
Fix a performance comparison issue in Benchmark Suite ( #23047 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-08-20 03:14:32 +00:00
a634733f67
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention ( #23185 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-08-20 02:57:47 +00:00
64ab3c7253
[Doc] Update V1 status of various pooling models ( #23189 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 10:33:41 +08:00
e58c5a9768
[Core] Add torch profiler CPU traces for AsyncLLM. ( #21794 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-20 02:32:47 +00:00
d46d417b58
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py ( #23132 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-19 20:18:52 -06:00
0167efe20d
[Core] Optimize scheduler request removal for single completions ( #21917 )
...
Signed-off-by: chiliu <chiliu@paypal.com >
Signed-off-by: chiliu <cliu_whu@yeah.net >
Co-authored-by: chiliu <chiliu@paypal.com >
2025-08-19 18:25:59 -07:00
c32e6ad1f6
[Quantization] Bump Compressed Tensors Version ( #23202 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-20 00:39:28 +00:00
1630cc8d0f
[Benchmarks] Add video inputs to ShareGPTDataset. ( #23199 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-19 23:42:31 +00:00
14e2b0730b
[BugFix] fix CUTLASS MLA full cudagraph ( #23200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-19 22:17:08 +00:00
0f4f0191d8
[CI/Build] Replace lm-eval gsm8k tests with faster implementation ( #23002 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-19 15:07:30 -07:00
a38b8af4c3
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend ( #22357 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2025-08-19 18:01:53 -04:00
21dce80ea9
[CI/Build] Add support for Python 3.13 ( #13164 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 13:49:34 -07:00
e61bac87ee
[Misc] Minor refactoring for FlashInfer backend ( #23147 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 13:11:51 -07:00
80141bbf2f
fix: use cache_salt for gpt-oss ( #23186 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-08-19 18:12:25 +00:00
b94faf9d50
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. ( #23125 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-19 14:00:51 -04:00
5b5f350d67
[Misc] Enable yapf for FlashInfer backend ( #23193 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 10:33:47 -07:00
f7cf5b512e
[Frontend] Add /collective_rpc API endpoint ( #23075 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-19 17:29:32 +00:00
03d4235fd2
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks ( #22654 )
...
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com >
2025-08-19 10:18:51 -07:00
d6a1a20973
[CI/Build] Update transformers to v4.55.2 ( #23093 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 10:06:17 -07:00
a70d0bd0a3
Migrate LlavaOnevisionMultiInputs to TensorSchema ( #21844 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-19 17:02:02 +00:00
24f4d1a224
Add return_token_ids parameter to OpenAI API endpoints ( #22587 )
...
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-19 09:48:31 -07:00
4f510bc2a1
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock ( #23169 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-08-19 16:18:41 +00:00
1298c67795
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL ( #22742 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-19 15:25:57 +00:00
4d9c61993a
[Bugfix] Fix benchmark_moe.py ( #23177 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-19 13:39:40 +00:00
b87cb97a53
[Model] support new model ovis2.5 ( #23084 )
...
Signed-off-by: myselvess <244285088@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 13:12:59 +00:00
f856c33ce9
[Model] Add multi_label_classification support ( #23173 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-19 12:54:30 +00:00
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel ( #21716 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-19 08:22:15 -04:00
40f26734b9
[Misc] Fix seq_lens for graph capture ( #23175 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 03:58:16 -07:00
2c3f557f08
[Doc] use power of 2 ( #23172 )
2025-08-19 03:16:23 -07:00
21bcc8263f
[Misc] Avoid accessing req_ids inside a loop ( #23159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 09:39:38 +00:00
5bfe0dea7a
[bug fix] Fix llama4 spec decoding ( #22691 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-08-19 08:53:24 +00:00
31fd3265c8
[Bugfix] Fix broken Minimax-01-VL model ( #22116 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 08:49:29 +00:00
31436e8b4f
[Misc] Add request_id into benchmark_serve.py ( #23065 )
...
Signed-off-by: yangxia <yangxiast@gmail.com >
2025-08-19 08:32:18 +00:00
4efd43e9b4
Fix GLM-4.5V-FP8 numerical issue ( #22949 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 07:56:31 +00:00
3c8a787247
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn ( #22889 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-08-19 07:48:07 +00:00
01a08739e0
[misc] split engine_model into json file for nsys profile tool ( #23117 )
...
Signed-off-by: Grace Ho <grho@nvidia.com >
Signed-off-by: Grace Ho <146482179+gracehonv@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 15:44:53 +08:00
fda9537c5e
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 ( #23114 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 14:24:31 +08:00
90bbe0a5ad
[Log] Warning Once for Cutlass MLA ( #23137 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-18 23:24:16 -07:00
e75f342261
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema ( #22023 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 13:48:26 +08:00
78dba404ad
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes ( #22725 )
...
Signed-off-by: Nikhil Suryawanshi <suryawanshin74@gmail.com >
2025-08-19 04:40:37 +00:00
e9d6a3db69
[TPU] make ptxla not imported when using tpu_commons ( #23081 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@gmail.com >
2025-08-19 11:46:42 +08:00
a4454e9401
chore: disable enable_cpp_symbolic_shape_guards ( #23048 )
...
Signed-off-by: Xiao Liu <xiszishu@gmail.com >
2025-08-18 23:08:05 -04:00
14006840ea
[V0 Deprecation] Remove V0 FlashInfer attention backend ( #22776 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 19:54:16 -07:00
6603288736
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests ( #22871 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 17:39:01 -07:00
95e3095136
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code ( #23122 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-19 08:31:38 +08:00
c9b38be8aa
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT ( #23041 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 17:20:38 -07:00
0dd3f4f5ab
[Misc] Minor refactoring for prepare_inputs ( #23116 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 16:58:05 -07:00
498259ccce
Install tpu_info==0.4.0 to fix core dump for TPU ( #23135 )
2025-08-18 16:23:33 -07:00
6d25e3fd6e
Use Blackwell FlashInfer MXFP4 MoE by default if available ( #23008 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-18 15:25:49 -07:00
ac6eb49de3
fix: OpenAI SDK compat (ResponseTextConfig) ( #23126 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
Signed-off-by: Breno Baldas Skuk <breno.skuk@hcompany.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-18 15:22:59 -07:00
bf756321c7
[CI Bugfix] Pin openai<1.100 to unblock CI ( #23118 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-18 12:14:01 -07:00
0e3bb543f0
[Bugfix] Support compile for Transformers multimodal ( #23095 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-08-18 13:35:48 +00:00
569aefd134
chore: remove unnecessary patch_padding_side for the chatglm model ( #23090 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-18 12:32:13 +00:00
d3f71f1224
[Refactor] Get prompt updates earlier ( #23097 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-18 12:31:53 +00:00
5a30bd10d8
[Bugfix] fix IntermediateTensors equal method ( #23027 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-18 02:58:11 -07:00
27e8d1ea3e
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs ( #23053 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-18 09:52:00 +00:00
5c79b0d648
[XPU][CI]add xpu env vars in CI scripts ( #22946 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-18 09:47:03 +00:00
5f5664b3e4
[XPU] Fix compile size for xpu ( #23069 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-18 00:04:08 -07:00
89657a557c
[Misc] Fix backward compatibility from #23030 ( #23070 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-17 23:33:29 -07:00
08d5f7113a
[Misc] refactor function name ( #23029 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-17 22:16:21 -07:00
b2fd0b81e0
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits ( #23055 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-08-17 22:10:26 -07:00
9f1c642254
[Bugfix] fix Qwen2.5-Omni processor output mapping ( #23058 )
...
Signed-off-by: double7 <33449816+DoubleVII@users.noreply.github.com >
Co-authored-by: 杨森 <yangsen.double7@bytedance.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-17 22:09:11 -07:00
7be3a59d8e
[Misc] enhance static type hint ( #23059 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-17 22:09:08 -07:00
8ea0c2753a
[Misc] Minor code cleanup for _get_prompt_logprobs_dict ( #23064 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-17 18:16:03 -07:00