728c365e4d
Use uv to install python in Dockerfile
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-10-02 11:05:47 -04:00
be8921fbba
Change size of single CUDA graph for CI to 4 ( #26089 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-10-02 14:14:28 +00:00
d4e7a1152d
Update base image to 22.04 (jammy) ( #26065 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-10-02 05:48:04 -07:00
be22bb6f3d
Run:ai model streamer add GCS package support ( #24909 )
...
Signed-off-by: Peter Schuurman <psch@google.com >
2025-10-01 20:59:13 -07:00
169313b9f8
[Misc] Make handling of SamplingParams clearer in n>1 case ( #26032 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-10-01 19:31:39 -07:00
0b018d8baf
[ROCm][Bugfix] Add missing parameter to ROCm backend ( #26029 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-10-01 19:23:14 -07:00
c31246800c
Support RL online quantization with torchao ( #23014 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-10-01 16:39:29 -07:00
4134312b35
[BugFix] ChunkedLocalAttention is currently not CG compatible ( #26034 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-10-01 16:28:00 -07:00
da554f932e
[Bug] Fix Negative Cuda Memory Usage ( #25683 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-10-01 18:16:26 -04:00
aac622e0cd
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series ( #25908 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-10-01 21:39:49 +00:00
1726e93ef1
[BugFix][DP/EP] Fix CUTLASS MLA hang under load ( #26026 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-10-01 12:30:00 -07:00
ee04c0cd04
[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability ( #26030 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-10-01 12:02:17 -07:00
c36f0aa300
Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets ( #25995 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
2025-10-01 18:18:36 +00:00
5234dc7451
[NVIDIA] Blackwell Family ( #24673 )
...
Signed-off-by: Johnny <johnnynuca14@gmail.com >
Signed-off-by: johnnynunez <johnnynuca14@gmail.com >
Signed-off-by: Johnny <johnnync13@gmail.com >
Signed-off-by: Salvatore Cena <cena@cenas.it >
Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com >
Co-authored-by: Salvatore Cena <cena@cenas.it >
2025-10-01 10:50:54 -07:00
3b7c20a6b5
[Bugfix] Apply same sampling parameters for both n=1
and n>1
( #26005 )
...
Signed-off-by: Kenichi Maehashi <maehashi@preferred.jp >
2025-10-01 14:37:35 +00:00
f9e714813a
[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type ( #26007 )
...
Signed-off-by: Nathan Scott <nathans@redhat.com >
2025-10-01 12:41:57 +00:00
2518230d3e
[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 ( #25829 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
Co-authored-by: Luka Govedic <ProExpertProg@users.noreply.github.com >
2025-10-01 08:39:45 -04:00
a332b84578
[CI] Only capture a single CUDA graph size in CI by default ( #25951 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-10-01 10:03:44 +01:00
1405f0c7ba
[Misc] Factor out common _apply_feature_select_strategy
( #26003 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-01 01:31:03 -07:00
84d57342b6
[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker ( #26004 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-10-01 08:03:25 +00:00
57b46d769e
[Doc] updating torch.compile doc link ( #25989 )
...
Signed-off-by: nadathurv <work.vnadathur@gmail.com >
Signed-off-by: WorldExplored <srreyansh.sethi@gmail.com >
Co-authored-by: Srreyansh Sethi <107075589+WorldExplored@users.noreply.github.com >
2025-10-01 07:04:56 +00:00
f48b6a03ba
[Misc]allow disable pynccl ( #25421 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-10-01 06:04:13 +00:00
2a69ab4899
Update to Transformers v4.56.2
( #24638 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-30 22:07:07 -07:00
8d7da92fd7
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 ( #25988 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-30 21:58:31 -07:00
e952eee698
[Bugfix] Fix __syncwarp
on ROCM ( #25996 )
2025-09-30 21:15:11 -07:00
66bca9b8bd
[MM] Add text-only mode for Qwen3-VL ( #26000 )
2025-09-30 21:13:42 -07:00
99028fda44
Fix INT8 quantization error on Blackwell GPUs (SM100+) ( #25935 )
...
Signed-off-by: padg9912 <phone.and.desktop@gmail.com >
2025-09-30 19:19:53 -07:00
1244948885
[Log] Optimize Log for FP8MOE ( #25709 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-30 19:18:43 -07:00
a73f6491c8
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning ( #25843 )
...
Signed-off-by: Salvatore Cena <cena@cenas.it >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-30 19:18:19 -07:00
001e50c92c
[Model] MTP fallback to eager for DeepSeek v32 ( #25982 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-10-01 01:53:22 +00:00
96ebcaa3ad
[Misc] Make EP kernels install script support uv ( #25785 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-30 23:38:34 +00:00
5db1870bb9
[gpt-oss] use vLLM instead of openai types for streaming ( #25186 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
Signed-off-by: Andrew Xia <axia@fb.com >
Co-authored-by: Andrew Xia <axia@fb.com >
2025-09-30 22:47:07 +00:00
2ce26b9b5d
[Docs] Remove API Reference from search index ( #25949 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 22:10:02 +00:00
a388252ac4
Add explicit pooling classes for the Transformers backend ( #25322 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-30 23:07:06 +01:00
9a9f48dff7
[V1] [P/D] Add Support for KV Load Failure Recovery ( #19330 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-09-30 14:57:08 -07:00
67f3fb0844
[Bench] Add DeepSeekV32 to MoE benchmark ( #25962 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-30 14:13:48 -07:00
43b752c325
[Llama4] [multimodal] Fix misplaced dtype cast of cos_sin_cache
in Llama4VisionRotaryEmbedding
( #25889 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-09-30 20:35:15 +00:00
cfd302db9b
OffloadingConnector: Fix GPU block tracking bug ( #25856 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-30 19:53:04 +00:00
fb610ae684
[Docs] Add moe kernel features doc ( #25297 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: bnellnm <49004751+bnellnm@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 19:03:15 +00:00
2f652e6cdf
[Doc] Improve MM Pooling model documentation ( #25966 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-30 18:58:29 +00:00
e6a226efba
[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' ( #25958 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-30 11:13:03 -07:00
a2e6fa7e03
[bugfix][deepseek] fix flashmla kernel selection ( #25956 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-10-01 00:30:36 +08:00
9f1c4ecaf2
[Bugfix] Token type and position embeddings fail to be applied to inputs_embeds
( #25922 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-10-01 00:23:12 +08:00
ef283548f7
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging ( #25895 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-09-30 10:51:31 -04:00
f4db5e6de1
[Bugfix][Model] Fix inference for Hunyuan dense models ( #25354 )
...
Signed-off-by: anion <1005128408@qq.com >
Signed-off-by: Anion <123177548+Anionex@users.noreply.github.com >
2025-09-30 14:38:07 +00:00
099aaee536
Add Hugging Face Inference Endpoints guide to Deployment docs ( #25886 )
...
Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 14:35:06 +00:00
35fe398c7c
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 ( #25858 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-30 07:30:44 -07:00
bb6d43047e
[Fix] Improve CPU backend compatibility for RISC-V ( #25816 )
...
Signed-off-by: lyd1992 <liuyudong@iscas.ac.cn >
Signed-off-by: ihb2032 <1355790728@qq.com >
2025-09-30 13:48:07 +00:00
bc546f76a1
[CI] Move applicable tests to CPU ( #24080 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 14:45:20 +01:00
80608ba5af
[NIXL] Add support for MLA caches with different latent dim ( #25902 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-09-30 12:18:29 +00:00
e184c9c510
[perf] Use CPU tensor to reduce GPU->CPU sync ( #25884 )
...
Signed-off-by: Lehua Ding <lehuading@tencent.com >
2025-09-30 19:51:16 +08:00
d7e34b4210
[Model] Move vision_feature_select_strategy
into resolve_visual_encoder_outputs
( #25938 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-30 11:24:57 +00:00
ef6e0e7132
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 ( #25936 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-09-30 19:11:21 +08:00
1ad3aca682
Updated TRL integration docs ( #25684 )
...
Signed-off-by: sergiopaniego <sergiopaniegoblanco@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-30 03:10:55 -07:00
8d0afa9b42
[Doc] Add Cambricon MLU support ( #25942 )
...
Signed-off-by: a120092009 <zhaoty0121@gmail.com >
2025-09-30 17:59:47 +08:00
fa7e254a7f
[New Model] DeepSeek-V3.2 (Rebased to Main) ( #25896 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com >
Signed-off-by: Lucia Fang <fanglu@meta.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: Lucia Fang <fanglu@meta.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Xiaozhu Meng <mxz297@gmail.com >
Co-authored-by: Barry Kang <43644113+Barry-Delaney@users.noreply.github.com >
2025-09-30 17:14:41 +08:00
e23cacda35
[Bugfix]: Clean up chunked prefill logging when using whisper ( #25075 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
2025-09-30 08:17:49 +00:00
2e1b8bc2b6
[Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect logical_not
( #25925 )
...
Signed-off-by: zhoukz <me@zhoukz.com >
2025-09-30 08:15:23 +00:00
e47433b3c1
[BugFix] Pass config_format via try_get_generation_config ( #25912 )
2025-09-30 05:09:50 +00:00
23194d83e8
[BugFix] Fix DP/EP hang ( #25906 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-30 04:18:59 +00:00
61aedb5ffe
MoveVllmConfig
from config/__init__.py
to config/vllm.py
( #25271 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-29 19:49:49 -07:00
d3bd171123
[Benchmark] Support benchmark throughput for external launcher DP ( #25913 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-30 01:43:57 +00:00
89e4050af4
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 ( #25909 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-30 09:15:19 +08:00
78a47f87ce
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models ( #25717 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-30 08:10:58 +08:00
6a113d9aed
[V0 Deprecation] Remove vllm.worker
and update according imports ( #25901 )
2025-09-29 23:26:11 +00:00
2e4fe48c37
[NIXL] Increase default KV block eviction timeout on P ( #25897 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-29 21:35:14 +00:00
8eb0a1d906
[Doc] Polish example for torchrun dp ( #25899 )
2025-09-29 21:31:34 +00:00
fea3e476aa
[Kernel] Chunk-aligned mamba2 ( #24683 )
2025-09-29 23:18:25 +02:00
61a3431613
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so ( #25605 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-29 17:01:50 -04:00
9bedac9623
[Doc] Add documentation for vLLM continuous benchmarking and profiling ( #25819 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
2025-09-29 20:49:49 +00:00
c42ff4f4fd
[BugFix][torch.compile] KV scale calculation issues with FP8 quantization ( #25513 )
...
Signed-off-by: adabeyta <aabeyta@redhat.com >
2025-09-29 15:52:04 -04:00
d5ab28511c
[Bugfix] Use correct key "ignore" for config.json non-quantized layers ( #25706 )
...
Signed-off-by: Lee Nau <lnau@nvidia.com >
2025-09-29 15:07:29 -04:00
e61eb5e09d
[Model] Remove MotifForCausalLM ( #25866 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-30 00:36:30 +08:00
0899ba5b42
[CI/Build] Include Transformers backend test in nightly transformers test ( #25885 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-29 09:33:39 -07:00
145ac73317
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue ( #25883 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-09-29 11:37:20 -04:00
d0d138bc55
[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) ( #24690 )
...
Signed-off-by: Chenxi Yang <cxyang@fb.com >
Co-authored-by: Chenxi Yang <cxyang@fb.com >
2025-09-29 14:31:51 +00:00
43227236ec
[torch.compile] serialize cudagraph_mode as its enum name instead of value ( #25868 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-29 13:54:52 +00:00
8616300ae2
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models ( #25854 )
...
Signed-off-by: zhoukz <me@zhoukz.com >
2025-09-29 10:59:04 +00:00
edbaadd91f
[Bugfix] Fix requirements paths in install instructions ( #25827 )
...
Signed-off-by: yingjun-mou <renzomou@gmail.com >
2025-09-29 03:49:35 -07:00
9360d34fa1
update to latest deepgemm for dsv3.2 ( #25871 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-29 17:51:43 +08:00
1b67b04656
[Misc] Remove more get_input_embeddings_v0
( #25857 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-29 08:03:37 +00:00
bd51f78e39
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge ( #25331 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: isotr0py <2037008807@qq.com >
2025-09-29 14:09:18 +08:00
65ecb4f134
[Bugfix] Fallback ViT attn backend to SDPA for blackwell ( #25851 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-29 06:03:51 +00:00
143844fa43
[XPU]Fix xpu spec decoding UTs, avoid using cuda graph ( #25847 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-29 05:15:10 +00:00
219cfbe7f6
Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS ( #25832 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-29 05:08:17 +00:00
9b44a7d926
[P/D] NIXL Updates ( #25844 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-09-29 04:46:30 +00:00
a3ae45a38c
[Misc] fix tests failure by using current_platform ( #25825 )
...
Signed-off-by: Juechen Liu <jueliu@meta.com >
2025-09-29 04:18:57 +00:00
0307428d65
Remove redundant cudagraph dispatcher warning ( #25841 )
2025-09-28 17:12:42 -04:00
471997adf6
[Bugfix] fix Qwen3VLMoe load when pp > 1 ( #25838 )
...
Signed-off-by: liuye.hj <liuye.hj@alibaba-inc.com >
Co-authored-by: liuye.hj <liuye.hj@alibaba-inc.com >
2025-09-28 17:56:12 +00:00
b1ded114b9
Update GLM-4.5 Doc transformers version ( #25830 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-09-28 12:05:51 +00:00
f4e4088c99
Fix random dataset mismatched token length with config. ( #24937 )
...
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-28 08:23:44 +00:00
0efd540dbc
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling ( #25557 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-28 04:21:01 +00:00
6144754014
[Bugfix] Fix Qwen3-VL regression from #24982 ( #25814 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-28 03:21:09 +00:00
69311446ba
[MM] Optimize memory profiling for scattered multimodal embeddings ( #25810 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-28 02:17:58 +00:00
da63274d9f
[Bugfix][NIXL] Fix Async Scheduler timeout issue ( #25808 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-27 15:17:35 -04:00
c216119d64
[Core] GC Debug callback ( #24829 )
...
Signed-off-by: Jialin Ouyang <jialino@meta.com >
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
Co-authored-by: Jialin Ouyang <jialino@meta.com >
2025-09-27 17:53:31 +00:00
5546acb463
[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location ( #25766 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
2025-09-27 13:36:28 -04:00
c0ec81836f
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable ( #25651 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-27 16:09:00 +00:00
b65e56babe
[Core] Refactor self.model() to call a helper for subclassing. ( #25084 )
...
Signed-off-by: Patrick Toulme <ptoulme@meta.com >
Signed-off-by: Patrick Toulme <pctoulme+1@gmail.com >
2025-09-27 08:40:59 -07:00
49996cd597
[env] default nixl side port conflicts with kv-event zmq port ( #25056 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-09-27 15:02:40 +00:00
ecb37e276a
[docs] transcriptions API audio upload ( #25446 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-09-27 15:00:35 +00:00
a5354b3ed2
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models ( #24982 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-27 14:22:28 +00:00
f9df8b4ad7
[Bugfix] Fix triton import precommit failure ( #25803 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-27 07:13:11 -07:00
ec152c8748
Fix GPTQ model loading in Transformers backend ( #25770 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-27 12:18:20 +00:00
7977e5027c
Add filtering for chat template kwargs ( #25794 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-27 10:46:49 +00:00
3f5d902d2a
Validate API tokens in constant time ( #25781 )
...
Signed-off-by: rentianyue-jk <rentianyue-jk@360shuke.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: rentianyue-jk <rentianyue-jk@360shuke.com >
2025-09-27 18:09:26 +08:00
27d7638b94
[Bugfix] Merge MM embeddings by index instead of token IDs ( #16229 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-27 08:15:12 +00:00
176173989a
[Bugfix] Add missing image_size
for phi4_multimodal ( #25796 )
2025-09-27 07:59:22 +00:00
23b8ee672d
[Misc] Update openai client example file for multimodal ( #25795 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-27 07:57:07 +00:00
3939152069
[Misc] Fix codeowners override for v1 sample and attention ( #25037 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-27 07:47:29 +00:00
cd87bfbf37
[CI/Build] Reorganize root-level V1 tests ( #25767 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-27 13:51:15 +08:00
b3613e3ace
[CI/Build] Add timing to Model Executor Test ( #25799 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-26 21:57:27 -07:00
d346ec695e
[CI/Build] Consolidate model loader tests and requirements ( #25765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 21:45:20 -07:00
c242c98031
[Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL ( #25788 )
2025-09-26 20:44:52 -07:00
f1d53d150c
[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl ( #22872 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Signed-off-by: Junhong Liu <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
2025-09-27 03:35:47 +00:00
92da847cf5
Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile ( #25782 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 18:54:09 -07:00
3958b96bf5
Add option to restrict media domains ( #25783 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-27 01:23:52 +00:00
8bf8f45822
[Core] Don't count preempted tokens in prefix cache hit rate ( #25787 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-27 00:16:40 +00:00
6f5c0931c1
[Spec decode] automatically disable mm for text-only draft models ( #25667 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2025-09-27 08:10:21 +08:00
4e33a7ea85
[Bugfix] Optimize CpuGpuBuffer initialization ( #25447 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
2025-09-27 08:07:36 +08:00
dc48ba0c75
Kernel-override Determinism [1/n] ( #25603 )
...
Signed-off-by: Bram Wasti <bwasti@meta.com >
2025-09-26 16:59:09 -07:00
4778b42660
Reduce the Cuda Graph memory footprint when running with DBO ( #25779 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-09-26 22:29:56 +00:00
c70ac4b8ff
[spec decode] Consolidate speculative decode method name for MTP ( #25232 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
2025-09-26 22:27:05 +00:00
cf89202855
[CI] Fix FlashInfer AOT in release docker image ( #25730 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 14:11:40 -07:00
f075693da7
[V1] address post issues related to #20059 (part 1) ( #23046 )
...
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-26 15:58:19 -04:00
f708bd4904
[CI] Add E2E Blackwell Quantized MoE Test ( #25723 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 12:23:00 -07:00
0002b7f0d1
[Docs] Add Toronto Meetup ( #25773 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-26 12:00:46 -07:00
11aafd9886
[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition ( #25355 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
Signed-off-by: Frank Wang <41319051+frankwang28@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-26 11:54:00 -07:00
b761df963c
[Doc]: improve CPU(x86) build-wheel-from-source section ( #25617 )
...
Signed-off-by: Kosseila (CloudThrill) <klouddude@gmail.com >
2025-09-26 10:26:33 -07:00
33f6aaf972
Eagle3 that supports the Minicpm3 model ( #24243 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: liudan <adan@minicpm.com >
Co-authored-by: liudan <liudan@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
2025-09-26 10:04:57 -07:00
56aafa8c0b
[Misc] fix unique_filepath ( #25732 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-26 16:56:15 +00:00
8d52f2b3a7
[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility in Ray ( #25439 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
2025-09-26 09:43:30 -07:00
984d18498a
[BugFix] Fix using dbo_decode_token_threshold
always (and ignoring dbo_prefill_token_threshold
) ( #25622 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-26 16:22:49 +00:00
d4d9899860
[Quantization] Add field to skip unquantized modules for GPTQ config ( #25455 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-26 15:47:41 +00:00
db1e42f627
[CI/Build] Fix some V1 tests not being run ( #25569 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 20:52:36 +08:00
bc9d7b5595
[CI/Build] Split up Distributed Tests ( #25572 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 14:49:33 +02:00
fe6b19c314
[Bugfix] Properly abort pooling request. ( #25734 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-26 05:47:34 -07:00
2827b3f4a3
[CI] Fix test_shared_storage_connector_hashes ( #25748 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-26 20:46:17 +08:00
2b6b1d7809
[Model] Mamba2 varlen refactor ( #21467 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: RishiAstra <40644327+RishiAstra@users.noreply.github.com >
2025-09-26 11:31:14 +00:00
633f943e30
[Doc] Update Batch-level DP docs ( #25757 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-26 02:37:40 -07:00
b03b1b97f6
Support LongCat-Flash-Chat tool call ( #24083 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-09-26 09:25:39 +00:00
dfb9af2014
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk ( #25698 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-26 01:25:28 -07:00
19f76ee68e
[misc] refactor speculative config ( #25657 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-09-26 01:22:06 -07:00
dd70437a4f
Remove cuda hard-code in compute_causal_conv1d_metadata ( #25555 )
...
Signed-off-by: Icey <1790571317@qq.com >
2025-09-26 01:19:20 -07:00
99b3a504c5
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and a stride bug in causal_conv_1d. ( #25743 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-26 01:18:58 -07:00
6e30010d2f
fix: print outputt offline_inference/base/chat.py example ( #25744 )
...
Signed-off-by: Iceber Gu <caiwei95@hotmail.com >
2025-09-26 01:18:24 -07:00
52621c8f5c
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300X ( #25703 )
...
Signed-off-by: xaguilar <Xavier.AguilarFruto@amd.com >
2025-09-26 01:18:20 -07:00
d48f4d6daf
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds is enabled ( #25739 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-26 01:18:09 -07:00
e84e0735c7
fix: revert cast to cpu in MsgpackEncoder._encode_tensor
to avoid hidden performance regressions ( #25738 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-26 01:18:05 -07:00
3edf87d25f
[CI/Build] fix doc build warning: Failed to get 'name: description' pair ( #25733 )
...
Signed-off-by: yiting.jiang <yiting.jiang@daocloud.io >
2025-09-26 01:18:02 -07:00
392edee34a
EVS Support (Video tokens pruning) ( #22980 )
...
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com >
Signed-off-by: Eugene Khvedchenya <ekhvedchenya@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-26 11:54:54 +08:00
983056e456
[Misc] Remove unnecessary memoryviews in shm_broadcast.py ( #25721 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-26 03:11:44 +00:00
13dd93c667
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder ( #25701 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-25 18:21:56 -07:00
53a30845be
Llamas 3.1 405B fp4 changes upstreaming from 355_wip ( #25135 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
2025-09-25 19:16:53 -06:00
8b77328ffe
[Misc] Don't log shm dequeue delay warning on worker side ( #25720 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-26 01:08:30 +00:00
9fe4c2bdb9
[Refactor] Remove DeepGEMM OP Register ( #25710 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-25 20:13:41 -04:00
081b5594a2
Fix routing_bias dtype ( #25711 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
2025-09-25 23:35:14 +00:00
57329a8c01
[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 ( #25708 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
2025-09-25 16:10:29 -07:00
8c435c9bce
[Core] Enable command line logging for LLMEngine ( #25610 )
...
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-25 15:31:17 -07:00
e71b8e210d
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. ( #24986 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-25 15:22:03 -07:00
89fa54e6f7
[Optimization] Use a cheaper cache key in get_model_architecture
( #25682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 17:54:20 -04:00
3d54bdcb73
[Optimization] Streamline InputPreprocessor
( #25702 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 21:06:49 +00:00
6b0fcbbf43
[Misc] Simplify test_argsort_mm_positions
( #25690 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 18:23:01 +00:00
0fa673af4c
[V0 deprecation] Clean up LoRA ( #25686 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-25 18:12:33 +00:00
3468f17ebe
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend names ( #25489 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-25 17:37:50 +00:00
71b25b0d48
[V0 deprecation] Clean up V0 fallback in compilation config ( #25675 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-25 17:29:51 +00:00
0ea80c87d9
[Model] Define merge_by_field_config
MM interface ( #25676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 17:13:07 +00:00
b8d9e4a326
[Model] Add optional parameter to reasoning parser constructor ( #25554 )
...
Signed-off-by: taohui <taohui3@gmail.com >
Signed-off-by: Tao Hui <taohui3@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-26 01:12:50 +08:00
13cc7f5370
[BugFix] Fix DBO hang ( #25625 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-25 17:04:48 +00:00
916bd9204d
Revert "[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function
calling disabled super()" ( #25681 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-25 09:45:06 -07:00
e04a1b6b21
[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin… ( #24662 )
...
Signed-off-by: AlonKejzman <alonkeizman@gmail.com >
2025-09-25 15:40:14 +00:00
2e5df88c92
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning ( #25532 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-25 15:16:06 +00:00
0754ac4c49
[Misc] Remove cruft file in repo ( #25678 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-25 08:05:12 -07:00
03858e6d1c
[Bugfix] Fix InternS1 video processing after Transformers v4.56 ( #25644 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-25 14:46:04 +00:00
532a6cfccb
[ux] Switch a warning to debug about a pytorch fallback ( #23750 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-25 14:38:16 +00:00
eb32335e35
[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata ( #25652 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-25 13:29:11 +00:00
69a8c8e99a
[torch.compile] Make Query Quantization Fusable ( #24914 )
...
Signed-off-by: Jonas Kuebler <kuebj@amazon.com >
2025-09-25 09:25:12 -04:00
6c340da4df
[misc] log info messages by default for hanging / busy / idle ( #25627 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-25 21:14:57 +08:00
2f17117606
[mypy] Fix wrong type annotations related to tuple ( #25660 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 13:00:45 +00:00
1e9a77e037
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar ( #22112 )
...
Signed-off-by: chenlang <chen.lang5@zte.com.cn >
Co-authored-by: chenlang <10346245@zte.com.cn >
2025-09-25 20:46:11 +08:00
d2af67441d
[XPU][Triton]add xpu config in triton_reshape_and_cache_flash ( #25643 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-25 12:38:11 +00:00
0bcc3a160d
[CI/Build] Fix flaky entrypoints test ( #25663 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 12:19:40 +00:00
70fbdb26e9
Add backward compatibility for guided_...
API ( #25615 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-25 19:45:25 +08:00
7f570f1caa
[V0 deprecation] Remove unreachable model_config.supported_tasks ( #25642 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-25 11:26:31 +00:00
eaeca3cd7f
[Bugfix] Parse SpeculativeConfig Error ( #25142 )
...
Signed-off-by: zxw <1020938856@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-25 11:09:39 +00:00
12c1287d64
[mypy] Further improve MM type annotations ( #25654 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 10:57:36 +00:00
17b4c6685c
[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling ( #25648 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-25 18:36:01 +08:00
3c2b2ccece
[Bugfix] Add triton.language.tensor placeholder ( #25649 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-09-25 10:31:14 +00:00
7be9ffcd9f
[Misc] Fix Qwen3-VL video_grid_thw
typing ( #25646 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-25 10:16:45 +00:00
393de22d2e
[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and Darwin ( #25579 )
...
Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com >
2025-09-25 09:39:18 +00:00
1260180c67
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class… ( #25607 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-25 08:05:21 +00:00
af4ee63e0e
typo: remove duplicate is
( #25641 )
...
Signed-off-by: nicole-lihui <nicole.li@daocloud.io >
2025-09-25 00:46:22 -07:00
bc092ea873
Map CwmForCausalLM to llama and LlamaForCausalLM ( #25611 )
...
Signed-off-by: Jacob Kahn <jacobkahn1@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-25 07:37:03 +00:00
755ed7b05b
[Misc] Simplify PoolerOutput and move to v1/outputs
( #25629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-25 06:47:03 +00:00
a676e668ee
[Bugfix] fix apply_temperature to avoid nan in probs ( #24734 )
...
Signed-off-by: courage17340 <courage17340@163.com >
2025-09-25 05:32:21 +00:00
c85be1f6dd
optimize: eliminate duplicate split_enc_dec_inputs calls ( #25573 )
...
Signed-off-by: nicole-lihui <nicole.li@daocloud.io >
2025-09-25 05:03:25 +00:00
845adb3ec6
[Model] Add LongCat-Flash ( #23991 )
...
Signed-off-by: yangxurui <yangxurui@meituan.com >
Co-authored-by: yangxurui <yangxurui@meituan.com >
2025-09-24 21:53:40 -07:00
90b139cfff
Enable Fbgemm NVFP4 on Dense models ( #25609 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-24 21:12:53 -07:00
4492e3a554
[Bug] Dynamo Unsupported due to BasevLLMParameter.torch_function
calling disabled super() ( #25613 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-24 18:52:52 -07:00
05c19485a5
[Kernel] Support DCP for Triton backend ( #25132 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-09-24 18:09:34 -07:00
52d0cb8458
[Model] Improve DotsOCRForCausalLM ( #25466 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-25 07:58:08 +08:00
5c1e496a75
[MISC] replace c10::optional with std::optional ( #25602 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-24 16:56:21 -07:00
e7f27ea648
Improve --help
for enhanced user experience ( #24903 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-24 23:08:18 +00:00
1f29141258
[Refactor] Use DeepGEMM Col Major TMA Aligned Tensor ( #25517 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-24 18:52:36 -04:00
6160ba4151
feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expert Parallel ( #25503 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-09-24 18:50:04 -04:00
fea8006062
[Logging] Improve log for when DeepEP HT disables CUDA Graphs ( #25531 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-24 22:43:06 +00:00
e6750d0b18
[V0 Deprecation] Remove unused classes in attention ( #25541 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-24 13:24:40 -07:00
8c853050e7
[Docs] Enable fail_on_warning
for the docs build in CI ( #25580 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-24 19:30:33 +00:00
f84a472a03
Suppress benign cuBLAS warning when capturing cudagraphs with DBO ( #25596 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-09-24 19:02:08 +00:00
54e42b72db
Support mnnvl all2allv from Flashinfer ( #21003 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-24 14:38:16 -04:00
2dda3e35d0
[Bugfix] add cache model when from object storage get model ( #24764 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-09-24 18:11:16 +00:00
d83f3f7cb3
Fixes and updates to bench_per_token_quant_fp8 ( #25591 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-24 08:30:15 -07:00
302eb941f3
[ROCm][Build][Bugfix] Fix ROCm base docker whls installation order ( #25415 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-24 11:25:10 -04:00
487745ff49
[ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly disabled ( #25275 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-24 11:24:39 -04:00
9313be5017
[Misc] Improve type annotations for jsontree ( #25577 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-24 22:49:58 +08:00
8938774c79
Move DeviceConfig
, ObservabilityConfig
, SpeechToTextConfig
to their own files ( #25564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-24 13:59:05 +00:00
e18b714b2e
[Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools in non-streaming output ( #25405 )
...
Signed-off-by: taohui <taohui3@gmail.com >
2025-09-24 20:58:00 +08:00
b1068903fd
[docs] fix nixl kv_connector_extra_config.backends key ( #25565 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-24 11:00:27 +00:00
164299500b
[Benchmark] Fix regression in structured output benchmark ( #25500 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-24 10:40:42 +00:00
58c360d9be
[Bug] fix import and unit test ( #25558 )
...
Signed-off-by: Jonas M. Kübler <44084297+jmkuebler@users.noreply.github.com >
2025-09-24 10:17:59 +00:00
42488dae69
[Bugfix] Fix dummy video number of frames calculation ( #25553 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-24 09:47:30 +00:00
b67dece2d8
[misc] update the warning message ( #25566 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-24 17:24:35 +08:00
2338daffd3
[BugFix] Potential Fix for FA3 full-cudagraph IMA ( #25490 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-24 02:04:04 -07:00
2e19a848d4
[V0 Deprecation] Remove max_seq_len_to_capture ( #25543 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-24 01:51:39 -07:00
77a7fce1bb
[CI/Build] add nightly prime-rl integration tests ( #25207 )
...
Signed-off-by: Jackmin801 <ongjackm@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-24 08:44:22 +00:00
6488f3481b
[Misc]] Move processing context to multimodal directory ( #25548 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-24 08:15:00 +00:00
27ec3c78f3
[CI/Build] Fix v1 OOT registration test ( #25547 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-24 08:03:13 +00:00
1cbcfb94de
[Bugfix][CPU] Skip unsupported custom op register on CPU ( #25534 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-24 06:21:51 +00:00
fed8a9b107
[Misc] Retry HF processing if "Already borrowed" error occurs ( #25535 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-23 22:32:11 -07:00
190c45a6af
[TPU][Bugfix] fix the missing apply_model in tpu worker ( #25526 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-09-24 05:18:08 +00:00
5caaeb714c
[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls ( #25514 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-09-24 03:20:38 +00:00
d747c2ef18
[Perf] Fix jit compiles at runtime of fla gated delta rule ( #25432 )
...
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-24 11:16:13 +08:00
c30b405b8f
[Spec Decode] Enable FlashInfer Spec Decoding ( #25196 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
Co-authored-by: lhsjohn <huashuoli@tencent.com >
2025-09-23 22:29:58 -04:00
77d906995c
[KV sharing] Re-land Gemma3n model changes from #22628 ( #24357 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-23 19:25:34 -07:00
359d293006
[fix]: add Arm 4bit fused moe support ( #23809 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-09-24 01:32:22 +00:00
9df8da548e
[BugFix] Fix MLA assert with CUTLASS MLA ( #25478 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-23 21:09:43 -04:00
bf68fd76a9
[Compile] Fix AMD Compile Error ( #25518 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-24 00:42:48 +00:00
de94289a98
[Core] Support weight_loader_v2 for UnquantizedLinearMethod
( #23036 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-23 18:30:26 -06:00
1983609239
[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen ( #25520 )
2025-09-24 00:19:56 +00:00
d06b5a95cb
[V1][Metrics] Add per-request TPOT histogram ( #24015 )
...
Signed-off-by: baxingpiaochong <771405853@qq.com >
2025-09-23 18:19:04 -06:00
be0bb568c9
[Model] Support SeedOss Reason Parser ( #24263 )
...
Signed-off-by: Yan Lu <luyan@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 18:15:51 -06:00
c8bde93367
[BUG] Allows for RunAI Streamer and Torch.compile cache to be used together ( #24922 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2025-09-23 18:13:32 -06:00
88d7bdbd23
[Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_weight_scale'. Did you mean: 'w13_weight_scale_inv' ( #25519 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-24 00:07:51 +00:00
0d235b874a
Add CUTLASS FP8 MOE benchmark scripts and kernel config ( #25302 )
...
Signed-off-by: Chenxi Yang <cxyang@fb.com >
Co-authored-by: Chenxi Yang <cxyang@fb.com >
2025-09-23 18:07:42 -06:00
7ad5e50adf
Improve output when failing json.loads() on structured output test ( #25483 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-09-23 18:03:31 -06:00
dc464a3d39
[BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch ( #25505 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-23 18:00:29 -06:00
1210e4d95b
[Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1 ( #25509 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-09-23 16:57:55 -07:00
e0b24ea030
[Perf] Increase default max splits for FA3 full cudagraphs ( #25495 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-23 16:53:34 -07:00
bde2a1a8a4
[ROCm] Small functional changes for gptoss ( #25201 )
...
Signed-off-by: jpvillam <jpvillam@amd.com >
Co-authored-by: jpvillam <jpvillam@amd.com >
2025-09-23 23:39:50 +00:00
5e25b12236
[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel
( #25197 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com >
2025-09-23 23:23:30 +00:00
c85d75cf08
Add VLLM_NVTX_SCOPES_FOR_PROFILING=1
to enable nvtx.annotate
scopes ( #25501 )
...
Signed-off-by: Corey Lowman <clowman1993@gmail.com >
2025-09-23 22:50:09 +00:00
abad204be6
[BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory accounting ( #25359 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-09-23 15:49:09 -07:00
7361ab379f
Remove redundant mutates_args and dispatch_key for direct_register_custom_op ( #25512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 22:48:40 +00:00
95bc60e4cb
[gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI ( #25428 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-23 15:46:46 -07:00
4f2954f724
Fix triton_reshape_and_cache_flash.py triton import ( #25522 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 15:26:10 -07:00
eca7be9077
Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA… ( #25493 )
...
Signed-off-by: rouchenzi <ruochenwen@gmail.com >
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com >
2025-09-23 22:17:49 +00:00
969b4da3a6
[V0 Deprecation] Remove placeholder attn ( #25510 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-23 22:12:14 +00:00
4f8c4b890a
[Core] Use KVCacheBlock as much as possible instead of dict[block_id, KVCacheBlock] ( #24830 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-09-23 15:11:14 -07:00
ae002924e9
[CI/Build] Fix and re-enable v1 PP test on CI ( #25496 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 21:58:25 +00:00
690f948e4a
[Bugfix] Fix for the import error from #24588 ( #25481 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-23 21:31:08 +00:00
08275ec0a2
[Build] Update Xgrammar to 0.1.25 ( #25467 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-23 21:25:46 +00:00
c828d1bf98
[Bugfix] gpt-oss container tool output bug ( #25485 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
2025-09-23 20:43:45 +00:00
8b8a8afc89
[CI] Fix Pre-commit Issue ( #25497 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-24 04:09:37 +08:00
8bdd8b5c51
Enable symmetric memory all reduce by default only enabling for TP ( #25070 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 15:53:00 -04:00
a8ffc4f0f2
[Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible with FA3 ( #25508 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 12:49:55 -07:00
d5944d5146
[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue ( #25406 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
2025-09-23 15:44:35 -04:00
24fab45d96
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEWISE ( #25444 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-23 15:29:26 -04:00
63400259d0
[Performance] Move apply_w8a8_block_fp8_linear to an op class ( #24666 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: ElizaWszola <elizaw.9289@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-09-23 12:03:10 -07:00
8c1c81a3de
[core] add nccl symmetric memory for all reduce ( #24532 )
...
Signed-off-by: Amir Samani <asamani@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 14:33:06 -04:00
a3a7828010
[ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 ( #24988 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
Signed-off-by: Hashem Hashemi <159079214+amd-hhashemi@users.noreply.github.com >
2025-09-23 14:31:45 -04:00
5abb117901
[Core] Ensure LoRA linear respect the base_layer's tp_size and tp_rank ( #25487 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-23 18:19:25 +00:00
867ecdd1c8
[Spec Decode][CI] Add e2e test for examples/spec_decode.py
and prevent breaking Acceptance Length ( #24531 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-23 10:46:40 -07:00
24e8222745
[Misc] Reduce initialization time of auto_tune ( #23682 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-09-23 17:34:58 +00:00
100b630a60
[V1][Kernel] Add triton implementation for reshape_and_cache_flash
( #24503 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-23 12:52:40 -04:00
527821d191
Use macro guard CUDA functions for back compatibility in grouped_topk_kernel.cu ( #25346 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-23 09:45:39 -07:00
846197f505
[Log] Optimize kv cache memory log from Bytes to GiB ( #25204 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-23 12:44:37 -04:00
2357480b1a
[BugFix] Fix UB in per_token_group_quant.cu ( #24913 )
...
Signed-off-by: Shreeasish Kumar <shreeasish@rivosinc.com >
2025-09-23 09:14:22 -07:00
f11e3c516b
[Kernels] Support blocked fp8 quantization for compressed tensors MoE ( #25219 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 16:11:34 +00:00
875d6def90
Add backward compatibility for GuidedDecodingParams
( #25422 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-23 17:07:30 +01:00
cc1dc7ed6d
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support and Prefill support ( #24845 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-23 16:02:10 +00:00
a903669e10
[V1] Remove V0 code paths for Hybrid models ( #25400 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-23 08:26:13 -07:00
2c58742dff
[UX] Change kv-cache-memory log level to debug ( #25479 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-23 08:01:24 -07:00
4c966e440e
[XPU] Fix MOE DP accuracy issue on XPU ( #25465 )
2025-09-23 14:32:57 +00:00
da5e7e4329
[Docs] NixlConnector quickstart guide ( #24249 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Signed-off-by: Nicolò Lucchesi<nicolo.lucchesi@gmail.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-09-23 14:23:22 +00:00
f05a4f0e34
[P/D] Support NIXL connector to disconnect during a clean shutdown ( #24423 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-09-23 16:08:02 +02:00
61d1b35561
[BugFix] Register expert_map as named buffer for wake_up and sleep ( #25458 )
...
Signed-off-by: wuxibin <wuxibin@bytedance.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-23 21:49:13 +08:00
b6a136b58c
[CI/Build] Fix disabled v1 attention backend selection test ( #25471 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 13:05:46 +00:00
0d9fe260dd
[docs] Benchmark Serving Incorrect Arg ( #25474 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-23 06:05:11 -07:00
273690a50a
[Core] Optimize LoRA weight loading ( #25403 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-23 18:19:45 +08:00
231c2c63e4
[Bugfix] Fix idefics3 tie_word_embeddings
( #25454 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 10:06:48 +00:00
4322c553a6
[Test]: Hermes tool parser stream output error in Qwen3 case ( #25203 )
...
Signed-off-by: Andreas Hartel <andreas.hartel@aleph-alpha.com >
2025-09-23 17:56:31 +08:00
babad6e5dd
[Misc] Move DP for ViT code inside model executor dir ( #25459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-23 09:20:52 +00:00
9383cd6f10
[Frontend] Add a new xml-based tool parser for qwen3-coder ( #25028 )
...
Signed-off-by: Zhikaiiii <1658973216@qq.com >
2025-09-23 16:07:27 +08:00
ba8d2165b6
Handle triton kernel import exception ( #25319 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-23 00:56:00 -07:00
c98be0a232
[Model] Enable DP for ViT in Qwen2-VL ( #25445 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-23 05:17:10 +00:00
5774b0a1da
[NIXL][OOT platform] support nixl_connector with oot platform and other nixl_backend ( #25121 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
2025-09-23 04:17:42 +00:00
e8db44f883
[DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP ( #24588 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-09-22 21:01:09 -07:00
fafbe11af4
[Docs] Fix griffe warnings in vllm/lora/ops ( #25369 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-23 03:42:58 +00:00
78237e43bf
[Bugfix] Remove contiguous output req for context parallel MLA ( #25414 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-22 20:26:32 -07:00
eea1783989
[benchmarks]allow skip ready check for bench serve ( #25420 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-09-23 03:21:48 +00:00
f225ea7dd9
[XPU] Fix compile_size
is None
case. ( #25433 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-23 03:09:00 +00:00
fc97733da8
[feat] Support MRoPE + YaRN ( #25384 )
...
Signed-off-by: liuye.hj <liuye.hj@alibaba-inc.com >
Co-authored-by: liuye.hj <liuye.hj@alibaba-inc.com >
2025-09-23 03:04:47 +00:00
4741239db7
[Bug] Fix Long Context OOM Issue ( #25290 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-22 22:04:15 -04:00
c625f9043c
[V0 deprecation] Remove _set_default_args_v0
function ( #25409 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 01:52:09 +00:00
6fa78d8f23
[V0 deprecation] Remove platform v1 controling interface ( #25410 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-23 01:48:12 +00:00
9949aa2ef1
[Perf] Apply torch.compile for per_block_cast_to_fp8
( #24611 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-22 19:42:45 -06:00
0b7bed9c38
[Performance] Remove input pads in cutlass_mla and optimize v_proj output handling ( #25184 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-09-22 19:20:53 -06:00
ac0048c0ae
[BugFix] [DP/EP] Fix slow execution when BS <= DP ( #25407 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com >
2025-09-22 17:26:17 -07:00
090197034f
[Bugfix] Fix missing clear_connector_metadata
( #25397 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-23 08:10:59 +08:00
f31ff87460
[Core] Drop overly aggressive whisper assertion ( #25408 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-22 17:09:52 -07:00
d588cd2406
[Bugfix] fix custom op test ( #25429 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-09-23 00:07:43 +00:00
45d7d852d3
[Frontend] Responses API MCP tools for built in tools and to pass through headers ( #24628 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Signed-off-by: Alec S <10566873+alecsolder@users.noreply.github.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-22 23:38:19 +00:00
8bed179109
[TPU] update torch_xla dependency for PyPI compatibility ( #25278 )
...
Signed-off-by: Johnny Yang <johnnyyang@google.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-09-22 16:14:44 -07:00
f552d5e578
[CI/Build] Skip Qwen3-VL initialization tests until models are actually released ( #25394 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 13:18:24 -07:00
8db2939289
[KV offload][5/N] Add CPUOffloadingSpec
( #24251 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-22 12:30:36 -07:00
d5e0fca264
[torch.compile] Cleanup compilation tests and custom passes, add debug utils, fix DCE bug ( #23091 ), fix test ( #24376 ), and prep for custom op matching ( #24604 ) ( #24542 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Signed-off-by: luka <lgovedic@redhat.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-22 12:30:05 -07:00
8d0ee5a564
[misc] Remove RFC review hours reference ( #25416 )
2025-09-22 12:16:59 -07:00
922979bfcc
[DP] support torchrun external launcher with Data Parallelism ( #24899 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-22 12:06:05 -07:00
239ef0c1ac
[CI Failure] Fix fp8 kv cache on <SM90 ( #25396 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-22 18:27:51 +00:00
1d7f95b85c
[Compiler] Disable Inductor standalone compile by default ( #25391 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-09-22 17:37:46 +00:00
cfbee3d0e7
[CLI env var] Add VLLM_FLASH_ATTN_MAX_NUM_SPLITS_FOR_CUDA_GRAPH in env variables ( #25274 )
...
Signed-off-by: qqma <qqma@amazon.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: qqma <qqma@amazon.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-22 10:37:43 -07:00
06a41334c7
[EPLB] Reduce EPLB Inference Overhead ( #24573 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-22 16:31:05 +00:00
175811e3b5
[V1][Attention] Split triton_attn in triton-only and rocm specific backends ( #24648 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-09-22 15:20:28 +00:00
c10101a3eb
[Bugfix] Fix several issues with p2p xPyD in GET type ( #23993 )
...
Signed-off-by: Csrayz <jover@cmbchina.com >
Signed-off-by: ivyilike <pww123@cmbchina.com >
Co-authored-by: ivyilike <pww123@cmbchina.com >
2025-09-22 14:53:13 +00:00
ac243886b0
[Kernel] MI-300X triton moe configs ( #23445 )
...
Signed-off-by: Sara Kokkila Schumacher <saraks@ibm.com >
2025-09-22 14:29:54 +00:00
3d2c56b7a9
Make mypy
behave like a proper pre-commit hook ( #25313 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-22 12:23:45 +00:00
64c824cd78
Make pickle import check fast ( #25379 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-22 04:08:25 -07:00
417a164af6
[Misc] Remove unused encoder-decoder error strings ( #25374 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 11:04:32 +00:00
b6f01bd9a7
refactor: abstract graph mode support into platform interface ( #25161 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-09-22 10:22:29 +00:00
4cf71cc88a
[TPU] Deprecate xm.mark_step
in favor of `torch_xla.sync
( #25254 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-22 10:12:57 +00:00
a66d131381
[TPU][Bugfix][CI] Fix broken tests/build dependency ( #25255 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-22 09:55:04 +00:00
21467f9a1c
Enable Eagle3 speculative decoding for GPT-OSS model ( #25246 )
...
Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com >
2025-09-22 08:50:39 +00:00
f92d952632
[V0 Deprecation] Remove MultiModalPlaceholderMap
( #25366 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 08:49:19 +00:00
6d0b827cbd
[V0 Deprecation] Remove V0-only methods in multi-modal registry ( #25362 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-22 13:58:26 +08:00
0eecb31663
[Bugfix] Fix hermes tool parser handling of non-string argument types ( #22002 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Signed-off-by: David Chen <530634352@qq.com >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-22 11:35:39 +08:00
793be8d057
[Docs] GSM8K Accuracy Evaluation doc update ( #25360 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-09-22 02:49:13 +00:00
7b57a433da
[Model] Support Dots OCR ( #24645 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: yinz-aizip <yinz@aizip.ai >
2025-09-22 02:24:40 +00:00
5aeb925452
Multimodal - audio tests ( #25285 )
...
Signed-off-by: Debolina Roy <debroy@redhat.com >
2025-09-22 07:07:11 +08:00
04d3752329
[Bugfix][V0 Deprecation][CI] use async mock and await for async method ( #25325 )
...
Signed-off-by: Yang <lymailforjob@gmail.com >
2025-09-22 07:06:16 +08:00
bc6e542d9f
Remove V0 attention backends ( #25351 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-21 16:03:28 -07:00
af7dfb0d1a
[Perf] Further optimization for Qwen3-VL fast_pos_embed_interpolate
( #25347 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-21 20:12:45 +00:00
1c3ffdbecc
[V0 Deprecation] Remove V0 sampling metadata ( #25345 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-21 10:37:11 -07:00
c438b2951c
feat: Enable engine-level arguments with speculators models ( #25250 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-09-21 11:04:45 -06:00
0ff8ebb2d7
[V0 Deprecation] Remove async_output_proc, preemption mode, delay factor ( #25334 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-21 08:52:32 -07:00
26e673fe93
[V0 Deprecation] Remove V0 Sequence class & Sampler ( #25332 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-21 08:52:15 -07:00
65a5910ce3
[Optimization] Cache chat template result when processor fails to be loaded ( #25341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-21 19:41:02 +08:00
9aea7373ff
[Bugfix] Typos in error message for missing model config file ( #25339 )
...
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com >
2025-09-21 04:36:47 -07:00
30d08911f7
[MM][Perf] Minor Optimization on Qwen3-VL fast_pos_embed_interpolate
( #25337 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-21 11:05:20 +00:00
cf56cf78b4
[V1] Add sliding window support to Flex Attention backend ( #24089 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-21 05:08:07 +00:00
7ed82d1974
[V0 Deprecation] Remove V0 MP executor ( #25329 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 21:26:35 -07:00
12dbd834cf
[V0 Deprecation] Remove from_seq_group methods ( #25330 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 21:10:48 -07:00
035fd2bd2c
[Multi Modal][Performance] Fused Q,K's apply_rope in more models ( #25005 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-21 03:55:10 +00:00
1cd885bd54
[V0 Deprecation] Remove V0 model runner base & simplify worker base ( #25328 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 20:49:09 -07:00
62b38dc832
[Doc] improve test-pipeline.yaml documentation ( #25305 )
...
Signed-off-by: Huamin Li <3ericli@gmail.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-20 20:29:12 -07:00
c99db8c8dd
[V0 Deprecation] Remove V0 core ( #25321 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 19:58:26 -07:00
72dd1595b4
[CI] Skip tests failing on main ( #25326 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 19:57:46 -07:00
572ddf83ce
[Chore] Remove unused sampler in models ( #25324 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 19:53:20 -07:00
86647d1cd0
[V0 Deprecation] Remove V0 Output Processor ( #25320 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 17:57:20 -07:00
52c2a8d4ad
[V0 Deprecation] Remove LLMEngine ( #25033 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-20 17:56:30 -07:00
367a480bd3
[Docs] Fix warnings in vllm/profiler and vllm/transformers_utils ( #25220 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-20 16:39:47 -07:00
bef180f009
[V0 Deprecation] Enable the remaining multimodal tests in V1 ( #25307 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 17:50:58 +00:00
d88918e4c2
[Core] Enable sharded state loader for V1 engine and enhance test coverage ( #25308 )
...
Signed-off-by: pengdrumli <pengdrumli@tencent.com >
2025-09-20 21:15:22 +08:00
3c713a9711
[Model] Cleanup InternViT's data parallel implementation ( #25306 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-20 05:46:24 -07:00
bf8b26cad1
Generate _ModelInfo properties file when loading to improve loading speed ( #23558 )
...
Signed-off-by: Manoel Marques <manoel.marques@ibm.com >
Signed-off-by: Manoel Marques <manoelmrqs@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-20 11:51:13 +00:00
032d661d27
[Docs] Fix warnings in mkdocs build (continued) ( #25042 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-20 11:45:18 +00:00
e08a3a3fdb
[CI Failure] Disable FlashInfer RoPE to unblock CI ( #25299 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-20 08:16:56 +00:00
3d9a1d2de5
[V1] Support LLM.apply_model
( #18465 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 07:14:35 +00:00
be874c0201
[Bugfix] Fix Qwen3-VL-MoE weight loading for EP ( #25300 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-20 00:04:05 -07:00
9607d5eb44
[Hybrid Allocator] Support full attention with different hidden size ( #25101 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-19 23:43:59 -07:00
c60e6137f0
[Optimization] Avoid repeated model architecture conversion for pooling models ( #25261 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 13:30:22 +08:00
f91480b2d4
[Bugfix] fix tool call arguments is empty ( #25223 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: xin.li <xin.li@daocloud.io >
2025-09-20 13:29:54 +08:00
6c5f82e5aa
[BUG FIX][NON-CUDA]quick fix to avoid call cudagraph_unsafe in attention ( #25298 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
2025-09-20 04:41:23 +00:00
b7f186bbb3
[BugFix] Exclude self when checking for port collision ( #25286 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-20 12:28:31 +08:00
3642909617
[BUGFIX] GPTQ quantization compatibility for Qwen3 Next MOE models (AutoGPTQ and AutoRound-GPTQ) ( #25268 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2025-09-20 11:18:13 +08:00
c308501cb6
Improve weight loading for encoder models in Transformers backend ( #25289 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-20 03:11:03 +00:00
535d80056b
[Misc] Support more collective_rpc return types ( #25294 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-20 02:02:38 +00:00
a25ade5d47
[BugFix] Ensure appropriate guards in destructors ( #25284 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-20 09:06:34 +08:00
8945b001db
[torch.compile] CUDAGraph Inductor partition integration ( #24281 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
Signed-off-by: Boyuan Feng <fby.1994@gmail.com >
Signed-off-by: boyuanfeng <boyuan@meta.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-20 01:02:15 +00:00
b8a287a0a8
[docs] Prompt Embedding feature support ( #25288 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-19 17:46:23 -07:00
c7e713616a
test: Remove vestigial skip for prompt embeds tests after landing v1 Prompt Embeds support ( #25291 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-19 17:33:40 -07:00
a36c675817
Don't skip special tokens with hermes-style tool calling ( #25281 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-19 17:33:25 -07:00
3da17c2cc2
[Bugfix] Remove VLLM_TEST_DYNAMO_FULLGRAPH_CAPTURE #2969 ( #25090 )
...
Signed-off-by: Lucas Kabela <lucaskabela@meta.com >
2025-09-19 20:27:21 -04:00
14c1432789
[BugFix] Fix async scheduling CPU tensor race take 2 ( #25279 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-19 16:34:07 -07:00
ee7a66dd9a
allow disable flashinfer prefill ( #25276 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-19 22:59:41 +00:00
431535b522
Enable modelopt gemma3 nvfp4/fp8, make workflow more robust ( #22771 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-19 22:40:33 +00:00
711e912946
[Compile] Fix Compile Warning for Ignoring MIN_BLOCK_PER_SM
( #25193 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-19 16:23:19 -06:00
e69e0b8b5f
[Frontend] Responses API messages out, just harmony for now ( #24985 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-19 21:40:16 +00:00
ddc9048394
Fix: Correct FusedMoE layer reference in auto_round quantization ( #24818 )
...
Signed-off-by: David-Wen <18927700430@163.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-19 20:44:24 +00:00
b1a63d1b3b
[BugFix] Make FlashInferMetadataBuilder non-blocking ( #25040 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-19 20:36:34 +00:00
48ecb4438b
[Perf] Use FlashInfer RoPE for RotaryEmbedding.forward_cuda when available ( #21126 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-19 14:06:49 -06:00
e57fc15971
Specify platform in pip-compile
pre-commit
hook so it runs on MacOS ( #25273 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 12:43:33 -07:00
4bdf400218
[Bugfix] Fix chunked a2_scales in modular kernels ( #25264 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-19 19:42:01 +00:00
7852b82b93
[Bugfix] GPT OSS Attritbute error on H100 ( #25228 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-09-19 13:14:09 -06:00
a2a5f79e09
Optimize triton unified attention performance for sliding window attention ( #24390 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
2025-09-19 13:07:26 -06:00
c59a0eca42
[KV offload][4/N] Offloading KV connector ( #22595 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 19:07:17 +00:00
b716ab93a7
[bugfix] fix structured outputs key missing issue from #24929 ( #25195 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-19 18:37:57 +00:00
138f0d1e75
[Docs] add __init__.py to vllm/model_executor/layers/quantization/compressed_tensors/transform ( #24974 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-19 18:32:27 +00:00
2506ce5189
[Core][Prefix Hash] Fix prefix hash metrics sliding window maintainance ( #24990 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-09-19 12:22:53 -06:00
47fd08aaf9
[CI/Build] fix test function_calling ( #25072 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-19 12:16:32 -06:00
12aed7e453
Encoder model support for the Transformers backend ( #25174 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 19:15:22 +01:00
d90e212a3a
Remove Redundant Assignment in Qwen3_VisionPatchMerger ( #25224 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-19 12:15:13 -06:00
2821986450
[Core] Modify the initialization parameters of the lora manager ( #25249 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-19 18:01:28 +00:00
6c117cff7d
[Frontend] Pass API server count to each process ( #23717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 01:15:19 +08:00
7ac67ea525
[KV offload][3/N] Add worker-side CPU support ( #21448 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 09:53:45 -07:00
ce75e15373
refactor(benchmarks): add type annotations to wait_for_endpoint parameters ( #25218 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-19 16:36:52 +00:00
aed16879a9
Move ModelConfig
from config/__init__.py
to config/model.py
( #25252 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 16:22:33 +00:00
cf278ff3b2
Update CODEOWNERS ( #25269 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 09:12:55 -07:00
838d7116ba
[Qwen] Remove cuda hard-code in qwen3 next ( #25243 )
...
Signed-off-by: Icey <1790571317@qq.com >
2025-09-19 12:25:12 +00:00
5089fd749c
[V0 Deprecation] Remove V0 logic from get_input_embeddings
interface ( #25242 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-19 11:10:52 +00:00
a3d087adec
[P/D][Nixl] Introduce KVTransferMetrics
and aggregation strategy ( #22188 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-19 11:09:14 +00:00
058525b997
Move PoolerConfig
from config/__init__.py
to config/pooler.py
( #25181 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 11:02:55 +00:00
1dfea5f4a9
[Bugfix][Perf] Misc fixes for Qwen3 VL ( #25238 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-19 10:46:16 +00:00
cea91a32f2
[Kernel][Performance] Add Triton kernel for Qwen3-VL interleaved MRoPE ( #25055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 10:27:49 +00:00
a684c0124c
[bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B ( #25146 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 08:45:06 +00:00
f2718d2948
[Misc] Cleanup test conftest for deprecated encoder-decoder models ( #25231 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 07:44:56 +00:00
825fdb11ad
[Bugfix][CPU] Add placeholder to avoid import errors when using fused_moe ops on platforms without triton ( #25137 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-19 07:41:12 +00:00
8c1d4acbfe
[CPU] Disable oneDNN linear on non-x86 platforms ( #25166 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-19 07:27:22 +00:00
486c5599e3
[Build] Update Xgrammar to 0.1.24 to get a CVE fix ( #25188 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-19 14:27:17 +08:00
a6149aa587
[OOT] Support sync_model_loading for OOT ( #25126 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
2025-09-19 05:41:53 +00:00
6c8a3c099b
[Docs] Fix griffe warnings in vllm/multimodal ( #25216 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-18 22:10:44 -07:00
31a8a2a7bc
[Misc] Clean up MM profiling warnings ( #25222 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-19 04:46:57 +00:00
1a0a04dae9
[Perf] Optimize memory peak during EAGLE model loading. ( #24585 )
...
Signed-off-by: Chen Ding <candy.dc@alibaba-inc.com >
2025-09-19 03:31:16 +00:00
6d8246aaff
[gpt-oss] Add ResponseReasoningPartAddedEvent, ResponseReasoningPartDoneEvent for streaming ( #24938 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-18 19:11:59 -07:00
9d1c50a5ac
[KV offload][2/N] Introduce LRU-based CPU offloading management ( #20075 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 00:20:51 +00:00
9a4600e4dc
[CORE] Prompt Embeddings Support for v1 Engine ( #24278 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-19 08:03:09 +08:00
9fac6aa30b
[BugFix] Fix DeepGEMM warmup, no m.weight_scale_inv ( #25206 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-18 14:26:28 -07:00
a53ad626d6
[KV offload][1b/N] rename offloading to kv_offload ( #25191 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-18 20:53:52 +00:00
1c3dad22ff
[V0 Deprecation] Remove unused async_timeout.py ( #25190 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-18 20:35:21 +00:00
d2a30a2d93
[Bug] Fix torch Compilation Cache Hit Error ( #25093 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-18 12:38:37 -07:00
75fb112d80
[Bug] Fix returned_lse
not Defined issue ( #25106 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-18 19:32:24 +00:00
38db529f66
[feat]: Create interface for model-specific M-RoPE ( #24194 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Aziz <azizbenothman76@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-18 19:18:56 +00:00
064cac7bb7
[fix]: remove data type hardcoding from gptoss model implementation ( #23807 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-09-18 18:15:23 +00:00
e19bce40a1
[V0 Deprecation] Remove AsyncLLMEngine ( #25025 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-18 11:07:42 -07:00
505805b645
[KV offload][1/N] Introduce an offloading component ( #19848 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-18 10:57:07 -07:00
bbdc0f2366
[ROCm][AITER][Bugfix] Switch AITER to use PIECEWISE_AND_FULL compilation ( #25104 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2025-09-18 17:46:47 +00:00
dc34059360
[ROCm][CI/Build] Use ROCm7.0 as the base ( #25178 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-18 09:36:55 -07:00
c4cb0af98a
[spec decode] Fix MTP inference path for MiMo-7B model ( #25136 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-18 09:12:19 -07:00
1c3b1634aa
[Misc] Add codeowner for Transformers backend ( #25180 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 09:01:50 -07:00
2ea50e977a
Enable Allgather/ReduceScatter backend for NaiveAllToAll ( #23964 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Shu Wang <shuw@nvidia.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-18 15:52:58 +00:00
b419937c78
[Docs] Fix warnings in mkdocs build (continued) ( #25163 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-18 08:23:26 -07:00
5f696c33b1
[New Model] Support BertForTokenClassification / Named Entity Recognition (NER) task ( #24872 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-18 23:22:01 +08:00
67244c86f0
feat(api): Return 503 on /health when engine is dead ( #24897 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-09-18 14:29:40 +00:00
072d7e53e5
[PERF] Add conv1d
metadata to GDN attn ( #25105 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-09-18 14:27:49 +00:00
01a583fea4
[Kernel] Decouple Tile Size from Block Size in Triton Unified Attention Kernel ( #21197 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-09-18 14:27:01 +00:00
bc19d75985
[Misc] Add kv-connector label ( #25156 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-18 13:56:07 +00:00
fbd6523ac0
Refactor dense FP8 tensor/channel/block utils and add CT FP8 block ( #21404 )
2025-09-18 08:53:45 -04:00
470484a4f5
[Structured Output][Refactor] Move apply_grammar_bitmask()
method from ModelRunner
to structured output utils ( #21999 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-09-18 20:44:31 +08:00
21da73343a
[Misc] Clean up flags in vllm bench serve
( #25138 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-18 12:43:33 +00:00
66072b36db
[Bugfix][Mamba] - Fix Conv State Kernel FP32 Support ( #24883 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-18 12:21:17 +00:00
3ed1ec4af2
Fix validate-config
pre-commit check ( #25157 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 12:06:28 +00:00
5a33ae9a3f
Fix forward reference warning in documentation ( #25150 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 11:41:41 +00:00
c9ff9e6f0c
[Docs] add the parallel sampling usage in LLMEngine and AsyncLLM ( #24222 )
2025-09-18 04:37:08 -07:00
eaffe4486c
[Docs] Fix pooling-params doc references in openai_compatible_server.md ( #24939 )
2025-09-18 04:36:47 -07:00
8ed039d527
Move StructuredOutputsConfig
from config/__init__.py
to config/structured_outputs.py
( #25153 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 11:24:27 +00:00
37970105fe
[Model] Improve Pooling Model ( #25149 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-18 11:04:21 +00:00
cc935fdd7e
[Frontend] Support setting logprobs to -1 ( #25031 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-18 10:34:42 +00:00
abdfcd4f3d
silu-v1: Fix EPS not being used during max-reduction ( #25069 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-18 10:25:12 +00:00
4f02b77de4
Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains ( #24951 )
...
Signed-off-by: lyd1992 <liuyudong@iscas.ac.cn >
Signed-off-by: ihb2032 <1355790728@qq.com >
2025-09-18 17:43:23 +08:00
29283e8976
[Chore] Cleanup guided namespace, move to structured outputs config ( #22772 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 09:20:27 +00:00
05b044e698
[Doc] Fix cross-reference warnings ( #25058 )
...
Signed-off-by: Punit Vara <punitvara@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 02:05:16 -07:00
aa3f105c59
Add 'path' option to ImagePrompt data_format ( #25081 )
...
Signed-off-by: Gerard Finol <gerard.finol@urv.cat >
2025-09-18 02:02:14 -07:00
ef7eefe17a
[Qwen] Add fp8 checkpoint support for qwen3-next. ( #25079 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-18 08:16:04 +00:00
350c94deb3
[Bugfix] when use s3 model cannot use default load_format ( #24435 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-18 07:47:43 +00:00
f4cd80f944
Retrieve sliding_window
from text config in Gemma3 MM ( #25085 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 06:29:05 +00:00
349e0e3462
[Docs] Fix API Reference ( #25140 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-17 23:23:29 -07:00
81b16a2bc9
[Kernel] Better inf handling for grouped topk cu ( #24886 )
...
Signed-off-by: lumina37 <starry.qvq@gmail.com >
2025-09-18 05:53:55 +00:00
e111d5b0ae
[CLI] Use streaming in CLI chat and completion commands ( #23769 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-17 22:30:26 -07:00
a904ea78ea
[benchmark] add peak throughput metrics and plot ( #23867 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-17 22:30:02 -07:00
b7433ca1a4
[Spec Decode] Efficient padded speculation ( #24539 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2025-09-18 01:07:24 -04:00
5c65a72bb1
[V0 Deprecation] Remove more V0 tests ( #25117 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 22:05:25 -07:00
9d8a2d86d2
[EPLB] Add EPLB support for hunyuan_v1 ( #23078 )
2025-09-18 04:51:35 +00:00
3bc18127ff
[XPU] Whisper model support on XPU Platform ( #25123 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-09-18 04:30:10 +00:00
bec060fd99
Mark prompt logprobs as incompatible with prompt embeds at API level ( #25077 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-17 21:25:07 -07:00
52bc9d5b3e
[Model] enable data parallel for InternVL vision encoder ( #23909 )
...
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu >
Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-17 21:11:46 -07:00
dc2979c585
[Kernels] Overlap shared experts with combine instead of dispatch ( #24254 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-18 12:10:21 +08:00
027d37df38
[Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models ( #24960 )
...
Signed-off-by: toncao <cpatonn@gmail.com >
Co-authored-by: toncao <cpatonn@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-18 12:08:50 +08:00
b98219670f
[Core][MM] Cleanup MultiModalCache
( #25006 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-17 21:08:41 -07:00
32baf1d036
[Docs] Clean up the contributing README ( #25099 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-17 21:05:18 -07:00
3127274d02
[MM Encoder] Apply DP ViT for Qwen3-VL model series ( #24955 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huang Jie <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: 松灵 <26085463+wulipc@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 21:04:21 -07:00
4ac510f484
[Kernels] Enable DeepGEMM by default ( #24462 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-17 20:19:52 -07:00
7fb2a5be28
[V0 Deprecation] Skip PP test ( #25128 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 20:18:36 -07:00
6c036615dc
[V0 Deprecation] Remove misc V0 tests ( #25118 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:41:55 -07:00
2fc24e94f9
[V0 Deprecation] Remove V0 Tracing & Metrics tests ( #25115 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:40:44 -07:00
2c3c1bd07a
[V0 Deprecation] Remove V0 Engine tests ( #25114 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:38:09 -07:00
5963b98b46
[Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses ( #22537 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-17 17:43:31 -06:00
e6585ddb45
[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel ( #24833 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-17 16:37:23 -07:00
2a4d6412e6
Add a batched auto tune script ( #25076 )
...
Signed-off-by: Karan Goel <karangoel@google.com >
Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-17 22:41:18 +00:00
e67a79db03
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic ( #24600 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 15:36:29 -07:00
9f882d8791
Disable failing GPT-OSS Eval (Blackwell) for now ( #25107 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 15:36:00 -07:00
1a456c7c90
Aiter mha fp8 fix ( #24991 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
2025-09-17 22:29:14 +00:00
fedb75fa27
[Bugfix][B200] Fix cutlass_mla
hang ( #24966 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 18:06:38 -04:00
bff2e5f1d6
[gpt-oss][2] fix types for streaming ( #24556 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-17 22:04:28 +00:00
3c068c637b
[Kernel] Faster pre-processing time for W4A8 ( #23972 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-09-17 14:35:32 -07:00
f20c3b0951
[BUG] Exclude .pth files when pulling remote files ( #25092 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2025-09-17 20:42:09 +00:00
883131544f
[Bugfix] Update import path for bc_linter_include ( #24766 )
...
Signed-off-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu >
2025-09-17 20:33:11 +00:00
ee5fd49150
[Misc] Update owners for KV connector and V1 offloading ( #25041 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-09-17 12:37:29 -07:00
7ae9887542
[V1] Logits processor docs ( #22919 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Signed-off-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Joseph Marinier <Joseph.Marinier@gmail.com >
2025-09-17 11:53:12 -07:00
e3db5ebb66
[CI Bugfix] Fix failing test_model_load_with_params tests due to tokenizer refactor ( #25086 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 11:15:05 -07:00
9d442b7c48
[V0 Deprecation] Remove V0 tests in test_sequence.py ( #25088 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 11:08:45 -07:00
eb68c2dcd9
[CI] Revert back prepare_prompts and check_answers ( #25087 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 11:03:16 -07:00
8b32464ac1
Change log level from info to debug for IOProcessor ( #24999 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 10:21:28 -07:00
99cc41ad50
[V0 Deprecation] Remove unused output processor util ( #25023 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-17 09:50:07 -07:00
d6a518fdde
Remove unused find_cuda_init helper script ( #25044 )
2025-09-17 09:47:40 -07:00
4aa8c7b047
cleanup: remove adapter commons ( #25045 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-17 16:46:29 +00:00
4b946d693e
[V0 Deprecation] Remove V0 Core tests ( #25082 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 09:32:42 -07:00
087c6ffc92
[CI Bugfix] Fix failing test_invalid_env ( #25078 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 08:28:58 -07:00
4a2d33e371
[Docs] vllm/benchmarks/datasets.py fix docstring param format. ( #24970 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-17 08:11:51 -07:00
8f3616f422
Remove old cutlass mla ( #23961 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-17 14:31:43 +00:00
47f670b03b
[Docs] improve code formatting and comments for eliminate griffe build warning. ( #25010 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-17 07:31:20 -07:00
dd6a910aac
[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. ( #24957 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-17 21:59:09 +08:00
1b962e2457
[fix] lora benchmarks pass no_lora_flag_cpu ( #23774 )
...
Signed-off-by: Dylan Maloy <34420038+dolpm@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-17 21:22:25 +08:00
bfe9380161
Apply fixes for CUDA 13 ( #24599 )
...
Signed-off-by: Aidyn-A <aidyn.b.aitzhan@gmail.com >
2025-09-17 09:15:42 -04:00
9fccd04e30
[Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check ( #25046 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-17 05:54:02 -07:00
252ada5559
Add RADIO Vision Encoder Support to vLLM ( #24595 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster >
2025-09-17 05:53:30 -07:00
e120533d7a
[Misc] Avoid use of deprecated AutoModelForVision2Seq
( #25065 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-17 12:19:15 +00:00
2b85697031
[BugFix] enable DOTALL to match multi-line tool_call parameters in extract_tool_call_required_streaming ( #24668 )
...
Signed-off-by: Shijun Yin <shijun.yin@outlook.com >
2025-09-17 09:21:18 +00:00
544fe76b95
[Frontend] Support returning all prompt logprobs ( #24956 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-17 09:03:52 +00:00
bb58dc8c20
[DP] Create placement groups by ray_device_key ( #25026 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-17 08:57:25 +00:00
0fb2551c23
[Docs] Fix griffe warning in base_static_graph.py ( #25018 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-17 08:49:19 +00:00
6c47f6bfa4
[Core] Remove tokenizer group in vLLM ( #24078 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-17 08:42:59 +00:00
c15309a730
[Model] Apply SharedFusedMoE to glm4_moe. ( #24849 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-17 16:02:31 +08:00
4a9375fe9d
[Model] Pass param prefix to LLMHead ( #24862 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-17 16:01:27 +08:00
03191cd8f0
[Core][MultiModalHasher] Hash images without converting image mode ( #24969 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-17 00:57:34 -07:00
b77bf34e53
[EPLB] Support EPLB for Mixtral Model ( #22842 )
...
Signed-off-by: rouchenzi <ruochenwen@gmail.com >
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com >
Co-authored-by: Bowen Wang <abmfy@icloud.com >
2025-09-17 07:27:34 +00:00
dd39baf717
[XPU] Fix xpu model runner call torch.cuda APIs ( #25011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-17 06:45:25 +00:00
43a62c51be
Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) ( #23255 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-09-17 05:53:17 +00:00
ca2d1925ef
[Rocm] [quantization] Fix quark ptpc moe and add test case ( #24649 )
...
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com >
Co-authored-by: Haoyang Li <haoyang.li@amd.com >
2025-09-16 22:15:13 -07:00
0f7acdd73c
[Model] Support Qwen3-VL Model Series ( #24727 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huang Jie <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: 松灵 <26085463+wulipc@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 05:01:04 +00:00
5801e49776
[V0 Deprecation] Remove MQLLMEngine ( #25019 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-16 21:29:27 -07:00
58d4c705a8
[Core] Get num_encoder_tokens from scheduler config ( #24989 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-16 20:59:07 -07:00
ea3de5ef0d
[misc] fix typo in value error ( #24995 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2025-09-16 20:58:38 -07:00
67532a1a68
[UX] Remove "quantization is not fully optimized yet" log ( #25012 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-16 20:57:51 -07:00
5672ba90bd
[Docs] fix invalid doc link ( #25017 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-09-16 20:53:23 -07:00
dd83a157f1
[UX] Enforce valid choices for envs like VLLM_ATTENTION_BACKEND, etc ( #24761 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-16 20:42:23 -07:00
5a411ef6c4
[Benchmarks] Add MMVU video dataset support and clean up deprecated datasets ( #24719 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 03:29:43 +00:00
eeb135eb87
[Core] Use CpuGpuBuffer
for block table tensors ( #24795 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-16 19:18:06 -07:00
3059b9cc6b
[Doc] Add --force-overwrite option to generate_cmake_presets.py ( #24375 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-16 18:45:29 -07:00
64ad551878
Removes source compilation of nixl dependency ( #24874 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Signed-off-by: Benjamin Bartels <benjamin@bartels.dev >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <36171005+dtrifiro@users.noreply.github.com >
2025-09-17 01:33:18 +00:00
cef32104b4
[FP8] Extend per-token-group quantization support to QuantFP8 ( #24342 )
...
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-09-16 18:31:06 -07:00
493b10f8bf
[CI] GPT-OSS GPQA eval test for Blackwell ( #24920 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-16 18:13:21 -07:00
d119fc8614
[CI][Bugfix] Fix failing Blackwell test ( #24993 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-16 15:55:02 -07:00
dbebb7f812
[Perf] Reuse workspace for FP8+FP4 Marlin MoE ( #20500 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-16 15:45:10 -06:00
3053a22b33
fp8 kv cache support fix for torch.compile ( #22758 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2025-09-16 21:27:11 +00:00
02d4b85454
Use kwargs for long lists of EngineCoreRequest
arguments in tests and fix extra kwargs ( #24987 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-16 14:06:56 -07:00
86daa875fe
[gpt-oss][1][bugfix] fix streaming final output ( #24466 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-16 13:56:16 -06:00
dcf2f3ec06
[ROCm] Add dependencies for ROCm ( #24900 )
...
Signed-off-by: Yida Wu <yida.wu@amd.com >
2025-09-16 19:49:06 +00:00
218454b9b2
[MISC] Add code owners of vllm/v1 to vllm/v1/core ( #24928 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-16 19:07:34 +00:00
f4d6eb95cf
[gpt-oss][1b] streaming add item id, content id ( #24788 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-16 18:41:12 +00:00
cd1f885bcf
Directly get max encoder len from VLLM config in V1 ( #24866 )
...
Signed-off-by: Sugar-zsg <952242923@qq.com >
2025-09-16 17:52:31 +00:00
d593cf28fa
[Misc] Add removed encoder-decoder models to previously supported models list ( #24961 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-16 10:46:46 -07:00
faa7a5daac
[Bugfix] Fix unable to run encoder model when disable_hybrid_kv_cache_manager is true ( #24571 )
...
Signed-off-by: lianyibo <lianyibo1@kunlunit.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-09-16 17:36:58 +00:00
567939953b
[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM ( #23693 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-16 12:21:48 -04:00
08369289af
[Core][MultiModalHasher] Don't convert memoryviews to bytes during hashing ( #24925 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-16 15:32:47 +00:00
73cfb3c5ee
[Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 ( #24331 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-09-16 14:53:43 +00:00
4e5affeaa1
[CI] Add Decode Context Parallelism (DCP) test to CI ( #24487 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-16 21:21:28 +08:00
e4f0b4cd96
(doc): set cmake c++ compatible standard when building on MacOS CPU. ( #23483 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 06:08:46 -07:00
de3e53a75b
feat: Add Grafana and Perces monitoring dashboards for vLLM ( #23498 )
2025-09-16 05:53:40 -07:00
85e0df1392
[Docs] move benchmarks README to contributing guides ( #24820 )
2025-09-16 05:52:57 -07:00
0faf3cc3e8
Move SpeculativeConfig
from config/__init__.py
to config/speculative.py
( #24904 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 12:51:35 +01:00
7ea5c73ad7
[Feat][EPLB] A novel static EPLB placement strategy for MoE models. ( #23745 )
...
Signed-off-by: bruceszchen <bruceszchen@tencent.com >
Signed-off-by: Chen Bruce <bruceszchen@tencent.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Chen Bruce <cszwwdz@vip.qq.com >
Co-authored-by: lemon412 <lemon412@foxmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 10:55:16 +00:00
27fcfe7bcf
[Mamba] Support TP>1 with quantization for mamba2 mixer in case n_groups % tp_size == 0
( #24593 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-16 10:51:01 +00:00
68dbde5dbb
[Bugfix] remove duplicate tokens streamed in required tool choice streaming ( #23312 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-16 15:16:32 +08:00
04ad0dc275
[benchmark] Add triton version in the moe tuned config ( #24769 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-16 14:10:54 +08:00
238c4c1705
[QWEN NEXT] Fused MoE kernels Optimization configs ( #24924 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-16 13:06:03 +08:00
8c54610265
[Bug] [Spec Dec]: Fix kv_cache dtype mismatch for Eagle3 drafter on FP8 target ( #24505 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-16 04:45:38 +00:00
17871983a2
[Bugfix] Fix sequence parallelism bug when enable pipeline parallelism ( #24021 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-09-16 04:32:32 +00:00
759ef49b15
Remove V0 Encoder-Decoder Support ( #24907 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-15 21:17:14 -07:00
5206ab20ba
[XPU] Fix circular import error. ( #24927 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-16 03:35:36 +00:00
0af3ce1355
Upgrade flashinfer to 0.3.1 ( #24470 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-16 02:36:09 +00:00
e1279ef00f
[Docs] Update instructions for how to using existing torch binary ( #24892 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 02:25:50 +00:00
2942970d44
[Metrics] Hide deprecated metrics with gpu_ prefix ( #24245 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-09-15 20:15:57 -06:00
3c96e7b8a1
[CI] Small Accuracy Eval Test for Deepseek Model ( #24259 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:14:50 -06:00
b42566f440
[Bug] Fix is_flashmla_supported
Check Error ( #24774 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:10:55 -06:00
d96e11167d
Add pytest-cov and .coveragerc ( #24778 )
...
Signed-off-by: Reza Barazesh <rezabarazesh@meta.com >
2025-09-15 20:08:46 -06:00
2891603efd
[ROCm][Bugfix] Fix the case where there's bias ( #24895 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-15 20:05:12 -06:00
de2cc3d867
[Deprecation] Remove DeepGEMM Old Symbol Wrapper ( #24902 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:03:29 -06:00
e95084308b
Updated CODEOWNERS for flashinfer, mla, fused_moe ( #24906 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-16 02:01:28 +00:00
7f6f2c1182
HuggingFace
-> Hugging Face
in Integration with Hugging Face
docs (#24889 )
2025-09-15 17:28:35 -07:00
5bcc153d7b
[Compile] Fix noop_elimination pass and add tests for noop_elimination ( #24880 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-15 23:33:18 +00:00
45bfa49cb8
[Tests] fix initialization of kv hash in tests ( #24273 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-09-15 21:48:27 +00:00
fd2f10546c
[ci] fix wheel names for arm wheels ( #24898 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-15 14:39:08 -07:00
e757a629e7
[Bug] Fix Cutlass Scaled MM Compilation Error ( #24887 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 17:21:17 -04:00
aae725af7c
[Performance] Remove redundant clone() calls in cutlass_mla ( #24891 )
2025-09-15 20:21:53 +00:00
73df49ef3a
[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still ( #24759 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-15 13:08:08 -07:00
25aba2b6a3
[gpt-oss] Add IncompleteDetails to ResponsesRepsonse ( #24561 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-15 13:07:55 -07:00
94b03f88dd
Bump Flashinfer to 0.3.1 ( #24868 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
2025-09-15 12:45:55 -07:00
49bfc538e4
Update num_tokens_across_dp to use nccl instead of gloo ( #24105 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-09-15 19:05:48 +00:00
a0b26701c9
[Transform] Deterministic Hadacore Transforms ( #24106 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-15 12:59:31 -06:00
c4afdb69cc
Move MultiModalConfig
from config/__init__.py
to config/multimodal.py
( #24659 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-15 17:43:16 +00:00
b834b4cbf1
[USAGE] Improve error handling for weight initialization in Unquantized… ( #20321 )
...
Signed-off-by: Rafael Marcelino Koike <rafael.koike@oracle.com >
Signed-off-by: Rafael Koike <koike.rafael@gmail.com >
2025-09-15 16:45:49 +00:00
740f0647b1
Reinstate existing torch script ( #24729 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-15 09:43:40 -07:00
01413e0cf5
Fp8 paged attention update ( #22222 )
...
Signed-off-by: Xiao Yu <xiao.yu@amd.com >
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com >
Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com >
Co-authored-by: Xiao Yu <xiao.yu@amd.com >
Co-authored-by: Bowen Bao <bowenbao@amd.com >
2025-09-15 10:43:26 -04:00
0e219cd50b
[Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 ( #24822 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-15 20:45:06 +08:00
72c99f2a75
[Model]: support Ling2.0 ( #24627 )
...
Signed-off-by: vito.yy <vito.yy@antgroup.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-15 05:09:30 -07:00
bf214ca226
[Misc] Fix examples openai_pooling_client.py ( #24853 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-15 11:57:30 +00:00
2e41f5abca
[XPU] Set consistent default KV cache layout ( #24745 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-15 18:09:34 +08:00
bc0f6059a2
[UT] enhance free kv cache block queue popleft_n ( #24220 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 10:04:37 +00:00
8de261b04a
[P/D]kv_output_aggregator
support P TP > D TP ( #23917 )
...
Signed-off-by: LCAIZJ <leichao139636@163.com >
Co-authored-by: leichao.lc <leichao.lc@antgroup.com >
2025-09-15 11:36:06 +02:00
a0d8b9738d
[Misc] Own KVConnectors installation ( #24867 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-15 02:21:09 -07:00
59e17dd4a0
[Misc] rename interval to max_recent_requests ( #24229 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 09:18:42 +00:00
4979eb79da
[Doc]: fix typos in various files ( #24821 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-15 01:08:52 -07:00
a8c0f59973
[Bugfix] MiDashengLM model contact error under concurrent testing ( #24738 )
...
Signed-off-by: chenbing8 <chenbing8@xiaomi.com >
Signed-off-by: bingchen-mi <chenbing8@xiaomi.com >
2025-09-15 06:38:12 +00:00
f4a948f33f
[Frontend] Skip stop
in reasoning content ( #14550 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-15 06:04:55 +00:00
3f3313981c
[kv cache] update num_free_blocks in the end ( #24228 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 05:15:12 +00:00
78818dd1b0
[Docs] Have a try to improve frameworks/streamlit.md ( #24841 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-14 21:50:36 -07:00
8e5cdcda4e
[Hybrid Allocator] Support Pipeline Parallel ( #23974 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-14 15:55:17 -07:00
90f3f7d73e
[Spec Decoding]Support Spec Decoding Metrics in DP Mode ( #24049 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 21:11:09 +00:00
6dc8da5dc1
[Chore] Remove ipex_ops warning ( #24835 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 19:41:53 +00:00
79cbcab871
Force use C++17 globally to avoid compilation error ( #24823 )
...
Signed-off-by: chenfengjin <1871653365@qq.com >
2025-09-14 19:30:10 +00:00
ff68035932
[Benchmarks] Throw usage error when using dataset-name random and dataset-path together ( #24819 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-14 17:50:01 +00:00
1177dd53e9
fix type of sampling rate for encode_base64 ( #24826 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-14 16:17:16 +00:00
fc2dbcda8b
[Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement ( #24783 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 11:20:17 -04:00
fec347dee1
[Misc] Improve s3_utils
type hints with BaseClient
( #24825 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-14 12:11:14 +00:00
cc3173ae98
[Multi Modal][Performance] Fused Q,K's apply_rope into one ( #24511 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-14 08:10:21 +00:00
3e903b6cb4
[Chore] Minor simplification for non-PP path ( #24810 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-13 17:41:36 -07:00
973c9d01da
[Minor] Simplify duplicative device check for cuda ( #24793 )
...
Signed-off-by: Ziliang Peng <ziliangdotme@gmail.com >
2025-09-13 18:28:38 +00:00
15b8fef453
Remove redundant assignment in xfer_buffers, This is a little fix ( #24732 )
...
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com >
2025-09-13 08:11:59 +00:00
cfa3234a5b
[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again ( #24771 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-13 15:45:11 +08:00
41ae4a1eab
[Doc]: fix typos in various files ( #24798 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-13 00:43:33 -07:00
4dad72f0d9
[Misc] Correct an outdated comment. ( #24765 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-13 00:34:53 -07:00
59d7ffc17f
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe ( #24750 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-13 07:29:19 +00:00
1da0f1441d
[Core][Multimodal] Cache supports_kw
( #24773 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-13 07:27:04 +00:00
98229db244
[Kernels][DP/EP] Optimize Silu Kernel for R1 ( #24054 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-13 00:17:27 -07:00
dbeee3844c
[Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization ( #24757 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-13 00:16:24 -07:00
30498f2a65
[Doc]: Remove 404 hyperlinks ( #24785 )
...
Signed-off-by: Rakesh Asapanna <45640029+rozeappletree@users.noreply.github.com >
2025-09-13 00:15:41 -07:00
abc7989adc
[Docs] Remove Neuron install doc as backend no longer exists ( #24396 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-13 00:15:03 -07:00
9a8966bcc2
[Docs] Fix warnings in mkdocs build (continued) ( #24791 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-13 00:13:44 -07:00
5febdc8750
[Chore] Remove unused batched RoPE op & kernel ( #24789 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-13 00:08:20 -07:00
99bfef841f
[Bugfix] Fix GPUModelRunner has no attribute lora_manager ( #24762 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 23:55:14 -07:00
89e08d6d18
[Model] Add Olmo3 model implementation ( #24534 )
...
Signed-off-by: Shane A <shanea@allenai.org >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-13 03:26:21 +00:00
7f2ea7074e
[Frontend][Multimodal] Allow skipping media data when UUIDs are provided. ( #23950 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-13 02:16:06 +00:00
4fdd6f5cbf
[Core] Support async scheduling with uniproc executor ( #24219 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Ronald1995 <ronaldautomobile@163.com >
Co-authored-by: Ronald1995 <ronaldautomobile@163.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-12 16:34:28 -07:00
8226dd56bf
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes ( #24660 ) ( #24667 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-12 22:31:32 +00:00
5fe643fc26
Add FLASHINFER_MLA to backend selector test ( #24753 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-12 22:30:07 +00:00
7ba32aa60b
[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode ( #24705 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-12 15:45:53 -06:00
c89ed8de43
Invert pattern order to make sure that out_proj layers are identified ( #24781 )
...
Signed-off-by: Alexandre Marques <almarque@redhat.com >
2025-09-12 14:45:29 -07:00
3beadc2f25
[Compilation Bug] Fix Inductor Graph Output with Shape Issue ( #24772 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-12 21:23:05 +00:00
bc636f21a6
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints ( #23937 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
2025-09-12 13:57:53 -07:00
017354c0ef
[CI] Trigger BC Linter when labels are added/removed ( #24767 )
2025-09-12 11:44:36 -07:00
010acc6e1e
[Bugfix] Fix incompatibility between #20452 and #24548 ( #24754 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-12 11:17:29 -07:00
c8c42597ab
[CI] Speed up model unit tests in CI ( #24253 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-09-12 10:36:50 -07:00
9d2a44606d
[UX] Remove AsyncLLM torch profiler disabled log ( #24609 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-12 10:08:44 -07:00
f17c075884
[Model] Switch to Fused RMSNorm in GLM-4.1V model ( #24733 )
...
Signed-off-by: SamitHuang <285365963@qq.com >
2025-09-12 09:12:23 -07:00
b0d1213ac3
[Models] Prevent CUDA sync in Qwen2.5-VL ( #24741 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-12 16:03:55 +00:00
57f94e88ea
[Models] Optimise and simplify _validate_and_reshape_mm_tensor
( #24742 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-12 15:37:37 +00:00
684b6870e1
[Bugfix][Frontend] Fix --enable-log-outputs
does not match the documentation ( #24626 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-12 08:01:24 -07:00
a5b84f1cbf
[Core] Shared memory based object store for Multimodal data caching and IPC ( #20452 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-09-12 07:54:17 -07:00
9f04d9d55f
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP ( #24739 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-12 07:54:04 -07:00
4d7c1d531b
[Bugfix] Fix MRoPE dispatch on XPU ( #24724 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-09-12 21:43:56 +08:00
41f17bf290
[Docs] Fix warnings in mkdocs build (continued) ( #24740 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-12 06:43:15 -07:00
bcb06d7baf
[Doc]: fix typos in various files ( #24726 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-12 06:43:12 -07:00
0377802c20
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec ( #24548 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-12 21:42:23 +08:00
72fc8aa412
[Multi Modal] Add FA3 in VIT ( #24347 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-12 21:27:24 +08:00
fdb09c77d6
[sleep mode] save memory for on-the-fly quantization ( #24731 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-12 11:25:19 +00:00
7a1c4025f1
[Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward
for better memory access ( #24701 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-09-12 19:23:07 +08:00
60a0951924
[Bugfix] Fix BNB name match ( #24735 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 11:12:01 +00:00
64d90c3e4f
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call ( #24717 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-12 18:57:07 +08:00
59d5d2c736
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend ( #24721 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-12 18:51:01 +08:00
d21a36f5f9
[CI] Add ci_envs for convenient local testing ( #24630 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-12 08:52:25 +00:00
561a0baee0
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order ( #24640 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-12 07:49:09 +00:00
f592b3174b
[BugFix] Fix Qwen3-Next PP ( #24709 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-11 23:35:04 -07:00
7920de0a2a
[Bugfix] Fix MRoPE dispatch on CPU ( #24712 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-12 04:56:31 +00:00
ddcec289c7
Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds ( #24686 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-12 04:35:48 +00:00
e090b7b45b
Enable conversion of multimodal models to pooling tasks ( #24451 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-12 03:30:41 +00:00
6a50eaa0d3
[DOCs] Update ROCm installation docs section ( #24691 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-11 20:02:53 -07:00
12a8414d81
[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 ( #24707 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 10:06:26 +08:00
880c741bb6
[Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases ( #24680 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-11 18:16:43 -07:00
40b6c9122b
[V1] feat:add engine v1 tracing ( #20372 )
...
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com >
Signed-off-by: Ye Zhang <zhysishu@gmail.com >
Signed-off-by: RichardoMu <44485717+RichardoMrMu@users.noreply.github.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Mu Huai <tianbowen.tbw@antgroup.com >
Co-authored-by: Ye Zhang <zhysishu@gmail.com >
Co-authored-by: Benjamin Bartels <benjamin@bartels.dev >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: 瑜琮 <ly186375@antfin.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-11 17:10:39 -07:00
2e6bc46821
[Startup] Make DeepGEMM warmup scale with max-num-batched-tokens ( #24693 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-11 20:10:19 -04:00
fcba05c435
[Bug] Fix Layer weight_block_size
Assertion Issue ( #24674 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 19:47:59 -04:00
7a30fa8708
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler ( #18698 )
...
Signed-off-by: Zazzle516 <2405677060@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 23:18:09 +00:00
f82f7a8990
[Qwen3-Next] MOE configs for H100 TP4 ( #24699 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-11 15:45:52 -07:00
c3aea10dc8
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel ( #23280 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-11 15:43:14 -07:00
d4fd2768ef
[Bugfix][Attention] Fix FlashInfer MLA block size logic ( #24692 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-11 22:39:42 +00:00
7a70a71892
[Qwen3-Next] Add B200 MoE configs for Qwen3-next ( #24698 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-09-11 15:34:58 -07:00
7d4651997a
[CI/Build] Add bc-linter to vLLM CI ( #21234 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-09-11 15:34:36 -07:00
569bf1c9c0
[Qwen3-Next] MoE configs for H200 TP=1,2,4 ( #24695 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-11 14:38:16 -07:00
1ec20355f5
[Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM
default to False ( #24696 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 14:32:27 -07:00
e42af78b18
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention ( #24197 )
...
Signed-off-by: Xiaozhu <mxz297@gmail.com >
2025-09-11 14:20:09 -07:00
074854b24f
[Kernel][B200] mxfp4
fused cutlass moe ( #23696 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-11 17:04:56 -04:00
79ac59f32e
Update Spec Decode metrics to include drafted and accepted token throughput ( #24127 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-11 19:58:43 +00:00
b971f91504
[BugFix] Fix tokenize asyncio task leak ( #24677 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-11 19:44:04 +00:00
c733bd5e87
[Qwen3-Next] Add MoE Config for H200 ( #24688 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-11 12:40:15 -07:00
a892b259b4
[Doc] Remove Useless Comments ( #24687 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 12:25:47 -07:00
127ded0a9e
[Ultravox] Use wrapped_model_config to instantiate inner model ( #24679 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2025-09-11 18:52:24 +00:00
bb2b5126da
[VLM] Migrate remain DP-supported ViT models to use disable_tp
( #24363 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-11 18:30:41 +00:00
361ae27f8a
[Docs] Fix formatting of transcription doc ( #24676 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 11:18:06 -07:00
e26fef8397
fix some typos ( #24616 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-11 10:48:46 -07:00
c1eda615ba
Fix model name included in responses ( #24663 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 10:47:51 -07:00
4aa23892d6
[Bugfix] Fix platform-specific routing in CustomOp implementations ( #24444 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-09-11 17:15:01 +00:00
1fdd5c42d7
[Kernels] Enable Torch Symmetric Memory All-Reduce By Default ( #24111 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-11 09:45:31 -07:00
bcbe2a4d9e
[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames ( #24161 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-11 09:44:34 -07:00
51d41265ad
[Docs] Fix typos in EP deployment doc ( #24669 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 09:07:23 -07:00
4984a291d5
[Doc] Fix Markdown Pre-commit Error ( #24670 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 09:05:59 -07:00
404c85ca72
[Docs] Add transcription support to model ( #24664 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-11 07:39:01 -07:00
817beef7f3
[Bugifx] Fix qwen-next packed_modules_mapping ( #24656 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 22:26:17 +08:00
4f6593b058
[HybridKVCache][Platform] Add support_hybrid_kv_cache for platform ( #24646 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2025-09-11 21:47:58 +08:00
94e6b2d55f
Allow users to specify kv cache memory size ( #21489 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 13:41:07 +00:00
fd1ce98cdd
[CI] Split mteb test from Language Models Test ( #24634 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 06:37:51 -07:00
d11ec124a0
[Bench] Add qwen-next in benchmark_moe.py ( #24661 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 21:29:43 +08:00
f510715882
[build] add torch to tool.uv no-build-isolation-package ( #24303 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 13:19:44 +00:00
f946197473
[Docs] Fixes a typo in the qwen3next model name. ( #24654 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-11 19:35:14 +08:00
0cd72a7b72
[XPU] add missing dependency tblib for XPU CI ( #24639 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
2025-09-11 11:22:33 +00:00
5f5271f1ee
Move LoRAConfig
from config/__init__.py
to config/lora.py
( #24644 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 11:01:38 +00:00
d6249d0699
Fix typing for safetensors_load_strategy
( #24641 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 10:41:39 +00:00
25bb9e8c65
[CI Failure] fix models/language/pooling/test_auto_prefix_cache_support.py ( #24636 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 03:31:23 -07:00
a1213fae5f
[Misc] Add @NickLucche to codeowners ( #24647 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-11 17:18:09 +08:00
a8b0361c92
[CI] Split pooling from entrypoints Test ( #24632 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 01:53:09 -07:00
ed5ae4aace
[Bugfix] Fix _synced_weight_loader ( #24565 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-09-11 16:52:33 +08:00
0fc36463e0
[CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Test ( #24615 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
2025-09-11 01:52:10 -07:00
d14c4ebf08
[Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ ( #24633 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-11 01:50:12 -07:00
ba6011027d
[Docs] Update V1 doc to reflect whisper support ( #24606 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-11 01:50:08 -07:00
85df8afdae
[Docs] Revise frameworks/anything-llm.md ( #24489 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-11 01:50:05 -07:00
6aeb1dab4a
[Bugfix] Fix incorrect import of CacheConfig ( #24631 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-11 01:48:25 -07:00
e93f4cc9e3
Add the support for the qwen3 next model (a hybrid attention model). ( #24526 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 15:32:09 +08:00
2048c4e379
[torchao] Support quantization configs using module swap ( #21982 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-09-10 23:53:24 -07:00
d13360183a
Remove redundant all gather + split ( #23441 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-10 23:45:07 -07:00
9bd831f501
[Model] New model support for Motif-1-Tiny ( #23414 )
...
Signed-off-by: ca1207 <ca1207zzz@gmail.com >
Signed-off-by: TaehyunKim <73943231+ca1207@users.noreply.github.com >
Co-authored-by: WyldeCat <skan1543@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 23:29:40 -07:00
e2b1f863aa
[Doc]: fixing doc typos ( #24635 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-10 23:19:28 -07:00
41329a0ff9
[Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre ( #24469 )
...
Signed-off-by: Shiqi Sheng <shengshiqi@google.com >
Signed-off-by: shengshiqi-google <160179165+shengshiqi-google@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-10 23:10:01 -07:00
ee0bc5e1b4
Enable --profile in 'vllm bench throughput' ( #24575 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2025-09-10 23:06:19 -07:00
3d1393f6fc
Kimi K2 Fused MoE kernels Optimization configs ( #24597 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-10 23:06:16 -07:00
8a894084d2
[Engine][Chore] use local variable and remove output var assignment ( #24554 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
2025-09-10 23:05:42 -07:00
e2d8c27f68
[BugFix] Fix pipeline parallel ( #24621 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-10 23:05:30 -07:00
29799ddacc
[Bugfix] Add missing VIT backend dispatch on CPU ( #24623 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-10 22:28:41 -07:00
f17a6aa4ec
[Ultravox] Fix Gemma instantiation, support quantization via --hf-overrides ( #24131 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2025-09-10 22:25:34 -07:00
6c8deacd72
[Bug] [Spec Decode] Fix model_initialization test and mismatch in aux_hidden_layers ( #24613 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-10 21:23:18 -07:00
55b823ba0f
Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool parser ( #24406 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-11 04:23:04 +00:00
8c5a747246
[distributed] update known issues ( #24624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-11 11:09:38 +08:00
5931b7e5d9
[Models][Quantization] Add quantization configuration update in Voxtral model ( #24122 )
...
Signed-off-by: Alexandre Marques <almarque@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-10 19:13:56 -07:00
cc99baf14d
[Misc] Make timeout passable in init_distributed_environment ( #24522 )
...
Signed-off-by: jberkhahn <jaberkha@us.ibm.com >
2025-09-10 15:41:12 -07:00
dcb28a332b
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration ( #21078 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-10 15:31:10 -07:00
fba7856581
[Perf] Warmup FlashInfer attention during startup ( #23439 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-10 15:03:17 -07:00
b5e383cd8b
[gpt-oss] raise error for flashinfer backend without trtllm ( #24482 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-10 14:33:13 -07:00
9a161307f5
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends ( #19767 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-10 13:59:55 -07:00
37e8182bfe
[v1] Add Whisper model support (encoder-decoder) ( #21088 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
2025-09-10 13:53:35 -07:00
4db4426404
[CI] Fail subprocess tests with root-cause error ( #23795 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-10 13:53:21 -07:00
a0933c3bd6
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs ( #24577 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-09-10 12:33:41 -07:00
09e68bce34
[Misc] update log level debug to warning when process port is used by ( #24226 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-10 11:32:57 -07:00
9fb74c27a7
[Core] Support configuration parsing plugin ( #24277 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Signed-off-by: Xingyu Liu <38244988+charlotte12l@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-10 11:32:43 -07:00
4032949630
[Bugfix] Fix DeepEP config for DP4TP4 ( #23619 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-10 10:37:56 -07:00
08abfa78ec
[Bugfix] fix modelopt exclude_modules name mapping ( #24178 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-10 10:20:46 -07:00
2bef2d1405
[Logging] allow config logging stream ( #24336 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-10 15:02:01 +00:00
36cacd0958
[Doc] Add documentation for GLM-4.5 series models: tool-calling and reasoning parser ( #24589 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-09-10 07:50:55 -07:00
bb3eb80d92
[Core] Split LoRA layers ( #24574 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 07:47:51 -07:00
fcc0a3130a
[CI] Fix tensorizer test assertion ( #24545 )
...
Signed-off-by: Peter Schuurman <psch@google.com >
2025-09-10 06:57:36 -07:00
736569da8d
[Platform] Custom ops support for LMhead and LogitsProcessor ( #23564 )
...
Signed-off-by: zzhx1 <zzh_201018@outlook.com >
2025-09-10 06:26:31 -07:00
2eb9986a2d
[BugFix] python collect_env.py
and vllm collect-env
compatibility with uv venv ( #24066 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-09-10 21:25:33 +08:00
ccee371e86
[Docs] Fix warnings in mkdocs build
(continued) ( #24092 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-10 06:23:28 -07:00
c0bd6a684a
Fix Auto_Round Quatization Loading on SM75 and Lower GPUs ( #24217 )
...
Signed-off-by: RoadToNowhereX <37441177+RoadToNowhereX@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-10 06:22:31 -07:00
3144d90217
fix some typos ( #24167 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-10 06:21:23 -07:00
2f5e5c18de
[CI/Build] bump timm dependency ( #24189 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-10 06:20:59 -07:00
bd98842c8a
[CI] Add PPL test for generation models ( #24485 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-10 06:16:39 -07:00
d6069887c6
[rocm] enable torchao quantization for rocm ( #24400 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-09-10 06:16:21 -07:00
492196ed0e
[CI/Build] split true unit tests to Entrypoints Unit Tests ( #24418 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-10 06:16:07 -07:00
f4f1a8df22
[BugFix] Ensure integrity of reused CPU tensors during async scheduling ( #24527 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: guoze.lin <guozelin@tencent.com >
2025-09-10 21:15:14 +08:00
0b9a612fa3
[BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat ( #24549 )
...
Signed-off-by: lacora2017 <yehu@meta.com >
Co-authored-by: lacora2017 <yehu@meta.com >
2025-09-10 21:14:55 +08:00
4c04eef706
[BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo ( #24559 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-10 06:14:27 -07:00
f36355abfd
Move LoadConfig
from config/__init__.py
to config/load.py
( #24566 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-10 06:14:18 -07:00
9e3c3a7df2
[LoRA]: Add LoRA support to Mistral's Voxtral models ( #24517 )
...
Signed-off-by: Yash Pratap Singh <yashsingh20001@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 06:12:03 -07:00
6cbd41909e
Feature/vit attention unification# 23880 ( #23978 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-10 06:10:14 -07:00
72d30108a0
Support for NemotronH Nano VLM ( #23644 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
2025-09-10 06:10:06 -07:00
8b83b93739
[Docs] Document the extra memory footprint overhead when using EPLB ( #24537 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-10 06:09:49 -07:00
9dbefd88e9
[Docs] Improve organisation of API Reference nav ( #24569 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-10 06:08:21 -07:00
7c195d43da
[ROCm][Bugfix] Fix Aiter RMSNorm ( #23412 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-10 21:08:03 +08:00
0ae43dbf8c
[Attention] add DCP support for FLASH_ATTN_MLA backend ( #24453 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-10 17:19:26 +08:00
267c80d31f
[Model] Limit CPU threads for image transformations in InternVL to reduce cpu contention. ( #24519 )
...
Signed-off-by: li-jinpeng <3332126450@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-10 16:45:44 +08:00
77f62613f9
Consolidate rendering parameters into RenderConfig dataclass ( #24543 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-10 08:44:47 +00:00
feaf202e93
[Bugfix] Guard _may_reorder_batch
for encoder-only models on CPU ( #24319 ) ( #24348 )
...
Signed-off-by: Remy <eunhwan.shin@dtonic.io >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-09-10 14:24:42 +08:00
91130ae376
[docs] promo pytorch conf and ray summit ( #24562 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-09 23:24:20 -07:00
e40827280b
[Docs] Enable relative links in examples to function when rendered in the docs ( #24041 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-09 21:40:45 -07:00
4377b1ae3b
[Bugfix] Update Run:AI Model Streamer Loading Integration ( #23845 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Signed-off-by: Peter Schuurman <psch@google.com >
Co-authored-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-09 21:37:17 -07:00
009d689b0c
[Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing. ( #24271 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-09 21:36:09 -07:00
0efdb5c3ba
[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading ( #24154 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-09-10 04:27:53 +00:00
53b42f4102
[BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-enable test for LlamaForCausalLMEagle3 ( #24392 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-09 21:24:23 -07:00
309d7aa401
[P/D] MultiConnector supports shutdown ( #24425 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-09 21:24:11 -07:00
b4a01aaf95
[KV Connector] More async support for get_num_new_matched_tokens
( #23620 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-09-09 21:23:37 -07:00
83dd28aae4
[CI] Adjust threshold for flaky ngram spec decoding test ( #24528 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 21:07:33 -07:00
f88e84016f
[BugFix] Fix async core engine client finalizer ( #24540 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 21:07:13 -07:00
3c2156b3af
[Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later) ( #24129 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-09-10 03:50:21 +00:00
7e7db04310
[CI] Retry flaky fp8 cutlass mla tests ( #24536 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 20:33:10 -07:00
41f160b974
Add @heheda12345 to CODEOWNERS of KVCacheManager related code ( #24546 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-10 03:30:32 +00:00
dc625ea6b8
[Perf] Convert np array to torch tensor to index into block table for attn chunking ( #24474 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-09 20:01:06 -07:00
b23fb78623
[Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. ( #24538 )
2025-09-09 17:53:53 -07:00
561f38dc3c
[Bugfix] Improve EPLB config validation error message ( #24524 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-10 00:32:36 +00:00
73e688cb79
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm ( #24275 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-09-09 23:27:35 +00:00
fb1a8f932a
[Benchmark] Add option to skip oversampling in benchmark ( #24457 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-09 22:00:17 +00:00
0dc9cbb527
[Benchmark] Update bench doc with mtbench, blazedit, spec bench ( #24450 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-09 21:15:41 +00:00
b5fb3005a8
[Log] Use a relative path in debug-level logs to distinguish files with identical names ( #23846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-09 16:46:35 -04:00
15de5ff9ea
[Feature] Disallow FlashMLA on Blackwell ( #24521 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-09 14:59:34 -04:00
b8a93076d3
[CI] execute all piecewise compilation tests together ( #24502 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-09 11:05:25 -07:00
c3f9773b2c
[TPU] Fix tpu structured decoding in mixed batches ( #24458 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-09-09 11:04:25 -07:00
3707cb2505
[Docs] Gemma3n transcriptions
endpoint support ( #24512 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-09 11:03:32 -07:00
920ed46b09
[Misc] bump outlines_core to fix the version conflicts with outlines >= 1.2.0 ( #24368 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 10:59:46 -07:00
15cb047e25
Extend renderer with embedding support and integrate completion endpoint ( #24405 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-10 01:46:46 +08:00
9ad0688e43
[Bugfix] Fix hidden_size for multimodal classification model ( #24501 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-09 10:37:25 -07:00
b9a1c4c8a2
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork ( #24279 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 12:21:56 -04:00
1aa427fdc1
[Kernels] Add Flash Linear Attention Kernels ( #24518 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-10 00:04:41 +08:00
1c63a16b65
[Core] Run garbage collector after CUDA graph capture to fix throughput regression ( #24128 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 10:38:10 -04:00
922d3b401b
[Bugfix] Handle the edge case in detokenizer where processed tokens contain both stop
str and eos
token ( #23938 )
...
Signed-off-by: dtransposed <damian.bogunowicz@gmail.com >
2025-09-09 07:30:24 -07:00
19332c0479
[Model] Systematic support for fp32 head, pooling models part ( #23810 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-09 07:29:50 -07:00
a55cf41a09
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT ( #24123 )
2025-09-09 10:21:10 -04:00
6fb2788163
[CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency ( #24411 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 10:02:35 +00:00
3d2a2de8f7
[RL] fast weight update with zmq + ipc handles ( #24295 )
...
Signed-off-by: huangweixiao <huangweixiao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-09 16:57:46 +08:00
1116590b16
[gpt-oss] Validate gpt-oss python tool during initialization ( #23856 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-09 08:37:48 +00:00
ccb97338af
[Misc] Add Codex settings to gitignore ( #24493 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-09 01:25:44 -07:00
45c9cb5835
[Misc] Add claude settings to gitignore ( #24492 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 01:14:45 -07:00
e283976f3a
[Performance][MM] Building the inverse permutation in O(n) time in Qwen2_5_VisionTransformer ( #24443 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
2025-09-09 00:24:11 -07:00
46876dff32
[Doc]: fixing typos to improve docs ( #24480 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 23:06:04 -07:00
1823a00d67
[Misc] Support bench serve long context ( #24373 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-08 22:53:10 -07:00
ed16d0f26f
[Doc] mention fpdb for multiprocess breakpoints ( #24452 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-09-08 21:46:45 -07:00
0cdd213641
[Misc] Improve Worker process title and logging prefix ( #22205 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 21:43:48 -07:00
948dd3443b
[Bugfix] Fix Apertus HF repo name ( #24447 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-08 21:40:29 -07:00
b2f7745774
Add data_parallel_size to VllmConfig string representation ( #24298 )
...
Co-authored-by: Cong Chen <congc@meta.com >
2025-09-08 21:35:18 -07:00
82dfb12e52
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead ( #23673 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-09-08 21:34:37 -07:00
bba1042c6f
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel ( #23647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-08 20:53:07 -07:00
b6fbc15634
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs ( #24074 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-09-09 11:37:16 +08:00
3e0d4a3475
Move KVTransferConfig
from config/__init__.py
to config/kv_transfer.py
( #24434 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 20:30:32 -07:00
562663a044
Bump actions/github-script from 7.0.1 to 8.0.0 ( #24413 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:12:44 +00:00
ed1623a88a
Bump actions/stale from 9.1.0 to 10.0.0 ( #24412 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:11:20 +00:00
13b89bd823
[doc] update vllm serve
cli args documentation ( #24329 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-09-09 03:07:58 +00:00
22a0070530
Bump actions/setup-python from 5.4.0 to 6.0.0 ( #24414 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 02:54:58 +00:00
170129eb28
[gpt-oss] Harmony changes with container tool support ( #23386 )
...
Signed-off-by: zhiweiz <zhiweiz@fb.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: zhiweiz <zhiweiz@fb.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-08 19:03:50 -07:00
955c624915
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE ( #24134 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-08 19:01:51 -07:00
4f87abdcc6
Update reviewers for modelopt related files ( #24468 )
2025-09-09 01:53:13 +00:00
6910b56da2
[CI] Add nightly multiarch manifests to dockerhub ( #24102 )
...
Signed-off-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 01:18:09 +00:00
e10fef0883
[Hardware][IBM Z] Fix Outlines Core issue for s390x ( #24034 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2025-09-08 16:50:34 -07:00
e680723eba
[Bugfix] Disable the statslogger if the api_server_count is greater than 1 ( #22227 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-08 15:28:03 -07:00
620db1fc58
[Attention] FlashAttention MLA cudagraph support ( #23958 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-08 22:05:26 +00:00
41183c1fe0
[Spec Decode] Fix offline spec_decode.py ( #24257 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 20:44:13 +00:00
43d9ad03ba
[Model loader]: support multi-thread model weight loading ( #23928 )
...
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-08 18:49:39 +00:00
7be141b2c5
[CI] Enable encoder model compilation test ( #24442 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-08 11:48:06 -07:00
8d7f39b48c
[Model] Remove quantized mixtral ( #24437 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 11:02:14 -07:00
cd08636926
[Spec Decode][Benchmark] Add Blitzedit dataset ( #23605 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 10:32:52 -07:00
3feeeb9fea
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking ( #23563 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-08 10:32:42 -07:00
6f4a82f8b5
[Model] Enable BNB support for qwen2_5_omni_thinker ( #24420 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 09:37:08 -07:00
c44797a4d6
[Docs]add eplb_config param use docs ( #24213 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-09-08 09:36:57 -07:00
55be93baf5
[Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's doc structure ( #24438 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 09:36:54 -07:00
717fc00e98
[Docs] Move feature compatibility tables to README ( #24431 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:45:14 -07:00
01dfb5e982
[Frontend] User-provided uuids for medias in chat. (RFC #22044 ) ( #23449 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-08 06:42:20 -07:00
03dd652c16
Move KVEventsConfig
from config/__init__.py
to config/kv_events.py
( #24433 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:41:27 -07:00
9cd76b71ab
[Misc] Terratorch related fixes ( #24337 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-08 06:40:26 -07:00
e041314184
[Bugfix] Fix mamba2 prefill chunking ( #23279 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-08 11:42:41 +00:00
5e537f45b4
[Bugfix] Fix get_quant_config when using modelscope ( #24421 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-09-08 11:03:02 +00:00
c2a8b08fcd
[Doc] Fix issues in integrations/llamastack.md ( #24428 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 02:28:32 -07:00
f4962a6d55
[Doc]: fix typos in Python comments ( #24417 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 00:22:16 -07:00
2f0b833a05
[Docs] Fix a tip indentation and typo ( #24419 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 00:19:40 -07:00
425b04b8f4
[gpt-oss][Responses API] Fix the function call id format ( #24409 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-08 06:49:52 +00:00
60f0843ef8
[Model] Remove unnecessary CUDA sync of Qwen2VL image and video preprocess ( #24334 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:11:12 -07:00
8a46602606
[Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess ( #24332 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:10:54 -07:00
61aa4b2901
[P/D] Add a shutdown method to the Connector API ( #22699 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-07 23:07:00 -07:00
8c892b1831
[Doc] Fix UTF-8 encoding issues in documentation generation on Windows ( #24361 )
...
Signed-off-by: alekramelaheehridoy <aliqramalaheehridoy@gmail.com >
Signed-off-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
Co-authored-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
2025-09-07 22:33:52 -07:00
3bca396f79
[CI/Build] Fix local image inputs in test_pixtral.py ( #24401 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 03:31:35 +00:00
3a3e91bdfe
[CI/Build] Disable flaky test_structured_output tests ( #24404 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 02:51:59 +00:00
b3d7e3c845
[Sampler] Support returning all prompt logprobs ( #23868 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-07 19:34:31 -07:00
67841317d1
[xpu] upgrade ipex/python3.12 for xpu ( #23830 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-09-08 02:07:16 +00:00
86173ad593
[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA ( #24385 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-08 09:27:12 +08:00
795b6951cd
Add @luccafong to codeowner for spec decode ( #24397 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-08 08:30:27 +08:00
2e5d21378d
Skip MM Encoder for non-first PP ranks ( #24387 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 09:38:35 -07:00
0661cb9df3
Add renderer-based prompt processing for embedding and classification endpoints ( #24356 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-07 08:26:48 +00:00
105d3d62ef
[TPU] Remove TopKTopPSampler dependency for TPU sampler ( #24391 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 01:12:36 -07:00
62f66be1f7
[Bugfix] Fix Qwen3-coder moe tuned config ( #24072 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 05:19:46 +00:00
81c53ef55c
[Misc] collect flashinfer version in collect_env.py ( #24378 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-07 03:30:41 +00:00
75334956c2
QWEN3 Thinking Fused MoE kernels Optimization configs ( #24330 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-07 03:18:54 +00:00
77aec83b8c
[Benchmark] add benchmark for custom activation op ( #23908 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-06 20:12:05 -07:00
e67597545b
[CI][Fix] deterministic seed for flaky CI runs on structured outputs ( #24380 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-09-07 11:10:40 +08:00
37a6fa95fd
Migrate Qwen2 inputs to TensorSchema ( #23475 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-06 20:07:31 -07:00
558f0907dc
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode ( #24372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-07 01:18:59 +00:00
4172235ab7
[V0 deprecation] Deprecate V0 Neuron backend ( #21159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-06 16:15:18 -07:00
848562bd49
break execute_model in gpu_model_runner into sub-functions for custom scopes ( #24265 )
...
Co-authored-by: Bangsheng Tang <bangsheng@meta.com >
2025-09-06 14:02:47 -07:00
e68dc2f014
[Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test ( #24370 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-06 20:39:34 +00:00
a3645ed94d
[Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count ( #24285 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-06 13:27:15 -07:00
fb691ee4e7
[Fix] [gpt-oss] fix non-tool calling path for chat completion ( #24324 )
2025-09-06 19:10:32 +00:00
6024d115cd
Lora bias(enable_lora_bias) deprecate warning ( #24339 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 00:42:19 +08:00
7555d6b34a
[Bugfix] Fix test_mixtral_moe ( #24371 )
2025-09-06 09:32:03 -07:00
00a4e56d8d
[Bugfix] Fix broken deepseek fp8 TP weights loading ( #24367 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-06 09:23:12 -07:00
0eadaeff7e
[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. ( #24335 )
...
Signed-off-by: Mohan Kumar Kumar <mohan.cbein@gmail.com >
Signed-off-by: mohankku <mohan.cbein@gmail.com >
2025-09-06 08:17:03 -07:00
0077c8634e
Add @benchislett to codeowner for spec decode and structured outputs ( #24362 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-06 22:03:35 +08:00
b121ca22ad
[CI] Disable flaky structured output test from CI ( #24366 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-06 13:31:56 +00:00
eddaafc1c7
[Multimodal] Improve max video embedding length estimation in V1 ( #24312 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-06 02:33:19 -07:00
305a1cc0d2
refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer ( #24345 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-05 23:01:23 -07:00
6d6c6b05d3
[New Model]: google/embeddinggemma-300m ( #24318 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-05 22:58:36 -07:00
53b19ccdd5
[Core] Allow disabling TP sharding for parallel Linear layer ( #23024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-05 22:53:58 -07:00
6432739ef1
[Bugfix] Catch and log invalid token ids in detokenizer ( #24351 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-05 22:30:22 -07:00
ac201a0eaf
[Feature] Support Decode Context Parallel (DCP) for MLA ( #23734 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-06 13:24:05 +08:00
3c529fc994
[KV Sharing] Raise error if using eagle with fast prefill ( #24350 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-05 20:22:40 -07:00
35bf193864
[Doc]: fix typos in Python comments ( #24294 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-05 19:41:12 -07:00
35efa70297
Add @22quinn as code reviewer for RL related components ( #24346 )
2025-09-06 01:56:15 +00:00
cee182b297
[Perf][V1] Fully overlap model execution ( #23569 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-05 18:20:17 -07:00
c954c6629c
[CI] Add timeouts to tests ( #24260 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-05 17:26:22 -07:00
9dfbeb41e5
[RFC] allow cancelation after shutdown in blocking collective_rpc ( #23390 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-05 14:14:18 -07:00
eedb2a2a10
[Bugfix] Fix silu_mul+quant fusion test ( #24341 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-05 20:13:42 +00:00
23a6c5280e
[gpt-oss][Bugfix]Fix streamableparser for missing handling of certain token_ids ( #24306 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-05 10:26:00 -07:00
7812bcf278
[docs] add shenzhen meetup ( #24326 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-05 22:48:42 +08:00
006e7a34ae
Adding int4 and int8 models for CPU benchmarking ( #23709 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-09-05 20:08:50 +08:00
e599e2c65e
[XPU][P/D] Add XPU support in NixlConnector ( #22436 )
...
Signed-off-by: zhenwei <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 21:03:12 -07:00
c29fb540ff
[gpt-oss] tool parser supports for /chat/completions [1/n] ( #22386 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 20:39:12 -07:00
65e038931d
[Frontend] Skip unnecessary detokenization when token_id is requested ( #24236 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-04 23:04:12 +00:00
886ccbe5ba
[CI/Build] Reduce the number of redundant cases to test for LoRA ( #24276 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-04 21:58:44 +00:00
adc3ddb430
[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files ( #23727 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 14:25:45 -07:00
60b755cbcb
[Misc] Have AsyncLLM custom_stat_loggers
extend default logger list ( #20952 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-04 14:25:30 -07:00
482e52f56c
QWEN3 Coder Fused MoE kernels Optimization configs ( #24266 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-04 20:33:43 +00:00
78336a0c3e
Upgrade FlashInfer to v0.3.0 ( #24086 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 09:49:20 -07:00
94866d7c93
[Misc] Slight improve deepgemm print ( #24085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-04 16:06:51 +00:00
83609ca91d
[Doc]: fix typos in Python comments ( #24173 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-04 08:52:17 -07:00
e41a0fa377
[Perf] Freeze core engine proc heap after init ( #24008 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-04 22:55:23 +08:00
37241077d5
[Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp ( #23725 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 09:25:40 -04:00
c9f7081f9c
[LoRA]: Add lora support to qwen-2.5-omni ( #24231 )
2025-09-04 05:50:50 -07:00
16ded21eeb
[XPU] support Triton Attention backend on Intel GPU ( #24149 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 20:41:08 +08:00
2b30afa442
Use hidden_size_per_head as head_size fallback ( #24221 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-09-04 12:59:16 +01:00
eafa8dcde6
[Model] Add pp support for hunyuan ( #24212 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-04 03:58:26 -07:00
6c7af8110a
[Doc] Update vLLM Singapore Meetup info ( #24234 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-09-04 02:58:18 -07:00
8f423e5f43
[Feature][Response API] Add streaming support for non-harmony ( #23741 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-04 17:49:06 +08:00
369a079568
[Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon ( #24200 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-09-04 02:48:25 -07:00
402759d472
[Attention] FlashAttn MLA ( #14258 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-04 02:47:59 -07:00
2c301ee2eb
[Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0
( #24159 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
Signed-off-by: Fanli Lin <fanli0116@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 02:47:08 -07:00
3efb9f4d95
[Attention][Platform] Refactor MLA to support Custom Op ( #23332 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-04 02:46:37 -07:00
04f3c35cff
Improve flexibility of auto_tune.sh execution. ( #23766 )
...
Signed-off-by: Anthony Su <50185138+anthonsu@users.noreply.github.com >
Signed-off-by: anthonsu <50185138+anthonsu@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 09:41:41 +00:00
51d5e9be7d
[Core][Model] Terratorch backend integration ( #23513 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-04 00:22:41 -07:00
e7fc70016f
[Model] Add MiDashengLM model support ( #23652 )
...
Signed-off-by: chenbing8 <chenbing8@xiaomi.com >
Signed-off-by: bingchen-mi <chenbing8@xiaomi.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-04 00:08:09 -07:00
12e1e63cc5
[Misc] Enhance output readability of helper script ( #24214 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-09-04 06:38:26 +00:00
57b1ce94f7
[CPU] Refactor CPU unquantized linear ( #24150 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-04 14:28:45 +08:00
cb55ad86fe
Migrate ultravox inputs to TensorSchema ( #23503 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-04 06:09:11 +00:00
712b273f65
[Refactor] Introduce basic Renderer for completion-style request ( #24010 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-04 05:21:12 +00:00
e919d6f549
[Kernel][Bugfix] Fix grouped topk cu ( #24146 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
2025-09-04 12:37:37 +08:00
a38f8bd54c
[Feature][Responses API]Support MCP tools with streaming mode + background mode ( #23927 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-09-04 04:05:10 +00:00
b5ee1e3261
Remove deprecated PyNcclConnector
( #24151 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-09-03 22:49:16 +00:00
36c260dad6
[Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking ( #23460 )
...
Signed-off-by: George Nagy II <george.nagy0969@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-03 21:08:47 +00:00
a43a3f1770
[Bugfix][DP] DP distribution does not require ray[default] ( #23822 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-03 13:21:36 -07:00
6adaed42f4
[Feature][P/D]: Optimize NIXL Connector xfer Launch ( #23887 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-09-03 19:14:30 +00:00
a742322092
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend ( #23289 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-03 14:05:24 -04:00
731a6940e3
Migrate whisper inputs to TensorSchema ( #23505 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-03 18:04:00 +00:00
e9b92dcd89
[Kernels] Overlap shared experts with send/recv ( #23273 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-03 12:35:18 -04:00
fa4311d85f
[V1] v1 engine + full CUDA graph support for PLaMo2 ( #23998 )
...
Signed-off-by: Hemmi Shinichi <shemmi@preferred.jp >
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
Co-authored-by: Hemmi Shinichi <shemmi@preferred.jp >
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com >
2025-09-03 08:24:02 -07:00
6d80ae83e1
[Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16 ( #23424 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-09-03 15:01:09 +00:00
4ba0c587ba
FIX: Add libnuma-dev to Dockerfile for dev stage ( #20388 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
2025-09-03 07:17:20 -07:00
6997a25ac6
[Model] Remove useless code from MiniMax implementation ( #23982 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-09-03 11:27:04 +00:00
28f350e147
Support add_generation_prompt in embeddings endpoint with chat request ( #23931 )
...
Signed-off-by: biba10 <jaksmid@seznam.cz >
2025-09-03 10:47:55 +00:00
51383bd472
[CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant ( #24088 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-03 17:23:56 +08:00
9c99e4871f
[Misc] Clean up deadcode for legacy processing pipeline ( #24153 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-03 08:34:29 +00:00
70549c1245
[CI/Build] Serve images used by multimodal tests through local HTTP Server ( #23907 )
...
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com >
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-03 16:13:11 +08:00
f0c503f66e
[Nixl] Heterogeneous TP support FlashInfer ( #20189 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-03 15:19:54 +08:00
f38035c123
[distributed][rl] remove nccl cumem env var override ( #24141 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-03 06:45:25 +00:00
426cc8629f
[BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models ( #24132 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-03 04:57:59 +00:00
e81d4e69c1
[Misc] Add check for dual_chunk_attention ( #24070 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-03 04:19:14 +00:00
02d411fdb2
[Doc]: fix typos in Python comments ( #24115 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:14:07 -07:00
d7e1e59972
[Doc]: fix typos in Python comments ( #24093 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:05:45 -07:00
c4ed78b14f
[Compile] Fix Compile Warning for w4a8_mm_entry.cu
( #23660 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-02 20:45:52 -07:00
1bd007f234
fix some typos ( #24071 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-02 20:44:50 -07:00
136d853e65
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing ( #23656 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-09-03 02:52:51 +00:00
e32a0e8678
Upgrade xgrammar to 0.1.23 ( #22988 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-03 02:32:59 +00:00
42dc59dbac
Update release pipeline post PyTorch 2.8.0 update ( #24073 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
2025-09-03 10:09:19 +08:00
862f2ef893
[XPU] Fix the bug of LoRA logits on the XPU platform ( #24081 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-09-03 08:21:18 +08:00
2fd1a40a54
[CI/Build] Disable SiluMul NVFP4 quant fusion tests ( #24121 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-02 16:50:28 -07:00
930a24144c
[Bug] R1 Accuracy: Fix routed_scaling_factor
Double Mul Issue ( #24119 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 22:22:30 +00:00
457e471971
[AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault ( #23692 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-09-02 22:13:57 +00:00
d328f7894f
[CI] Enable all hf transformers baselines in test_hybrid ( #23936 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-02 20:15:06 +00:00
98aee612aa
[Log] Only Print Profiler Results on Rank 0 ( #23370 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 18:53:34 +00:00
598bd74cf8
Fix weights loading for Apertus ( #24100 )
...
Signed-off-by: Nathan Ranchin <nranchin@student.ethz.ch >
2025-09-02 18:34:28 +00:00
2417798471
[Metrics] Deprecate TPOT in favor of ITL ( #24110 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-09-02 18:10:10 +00:00
9480ae24e3
[Bugfix] Fix packed_factor missing attribute error ( #23902 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-09-02 10:56:31 -07:00
f399182e8c
Run ruff format on a few files. ( #24075 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-02 17:55:32 +00:00
1c41310584
[Bugfix] Fix transform_config parsing in Compressed Tensors ( #23945 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-02 13:54:10 -04:00
c83c4ff815
[Benchmark] Add support for local hf dataset path in benchmark ( #23999 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-02 17:49:16 +00:00
0e1759cd54
[docs] add SYS_NICE cap & security-opt
for docker/k8s ( #24017 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 17:27:20 +00:00
e66ed3e675
[CI Failure] Skip failing nvfp4 silu test ( #23959 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-02 13:18:15 -04:00
e0653f6c0b
[Model] Classification models support logit_bias / sigmoid_normalize ( #24031 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-02 16:48:57 +00:00
38ba061f6f
[BugFix] Fix EXAONE4 rotary embeddings ( #23918 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 14:40:55 +00:00
0a74e9d0f2
[Gemma3n] Fix audio batching ( #24052 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-02 22:23:35 +08:00
8bd5844989
correct LWS deployment yaml ( #23104 )
...
Signed-off-by: cberge908 <42270330+cberge908@users.noreply.github.com >
2025-09-02 12:04:59 +00:00
ce30dca5c4
[CI]: reduce HTTP calls inside entrypoints openai tests ( #23646 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Aziz <azizbenothman76@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 10:49:32 +00:00
2f0bab3f26
[Model] Support dp on ViT on GLM-4.5V ( #23168 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-09-02 10:48:18 +00:00
fad73be1a5
[Doc]: fix typos in Python comments ( #24077 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 02:38:55 -07:00
56d04089ef
Migrate Interns1 inputs to TensorSchema ( #23510 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 04:35:45 +00:00
7be0cb8e9e
[XPU][Feature] fp8 online quantization support for XPU ( #23148 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com >
2025-09-02 04:06:53 +00:00
1fa1d6a9a0
Migrate OvisImagePatchInputs to TensorSchema ( #22024 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 12:01:36 +08:00
d59c986444
Remove runtime checks based on pooling params ( #24051 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-02 11:54:37 +08:00
04d0c60770
[Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has… ( #24028 )
...
Signed-off-by: Dazhi Jiang <dazhi_jiang@163.com >
2025-09-02 11:54:20 +08:00
2b41cbbf03
[V1][Mamba1] - FP32 SSM Kernel Support ( #23506 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-01 20:53:00 -07:00
0235103cbb
[Doc]: fix typos in Python comments ( #24042 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-01 19:07:45 -07:00
a344a5aa0a
[bugfix]fix MTP hidden states ( #24056 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-01 21:09:37 +00:00
5685370271
[Chore][V0 Deprecation] Move LogProb to a separate file ( #24055 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 12:07:53 -07:00
a0e0efd6bd
[Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506 ( #23817 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Signed-off-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-09-01 16:56:56 +00:00
cf91a89dd2
[docs][misc] IOProcessor plugins fixes ( #24046 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-09-01 09:17:41 -07:00
39a22dcaac
[Misc] Minor code simplification for spec decode ( #24053 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 08:54:01 -07:00
41c80698b3
Document multi-proc method selection for profiling ( #23802 )
...
Signed-off-by: jdebache <jdebache@nvidia.com >
2025-09-01 06:28:26 -07:00
7c8271cd1e
[Model]: support KeyeVL-1_5-8B ( #23838 )
...
Signed-off-by: wangruitao <wangruitao@kuaishou.com >
Co-authored-by: wangruitao <wangruitao@kuaishou.com >
2025-09-01 03:50:27 -07:00
3e330fcb21
[Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors ( #24033 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-09-01 03:34:52 -07:00
d46934b229
[Frontend] Gemma3n audio transcriptions
/translations
endpoint ( #23735 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-01 18:07:46 +08:00
107284959a
[Doc]: fix typos in Python comments ( #24026 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-01 09:38:20 +00:00
dc1a53186d
[Kernel] Update DeepGEMM to latest commit ( #23915 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-01 02:38:04 -07:00
55602bb2e6
[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN ( #20904 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-01 08:50:25 +00:00
d7fbc6ddac
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs ( #24022 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 08:12:22 +00:00
5438967fbc
[Misc] add hash_function doc string ( #24014 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-31 23:11:20 -07:00
422e793fa6
[Bugfix] Add support for <tool_call>
format in streaming mode for XLAM Tool Parser ( #22769 )
...
Signed-off-by: Devon Peroutky <devon@kindo.ai >
2025-09-01 14:07:54 +08:00
1cb39dbcdd
[Misc] IO Processor plugins for pooling models ( #22820 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-31 23:07:12 -07:00
437c3ce026
Migrate Phi4 inputs to TensorSchema ( #23471 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-01 14:05:59 +08:00
499b074bfd
[Misc] refactor code by import as for torch._inductor.config ( #23677 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-01 14:05:42 +08:00
ff0e59d83a
[CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization ( #23357 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-31 22:52:20 -07:00
b55713683c
[Misc] Move fast prefill logic to separate method ( #24013 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 05:40:38 +00:00
acc1a6e10a
Fix the bug related to loading GPTP INT3 weights. ( #23328 )
...
Signed-off-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 05:39:57 +00:00
8c742a66d1
[Misc] Avoid redundant copy for encoder-only models ( #24012 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 04:02:43 +00:00
183a70967a
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) ( #23994 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 03:33:40 +00:00
14b4326b94
v1: Support KV events from connectors ( #19737 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-01 01:13:21 +00:00
752d2e1c36
[Minor] Fix some random typos in comments ( #24009 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-31 16:42:17 -07:00
81eea3d348
vllm fix check on max vocab size ( #22471 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-31 20:57:05 +08:00
9701352e4b
[Doc]: fix typos in Python comments ( #24001 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-31 08:21:59 +00:00
749be00a98
[Core][Multimodal] Allow passing multi_modal_uuids
as multimodal identifiers. ( #23394 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-30 18:01:22 -07:00
5b8077b8ac
Fix wrong truncate_prompt_tokens type hint ( #22761 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
Signed-off-by: Gabriel Marinho <104592062+gmarinho2@users.noreply.github.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-30 20:39:38 +00:00
038e9be4eb
[LoRA] Much faster startup when LoRA is enabled ( #23777 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 15:37:39 +00:00
68a349114f
[Misc] enhance type hint for rearrange return value ( #23519 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:43:33 -07:00
e80bca309e
[Refactor] refactor freezing_value/cuda_event initialize outside try finally ( #23758 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:42:25 -07:00
fb4983e112
[Misc] add reorder_batch AttentionMetadataBuilder ( #23798 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:41:45 -07:00
379ea2823a
Add LoRA support for DeepSeek models (V2, V3, R1-0528) ( #23971 )
...
Signed-off-by: sadeghja1070 <sadegh.ja1070@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 06:40:02 -07:00
3a6acad431
[Model] Enable encoder DP for MiniCPM-V ( #23948 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 06:31:26 -07:00
5490d633ce
[UT] fix unify_kv_cache_configs when kv cache config needs sort ( #23843 )
2025-08-30 11:22:14 +00:00
628d00cd7b
[Bugfix] Fix test_lora_resolvers.py ( #23984 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 11:16:11 +00:00
4071c76cf3
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba ( #23831 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 00:16:15 -07:00
f1bddbd852
[Core] Cleanup TPU model runner for MM ( #23894 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-30 00:14:58 -07:00
9748c5198b
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion ( #23973 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-30 00:14:43 -07:00
ee52a32705
[CI] Move testing image from remote URL to S3 ( #23980 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 21:41:25 -07:00
8fb85b7bb6
Add routed_scaling_factor to MoE grouped topk ( #23123 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:48 -07:00
5b31cb1781
[Bugfix] Fix --config arg expansion called from api_server.py ( #23944 )
...
Signed-off-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:39 -07:00
d660c98c1b
[CI] Fix unavailable image remote URL ( #23966 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 15:40:04 -07:00
5674a40366
[Misc] Make download_weights_from_hf
more reliable ( #23863 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 12:37:24 -07:00
8c3e199998
Revert gemma3n fast prefill changes ( #23897 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-29 12:16:57 -07:00
1c26b42296
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models ( #23824 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-29 18:47:58 +00:00
b7adf94c4a
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj ( #23939 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-29 10:28:35 -07:00
4d7fe40fc0
[RL][BugFix] Fix missing tokenizer error for token-in-token-out ( #23904 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 01:09:55 +08:00
0dc9532065
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant ( #23929 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2025-08-29 09:36:39 -07:00
72a69132dc
[CI] Add aiter
to matching list of issue auto labeller for rocm
tag ( #23942 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-29 15:29:21 +00:00
d90d8eb674
[BugFix] Async scheduling and PP compatibility with DP ( #23770 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-29 08:17:27 -07:00
0a2f4c0793
[Models] Use in-place adds in Idefics2Vision ( #23932 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 07:42:57 -07:00
1cf3753b90
[MODEL] Apertus
and XIELU
( #23068 )
...
Signed-off-by: EduardDurech <39579228+EduardDurech@users.noreply.github.com >
Co-authored-by: AllenHaoHuang <allenhuangdd@gmail.com >
2025-08-29 20:29:18 +08:00
4f7cde7272
Adds json_count_leaves
utility function ( #23899 )
...
Signed-off-by: aditchawdhary <aditxy@hotmail.com >
2025-08-29 05:28:13 -07:00
67c14906aa
Update PyTorch to 2.8.0 ( #20358 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-29 18:57:35 +08:00
69f46359dd
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec ( #23779 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-08-29 18:36:57 +08:00
d9e00dbd1f
[Performance] V1 Classify Models E2E Performance Optimization ( #23541 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-29 03:12:32 -07:00
ad39106b16
[CPU] Enable data parallel for CPU backend ( #23903 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-29 02:19:58 -07:00
2554b27baa
[V0 Deprecation] Remove pooling model support in V0 ( #23434 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-29 00:04:02 -07:00
934bebf192
Better errors for Transformers backend missing features ( #23759 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 07:01:40 +00:00
885ca6d31d
[Misc] Fix warnings for mistral model ( #23552 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-08-29 06:58:48 +00:00
2d0afcc9dc
[mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. ( #23895 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-28 23:29:13 -07:00
b4f9e9631c
[CI/Build] Clean up LoRA test ( #23890 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-28 23:28:35 -07:00
05d839c19e
Fix(async): Add support for truncate_prompt_tokens in AsyncLLM ( #23800 )
2025-08-28 22:55:06 -07:00
6597d7a456
[Platform] import activation_quant_fusion for CUDA only ( #23882 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-08-28 22:54:16 -07:00
5264015d74
[BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD ( #23864 )
...
Signed-off-by: Jinghui Zhang <jinghuizhang0804@gmail.com >
2025-08-28 22:54:12 -07:00
98ac0cb32d
[Bugfix] Use ReplicatedLinear
for SequenceClassification head ( #23836 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-29 04:41:20 +00:00
c8b3b299c9
[tests] Improve speed and reliability of test_transcription_api_correctness ( #23854 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-29 04:25:33 +00:00
006477e60b
[ROCm][Fix] Fix rocm build caused by #23791 ( #23847 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-08-28 19:52:27 -07:00
de533ab2a1
[Models] Improve iteration over layers ( #19497 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 09:26:34 +08:00
235c9db8a7
[XPU] support data parallel for MoE models on XPU ( #22887 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-29 09:23:04 +08:00
b668055a11
[V0 Deprecation] Remove V0 Samplers test ( #23862 )
2025-08-28 18:05:52 -07:00
d3d2aad5a2
[Log] Use Debug Once for DeepGEMM E8M0 When not Enabled ( #23858 )
2025-08-28 22:18:10 +00:00
cb293f6a79
[V1] Enable prefill optimization for Gemma3n ( #22628 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-28 14:54:30 -07:00
7ffbf27239
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu ( #23737 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 14:22:46 -07:00
27e88cee74
chore: build release image by default ( #23852 )
...
Signed-off-by: Codex <codex@openai.com >
2025-08-28 13:17:15 -07:00
16a45b3a28
[NVIDIA] Support SiluMul + NVFP4 quant fusion ( #23671 )
...
Signed-off-by: jindih <jindih@nvidia.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: jindih <jindih@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedic <lgovedic@redhat.com >
2025-08-28 19:36:50 +00:00
57d4ede520
[bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) ( #23829 )
...
Signed-off-by: He-Jingkai <he-jingkai@outlook.com >
2025-08-28 19:05:20 +00:00
04d1dd7f4a
[ROCm][Aiter] Add triton fp8 bmm kernel for mla ( #23264 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com >
2025-08-28 18:18:08 +00:00
f32a5bc505
Migrate Llama4ImagePatchInputs to TensorSchema ( #22021 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 17:29:37 +00:00
8805ad9fa9
Add scale_config.yml file for Meta autoscalers for GH Actions ( #23840 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 09:31:20 -07:00
0583578f42
[ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime ( #23757 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 08:59:19 -07:00
db74d60490
[Bugfix] Add fake mode around passes ( #23349 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-08-28 11:25:56 -04:00
95089607fa
[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE ( #23819 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-28 06:56:20 -07:00
1f096f9b95
[CI] Fix linting error on main ( #23835 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-28 06:52:01 -07:00
66548f6603
[Bugfix] Fix benchmark_moe.py for blockwise fp8. ( #23823 )
...
Signed-off-by: crischeng <420985011@qq.com >
Co-authored-by: cris <grace@guisenbindeMacBook-Pro.local >
2025-08-28 21:44:09 +08:00
d3da2eea54
[Doc]: fix typos in Python scripts ( #23828 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 05:37:38 -07:00
bfab219648
[Model] [gpt-oss] fix gpt-oss pp support ( #23815 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 05:36:55 -07:00
a3432f18fd
[BugFix][Spec Decode] Use float64 for uniform_probs ( #23803 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 12:26:45 +00:00
67cee40da0
[CI/Build][Bugfix] Fix Qwen VL tests on CPU ( #23818 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-28 11:57:05 +00:00
d99c3a4f7b
[Doc]: fix typos in .md files (including those of #23751 ) ( #23825 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 04:38:19 -07:00
3462c1c522
[FIXBUG] Add return_success parameter to moe_wna16_weight_loader function ( #22797 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-28 09:03:22 +00:00
c5d004aaaf
[Model] Add PP support and VLM backbone compatability for GPT-OSS ( #23680 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-28 16:03:28 +08:00
11a7fafaa8
[New Model]: Support GteNewModelForSequenceClassification ( #23524 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-28 15:36:42 +08:00
186aced5ff
[Kernel] cuda kernels for upcoming decode context parallel feature ( #23791 )
...
Co-authored-by: hongchao <hongchao@msh.team >
2025-08-28 15:29:11 +08:00
daa1273b14
[Bugfix] when set offline model running error ( #23711 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-28 07:27:45 +00:00
c07a73317d
[CI] enable idefics3 and fuyu-8b test in multimodal test ( #23790 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 14:51:24 +08:00
22feac8e95
[Transform] [Quantization] Add transforms to compressed tensors ( #22486 )
2025-08-28 02:43:48 -04:00
c8851a4723
Add deprecation warning for lora_extra_vocab_size ( #23635 )
...
Signed-off-by: Jinheng Li <ahengljh@gmail.com >
2025-08-27 22:34:29 -07:00
f48a9af892
[CI] make all multi-gpu weight loading tests run nightly ( #23792 )
...
Signed-off-by: Alex Yun <alexyun04@gmail.com >
2025-08-27 21:27:36 -07:00
a11adafdca
Gracefully handle edge cases in harmony utils ( #23155 )
...
Signed-off-by: Jan Kessler <jakessle@uni-mainz.de >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-27 20:14:00 -07:00
a781e84ec2
[Perf] Tune configs for triton block fp8 gemm H100/H200 ( #23748 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-28 11:12:53 +08:00
1b7b161a09
[Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 ( #23556 )
...
Signed-off-by: Shrey Gupta <shreyg1303@gmail.com >
2025-08-27 20:12:44 -07:00
a69693e38f
Migrate Qwen inputs to TensorSchema ( #23473 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 10:43:26 +08:00
5da4f5d857
[Bugfix] Fix for V1 priority scheduling crashes at preemption ( #23713 )
...
Signed-off-by: Hanchenli <lihanc2002@gmail.com >
2025-08-28 00:44:52 +00:00