936da0f740
update
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-19 23:30:15 +00:00
20098c10d9
Remove global CUDA graph pool
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-19 23:27:51 +00:00
ee7a66dd9a
allow disable flashinfer prefill ( #25276 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-19 22:59:41 +00:00
431535b522
Enable modelopt gemma3 nvfp4/fp8, make workflow more robust ( #22771 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-19 22:40:33 +00:00
711e912946
[Compile] Fix Compile Warning for Ignoring MIN_BLOCK_PER_SM ( #25193 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-19 16:23:19 -06:00
e69e0b8b5f
[Frontend] Responses API messages out, just harmony for now ( #24985 )
...
Signed-off-by: Alec Solder <alecs@fb.com >
Co-authored-by: Alec Solder <alecs@fb.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-19 21:40:16 +00:00
ddc9048394
Fix: Correct FusedMoE layer reference in auto_round quantization ( #24818 )
...
Signed-off-by: David-Wen <18927700430@163.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-19 20:44:24 +00:00
b1a63d1b3b
[BugFix] Make FlashInferMetadataBuilder non-blocking ( #25040 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-19 20:36:34 +00:00
48ecb4438b
[Perf] Use FlashInfer RoPE for RotaryEmbedding.forward_cuda when available ( #21126 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-19 14:06:49 -06:00
e57fc15971
Specify platform in pip-compile pre-commit hook so it runs on MacOS ( #25273 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 12:43:33 -07:00
4bdf400218
[Bugfix] Fix chunked a2_scales in modular kernels ( #25264 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-19 19:42:01 +00:00
7852b82b93
[Bugfix] GPT OSS Attritbute error on H100 ( #25228 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-09-19 13:14:09 -06:00
a2a5f79e09
Optimize triton unified attention performance for sliding window attention ( #24390 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
2025-09-19 13:07:26 -06:00
c59a0eca42
[KV offload][4/N] Offloading KV connector ( #22595 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 19:07:17 +00:00
b716ab93a7
[bugfix] fix structured outputs key missing issue from #24929 ( #25195 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-19 18:37:57 +00:00
138f0d1e75
[Docs] add __init__.py to vllm/model_executor/layers/quantization/compressed_tensors/transform ( #24974 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-19 18:32:27 +00:00
2506ce5189
[Core][Prefix Hash] Fix prefix hash metrics sliding window maintainance ( #24990 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-09-19 12:22:53 -06:00
47fd08aaf9
[CI/Build] fix test function_calling ( #25072 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-19 12:16:32 -06:00
12aed7e453
Encoder model support for the Transformers backend ( #25174 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 19:15:22 +01:00
d90e212a3a
Remove Redundant Assignment in Qwen3_VisionPatchMerger ( #25224 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-19 12:15:13 -06:00
2821986450
[Core] Modify the initialization parameters of the lora manager ( #25249 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-19 18:01:28 +00:00
6c117cff7d
[Frontend] Pass API server count to each process ( #23717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-20 01:15:19 +08:00
7ac67ea525
[KV offload][3/N] Add worker-side CPU support ( #21448 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 09:53:45 -07:00
ce75e15373
refactor(benchmarks): add type annotations to wait_for_endpoint parameters ( #25218 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-19 16:36:52 +00:00
aed16879a9
Move ModelConfig from config/__init__.py to config/model.py ( #25252 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 16:22:33 +00:00
cf278ff3b2
Update CODEOWNERS ( #25269 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 09:12:55 -07:00
838d7116ba
[Qwen] Remove cuda hard-code in qwen3 next ( #25243 )
...
Signed-off-by: Icey <1790571317@qq.com >
2025-09-19 12:25:12 +00:00
5089fd749c
[V0 Deprecation] Remove V0 logic from get_input_embeddings interface ( #25242 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-19 11:10:52 +00:00
a3d087adec
[P/D][Nixl] Introduce KVTransferMetrics and aggregation strategy ( #22188 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-19 11:09:14 +00:00
058525b997
Move PoolerConfig from config/__init__.py to config/pooler.py ( #25181 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-19 11:02:55 +00:00
1dfea5f4a9
[Bugfix][Perf] Misc fixes for Qwen3 VL ( #25238 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-19 10:46:16 +00:00
cea91a32f2
[Kernel][Performance] Add Triton kernel for Qwen3-VL interleaved MRoPE ( #25055 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 10:27:49 +00:00
a684c0124c
[bugfix] fix MHA for models like OpenGVLab/InternVL3_5-38B ( #25146 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 08:45:06 +00:00
f2718d2948
[Misc] Cleanup test conftest for deprecated encoder-decoder models ( #25231 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-19 07:44:56 +00:00
825fdb11ad
[Bugfix][CPU] Add placeholder to avoid import errors when using fused_moe ops on platforms without triton ( #25137 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-19 07:41:12 +00:00
8c1d4acbfe
[CPU] Disable oneDNN linear on non-x86 platforms ( #25166 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-19 07:27:22 +00:00
486c5599e3
[Build] Update Xgrammar to 0.1.24 to get a CVE fix ( #25188 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-19 14:27:17 +08:00
a6149aa587
[OOT] Support sync_model_loading for OOT ( #25126 )
...
Signed-off-by: Chendi Xue <Chendi.Xue@intel.com >
2025-09-19 05:41:53 +00:00
6c8a3c099b
[Docs] Fix griffe warnings in vllm/multimodal ( #25216 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-18 22:10:44 -07:00
31a8a2a7bc
[Misc] Clean up MM profiling warnings ( #25222 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-19 04:46:57 +00:00
1a0a04dae9
[Perf] Optimize memory peak during EAGLE model loading. ( #24585 )
...
Signed-off-by: Chen Ding <candy.dc@alibaba-inc.com >
2025-09-19 03:31:16 +00:00
6d8246aaff
[gpt-oss] Add ResponseReasoningPartAddedEvent, ResponseReasoningPartDoneEvent for streaming ( #24938 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-18 19:11:59 -07:00
9d1c50a5ac
[KV offload][2/N] Introduce LRU-based CPU offloading management ( #20075 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-19 00:20:51 +00:00
9a4600e4dc
[CORE] Prompt Embeddings Support for v1 Engine ( #24278 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Andrew Sansom <qthequartermasterman@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-19 08:03:09 +08:00
9fac6aa30b
[BugFix] Fix DeepGEMM warmup, no m.weight_scale_inv ( #25206 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-18 14:26:28 -07:00
a53ad626d6
[KV offload][1b/N] rename offloading to kv_offload ( #25191 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-18 20:53:52 +00:00
1c3dad22ff
[V0 Deprecation] Remove unused async_timeout.py ( #25190 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-18 20:35:21 +00:00
d2a30a2d93
[Bug] Fix torch Compilation Cache Hit Error ( #25093 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-18 12:38:37 -07:00
75fb112d80
[Bug] Fix returned_lse not Defined issue ( #25106 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-18 19:32:24 +00:00
38db529f66
[feat]: Create interface for model-specific M-RoPE ( #24194 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Aziz <azizbenothman76@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-18 19:18:56 +00:00
064cac7bb7
[fix]: remove data type hardcoding from gptoss model implementation ( #23807 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-09-18 18:15:23 +00:00
e19bce40a1
[V0 Deprecation] Remove AsyncLLMEngine ( #25025 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-18 11:07:42 -07:00
505805b645
[KV offload][1/N] Introduce an offloading component ( #19848 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-18 10:57:07 -07:00
bbdc0f2366
[ROCm][AITER][Bugfix] Switch AITER to use PIECEWISE_AND_FULL compilation ( #25104 )
...
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com >
2025-09-18 17:46:47 +00:00
dc34059360
[ROCm][CI/Build] Use ROCm7.0 as the base ( #25178 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-18 09:36:55 -07:00
c4cb0af98a
[spec decode] Fix MTP inference path for MiMo-7B model ( #25136 )
...
Signed-off-by: zixi-qi <qizixi@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-18 09:12:19 -07:00
1c3b1634aa
[Misc] Add codeowner for Transformers backend ( #25180 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 09:01:50 -07:00
2ea50e977a
Enable Allgather/ReduceScatter backend for NaiveAllToAll ( #23964 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Signed-off-by: Shu Wang <shuw@nvidia.com >
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-18 15:52:58 +00:00
b419937c78
[Docs] Fix warnings in mkdocs build (continued) ( #25163 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-18 08:23:26 -07:00
5f696c33b1
[New Model] Support BertForTokenClassification / Named Entity Recognition (NER) task ( #24872 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-18 23:22:01 +08:00
67244c86f0
feat(api): Return 503 on /health when engine is dead ( #24897 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
Co-authored-by: Claude <noreply@anthropic.com >
2025-09-18 14:29:40 +00:00
072d7e53e5
[PERF] Add conv1d metadata to GDN attn ( #25105 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-09-18 14:27:49 +00:00
01a583fea4
[Kernel] Decouple Tile Size from Block Size in Triton Unified Attention Kernel ( #21197 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-09-18 14:27:01 +00:00
bc19d75985
[Misc] Add kv-connector label ( #25156 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-18 13:56:07 +00:00
fbd6523ac0
Refactor dense FP8 tensor/channel/block utils and add CT FP8 block ( #21404 )
2025-09-18 08:53:45 -04:00
470484a4f5
[Structured Output][Refactor] Move apply_grammar_bitmask() method from ModelRunner to structured output utils ( #21999 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-09-18 20:44:31 +08:00
21da73343a
[Misc] Clean up flags in vllm bench serve ( #25138 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-18 12:43:33 +00:00
66072b36db
[Bugfix][Mamba] - Fix Conv State Kernel FP32 Support ( #24883 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-18 12:21:17 +00:00
3ed1ec4af2
Fix validate-config pre-commit check ( #25157 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 12:06:28 +00:00
5a33ae9a3f
Fix forward reference warning in documentation ( #25150 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 11:41:41 +00:00
c9ff9e6f0c
[Docs] add the parallel sampling usage in LLMEngine and AsyncLLM ( #24222 )
2025-09-18 04:37:08 -07:00
eaffe4486c
[Docs] Fix pooling-params doc references in openai_compatible_server.md ( #24939 )
2025-09-18 04:36:47 -07:00
8ed039d527
Move StructuredOutputsConfig from config/__init__.py to config/structured_outputs.py ( #25153 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 11:24:27 +00:00
37970105fe
[Model] Improve Pooling Model ( #25149 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-18 11:04:21 +00:00
cc935fdd7e
[Frontend] Support setting logprobs to -1 ( #25031 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-18 10:34:42 +00:00
abdfcd4f3d
silu-v1: Fix EPS not being used during max-reduction ( #25069 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-18 10:25:12 +00:00
4f02b77de4
Fix: Add explicit #include <omp.h> for OpenMP compatibility on certain toolchains ( #24951 )
...
Signed-off-by: lyd1992 <liuyudong@iscas.ac.cn >
Signed-off-by: ihb2032 <1355790728@qq.com >
2025-09-18 17:43:23 +08:00
29283e8976
[Chore] Cleanup guided namespace, move to structured outputs config ( #22772 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 09:20:27 +00:00
05b044e698
[Doc] Fix cross-reference warnings ( #25058 )
...
Signed-off-by: Punit Vara <punitvara@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 02:05:16 -07:00
aa3f105c59
Add 'path' option to ImagePrompt data_format ( #25081 )
...
Signed-off-by: Gerard Finol <gerard.finol@urv.cat >
2025-09-18 02:02:14 -07:00
ef7eefe17a
[Qwen] Add fp8 checkpoint support for qwen3-next. ( #25079 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-18 08:16:04 +00:00
350c94deb3
[Bugfix] when use s3 model cannot use default load_format ( #24435 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-18 07:47:43 +00:00
f4cd80f944
Retrieve sliding_window from text config in Gemma3 MM ( #25085 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-18 06:29:05 +00:00
349e0e3462
[Docs] Fix API Reference ( #25140 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-17 23:23:29 -07:00
81b16a2bc9
[Kernel] Better inf handling for grouped topk cu ( #24886 )
...
Signed-off-by: lumina37 <starry.qvq@gmail.com >
2025-09-18 05:53:55 +00:00
e111d5b0ae
[CLI] Use streaming in CLI chat and completion commands ( #23769 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-17 22:30:26 -07:00
a904ea78ea
[benchmark] add peak throughput metrics and plot ( #23867 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-17 22:30:02 -07:00
b7433ca1a4
[Spec Decode] Efficient padded speculation ( #24539 )
...
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com >
2025-09-18 01:07:24 -04:00
5c65a72bb1
[V0 Deprecation] Remove more V0 tests ( #25117 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 22:05:25 -07:00
9d8a2d86d2
[EPLB] Add EPLB support for hunyuan_v1 ( #23078 )
2025-09-18 04:51:35 +00:00
3bc18127ff
[XPU] Whisper model support on XPU Platform ( #25123 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-09-18 04:30:10 +00:00
bec060fd99
Mark prompt logprobs as incompatible with prompt embeds at API level ( #25077 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-17 21:25:07 -07:00
52bc9d5b3e
[Model] enable data parallel for InternVL vision encoder ( #23909 )
...
Signed-off-by: Yiwen Chen <yiwen66@berkeley.edu >
Signed-off-by: YiwenC <54658925+666even666@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-17 21:11:46 -07:00
dc2979c585
[Kernels] Overlap shared experts with combine instead of dispatch ( #24254 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-18 12:10:21 +08:00
027d37df38
[Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models ( #24960 )
...
Signed-off-by: toncao <cpatonn@gmail.com >
Co-authored-by: toncao <cpatonn@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-18 12:08:50 +08:00
b98219670f
[Core][MM] Cleanup MultiModalCache ( #25006 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-17 21:08:41 -07:00
32baf1d036
[Docs] Clean up the contributing README ( #25099 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-17 21:05:18 -07:00
3127274d02
[MM Encoder] Apply DP ViT for Qwen3-VL model series ( #24955 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huang Jie <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: 松灵 <26085463+wulipc@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 21:04:21 -07:00
4ac510f484
[Kernels] Enable DeepGEMM by default ( #24462 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-17 20:19:52 -07:00
7fb2a5be28
[V0 Deprecation] Skip PP test ( #25128 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 20:18:36 -07:00
6c036615dc
[V0 Deprecation] Remove misc V0 tests ( #25118 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:41:55 -07:00
2fc24e94f9
[V0 Deprecation] Remove V0 Tracing & Metrics tests ( #25115 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:40:44 -07:00
2c3c1bd07a
[V0 Deprecation] Remove V0 Engine tests ( #25114 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 19:38:09 -07:00
5963b98b46
[Kernel] Delegate construction of FusedMoEQuantConfig to FusedMoEMethodBase subclasses ( #22537 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-17 17:43:31 -06:00
e6585ddb45
[Bugfix] Fix accuracy issue for silu_mul + nvfp4 quant fusion kernel ( #24833 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-17 16:37:23 -07:00
2a4d6412e6
Add a batched auto tune script ( #25076 )
...
Signed-off-by: Karan Goel <karangoel@google.com >
Signed-off-by: Karan Goel <3261985+karan@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-17 22:41:18 +00:00
e67a79db03
[Bugfix] Refactor Flashinfer TRTLLM attention kernel selection logic ( #24600 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 15:36:29 -07:00
9f882d8791
Disable failing GPT-OSS Eval (Blackwell) for now ( #25107 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 15:36:00 -07:00
1a456c7c90
Aiter mha fp8 fix ( #24991 )
...
Signed-off-by: Doug Lehr <douglehr@amd.com >
Co-authored-by: Doug Lehr <douglehr@amd.com >
2025-09-17 22:29:14 +00:00
fedb75fa27
[Bugfix][B200] Fix cutlass_mla hang ( #24966 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 18:06:38 -04:00
bff2e5f1d6
[gpt-oss][2] fix types for streaming ( #24556 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-17 22:04:28 +00:00
3c068c637b
[Kernel] Faster pre-processing time for W4A8 ( #23972 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-09-17 14:35:32 -07:00
f20c3b0951
[BUG] Exclude .pth files when pulling remote files ( #25092 )
...
Signed-off-by: ahao-anyscale <ahao@anyscale.com >
2025-09-17 20:42:09 +00:00
883131544f
[Bugfix] Update import path for bc_linter_include ( #24766 )
...
Signed-off-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu >
2025-09-17 20:33:11 +00:00
ee5fd49150
[Misc] Update owners for KV connector and V1 offloading ( #25041 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-09-17 12:37:29 -07:00
7ae9887542
[V1] Logits processor docs ( #22919 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Signed-off-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Joseph Marinier <Joseph.Marinier@gmail.com >
2025-09-17 11:53:12 -07:00
e3db5ebb66
[CI Bugfix] Fix failing test_model_load_with_params tests due to tokenizer refactor ( #25086 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 11:15:05 -07:00
9d442b7c48
[V0 Deprecation] Remove V0 tests in test_sequence.py ( #25088 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 11:08:45 -07:00
eb68c2dcd9
[CI] Revert back prepare_prompts and check_answers ( #25087 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 11:03:16 -07:00
8b32464ac1
Change log level from info to debug for IOProcessor ( #24999 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-17 10:21:28 -07:00
99cc41ad50
[V0 Deprecation] Remove unused output processor util ( #25023 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-17 09:50:07 -07:00
d6a518fdde
Remove unused find_cuda_init helper script ( #25044 )
2025-09-17 09:47:40 -07:00
4aa8c7b047
cleanup: remove adapter commons ( #25045 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-17 16:46:29 +00:00
4b946d693e
[V0 Deprecation] Remove V0 Core tests ( #25082 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-17 09:32:42 -07:00
087c6ffc92
[CI Bugfix] Fix failing test_invalid_env ( #25078 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-17 08:28:58 -07:00
4a2d33e371
[Docs] vllm/benchmarks/datasets.py fix docstring param format. ( #24970 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-17 08:11:51 -07:00
8f3616f422
Remove old cutlass mla ( #23961 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-17 14:31:43 +00:00
47f670b03b
[Docs] improve code formatting and comments for eliminate griffe build warning. ( #25010 )
...
Signed-off-by: samzong <samzong.lu@gmail.com >
2025-09-17 07:31:20 -07:00
dd6a910aac
[Bugfix][Qwen3-Next] fixes the varlen issue in qwen3-next's MTP implementation. ( #24957 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-17 21:59:09 +08:00
1b962e2457
[fix] lora benchmarks pass no_lora_flag_cpu ( #23774 )
...
Signed-off-by: Dylan Maloy <34420038+dolpm@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-17 21:22:25 +08:00
bfe9380161
Apply fixes for CUDA 13 ( #24599 )
...
Signed-off-by: Aidyn-A <aidyn.b.aitzhan@gmail.com >
2025-09-17 09:15:42 -04:00
9fccd04e30
[Bugfix] Fix Stream usage in CPU model runner and OneDNN kernel check ( #25046 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-17 05:54:02 -07:00
252ada5559
Add RADIO Vision Encoder Support to vLLM ( #24595 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster >
2025-09-17 05:53:30 -07:00
e120533d7a
[Misc] Avoid use of deprecated AutoModelForVision2Seq ( #25065 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-17 12:19:15 +00:00
2b85697031
[BugFix] enable DOTALL to match multi-line tool_call parameters in extract_tool_call_required_streaming ( #24668 )
...
Signed-off-by: Shijun Yin <shijun.yin@outlook.com >
2025-09-17 09:21:18 +00:00
544fe76b95
[Frontend] Support returning all prompt logprobs ( #24956 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-17 09:03:52 +00:00
bb58dc8c20
[DP] Create placement groups by ray_device_key ( #25026 )
...
Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-17 08:57:25 +00:00
0fb2551c23
[Docs] Fix griffe warning in base_static_graph.py ( #25018 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-17 08:49:19 +00:00
6c47f6bfa4
[Core] Remove tokenizer group in vLLM ( #24078 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-17 08:42:59 +00:00
c15309a730
[Model] Apply SharedFusedMoE to glm4_moe. ( #24849 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-17 16:02:31 +08:00
4a9375fe9d
[Model] Pass param prefix to LLMHead ( #24862 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-17 16:01:27 +08:00
03191cd8f0
[Core][MultiModalHasher] Hash images without converting image mode ( #24969 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-17 00:57:34 -07:00
b77bf34e53
[EPLB] Support EPLB for Mixtral Model ( #22842 )
...
Signed-off-by: rouchenzi <ruochenwen@gmail.com >
Signed-off-by: rouchenzi <40842833+rouchenzi@users.noreply.github.com >
Co-authored-by: Bowen Wang <abmfy@icloud.com >
2025-09-17 07:27:34 +00:00
dd39baf717
[XPU] Fix xpu model runner call torch.cuda APIs ( #25011 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-17 06:45:25 +00:00
43a62c51be
Add more documentation and improve usability of lognormal dist (benchmark_serving_multi_turn) ( #23255 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-09-17 05:53:17 +00:00
ca2d1925ef
[Rocm] [quantization] Fix quark ptpc moe and add test case ( #24649 )
...
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com >
Co-authored-by: Haoyang Li <haoyang.li@amd.com >
2025-09-16 22:15:13 -07:00
0f7acdd73c
[Model] Support Qwen3-VL Model Series ( #24727 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huang Jie <92386084+JJJYmmm@users.noreply.github.com >
Co-authored-by: 松灵 <26085463+wulipc@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 05:01:04 +00:00
5801e49776
[V0 Deprecation] Remove MQLLMEngine ( #25019 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-16 21:29:27 -07:00
58d4c705a8
[Core] Get num_encoder_tokens from scheduler config ( #24989 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-16 20:59:07 -07:00
ea3de5ef0d
[misc] fix typo in value error ( #24995 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2025-09-16 20:58:38 -07:00
67532a1a68
[UX] Remove "quantization is not fully optimized yet" log ( #25012 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-16 20:57:51 -07:00
5672ba90bd
[Docs] fix invalid doc link ( #25017 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-09-16 20:53:23 -07:00
dd83a157f1
[UX] Enforce valid choices for envs like VLLM_ATTENTION_BACKEND, etc ( #24761 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-09-16 20:42:23 -07:00
5a411ef6c4
[Benchmarks] Add MMVU video dataset support and clean up deprecated datasets ( #24719 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-17 03:29:43 +00:00
eeb135eb87
[Core] Use CpuGpuBuffer for block table tensors ( #24795 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-16 19:18:06 -07:00
3059b9cc6b
[Doc] Add --force-overwrite option to generate_cmake_presets.py ( #24375 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-16 18:45:29 -07:00
64ad551878
Removes source compilation of nixl dependency ( #24874 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Signed-off-by: Benjamin Bartels <benjamin@bartels.dev >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniele <36171005+dtrifiro@users.noreply.github.com >
2025-09-17 01:33:18 +00:00
cef32104b4
[FP8] Extend per-token-group quantization support to QuantFP8 ( #24342 )
...
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-09-16 18:31:06 -07:00
493b10f8bf
[CI] GPT-OSS GPQA eval test for Blackwell ( #24920 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-16 18:13:21 -07:00
d119fc8614
[CI][Bugfix] Fix failing Blackwell test ( #24993 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-16 15:55:02 -07:00
dbebb7f812
[Perf] Reuse workspace for FP8+FP4 Marlin MoE ( #20500 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-16 15:45:10 -06:00
3053a22b33
fp8 kv cache support fix for torch.compile ( #22758 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
2025-09-16 21:27:11 +00:00
02d4b85454
Use kwargs for long lists of EngineCoreRequest arguments in tests and fix extra kwargs ( #24987 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-16 14:06:56 -07:00
86daa875fe
[gpt-oss][1][bugfix] fix streaming final output ( #24466 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-16 13:56:16 -06:00
dcf2f3ec06
[ROCm] Add dependencies for ROCm ( #24900 )
...
Signed-off-by: Yida Wu <yida.wu@amd.com >
2025-09-16 19:49:06 +00:00
218454b9b2
[MISC] Add code owners of vllm/v1 to vllm/v1/core ( #24928 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-16 19:07:34 +00:00
f4d6eb95cf
[gpt-oss][1b] streaming add item id, content id ( #24788 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-16 18:41:12 +00:00
cd1f885bcf
Directly get max encoder len from VLLM config in V1 ( #24866 )
...
Signed-off-by: Sugar-zsg <952242923@qq.com >
2025-09-16 17:52:31 +00:00
d593cf28fa
[Misc] Add removed encoder-decoder models to previously supported models list ( #24961 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-16 10:46:46 -07:00
faa7a5daac
[Bugfix] Fix unable to run encoder model when disable_hybrid_kv_cache_manager is true ( #24571 )
...
Signed-off-by: lianyibo <lianyibo1@kunlunit.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-09-16 17:36:58 +00:00
567939953b
[Core/DBO][1/N] Add Dual-Batch Overlap mechanism to VLLM ( #23693 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-16 12:21:48 -04:00
08369289af
[Core][MultiModalHasher] Don't convert memoryviews to bytes during hashing ( #24925 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-16 15:32:47 +00:00
73cfb3c5ee
[Model] Clean up and simplify Mamba2 Metadata Usage in both V0 and V1 ( #24331 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-09-16 14:53:43 +00:00
4e5affeaa1
[CI] Add Decode Context Parallelism (DCP) test to CI ( #24487 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-16 21:21:28 +08:00
e4f0b4cd96
(doc): set cmake c++ compatible standard when building on MacOS CPU. ( #23483 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 06:08:46 -07:00
de3e53a75b
feat: Add Grafana and Perces monitoring dashboards for vLLM ( #23498 )
2025-09-16 05:53:40 -07:00
85e0df1392
[Docs] move benchmarks README to contributing guides ( #24820 )
2025-09-16 05:52:57 -07:00
0faf3cc3e8
Move SpeculativeConfig from config/__init__.py to config/speculative.py ( #24904 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 12:51:35 +01:00
7ea5c73ad7
[Feat][EPLB] A novel static EPLB placement strategy for MoE models. ( #23745 )
...
Signed-off-by: bruceszchen <bruceszchen@tencent.com >
Signed-off-by: Chen Bruce <bruceszchen@tencent.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Chen Bruce <cszwwdz@vip.qq.com >
Co-authored-by: lemon412 <lemon412@foxmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 10:55:16 +00:00
27fcfe7bcf
[Mamba] Support TP>1 with quantization for mamba2 mixer in case n_groups % tp_size == 0 ( #24593 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-16 10:51:01 +00:00
68dbde5dbb
[Bugfix] remove duplicate tokens streamed in required tool choice streaming ( #23312 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-16 15:16:32 +08:00
04ad0dc275
[benchmark] Add triton version in the moe tuned config ( #24769 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-16 14:10:54 +08:00
238c4c1705
[QWEN NEXT] Fused MoE kernels Optimization configs ( #24924 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-16 13:06:03 +08:00
8c54610265
[Bug] [Spec Dec]: Fix kv_cache dtype mismatch for Eagle3 drafter on FP8 target ( #24505 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-16 04:45:38 +00:00
17871983a2
[Bugfix] Fix sequence parallelism bug when enable pipeline parallelism ( #24021 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-09-16 04:32:32 +00:00
759ef49b15
Remove V0 Encoder-Decoder Support ( #24907 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-15 21:17:14 -07:00
5206ab20ba
[XPU] Fix circular import error. ( #24927 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-16 03:35:36 +00:00
0af3ce1355
Upgrade flashinfer to 0.3.1 ( #24470 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-16 02:36:09 +00:00
e1279ef00f
[Docs] Update instructions for how to using existing torch binary ( #24892 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-16 02:25:50 +00:00
2942970d44
[Metrics] Hide deprecated metrics with gpu_ prefix ( #24245 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-09-15 20:15:57 -06:00
3c96e7b8a1
[CI] Small Accuracy Eval Test for Deepseek Model ( #24259 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:14:50 -06:00
b42566f440
[Bug] Fix is_flashmla_supported Check Error ( #24774 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:10:55 -06:00
d96e11167d
Add pytest-cov and .coveragerc ( #24778 )
...
Signed-off-by: Reza Barazesh <rezabarazesh@meta.com >
2025-09-15 20:08:46 -06:00
2891603efd
[ROCm][Bugfix] Fix the case where there's bias ( #24895 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-15 20:05:12 -06:00
de2cc3d867
[Deprecation] Remove DeepGEMM Old Symbol Wrapper ( #24902 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 20:03:29 -06:00
e95084308b
Updated CODEOWNERS for flashinfer, mla, fused_moe ( #24906 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-16 02:01:28 +00:00
7f6f2c1182
HuggingFace -> Hugging Face in Integration with Hugging Face docs (#24889 )
2025-09-15 17:28:35 -07:00
5bcc153d7b
[Compile] Fix noop_elimination pass and add tests for noop_elimination ( #24880 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-15 23:33:18 +00:00
45bfa49cb8
[Tests] fix initialization of kv hash in tests ( #24273 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-09-15 21:48:27 +00:00
fd2f10546c
[ci] fix wheel names for arm wheels ( #24898 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-15 14:39:08 -07:00
e757a629e7
[Bug] Fix Cutlass Scaled MM Compilation Error ( #24887 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-15 17:21:17 -04:00
aae725af7c
[Performance] Remove redundant clone() calls in cutlass_mla ( #24891 )
2025-09-15 20:21:53 +00:00
73df49ef3a
[gpt-oss][1a] create_responses stream outputs BaseModel type, api server is SSE still ( #24759 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-15 13:08:08 -07:00
25aba2b6a3
[gpt-oss] Add IncompleteDetails to ResponsesRepsonse ( #24561 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-15 13:07:55 -07:00
94b03f88dd
Bump Flashinfer to 0.3.1 ( #24868 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
2025-09-15 12:45:55 -07:00
49bfc538e4
Update num_tokens_across_dp to use nccl instead of gloo ( #24105 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-09-15 19:05:48 +00:00
a0b26701c9
[Transform] Deterministic Hadacore Transforms ( #24106 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-15 12:59:31 -06:00
c4afdb69cc
Move MultiModalConfig from config/__init__.py to config/multimodal.py ( #24659 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-15 17:43:16 +00:00
b834b4cbf1
[USAGE] Improve error handling for weight initialization in Unquantized… ( #20321 )
...
Signed-off-by: Rafael Marcelino Koike <rafael.koike@oracle.com >
Signed-off-by: Rafael Koike <koike.rafael@gmail.com >
2025-09-15 16:45:49 +00:00
740f0647b1
Reinstate existing torch script ( #24729 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-15 09:43:40 -07:00
01413e0cf5
Fp8 paged attention update ( #22222 )
...
Signed-off-by: Xiao Yu <xiao.yu@amd.com >
Signed-off-by: xiao-llm <xiao.yu.dc@outlook.com >
Co-authored-by: Xiao Yu <xiao.yu@metamaterial.com >
Co-authored-by: Xiao Yu <xiao.yu@amd.com >
Co-authored-by: Bowen Bao <bowenbao@amd.com >
2025-09-15 10:43:26 -04:00
0e219cd50b
[Bugfix] Fix GLM4.1V multimodal processor with compatability for Transformers v4.56 ( #24822 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-15 20:45:06 +08:00
72c99f2a75
[Model]: support Ling2.0 ( #24627 )
...
Signed-off-by: vito.yy <vito.yy@antgroup.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-15 05:09:30 -07:00
bf214ca226
[Misc] Fix examples openai_pooling_client.py ( #24853 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-15 11:57:30 +00:00
2e41f5abca
[XPU] Set consistent default KV cache layout ( #24745 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-15 18:09:34 +08:00
bc0f6059a2
[UT] enhance free kv cache block queue popleft_n ( #24220 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 10:04:37 +00:00
8de261b04a
[P/D]kv_output_aggregator support P TP > D TP ( #23917 )
...
Signed-off-by: LCAIZJ <leichao139636@163.com >
Co-authored-by: leichao.lc <leichao.lc@antgroup.com >
2025-09-15 11:36:06 +02:00
a0d8b9738d
[Misc] Own KVConnectors installation ( #24867 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-15 02:21:09 -07:00
59e17dd4a0
[Misc] rename interval to max_recent_requests ( #24229 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 09:18:42 +00:00
4979eb79da
[Doc]: fix typos in various files ( #24821 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-15 01:08:52 -07:00
a8c0f59973
[Bugfix] MiDashengLM model contact error under concurrent testing ( #24738 )
...
Signed-off-by: chenbing8 <chenbing8@xiaomi.com >
Signed-off-by: bingchen-mi <chenbing8@xiaomi.com >
2025-09-15 06:38:12 +00:00
f4a948f33f
[Frontend] Skip stop in reasoning content ( #14550 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
2025-09-15 06:04:55 +00:00
3f3313981c
[kv cache] update num_free_blocks in the end ( #24228 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-15 05:15:12 +00:00
78818dd1b0
[Docs] Have a try to improve frameworks/streamlit.md ( #24841 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-14 21:50:36 -07:00
8e5cdcda4e
[Hybrid Allocator] Support Pipeline Parallel ( #23974 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-14 15:55:17 -07:00
90f3f7d73e
[Spec Decoding]Support Spec Decoding Metrics in DP Mode ( #24049 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 21:11:09 +00:00
6dc8da5dc1
[Chore] Remove ipex_ops warning ( #24835 )
...
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 19:41:53 +00:00
79cbcab871
Force use C++17 globally to avoid compilation error ( #24823 )
...
Signed-off-by: chenfengjin <1871653365@qq.com >
2025-09-14 19:30:10 +00:00
ff68035932
[Benchmarks] Throw usage error when using dataset-name random and dataset-path together ( #24819 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-14 17:50:01 +00:00
1177dd53e9
fix type of sampling rate for encode_base64 ( #24826 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-14 16:17:16 +00:00
fc2dbcda8b
[Perf] Fix DeepGEMM Contiguous Layout Issue, 5.5% Throughput Improvement ( #24783 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-14 11:20:17 -04:00
fec347dee1
[Misc] Improve s3_utils type hints with BaseClient ( #24825 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-14 12:11:14 +00:00
cc3173ae98
[Multi Modal][Performance] Fused Q,K's apply_rope into one ( #24511 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-14 08:10:21 +00:00
3e903b6cb4
[Chore] Minor simplification for non-PP path ( #24810 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-13 17:41:36 -07:00
973c9d01da
[Minor] Simplify duplicative device check for cuda ( #24793 )
...
Signed-off-by: Ziliang Peng <ziliangdotme@gmail.com >
2025-09-13 18:28:38 +00:00
15b8fef453
Remove redundant assignment in xfer_buffers, This is a little fix ( #24732 )
...
Signed-off-by: ChenTaoyu-SJTU <ctynb@qq.com >
2025-09-13 08:11:59 +00:00
cfa3234a5b
[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again ( #24771 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-13 15:45:11 +08:00
41ae4a1eab
[Doc]: fix typos in various files ( #24798 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-13 00:43:33 -07:00
4dad72f0d9
[Misc] Correct an outdated comment. ( #24765 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-13 00:34:53 -07:00
59d7ffc17f
[CI Failure] Fix test_flashinfer_cutlass_mxfp4_mxfp8_fused_moe ( #24750 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-13 07:29:19 +00:00
1da0f1441d
[Core][Multimodal] Cache supports_kw ( #24773 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-13 07:27:04 +00:00
98229db244
[Kernels][DP/EP] Optimize Silu Kernel for R1 ( #24054 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-13 00:17:27 -07:00
dbeee3844c
[Perf] Use NVIDIA hardware-accelerated instruction for float to fp8_e4m3 quantization ( #24757 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-13 00:16:24 -07:00
30498f2a65
[Doc]: Remove 404 hyperlinks ( #24785 )
...
Signed-off-by: Rakesh Asapanna <45640029+rozeappletree@users.noreply.github.com >
2025-09-13 00:15:41 -07:00
abc7989adc
[Docs] Remove Neuron install doc as backend no longer exists ( #24396 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-13 00:15:03 -07:00
9a8966bcc2
[Docs] Fix warnings in mkdocs build (continued) ( #24791 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-13 00:13:44 -07:00
5febdc8750
[Chore] Remove unused batched RoPE op & kernel ( #24789 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-13 00:08:20 -07:00
99bfef841f
[Bugfix] Fix GPUModelRunner has no attribute lora_manager ( #24762 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 23:55:14 -07:00
89e08d6d18
[Model] Add Olmo3 model implementation ( #24534 )
...
Signed-off-by: Shane A <shanea@allenai.org >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-13 03:26:21 +00:00
7f2ea7074e
[Frontend][Multimodal] Allow skipping media data when UUIDs are provided. ( #23950 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-13 02:16:06 +00:00
4fdd6f5cbf
[Core] Support async scheduling with uniproc executor ( #24219 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Ronald1995 <ronaldautomobile@163.com >
Co-authored-by: Ronald1995 <ronaldautomobile@163.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-12 16:34:28 -07:00
8226dd56bf
[Qwen3Next] Fixes the cuda graph capture conditions under large batch sizes ( #24660 ) ( #24667 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-12 22:31:32 +00:00
5fe643fc26
Add FLASHINFER_MLA to backend selector test ( #24753 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-12 22:30:07 +00:00
7ba32aa60b
[Attention][FlashInfer] Enable FP8 FlashInfer (TRTLLM) MLA decode ( #24705 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-12 15:45:53 -06:00
c89ed8de43
Invert pattern order to make sure that out_proj layers are identified ( #24781 )
...
Signed-off-by: Alexandre Marques <almarque@redhat.com >
2025-09-12 14:45:29 -07:00
3beadc2f25
[Compilation Bug] Fix Inductor Graph Output with Shape Issue ( #24772 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-12 21:23:05 +00:00
bc636f21a6
[Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints ( #23937 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
2025-09-12 13:57:53 -07:00
017354c0ef
[CI] Trigger BC Linter when labels are added/removed ( #24767 )
2025-09-12 11:44:36 -07:00
010acc6e1e
[Bugfix] Fix incompatibility between #20452 and #24548 ( #24754 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-12 11:17:29 -07:00
c8c42597ab
[CI] Speed up model unit tests in CI ( #24253 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-09-12 10:36:50 -07:00
9d2a44606d
[UX] Remove AsyncLLM torch profiler disabled log ( #24609 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-09-12 10:08:44 -07:00
f17c075884
[Model] Switch to Fused RMSNorm in GLM-4.1V model ( #24733 )
...
Signed-off-by: SamitHuang <285365963@qq.com >
2025-09-12 09:12:23 -07:00
b0d1213ac3
[Models] Prevent CUDA sync in Qwen2.5-VL ( #24741 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-12 16:03:55 +00:00
57f94e88ea
[Models] Optimise and simplify _validate_and_reshape_mm_tensor ( #24742 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-09-12 15:37:37 +00:00
684b6870e1
[Bugfix][Frontend] Fix --enable-log-outputs does not match the documentation ( #24626 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-12 08:01:24 -07:00
a5b84f1cbf
[Core] Shared memory based object store for Multimodal data caching and IPC ( #20452 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-09-12 07:54:17 -07:00
9f04d9d55f
[Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP ( #24739 )
...
Signed-off-by: elvircrn <elvircrn@gmail.com >
2025-09-12 07:54:04 -07:00
4d7c1d531b
[Bugfix] Fix MRoPE dispatch on XPU ( #24724 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-09-12 21:43:56 +08:00
41f17bf290
[Docs] Fix warnings in mkdocs build (continued) ( #24740 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-09-12 06:43:15 -07:00
bcb06d7baf
[Doc]: fix typos in various files ( #24726 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-12 06:43:12 -07:00
0377802c20
[Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec ( #24548 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-12 21:42:23 +08:00
72fc8aa412
[Multi Modal] Add FA3 in VIT ( #24347 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-12 21:27:24 +08:00
fdb09c77d6
[sleep mode] save memory for on-the-fly quantization ( #24731 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-12 11:25:19 +00:00
7a1c4025f1
[Kernel] [CPU] refactor cpu_attn.py:_run_sdpa_forward for better memory access ( #24701 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-09-12 19:23:07 +08:00
60a0951924
[Bugfix] Fix BNB name match ( #24735 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 11:12:01 +00:00
64d90c3e4f
[Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call ( #24717 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-12 18:57:07 +08:00
59d5d2c736
[CI/Build] Skip prompt embeddings tests on V1-only CPU backend ( #24721 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-12 18:51:01 +08:00
d21a36f5f9
[CI] Add ci_envs for convenient local testing ( #24630 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-12 08:52:25 +00:00
561a0baee0
[CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order ( #24640 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-12 07:49:09 +00:00
f592b3174b
[BugFix] Fix Qwen3-Next PP ( #24709 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-11 23:35:04 -07:00
7920de0a2a
[Bugfix] Fix MRoPE dispatch on CPU ( #24712 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-12 04:56:31 +00:00
ddcec289c7
Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds ( #24686 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-12 04:35:48 +00:00
e090b7b45b
Enable conversion of multimodal models to pooling tasks ( #24451 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-12 03:30:41 +00:00
6a50eaa0d3
[DOCs] Update ROCm installation docs section ( #24691 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-11 20:02:53 -07:00
12a8414d81
[Qwen3-Next] MoE configs for H20 TP=1,2,4,8 ( #24707 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-12 10:06:26 +08:00
880c741bb6
[Bugfix] fixes the causal_conv1d_update kernel update non-speculative decoding cases ( #24680 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-11 18:16:43 -07:00
40b6c9122b
[V1] feat:add engine v1 tracing ( #20372 )
...
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com >
Signed-off-by: Ye Zhang <zhysishu@gmail.com >
Signed-off-by: RichardoMu <44485717+RichardoMrMu@users.noreply.github.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Mu Huai <tianbowen.tbw@antgroup.com >
Co-authored-by: Ye Zhang <zhysishu@gmail.com >
Co-authored-by: Benjamin Bartels <benjamin@bartels.dev >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: 瑜琮 <ly186375@antfin.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-11 17:10:39 -07:00
2e6bc46821
[Startup] Make DeepGEMM warmup scale with max-num-batched-tokens ( #24693 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-09-11 20:10:19 -04:00
fcba05c435
[Bug] Fix Layer weight_block_size Assertion Issue ( #24674 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 19:47:59 -04:00
7a30fa8708
[Doc] Clarify cudagraph capture size logic and default behavior in scheduler ( #18698 )
...
Signed-off-by: Zazzle516 <2405677060@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 23:18:09 +00:00
f82f7a8990
[Qwen3-Next] MOE configs for H100 TP4 ( #24699 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-11 15:45:52 -07:00
c3aea10dc8
[Perf] Use upstream CUTLASS for SM90 Block FP8 kernel ( #23280 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-11 15:43:14 -07:00
d4fd2768ef
[Bugfix][Attention] Fix FlashInfer MLA block size logic ( #24692 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-11 22:39:42 +00:00
7a70a71892
[Qwen3-Next] Add B200 MoE configs for Qwen3-next ( #24698 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com >
2025-09-11 15:34:58 -07:00
7d4651997a
[CI/Build] Add bc-linter to vLLM CI ( #21234 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-09-11 15:34:36 -07:00
569bf1c9c0
[Qwen3-Next] MoE configs for H200 TP=1,2,4 ( #24695 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-11 14:38:16 -07:00
1ec20355f5
[Bugfix] Set VLLM_ALLREDUCE_USE_SYMM_MEM default to False ( #24696 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 14:32:27 -07:00
e42af78b18
[flashinfer] [kernel] support for fp8 kv cache for trtllm prefill attention ( #24197 )
...
Signed-off-by: Xiaozhu <mxz297@gmail.com >
2025-09-11 14:20:09 -07:00
074854b24f
[Kernel][B200] mxfp4 fused cutlass moe ( #23696 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-11 17:04:56 -04:00
79ac59f32e
Update Spec Decode metrics to include drafted and accepted token throughput ( #24127 )
...
Signed-off-by: Andrew Xia <axia@meta.com >
2025-09-11 19:58:43 +00:00
b971f91504
[BugFix] Fix tokenize asyncio task leak ( #24677 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-11 19:44:04 +00:00
c733bd5e87
[Qwen3-Next] Add MoE Config for H200 ( #24688 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-09-11 12:40:15 -07:00
a892b259b4
[Doc] Remove Useless Comments ( #24687 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 12:25:47 -07:00
127ded0a9e
[Ultravox] Use wrapped_model_config to instantiate inner model ( #24679 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2025-09-11 18:52:24 +00:00
bb2b5126da
[VLM] Migrate remain DP-supported ViT models to use disable_tp ( #24363 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-11 18:30:41 +00:00
361ae27f8a
[Docs] Fix formatting of transcription doc ( #24676 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 11:18:06 -07:00
e26fef8397
fix some typos ( #24616 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-11 10:48:46 -07:00
c1eda615ba
Fix model name included in responses ( #24663 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 10:47:51 -07:00
4aa23892d6
[Bugfix] Fix platform-specific routing in CustomOp implementations ( #24444 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-09-11 17:15:01 +00:00
1fdd5c42d7
[Kernels] Enable Torch Symmetric Memory All-Reduce By Default ( #24111 )
...
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-11 09:45:31 -07:00
bcbe2a4d9e
[VLM] Optimize GLM4.5-V-style video processing to only decode necessary frames ( #24161 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-11 09:44:34 -07:00
51d41265ad
[Docs] Fix typos in EP deployment doc ( #24669 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 09:07:23 -07:00
4984a291d5
[Doc] Fix Markdown Pre-commit Error ( #24670 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-11 09:05:59 -07:00
404c85ca72
[Docs] Add transcription support to model ( #24664 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-11 07:39:01 -07:00
817beef7f3
[Bugifx] Fix qwen-next packed_modules_mapping ( #24656 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 22:26:17 +08:00
4f6593b058
[HybridKVCache][Platform] Add support_hybrid_kv_cache for platform ( #24646 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2025-09-11 21:47:58 +08:00
94e6b2d55f
Allow users to specify kv cache memory size ( #21489 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 13:41:07 +00:00
fd1ce98cdd
[CI] Split mteb test from Language Models Test ( #24634 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 06:37:51 -07:00
d11ec124a0
[Bench] Add qwen-next in benchmark_moe.py ( #24661 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 21:29:43 +08:00
f510715882
[build] add torch to tool.uv no-build-isolation-package ( #24303 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 13:19:44 +00:00
f946197473
[Docs] Fixes a typo in the qwen3next model name. ( #24654 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-09-11 19:35:14 +08:00
0cd72a7b72
[XPU] add missing dependency tblib for XPU CI ( #24639 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
2025-09-11 11:22:33 +00:00
5f5271f1ee
Move LoRAConfig from config/__init__.py to config/lora.py ( #24644 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 11:01:38 +00:00
d6249d0699
Fix typing for safetensors_load_strategy ( #24641 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-11 10:41:39 +00:00
25bb9e8c65
[CI Failure] fix models/language/pooling/test_auto_prefix_cache_support.py ( #24636 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 03:31:23 -07:00
a1213fae5f
[Misc] Add @NickLucche to codeowners ( #24647 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-11 17:18:09 +08:00
a8b0361c92
[CI] Split pooling from entrypoints Test ( #24632 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-11 01:53:09 -07:00
ed5ae4aace
[Bugfix] Fix _synced_weight_loader ( #24565 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-09-11 16:52:33 +08:00
0fc36463e0
[CI]Add transformers_utils to Async Engine, Inputs, Utils, Worker Test ( #24615 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
2025-09-11 01:52:10 -07:00
d14c4ebf08
[Docs] Use 1-2-3 list for deploy steps in deployment/frameworks/ ( #24633 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-11 01:50:12 -07:00
ba6011027d
[Docs] Update V1 doc to reflect whisper support ( #24606 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-11 01:50:08 -07:00
85df8afdae
[Docs] Revise frameworks/anything-llm.md ( #24489 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-11 01:50:05 -07:00
6aeb1dab4a
[Bugfix] Fix incorrect import of CacheConfig ( #24631 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-11 01:48:25 -07:00
e93f4cc9e3
Add the support for the qwen3 next model (a hybrid attention model). ( #24526 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-11 15:32:09 +08:00
2048c4e379
[torchao] Support quantization configs using module swap ( #21982 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-09-10 23:53:24 -07:00
d13360183a
Remove redundant all gather + split ( #23441 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-10 23:45:07 -07:00
9bd831f501
[Model] New model support for Motif-1-Tiny ( #23414 )
...
Signed-off-by: ca1207 <ca1207zzz@gmail.com >
Signed-off-by: TaehyunKim <73943231+ca1207@users.noreply.github.com >
Co-authored-by: WyldeCat <skan1543@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 23:29:40 -07:00
e2b1f863aa
[Doc]: fixing doc typos ( #24635 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-10 23:19:28 -07:00
41329a0ff9
[Core] feat: Add --safetensors-load-strategy flag for faster safetensors loading from Lustre ( #24469 )
...
Signed-off-by: Shiqi Sheng <shengshiqi@google.com >
Signed-off-by: shengshiqi-google <160179165+shengshiqi-google@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-10 23:10:01 -07:00
ee0bc5e1b4
Enable --profile in 'vllm bench throughput' ( #24575 )
...
Signed-off-by: Tomas Ruiz <tomas.ruiz.te@gmail.com >
2025-09-10 23:06:19 -07:00
3d1393f6fc
Kimi K2 Fused MoE kernels Optimization configs ( #24597 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-10 23:06:16 -07:00
8a894084d2
[Engine][Chore] use local variable and remove output var assignment ( #24554 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
2025-09-10 23:05:42 -07:00
e2d8c27f68
[BugFix] Fix pipeline parallel ( #24621 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-10 23:05:30 -07:00
29799ddacc
[Bugfix] Add missing VIT backend dispatch on CPU ( #24623 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-10 22:28:41 -07:00
f17a6aa4ec
[Ultravox] Fix Gemma instantiation, support quantization via --hf-overrides ( #24131 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2025-09-10 22:25:34 -07:00
6c8deacd72
[Bug] [Spec Decode] Fix model_initialization test and mismatch in aux_hidden_layers ( #24613 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-10 21:23:18 -07:00
55b823ba0f
Add @chaunceyjiang to codeowner for reasoning Reasoning and Tool parser ( #24406 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-11 04:23:04 +00:00
8c5a747246
[distributed] update known issues ( #24624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-11 11:09:38 +08:00
5931b7e5d9
[Models][Quantization] Add quantization configuration update in Voxtral model ( #24122 )
...
Signed-off-by: Alexandre Marques <almarque@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-10 19:13:56 -07:00
cc99baf14d
[Misc] Make timeout passable in init_distributed_environment ( #24522 )
...
Signed-off-by: jberkhahn <jaberkha@us.ibm.com >
2025-09-10 15:41:12 -07:00
dcb28a332b
[Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration ( #21078 )
...
Signed-off-by: hjjq <hanjieq@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-09-10 15:31:10 -07:00
fba7856581
[Perf] Warmup FlashInfer attention during startup ( #23439 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-09-10 15:03:17 -07:00
b5e383cd8b
[gpt-oss] raise error for flashinfer backend without trtllm ( #24482 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-10 14:33:13 -07:00
9a161307f5
[torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends ( #19767 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-10 13:59:55 -07:00
37e8182bfe
[v1] Add Whisper model support (encoder-decoder) ( #21088 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: NickLucche <nlucches@redhat.com >
2025-09-10 13:53:35 -07:00
4db4426404
[CI] Fail subprocess tests with root-cause error ( #23795 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-10 13:53:21 -07:00
a0933c3bd6
[Bugfix] Enable FP8 KV cache for FlashInfer and Triton backend on non-sm100 GPUs ( #24577 )
...
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg >
2025-09-10 12:33:41 -07:00
09e68bce34
[Misc] update log level debug to warning when process port is used by ( #24226 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-10 11:32:57 -07:00
9fb74c27a7
[Core] Support configuration parsing plugin ( #24277 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Signed-off-by: Xingyu Liu <38244988+charlotte12l@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-10 11:32:43 -07:00
4032949630
[Bugfix] Fix DeepEP config for DP4TP4 ( #23619 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-10 10:37:56 -07:00
08abfa78ec
[Bugfix] fix modelopt exclude_modules name mapping ( #24178 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-10 10:20:46 -07:00
2bef2d1405
[Logging] allow config logging stream ( #24336 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-10 15:02:01 +00:00
36cacd0958
[Doc] Add documentation for GLM-4.5 series models: tool-calling and reasoning parser ( #24589 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-09-10 07:50:55 -07:00
bb3eb80d92
[Core] Split LoRA layers ( #24574 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 07:47:51 -07:00
fcc0a3130a
[CI] Fix tensorizer test assertion ( #24545 )
...
Signed-off-by: Peter Schuurman <psch@google.com >
2025-09-10 06:57:36 -07:00
736569da8d
[Platform] Custom ops support for LMhead and LogitsProcessor ( #23564 )
...
Signed-off-by: zzhx1 <zzh_201018@outlook.com >
2025-09-10 06:26:31 -07:00
2eb9986a2d
[BugFix] python collect_env.py and vllm collect-env compatibility with uv venv ( #24066 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-09-10 21:25:33 +08:00
ccee371e86
[Docs] Fix warnings in mkdocs build (continued) ( #24092 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-10 06:23:28 -07:00
c0bd6a684a
Fix Auto_Round Quatization Loading on SM75 and Lower GPUs ( #24217 )
...
Signed-off-by: RoadToNowhereX <37441177+RoadToNowhereX@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-10 06:22:31 -07:00
3144d90217
fix some typos ( #24167 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-10 06:21:23 -07:00
2f5e5c18de
[CI/Build] bump timm dependency ( #24189 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-10 06:20:59 -07:00
bd98842c8a
[CI] Add PPL test for generation models ( #24485 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-10 06:16:39 -07:00
d6069887c6
[rocm] enable torchao quantization for rocm ( #24400 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-09-10 06:16:21 -07:00
492196ed0e
[CI/Build] split true unit tests to Entrypoints Unit Tests ( #24418 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-10 06:16:07 -07:00
f4f1a8df22
[BugFix] Ensure integrity of reused CPU tensors during async scheduling ( #24527 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: guoze.lin <guozelin@tencent.com >
2025-09-10 21:15:14 +08:00
0b9a612fa3
[BugFix][easy] Fix flaky test test_gpt_oss_multi_turn_chat ( #24549 )
...
Signed-off-by: lacora2017 <yehu@meta.com >
Co-authored-by: lacora2017 <yehu@meta.com >
2025-09-10 21:14:55 +08:00
4c04eef706
[BugFix][Multi Modal] Fix TensorSchema shape mismatch in Molmo ( #24559 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-10 06:14:27 -07:00
f36355abfd
Move LoadConfig from config/__init__.py to config/load.py ( #24566 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-10 06:14:18 -07:00
9e3c3a7df2
[LoRA]: Add LoRA support to Mistral's Voxtral models ( #24517 )
...
Signed-off-by: Yash Pratap Singh <yashsingh20001@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-10 06:12:03 -07:00
6cbd41909e
Feature/vit attention unification# 23880 ( #23978 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-10 06:10:14 -07:00
72d30108a0
Support for NemotronH Nano VLM ( #23644 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
2025-09-10 06:10:06 -07:00
8b83b93739
[Docs] Document the extra memory footprint overhead when using EPLB ( #24537 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-10 06:09:49 -07:00
9dbefd88e9
[Docs] Improve organisation of API Reference nav ( #24569 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-10 06:08:21 -07:00
7c195d43da
[ROCm][Bugfix] Fix Aiter RMSNorm ( #23412 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-09-10 21:08:03 +08:00
0ae43dbf8c
[Attention] add DCP support for FLASH_ATTN_MLA backend ( #24453 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-10 17:19:26 +08:00
267c80d31f
[Model] Limit CPU threads for image transformations in InternVL to reduce cpu contention. ( #24519 )
...
Signed-off-by: li-jinpeng <3332126450@qq.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-10 16:45:44 +08:00
77f62613f9
Consolidate rendering parameters into RenderConfig dataclass ( #24543 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-10 08:44:47 +00:00
feaf202e93
[Bugfix] Guard _may_reorder_batch for encoder-only models on CPU ( #24319 ) ( #24348 )
...
Signed-off-by: Remy <eunhwan.shin@dtonic.io >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-09-10 14:24:42 +08:00
91130ae376
[docs] promo pytorch conf and ray summit ( #24562 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-09-09 23:24:20 -07:00
e40827280b
[Docs] Enable relative links in examples to function when rendered in the docs ( #24041 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-09 21:40:45 -07:00
4377b1ae3b
[Bugfix] Update Run:AI Model Streamer Loading Integration ( #23845 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Signed-off-by: Peter Schuurman <psch@google.com >
Co-authored-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-09 21:37:17 -07:00
009d689b0c
[Core] Simplify and unify mm uuid handling & auto-generated mm hash overrides processing. ( #24271 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-09 21:36:09 -07:00
0efdb5c3ba
[gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading ( #24154 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-09-10 04:27:53 +00:00
53b42f4102
[BugFix][Spec Decode] Fix out-of-range index triggered by eagle3; re-enable test for LlamaForCausalLMEagle3 ( #24392 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-09-09 21:24:23 -07:00
309d7aa401
[P/D] MultiConnector supports shutdown ( #24425 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-09 21:24:11 -07:00
b4a01aaf95
[KV Connector] More async support for get_num_new_matched_tokens ( #23620 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
2025-09-09 21:23:37 -07:00
83dd28aae4
[CI] Adjust threshold for flaky ngram spec decoding test ( #24528 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 21:07:33 -07:00
f88e84016f
[BugFix] Fix async core engine client finalizer ( #24540 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 21:07:13 -07:00
3c2156b3af
[Hardware][Apple-CPU] Enable native bfloat16 on Apple Silicon (M2 and later) ( #24129 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-09-10 03:50:21 +00:00
7e7db04310
[CI] Retry flaky fp8 cutlass mla tests ( #24536 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-09 20:33:10 -07:00
41f160b974
Add @heheda12345 to CODEOWNERS of KVCacheManager related code ( #24546 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-10 03:30:32 +00:00
dc625ea6b8
[Perf] Convert np array to torch tensor to index into block table for attn chunking ( #24474 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-09 20:01:06 -07:00
b23fb78623
[Bugfix] Fix for 24530. Fix naive all2all shared expert overlap. ( #24538 )
2025-09-09 17:53:53 -07:00
561f38dc3c
[Bugfix] Improve EPLB config validation error message ( #24524 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-09-10 00:32:36 +00:00
73e688cb79
[ROCm][Feature] Enable Pipeline Parallelism with Ray Compiled Graph on ROCm ( #24275 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-09-09 23:27:35 +00:00
fb1a8f932a
[Benchmark] Add option to skip oversampling in benchmark ( #24457 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-09 22:00:17 +00:00
0dc9cbb527
[Benchmark] Update bench doc with mtbench, blazedit, spec bench ( #24450 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-09 21:15:41 +00:00
b5fb3005a8
[Log] Use a relative path in debug-level logs to distinguish files with identical names ( #23846 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-09 16:46:35 -04:00
15de5ff9ea
[Feature] Disallow FlashMLA on Blackwell ( #24521 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-09 14:59:34 -04:00
b8a93076d3
[CI] execute all piecewise compilation tests together ( #24502 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-09 11:05:25 -07:00
c3f9773b2c
[TPU] Fix tpu structured decoding in mixed batches ( #24458 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-09-09 11:04:25 -07:00
3707cb2505
[Docs] Gemma3n transcriptions endpoint support ( #24512 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-09 11:03:32 -07:00
920ed46b09
[Misc] bump outlines_core to fix the version conflicts with outlines >= 1.2.0 ( #24368 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 10:59:46 -07:00
15cb047e25
Extend renderer with embedding support and integrate completion endpoint ( #24405 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-10 01:46:46 +08:00
9ad0688e43
[Bugfix] Fix hidden_size for multimodal classification model ( #24501 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-09 10:37:25 -07:00
b9a1c4c8a2
[ROCm][CI/Build] Sync ROCm dockerfiles with the ROCm fork ( #24279 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 12:21:56 -04:00
1aa427fdc1
[Kernels] Add Flash Linear Attention Kernels ( #24518 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-10 00:04:41 +08:00
1c63a16b65
[Core] Run garbage collector after CUDA graph capture to fix throughput regression ( #24128 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-09-09 10:38:10 -04:00
922d3b401b
[Bugfix] Handle the edge case in detokenizer where processed tokens contain both stop str and eos token ( #23938 )
...
Signed-off-by: dtransposed <damian.bogunowicz@gmail.com >
2025-09-09 07:30:24 -07:00
19332c0479
[Model] Systematic support for fp32 head, pooling models part ( #23810 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-09 07:29:50 -07:00
a55cf41a09
[Compilation][WideEP] Enable Piecewise CUDAGraph for DeepEPHT ( #24123 )
2025-09-09 10:21:10 -04:00
6fb2788163
[CI/Build][Doc] Fully deprecate old bench scripts for serving / throughput / latency ( #24411 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 10:02:35 +00:00
3d2a2de8f7
[RL] fast weight update with zmq + ipc handles ( #24295 )
...
Signed-off-by: huangweixiao <huangweixiao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-09 16:57:46 +08:00
1116590b16
[gpt-oss] Validate gpt-oss python tool during initialization ( #23856 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-09 08:37:48 +00:00
ccb97338af
[Misc] Add Codex settings to gitignore ( #24493 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-09 01:25:44 -07:00
45c9cb5835
[Misc] Add claude settings to gitignore ( #24492 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-09 01:14:45 -07:00
e283976f3a
[Performance][MM] Building the inverse permutation in O(n) time in Qwen2_5_VisionTransformer ( #24443 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
2025-09-09 00:24:11 -07:00
46876dff32
[Doc]: fixing typos to improve docs ( #24480 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 23:06:04 -07:00
1823a00d67
[Misc] Support bench serve long context ( #24373 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-09-08 22:53:10 -07:00
ed16d0f26f
[Doc] mention fpdb for multiprocess breakpoints ( #24452 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-09-08 21:46:45 -07:00
0cdd213641
[Misc] Improve Worker process title and logging prefix ( #22205 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 21:43:48 -07:00
948dd3443b
[Bugfix] Fix Apertus HF repo name ( #24447 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-09-08 21:40:29 -07:00
b2f7745774
Add data_parallel_size to VllmConfig string representation ( #24298 )
...
Co-authored-by: Cong Chen <congc@meta.com >
2025-09-08 21:35:18 -07:00
82dfb12e52
[Core] Use sha256 bytes instead of BlockHash to reduce GC overhead ( #23673 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-09-08 21:34:37 -07:00
bba1042c6f
[Flashinfer] Support Flashinfer TRTLLM FP8-qkv BF16/FP16-out Attention Kernel ( #23647 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-08 20:53:07 -07:00
b6fbc15634
[BugFix][Model] Fix Ernie4.5-VL hanging on long inputs ( #24074 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-09-09 11:37:16 +08:00
3e0d4a3475
Move KVTransferConfig from config/__init__.py to config/kv_transfer.py ( #24434 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 20:30:32 -07:00
562663a044
Bump actions/github-script from 7.0.1 to 8.0.0 ( #24413 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:12:44 +00:00
ed1623a88a
Bump actions/stale from 9.1.0 to 10.0.0 ( #24412 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 03:11:20 +00:00
13b89bd823
[doc] update vllm serve cli args documentation ( #24329 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-09-09 03:07:58 +00:00
22a0070530
Bump actions/setup-python from 5.4.0 to 6.0.0 ( #24414 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-09-09 02:54:58 +00:00
170129eb28
[gpt-oss] Harmony changes with container tool support ( #23386 )
...
Signed-off-by: zhiweiz <zhiweiz@fb.com >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
Co-authored-by: zhiweiz <zhiweiz@fb.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-09-08 19:03:50 -07:00
955c624915
[Bugfix][Wide EP] Fix redundant work when using DeepEP, TP Attn, and EP MoE ( #24134 )
...
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com >
2025-09-08 19:01:51 -07:00
4f87abdcc6
Update reviewers for modelopt related files ( #24468 )
2025-09-09 01:53:13 +00:00
6910b56da2
[CI] Add nightly multiarch manifests to dockerhub ( #24102 )
...
Signed-off-by: Sahithi Chigurupati <chigurupati.sahithi@gmail.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-09 01:18:09 +00:00
e10fef0883
[Hardware][IBM Z] Fix Outlines Core issue for s390x ( #24034 )
...
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com >
2025-09-08 16:50:34 -07:00
e680723eba
[Bugfix] Disable the statslogger if the api_server_count is greater than 1 ( #22227 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-08 15:28:03 -07:00
620db1fc58
[Attention] FlashAttention MLA cudagraph support ( #23958 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-09-08 22:05:26 +00:00
41183c1fe0
[Spec Decode] Fix offline spec_decode.py ( #24257 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 20:44:13 +00:00
43d9ad03ba
[Model loader]: support multi-thread model weight loading ( #23928 )
...
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com >
Signed-off-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-08 18:49:39 +00:00
7be141b2c5
[CI] Enable encoder model compilation test ( #24442 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-08 11:48:06 -07:00
8d7f39b48c
[Model] Remove quantized mixtral ( #24437 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 11:02:14 -07:00
cd08636926
[Spec Decode][Benchmark] Add Blitzedit dataset ( #23605 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 10:32:52 -07:00
3feeeb9fea
[Spec Decode][Benchmark] Add Spec Bench Dataset for benchmarking ( #23563 )
...
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com >
2025-09-08 10:32:42 -07:00
6f4a82f8b5
[Model] Enable BNB support for qwen2_5_omni_thinker ( #24420 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-08 09:37:08 -07:00
c44797a4d6
[Docs]add eplb_config param use docs ( #24213 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-09-08 09:36:57 -07:00
55be93baf5
[Doc]: fix 2 hyperlinks leading to Ray site after they changed Ray's doc structure ( #24438 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 09:36:54 -07:00
717fc00e98
[Docs] Move feature compatibility tables to README ( #24431 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:45:14 -07:00
01dfb5e982
[Frontend] User-provided uuids for medias in chat. (RFC #22044 ) ( #23449 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-08 06:42:20 -07:00
03dd652c16
Move KVEventsConfig from config/__init__.py to config/kv_events.py ( #24433 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-08 06:41:27 -07:00
9cd76b71ab
[Misc] Terratorch related fixes ( #24337 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-08 06:40:26 -07:00
e041314184
[Bugfix] Fix mamba2 prefill chunking ( #23279 )
...
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com >
Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-08 11:42:41 +00:00
5e537f45b4
[Bugfix] Fix get_quant_config when using modelscope ( #24421 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-09-08 11:03:02 +00:00
c2a8b08fcd
[Doc] Fix issues in integrations/llamastack.md ( #24428 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 02:28:32 -07:00
f4962a6d55
[Doc]: fix typos in Python comments ( #24417 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-08 00:22:16 -07:00
2f0b833a05
[Docs] Fix a tip indentation and typo ( #24419 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-09-08 00:19:40 -07:00
425b04b8f4
[gpt-oss][Responses API] Fix the function call id format ( #24409 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-08 06:49:52 +00:00
60f0843ef8
[Model] Remove unnecessary CUDA sync of Qwen2VL image and video preprocess ( #24334 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:11:12 -07:00
8a46602606
[Model] Remove unnecessary CUDA sync of GLM-4.1V image and video preprocess ( #24332 )
...
Signed-off-by: Win <chatcharinsang@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-07 23:10:54 -07:00
61aa4b2901
[P/D] Add a shutdown method to the Connector API ( #22699 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-07 23:07:00 -07:00
8c892b1831
[Doc] Fix UTF-8 encoding issues in documentation generation on Windows ( #24361 )
...
Signed-off-by: alekramelaheehridoy <aliqramalaheehridoy@gmail.com >
Signed-off-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
Co-authored-by: alekramelaheehridoy <alekramelaheehridoy@gmail.com >
2025-09-07 22:33:52 -07:00
3bca396f79
[CI/Build] Fix local image inputs in test_pixtral.py ( #24401 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-09-08 03:31:35 +00:00
3a3e91bdfe
[CI/Build] Disable flaky test_structured_output tests ( #24404 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-09-08 02:51:59 +00:00
b3d7e3c845
[Sampler] Support returning all prompt logprobs ( #23868 )
...
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-07 19:34:31 -07:00
67841317d1
[xpu] upgrade ipex/python3.12 for xpu ( #23830 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-09-08 02:07:16 +00:00
86173ad593
[Kernel] Support decode context parallelism on Blackwell with CUTLASS MLA ( #24385 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-08 09:27:12 +08:00
795b6951cd
Add @luccafong to codeowner for spec decode ( #24397 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-08 08:30:27 +08:00
2e5d21378d
Skip MM Encoder for non-first PP ranks ( #24387 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 09:38:35 -07:00
0661cb9df3
Add renderer-based prompt processing for embedding and classification endpoints ( #24356 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-07 08:26:48 +00:00
105d3d62ef
[TPU] Remove TopKTopPSampler dependency for TPU sampler ( #24391 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-07 01:12:36 -07:00
62f66be1f7
[Bugfix] Fix Qwen3-coder moe tuned config ( #24072 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 05:19:46 +00:00
81c53ef55c
[Misc] collect flashinfer version in collect_env.py ( #24378 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-07 03:30:41 +00:00
75334956c2
QWEN3 Thinking Fused MoE kernels Optimization configs ( #24330 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-07 03:18:54 +00:00
77aec83b8c
[Benchmark] add benchmark for custom activation op ( #23908 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-06 20:12:05 -07:00
e67597545b
[CI][Fix] deterministic seed for flaky CI runs on structured outputs ( #24380 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-09-07 11:10:40 +08:00
37a6fa95fd
Migrate Qwen2 inputs to TensorSchema ( #23475 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-06 20:07:31 -07:00
558f0907dc
[attention][DCP] use AttentionImpl.need_to_return_lse_for_decode ( #24372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-07 01:18:59 +00:00
4172235ab7
[V0 deprecation] Deprecate V0 Neuron backend ( #21159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-06 16:15:18 -07:00
848562bd49
break execute_model in gpu_model_runner into sub-functions for custom scopes ( #24265 )
...
Co-authored-by: Bangsheng Tang <bangsheng@meta.com >
2025-09-06 14:02:47 -07:00
e68dc2f014
[Bugfix] Fix unstable silu_mul+nvfp4 quant fusion test ( #24370 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-06 20:39:34 +00:00
a3645ed94d
[Frontend][Responses API] Support reporting tool output tokens and fix reasoning token count ( #24285 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-09-06 13:27:15 -07:00
fb691ee4e7
[Fix] [gpt-oss] fix non-tool calling path for chat completion ( #24324 )
2025-09-06 19:10:32 +00:00
6024d115cd
Lora bias(enable_lora_bias) deprecate warning ( #24339 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-07 00:42:19 +08:00
7555d6b34a
[Bugfix] Fix test_mixtral_moe ( #24371 )
2025-09-06 09:32:03 -07:00
00a4e56d8d
[Bugfix] Fix broken deepseek fp8 TP weights loading ( #24367 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-06 09:23:12 -07:00
0eadaeff7e
[Bugfix] Avoid uninitialized usage of azp_val when AZP is false. ( #24335 )
...
Signed-off-by: Mohan Kumar Kumar <mohan.cbein@gmail.com >
Signed-off-by: mohankku <mohan.cbein@gmail.com >
2025-09-06 08:17:03 -07:00
0077c8634e
Add @benchislett to codeowner for spec decode and structured outputs ( #24362 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-06 22:03:35 +08:00
b121ca22ad
[CI] Disable flaky structured output test from CI ( #24366 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-09-06 13:31:56 +00:00
eddaafc1c7
[Multimodal] Improve max video embedding length estimation in V1 ( #24312 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-09-06 02:33:19 -07:00
305a1cc0d2
refactor: Turn GPUModelRunner.inputs_embeds to a CpuGpuBuffer ( #24345 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-09-05 23:01:23 -07:00
6d6c6b05d3
[New Model]: google/embeddinggemma-300m ( #24318 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-05 22:58:36 -07:00
53b19ccdd5
[Core] Allow disabling TP sharding for parallel Linear layer ( #23024 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-05 22:53:58 -07:00
6432739ef1
[Bugfix] Catch and log invalid token ids in detokenizer ( #24351 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-05 22:30:22 -07:00
ac201a0eaf
[Feature] Support Decode Context Parallel (DCP) for MLA ( #23734 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-09-06 13:24:05 +08:00
3c529fc994
[KV Sharing] Raise error if using eagle with fast prefill ( #24350 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-05 20:22:40 -07:00
35bf193864
[Doc]: fix typos in Python comments ( #24294 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-05 19:41:12 -07:00
35efa70297
Add @22quinn as code reviewer for RL related components ( #24346 )
2025-09-06 01:56:15 +00:00
cee182b297
[Perf][V1] Fully overlap model execution ( #23569 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-09-05 18:20:17 -07:00
c954c6629c
[CI] Add timeouts to tests ( #24260 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-05 17:26:22 -07:00
9dfbeb41e5
[RFC] allow cancelation after shutdown in blocking collective_rpc ( #23390 )
...
Signed-off-by: Shiyan Deng <dsy842974287@meta.com >
2025-09-05 14:14:18 -07:00
eedb2a2a10
[Bugfix] Fix silu_mul+quant fusion test ( #24341 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-09-05 20:13:42 +00:00
23a6c5280e
[gpt-oss][Bugfix]Fix streamableparser for missing handling of certain token_ids ( #24306 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-09-05 10:26:00 -07:00
7812bcf278
[docs] add shenzhen meetup ( #24326 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-09-05 22:48:42 +08:00
006e7a34ae
Adding int4 and int8 models for CPU benchmarking ( #23709 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-09-05 20:08:50 +08:00
e599e2c65e
[XPU][P/D] Add XPU support in NixlConnector ( #22436 )
...
Signed-off-by: zhenwei <zhenwei.liu@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 21:03:12 -07:00
c29fb540ff
[gpt-oss] tool parser supports for /chat/completions [1/n] ( #22386 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 20:39:12 -07:00
65e038931d
[Frontend] Skip unnecessary detokenization when token_id is requested ( #24236 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-04 23:04:12 +00:00
886ccbe5ba
[CI/Build] Reduce the number of redundant cases to test for LoRA ( #24276 )
...
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com >
2025-09-04 21:58:44 +00:00
adc3ddb430
[Bugfix][Misc] Fix silu_and_mul_nvfp4_quant issue and extract common utils for nvfp4 kernel source files ( #23727 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 14:25:45 -07:00
60b755cbcb
[Misc] Have AsyncLLM custom_stat_loggers extend default logger list ( #20952 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-09-04 14:25:30 -07:00
482e52f56c
QWEN3 Coder Fused MoE kernels Optimization configs ( #24266 )
...
Signed-off-by: Saman Keon <samanamp@outlook.com >
2025-09-04 20:33:43 +00:00
78336a0c3e
Upgrade FlashInfer to v0.3.0 ( #24086 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-09-04 09:49:20 -07:00
94866d7c93
[Misc] Slight improve deepgemm print ( #24085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-04 16:06:51 +00:00
83609ca91d
[Doc]: fix typos in Python comments ( #24173 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-04 08:52:17 -07:00
e41a0fa377
[Perf] Freeze core engine proc heap after init ( #24008 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-09-04 22:55:23 +08:00
37241077d5
[Misc] Removed force_fp8_e4m3fnuz from FP8LinearOp ( #23725 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-04 09:25:40 -04:00
c9f7081f9c
[LoRA]: Add lora support to qwen-2.5-omni ( #24231 )
2025-09-04 05:50:50 -07:00
16ded21eeb
[XPU] support Triton Attention backend on Intel GPU ( #24149 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-09-04 20:41:08 +08:00
2b30afa442
Use hidden_size_per_head as head_size fallback ( #24221 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-09-04 12:59:16 +01:00
eafa8dcde6
[Model] Add pp support for hunyuan ( #24212 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-04 03:58:26 -07:00
6c7af8110a
[Doc] Update vLLM Singapore Meetup info ( #24234 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-09-04 02:58:18 -07:00
8f423e5f43
[Feature][Response API] Add streaming support for non-harmony ( #23741 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-04 17:49:06 +08:00
369a079568
[Hardware][Apple-CPU] Disable OneDNN build for Apple Silicon ( #24200 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-09-04 02:48:25 -07:00
402759d472
[Attention] FlashAttn MLA ( #14258 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni001@gmail.com >
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-04 02:47:59 -07:00
2c301ee2eb
[Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0 ( #24159 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
Signed-off-by: Fanli Lin <fanli0116@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 02:47:08 -07:00
3efb9f4d95
[Attention][Platform] Refactor MLA to support Custom Op ( #23332 )
...
Signed-off-by: whx-sjtu <2952154980@qq.com >
2025-09-04 02:46:37 -07:00
04f3c35cff
Improve flexibility of auto_tune.sh execution. ( #23766 )
...
Signed-off-by: Anthony Su <50185138+anthonsu@users.noreply.github.com >
Signed-off-by: anthonsu <50185138+anthonsu@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-04 09:41:41 +00:00
51d5e9be7d
[Core][Model] Terratorch backend integration ( #23513 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Christian Pinto <christian.pinto@ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-04 00:22:41 -07:00
e7fc70016f
[Model] Add MiDashengLM model support ( #23652 )
...
Signed-off-by: chenbing8 <chenbing8@xiaomi.com >
Signed-off-by: bingchen-mi <chenbing8@xiaomi.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-04 00:08:09 -07:00
12e1e63cc5
[Misc] Enhance output readability of helper script ( #24214 )
...
Signed-off-by: Weida Hong <wdhongtw@google.com >
2025-09-04 06:38:26 +00:00
57b1ce94f7
[CPU] Refactor CPU unquantized linear ( #24150 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-09-04 14:28:45 +08:00
cb55ad86fe
Migrate ultravox inputs to TensorSchema ( #23503 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-04 06:09:11 +00:00
712b273f65
[Refactor] Introduce basic Renderer for completion-style request ( #24010 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-09-04 05:21:12 +00:00
e919d6f549
[Kernel][Bugfix] Fix grouped topk cu ( #24146 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
2025-09-04 12:37:37 +08:00
a38f8bd54c
[Feature][Responses API]Support MCP tools with streaming mode + background mode ( #23927 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-09-04 04:05:10 +00:00
b5ee1e3261
Remove deprecated PyNcclConnector ( #24151 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-09-03 22:49:16 +00:00
36c260dad6
[Feature][gpt-oss] Add support for num_cached_tokens and num_reasoning_tokens tracking ( #23460 )
...
Signed-off-by: George Nagy II <george.nagy0969@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-09-03 21:08:47 +00:00
a43a3f1770
[Bugfix][DP] DP distribution does not require ray[default] ( #23822 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-09-03 13:21:36 -07:00
6adaed42f4
[Feature][P/D]: Optimize NIXL Connector xfer Launch ( #23887 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-09-03 19:14:30 +00:00
a742322092
[Attention] Blackwell FP8 MLA support with CUTLASS_MLA backend ( #23289 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-03 14:05:24 -04:00
731a6940e3
Migrate whisper inputs to TensorSchema ( #23505 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-03 18:04:00 +00:00
e9b92dcd89
[Kernels] Overlap shared experts with send/recv ( #23273 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-09-03 12:35:18 -04:00
fa4311d85f
[V1] v1 engine + full CUDA graph support for PLaMo2 ( #23998 )
...
Signed-off-by: Hemmi Shinichi <shemmi@preferred.jp >
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
Co-authored-by: Hemmi Shinichi <shemmi@preferred.jp >
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com >
2025-09-03 08:24:02 -07:00
6d80ae83e1
[Bugfix] Fixing division by zero in triton_attn if query_heads/kv_heads > 16 ( #23424 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-09-03 15:01:09 +00:00
4ba0c587ba
FIX: Add libnuma-dev to Dockerfile for dev stage ( #20388 )
...
Signed-off-by: dongbo910220 <1275604947@qq.com >
2025-09-03 07:17:20 -07:00
6997a25ac6
[Model] Remove useless code from MiniMax implementation ( #23982 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-09-03 11:27:04 +00:00
28f350e147
Support add_generation_prompt in embeddings endpoint with chat request ( #23931 )
...
Signed-off-by: biba10 <jaksmid@seznam.cz >
2025-09-03 10:47:55 +00:00
51383bd472
[CI] Accelerate mteb test by setting SentenceTransformers mteb score to a constant ( #24088 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-09-03 17:23:56 +08:00
9c99e4871f
[Misc] Clean up deadcode for legacy processing pipeline ( #24153 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-03 08:34:29 +00:00
70549c1245
[CI/Build] Serve images used by multimodal tests through local HTTP Server ( #23907 )
...
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com >
Signed-off-by: dsinghvi <divyanshsinghvi@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-09-03 16:13:11 +08:00
f0c503f66e
[Nixl] Heterogeneous TP support FlashInfer ( #20189 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-03 15:19:54 +08:00
f38035c123
[distributed][rl] remove nccl cumem env var override ( #24141 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-03 06:45:25 +00:00
426cc8629f
[BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models ( #24132 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-09-03 04:57:59 +00:00
e81d4e69c1
[Misc] Add check for dual_chunk_attention ( #24070 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-03 04:19:14 +00:00
02d411fdb2
[Doc]: fix typos in Python comments ( #24115 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:14:07 -07:00
d7e1e59972
[Doc]: fix typos in Python comments ( #24093 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 21:05:45 -07:00
c4ed78b14f
[Compile] Fix Compile Warning for w4a8_mm_entry.cu ( #23660 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-09-02 20:45:52 -07:00
1bd007f234
fix some typos ( #24071 )
...
Signed-off-by: co63oc <co63oc@users.noreply.github.com >
2025-09-02 20:44:50 -07:00
136d853e65
[V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing ( #23656 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-09-03 02:52:51 +00:00
e32a0e8678
Upgrade xgrammar to 0.1.23 ( #22988 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-09-03 02:32:59 +00:00
42dc59dbac
Update release pipeline post PyTorch 2.8.0 update ( #24073 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
2025-09-03 10:09:19 +08:00
862f2ef893
[XPU] Fix the bug of LoRA logits on the XPU platform ( #24081 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-09-03 08:21:18 +08:00
2fd1a40a54
[CI/Build] Disable SiluMul NVFP4 quant fusion tests ( #24121 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-09-02 16:50:28 -07:00
930a24144c
[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue ( #24119 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 22:22:30 +00:00
457e471971
[AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault ( #23692 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-09-02 22:13:57 +00:00
d328f7894f
[CI] Enable all hf transformers baselines in test_hybrid ( #23936 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-09-02 20:15:06 +00:00
98aee612aa
[Log] Only Print Profiler Results on Rank 0 ( #23370 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-09-02 18:53:34 +00:00
598bd74cf8
Fix weights loading for Apertus ( #24100 )
...
Signed-off-by: Nathan Ranchin <nranchin@student.ethz.ch >
2025-09-02 18:34:28 +00:00
2417798471
[Metrics] Deprecate TPOT in favor of ITL ( #24110 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-09-02 18:10:10 +00:00
9480ae24e3
[Bugfix] Fix packed_factor missing attribute error ( #23902 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-09-02 10:56:31 -07:00
f399182e8c
Run ruff format on a few files. ( #24075 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-09-02 17:55:32 +00:00
1c41310584
[Bugfix] Fix transform_config parsing in Compressed Tensors ( #23945 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-09-02 13:54:10 -04:00
c83c4ff815
[Benchmark] Add support for local hf dataset path in benchmark ( #23999 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-09-02 17:49:16 +00:00
0e1759cd54
[docs] add SYS_NICE cap & security-opt for docker/k8s ( #24017 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
Signed-off-by: Peter Pan <peter.pan@daocloud.io >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 17:27:20 +00:00
e66ed3e675
[CI Failure] Skip failing nvfp4 silu test ( #23959 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-02 13:18:15 -04:00
e0653f6c0b
[Model] Classification models support logit_bias / sigmoid_normalize ( #24031 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-09-02 16:48:57 +00:00
38ba061f6f
[BugFix] Fix EXAONE4 rotary embeddings ( #23918 )
...
Signed-off-by: lkm2835 <lkm2835@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 14:40:55 +00:00
0a74e9d0f2
[Gemma3n] Fix audio batching ( #24052 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-09-02 22:23:35 +08:00
8bd5844989
correct LWS deployment yaml ( #23104 )
...
Signed-off-by: cberge908 <42270330+cberge908@users.noreply.github.com >
2025-09-02 12:04:59 +00:00
ce30dca5c4
[CI]: reduce HTTP calls inside entrypoints openai tests ( #23646 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Aziz <azizbenothman76@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-02 10:49:32 +00:00
2f0bab3f26
[Model] Support dp on ViT on GLM-4.5V ( #23168 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-09-02 10:48:18 +00:00
fad73be1a5
[Doc]: fix typos in Python comments ( #24077 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-02 02:38:55 -07:00
56d04089ef
Migrate Interns1 inputs to TensorSchema ( #23510 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 04:35:45 +00:00
7be0cb8e9e
[XPU][Feature] fp8 online quantization support for XPU ( #23148 )
...
Signed-off-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com >
2025-09-02 04:06:53 +00:00
1fa1d6a9a0
Migrate OvisImagePatchInputs to TensorSchema ( #22024 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-02 12:01:36 +08:00
d59c986444
Remove runtime checks based on pooling params ( #24051 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-09-02 11:54:37 +08:00
04d0c60770
[Bugfix] Fix the issue that Blip2ForConditionalGeneration' object has… ( #24028 )
...
Signed-off-by: Dazhi Jiang <dazhi_jiang@163.com >
2025-09-02 11:54:20 +08:00
2b41cbbf03
[V1][Mamba1] - FP32 SSM Kernel Support ( #23506 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-09-01 20:53:00 -07:00
0235103cbb
[Doc]: fix typos in Python comments ( #24042 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-09-01 19:07:45 -07:00
a344a5aa0a
[bugfix]fix MTP hidden states ( #24056 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-09-01 21:09:37 +00:00
5685370271
[Chore][V0 Deprecation] Move LogProb to a separate file ( #24055 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 12:07:53 -07:00
a0e0efd6bd
[Model] Support DP for ViT on Kimi-VL-A3B-Thinking-2506 ( #23817 )
...
Signed-off-by: Junhong <liujunhong11@huawei.com >
Signed-off-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Junhong <liujunhong11@huawei.com >
Co-authored-by: LJH-LBJ <98734602+LJH-LBJ@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-09-01 16:56:56 +00:00
cf91a89dd2
[docs][misc] IOProcessor plugins fixes ( #24046 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-09-01 09:17:41 -07:00
39a22dcaac
[Misc] Minor code simplification for spec decode ( #24053 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 08:54:01 -07:00
41c80698b3
Document multi-proc method selection for profiling ( #23802 )
...
Signed-off-by: jdebache <jdebache@nvidia.com >
2025-09-01 06:28:26 -07:00
7c8271cd1e
[Model]: support KeyeVL-1_5-8B ( #23838 )
...
Signed-off-by: wangruitao <wangruitao@kuaishou.com >
Co-authored-by: wangruitao <wangruitao@kuaishou.com >
2025-09-01 03:50:27 -07:00
3e330fcb21
[Doc]: Fix CPU install docs: force torch-backend=cpu to avoid GPU torchvision errors ( #24033 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-09-01 03:34:52 -07:00
d46934b229
[Frontend] Gemma3n audio transcriptions/translations endpoint ( #23735 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-09-01 18:07:46 +08:00
107284959a
[Doc]: fix typos in Python comments ( #24026 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-09-01 09:38:20 +00:00
dc1a53186d
[Kernel] Update DeepGEMM to latest commit ( #23915 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-09-01 02:38:04 -07:00
55602bb2e6
[Frontend] Update the warning log when using VLLM_ALLOW_LONG_MAX_MODEL_LEN ( #20904 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-09-01 08:50:25 +00:00
d7fbc6ddac
[Misc] Enable V1 FP16 inference on pre-Ampere GPUs ( #24022 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 08:12:22 +00:00
5438967fbc
[Misc] add hash_function doc string ( #24014 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-31 23:11:20 -07:00
422e793fa6
[Bugfix] Add support for <tool_call> format in streaming mode for XLAM Tool Parser ( #22769 )
...
Signed-off-by: Devon Peroutky <devon@kindo.ai >
2025-09-01 14:07:54 +08:00
1cb39dbcdd
[Misc] IO Processor plugins for pooling models ( #22820 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-31 23:07:12 -07:00
437c3ce026
Migrate Phi4 inputs to TensorSchema ( #23471 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-09-01 14:05:59 +08:00
499b074bfd
[Misc] refactor code by import as for torch._inductor.config ( #23677 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-09-01 14:05:42 +08:00
ff0e59d83a
[CI/Build] Improve Tensor Schema tests speed by avoid engine core initialization ( #23357 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-31 22:52:20 -07:00
b55713683c
[Misc] Move fast prefill logic to separate method ( #24013 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 05:40:38 +00:00
acc1a6e10a
Fix the bug related to loading GPTP INT3 weights. ( #23328 )
...
Signed-off-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 05:39:57 +00:00
8c742a66d1
[Misc] Avoid redundant copy for encoder-only models ( #24012 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-09-01 04:02:43 +00:00
183a70967a
[BUGFIX] GPTQ quantization compatibility for Qwen3 MOE models (AutoGPTQ and AutoRound-GPTQ) ( #23994 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-09-01 03:33:40 +00:00
14b4326b94
v1: Support KV events from connectors ( #19737 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-09-01 01:13:21 +00:00
752d2e1c36
[Minor] Fix some random typos in comments ( #24009 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-31 16:42:17 -07:00
81eea3d348
vllm fix check on max vocab size ( #22471 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-31 20:57:05 +08:00
9701352e4b
[Doc]: fix typos in Python comments ( #24001 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-31 08:21:59 +00:00
749be00a98
[Core][Multimodal] Allow passing multi_modal_uuids as multimodal identifiers. ( #23394 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-30 18:01:22 -07:00
5b8077b8ac
Fix wrong truncate_prompt_tokens type hint ( #22761 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
Signed-off-by: Gabriel Marinho <104592062+gmarinho2@users.noreply.github.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-30 20:39:38 +00:00
038e9be4eb
[LoRA] Much faster startup when LoRA is enabled ( #23777 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 15:37:39 +00:00
68a349114f
[Misc] enhance type hint for rearrange return value ( #23519 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:43:33 -07:00
e80bca309e
[Refactor] refactor freezing_value/cuda_event initialize outside try finally ( #23758 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:42:25 -07:00
fb4983e112
[Misc] add reorder_batch AttentionMetadataBuilder ( #23798 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-30 06:41:45 -07:00
379ea2823a
Add LoRA support for DeepSeek models (V2, V3, R1-0528) ( #23971 )
...
Signed-off-by: sadeghja1070 <sadegh.ja1070@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 06:40:02 -07:00
3a6acad431
[Model] Enable encoder DP for MiniCPM-V ( #23948 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 06:31:26 -07:00
5490d633ce
[UT] fix unify_kv_cache_configs when kv cache config needs sort ( #23843 )
2025-08-30 11:22:14 +00:00
628d00cd7b
[Bugfix] Fix test_lora_resolvers.py ( #23984 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-30 11:16:11 +00:00
4071c76cf3
[V1] [Hybrid] Move MiniMaxLinearAttention into layers/mamba ( #23831 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-30 00:16:15 -07:00
f1bddbd852
[Core] Cleanup TPU model runner for MM ( #23894 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-30 00:14:58 -07:00
9748c5198b
[CI] Fix broken compile tests due to unsupported SiluMul+Nvfp4Quant fusion ( #23973 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-30 00:14:43 -07:00
ee52a32705
[CI] Move testing image from remote URL to S3 ( #23980 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 21:41:25 -07:00
8fb85b7bb6
Add routed_scaling_factor to MoE grouped topk ( #23123 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:48 -07:00
5b31cb1781
[Bugfix] Fix --config arg expansion called from api_server.py ( #23944 )
...
Signed-off-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Jean-Francois Dube <dubejf+gh@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-29 21:36:39 -07:00
d660c98c1b
[CI] Fix unavailable image remote URL ( #23966 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-29 15:40:04 -07:00
5674a40366
[Misc] Make download_weights_from_hf more reliable ( #23863 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 12:37:24 -07:00
8c3e199998
Revert gemma3n fast prefill changes ( #23897 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-29 12:16:57 -07:00
1c26b42296
[Docs] [V1] [Hybrid] Add new documentation re: contributing mamba-based models ( #23824 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-29 18:47:58 +00:00
b7adf94c4a
Tuned H100/H200 triton fp8 block configs for fused_qkv_a_proj ( #23939 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-29 10:28:35 -07:00
4d7fe40fc0
[RL][BugFix] Fix missing tokenizer error for token-in-token-out ( #23904 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-30 01:09:55 +08:00
0dc9532065
[BUGFIX ] fix undefined silu_and_mul_nvfp4_quant ( #23929 )
...
Signed-off-by: hongchao <hongchao@msh.team >
Signed-off-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: Richard Zou <zou3519@gmail.com >
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com >
2025-08-29 09:36:39 -07:00
72a69132dc
[CI] Add aiter to matching list of issue auto labeller for rocm tag ( #23942 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-29 15:29:21 +00:00
d90d8eb674
[BugFix] Async scheduling and PP compatibility with DP ( #23770 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-29 08:17:27 -07:00
0a2f4c0793
[Models] Use in-place adds in Idefics2Vision ( #23932 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 07:42:57 -07:00
1cf3753b90
[MODEL] Apertus and XIELU ( #23068 )
...
Signed-off-by: EduardDurech <39579228+EduardDurech@users.noreply.github.com >
Co-authored-by: AllenHaoHuang <allenhuangdd@gmail.com >
2025-08-29 20:29:18 +08:00
4f7cde7272
Adds json_count_leaves utility function ( #23899 )
...
Signed-off-by: aditchawdhary <aditxy@hotmail.com >
2025-08-29 05:28:13 -07:00
67c14906aa
Update PyTorch to 2.8.0 ( #20358 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-29 18:57:35 +08:00
69f46359dd
[Multimodal] Consolidate mm inputs into MultiModalFeatureSpec ( #23779 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-08-29 18:36:57 +08:00
d9e00dbd1f
[Performance] V1 Classify Models E2E Performance Optimization ( #23541 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-29 03:12:32 -07:00
ad39106b16
[CPU] Enable data parallel for CPU backend ( #23903 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-29 02:19:58 -07:00
2554b27baa
[V0 Deprecation] Remove pooling model support in V0 ( #23434 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-29 00:04:02 -07:00
934bebf192
Better errors for Transformers backend missing features ( #23759 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-29 07:01:40 +00:00
885ca6d31d
[Misc] Fix warnings for mistral model ( #23552 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-08-29 06:58:48 +00:00
2d0afcc9dc
[mrope][Qwen2-VL] Fix edge case where getting index of image/video token can potentially throw in default vl mrope implementation. ( #23895 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-28 23:29:13 -07:00
b4f9e9631c
[CI/Build] Clean up LoRA test ( #23890 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-28 23:28:35 -07:00
05d839c19e
Fix(async): Add support for truncate_prompt_tokens in AsyncLLM ( #23800 )
2025-08-28 22:55:06 -07:00
6597d7a456
[Platform] import activation_quant_fusion for CUDA only ( #23882 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-08-28 22:54:16 -07:00
5264015d74
[BugFix][AMD][Deepseek] fix a dtype mismatch error for deepseek running on AMD ( #23864 )
...
Signed-off-by: Jinghui Zhang <jinghuizhang0804@gmail.com >
2025-08-28 22:54:12 -07:00
98ac0cb32d
[Bugfix] Use ReplicatedLinear for SequenceClassification head ( #23836 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-29 04:41:20 +00:00
c8b3b299c9
[tests] Improve speed and reliability of test_transcription_api_correctness ( #23854 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-29 04:25:33 +00:00
006477e60b
[ROCm][Fix] Fix rocm build caused by #23791 ( #23847 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-08-28 19:52:27 -07:00
de533ab2a1
[Models] Improve iteration over layers ( #19497 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-29 09:26:34 +08:00
235c9db8a7
[XPU] support data parallel for MoE models on XPU ( #22887 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-29 09:23:04 +08:00
b668055a11
[V0 Deprecation] Remove V0 Samplers test ( #23862 )
2025-08-28 18:05:52 -07:00
d3d2aad5a2
[Log] Use Debug Once for DeepGEMM E8M0 When not Enabled ( #23858 )
2025-08-28 22:18:10 +00:00
cb293f6a79
[V1] Enable prefill optimization for Gemma3n ( #22628 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-28 14:54:30 -07:00
7ffbf27239
[BugFix][FlashInfer] Fix potential race condition for paged_kv_indptr_cpu ( #23737 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 14:22:46 -07:00
27e88cee74
chore: build release image by default ( #23852 )
...
Signed-off-by: Codex <codex@openai.com >
2025-08-28 13:17:15 -07:00
16a45b3a28
[NVIDIA] Support SiluMul + NVFP4 quant fusion ( #23671 )
...
Signed-off-by: jindih <jindih@nvidia.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: jindih <jindih@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedic <lgovedic@redhat.com >
2025-08-28 19:36:50 +00:00
57d4ede520
[bugfix] [spec-decoding] fix data race in sample_recovered_tokens_kernel (vLLM v1) ( #23829 )
...
Signed-off-by: He-Jingkai <he-jingkai@outlook.com >
2025-08-28 19:05:20 +00:00
04d1dd7f4a
[ROCm][Aiter] Add triton fp8 bmm kernel for mla ( #23264 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com >
2025-08-28 18:18:08 +00:00
f32a5bc505
Migrate Llama4ImagePatchInputs to TensorSchema ( #22021 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 17:29:37 +00:00
8805ad9fa9
Add scale_config.yml file for Meta autoscalers for GH Actions ( #23840 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 09:31:20 -07:00
0583578f42
[ci] breaks down V1 Test into 3 groups of approx 30 minutes runtime ( #23757 )
...
Signed-off-by: Jean Schmidt <contato@jschmidt.me >
2025-08-28 08:59:19 -07:00
db74d60490
[Bugfix] Add fake mode around passes ( #23349 )
...
Signed-off-by: angelayi <yiangela7@gmail.com >
2025-08-28 11:25:56 -04:00
95089607fa
[Model][gpt-oss] Support DP+EP for GPT-OSS with FlashInfer trtllm-gen MoE ( #23819 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-28 06:56:20 -07:00
1f096f9b95
[CI] Fix linting error on main ( #23835 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-28 06:52:01 -07:00
66548f6603
[Bugfix] Fix benchmark_moe.py for blockwise fp8. ( #23823 )
...
Signed-off-by: crischeng <420985011@qq.com >
Co-authored-by: cris <grace@guisenbindeMacBook-Pro.local >
2025-08-28 21:44:09 +08:00
d3da2eea54
[Doc]: fix typos in Python scripts ( #23828 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 05:37:38 -07:00
bfab219648
[Model] [gpt-oss] fix gpt-oss pp support ( #23815 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 05:36:55 -07:00
a3432f18fd
[BugFix][Spec Decode] Use float64 for uniform_probs ( #23803 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-28 12:26:45 +00:00
67cee40da0
[CI/Build][Bugfix] Fix Qwen VL tests on CPU ( #23818 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-28 11:57:05 +00:00
d99c3a4f7b
[Doc]: fix typos in .md files (including those of #23751 ) ( #23825 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-28 04:38:19 -07:00
3462c1c522
[FIXBUG] Add return_success parameter to moe_wna16_weight_loader function ( #22797 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-28 09:03:22 +00:00
c5d004aaaf
[Model] Add PP support and VLM backbone compatability for GPT-OSS ( #23680 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-28 16:03:28 +08:00
11a7fafaa8
[New Model]: Support GteNewModelForSequenceClassification ( #23524 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-28 15:36:42 +08:00
186aced5ff
[Kernel] cuda kernels for upcoming decode context parallel feature ( #23791 )
...
Co-authored-by: hongchao <hongchao@msh.team >
2025-08-28 15:29:11 +08:00
daa1273b14
[Bugfix] when set offline model running error ( #23711 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-28 07:27:45 +00:00
c07a73317d
[CI] enable idefics3 and fuyu-8b test in multimodal test ( #23790 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-28 14:51:24 +08:00
22feac8e95
[Transform] [Quantization] Add transforms to compressed tensors ( #22486 )
2025-08-28 02:43:48 -04:00
c8851a4723
Add deprecation warning for lora_extra_vocab_size ( #23635 )
...
Signed-off-by: Jinheng Li <ahengljh@gmail.com >
2025-08-27 22:34:29 -07:00
f48a9af892
[CI] make all multi-gpu weight loading tests run nightly ( #23792 )
...
Signed-off-by: Alex Yun <alexyun04@gmail.com >
2025-08-27 21:27:36 -07:00
a11adafdca
Gracefully handle edge cases in harmony utils ( #23155 )
...
Signed-off-by: Jan Kessler <jakessle@uni-mainz.de >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-27 20:14:00 -07:00
a781e84ec2
[Perf] Tune configs for triton block fp8 gemm H100/H200 ( #23748 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-28 11:12:53 +08:00
1b7b161a09
[Feature] models: pass layer prefix to replace_linear_class for per-layer quantization routing. Addresses #23239 ( #23556 )
...
Signed-off-by: Shrey Gupta <shreyg1303@gmail.com >
2025-08-27 20:12:44 -07:00
a69693e38f
Migrate Qwen inputs to TensorSchema ( #23473 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-28 10:43:26 +08:00
5da4f5d857
[Bugfix] Fix for V1 priority scheduling crashes at preemption ( #23713 )
...
Signed-off-by: Hanchenli <lihanc2002@gmail.com >
2025-08-28 00:44:52 +00:00
321938e9ac
[Feature] Add VLLM_DISABLE_PAD_FOR_CUDAGRAPH to Avoid Hang Issue ( #23595 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 21:52:24 +00:00
f9ca2b40a0
[Bugfix] Fix Marlin NVFP4 for modelopt ( #23659 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-27 17:48:16 -04:00
082cc07ef8
DP/EP Support for gpt-oss with deepep-ht comm kernel on SM100 ( #23608 )
2025-08-27 17:33:21 -04:00
853c371fc3
[V1][Mamba] - Enable V1 by default for Mamba Models ( #23650 )
...
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
2025-08-27 20:53:30 +00:00
8bf6266a17
[Multimodal] Generate mm_hash based on request metadata when caching is turned off ( #23690 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
2025-08-27 20:24:31 +00:00
0585a9e73c
Disable torch.compile for dynamic rope models in Transformers backend ( #23738 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 19:03:05 +00:00
3c0ef769ba
ci: Add arm64 docker build to release pipeline ( #23210 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
Signed-off-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com >
2025-08-27 10:41:48 -07:00
4e4d017b6f
[Docs] Fix warnings in mkdocs build (continued) ( #23743 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
2025-08-27 17:17:29 +00:00
dd58932280
[V1] [Hybrid] Enable compile and piecewise CUDA graph for MiniMax-Text models ( #22589 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-27 10:05:16 -07:00
52883ed084
[Model] Merge SupportsMultiModalWithRawInput with SupportsMultiModal ( #23749 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 10:01:50 -07:00
4f35be10a9
[BugFix] Fix topk_softmax assert ( #19764 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
2025-08-27 09:47:28 -07:00
2b61d2e22f
[Docs] Remove in-tree Gaudi install instructions ( #23628 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 09:22:21 -07:00
3ce8285d6d
[LogitsProcs] Deduplicate built-in LP implementation logic ( #23362 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-27 23:11:33 +08:00
83f555f637
[Doc]: upgrade version of crate-ci tool for improved typo detection ( #23755 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-27 07:59:34 -07:00
841490434a
[Model] Enable native HF format InternVL support ( #23742 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-27 14:45:17 +00:00
3af47c3cc6
[Feature] Add Hopper DeepGEMM E8M0 for DeepSeekV3.1 scale_fmt ( #23666 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-27 14:09:08 +00:00
513c1fe255
Only run get_attr_docs if generating help text ( #23723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 13:55:12 +00:00
fe8d7b6f03
[Model] Interface to enable batch-level DP support ( #23733 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 06:41:22 -07:00
16dc4052b0
Fix pre-commit on main ( #23747 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-27 06:39:48 -07:00
8dd2baa597
Add vLLM Korea Meetup in the README.md and meetups.md ( #23746 )
...
Signed-off-by: rebel-hongseok <hongseok@rebellions.ai >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-27 06:25:49 -07:00
5eeef1b908
[Model] Explicit default_pooling_type interface ( #23736 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 13:24:09 +00:00
704432af3c
[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models ( #23716 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-27 12:51:54 +00:00
a403d0fa41
[Misc] Remove unnecessary _send_reconfig_message() in core_client.py ( #23127 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-27 05:50:47 -07:00
8c13820f0b
[Bugfix] Fix task field initialization when PYTHONOPTIMIZE is enabled ( #23718 )
...
Signed-off-by: cndoit18 <cndoit18@outlook.com >
2025-08-27 12:42:20 +00:00
9d30de4469
[model] Support MiniCPM-V 4.5 ( #23586 )
...
Signed-off-by: tc-mb <caitianchi@modelbest.cn >
Signed-off-by: Xin Yang <xyangx@amazon.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Signed-off-by: Pate Motter <patemotter@google.com >
Signed-off-by: Terrencezzj <terrence@cohere.ai >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: tc-mb <157115220+tc-mb@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: Huy Do <huydhn@gmail.com >
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com >
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: oye93 <en.ouyang93@outlook.com >
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Tianyu Li <tianyu.li@arm.com >
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Signed-off-by: Federico <65908512+coval3nte@users.noreply.github.com >
Signed-off-by: Zixuan Zhang <zixuanzhang@bytedance.com >
Signed-off-by: wuhang <wuhang6@huawei.com >
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
Signed-off-by: Wei Wei <wwei6@meta.com >
Signed-off-by: Yiheng Xu <charlesyihengxu@gmail.com >
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com >
Co-authored-by: Pate Motter <p@temotter.com >
Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: weiliang <weiliangl@nvidia.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
Co-authored-by: Bin Jia <45593998+FoolPlayer@users.noreply.github.com >
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Raghavan <oneraghavan@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Huy Do <huydhn@gmail.com >
Co-authored-by: Matúš Námešný <matus@namesny.com >
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: En Ouyang <en.ouyang93@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: nvjullin <jullin@nvidia.com >
Co-authored-by: Didier Durand <2927957+didier-durand@users.noreply.github.com >
Co-authored-by: TianyuLi0 <116711075+TianyuLi0@users.noreply.github.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Co-authored-by: Federico <65908512+coval3nte@users.noreply.github.com >
Co-authored-by: zixuanzhang226 <zixuanzhang@bytedance.com >
Co-authored-by: wuhang <wuhang6@huawei.com >
Co-authored-by: yzds <41983536+youzhedian@users.noreply.github.com >
Co-authored-by: hongchao <hongchao@msh.team >
Co-authored-by: czhu-cohere <conway.zhu@cohere.com >
Co-authored-by: Wei <weiweinpu@gmail.com >
Co-authored-by: Yiheng Xu <charlesyihengxu@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: CSWYF3634076 <58356743+CSWYF3634076@users.noreply.github.com >
2025-08-27 05:38:00 -07:00
1f7a9c95e4
[Docs] Fix a 1-2-3 list and style issues in tpu.md ( #23729 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-08-27 05:37:52 -07:00
8f0d7eaea8
[XPU] Fix OOM issue for data parallel with Ray backend ( #22500 )
...
Signed-off-by: Fanli Lin <fanli.lin@intel.com >
Signed-off-by: Fanli Lin <fanli0116@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-27 19:57:38 +08:00
e03940762b
[CI/Build] Reduce LoRA layer test cases ( #23721 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-27 10:59:35 +00:00
11eddf02f0
[FlashInfer] Cache hyper params in metadata builder ( #23732 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 03:45:04 -07:00
04ff1e43fb
[Misc] Move CpuGpuBuffer to vllm/v1/utils.py ( #23728 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 03:25:00 -07:00
6578e87365
Optimize input preparation for FlashInfer [2/N] ( #23174 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-27 02:52:45 -07:00
5bd9f84158
[Docs] Fix an admonition important ( #23726 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-08-27 02:50:09 -07:00
91e382c935
[CI/Build] Remove redundant register in model init tests ( #23715 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 08:11:15 +00:00
6446677839
[XPU]fix cuda event used in XPU model runner ( #23708 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-27 07:27:14 +00:00
69244e67e6
[Core] Use key-only cache for BaseMultiModalProcessor ( #23018 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-27 14:19:13 +08:00
8dbf6ed7be
[Bugfix] fix when config.yaml config value is list parse error ( #23528 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-27 05:54:39 +00:00
9de25c294b
[CI/Build] Remove redundant LoRA model tests ( #23706 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-27 05:51:50 +00:00
fce10dbed5
[XPU] Add xpu torch.compile support ( #22609 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-27 05:33:27 +00:00
d272415e57
[Quantization] Expand compressed-tensors MoE matching logic to support NFP4 + FP8 MoEs ( #22674 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-08-27 05:00:21 +00:00
142ac08030
[Frontend] Optimize beam search performance by limiting concurrency ( #23599 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-27 04:59:14 +00:00
3210264421
[Frontend] Add --log-error-stack to print stack trace for error response ( #22960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-27 04:58:59 +00:00
644d57d531
[Model] Add Ernie4.5 VL Model Support ( #22514 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-08-26 21:02:55 -07:00
c905684cfe
[Core] Asynchronous h2d in merge_multimodal_embeddings via pinned memory. ( #23686 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-26 20:05:34 -07:00
786835807b
[Bugfix]: Qwen3 Coder Tool Parser ( #23099 )
...
Signed-off-by: Yiheng Xu <charlesyihengxu@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-08-26 19:58:32 -07:00
fecbb7c782
[Bugfix][gpt-oss] passing the cache config in gpt-oss ( #23613 )
...
Signed-off-by: Wei Wei <wwei6@meta.com >
2025-08-27 02:54:23 +00:00
6dab89b8ec
[Docs] Fix math rendering in docs ( #23676 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 18:47:08 -07:00
de02b07db4
[Bugfix] Lazy import gpt_oss_triton_kernels_moe for mxfp4 ( #23678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-27 09:34:57 +08:00
eb1995167e
[gpt-oss] Enable unit test for response API harmony integration ( #23533 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-26 18:23:26 -07:00
2c2b140ae8
[quantization] use channel scales for w4a8 + misc fixes ( #23570 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-08-26 18:23:23 -07:00
c7c80af084
fix pynccl reduce_scatter ( #23648 )
...
Co-authored-by: hongchao <hongchao@msh.team >
2025-08-26 18:21:11 -07:00
6891205b16
[Feature][Responses API] Support MCP tool in background mode ( #23494 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-08-27 01:06:58 +00:00
b1625dbe9c
feat: add triton fused moe config for GLM-4.5-Air-FP8 on B200 ( #23695 )
...
Signed-off-by: Zixuan Zhang <zixuanzhang@bytedance.com >
2025-08-26 18:06:10 -07:00
585e0bde36
[Bugfix] UnboundLocalError when GptOss reasoning specified ( #23054 )
...
Signed-off-by: Federico <65908512+coval3nte@users.noreply.github.com >
2025-08-27 00:29:52 +00:00
714872f1a9
[Compile] Fix Cmake Warning ( #23689 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-26 23:48:32 +00:00
5f1af97f86
[V1] [Hybrid] Enable Full CUDA graph by default for hybrid models in V1 ( #22594 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-26 23:28:55 +00:00
c3b0fd1ee6
[V1][P/D]P2pNcclConnector supports flashinfer ( #23536 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-26 22:56:16 +00:00
6421b66bf4
[Docs] Move quant supported hardware table to README ( #23663 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 22:26:46 +00:00
2f13319f47
Enhance the pre-notification policy ( #23532 )
...
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
2025-08-26 20:41:36 +00:00
d696f86e7b
[doc] Hybrid KV Cache Manager design doc ( #22688 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 20:19:05 +00:00
9816b81f5f
[Model] Enable video support for InternVL3.5 models ( #23658 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-26 19:46:52 +00:00
c37c0af990
[Misc] Fix comments in tests/kernels/quantization ( #23675 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-26 19:31:20 +00:00
9715f7bb0f
[Bugfix] Fix incorrect original shape in hashing ( #23672 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-08-26 19:01:25 +00:00
98aa16ff41
[v1] Add cross-attention KV cache support for encoder-decoder models ( #23664 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-26 18:49:06 +00:00
227e231b55
[Docs] [V1] [Hybrid] Update docs to remove FlashInfer constraint for hybrid models ( #23665 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-26 18:33:16 +00:00
730d0ac8b9
[Docs] Fix warnings in mkdocs build ( #23649 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
Signed-off-by: Hyogeun Oh (오효근) <ohg3417@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 18:19:23 +00:00
9b0187003e
[Bugfix] Fix cuda event usage with CPU model runner ( #23643 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-26 17:10:42 +00:00
44ac25eae2
[CI] [Doc]: Add GH Action for auto labeling issues with rocm tag ( #20988 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-26 16:20:13 +00:00
7ea22e42d5
[Misc] Add override for allreduce fusion thresholds ( #23639 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
2025-08-26 15:53:04 +00:00
9d4183dd2e
[model] support qwen2audio embedding input ( #23625 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-26 23:48:08 +08:00
513298f1b4
[Bugfix] fix bf16 multimodal model hash ( #23623 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-26 23:47:50 +08:00
379f828fba
[Docs] Reduce requirements for docs build ( #23651 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 15:43:28 +00:00
1fdc732419
[ROCm] Starting to add AMD code reviewers for ROCm components ( #23496 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-08-26 07:32:37 -07:00
f58675bfb3
[CPU] add cpu fused moe pytorch native implementation ( #23146 )
...
Signed-off-by: Tianyu Li <tianyu.li@arm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 14:09:17 +00:00
7c04779afa
[Doc]: fix various spelling issues in multiple files ( #23636 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-26 14:05:29 +00:00
f66673a39d
[Kernel] Added flashinfer fp8 per-tensor gemms ( #22895 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 06:54:04 -07:00
b78bed1bc5
[Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) ( #23565 )
...
Signed-off-by: oye93 <en.ouyang93@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 13:04:25 +00:00
164b2273c8
[Docs] Fix broken links to docs/api/summary.md ( #23637 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 13:00:18 +00:00
2b4fc9bd9b
Support FlashAttention Backend for Hybrid SSM Models ( #23299 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-26 12:41:52 +00:00
ebd5a77bb5
feat: add usage to TranscriptionResponse (text and json response_format) ( #23576 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-08-26 05:26:26 -07:00
384dd1b0a8
[Bugfix] Add missing enable_log_outputs parameter to init_app_state function ( #23634 )
...
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com >
2025-08-26 12:13:15 +00:00
fdeb3dac13
[Model] fix DeepSeek e_score_correction_bias dtype to fp32 ( #23640 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-26 20:09:47 +08:00
d52358c1e0
[Perf] Remove duplicated NVFP4 blockscales to save memory ( #23379 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-26 19:16:33 +08:00
6ace2f72b0
Fix writing benchmark results with tuple keys ( #23633 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-08-26 19:16:09 +08:00
b00e69f8ca
Fix nits from #20059 ( #23548 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 03:27:20 -07:00
50fede6634
[V1] Enable V1 for compute capability < 8.0 + FP32 ( #23614 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-26 03:00:18 -07:00
b5d34af328
[Bugfix] Fix scheduling when repeated images in one request ( #23544 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
2025-08-26 09:46:28 +00:00
9b5f64238f
[Bugfix] Fix Qwen25VL packed_modules_mapping ( #23604 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 01:09:14 -07:00
ff77764f86
Fix CLI parameter documentation inconsistency in pooling_models.md ( #23630 )
2025-08-26 01:05:37 -07:00
bfc1edc9f5
[Docs] Fix titles for multi-file examples that are rendered in the docs ( #23573 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 00:16:44 -07:00
3ecbb14b81
[Benchmarks] add benchmark for embedding models ( #23000 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-25 23:57:08 -07:00
7d67a9d9f9
[mypy] Fix incorrect type hint for EAGLE3 support ( #23617 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:50:17 -07:00
959783fb99
[fix] fix seed-oss-parser ( #23560 )
...
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
2025-08-25 23:16:36 -07:00
ce0e9dbd43
[CI/Build] Fix typo in #23561 ( #23616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:13:03 -07:00
b395b3b0a3
[Disagg][Perf] Use CUDA event sync instead of blocking tolist to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT ( #22760 )
...
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
2025-08-25 21:06:00 -07:00
6fad29b11b
Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper ( #23385 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-25 19:34:15 -07:00
6fd45e7b8a
[CI/Build] Use vLLM client's user agent to fetch images ( #23561 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 19:34:12 -07:00
56dcf4e7e9
[Bug] Fix DeepGEMM Env Control ( #23591 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-25 18:41:21 -07:00
ae067888d6
Update Flashinfer to 0.2.14.post1 ( #23537 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 18:30:44 -07:00
906e461ed6
[CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests ( #23568 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-25 18:29:00 -07:00
2a97ffc33d
[Misc] Add release note draft to PR template ( #23598 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-08-25 16:44:51 -07:00
efc88cf64a
[Misc] Simplify FlashInfer attention metadata ( #23585 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-25 15:42:29 -07:00
7b6a837275
[Docs] Update Documentation of Cohere Command-A Models ( #23584 )
...
Signed-off-by: Terrencezzj <terrence@cohere.ai >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
2025-08-25 21:53:52 +00:00
c34c82b7fe
[TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. ( #23574 )
...
Signed-off-by: Pate Motter <patemotter@google.com >
2025-08-25 14:29:16 -07:00
8a044754bd
[XPU] Delay BF16 check to worker init for spawn compatibility ( #22979 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-25 13:09:26 -07:00
9188ae7cb5
[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector ( #23403 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-08-25 12:57:08 -07:00
8a3cd90af5
[Kernel] Add fused grouped_topk kernel for MoE ( #23274 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-25 11:47:52 -07:00
2a167b2eeb
[test][RL] Add sleep level 2 test and fix reload with sleep mode ( #23521 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-26 00:25:52 +08:00
0ff902f3b4
[Refactor] Refactor persistent buffers with CpuGpuBuffer ( #23515 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 08:44:48 -07:00
a9082a4d14
[Bugfix] Fix Qwen3 MoE GPTQ inference ( #23490 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-25 06:40:20 -07:00
e0329ed4b4
Updates to Flex + VLLm integration ( #21416 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-08-25 09:32:42 -04:00
6879cd80ae
[Refactor] Pass tokenizer explicitly instead of binding to prompt update ( #23542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:31:57 -07:00
e269be2ba2
[Doc] Add caution for API server scale-out ( #23550 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:14:15 -07:00
5c4b6e66fe
[Attention] Unify mamba and attention backend selection ( #23171 )
...
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com >
2025-08-25 09:09:36 +00:00
d0a4a3f645
[misc] add shanghai meetup ( #23535 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-25 17:00:03 +08:00
ebafb0936d
[Bugfix] Allow dynamic number of patches for llava_onevision ( #23525 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 08:34:54 +00:00
0cb7b065c3
Feature/benchmark/random mm data/images ( #23119 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
2025-08-25 01:28:35 -07:00
2da02dd0d8
[Fix] DeepSeek V3.1 tool parser error message ( #23492 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-25 00:56:39 -07:00
d765cf01fe
[Core][Multimodal] Track encode cache entries by mm_hash and enable embedding sharing between requests ( #22711 )
...
Signed-off-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.io >
2025-08-25 00:41:17 -07:00
712d0f88d8
[Refactor] Dynamic target and content for prompt updates ( #23411 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-24 23:39:58 -07:00
49ab23b3cc
[gpt-oss] use reasoning channel for reasoning text in serving_chat ( #22920 )
...
Signed-off-by: Yu Guo <yuguo@meta.com >
2025-08-25 06:29:34 +00:00
c9abb10489
[Bugfix] Fix Dense module loading for sentence-transformers embedding models (simplified V2) ( #23408 )
...
Signed-off-by: FFFfff1FFFfff <yifanli0919@gmail.com >
2025-08-25 05:39:24 +00:00
787cdb3829
Migrate DonutImagePixelInputs to TensorSchema ( #23509 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 05:02:15 +00:00
a5203d04df
Migrate skyworkr1v inputs to TensorSchema ( #23499 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 04:43:21 +00:00
99f8094400
Migrate tarsier inputs to TensorSchema ( #23500 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-25 04:42:36 +00:00
170e8ea9ea
[Misc] Unified linear print info ( #23516 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-24 20:13:51 -07:00
a71e4765cc
[Bugfix] Fix Qwen2.5-VL quantized model weights loading ( #23512 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-08-25 10:40:22 +08:00
39971db3aa
Frontend: Adding LM Format Enforcer support to V1 engine ( #22564 )
...
Signed-off-by: Noam Gat <noamgat@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-24 19:31:22 -07:00
504d914314
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 ( #23504 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-08-24 18:06:35 -07:00
47455c424f
[Doc: ]fix various typos in multiple files ( #23487 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-25 00:04:04 +00:00
c7fc6b1354
fix incompatibililty with non cuda platform for nvfp4 ( #23478 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-08-24 15:35:41 -07:00
ad78868450
[Misc] Remove unused slot_mapping buffer ( #23502 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-24 14:03:36 -07:00
e2db1164a1
[Model] Enable BLOOM on V1 ( #23488 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-24 13:30:47 +00:00
416f05929a
[New Model]Donut model ( #23229 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-24 12:52:24 +00:00
5e021b4981
(Misc): add missing test for zero truncation size. ( #23457 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
2025-08-24 18:12:47 +08:00
1b9b16649c
[Misc] update dict parse to EPLBConfig from json dumps to dict unpacking ( #23305 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-24 08:06:34 +00:00
e76e233540
[kernel] Support W4A8 on Hopper ( #23198 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-08-24 06:18:04 +00:00
a75277285b
Migrate Paligemma inputs to TensorSchema ( #23470 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-24 04:56:56 +00:00
9dc30b7068
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks ( #23477 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-24 12:56:17 +08:00
053278a5dc
Migrate Pixtral inputs to TensorSchema ( #23472 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-24 04:55:53 +00:00
c55c028998
[gpt-oss] Streaming Output for Python Tool ( #23409 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-24 04:42:38 +00:00
65197a5fb3
[Misc] Modify CacheConfig import ( #23459 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-23 06:05:27 +00:00
b8f17f5d98
Support DeepSeek-V3.1 tool call ( #23454 )
...
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-08-23 05:50:16 +00:00
d9a55204ba
fix(tests): Correct unreachable assertion in truncation test ( #23425 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
2025-08-23 05:23:54 +00:00
b4e9fd811f
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion ( #20000 )" ( #23396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-23 04:16:48 +00:00
308fa287a8
Add glm4.5v tp2,4 fp8 config on H100_80GB ( #23443 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
2025-08-23 02:54:19 +00:00
fa78de9dc3
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs ( #22527 )
...
Signed-off-by: feng <fengli1702@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-22 20:53:21 -06:00
f6818a92cb
[UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh ( #23360 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-22 20:52:50 -06:00
23c939fd30
[Model] Support DP for ViT on MiniCPM-V-4 ( #23327 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-08-23 02:14:41 +00:00
add1adfec7
[BugFix] Fix MinPLogitsProcessor.update_states() ( #23401 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-23 08:22:11 +08:00
c80c53a30f
[BugFix] Fix batch updates for pooling models ( #23398 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-23 08:20:41 +08:00
24d0c9e6ed
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out Attention Kernel ( #22703 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-22 22:09:05 +00:00
cc7ae5e7ca
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ not being called when it should when using quantized FP8 model ( #22281 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-08-22 21:47:57 +00:00
0313cf854d
[PERF] PyTorch Symmetric Memory All-Reduce ( #20759 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-22 15:39:08 -06:00
0483fabc74
[CI/Build] add EP dependencies to docker ( #21976 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-22 13:34:40 -07:00
da65bec309
add an env var for path to pre-downloaded flashinfer cubin files ( #22675 )
2025-08-22 19:25:45 +00:00
4645024d3a
[Quantization] Allow GGUF quantization to skip unquantized layer ( #23188 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 13:04:22 -06:00
cd7a3df26f
[Bugfix] Fix broken Florence-2 model ( #23426 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-22 17:50:52 +00:00
32d2b4064f
[Model] Add Ovis2.5 PP support ( #23405 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 17:46:34 +00:00
22cf679aad
[Doc]: fix various typos in multiple files ( #23179 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-22 10:38:46 -07:00
b6d7d34fc6
Add unit tests for batched guided and non-guided requests ( #23389 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-22 10:31:24 -07:00
341923b982
fix(tests): Ensure reliable CUDA cache clearing in MoE test ( #23416 )
...
Signed-off-by: AzizCode92 <azizbenothman76@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 17:20:59 +00:00
424fb7a5d2
[BugFix] Fix the issue where image embeddings were incorrectly split.… ( #23366 )
...
Signed-off-by: bppps <bpppsaka@gmail.com >
Co-authored-by: zouyu.zzx <zouyu.zzx@alibaba-inc.com >
Co-authored-by: bppps <bpppsaka@gmail.com >
2025-08-22 16:56:46 +00:00
88491c1b6b
[Speculators][Speculative Decoding] Fix Qwen 2 Eagle3 Support ( #23337 )
2025-08-22 16:39:19 +00:00
613a23b57f
[Bugfix]: Installing dev environment due to pydantic incompatible version ( #23353 )
...
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com >
2025-08-22 16:22:29 +00:00
51a215300b
[Fix] Bump triton version in rocm-build requirements ( #21630 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
2025-08-22 15:13:39 +00:00
ebe14621e3
[Bug fix] Dynamically setting the backend variable for genai_perf_tests in the run-nightly-benchmark script ( #23375 )
...
Signed-off-by: Naman Lalit <nl2688@nyu.edu >
2025-08-22 15:12:28 +00:00
325aa3dee9
[Misc] local import code clean ( #23420 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-22 14:01:35 +00:00
a073be6d87
[Doc] Update the doc for log probs + prefix caching ( #23399 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 13:20:39 +00:00
695e7adcd2
[misc] Remove outdate comment about runai_model_streamer ( #23421 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
2025-08-22 13:08:53 +00:00
281710ef9a
[Attention] Allow V1 flash_attn to support cross-attention ( #23297 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-22 12:10:16 +00:00
808d2e9aa0
[Misc] Move M-RoPE init logic to _init_mrope_positions ( #23422 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-22 03:07:22 -07:00
285178b3b8
[V0 Deprecation] Remove V0 LoRA test ( #23418 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-22 09:56:51 +00:00
88016c372a
[Bugfix] Fix pooling models on CPU backend ( #23392 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-22 09:47:17 +00:00
998720859c
Migrate MiniCPMOAudioInputs to TensorSchema ( #21847 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-22 16:43:29 +08:00
0ba1b54ac6
[gpt-oss] add input/output usage in responses api when harmony context is leveraged ( #22667 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-08-22 08:32:24 +00:00
53415653ff
[P/D][Nixl] Make kv cache register compatible with hybrid memory allocator ( #23079 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-08-21 22:30:48 -07:00
17373dcd93
[Attention] Refactor AttentionMetadata Preparation for Encoder-only Models ( #23154 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-22 05:05:59 +00:00
5964069367
[New Model] Add Seed-Oss model ( #23241 )
...
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-22 04:58:10 +00:00
de9c085e17
[Misc] Add gemma3 chat template with pythonic-style function calling ( #17149 )
...
Signed-off-by: Philip Chung <philip.f.chung@gmail.com >
2025-08-21 21:06:50 -07:00
111692bb8c
[CI] Add end-to-end V1 min_tokens test coverage ( #22495 )
...
Signed-off-by: Arjun Reddy <189282188+arjunbreddy22@users.noreply.github.com >
Co-authored-by: Arjun Reddy <189282188+arjunbreddy22@users.noreply.github.com >
2025-08-21 22:04:07 -06:00
394591e343
[Feature] Enable DeepGEMM Linear on B200; 1.5% E2E throughput improvement ( #23351 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-21 21:01:08 -07:00
3ac849665d
[CI/Build] Skip Idefics3 and SmolVLM generation test again ( #23356 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-22 03:39:46 +00:00
0b9cc56fac
Migrate MllamaImagePixelInputs to TensorSchema ( #22020 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-22 11:28:49 +08:00
8896eb72eb
[Deprecation] Remove prompt_token_ids arg fallback in LLM.generate and LLM.embed ( #18800 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-22 10:56:57 +08:00
19fe1a0510
[Kernel] Add FP8 support with FlashMLA backend ( #22668 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-08-22 02:26:32 +00:00
480bdf5a7b
[Core] Support custom executor qualname ( #23314 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-22 09:40:54 +08:00
5368f76855
[Feature][Responses API] Support logprobs(non-stream) ( #23319 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-08-21 23:09:16 +00:00
8ef6b8a38c
Always use cache mounts when installing vllm to avoid populating pip cache in the image. Also remove apt cache. ( #23270 )
...
Signed-off-by: Valentyn Tymofieiev <valentyn@google.com >
2025-08-21 18:01:03 -04:00
3bbe11cc13
[Perf] Small optimizations for silu_mul_fp8_quant_deep_gemm ( #23265 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-21 17:56:15 -04:00
c5041f899f
[CI] improve pr comments bot ( #23380 )
2025-08-21 14:49:03 -07:00
8b5fe6eb51
[CI] Clean up actions: remove helm, publish workflows and improve pr … ( #23377 )
2025-08-21 14:29:04 -07:00
800349c2a5
[Structured Outputs] Refactor bitmask construction into get_grammar_bitmask ( #23361 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-21 20:53:33 +00:00
044931f97b
Make sure that vectorize_with_alignment produced vectorized global loads ( #23182 )
2025-08-21 20:06:54 +00:00
1d353b6352
[Core] Always use tensor cores for Flashinfer Decode Wrapper ( #23214 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-08-21 16:02:11 -04:00
3496274663
[Misc] Convert VLLM_TORCH_PROFILER_DIR path to absolute ( #23191 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-21 15:49:09 -04:00
8a19303173
[BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message ( #23318 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-21 10:31:11 -07:00
603fbbbce0
[Misc] Misc code cleanup/simplification ( #23304 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-21 17:22:55 +00:00
10f535c086
[Bugfix] Fix port conflict by obtaining a list of open ports upfront ( #21894 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-08-21 10:22:18 -07:00
48bfb0c9b7
[Bug] Fix R1 Accuracy 0 Bug ( #23294 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-21 13:11:28 -04:00
f8ce022948
add tg-mxfp4-moe-test ( #22540 )
...
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-21 17:05:47 +00:00
0278f1ac3a
Fix nvfp4 swizzling ( #23140 )
...
Signed-off-by: yiliu30 <yi4.liu@intel.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-21 16:54:50 +00:00
a482e4e769
Migrate MolmoImageInputs to TensorSchema ( #22022 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-21 16:54:08 +00:00
e0b056e443
[ci/build] Fix abi tag for aarch64 ( #23329 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-21 23:32:55 +08:00
79f05e4436
[Multimodal] Always enable hashing mm data ( #23308 )
...
Signed-off-by: Roger Wang <hey@rogerw.io >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-21 07:23:28 -07:00
f8daddcc4c
[Bugfix] set system_message in phi4mini chat template ( #23309 )
...
Signed-off-by: zhuangqh <zhuangqhc@gmail.com >
2025-08-21 14:22:39 +00:00
c8e33c72c6
[V1] Remove unnecessary check for main thread ( #23298 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-08-21 14:08:35 +00:00
d70a16625d
[Performance] V1 Pooling Models E2E Performance Optimization ( #23162 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-21 13:26:09 +00:00
5cc54f7c5b
[Doc] Fix batch-level DP example ( #23325 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-21 06:16:38 -07:00
0c6e40bbaa
[Refactor] Simplify code for MM budget ( #23310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-21 08:00:16 +00:00
2e2000f352
[Model] Add LFM2 architecture ( #22845 )
...
Signed-off-by: Paul Pak <paulpak58@gmail.com >
2025-08-21 09:35:07 +02:00
31282401b6
[BugFix] Fix Python 3.9 Support ( #23306 )
...
Signed-off-by: Jared O'Connell <46976761+jaredoconnell@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-20 23:23:56 -07:00
0c31e28e95
[Bugfix] Fix extra whitespace in strings caused by newline ( #23272 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 22:03:00 -07:00
f571ff8eb6
[Sampler] Support returning final logprobs ( #22387 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-20 21:28:32 -07:00
f64ee61d9e
[CI] Block the cu126 wheel build while broken ( #23285 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-21 04:21:05 +00:00
8993073dc1
[CI] Delete images older than 24h. ( #23291 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-08-20 21:15:20 -07:00
655a09f653
[Model][VLM] Support R-4B Model ( #23246 )
...
Signed-off-by: yannqi <yannqi@qq.com >
Signed-off-by: 杨奇(yann qi) <51905299+yannqi@users.noreply.github.com >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: yannqiyang <yannqiyang@tencent.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-21 04:08:52 +00:00
f94bf9b924
[Compile] Fix Compile Warning SM100 Cutlass MLA ( #23287 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-21 03:09:39 +00:00
3663870c72
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support ( #23035 )
...
Signed-off-by: asafg <asafg@ai21.com >
Signed-off-by: asafg <39553475+Josephasafg@users.noreply.github.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-08-20 20:08:51 -07:00
2461d9e562
[CI/Build] Split out mm processor tests ( #23260 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 20:05:20 -07:00
7be5d113d8
[CPU] Refactor CPU W8A8 scaled_mm ( #23071 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-21 09:34:24 +08:00
b029de9902
[Optimization] Make new_block_ids None if empty ( #23262 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-20 18:25:56 -07:00
bbea1cefdd
[CI Bugfix] Fix CI by fully removing --enable-prompt-adapter ( #23284 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 17:18:12 -07:00
f5aa307d77
Remove duplicate entry in vllm.attention.__all__ ( #23296 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-20 17:14:59 -07:00
4b795020ed
[EP] Add logging for experts map ( #22685 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-20 23:46:06 +00:00
c86af22f31
[Fix] remove is_marlin param in benchmark_moe ( #23286 )
2025-08-20 22:04:21 +00:00
10cc12ba66
Feature/mla tests ( #23195 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-08-20 21:46:47 +00:00
a4fbb32fab
Remove chunked_prefill_enabled flag in V1 MLA ( #23183 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
2025-08-20 21:43:17 +00:00
1b125004be
[misc] fix multiple arch wheels for the nightly index ( #23110 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-20 14:15:34 -07:00
4fbda0b20c
[Feature] use --eplb_config to set eplb param ( #20562 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: rongfu.leng <lenronfu@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-20 14:07:28 -07:00
4e51fa8cba
Do not use eval() to convert unknown types ( #23266 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-20 13:28:30 -07:00
bf7c99dfc4
[Perf] Speed up function _convert_tokens_to_string_with_added_encoders by 13.7x ( #20413 )
...
Signed-off-by: Saurabh Misra <misra.saurabh1@gmail.com >
Signed-off-by: Aseem Saxena <aseem.bits@gmail.com >
Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>
Co-authored-by: Aseem Saxena <aseem.bits@gmail.com >
2025-08-20 13:17:11 -07:00
b95697d731
[Frontend] improve error logging of chat completion ( #22957 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-20 13:03:37 -07:00
582bbe6bd7
[Fix] correct tool_id for kimi-k2 when use tool_choice=required ( #21259 )
...
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-08-20 12:59:54 -07:00
0cdbf5e61c
[Kernel/Quant] Remove the original marlin format and qqq ( #23204 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 15:13:36 -04:00
ebe56a0064
Small fix for Command-A-Vision ( #23268 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-08-20 18:15:18 +00:00
f77a0802b7
Limit HTTP header count and size ( #23267 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
2025-08-20 17:57:37 +00:00
c4477f55e5
Migrate Mistral3ImagePixelInputs to TensorSchema ( #21945 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-20 17:37:29 +00:00
dfd2382039
[torch.compile] Support conditional torch.compile per module ( #22269 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-20 16:52:59 +00:00
3b11b26b50
[FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not compatible with AITER ( #22795 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-20 09:08:29 -07:00
d6d13bd49e
[Misc] Add max_seq_len to CommonAttentionMetadata ( #23216 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-20 09:05:29 -07:00
5efd6905bc
[CLI][Doc] Formalize --mm-encoder-tp-mode ( #23190 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 23:42:28 +08:00
b17109beea
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute ( #23045 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
2025-08-20 10:35:26 -04:00
4449235843
[Bugfix] Ensure correctness of HCXVision processing ( #23254 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 14:19:30 +00:00
38217877aa
[Fix] fix offline env use local mode path ( #22526 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-20 13:34:49 +00:00
c6d80a7a96
[Model] Improve olmo and olmo2 ( #23228 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-20 12:47:05 +00:00
7cd17e22d7
[Model][V1] Support Ernie MTP ( #22169 )
...
Signed-off-by: zhouchong <zhouchong03@baidu.com >
Co-authored-by: zhouchong <zhouchong03@baidu.com >
2025-08-20 20:41:55 +08:00
50df09fe13
Update to flashinfer-python==0.2.12 and disable AOT compile for non-release image ( #23129 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-20 08:05:54 -04:00
68fcd3fa73
[Bugfix] Ensure correctness of Cohere2Vision processing ( #23245 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 11:09:18 +00:00
83e69a09d6
[Model] Support deepseek with eagle ( #21086 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
2025-08-20 19:01:31 +08:00
3aa8c10038
Fix missing quotes ( #23242 )
...
Signed-off-by: Shiming Zhang <wzshiming@hotmail.com >
2025-08-20 10:46:59 +00:00
103f1ec8d3
[Model] use autoWeightsLoader for gptoss ( #22446 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
2025-08-20 10:16:27 +00:00
d983769c41
fix cuda graph ( #22721 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
2025-08-20 06:24:37 +00:00
8fd920924c
[BugFix] Fix stuck stats/metrics after requests are aborted ( #22995 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-20 13:50:29 +08:00
de7b67a023
[CI/Build] Sync multimodal tests ( #23181 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 05:06:42 +00:00
f729023272
[CI/Build] Also check DP in benchmarks throughput script ( #23038 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-20 04:09:27 +00:00
1a3079a15e
chore: support pytorch format in lora ( #22790 )
...
Signed-off-by: jaeeun.kil <rha3122@naver.com >
Signed-off-by: 길재은 <rha3122@naver.com >
2025-08-20 04:02:50 +00:00
941f56858a
Fix a performance comparison issue in Benchmark Suite ( #23047 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-08-20 03:14:32 +00:00
a634733f67
[Attention] Optimize make_local_attention_virtual_batches for Flash Attention ( #23185 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-08-20 02:57:47 +00:00
64ab3c7253
[Doc] Update V1 status of various pooling models ( #23189 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-20 10:33:41 +08:00
e58c5a9768
[Core] Add torch profiler CPU traces for AsyncLLM. ( #21794 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-20 02:32:47 +00:00
d46d417b58
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce.py ( #23132 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-19 20:18:52 -06:00
0167efe20d
[Core] Optimize scheduler request removal for single completions ( #21917 )
...
Signed-off-by: chiliu <chiliu@paypal.com >
Signed-off-by: chiliu <cliu_whu@yeah.net >
Co-authored-by: chiliu <chiliu@paypal.com >
2025-08-19 18:25:59 -07:00
c32e6ad1f6
[Quantization] Bump Compressed Tensors Version ( #23202 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-20 00:39:28 +00:00
1630cc8d0f
[Benchmarks] Add video inputs to ShareGPTDataset. ( #23199 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-19 23:42:31 +00:00
14e2b0730b
[BugFix] fix CUTLASS MLA full cudagraph ( #23200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-19 22:17:08 +00:00
0f4f0191d8
[CI/Build] Replace lm-eval gsm8k tests with faster implementation ( #23002 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-19 15:07:30 -07:00
a38b8af4c3
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend ( #22357 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
2025-08-19 18:01:53 -04:00
21dce80ea9
[CI/Build] Add support for Python 3.13 ( #13164 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 13:49:34 -07:00
e61bac87ee
[Misc] Minor refactoring for FlashInfer backend ( #23147 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 13:11:51 -07:00
80141bbf2f
fix: use cache_salt for gpt-oss ( #23186 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-08-19 18:12:25 +00:00
b94faf9d50
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 and modelopt. ( #23125 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-19 14:00:51 -04:00
5b5f350d67
[Misc] Enable yapf for FlashInfer backend ( #23193 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 10:33:47 -07:00
f7cf5b512e
[Frontend] Add /collective_rpc API endpoint ( #23075 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-19 17:29:32 +00:00
03d4235fd2
[Misc] Fix the benchmark's README and improve the error messages for the benchmark's argument checks ( #22654 )
...
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com >
2025-08-19 10:18:51 -07:00
d6a1a20973
[CI/Build] Update transformers to v4.55.2 ( #23093 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 10:06:17 -07:00
a70d0bd0a3
Migrate LlavaOnevisionMultiInputs to TensorSchema ( #21844 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-19 17:02:02 +00:00
24f4d1a224
Add return_token_ids parameter to OpenAI API endpoints ( #22587 )
...
Signed-off-by: Yuge Zhang <scottyugochang@gmail.com >
Co-authored-by: Claude <noreply@anthropic.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-08-19 09:48:31 -07:00
4f510bc2a1
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBlock ( #23169 )
...
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-08-19 16:18:41 +00:00
1298c67795
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL ( #22742 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-19 15:25:57 +00:00
4d9c61993a
[Bugfix] Fix benchmark_moe.py ( #23177 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-19 13:39:40 +00:00
b87cb97a53
[Model] support new model ovis2.5 ( #23084 )
...
Signed-off-by: myselvess <244285088@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 13:12:59 +00:00
f856c33ce9
[Model] Add multi_label_classification support ( #23173 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-19 12:54:30 +00:00
03752dba8f
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel ( #21716 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-19 08:22:15 -04:00
40f26734b9
[Misc] Fix seq_lens for graph capture ( #23175 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 03:58:16 -07:00
2c3f557f08
[Doc] use power of 2 ( #23172 )
2025-08-19 03:16:23 -07:00
21bcc8263f
[Misc] Avoid accessing req_ids inside a loop ( #23159 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-19 09:39:38 +00:00
5bfe0dea7a
[bug fix] Fix llama4 spec decoding ( #22691 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com >
2025-08-19 08:53:24 +00:00
31fd3265c8
[Bugfix] Fix broken Minimax-01-VL model ( #22116 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-19 08:49:29 +00:00
31436e8b4f
[Misc] Add request_id into benchmark_serve.py ( #23065 )
...
Signed-off-by: yangxia <yangxiast@gmail.com >
2025-08-19 08:32:18 +00:00
4efd43e9b4
Fix GLM-4.5V-FP8 numerical issue ( #22949 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 07:56:31 +00:00
3c8a787247
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_turn ( #22889 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-08-19 07:48:07 +00:00
01a08739e0
[misc] split engine_model into json file for nsys profile tool ( #23117 )
...
Signed-off-by: Grace Ho <grho@nvidia.com >
Signed-off-by: Grace Ho <146482179+gracehonv@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 15:44:53 +08:00
fda9537c5e
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Thinking-2506 ( #23114 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 14:24:31 +08:00
90bbe0a5ad
[Log] Warning Once for Cutlass MLA ( #23137 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-18 23:24:16 -07:00
e75f342261
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema ( #22023 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-19 13:48:26 +08:00
78dba404ad
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes ( #22725 )
...
Signed-off-by: Nikhil Suryawanshi <suryawanshin74@gmail.com >
2025-08-19 04:40:37 +00:00
e9d6a3db69
[TPU] make ptxla not imported when using tpu_commons ( #23081 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@gmail.com >
2025-08-19 11:46:42 +08:00
a4454e9401
chore: disable enable_cpp_symbolic_shape_guards ( #23048 )
...
Signed-off-by: Xiao Liu <xiszishu@gmail.com >
2025-08-18 23:08:05 -04:00
14006840ea
[V0 Deprecation] Remove V0 FlashInfer attention backend ( #22776 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 19:54:16 -07:00
6603288736
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Caching Tests ( #22871 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 17:39:01 -07:00
95e3095136
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-attention related code ( #23122 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-19 08:31:38 +08:00
c9b38be8aa
[Spec Decode] Make propose_draft_token_ids non-blocking for lower TTFT ( #23041 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 17:20:38 -07:00
0dd3f4f5ab
[Misc] Minor refactoring for prepare_inputs ( #23116 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-18 16:58:05 -07:00
498259ccce
Install tpu_info==0.4.0 to fix core dump for TPU ( #23135 )
2025-08-18 16:23:33 -07:00
6d25e3fd6e
Use Blackwell FlashInfer MXFP4 MoE by default if available ( #23008 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-18 15:25:49 -07:00
ac6eb49de3
fix: OpenAI SDK compat (ResponseTextConfig) ( #23126 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
Signed-off-by: Breno Baldas Skuk <breno.skuk@hcompany.ai >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-18 15:22:59 -07:00
bf756321c7
[CI Bugfix] Pin openai<1.100 to unblock CI ( #23118 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-18 12:14:01 -07:00
0e3bb543f0
[Bugfix] Support compile for Transformers multimodal ( #23095 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-08-18 13:35:48 +00:00
569aefd134
chore: remove unnecessary patch_padding_side for the chatglm model ( #23090 )
...
Signed-off-by: carlory <baofa.fan@daocloud.io >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-18 12:32:13 +00:00
d3f71f1224
[Refactor] Get prompt updates earlier ( #23097 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-18 12:31:53 +00:00
5a30bd10d8
[Bugfix] fix IntermediateTensors equal method ( #23027 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-18 02:58:11 -07:00
27e8d1ea3e
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs ( #23053 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-18 09:52:00 +00:00
5c79b0d648
[XPU][CI]add xpu env vars in CI scripts ( #22946 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-18 09:47:03 +00:00
5f5664b3e4
[XPU] Fix compile size for xpu ( #23069 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-18 00:04:08 -07:00
89657a557c
[Misc] Fix backward compatibility from #23030 ( #23070 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-17 23:33:29 -07:00
08d5f7113a
[Misc] refactor function name ( #23029 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-17 22:16:21 -07:00
b2fd0b81e0
[Bugfix][CI] Machete kernels: deterministic ordering for more cache hits ( #23055 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-08-17 22:10:26 -07:00
9f1c642254
[Bugfix] fix Qwen2.5-Omni processor output mapping ( #23058 )
...
Signed-off-by: double7 <33449816+DoubleVII@users.noreply.github.com >
Co-authored-by: 杨森 <yangsen.double7@bytedance.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-17 22:09:11 -07:00
7be3a59d8e
[Misc] enhance static type hint ( #23059 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-17 22:09:08 -07:00
8ea0c2753a
[Misc] Minor code cleanup for _get_prompt_logprobs_dict ( #23064 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-17 18:16:03 -07:00
0fc8fa751a
fix: gptq marlin weight loading failure ( #23066 )
2025-08-17 15:56:07 -07:00
21e39436c8
[XPU] fix xpu to set cudagraph batch sizes ( #23044 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
2025-08-17 21:45:42 +00:00
6d243efeda
[Misc] Convert use_structured_output property into constant ( #23060 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-17 12:41:38 -07:00
c55bc1db26
[Misc] Remove dead return ( #23061 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-17 10:36:46 -07:00
292084e72a
[BugFix] Fix for IMA in FA3 varlen combine ( #22967 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-17 08:52:04 -07:00
16bff144be
[Misc] fix typo in the multimodal doc ( #23051 )
2025-08-17 01:56:20 -07:00
fe0411fc6f
[Bugfix] should use stack instead of concat ( #22972 )
...
Signed-off-by: 947132885 <947132885@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-17 08:46:36 +00:00
4d4061b6e7
[Kernel] Add cuda kernel for gpt_oss activation ( #22951 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-17 05:03:24 +00:00
87f48623a5
[Misc] method name typo fix ( #23042 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-16 21:49:14 -07:00
5c32143b9d
[Refactor] Defer tensor data construction in MultiModalKwargs ( #23030 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-16 21:05:50 -07:00
94096a47c9
[UX] Separate marlin moe config logic from triton moe ( #23006 )
2025-08-16 22:16:42 -04:00
a258ad8bcc
[Bugfix] fix qwen3 moe fp8 accuracy issue ( #23031 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-08-16 17:41:23 -07:00
bf7f470b22
[V1] Logits processors extensibility ( #19912 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Signed-off-by: Andrew Feldman <afeld2012@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-16 12:59:17 -07:00
4fc722eca4
[Kernel/Quant] Remove AQLM ( #22943 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-16 19:38:21 +00:00
3253ae765e
[Flaky CI] Increase timeout tolerance for test_mp_crash_detection+test_default_mm_lora_chat_completions ( #23028 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-16 18:33:08 +00:00
000cceca8c
[Bugfix gpt-oss] Fix float32 convert for flashinfer sink support ( #23016 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-16 11:16:00 -07:00
68373d3126
[Frontend] Added support for HermesToolParser for models without special tokens ( #16890 )
...
Signed-off-by: minpeter <kali2005611@gmail.com >
2025-08-16 17:38:42 +00:00
52ce1420e9
Fix handling of max_num_batched_tokens for pooling tasks ( #23004 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-16 17:36:30 +00:00
829bbd7882
[New Model]mBART model ( #22883 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-08-16 12:16:58 +00:00
4dff91c93d
[Refactor] Allow optional MultiModalKwargsItem in IPC ( #23022 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-16 11:30:49 +00:00
de9cb61763
Add docs for PrefixRepetitionDataset + enable usage with vllm bench throughput ( #23012 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-16 10:21:20 +00:00
2dbccce8a6
[CI][Bugfix] Skip Ovis2 generation test because of broken remote code ( #22954 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-16 09:44:19 +00:00
933f45334a
[Core] Make cudagraph check cuda platform only ( #23005 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@gmail.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-16 07:46:00 +00:00
cc826a202b
[Multimodal] Update Tensor schema test to cover arbitrary shape mm inputs ( #22867 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-16 00:44:50 -07:00
6d3da472bc
[Misc] Add --save-dir option to benchmark_moe ( #23020 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-16 07:26:10 +00:00
78863f8c5c
[BugFix] Add support for loading prompt embeds tensors serialized on unavailable devices and sparse tensors ( #22962 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-08-16 06:25:10 +00:00
5157827cfc
[Build] Env var to disable sccache ( #22968 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-16 05:36:27 +00:00
7caec10e7b
[XPU]avoid circular import during XPU init ( #23017 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-16 05:16:34 +00:00
1f83e7d849
[misc] nsys profile output kernel classifier and visualizer ( #22971 )
...
Signed-off-by: Grace Ho <grho@nvidia.com >
2025-08-16 02:52:51 +00:00
e4e37ded56
[V1] support min_tokens for detokener ( #22014 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-08-16 02:28:10 +00:00
f6b5040590
[Frontend] Avoid list copies in serving_chat.py ( #22947 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-16 02:06:30 +00:00
fbd88728b3
[Bugfix] Fix DeepSeek MTP ( #22934 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-08-16 01:25:06 +00:00
070da660c1
[Kernel] Simplify get_kv_cache_layout and cache use_trtllm_attention env-dependent bit ( #22735 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-16 00:14:08 +00:00
ad0297d113
[Misc] Support passing multiple request ids at once to AsyncLLM.abort() ( #22944 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-15 17:00:36 -07:00
236b864e4f
[BugFix] Make run_once thread-safe ( #22978 )
...
Signed-off-by: <wenji.yyc@alibaba-inc.com >
Signed-off-by: Yichen Yan <wenji.yyc@alibaba-inc.com >
2025-08-15 16:56:17 -07:00
3e2f7985a2
Support multiple attention groups for KV sharing ( #22672 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-15 16:54:10 -07:00
c280066f9d
[v1] Move block_hashes from KVCacheManager to Request.block_hashes ( #19728 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-08-15 16:52:52 -07:00
b9dc9d2607
[BugFix] Handle case where async utility call is cancelled ( #22996 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-08-15 17:38:42 -06:00
1fc375dc05
[Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask causing unintended masking and NaN logits ( #22963 )
...
Signed-off-by: rishitdholakia13 <rishit+github@cohere.com >
2025-08-15 23:25:05 +00:00
76144adf76
ci: Add CUDA + arm64 release builds ( #21201 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
2025-08-15 23:16:23 +00:00
f5d412bafb
[BugFix] Fix regression caused by mamba state dtype PR ( #22998 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-15 22:55:26 +00:00
177e55e3bd
[Attention] FA3 Attention Sinks Perf Boost ( #22478 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-15 17:41:07 -04:00
1723ef1aae
minor: zero workspace buffer init for flashinfer trtllm-gen attn ( #22603 )
2025-08-15 21:38:10 +00:00
00d6cba0cf
Add PrefixRepetitionRandomDataset to vllm bench serve datasets ( #20638 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-08-15 14:09:23 -07:00
7f89ed248f
[Fix] enable swap_ab for pplx problem size computation ( #22991 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-08-15 14:02:12 -07:00
8a87cd27d9
[CI] Speed up Whisper tests by reusing server ( #22859 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-15 16:56:31 -04:00
a344a1a7da
Use regex in convert-results-json-to-markdown.py ( #22989 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-08-15 20:54:20 +00:00
79899b63f6
[Bugfix] Added more env vars to hash ( #22449 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
2025-08-15 20:08:37 +00:00
6e670778cd
[Core] direct indexing on self.block_table_np in compute_slot_mapping ( #22940 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-08-15 12:12:12 -07:00
df5afa82e5
[Log] Debug Once for Randomizing dummy data for DP Rank ( #22860 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-15 11:51:50 -07:00
6cd69f51bf
[Model] Granite-4 support loading quantized checkpoint ( #22925 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-08-15 18:47:56 +00:00
8ad7285ea2
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remove extra arguments from modular kernel methods. ( #22035 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-15 14:46:00 -04:00
48b01fd4d4
[Structured Output] Make the output of structured output example more complete ( #22481 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-08-15 18:29:25 +00:00
993d3d122b
[Benchmarks] Include image data when ShareGPT4V dataset is used. ( #22955 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-08-15 18:23:06 +00:00
68af77e51c
[FIXBUG] Correctly Apply Grammar Bitmask in Mixed Batches ( #22896 )
...
Signed-off-by: JartX <sagformas@epdcenter.es >
2025-08-15 17:42:49 +00:00
6b04039a72
[BugFix] Skip the Q component for QKVParallelLinear in the case of QKVCrossParallelLinear since its width is 0 ( #22369 )
...
Signed-off-by: sstamenk <sstamenk@amd.com >
2025-08-15 17:17:31 +00:00
1c859a1387
[V0 Deprecation] Remove advance_step ( #22969 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-15 08:22:31 -07:00
74f441f4b5
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer ( #20059 )
...
Signed-off-by: fhl <2410591650@qq.com >
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
2025-08-15 10:01:39 -04:00
a0632a3e03
[Frontend] Expose do_log_stats interval to env ( #22905 )
...
Signed-off-by: Csrayz <jover@cmbchina.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-15 13:00:20 +00:00
e8b40c7fa2
[CI] Remove duplicated docs build from buildkite ( #22924 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-15 05:58:06 -07:00
48f4636927
[Misc] Ignore ep_kernels_workspace ( #22807 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-15 05:58:03 -07:00
75531a6c13
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba2, Mamba1, Minimax) ( #22928 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Daniel Afrimi <danielafrimi8@gmail.com >
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-08-15 12:57:06 +00:00
22341b996e
Improve multimodal hasher performance for re-used Image prompts ( #22825 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
2025-08-15 12:32:56 +00:00
49252cf59e
[MM] Allow skipping memory profiling for multimodal models. ( #22950 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-15 11:41:38 +00:00
3e6dd40016
[Bugfix] fix cuda 12.6 and 11.8 build ( #22952 )
...
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
2025-08-15 10:10:22 +00:00
aa300c438d
[Bugfix] Unquote file uri before reading image ( #22912 )
...
Signed-off-by: Sayandip Dutta <sayandip199309@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-15 09:28:00 +00:00
fe91ce9591
[V1] - Split Prefill and Decode for Mamba1 models ( #22653 )
...
Signed-off-by: amirk <amirk@ai21.com >
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
Co-authored-by: Asaf Joseph Gardin <39553475+Josephasafg@users.noreply.github.com >
2025-08-15 08:59:52 +00:00
5406ebf5c9
[CI] Pooling models mteb test uses enforce_eager ( #22878 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-15 01:16:15 -07:00
b2c06509e5
[P/D]Provide bucket algorithm rate limiter for proxy_server ( #22643 )
...
Signed-off-by: frankie-ys <yongshengwang@cmbchina.com >
Signed-off-by: frankie <wangyongsheng686@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Kuntai Du <kuntai@uchicago.edu >
2025-08-15 07:01:48 +00:00
b2f6c247a9
Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module." ( #22956 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-15 06:39:19 +00:00
3d232dbd19
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn ( #22818 )
...
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-08-15 06:38:05 +00:00
5c3fbfe46b
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughput Improvement ( #22763 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-15 06:27:30 +00:00
b4cef5e6c7
refactor: Change scaling factors calculation for flashinfer FusedMoE ( #22812 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-15 06:19:31 +00:00
0fe85087a9
[CI Perf] Prune tests in tests/kernels/attention/ ( #22936 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-14 21:34:53 -06:00
d2b0e97ea6
[CI Perf] Prune tests in tests/kernels/moe/ ( #22939 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-14 21:33:42 -06:00
590bddbfc5
[CI Perf] Prune tests in tests/kernels/quantization/ ( #22942 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-14 21:25:34 -06:00
ae05a6d83d
[BugFix] Fix port lookup in internal DP LB tests ( #22252 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-15 11:17:11 +08:00
0933f9d518
[BugFix][KVConn] Fix use of get_required_kvcache_layout ( #22734 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-15 01:39:43 +00:00
f1f0d2fab8
Revert "[Kernel] Add cuda kernel for gpt_oss activation" ( #22948 )
2025-08-14 17:38:10 -07:00
81f4b96481
[Kernel] Add cuda kernel for gpt_oss activation ( #22538 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-14 17:21:29 -07:00
39cd09dc86
[Bugfix] use flash attn on sm90 ( #22933 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-14 16:37:22 -07:00
919234fe17
[BugFix] Fix initial DP request load imbalance ( #22910 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-14 15:20:28 -07:00
ebcce2cd36
[Core] Return final response for aborted requests from AsyncLLM.generate ( #22283 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-14 14:49:02 -07:00
4121de512e
[Quantization]: Support compressed-tensors mixed-precision model loading ( #22468 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-08-14 17:32:09 -04:00
279a5f31b3
[Kernel] Add nvfp4 gemm flashinfer backends ( #22346 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-14 16:03:55 -04:00
b8ff05361a
[CI] Temporarily disable flaky test ( #22930 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-14 19:59:16 +00:00
637093ae26
docs: update fastsafetensors usage instructions ( #22891 )
...
Signed-off-by: Nir Levy <bhr166@gmail.com >
2025-08-14 19:56:54 +00:00
33c63e9547
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel ( #22428 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Animesh Jain <anijain@umich.edu >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com >
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: yan <yan.ma@intel.com >
Signed-off-by: Yan Ma <yan.ma@intel.com >
Signed-off-by: Xiao Liu <xiszishu@gmail.com >
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Signed-off-by: Haibin Lin <haibin.lin@bytedance.com >
Signed-off-by: David Ben-David <davidb@pliops.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: huangweixiao <huangweixiao@msh.team >
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
Signed-off-by: Eric Hanley <ericehanley@google.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Signed-off-by: CLFutureX <775523362@qq.com >
Signed-off-by: Linkun Chen <github@lkchen.net >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Signed-off-by: tlipoca9 <tlipoca9@gmail.com >
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Benji Beck <benjibeck@meta.com >
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Zhang Jason <ning.zhang2@amd.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: asafg <asafg@ai21.com >
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: Lain <fusiyuan2000@hotmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk >
Signed-off-by: Lionel Villard <villard@us.ibm.com >
Signed-off-by: ycyaw66 <497410282@qq.com >
Signed-off-by: David Chen <530634352@qq.com >
Signed-off-by: Linkun <github@lkchen.net >
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com >
Signed-off-by: Ming Yang <minos.future@gmail.com >
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai >
Signed-off-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
Signed-off-by: Andrew Chan <andrewkchan.akc@gmail.com >
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: XIn Li <xinli@nvidia.com >
Signed-off-by: Junhao Li <junhao@ubicloud.com >
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Signed-off-by: <zyy1102000@gmail.com >
Signed-off-by: Guy Stone <guys@spotify.com >
Signed-off-by: <yyweiss@gmail.com >
Signed-off-by: yyw <yyweiss@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com >
Signed-off-by: Pradyun92 <142861237+Pradyun92@users.noreply.github.com >
Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com >
Co-authored-by: rongfu.leng <rongfu.leng@daocloud.io >
Co-authored-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Animesh Jain <jainanimesh2305@yahoo.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
Co-authored-by: XiongfeiWei <isaacwxf23@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
Co-authored-by: JartX <sagformas@gmail.com >
Co-authored-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: kf <kuanfu.liu@embeddedllm.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
Co-authored-by: Yong Hoon Shin <48474650+sarckk@users.noreply.github.com >
Co-authored-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com >
Co-authored-by: Yuxuan Zhang <2448370773@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Xiao <xiszishu@gmail.com >
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Roberto L. Castro <38211239+LopezCastroRoberto@users.noreply.github.com >
Co-authored-by: Ning Xie <andy.xning@gmail.com >
Co-authored-by: H <linhaibin.eric@gmail.com >
Co-authored-by: David Ben-David <sdavidbd@gmail.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
Co-authored-by: TankNee <nee@tanknee.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com >
Co-authored-by: ZiTian.Zhao <zitian.zhao@tencentmusic.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: Abirdcfly <fp544037857@gmail.com >
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com >
Co-authored-by: Chenxi Yang <cxyang@cs.utexas.edu >
Co-authored-by: Chenxi Yang <cxyang@meta.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Weixiao Huang <hwx.simle@gmail.com >
Co-authored-by: Raghav Ravishankar <113712354+alyosha-swamy@users.noreply.github.com >
Co-authored-by: ericehanley <ericehanley@google.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
Co-authored-by: Po-Han Huang (NVIDIA) <53919306+nvpohanh@users.noreply.github.com >
Co-authored-by: PiteXChen <44110731+CLFutureX@users.noreply.github.com >
Co-authored-by: lkchen <github@lkchen.net >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com >
Co-authored-by: tlipoca9 <160737620+tlipoca9@users.noreply.github.com >
Co-authored-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
Co-authored-by: wang.yuqi <noooop@126.com >
Co-authored-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Siyuan Liu <lsiyuan@google.com >
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Zhang Jason <ning.zhang2@amd.com >
Co-authored-by: Asaf Joseph Gardin <39553475+Josephasafg@users.noreply.github.com >
Co-authored-by: asafg <asafg@ai21.com >
Co-authored-by: Lain <siyuanf@nvidia.com >
Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com >
Co-authored-by: imning3 <hbning@pku.edu.cn >
Co-authored-by: Maximilien de Bayser <mbayser@br.ibm.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Tao He <linzhu.ht@alibaba-inc.com >
Co-authored-by: qscqesze <qingjun@minimaxi.com >
Co-authored-by: Syed Muhammad Bin Asif <92625830+syedmba@users.noreply.github.com >
Co-authored-by: Lionel Villard <villard@us.ibm.com >
Co-authored-by: WeiQing Chen <40507679+david6666666@users.noreply.github.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
Co-authored-by: Moritz Sanft <58110325+msanft@users.noreply.github.com >
Co-authored-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: Adrián García García <adrigarvk8@gmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: JaceyShao <65159281+JaceyShao@users.noreply.github.com >
Co-authored-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
Co-authored-by: Ricardo Decal <crypdick@users.noreply.github.com >
Co-authored-by: Andrew Chan <andrewkchan.akc@gmail.com >
Co-authored-by: fxmarty-amd <felmarty@amd.com >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
Co-authored-by: Zhiyu <zhiyuc@nvidia.com >
Co-authored-by: Shu Wang <shuw@nvidia.com >
Co-authored-by: XIn Li <xinli@nvidia.com >
Co-authored-by: Junhao Li <streaver91@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: Hong Hanh <hanh.usth@gmail.com >
Co-authored-by: Daniel Serebrenik <74646983+pliops-daniels@users.noreply.github.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Guy Stone <guys@spotify.com >
Co-authored-by: yyweiss <70619747+yyweiss@users.noreply.github.com >
Co-authored-by: Pradyun92 <142861237+Pradyun92@users.noreply.github.com >
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-08-14 11:23:22 -07:00
ab9f2cfd19
[CI] [Hybrid] Bump min transformers version for Bamba and Jamba ( #22908 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-14 11:01:16 -07:00
dbe298046c
[Bugfix] Fix parsing of --disable-mm-preprocessor-cache ( #22909 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-14 08:09:44 -07:00
625ccd1c4d
[Bugfix] Replace custom Encoding class with BatchEncoding in MistralTokenizer ( #22786 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-14 08:09:27 -07:00
92ff41abea
[Model] Modify the gate implementation of glm4_moe ( #22832 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-14 05:28:50 -07:00
829b9a62d0
[Perf] Dont create unnecessary pooling params ( #22876 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-14 05:28:09 -07:00
540d54ca8d
[CI] Re-enable transcriptions test_long_audio_request ( #22890 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-14 11:34:34 +00:00
0783f13960
[Doc] fix dead link ( #22898 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-08-14 04:06:13 -07:00
7655dc3e45
[Bugfix] Add reset prefix cache for online serving ( #22726 )
...
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-14 04:04:18 -07:00
f4efda821d
Remove Phi 4 Flash configuration workaround ( #22723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-14 04:03:49 -07:00
eb08487b18
[BugFix] Threadsafe close async zmq sockets ( #22877 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-14 03:44:29 -07:00
7c3a0741c6
[Bugfix] Fix PixtralHFImagePixelInputs dynamic shape check ( #22827 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-14 02:35:43 -07:00
00e3f9da46
vLLM Benchmark suite improvement ( #22119 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Signed-off-by: Louie Tsai <louie.tsai@intel.com >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-08-14 07:12:17 +00:00
a353bd083d
[CI] remove flaky v0 test ( #22864 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-08-13 21:41:51 -07:00
1d20c34717
[CI] Fix tests/distributed/test_ca_buffer_sharing.py ( #22849 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-13 20:09:30 -07:00
b6af24fba7
[CI][Entrypoints]: add filter to generation to filter out invalid tool calls ( #22826 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-08-13 20:09:07 -07:00
0ca2393b47
[CI/Build] Increase pooling tolerance to pass CI ( #22844 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-13 18:52:48 -04:00
31a500c86f
[Core] [N-gram SD Optimization][1/n] Propose tokens with a single KMP ( #22437 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-08-13 14:44:06 -07:00
4e8614e88b
Move checklist in PR template ( #22852 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
2025-08-13 21:38:35 +00:00
c6cd5ca3d3
[ROCm][Bugfix] Fix compilation error in topk softmax fused kernel ( #22819 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
2025-08-13 13:45:03 -07:00
df0e0f023e
[CI/Build] Skip gpt_big model test because of broken HF model ( #22848 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-13 20:36:28 +00:00
b4b78d6317
[CI/Build] Fix param mismatch in test_eagle_correctness ( #22847 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 10:55:25 -07:00
12817a8ac7
[CI] Fix tests/v1/e2e/test_kv_sharing_fast_prefill.py import on test ( #22815 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-13 10:35:50 -07:00
c9232d41f4
[CI/Build] Update VLM common tests ( #22841 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 10:03:05 -07:00
9bd9294f0e
[Bugfix] Fix MiniCPMV Image input inference failed ( #22813 )
...
Signed-off-by: HWH <67449739+jio-H@users.noreply.github.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-13 09:41:41 -07:00
da2705198f
[Misc] clear and separate error messages for input too long and input + max-tokens too long ( #22803 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-13 07:22:56 -07:00
19b927e52d
[Core] Use individual MM items in P0/P1 cache and model runner ( #22570 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 07:18:07 -07:00
20d65aa755
[Frontend] Multithreaded async multimodal load_bytes ( #22710 )
...
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
2025-08-13 06:09:26 -07:00
b159c0a67a
Fix GGUF loader for Qwen3 MoE. ( #22785 )
...
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com >
2025-08-13 06:08:23 -07:00
6772bb0f7d
Remove unnecessary CUDA sync of qwen image and video preprocess ( #22792 )
...
Signed-off-by: cyy <cyyever@outlook.com >
Signed-off-by: Yuanyuan Chen <cyyever@outlook.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-13 06:07:28 -07:00
fceafaf582
[Bugfix][mamba] Fix type annotation of Mamba2Metadata ( #22787 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-13 06:07:09 -07:00
6b794c756c
[Nixl][CI] Fix tests ( #22806 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-13 06:03:53 -07:00
98deac3879
[FEATURE] support custom vllm tuned config path for fused moe triton kernels ( #22791 )
...
Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com >
2025-08-13 20:27:25 +08:00
653124bd46
[Frontend] Add chunked processing to handle long inputs in embedding models ( #22280 )
...
Signed-off-by: x22x22 <wadeking@qq.com >
Signed-off-by: Kdump <rootshellexp@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-13 04:14:24 -07:00
0b1bdac6af
[Platform] Custom ops support for FusedMoe ( #22509 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-08-13 04:12:00 -07:00
d94e3026de
[V1] Add tree drafting tests for eagle spec decoding ( #22705 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-13 04:11:28 -07:00
3f52738dce
[Doc] Add max_lora_rank configuration guide ( #22782 )
...
Signed-off-by: chiliu <cliu_whu@yeah.net >
2025-08-13 04:10:07 -07:00
a01e0018b5
[Bugfix] Fix Nemotron VL image processing ( #22739 )
...
Co-authored-by: ducviet00-h2 <viet.d.hoang@h2corporation.jp >
2025-08-13 03:11:36 -07:00
9e7e5baaa8
[Model] Add missing prefix to glm4_1v ( #22716 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-13 01:23:33 -07:00
d16aa3dae4
[Model] Add option to run Step3VisionEncoder in DP ( #22697 )
...
Signed-off-by: zzh142857 <chaorenzhaozhenghao@gmail.com >
2025-08-13 00:09:13 -07:00
6807af8f46
[gpt-oss] upgrade gpt-oss to v0.0.3 and add version check ( #22768 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-12 21:37:26 -07:00
4c558cf62e
[Perf] Support topk softmax fused kernel for broader num_experts ( #22211 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-08-12 21:34:47 -07:00
77a6bf07ae
[Bug] Fix Unexpected Keyword Argument 'w1_bias' ( #22757 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-12 21:31:47 -07:00
4082338a25
Remove unneeded ROCm platform import when using CUDA ( #22765 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-12 21:26:38 -07:00
c6b928798e
Force TRTLLM attention for gpt-oss on SM100 ( #22678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-12 21:22:16 -07:00
b1361c7273
[Bugfix] Fix default enable for CUTLASS MLA on SM100 ( #22738 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-12 21:22:05 -07:00
4f0f844b16
Fix cuda illegal mem access with Llama4 TP8 + rms_norm custom op ( #22701 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-12 21:21:50 -07:00
c5830381af
[V0 Deprecation] Remove args for multi-step scheduling ( #22779 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-12 20:38:18 -07:00
d31f97cf57
[Misc] Remove tests/multi_step/__init__.py ( #22778 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-12 20:21:18 -07:00
71683ca6f6
[V0 Deprecation] Remove multi-step scheduling ( #22138 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-12 20:18:39 -07:00
e18859298d
Add hardware plugins to installation doc ( #22732 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 17:14:46 -07:00
fde0b611a3
[Model] Decouple glm4v ( #22751 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-12 17:13:17 -07:00
d0a6301588
Fix Transformers backend tensor parallel for multimodal models ( #22673 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 17:12:30 -07:00
45c3936e94
[Docs] Hide the navigation and toc sidebars on home page ( #22749 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 17:12:26 -07:00
ba81acbdc1
[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues ( #22606 )
...
Signed-off-by: frankwang28 <frank.wbb@hotmail.com >
2025-08-12 15:43:06 -07:00
53c730286c
[Misc] parametrize 'dtype' in test_flash_mla ( #22641 )
...
Signed-off-by: RUTHLESS-BOT <wujiafeng@cmbchina.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-12 16:31:48 -04:00
6534d2fc97
Fix torch version check for SM100 mxfp4 ( #22535 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-12 12:54:42 -07:00
422f22e012
[CI][Nixl] Check kv cache layout during handshake ( #22745 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-12 12:53:52 -07:00
6bd8ebf026
[Kernel][AMD] Avoid D2H copy and cumsum kernel ( #22683 )
...
Signed-off-by: Xiaozhu <mxz297@gmail.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-12 12:53:36 -07:00
dab4f9f764
[Chore] Update CODEOWNERS to include @yewentao256 for CUDA kernels, attention backends, quantization, and related tests ( #22741 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-13 00:50:31 +08:00
c42fe0b63a
Add more test scenario for tensor schema ( #22733 )
...
Signed-off-by: teekenl <teekenlau@gmail.com >
2025-08-12 16:34:41 +00:00
5a4b4b3729
Add: SupportsEagle3 interface for explicit EAGLE3 support ( #22642 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
2025-08-12 09:24:52 -07:00
e5d3d63c42
[Benchmark] Fix terminal colors in benchmark_serving_multi_turn (python 3.12) ( #22730 )
...
Signed-off-by: daniels <daniels@pliops.com >
2025-08-12 14:41:37 +00:00
3d9d40efde
[Bugfix][CI] Fix test_remote_decode_lifecycle.py::test_short_prompt_lifecycle ( #22727 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-12 07:30:17 -07:00
67c153b88a
Fix Llama4 FlashInfer FP4 MoE issues ( #22511 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-12 05:50:59 -07:00
f7ad6a1eb3
[CI Failure] fix tests/entrypoints/openai/test_skip_tokenizer.py ( #22708 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-12 05:42:58 -07:00
80bb1e8afe
Officially support SmolLM3 using the Transformers backend ( #22665 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 05:38:48 -07:00
d030b01548
[BugFix][Nixl][PD] Fix heterogenous TP ( #22663 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-08-12 05:37:30 -07:00
767e63b860
[Docs] Improve docs navigation ( #22720 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 04:25:55 -07:00
007dd90859
[gpt-oss] Enable gpt-oss on ampere ( #22714 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-12 03:21:44 -07:00
b8a9d0e429
[Misc] remove GH discussions link ( #22722 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-12 03:15:33 -07:00
50f2aae1b4
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing ( #21161 )
...
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com >
2025-08-12 02:05:14 -07:00
46ae7f6666
[Bugfix] Mamba2 SSD varlen bug fix initstates decay, improve test, assert chunk pwr 2 ( #21783 )
...
Signed-off-by: Rishi Astra <40644327+RishiAstra@users.noreply.github.com >
2025-08-12 02:04:37 -07:00
1ece7f30ba
Fix: AWQ Marlin get_quant_method does not recognize "modules_to_not_convert" ( #21888 )
...
Signed-off-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: JunHowie <JunHowie@aliyun.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-12 02:03:53 -07:00
bc8372efc3
[Bugfix] Fix erroneous randomly generated cases in bad word testing ( #22170 )
...
Signed-off-by: phantomlei <phantomlei3@gmail.com >
2025-08-12 02:03:22 -07:00
8d17fa633e
[V0] Correct CUDA Graph capture for encoder-decoder models ( #22630 )
2025-08-12 02:01:08 -07:00
9f909b8996
[New Model] Support Command-A-Vision ( #22660 )
...
Signed-off-by: donglu <donglu@cohere.com >
2025-08-12 01:39:54 -07:00
59f3b93636
[DOC] update v1_guide with INTEL HW ( #22679 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-08-12 01:22:49 -07:00
78077d5417
Move SchedulerConfig from config/__init__.py to config/scheduler.py ( #22626 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-12 00:23:49 -07:00
6d729c43fb
[Bugfix] Fix ModernBert load & Enable sliding window attention for bidirectional attention. ( #22637 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-12 00:23:17 -07:00
2f4657952b
[doc] Update x86 CPU-inference installation doc to reflect optionality of AVX512f ( #22707 )
...
Signed-off-by: Sooraj S <94284954+sooraj-satheesh@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-08-12 00:21:08 -07:00
3a7e3bbdd2
[Doc] Added unmentioned required option "method" in the usage of EAGLE-3 based models ( #21737 )
...
Signed-off-by: Dilute-l <dilu2333@163.com >
Co-authored-by: Dilute-l <dilu2333@163.com >
2025-08-12 00:14:51 -07:00
4fbd8bb597
Fix passing SpeculativeConfig from the CLI ( #22652 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 22:13:32 -07:00
ad344ef552
[gpt-oss] Small bug fixes for frontend ( #22512 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 22:04:38 -07:00
bbaf9e9cb1
[gpt-oss] Fix mxfp4 support ( #22700 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 21:22:26 -07:00
4678503476
Migrate MiniCPMVImageInputs to TensorSchema ( #21939 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-11 20:43:37 -07:00
93d0652433
[CI] Increase timeout for test_completion_with_image_embeds ( #22670 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-11 20:31:36 -07:00
ea1292ad3e
[CI Failure] Use float32 for tests/entrypoints/openai/test_audio.py ( #22686 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-11 20:20:42 -07:00
dc5e4a653c
Upgrade FlashInfer to v0.2.11 ( #22613 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-11 19:58:41 -07:00
839ab00349
Re-enable Xet on TPU tests now that hf_xet has been updated ( #22666 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 19:54:40 -07:00
9b94d6ec8f
Enable 4bit bnb prequant MOE ( #21548 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-11 19:02:14 -07:00
1891a265d3
[gpt-oss] Add test for response API + harmony (but skipped) ( #22554 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 17:47:24 -07:00
95a935fc48
[gpt-oss] Support streaming in response API ( #22431 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-11 17:46:59 -07:00
458e74eb90
Support more parallel styles in Transformers backend TP ( #22651 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 10:42:48 -07:00
65abe111a3
[CI] Skip Tree Attn Test in test_max_len.py to unblock CI ( #22664 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-11 10:36:05 -07:00
807d21b80d
[BugFix] [Spec Decode] Remove LlamaForCausalLMEagle3 to fix CI ( #22611 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-11 10:31:36 -07:00
c90fb03df5
[CI/Build] Skip Mllama HF runner tests with Transformers v4.55.0 ( #22659 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-11 10:00:58 -07:00
84cf78acee
[Model] Pooling models default to using chunked prefill & prefix caching if supported. ( #20930 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-11 09:41:37 -07:00
16fb668b61
fix: NIXL connector transfers partial block to pass full multi-modal context ( #21074 )
...
Signed-off-by: GuanLuo <gluo@nvidia.com >
2025-08-11 09:40:55 -07:00
f7dcce7a4a
[Feature] Add VLLM_USE_DEEP_GEMM_E8M0 Env to Control E8M0 Scale ( #21968 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-11 09:39:08 -07:00
8e13d9fe6d
[Misc] Further clean up some redundant config definitions ( #22649 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-11 09:22:25 -07:00
3fa5b25845
Document aarch64 CPU support works ( #22646 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-08-11 07:22:45 -07:00
14a5d903ab
[Model] NemotronH Support ( #22349 )
...
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com >
2025-08-11 04:09:24 -07:00
951b038298
[Misc] Move jsontree to utils ( #22622 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-11 03:49:32 -07:00
ebf7605b0d
[Misc] Move tensor schema tests ( #22612 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-11 00:15:27 -07:00
bc1d02ac85
[Docs] Add comprehensive CLI reference for all large vllm subcommands ( #22601 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-11 00:13:33 -07:00
1e55dfa7e5
[BUGFIX] KeyError 'layers.14.mlp.gate.g_idx' for Qwen3-MoE with GPTQ on ROCm ( #22017 )
2025-08-11 00:13:30 -07:00
384a052971
[Misc] benchmark_moe supports expert parallel ( #22251 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-11 00:13:27 -07:00
39052dbca8
Support token_type_ids in V1 with less code changes ( #21985 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-10 22:54:59 -07:00
9c97a1c349
[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Module. ( #22521 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-08-10 22:52:34 -07:00
f919d4cb8f
[BugFix] Fix logits repetition penalty cuda check ( #22592 )
2025-08-10 22:52:31 -07:00
afa5b7ca0b
[Misc][gpt-oss] guard import when triton kernel when not up to date ( #22584 )
...
Signed-off-by: zhewenli <zhewenli@meta.com >
2025-08-10 21:29:35 -07:00
1b99028069
[Misc][gpt-oss] Add rules to label gpt-oss related PRs ( #22600 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-08-10 19:49:51 -07:00
5898b135ab
[BugFix] Fix KVConnectorOutput TPU breakage ( #22598 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-10 19:33:48 -07:00
b799f4b9ea
[CI/Build] Fix tensorizer test for load_format change ( #22583 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-10 19:30:00 -07:00
06da44f0cb
Migrate LlavaImageInputs to TensorSchema ( #21770 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-10 19:29:19 -07:00
a554991748
Migrate LlavaNextVideoPixelInputs to TensorSchema ( #21843 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-10 19:29:16 -07:00
d1af8b7be9
enable Docker-aware precompiled wheel setup ( #22106 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-08-10 16:29:02 -07:00
68b254d673
Fix TensorSchema validation test for symbolic dims ( #22366 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-08-10 17:16:44 +00:00
8c50d62f5a
Remove redundant row_indices unsqueeze operation in MiniCPMO ( #22528 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-10 09:20:00 -07:00
b4e2916721
Migrate LlavaNextImageInputs to TensorSchema ( #21774 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-10 09:05:21 -07:00
65a7917be4
Fix(benchmarks): allow multiple mm contents in OpenAI Chat Completion Benchmarks ( #22534 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
2025-08-10 09:03:15 -07:00
b76753f0b5
[Bugfix][Kernel] Support partial rotary embedding for MRoPE triton kernel ( #22593 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-10 09:00:36 -07:00
b81fe83b2c
[doc] add alibaba cloud as sponsor ( #22597 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-10 23:13:47 +08:00
0757551c96
[doc] add beijing meetup links ( #22596 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-10 22:51:36 +08:00
8290d15d2c
Move CacheConfig from config/__init__.py to config/cache.py ( #22586 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-10 07:36:40 -07:00
049c245143
[Misc] Replace flaky image urls in pixtral test ( #22574 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-10 06:18:21 -07:00
00976db0c3
[Docs] Fix warnings in docs build ( #22588 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-10 05:49:51 -07:00
d411df0296
[Misc] Further refine type annotations in parallel state ( #22499 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-10 05:49:48 -07:00
010e0e39ea
[Doc] Fix API doc link in side navigation ( #22585 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-10 01:35:22 -07:00
326976291b
[Misc] code clean duplicate set_current_vllm_config in _set_vllm_config ( #22566 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-10 00:08:48 -07:00
7e8d685775
[Minor] Fix pre-commit error on main ( #22579 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-10 00:08:23 -07:00
c49848396d
Refactor sliding window configuration to Transformers best practice ( #21927 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-09 20:50:48 -07:00
2a84fb422f
[TPU] kv cache update kernel doesn't need to be padded slices to multiple of num_slices_per_block ( #22394 )
...
Signed-off-by: Chengji Yao <chengjiyao@gmail.com >
Co-authored-by: Chengji Yao <chengjiyao@gmail.com >
2025-08-09 20:49:04 -07:00
534c45b962
Improve fast_topk function with type hints and documentation ( #22530 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-09 20:25:42 -07:00
3d7363e61c
[Config] add "qwen" as a native eagle3 target supported model ( #22333 )
...
Signed-off-by: lechen <lecself@163.com >
Signed-off-by: LeChen <lecself@163.com >
2025-08-09 20:21:05 -07:00
0c5254b82a
[oss] Init gpt-oss bf16 support ( #22508 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-09 20:19:13 -07:00
61f67d8acd
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers ( #21401 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-09 20:16:11 -07:00
42172ad18f
[FEAT] [Performance] Add triton mrope to replace the torch code path ( #22375 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-09 11:50:03 -07:00
fbd8595c5c
[Bugfix] Fix basic models tests hanging due to mm processor creation ( #22571 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-09 11:42:21 -07:00
5a16fa614c
[Model] Gemma3n MM ( #20495 )
...
Signed-off-by: ShriKode <shrikode@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: ShriKode <shrikode@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-09 09:56:25 -07:00
2d18256e47
Move ParallelConfig from config/__init__.py to config/parallel.py ( #22565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-09 08:33:46 -07:00
56186474f6
[Docs] Reduce noise in docs and --help from the JSON tip ( #22567 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-09 08:31:32 -07:00
1bf5e1f25b
[CI] [Hybrid] Speed up hybrid models test by removing large models ( #22563 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-09 02:04:42 -07:00
a6022e6fbc
GLM-4.5V with new class name at transformers ( #22520 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-09 00:50:21 -07:00
2be07a0db1
Update docs for Minimax-Text support ( #22562 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-09 00:18:18 -07:00
0edc0cd52b
[Bugfix] Fix CI moe kernel failure ( #22556 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-09 00:03:29 -07:00
7920e9b1c5
[Bugfix] Fix failing GPT-OSS initialization test ( #22557 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-09 00:03:26 -07:00
b7c0942b65
[ROCm][Misc] Rename the context_len to seq_len in ROCm custom paged attention kernel ( #22097 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-08-08 23:15:06 -07:00
9a0c5ded5a
[TPU] Add support for online w8a8 quantization ( #22425 )
...
Signed-off-by: Kyuyeun Kim <kyuyeunk@google.com >
2025-08-08 23:12:54 -07:00
10a02535d4
Fix loading of quantized BigCode models ( #22463 )
...
Signed-off-by: Eldar Kurtic <eldar@neuralmagic.com >
2025-08-08 23:12:12 -07:00
65552b476b
[Misc] Use config definitions from Transformers library ( #21913 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-08 23:10:51 -07:00
7ad7adb67f
v1: Pass KVConnectorOutput to scheduler-side ( #22157 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-08-08 23:09:51 -07:00
6ade99eafa
[V1] [Hybrid] Support Minimax-Text-01 in V1 ( #22151 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-08 23:08:48 -07:00
3157aebb63
[Log] Add Warning for Deprecation of DeepGEMM old version ( #22194 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-08 23:07:48 -07:00
8a0ffd6285
Remove mamba_ssm from vLLM requirements; install inside test container using --no-build-isolation ( #22541 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-08 23:05:32 -07:00
23472ff51c
[Doc] Add usage of implicit text-only mode ( #22561 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Flora Feng <4florafeng@gmail.com >
2025-08-08 23:04:19 -07:00
08b751ba74
Implicit language-model-only mode via limit-mm-per-prompt ( #22299 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Andy Xie <andy.xning@gmail.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: XIn Li <xinli@nvidia.com >
Signed-off-by: Junhao Li <junhao@ubicloud.com >
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Signed-off-by: Linkun <github@lkchen.net >
Co-authored-by: Ning Xie <andy.xning@gmail.com >
Co-authored-by: TJian <tunjian.tan@embeddedllm.com >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
Co-authored-by: Zhiyu <zhiyuc@nvidia.com >
Co-authored-by: Shu Wang <shuw@nvidia.com >
Co-authored-by: XIn Li <xinli@nvidia.com >
Co-authored-by: Junhao Li <streaver91@gmail.com >
Co-authored-by: Chauncey <chaunceyjiang@gmail.com >
Co-authored-by: Yuxuan Zhang <2448370773@qq.com >
Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Po-Han Huang (NVIDIA) <53919306+nvpohanh@users.noreply.github.com >
Co-authored-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Hong Hanh <hanh.usth@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: lkchen <github@lkchen.net >
2025-08-08 22:21:40 -07:00
429e4e2d42
[Bugfix] Fix ModernBert cuda graph capturing in v1 ( #21901 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-08 22:17:22 -07:00
35afe1b30b
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D ( #22317 )
...
Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com >
Signed-off-by: Pradyun92 <142861237+Pradyun92@users.noreply.github.com >
Co-authored-by: Pradyun Ramadorai <pradyunr@amazon.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-08-08 17:04:15 -07:00
81c57f60a2
[XPU] upgrade torch 2.8 on for XPU ( #22300 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-08 17:03:45 -07:00
311d875614
Drop flaky test_healthcheck_response_time ( #22539 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-08-08 16:56:47 -07:00
e3edc0a7a8
Extract CompilationConfig from config.py ( #22524 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 16:34:25 -07:00
baece8c3d2
[Frontend] Add unix domain socket support ( #18097 )
...
Signed-off-by: <yyweiss@gmail.com >
Signed-off-by: yyw <yyweiss@gmail.com >
2025-08-08 16:23:44 -07:00
2fcf6b27b6
[Docs] fix broken links in metrics.md ( #22315 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 16:22:35 -07:00
41b9655751
Skip Qwen 1 in CI because remote code is no longer compatible with Transformers ( #22536 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 16:20:58 -07:00
bd875d2eb7
[Bugfix] Update FA commit hash ( #22546 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-08 16:10:25 -07:00
f703b923f3
[Misc] DeepGEMM : Avoid JIT generation in the hot-path ( #22215 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-08-08 16:09:59 -07:00
cd9b9de1fb
[BugFix] Fix IMA FlashMLA full cuda-graph and DP + Update FlashMLA ( #21691 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-08 16:09:42 -07:00
fe6d8257a1
[gpt-oss] Support tool call and implement MCP tool server ( #22427 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-08 15:06:37 -07:00
e290594072
[Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling” ( #22466 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-08-08 19:26:21 +00:00
f756a682d9
[gpt-oss] guard import when triton kernel is not installed ( #22529 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-08 11:18:33 -07:00
f0964e29cb
[Benchmark] Add benchmark tool for multi turn conversations ( #20267 )
2025-08-08 10:28:50 -07:00
e789cad6b8
[gpt-oss] triton kernel mxfp4 ( #22421 )
...
Signed-off-by: <zyy1102000@gmail.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-08 08:24:07 -07:00
e5ebeeba53
Remove exception for Python 3.8 typing from linter ( #22506 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 03:06:46 -07:00
7be7f3824a
[Docs] Improve API docs (+small tweaks) ( #22459 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-08 03:02:51 -07:00
ccdae737a0
[BugFix] Don't cancel asyncio tasks directly from destructors ( #22476 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-08 01:13:18 -07:00
904063907c
[Misc] fix openai version ( #22485 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-08 01:12:54 -07:00
43c4f3d77c
[Misc] Begin deprecation of get_tensor_model_*_group ( #22494 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-08 01:11:54 -07:00
1712543df6
[CI/Build] Fix multimodal tests ( #22491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-08 00:31:19 -07:00
808a7b69df
[bench] Fix benchmark/serve.py to ignore unavailable results ( #22382 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-08-07 23:15:50 -07:00
099c046463
[Doc] Sleep mode documentation ( #22310 )
...
Signed-off-by: iAmir97 <Amir.balwel@embeddedllm.com >
Signed-off-by: iAmir97 <71513472+iAmir97@users.noreply.github.com >
Co-authored-by: iAmir97 <Amir.balwel@embeddedllm.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Hong Hanh <hanh.usth@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-08-08 12:25:18 +08:00
af473f0a85
[bugfix] Fix Llama3/4 issues caused by FlashInfer 0.2.10 ( #22426 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-08-07 20:25:01 -07:00
157f9c1368
Fix pre-commit ( #22487 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 20:21:54 -07:00
6f287915d8
Optimize MiniCPMO mask creation with vectorized implementation ( #22464 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
2025-08-07 20:18:50 -07:00
c152e2a8a0
not tie_word_embeddings for glm-4.5 and glm-4.5v ( #22460 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-07 19:37:23 -07:00
17eaaef595
[Bugfix] Fix RuntimeError: Index put requires the source and destination dtypes match ( #22065 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-08-07 19:20:21 -07:00
3303f134e0
[Kernel] Add support for block FP8 on SM120 (NVIDIA 5090 and RTX PRO 6000) ( #22131 )
...
Signed-off-by: Junhao Li <junhao@ubicloud.com >
2025-08-07 19:18:28 -07:00
b2c8ce57c6
Fix Flashinfer CUTLASS MOE Allgather ( #21963 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-08-07 19:18:25 -07:00
a3b9c17b56
Support Tensorrt-LLM MoE fp4 for low-latency ( #21331 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
Signed-off-by: Shu Wang. <shuw@nvidia.com >
Signed-off-by: XIn Li <xinli@nvidia.com >
Co-authored-by: XIn Li <xinli@nvidia.com >
2025-08-07 19:18:22 -07:00
d57dc2364e
Add ModelOpt Qwen3 nvfp4 support ( #20101 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-08-07 19:18:19 -07:00
e2c8f1edec
[PERF] Use pybase64 to more quickly decode prompt embeddings ( #22469 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
2025-08-07 19:15:32 -07:00
1ee5ead5f8
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine ( #21496 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-07 19:13:17 -07:00
acf8aeb79e
[Misc] normalize multiprocessing Queue usage ( #22371 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-08 01:57:27 +00:00
7e3a8dc906
Remove from_dict from SpeculativeConfig ( #22451 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-07 10:13:04 -07:00
139d155781
[Frontend] Use engine argument to control MM cache size ( #22441 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 09:47:10 -07:00
8c9da6be22
[Core] Simplify mm processing cache ( #22457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 09:47:07 -07:00
399d2a10e2
Fix pre-commit error in main ( #22462 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-07 08:54:39 -07:00
4815b00f54
[gpt-oss] Generate ResponseOutputItem from Harmony Message ( #22410 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-07 08:33:25 -07:00
4da8bf20d0
[Tool] Fix auto tool call ( #22434 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-07 07:03:38 -07:00
7e0b121812
[Bugfix] Add missing packed_modules_mapping to DeepseekV2ForCausalLM ( #22352 )
...
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
2025-08-07 06:30:48 -07:00
766bc8162c
[Core] Store only the keys for multi-modal data in P0 ( #22198 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-07 01:45:04 -07:00
289b18e670
[Docs] Update features/disagg_prefill, add v1 examples and development ( #22165 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-08-07 00:59:23 -07:00
35171b1172
[Doc] update docs for nightly benchmarks ( #12022 )
...
Signed-off-by: Andrew Chan <andrewkchan.akc@gmail.com >
2025-08-07 00:29:45 -07:00
a2c6696bfe
[Docs] Factor out troubleshooting to its own guide; add section for Ray Observability ( #21578 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-08-07 00:29:13 -07:00
5e8398805e
[Doc] Fix link to prefix caching design ( #22384 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-07 00:28:15 -07:00
136825de75
[Misc] Enhance code formatting in mxfp4.py ( #22423 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-07 00:26:24 -07:00
c2dba2dba8
Add H20-3e fused MoE kernel tuning configs for GLM-4.5 ( #22433 )
...
Signed-off-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
Co-authored-by: shaojunqi <shaojunqi.sjq@alibaba-inc.com >
2025-08-07 00:24:47 -07:00
434d2f3f7a
[Docs] Add missing dependency for docs build ( #22435 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-07 00:22:07 -07:00
8e8e0b6af1
feat: Add --enable-log-outputs flag for logging model generations ( #20707 )
...
Signed-off-by: Adrian Garcia <adrian.garcia@inceptionai.ai >
2025-08-06 23:10:13 -07:00
82216dc21f
[Misc] Support routing logic simulation ( #21990 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-06 23:06:20 -07:00
370661856b
[Frontend] Update OpenAI error response to upstream format ( #22099 )
...
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com >
2025-08-06 23:06:00 -07:00
cbc8457b26
[Model] Switch to Fused RMS norm in Qwen2.5_VL model. ( #22184 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: kf <kuanfu.liu@embeddedllm.com >
2025-08-06 23:05:24 -07:00
4d4297e8fe
[Bench] Split serve.py:main into async/async versions ( #22405 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-08-06 23:05:07 -07:00
2a4c825523
[CI] Skip the pooling models that do not support transformers v4.55 ( #22411 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-06 23:05:03 -07:00
4be02a3776
[Bugfix] EPLB load statistics problem ( #22167 )
...
Signed-off-by: ycyaw66 <497410282@qq.com >
Signed-off-by: David Chen <530634352@qq.com >
Co-authored-by: ycyaw66 <497410282@qq.com >
2025-08-07 04:07:54 +00:00
f6278b6243
[gpt-oss] Convert user input to harmony format ( #22402 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-06 20:56:02 -07:00
ad6c655dde
preload heavy modules when mp method is forkserver ( #22214 )
...
Signed-off-by: Lionel Villard <villard@us.ibm.com >
2025-08-06 20:33:24 -07:00
14bcf93a6a
Optimize logger init performance by using module-level constants ( #22373 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-06 20:32:19 -07:00
ecbea55ca2
Update hf_xet pin to resolve hangs ( #22356 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-06 20:31:41 -07:00
609b533cb6
[Bugfix] Add proper comparison for package versions ( #22314 )
...
Signed-off-by: Syed Muhammad Bin Asif <syedmba7@connect.hku.hk >
2025-08-06 20:31:03 -07:00
5e9455ae8f
[Bugfix]: Fix the streaming output for function calls in the minimax ( #22015 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-08-06 20:30:27 -07:00
a00d8b236f
Use float32 for test_completion.py ( #22385 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-08-07 11:07:47 +08:00
04cf435d95
[Bugfix] Fix wrong method name in Intern-S1 image processor ( #22417 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-06 20:05:20 -07:00
7377131a2c
[Qwen3] Enable dual-chunk-attention support for Qwen3 models. ( #21924 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-08-06 19:58:08 -07:00
6b47ef24de
[XPU]Fix flash_attn_varlen_func interface on xpu ( #22350 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-08-06 19:28:11 -07:00
1dc8a70b6d
[Attention] Support multiple attention metadata builders per kv_cache_spec + proper local attention no hybrid kv cache fix ( #21588 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-08-06 18:40:52 -07:00
f825c6bd22
Support encoder_only attention for FlexAttention ( #22273 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-08-06 18:37:14 -07:00
41b67f4263
[model] Support MiniCPM-V 4.0 ( #22166 )
...
Co-authored-by: imning3 <hbning@pku.edu.cn >
2025-08-06 18:35:46 -07:00
e8961e963a
Update flashinfer-python==0.2.10 ( #22389 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-06 18:10:24 -07:00
9a3835aaa9
Fix trtllm-gen attention env and add attention sink ( #22378 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: Lain <fusiyuan2000@hotmail.com >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 18:07:41 -07:00
5c7cc33f4d
[gpt-oss] fix model config with hf_config ( #22401 )
...
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 18:04:04 -07:00
19c9365aa4
[gpt-oss] add demo tool server ( #22393 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-06 17:47:14 -07:00
eec890c1c1
[Bug] Fix B200 DeepGEMM E8M0 Accuracy Issue ( #22399 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-06 17:03:53 -07:00
46a13949d5
[v1] - Mamba1 Attention Metadata ( #21249 )
...
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-08-06 17:03:42 -07:00
31f09c615f
[gpt-oss] flashinfer mxfp4 ( #22339 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-08-06 12:37:27 -07:00
31f5dc5b2a
[gpt-oss] Enhance error msg on attention sink init ( #22335 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-08-06 11:41:42 -07:00
ec7cb19224
[gpt-oss] Add loop for built-in tool call ( #22374 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 10:32:21 -07:00
2435ea7ed5
[Bugfix] Make condition in triton kernel constexpr ( #22370 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-08-06 10:00:58 -07:00
4a6b72c2ab
[BugFix] Fix triton compile error in kernel_unified_attention_2/3d caused by attention sinks ( #22368 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-08-06 09:47:38 -07:00
b4b9813b5e
add the codes to check AMD Instinct GPU number ( #22367 )
...
Signed-off-by: Zhang Jason <ning.zhang2@amd.com >
2025-08-06 08:58:38 -07:00
2cb6ef8996
[BugFix] Fix FA2 RuntimeError when sinks is provided ( #22365 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-08-06 08:03:03 -07:00
9edd1db02b
[Minor] Fix type ( #22347 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-06 02:22:03 -07:00
f263a4b53f
[gpt-oss] Support chat completion api ( #22342 )
2025-08-06 01:57:39 -07:00
54991c548a
[gpt-oss] add model to supported models doc ( #22336 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-06 01:49:44 -07:00
178d03fbd6
[gpt-oss] Add Tool/ConversationContext classes and harmony_utils ( #22340 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 01:08:49 -07:00
fa00c5d75b
[Misc] Clean up duplicated hf overrides ( #22311 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-06 07:50:25 +00:00
134a8ee8fd
[gpt-oss] Add openai-harmony as default dependency ( #22332 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-06 00:10:14 -07:00
90ec006937
[gpt-oss] flashinfer attention sink init ( #22330 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
2025-08-05 23:48:19 -07:00
a47e6ffe93
[GptOss] Add GptOss reasoning parser to support structure output ( #22322 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-05 23:39:13 -07:00
98a3a81024
[ROCm] Add attention sink to use_rocm_custom_paged_attention ( #22329 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-05 23:30:38 -07:00
de98252f49
Add GPT-OSS model code and config [1/N] ( #22327 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 23:26:00 -07:00
796bae07c5
Update transformers to v4.55 ( #21931 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 22:56:14 -07:00
6e20924350
Add attention sink in attention backends ( #22320 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
Co-authored-by: Minseok Lee <47620120+minseokl@users.noreply.github.com >
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com >
2025-08-05 22:37:21 -07:00
dd16bdc798
Increase openai-python version ( #22316 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 21:43:21 -07:00
e3c876dca3
Upgrade FA3 for attention sink ( #22313 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-05 21:36:21 -07:00
5d5d419ca6
[Bugfix][CI/Build][ROCm] Make sure to use the headers from the build folder on ROCm ( #22264 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-08-05 20:39:32 -07:00
302962e806
[Bugfix] Skip dead and non-GPU nodes for Ray DP engine allocation ( #22275 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-05 20:35:32 -07:00
7e6544c797
[Perf] Parallelize fill_bitmask to accelerate high-throughput guided decoding ( #21862 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-08-05 19:57:49 -07:00
8e6c7e873f
[Bugfix] Fix MoE BNB version ( #22260 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-05 19:56:22 -07:00
6a51530437
[Bugfix] Fix 3D input passed into cutlass_scaled_mm ( #22278 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-06 10:35:20 +08:00
35509fc5be
[Bugfix] Remove faulty test for oot attention backend ( #22286 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-06 00:05:40 +00:00
4b29d2784b
[CI][TPU] Fix docker clean up ( #22271 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-08-05 23:54:56 +00:00
59a0b8554b
[bugfix] fix blackwell deepep installation ( #22255 )
2025-08-06 01:26:09 +08:00
469b3ffaaa
[V1] port xformers backend to v1 ( #21342 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-05 10:04:46 -07:00
ae87ddd040
[Refactor] Remove Unused Environment Variable VLLM_NO_DEPRECATION_WARNING ( #22199 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-05 09:40:23 -07:00
a7cb6101ca
[CI/Build] Update flashinfer to 0.2.9 ( #22233 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-05 09:39:38 -07:00
c494f96fbc
Use UV_LINK_MODE=copy in Dockerfile to avoid hardlink fail ( #22128 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-05 06:57:10 -07:00
0c275ad5ad
[V0 Deprecation][TPU] Remove V1 flag check from tests ( #22248 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-05 06:53:23 -07:00
74333ae2f6
[Misc] correct static type check for GroupCoordinator ( #21946 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-05 03:17:46 -07:00
83156c7b89
[NVIDIA] Support Flashinfer TRT-LLM Prefill Attention Kernel ( #22095 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-08-05 02:45:34 -07:00
4771df7b2b
[Feature] Non-contiguous Support for FP8 Quantization ( #21961 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-05 02:36:43 -07:00
05fae02175
Migrate KimiVLImagePixelInputs to TensorSchema ( #21769 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-08-05 02:36:18 -07:00
d1bf1b9711
[Docs][TPU] Highlight TPU Software version selection ( #22242 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-05 02:33:46 -07:00
586f286789
[Model] Pooling model activation supports per request control by PoolingParams ( #20538 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-08-05 00:37:00 -07:00
811ac13d03
[Core] Factor out common logic for MM budget calculation ( #22228 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-04 23:54:55 -07:00
e79a12fc3a
[UX] Fail if an invalid attention backend is specified ( #22217 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-08-04 23:54:52 -07:00
cdfd6871a5
[Bugfix] Misaligned params in TreeAttentionImpl ( #22226 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-04 22:40:09 -07:00
4b3e4474d7
Optimize configuration access with LRU cache in custom ops ( #22204 )
...
Signed-off-by: zitian zhao <zitian.zhao@tencentmusic.com >
2025-08-04 21:43:24 -07:00
bd3db7f469
[Misc] log more detailed message for ensure_model_parallel_initialized ( #22144 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-04 19:36:55 -07:00
29b97c0995
[Doc] add backend to doc string of initialize_model_parallel ( #22142 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-04 19:36:20 -07:00
7b455cf1c0
[Misc] Remove pass_config from CompilationConfig dump_json excluded ( #21911 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-08-04 19:17:18 -07:00
8a6e108e76
fix: kimi_k2 return empty tool call list ( #22149 )
...
Signed-off-by: tlipoca9 <tlipoca9@gmail.com >
2025-08-04 19:15:31 -07:00
d7b28f3415
[Log] DeepGEMM Update Log for Unaligned Problem Size ( #22208 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-04 19:13:19 -07:00
6fa41e0c32
self.gate dtype update for GLM-4.5 ( #22203 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-04 19:12:38 -07:00
031ca762d7
[ROCm][Bugfix] Compilation passes fix ( #22202 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-08-04 19:12:28 -07:00
6ad6b8e115
[FEAT] Refactor ROPE into module ( #22192 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-04 19:12:16 -07:00
f4f4e7ef27
[V0 deprecation][P/D] Deprecate v0 KVConnectorBase code (1/2) ( #21785 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-08-04 19:11:33 -07:00
5ea71ff46f
[V1] reduce block size for tree attention correctness test to fix 'ou… ( #22207 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-04 19:11:06 -07:00
7175817637
Revert "[Bugfix] V1 Fix the cursor leakage issue during request scheduling." ( #22223 )
2025-08-04 18:37:06 -07:00
2dffac464c
[Bugfix] V1 Fix the cursor leakage issue during request scheduling. ( #21173 )
...
Signed-off-by: CLFutureX <775523362@qq.com >
2025-08-04 18:34:10 -07:00
bdcb42e45d
[NVIDIA] Auto detect modelopt quant and fix DSR1-FP4 weight loading ( #22073 )
2025-08-04 21:02:55 -04:00
c09efff976
[Bugfix][V1][P/D]Fix the uneven polling issue in the toy proxy for P2pNcclConnector ( #21819 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-08-04 20:17:05 +00:00
309c1bb822
[Bug] Update auto_tune.sh to separate benchmarking and profiling. ( #21629 )
...
Signed-off-by: Eric Hanley <ericehanley@google.com >
2025-08-04 15:12:06 +00:00
9af654cc38
[Responses API] Ignore store=True and process the request by default ( #22185 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-04 05:12:48 -07:00
a5fff3bd49
Fix Arcee model weight loading: Add custom load_weights ( #21725 )
...
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
2025-08-04 04:09:56 -07:00
1539ced93a
[Doc] Update pooling model docs ( #22186 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-04 03:37:06 -07:00
54de71d0df
[Sampler] Support returning all logprobs or logits ( #21792 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-04 03:04:12 -07:00
fed5849d3f
[Bugfix] Fix failing GGUF models test ( #22174 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-04 01:27:02 -07:00
c1b4eb048a
[feat] move WEIGHT_SCALE_SUPPORTED into raise block to accelerate RLHF weight loading ( #21164 )
...
Signed-off-by: huangweixiao <huangweixiao@msh.team >
2025-08-04 15:43:06 +08:00
a7b8788d2c
[Misc] Modify the organization of GLM series ( #22171 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-03 23:51:20 -07:00
8ecb3e9e93
[CI Bugfix] Fix wNa16 kernel not found for test_shared_storage_connector_hashes ( #22163 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-08-03 22:19:04 -07:00
e5949e5ae0
Remove index_put from MM embeddings merging ( #22105 )
...
Co-authored-by: Chenxi Yang <cxyang@meta.com >
2025-08-03 22:15:14 -07:00
49bcd893e7
[refactor] improve ConstantList exception specificity ( #22156 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-03 22:14:49 -07:00
aa7012eb6d
Add tree attention backend for v1 (part 1) ( #20401 )
...
Signed-off-by: Giancarlo Delfin <gdelfin@meta.com >
2025-08-03 22:13:26 -07:00
c2e75b3c11
remove duplicate code within cleanup_dist_env_and_memory ( #22147 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-03 20:03:58 -07:00
0d7db16a92
[PD] add test for chat completions endpoint ( #21925 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
2025-08-03 19:57:03 -07:00
845420ac2c
[RLHF] Fix torch.dtype not serializable in example ( #22158 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-04 02:43:33 +00:00
e27d25a0dc
[fix] fix correct assertion syntax error in attention utils. ( #22154 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-03 19:24:02 -07:00
6f5478298d
Use aiohttp connection pool for benchmarking ( #21981 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-08-03 19:23:32 -07:00
6a39ba85fe
[Bugfix] Fix failing multimodal standard test ( #22153 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-03 19:04:38 +00:00
d3c18c9cb0
fuse fp32 for GLM-4.5 e_score_correction_bias ( #22143 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-08-03 09:04:54 -07:00
83f7bbb318
Add chat doc in quick start ( #21213 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-08-03 07:47:55 -07:00
b5dfb94fa0
[CI/Build][Bugfix] Fix Qwen2.5 tests in CPU CI via fallback silu_and_mul to torch native implementation ( #22145 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-08-03 05:34:04 -07:00
6d98843b31
[Responses API] Disable response store by default ( #22137 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-03 04:04:21 -07:00
aefeea0fde
[V1] [P/D] Refactor KV Connector Path ( #21980 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-08-03 04:03:40 -07:00
24d1dffbeb
[executor] feat: add supports_pp attr to executors ( #21786 )
...
Signed-off-by: Haibin Lin <haibin.lin@bytedance.com >
2025-08-03 18:04:45 +08:00
7de45db9a5
[Misc] update doc comment for send ( #22026 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-08-03 00:55:20 -07:00
789562c28c
Support CUTLASS NVFP4 (w4a4) for Blackwell Geforce GPUs (SM120) ( #21309 )
...
Signed-off-by: LopezCastroRoberto <roberto.lopez.castro@udc.es >
2025-08-03 00:54:22 -07:00
3f36c325fa
[Benchmark] Support ready check timeout in vllm bench serve ( #21696 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-08-03 00:52:38 -07:00
3dddbf1f25
[Misc] Add tensor schema test coverage for multimodal models ( #21754 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-03 00:52:14 -07:00
337eb23bcc
[Fix] Fix llama4 modelopt weight loading error ( #22107 )
...
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-08-03 00:50:34 -07:00
2ff46b8826
[Misc] Bump ray to 2.48.0 ( #22123 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-02 19:42:00 -07:00
554df8a6a2
Revert "[compile][startup] Disable C++ compilation of symbolic shapes" ( #22122 )
...
Signed-off-by: Xiao Liu <xiszishu@gmail.com >
2025-08-02 09:03:30 -07:00
73e1b9b1d4
[xpu]support moe models on XPU platform ( #21643 )
...
Signed-off-by: yan <yan.ma@intel.com >
Signed-off-by: Yan Ma <yan.ma@intel.com >
2025-08-02 07:49:08 -07:00
4abfd8796f
[V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time ( #21557 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-08-02 05:29:40 -07:00
f5d0f4784f
[Frontend] Improve error message for too many mm items ( #22114 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-02 02:20:38 -07:00
b690e34824
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead ( #21075 )
...
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-08-02 01:59:34 -07:00
25373b6c6c
for glm-4.1V update ( #22000 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-08-02 01:46:57 -07:00
58eee5f2e0
[PERF] Use faster way of decode in tokenizer: avoid useless list-to-list conversion ( #20000 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-08-02 01:43:52 -07:00
067c34a155
docs: remove deprecated disable-log-requests flag ( #22113 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-02 00:19:48 -07:00
c64861d63c
[Bugfix] Mamba2 remove bugged initial state condition in chunk scan ( #22034 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-08-01 23:55:57 -07:00
8564dc9448
Fix test_kv_sharing_fast_prefill flakiness ( #22038 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-08-01 23:55:34 -07:00
4ac8437352
[Misc] Getting and passing ray runtime_env to workers ( #22040 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-01 23:54:40 -07:00
d3a6f2120b
[FEAT][ROCm] Enable running Flash Attention as ViT attn backend for Qwen-VL models on ROCm platform. ( #22069 )
...
Signed-off-by: tjtanaavllm <tunjian.tan@amd.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: tjtanaavllm <tunjian.tan@amd.com >
2025-08-01 23:53:18 -07:00
0edaf752d7
[Attention][DBO] Add support for "splitting" the CommonAttentionMetadata ( #21153 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-08-01 19:47:53 -07:00
6e8d8c4afb
[Test] Add Unit Test for Batched DeepGEMM ( #21559 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-02 10:45:46 +08:00
8d524ce79f
[BugFix] Improve internal DP load balancing ( #21617 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-01 19:45:27 -07:00
9f9c38c392
[Speculators][Speculative Decoding] Add Qwen Eagle3 Support ( #21835 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-08-01 19:43:37 -07:00
a65f46be5e
[Misc] DeepGemmExperts : Avoid JIT generation in the hot-path ( #21955 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-08-01 19:42:03 -07:00
57393715e8
[Misc] VLLM_TARGET_DEVICE.lower() ( #22101 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-08-01 19:41:40 -07:00
ee2eb6ecd8
[Model] Qwen2.5 VL SiLU-and-Mul ( #22066 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: kf <kuanfu.liu@embeddedllm.com >
2025-08-01 19:34:37 -07:00
23322431c8
[V1][CUDA] Full cudagraph support for FlashInfer ( #21367 )
2025-08-01 21:49:34 -04:00
3654847db5
feat: Add Support GPTQ Quantization MOE on ROCM vllm serve ( #21733 )
2025-08-01 21:12:19 -04:00
eefbf4a68b
[Perf] Optimize reshape_and_cache_flash CUDA Kernel ( #22036 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-01 19:18:51 -04:00
88faa466d7
[CI] Initial tests for SM100 Blackwell runner ( #21877 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-01 16:18:38 -07:00
881e1af43a
[BugFix] Harden distributed DP startup ( #21538 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-01 21:40:45 +00:00
d84b97a3e3
Add lora test for tp>1 case for TPU. ( #21970 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-08-01 18:56:08 +00:00
d331759488
Introduce RayPPCommunicator for ray-based PP ( #21660 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-08-01 11:50:58 -07:00
9659bc7f27
[compile][startup] Disable C++ compilation of symbolic shapes ( #20836 )
...
Signed-off-by: Animesh Jain <anijain@umich.edu >
2025-08-01 10:38:52 -07:00
3277e8f9e1
Fix pre-commit failure for SECURTIY.md ( #22102 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-01 10:36:07 -07:00
8d705996df
[Misc] Minor enhancement of benchmark_moe ( #22068 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-02 01:35:30 +08:00
38c8bce8b6
Enable headless models for pooling in the Transformers backend ( #21767 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 10:31:29 -07:00
ac45c44d98
[Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before Dispatch ( #21837 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-08-01 10:14:38 -07:00
d6664664b4
security policy: take 1 ( #21119 )
...
Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-08-01 10:09:49 -07:00
b879ecd6e2
[Bugfix] fix when skip tokenizer init ( #21922 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-08-01 10:09:36 -07:00
3f8e952179
[Bugfix] Fix glm4.1v video inference issue ( #22067 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-08-01 09:33:30 -07:00
326a1b001d
Improve documentation of ModelConfig.try_get_generation_config to prevent future confusion ( #21526 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 09:32:27 -07:00
2d7b09b998
Deprecate --disable-log-requests and replace with --enable-log-requests ( #21739 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 17:16:37 +01:00
97608dc276
[Docs] use uv in CPU installation docs ( #22089 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-08-01 07:55:55 -07:00
3146519add
[BugFix] Don't change title of top-level process ( #22032 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-08-01 07:37:55 -07:00
8026a335a1
[BugFix] Update AttnFusionPass cache key ( #21947 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-08-01 07:11:29 -07:00
a59cd9d9f7
[Refactor] Fix Compile Warning #1444-D ( #21462 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-01 06:10:30 -07:00
5c54d9759d
[Bugfix][PD] set max_completion_tokens=1 if req has this value ( #21841 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
2025-08-01 06:08:45 -07:00
0a6d305e0f
feat(multimodal): Add customizable background color for RGBA to RGB conversion ( #22052 )
...
Signed-off-by: Jinheng Li <ahengljh@gmail.com >
Co-authored-by: Jinheng Li <ahengljh@gmail.com >
2025-08-01 06:07:33 -07:00
f81c1bb055
[Bugfix] Check NVIDIA artifactory is accessible before using flashinfer cubin kernels ( #21893 )
2025-08-01 08:28:45 -04:00
fb0e0d46fc
Fix get_kwargs for case where type hint is list[Union[str, type]] ( #22016 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 05:26:42 -07:00
26b5f7bd2a
[BUG] [ROCm] Fix import bug on ROCm ( #22083 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-08-01 05:25:20 -07:00
dfbc1f8880
[Speculative Decoding] Add speculators config support ( #21345 )
2025-08-01 08:25:18 -04:00
87c94bc879
Revert "Update sampling_metadata.py ( #21937 )" ( #22088 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-01 05:24:46 -07:00
28b18cc741
[Quantization] Enable BNB support for InternS1 ( #21953 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-01 11:09:54 +00:00
4931486988
[Doc] Added warning of speculating with draft model ( #22047 )
...
Signed-off-by: Dilute-l <dilu2333@163.com >
Co-authored-by: Dilute-l <dilu2333@163.com >
2025-08-01 02:11:56 -07:00
0f81b310db
[Misc] Remove upper bound in openai package version ( #22060 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-08-01 02:11:40 -07:00
e6680f9e25
[Bugfix] Add log prefix in non-dp mode engine core ( #21889 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-08-01 09:04:16 +00:00
27a145e893
[Doc] Add example for Step3-VL ( #22061 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-08-01 08:35:49 +00:00
da31f6ad3d
Revert precompile wheel changes ( #22055 )
2025-08-01 08:26:24 +00:00
98df153abf
[Frontend] Align tool_choice="required" behavior with OpenAI when tools is empty ( #21052 )
...
Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai >
2025-08-01 07:54:17 +00:00
e0f63e4a35
[Core] Avoid repeated len(block_token_ids) check in hash_request_tokens ( #21781 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-08-01 00:23:29 -07:00
b4e081cb15
[Bugfix] Disable multi-modal preprocessor cache for DP ( #21896 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-01 08:03:56 +01:00
79731a79f0
[Doc] Fix a syntax error of example code in structured_outputs.md ( #22045 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
2025-08-01 00:01:22 -07:00
53d7c39271
Update sampling_metadata.py ( #21937 )
...
Signed-off-by: Aviad Rossmann <aviadr@neureality.ai >
2025-07-31 23:23:18 -07:00
61dcc280fa
[Doc] Add Voxtral to Supported Models page ( #22059 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-31 23:10:56 -07:00
0f46a780d4
[Model] [Quantization] Support quantization for Gemma3n ( #21974 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-07-31 22:45:15 -07:00
e1a7fe4af5
[BugFix] fix: aot passes kvcache dtype information ( #19750 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
2025-08-01 05:45:02 +00:00
82de9b9d46
[Misc] Automatically resolve HF processor init kwargs ( #22005 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-31 22:44:10 -07:00
ad57f23f6a
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache ( #20873 )
...
Signed-off-by: charent <19562666+charent@users.noreply.github.com >
2025-07-31 19:48:13 -07:00
3700642013
[Refactor] Remove Duplicate per_block_cast_to_fp8, Remove Dependencies of DeepGEMM ( #21787 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-01 01:13:27 +00:00
0bd409cf01
Move flashinfer-python to optional extra vllm[flashinfer] ( #21959 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-31 18:02:11 -07:00
e360316ab9
Add DeepGEMM to Dockerfile in vllm-base image ( #21533 )
...
Signed-off-by: Matthew Bonanni <mbonanni001@gmail.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-31 18:01:55 -07:00
c3e0e9337e
[Feature] Add Flashinfer MoE Support for Compressed Tensor NVFP4 ( #21639 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-31 15:26:11 -07:00
6e672daf62
Add FlashInfer allreduce RMSNorm Quant fusion ( #21069 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-31 13:58:38 -07:00
2dff2e21d9
[Bugfix] Fix MTP weight loading ( #21941 )
2025-07-31 16:33:53 -04:00
71470bc4af
[Misc] Add unit tests for chunked local attention ( #21692 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-31 11:39:16 -07:00
9e0726e5bf
[Meta] Official Eagle mm support, first enablement on llama4 ( #20788 )
...
Signed-off-by: morgendave <morgendave@gmail.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-07-31 10:35:07 -07:00
53c21e492e
Update torch_xla pin to 20250730 ( #21956 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-31 17:26:43 +00:00
0780bb5783
Removing amdproduction Tests ( #22027 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-07-31 09:53:27 -07:00
58bb902186
fix(setup): improve precompiled wheel setup for Docker builds ( #22025 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-31 09:52:48 -07:00
7349d5268b
[ez] Remove a trailing space from compilation/decorators.py ( #22028 )
2025-07-31 09:46:07 -07:00
9484641616
[Model] Add step3 vl ( #21998 )
...
Signed-off-by: oliveryuan <yuansong@step.ai >
Co-authored-by: oliveryuan <yuansong@step.ai >
2025-07-31 23:19:06 +08:00
207b750e19
[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend ( #21458 )
...
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-31 06:00:01 -07:00
5daffe7cf6
[BugFix] Fix case where collective_rpc returns None ( #22006 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-31 12:51:37 +00:00
2836dd73f1
[Model][CI] Let more pooling models support v1 ( #21747 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-31 01:51:15 -07:00
d2aab336ad
[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES ( #21599 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-07-31 15:00:08 +08:00
9532a6d563
[Deprecation] Remove deprecated args and methods ( #21907 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 23:46:38 -07:00
3e36fcbee6
[Bugfix]: fix metadata file copy in test_sharded_state_loader ( #21830 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-31 06:22:11 +00:00
055bd3978e
[CI Bugfix] Fix CI OOM for test_shared_storage_connector_hashes ( #21973 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-31 11:45:29 +08:00
0f7919fca0
[Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels ( #21818 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-30 20:41:12 -07:00
61445453df
[UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA ( #21966 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-30 20:40:34 -07:00
ec02e536df
[Bugfix] Relax lang pin for voxtral ( #21833 )
...
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-30 20:38:52 -07:00
9cb497bfa3
[Example] Add async_llm_streaming.py example for AsyncLLM streaming in python ( #21763 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-30 18:39:46 -06:00
ca9e2be3ed
[Core] Move EngineCoreRequest to Request conversion out of EngineCore ( #21627 )
...
Signed-off-by: linzebing <linzebing1995@gmail.com >
2025-07-30 15:00:54 -07:00
601f856d56
[Bugfix] Fix None value handling in trace span creation for cancelled requests ( #20272 )
2025-07-30 14:44:02 -07:00
287f527f54
[Feature] Add async tensor parallelism for scaled mm ( #20155 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-07-30 17:23:41 -04:00
f12d9256b3
[Misc] Use dracut on CentOS and skip clone if repo exists for EP kernel installation ( #21635 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-30 13:15:06 -07:00
b9b753e7a7
For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted ( #21964 )
2025-07-30 13:04:40 -07:00
56bd537dde
[Misc] Support more collective_rpc return types ( #21845 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-30 10:20:20 -07:00
8f0d516715
[TPU] Support Pathways in vLLM ( #21417 )
...
Signed-off-by: wenxindongwork <wenxindong@google.com >
2025-07-30 10:02:12 -07:00
f4135232b9
feat(distributed): add get_required_kvcache_layout class method to kv connector api ( #20433 )
...
Signed-off-by: wxsm <wxsms@foxmail.com >
2025-07-30 16:41:51 +00:00
4904e53c32
[Bugfix] SharedStorage Connector for V1 PD multimodal ( #21611 )
...
Signed-off-by: fake0fan <645327136@qq.com >
Signed-off-by: herotai214 <herotai214@gmail.com >
Co-authored-by: herotai214 <herotai214@gmail.com >
2025-07-30 09:18:37 -07:00
004203e953
[CI/Build] Fix registry tests ( #21934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 09:10:41 -07:00
5c765aec65
[Bugfix] Fix TypeError in scheduler when comparing mixed request_id types ( #21816 )
...
Signed-off-by: chiliu <chiliu@paypal.com >
Co-authored-by: chiliu <chiliu@paypal.com >
2025-07-30 08:54:44 -07:00
ad510309ee
Override attention metadata for fast prefill in some KV sharing setups ( #21590 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-30 08:54:15 -07:00
366f6b3a4d
[Bugfix] Fix multi-api server not working for text models ( #21933 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 08:42:05 -07:00
6e599eebe8
[Bugfix] Fix OOM tests in initialization test ( #21921 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-30 07:35:47 -07:00
88edf5994c
[Docs] Reduce the size of the built docs ( #21920 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-30 07:35:08 -07:00
ff08e51940
[NVIDIA] Fix Llama4 Scout FP4 functionality issues ( #21499 )
...
Signed-off-by: Po-Han Huang <pohanh@nvidia.com >
2025-07-30 07:33:40 -07:00
8f4a1c9a04
[Misc] Improve code readability of KVCacheManager ( #21673 )
...
Signed-off-by: tanruixiang <tanruixiang0104@gmail.com >
Signed-off-by: Ruixiang Tan <819464715@qq.com >
Signed-off-by: GitHub <noreply@github.com >
2025-07-30 07:20:43 -07:00
36ede45989
Reduce time wasted in GitHub Actions using concurrency ( #21919 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-30 07:18:02 -07:00
0e40b26073
[CI/Build] Only run markdownlint in CI ( #21892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-30 07:17:14 -07:00
0271c2ff2f
[Test] Add Benchmark and Unit Test for per_token_group_quant ( #21860 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-30 07:15:02 -07:00
e91d3c9cda
[misc] skip p2p check by default ( #21904 )
2025-07-30 22:05:04 +08:00
bf668b5bf5
[Feature] Support multiple api keys in server ( #18548 )
...
Signed-off-by: Yan Pashkovsky <yanp.bugz@gmail.com >
2025-07-30 07:03:23 -07:00
da3e0bd6e5
[Bugfix] we should use metavar is not choices ( #21902 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-30 06:51:58 -07:00
fcfd1eb9c5
[Doc] Remove vLLM prefix and add citation for PagedAttention ( #21910 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 06:36:34 -07:00
d979dd6beb
[Feature][EPLB] Add eplb support for Qwen3 ( #20815 )
...
Signed-off-by: aladerran <aladerran@gmail.com >
2025-07-30 06:27:57 -07:00
b876860c62
[Hardware][CPU] Build fix for ARM without BF16 ( #21848 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-07-30 06:22:00 -07:00
13986365a9
Add @patrickvonplaten as maintainer of mistral's related files. ( #21928 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-07-30 20:42:51 +08:00
5c8fe389d6
[Docs] Fix the example code of streaming chat completions in reasoning ( #21825 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: Zi Wang <66560864+BruceW-07@users.noreply.github.com >
2025-07-30 12:11:58 +00:00
5bbaf492a6
[Doc] Update partial support ( #21916 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-30 01:32:39 -07:00
533db0935d
[benchmark] add max-concurrency in result table ( #21095 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-30 01:15:43 -07:00
fc91da5499
[Model] Remove DSV2 unused code ( #21903 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-30 00:55:03 -07:00
547795232d
[Tests] Fixing bug inside MultiModalProfiler. ( #21842 )
...
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com >
2025-07-30 00:44:15 -07:00
30ef30ed5a
[CI] rollback lint-and-deploy pipeline using amd machine ( #21912 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-30 00:37:59 -07:00
02f82fe438
[Doc] Update Intern-S1 info ( #21908 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-29 23:58:57 -07:00
2ca5f82c2a
[Misc] Remove redundant config definitions ( #21891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 23:54:18 -07:00
6f8d261882
Update vLLM Benchmark Suite for Xeon based on 0.9.2 release ( #21486 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-07-30 05:57:03 +00:00
4cd7fe6cea
[Docs] Expand introduction to Ray in Multi-node deployment section ( #21584 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-29 22:07:28 -07:00
16f3250527
[CI/Build] Fix pre-commit failure in docs ( #21897 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 21:53:08 -07:00
e3bc17ceea
Add @sighingnow as maintainer of qwen's related files. ( #21895 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-07-29 21:30:44 -07:00
05cbbe20c5
[XPU] use ZE_AFFINITY_MASK for device select on xpu ( #21815 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-30 03:56:14 +00:00
65f311ce59
[Frontend] Add LLM.reward specific to reward models ( #21720 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-29 20:56:03 -07:00
1b0a155534
[Perf] Using __nv_fp8_e4m3 instead of c10::e4m3 for per_token_group_quant ( #21867 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-29 21:50:46 -06:00
44bc46da60
[Bugfix] Actually disable processing cache when API server is scaled out ( #21839 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 20:36:04 -07:00
b7b23da4d2
[Bugfix] Fix comment typo of get_num_common_prefix_blocks() ( #21827 )
...
Signed-off-by: MingzhenHan <hanmingzhen2002@outlook.com >
2025-07-29 20:35:33 -07:00
fdde18229e
[Bugfix] Fix shape mismatch assertion error when loading Gemma3n model with BitsAndBytes quantization ( #21808 )
...
Signed-off-by: sydarb <areebsyed237@gmail.com >
2025-07-30 11:35:21 +08:00
b917da442b
Expose PyTorch profiler configuration to environment variables ( #21803 )
...
Signed-off-by: Csrayz <33659823+Csrayz@users.noreply.github.com >
2025-07-29 19:46:31 -07:00
fb58e3a651
[Docs] Update docker.md with HF_TOKEN, new model, and podman fix ( #21856 )
2025-07-29 19:45:41 -07:00
76080cff79
[DOC] Fix path of v1 related figures ( #21868 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-29 19:45:18 -07:00
ba5c5e5404
[Docs] Switch to better markdown linting pre-commit hook ( #21851 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 19:45:08 -07:00
555e7225bc
[v1][attention] Support Hybrid Allocator + FlashInfer ( #21412 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-07-30 01:45:29 +00:00
0e36abf993
[Bugfix] Correct max tokens for non-contiguous embeds ( #21798 )
...
Signed-off-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
Co-authored-by: Alexandre Milesi <30204471+milesial@users.noreply.github.com >
2025-07-30 01:16:25 +00:00
452b2a3180
[ci] mark blackwell test optional for now ( #21878 )
2025-07-29 18:03:27 -07:00
0d0cc9e150
[ci] add b200 test placeholder ( #21866 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-29 17:11:50 -07:00
9266d98048
[BugFix] Fix interleaved sliding window not set for Gemma3n ( #21863 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-29 16:34:19 -07:00
176bbce1db
Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure ( #21647 )" ( #21850 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-29 21:56:29 +00:00
a1873db23d
docker: docker-aware precompiled wheel support ( #21127 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-29 14:45:19 -07:00
a33ea28b1b
Add flashinfer_python to CUDA wheel requirements ( #21389 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-29 12:51:58 -07:00
7b49cb1c6b
[Doc] update Contributing page's testing section ( #18272 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-07-29 10:32:46 -07:00
f03e9cf2bb
[Doc] Add FusedMoE Modular Kernel Documentation ( #21623 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-29 10:32:30 -07:00
37f86d9048
[Docs] use uv in GPU installation docs ( #20277 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-07-29 10:32:06 -07:00
58b11b24a6
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend ( #21525 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-29 10:34:00 -04:00
ad341c5194
[Bugfix]fix mixed bits and visual language model quantization in AutoRound ( #21802 )
...
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com >
2025-07-29 07:26:31 -07:00
759b87ef3e
[TPU] Add an optimization doc on TPU ( #21155 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 07:23:19 -07:00
f693b067a2
[Docs] Merge design docs for a V1 only future ( #21832 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 07:22:50 -07:00
04e38500ee
[Bugfix] VLLM_V1 supports passing other compilation levels ( #19340 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-29 09:35:58 -04:00
ab714131e4
[Doc] Update compatibility matrix for pooling and multimodal models ( #21831 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-29 06:29:51 -07:00
755fa8b657
[KVCache] Make KVCacheSpec hashable ( #21791 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-07-29 19:58:29 +08:00
2470419119
[Docs] Fix the outdated URL for installing from vLLM binaries ( #21523 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 04:56:27 -07:00
61a6905ab0
[Model] Refactor JambaForCausalLM ( #21394 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-29 18:25:07 +08:00
37efc63b64
[V0 deprecation] Guided decoding ( #21347 )
...
Signed-off-by: Reza Barazesh <rezabarazesh@meta.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-29 03:15:30 -07:00
a4528f0cac
[Model]: Fused MoE for nomic-embed-text-v2-moe ( #18321 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-29 03:13:27 -07:00
a2480251ec
[Doc] Link to RFC for pooling optimizations ( #21806 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 23:53:18 -07:00
7234fe2685
[Misc] Rework process titles ( #21780 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-29 05:14:47 +00:00
f1e2c095ec
Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema ( #21684 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-28 22:09:45 -07:00
12a223ef9b
[AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef ROCM ( #21766 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-29 03:35:37 +00:00
e18f085103
skip fusedmoe layer for start_load_kv ( #21378 )
...
Signed-off-by: calvin chen <wen.chen@dynamia.ai >
2025-07-28 18:59:44 -07:00
afa2607596
[CI] Parallelize Kernels MoE Test ( #21764 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-28 18:56:24 -07:00
48b763d6b5
[Refactor] Merge Compressed Tensor FP8 CompressedTensorsW8A8Fp8MoEMethod and CompressedTensorsW8A8Fp8MoECutlassMethod ( #21775 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-28 19:47:21 -06:00
947e982ede
[Docs] Minimize spacing for supported_hardware.md table ( #21779 )
2025-07-28 18:46:39 -07:00
c6c9122d50
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning ( #20396 )
...
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com >
Co-authored-by: Duncan Moss <djm.moss@gmail.com >
2025-07-28 23:13:58 +00:00
8aa1485fcf
[Perf] Disable chunked local attention by default with llama4 ( #21761 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-28 18:49:04 -04:00
89ac266b26
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels ( #17112 )
...
Signed-off-by: Nikhil Gupta <nikhil.gupta2@arm.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-28 20:55:15 +00:00
c6f36cfa26
[Bugfix] DeepGEMM is not enabled on B200 due to _lazy_init() ( #21472 )
...
Signed-off-by: Clayton Coleman <smarterclayton@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-28 20:51:22 +00:00
b18b417fbf
Revert "[V1] Exception Handling when Loading KV Cache from Remote Store" ( #21778 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-07-28 20:15:18 +00:00
9ba1c88a93
[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol exposure ( #21647 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-28 20:11:16 +00:00
e0e58f9729
[Bug] Enforce contiguous input for dynamic_scaled_fp8_quant and static_scaled_fp8_quant ( #21773 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-28 19:55:48 +00:00
b361f14e39
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile ( #21350 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-07-28 15:38:20 -04:00
01c753ed98
update flashinfer to v0.2.9rc2 ( #21701 )
...
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
2025-07-28 19:31:47 +00:00
94b71ae106
Use metavar to list the choices for a CLI arg when custom values are also accepted ( #21760 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-28 19:31:10 +00:00
7d44c691b0
[P/D] Log warnings related to prefill KV expiry ( #21753 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-28 18:40:53 +00:00
e17a4d3bf9
[Bugfix] Fix granite speech shape validation ( #21762 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 14:19:21 -04:00
ec261b0291
[XPU] IPEX-optimized Punica Wrapper on XPU ( #21703 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-28 16:43:37 +00:00
04fe61aa3d
[CI/Build] Fix plugin tests ( #21758 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 15:08:05 +00:00
25708d317a
[Bugfix] Mistral crashes on tool with no description ( #21167 )
...
Signed-off-by: HugoMichard <hugo@harfanglab.fr >
2025-07-28 08:03:35 -07:00
0e18a5d058
[Misc] Reduce logs for model resolution ( #21765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 07:59:56 -07:00
34a20c49b3
[Logs] Change flashinfer sampler logs to once ( #21759 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-28 06:59:51 -07:00
31084b3b1f
[Bugfix][CI/Build] Update peft version in test requirement ( #21729 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-28 06:17:43 -07:00
bccc43c033
[Bugfix]check health for engine core process exiting unexpectedly ( #21728 )
...
Signed-off-by: wuhang <wuhang6@huawei.com >
2025-07-28 06:17:31 -07:00
1395dd9c28
[Docs] Add revision date to rendered docs ( #21752 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-28 06:12:46 -07:00
9ace2eaf35
[Bugfix] Improve JSON extraction in LlamaToolParser ( #19024 )
...
Signed-off-by: keru <keyang.ru@oracle.com >
Co-authored-by: keru <keyang.ru@oracle.com >
2025-07-28 12:36:58 +00:00
656c24f1b5
[Ernie 4.5] Name Change for Base 0.3B Model ( #21735 )
...
Signed-off-by: vasqu <antonprogamer@gmail.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 12:22:32 +00:00
63fe3a700f
[PD] let p2p nccl toy proxy handle /chat/completions ( #21734 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-28 11:45:50 +00:00
0ae970ed15
[Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme ( #21744 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-28 04:26:49 -07:00
65e8466c37
[Bugfix] Fix environment variable setting in CPU Dockerfile ( #21730 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-28 11:02:39 +00:00
1b769dccf3
[Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts ( #21717 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-28 11:02:25 +00:00
2cc571199b
[feature] add log non default args in LLM ( #21680 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-28 02:21:22 -07:00
a4ed731546
[Model] Prioritize Transformers fallback over suffix matching ( #21719 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 02:15:31 -07:00
d128d0d554
Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema ( #21686 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-28 01:16:35 -07:00
a6c050286a
[v1][mamba] Added mamba_type into MambaSpec ( #21715 )
...
Signed-off-by: asafg <asafg@ai21.com >
Co-authored-by: asafg <asafg@ai21.com >
2025-07-28 08:15:55 +00:00
139a7f07bd
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabled ( #21707 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-28 07:18:47 +00:00
150d9e6337
[Bugfix] fix max-file-size type from str to int ( #21675 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-28 00:06:52 -07:00
139a97ec56
[Bugfix] Fix shape checking for Fuyu ( #21709 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-28 00:05:56 -07:00
18cc33dd60
[bugfix] fix profile impact benchmark results ( #21507 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-27 22:44:24 -07:00
7656cf4cf3
[Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled ( #21573 )
...
Signed-off-by: wangzi <3220100013@zju.edu.cn >
Co-authored-by: wangzi <3220100013@zju.edu.cn >
2025-07-27 22:43:50 -07:00
3ea57a56d9
Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to … ( #21683 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 22:37:23 -07:00
75856bc2cb
Migrate GraniteSpeechAudioInputs to TensorSchema ( #21682 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-07-27 22:37:20 -07:00
304dcdf575
Migrate GLMVImagePixelInputs to TensorSchema ( #21679 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 22:36:11 -07:00
88e46c7c8d
Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema ( #21678 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-07-27 22:36:08 -07:00
d8937de4c8
Migrate Gemma3ImagePixelInputs to TensorSchema ( #21676 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 22:36:05 -07:00
e626d286f5
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel ( #21242 )
2025-07-28 05:07:06 +00:00
c7ffe93d9c
[Model] Support TP/PP/mamba2 kernel for PLaMo2 ( #19674 )
...
Signed-off-by: Shinichi Hemmi <shemmi@preferred.jp >
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
Co-authored-by: Calvin Metzger <metzger@preferred.jp >
Co-authored-by: Sixue Wang <cecilwang@preferred.jp >
2025-07-28 05:00:47 +00:00
15a72ac478
[V1] Exception Handling when Loading KV Cache from Remote Store ( #21534 )
...
Signed-off-by: liuyumoye <adeline_ly2023@outlook.com >
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com >
2025-07-27 20:34:17 -07:00
04ff4be310
[Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 ( #21700 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-27 20:12:18 -07:00
93269bb43e
Fix GLM tool parser ( #21668 )
...
Co-authored-by: Chenhui Zhang <zhang.chenhui@outlook.com >
2025-07-28 10:46:38 +08:00
82acf2184d
Fix typo for limit-mm-per-prompt in docs ( #21697 )
...
Signed-off-by: Joachim Studnia <joachim@mistral.ai >
2025-07-27 19:45:37 -07:00
86ae693f20
[Deprecation][2/N] Replace --task with --runner and --convert ( #21470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-27 19:42:40 -07:00
8f605ee309
[Attention] Make CutlassMLA the default backend for SM100 (blackwell) ( #21626 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-27 20:13:00 +00:00
a9b2a1d704
[Misc] Refactor vllm config str ( #21666 )
2025-07-27 09:51:44 -07:00
57c22e57f9
Fix CUDA permute/unpermute for use with DeepGemm Moe ( #17934 )
...
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn >
2025-07-27 07:08:00 -07:00
bda9d0535f
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor ( #21631 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-27 05:25:21 -07:00
3d847a3125
[VLM] Add video support for Intern-S1 ( #21671 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-27 11:49:43 +00:00
5f8c9a425e
Migrate Florence2ImagePixelInputs to TensorSchema ( #21663 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-27 02:43:02 -07:00
1cbf951ba2
[Misc] add default value for file pattern arg ( #21659 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-27 05:14:51 +00:00
a8936e5193
Refactor: Remove numpy dependency from LoggingStatLogger ( #20529 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-07-27 04:06:21 +00:00
01a395e9e7
[CI/Build][Doc] Clean up more docs that point to old bench scripts ( #21667 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-07-27 04:02:12 +00:00
971948b846
Handle non-serializable objects in vllm bench ( #21665 )
2025-07-27 03:35:22 +00:00
eed2f463b2
[VLM] Support HF format Phi-4-MM model ( #17121 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-26 20:07:57 -07:00
20950b29fb
Migrate ChameleonImagePixelInputs to TensorSchema ( #21657 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 19:34:25 -07:00
3339cba3ff
Migrate FuyuImagePatchInputs to TensorSchema ( #21662 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 19:34:14 -07:00
0b8caf9095
Migrate DeepseekVL2ImageInputs to TensorSchema ( #21658 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 19:34:11 -07:00
ccf27cc4d4
Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to TensorSchema ( #21656 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-27 10:33:52 +08:00
c657369841
support torch.compile for bailing moe ( #21664 )
2025-07-26 23:54:32 +00:00
6c66f28fa5
Remove xformers requirement for Mistral-format Pixtral and Mistral3 ( #21154 )
...
Signed-off-by: Wenchen Lo <charles761013@gmail.com >
2025-07-26 17:20:29 -06:00
de509ae8eb
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels ( #21411 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
2025-07-26 07:10:36 -07:00
e7c4f9ee86
[CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI ( #21355 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-07-26 07:10:14 -07:00
9094d11c5d
[Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon ( #21380 )
...
Signed-off-by: Yeju Zhou <yejuzhou@outlook.com >
2025-07-26 07:09:57 -07:00
56e544f24b
[Refactor] Remove moe_align_block_size_triton ( #21335 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-26 07:08:29 -07:00
97d6c30cc9
[BugFix] Fix shared storage connector load kv only load attention layer ( #21428 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-07-26 07:07:40 -07:00
a40a8506df
[Misc] Improve memory profiling debug message ( #21429 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-07-26 07:07:21 -07:00
c215f5c877
[Bug] Fix has_flashinfer_moe Import Error when it is not installed ( #21634 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-26 07:06:14 -07:00
1cd6eaba54
Support encoder-only models without KV-Cache ( #21270 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-07-26 21:09:52 +08:00
f27fdfc3ed
[Bugfix] Investigate Qwen2-VL failing test ( #21527 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-26 06:09:29 -07:00
de10ff0b7c
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation ( #21622 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 06:08:18 -07:00
9d197280fa
Migrate AriaImagePixelInputs to TensorSchema for shape validation ( #21620 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-26 06:08:15 -07:00
e98def439c
[Take 2] Correctly kill vLLM processes after benchmarks ( #21646 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-26 06:06:05 -07:00
05c1126f29
[Misc] remove unused try-except in pooling config check ( #21618 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-26 12:20:03 +00:00
875af38e01
Support Intern-S1 ( #21628 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-26 19:14:04 +08:00
7728dd77bb
[TPU][Test] Divide TPU v1 Test into 2 parts. ( #21431 )
2025-07-26 06:20:30 +00:00
2f6e6b33fb
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison ( #21612 )
...
Signed-off-by: Alexandre Juan <a.juan@netheos.net >
2025-07-25 20:11:10 -07:00
a55c95096b
Correctly kill vLLM processes after finishing serving benchmarks ( #21641 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-25 19:06:21 -07:00
97349fe2bc
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL ( #21530 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-07-25 18:37:32 -07:00
62965de5fe
[Model] Ultravox: Support Llama 4 and Gemma 3 backends ( #17818 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
Signed-off-by: Patrick Li <patrick8289@gmail.com >
Co-authored-by: Patrick Li <patrick8289@gmail.com >
2025-07-25 18:12:31 -07:00
7ae75fa6d0
[Feature] Add support for MoE models in the calibration-free RTN-based quantization ( #20766 )
...
Signed-off-by: Alex Kogan <alex.kogan@oracle.com >
2025-07-25 18:09:34 -07:00
f1b286b2fb
[TPU] Update ptxla nightly version to 20250724 ( #21555 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-25 17:09:00 -07:00
c7742d6113
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn ( #21540 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-25 17:08:30 -07:00
cea96a0156
[Bugfix] Fix sync_and_slice_intermediate_tensors ( #21537 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-25 17:07:58 -07:00
2eddd437ba
Add interleaved RoPE test for Llama4 (Maverick) ( #21478 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-25 17:07:26 -07:00
75d29cf4e1
[Perf] Cuda Kernel for Int8 Per Token Group Quant ( #21476 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-25 17:07:07 -07:00
41d3082c41
Add Unsloth to RLHF.md ( #21636 )
2025-07-25 17:06:48 -07:00
7cfea0df39
[TPU][Test] Rollback PR-21550. ( #21619 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-25 13:22:01 -07:00
5ac3168ee3
[Docs] add auto-round quantization readme ( #21600 )
...
Signed-off-by: Wenhua Cheng <wenhua.cheng@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-25 08:52:42 -07:00
396ee94180
[CI] Unifying Dockerfiles for ARM and X86 Builds ( #21343 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-25 07:33:56 -07:00
e189b50f53
Add support for Prithvi in Online serving mode ( #21518 )
...
Signed-off-by: Michele Gazzetti <michele.gazzetti1@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-25 07:01:27 -07:00
136d750f5f
[Kernel] Improve machete memory bound perf ( #21556 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-25 06:53:21 -07:00
b3caeb82e7
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. ( #20295 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
Signed-off-by: amd-ruitang3 <Rui.Tang2@amd.com >
Co-authored-by: amd-ruitang3 <Rui.Tang2@amd.com >
2025-07-25 06:50:21 -07:00
eab2f3980c
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel ( #20839 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Signed-off-by: Yu Chin Fabian Lim <fabian.lim@gmail.com >
Signed-off-by: Chih-Chieh Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Yu Chin Fabian Lim <fabian.lim@gmail.com >
2025-07-25 06:49:36 -07:00
9fe98d4250
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers ( #21009 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-25 06:49:11 -07:00
29c6fbe58c
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B ( #20931 )
...
Signed-off-by: bigshanedogg <bigshane319@gmail.com >
2025-07-25 06:05:42 -07:00
c72f049cb4
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter ( #21586 )
...
Signed-off-by: zhouchong <zhouchong03@baidu.com >
Co-authored-by: zhouchong <zhouchong03@baidu.com >
2025-07-25 06:02:53 -07:00
f3a683b7c9
[Bugfix][Logprobs] Fix logprobs op to support more backend ( #21591 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2025-07-25 05:53:07 -07:00
46d81d6951
[V1] Get supported tasks from model runner instead of model config ( #21585 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-25 05:36:45 -07:00
5c3f2628d5
[Quantization] Enable BNB support for more MoE models ( #21370 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-25 03:57:34 -07:00
7311f74468
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith' ( #21579 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-25 03:42:23 -07:00
8ed01e32f7
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct ( #21598 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-07-25 02:36:55 -07:00
e38e96a3c0
[Tests] Harden DP tests ( #21508 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-25 02:27:24 -07:00
40d86ee412
[TPU][Bugfix] fix OOM issue in CI test ( #21550 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-24 23:01:53 -07:00
85d051f026
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS ( #21262 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-07-24 22:54:23 -07:00
5140f54b89
[CI/Build] fix cpu_extension for apple silicon ( #21195 )
...
Signed-off-by: ignaciosica <mignacio.sica@gmail.com >
2025-07-24 22:53:59 -07:00
947edd099e
[Misc][Tools] make max-model-len a parameter in auto_tune script ( #21321 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-24 22:46:43 -07:00
fde60ee775
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings ( #21479 )
...
Signed-off-by: Hongmin Fan <fanhongmin@google.com >
2025-07-25 13:46:06 +08:00
b38bc652ac
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 ( #21494 )
...
Signed-off-by: wzqd <1057337859@qq.com >
2025-07-24 22:45:16 -07:00
adaf2c6d4f
[Bugfix] fix modelscope snapshot_download serialization ( #21536 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-24 22:44:38 -07:00
42343f1f89
[CI] Update CODEOWNERS for CPU and Intel GPU ( #21582 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-24 21:58:03 -07:00
965bc71b04
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs ( #21232 )
...
Signed-off-by: Benji Beck <benjibeck@meta.com >
2025-07-24 21:43:52 -07:00
807a328bb6
[Docs] Add requirements/common.txt to run unit tests ( #21572 )
...
Signed-off-by: Zhou Fang <fang.github@gmail.com >
2025-07-24 20:51:15 -07:00
e0be2c4d09
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. ( #21560 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-24 20:44:50 -07:00
9c8b2c2a8a
[DP] Support api-server-count > 0 in hybrid DP LB mode ( #21510 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-24 20:18:16 -07:00
2212cd6cfb
[Bugfix] DeepGemm utils : Fix hardcoded type-cast ( #21517 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-24 20:17:29 -07:00
ce3a9b1378
[Kernel] adding fused_moe configs for upcoming granite4 ( #21332 )
...
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-24 20:16:59 -07:00
2ce90e5b01
Fix GLM-4 PP Missing Layer When using with PP. ( #21531 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-07-24 20:07:38 -07:00
633f6e804b
[Bug] Fix DeepGemm Init Error ( #21554 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-24 20:07:22 -07:00
b57296bb9a
[Docs] Fix site_url for RunLLM ( #21564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 20:05:58 -07:00
34ddcf9ff4
[Frontend] run-batch supports V1 ( #21541 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-24 20:05:55 -07:00
fe56180c7f
[MoE] More balanced expert sharding ( #21497 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-07-24 15:56:08 -07:00
07d80d7b0e
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. ( #21539 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-24 15:33:04 -07:00
2dd72d23d9
update flashinfer to v0.2.9rc1 ( #21485 )
...
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
2025-07-24 14:06:11 -07:00
a6c7fb8cff
[Docs] Add Expert Parallelism Initial Documentation ( #21373 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 12:36:06 -07:00
a7272c23d0
[Docs][minor] Fix broken gh-file link in distributed serving docs ( #21543 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-24 10:36:56 -07:00
6066284914
[P/D] Support CPU Transfer in NixlConnector ( #18293 )
...
Signed-off-by: Juncheng Gu <juncgu@gmail.com >
Signed-off-by: Richard Liu <ricliu@google.com >
Co-authored-by: Richard Liu <39319471+richardsliu@users.noreply.github.com >
Co-authored-by: Richard Liu <ricliu@google.com >
2025-07-24 17:58:42 +01:00
1e9ea8e69d
[P/D] Move FakeNixlWrapper to test dir ( #21328 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-24 08:53:45 -07:00
d9f9a3fd96
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform ( #21036 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-07-24 23:23:36 +08:00
1b25f1fe75
Update flashinfer CUTLASS MoE Kernel ( #21408 )
...
Signed-off-by: Shu Wang. <shuw@nvidia.com >
2025-07-24 08:13:31 -07:00
e8cb0d0495
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access ( #21465 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-24 08:13:24 -07:00
684174115d
[Docs] Rewrite Distributed Inference and Serving guide ( #20593 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 08:13:05 -07:00
cdb79ee63d
[Docs] Update Tensorizer usage documentation ( #21190 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
Signed-off-by: William Goldby <willgoldby@gmail.com >
Co-authored-by: William Goldby <willgoldby@gmail.com >
2025-07-24 06:56:18 -07:00
5a19a6c670
[Fix] Update mamba_ssm to 2.2.5 ( #21421 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-24 03:25:41 -07:00
2ded067fd2
[Bugfix] Fix CUDA arch flags for MoE permute ( #21426 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-24 03:23:59 -07:00
13abd0eaf9
[Model] Officially support Emu3 with Transformers backend ( #21319 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 03:22:12 -07:00
61b8cea3b4
[Attention] Optimize FlashInfer MetadataBuilder Build call ( #21137 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-24 03:21:46 -07:00
526078a96c
bump flashinfer to v0.2.8 ( #21385 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-07-24 03:20:38 -07:00
6da0078523
[Feat] Allow custom naming of vLLM processes ( #21445 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-24 03:15:23 -07:00
73e3949d07
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices() ( #21501 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-24 03:13:40 -07:00
6eca337ce0
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0 ( #20544 )
...
Signed-off-by: okada <kokuzen@gmail.com >
Signed-off-by: okada shintarou <okada@preferred.jp >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 02:56:36 -07:00
85bda9e7d0
remove GLM-4.5 quantization wrong Code ( #21435 )
2025-07-24 01:52:43 -07:00
610852a423
[Core] Support model loader plugins ( #21067 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-24 01:49:44 -07:00
f0f4de8f26
[Misc] Fix duplicate FusedMoEConfig debug messages ( #21455 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-24 01:27:30 -07:00
fc5f756db4
[v1][Core] Clean up usages of SpecializedManager ( #21407 )
...
Signed-off-by: Zhou Fang <fang.github@gmail.com >
2025-07-24 00:40:11 -07:00
e74bfc70e4
[TPU][Bugfix] fix moe layer ( #21340 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-07-24 00:38:39 -07:00
90eeea8f85
[Bugfix][ROCm] Fix for warp_size uses on host ( #21205 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-24 00:37:19 -07:00
dde295a934
Deduplicate Transformers backend code using inheritance ( #21461 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-24 00:16:23 -07:00
6d8d0a24c0
Add think chunk ( #21333 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-07-23 21:51:32 -07:00
11ef7a611e
[BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses ( #21211 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-23 21:44:04 -07:00
dc2f159f8a
Dump input metadata on crash for async scheduling ( #21258 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-23 21:10:30 -07:00
d5b981f8b1
[DP] Internal Load Balancing Per Node [one-pod-per-node] ( #21238 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-23 20:57:32 -07:00
eec6942014
[BugFix] Fix KVConnector TP worker aggregation ( #21473 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 20:56:49 -07:00
fd48d99ffd
[BugFix]: Batch generation from prompt_embeds fails for long prompts ( #21390 )
...
Signed-off-by: KazusatoOko <kazusto.oko@sakana.ai >
Co-authored-by: KazusatoOko <kazusto.oko@sakana.ai >
2025-07-23 20:43:17 -07:00
f8c15c4efb
[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process ( #21437 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-07-23 20:42:11 -07:00
aa08a954f9
[Bugfix] Fix casing warning ( #21468 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-07-23 20:41:23 -07:00
13e4ee1dc3
[XPU][UT] increase intel xpu CI test scope ( #21492 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-23 20:24:04 -07:00
772ce5af97
[Misc] Add dummy maverick test to CI ( #21324 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-23 20:22:42 -07:00
63d92abb7c
[Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding ( #21374 )
...
Signed-off-by: Deven Labovitch <deven@videa.ai >
2025-07-23 20:22:19 -07:00
11599b0e1f
feat(gguf_loader): accept HF repo paths & URLs for GGUF ( #20793 )
...
Signed-off-by: Hardik <hardikgupta1999@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-23 20:21:02 -07:00
f3137cdd81
[Core] Freeze gc during cuda graph capture to speed up init ( #21146 )
...
Signed-off-by: Codex <codex@openai.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 17:20:14 -07:00
82ec66f514
[V0 Deprecation] Remove Prompt Adapters ( #20588 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 16:36:48 -07:00
78c13e30e1
[V1] Fix local chunked attention always disabled ( #21419 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-23 15:59:30 -07:00
5c9b807b34
[Core] Add reload_weights RPC method ( #20096 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-23 14:24:52 -07:00
14bf19e39f
[TPU][TEST] Fix the downloading issue in TPU v1 test 11. ( #21418 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-23 11:29:36 -07:00
4ac7713e32
Add test case for compiling multiple graphs ( #21044 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-23 11:00:47 -07:00
8560a5b258
[Core][Model] PrithviMAE Enablement on vLLM v1 engine ( #20577 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-07-23 11:00:23 -07:00
316b1bf706
[Tests] Add tests for headless internal DP LB ( #21450 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 07:49:25 -07:00
7c734ee09b
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. ( #21364 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-07-23 06:34:37 -07:00
f59ec35b7f
[V1] Check all pooling tasks during profiling ( #21299 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-23 05:53:26 -07:00
2671334d45
[Model] add Hunyuan V1 Dense Model support. ( #21368 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-23 03:54:08 -07:00
2cc5016a19
[Docs] Clean up v1/metrics.md ( #21449 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 03:37:25 -07:00
6929f8b437
[Misc] fixed nvfp4_moe test failures due to invalid kwargs ( #21246 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-07-23 01:41:43 -07:00
32ec9e2f2a
Mamba V2 Test not Asserting Failures. ( #21379 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-07-23 01:40:27 -07:00
accac82928
[Sampler] Introduce logprobs mode for logging ( #21398 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-23 01:39:25 -07:00
23637dcdef
[Docs] Fix bullets and grammars in tool_calling.md ( #21440 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 01:23:20 -07:00
6364af92f8
Fixed typo in profiling logs ( #21441 )
2025-07-23 01:18:54 -07:00
7aaa2bd5a8
[Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload ( #19679 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-07-23 00:30:05 -07:00
2f5c14de6a
add clear messages for deprecated models ( #21424 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-07-23 00:03:16 -07:00
f002e9a870
[Cleanup] Only log MoE DP setup warning if DP is enabled ( #21315 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 00:02:48 -07:00
a1f3610fc6
[Core] Add basic unit test for maybe_evict_cached_block ( #21400 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-23 00:02:02 -07:00
4ecedd1806
[Bugfix] Fix nightly transformers CI failure ( #21427 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-23 00:01:01 -07:00
107111a859
Changing "amdproduction" allocation. ( #21409 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-07-22 20:48:31 -07:00
2dec7c1a5d
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported ( #21420 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-22 20:34:50 -07:00
08d2bd78da
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update ( #21414 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-22 20:33:57 -07:00
4f76a05f4f
[BugFix] Update python to python3 calls for image; fix prefix & input calculations. ( #21391 )
...
Signed-off-by: Eric Hanley <ericehanley@google.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-22 20:33:00 -07:00
f154bb9ff0
Simplify weight loading in Transformers backend ( #21382 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-22 20:29:43 -07:00
3ec7170ff1
[Bugfix][ROCm][Build] Fix build regression on ROCm ( #21393 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-22 20:27:41 -07:00
c401c64b4c
[CI/Build] Fix model executor tests ( #21387 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 20:25:37 -07:00
b77c7d327f
[BugFix] Fix ray import error mem cleanup bug ( #21381 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-07-22 16:19:55 -07:00
35bc8bd5fb
[Misc] Copy HF_TOKEN env var to Ray workers ( #21406 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-22 16:18:42 -07:00
4594fc3b28
[Model] Add Qwen3CoderToolParser ( #21396 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-07-22 15:05:57 -07:00
ae268b6326
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num ( #21325 )
...
Signed-off-by: XIn Li <xinli@nvidia.com >
2025-07-22 12:42:31 -07:00
35366ae57c
[CI/Build] Fix test failure due to updated model repo ( #21375 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 08:39:35 -07:00
2226d5bd85
[Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers ( #21353 )
...
Signed-off-by: ariG23498 <aritra.born2fly@gmail.com >
2025-07-22 08:27:28 -07:00
44554a0068
Add tokenization_kwargs to encode for embedding model truncation ( #21033 )
2025-07-22 08:24:00 -07:00
226b452a20
Revert "[Refactor] Fix Compile Warning #1444-D ( #21208 )" ( #21384 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 08:22:10 -07:00
f38ee34a0a
[feat] Enable mm caching for transformers backend ( #21358 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-07-22 08:18:46 -07:00
b194557a6c
Adds parallel model weight loading for runai_streamer ( #21330 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-22 08:15:53 -07:00
774d0c014b
[Perf] Cuda Kernel for Per Token Group Quant ( #21083 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 07:27:15 -07:00
2c8db17cfd
[feat]: add SM100 support for cutlass FP8 groupGEMM ( #20447 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:27:12 -07:00
4fb56914c5
[perf] Add fused MLA QKV + strided layernorm ( #21116 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:07:44 -07:00
0df4d9b06b
[Misc] unify variable for LLM instance v2 ( #21356 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-22 06:32:36 -07:00
ed25054577
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool ( #21222 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 06:17:47 -07:00
10904e6d75
[benchmark] Port benchmark request sent optimization to benchmark_serving ( #21209 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:28:00 -07:00
a32237665d
[Core] Optimize update checks in LogitsProcessor ( #21245 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:27:18 -07:00
bc8a8ce5ec
[Misc] Remove deprecated args in v0.10 ( #21349 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-22 05:26:39 -07:00
32142b3c62
[Bugfix] Fix eviction cached blocked logic ( #21357 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-22 01:18:40 -07:00
82b8027be6
Add arcee model ( #21296 )
...
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-22 00:57:43 -07:00
3779eb8c81
[Feature][eplb] add verify ep or tp or dp ( #21102 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-21 23:41:14 -07:00
9e23ad9655
Update fp4 quantize API ( #21327 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-07-21 23:40:21 -07:00
e69a92a1ce
[Bug] DeepGemm: Fix Cuda Init Error ( #21312 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:36:18 -07:00
8425f785ad
[Misc] DeepEPHighThroughtput - Enable Inductor pass ( #21311 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-21 23:35:45 -07:00
c17231e827
Fix kv_cache_dtype handling for out-of-tree HPU plugin ( #21302 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-21 23:35:14 -07:00
6e5b5ca580
[Refactor] Fix Compile Warning #1444-D ( #21208 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:33:51 -07:00
488d8a986a
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible ( #21300 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-21 23:31:18 -07:00
af376ca19d
[Core] Minimize number of dict lookup in _maybe_evict_cached_block ( #21281 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-21 22:37:34 -07:00
e7b2042681
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 ) ( #21334 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-21 21:49:01 -07:00
90f1e55421
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU ( #21338 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-21 21:48:27 -07:00
5e70dcd6e6
[Doc] Fix CPU doc format ( #21316 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 21:47:49 -07:00
25d585ab7b
[XPU] Enable external_launcher to serve as an executor via torchrun ( #21021 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-07-21 21:47:35 -07:00
8d0a01a5f2
[v1][sampler] Inplace logprobs comparison to get the token rank ( #21283 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-21 13:47:47 -07:00
0ec82edda5
[perf] Speed up align sum kernels ( #21079 )
...
Signed-off-by: Himanshu Jaju <hj@mistral.ai >
2025-07-21 11:19:23 -07:00
005ae9be6c
Fix bad lm-eval fork ( #21318 )
2025-07-21 10:47:51 -07:00
29d1ffc5b4
[DP] Fix Prometheus Logging ( #21257 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-07-21 09:11:35 -07:00
304dce7ec0
[Attention] Clean up iRoPE in V1 ( #21188 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-07-21 09:10:30 -07:00
6ece16c4fe
[Misc] Add dummy maverick test ( #21199 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-21 09:08:09 -07:00
a0e827e07c
[BugFix] make utils.current_stream thread-safety ( #21252 ) ( #21253 )
...
Signed-off-by: simpx <simpxx@gmail.com >
2025-07-21 09:07:36 -07:00
a15a50fc17
[CPU] Enable shared-memory based pipeline parallel for CPU backend ( #21289 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 09:07:08 -07:00
6dda13c86b
[Misc] Add sliding window to flashinfer test ( #21282 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-21 08:37:49 -07:00
6b46c4b653
Add Nvidia ModelOpt config adaptation ( #19815 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-21 10:02:58 -04:00
d97841078b
[Misc] unify variable for LLM instance ( #20996 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-21 12:18:33 +01:00
e6b90a2805
[Docs] Make tables more space efficient in supported_models.md ( #21291 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:25:02 -07:00
be54a951a3
[Docs] Fix hardcoded links in docs ( #21287 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:23:57 -07:00
042af0c8d3
[Model][1/N] Support multiple poolers at model level ( #21227 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-21 02:22:21 -07:00
378d33c392
[Bugfix] Fix missing placeholder in logger debug ( #21280 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-20 22:50:06 -07:00
940af1f03a
Add the instruction to run e2e validation manually before release ( #21023 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-20 22:29:18 -07:00
92615d7fe8
[Docs] Add RFC Meeting to Issue Template ( #21279 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-20 21:58:07 -07:00
8188196a1c
[CI] Cleanup modelscope version constraint in Dockerfile ( #21243 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-07-20 20:13:02 -07:00
7ba34b1241
[bugfix] fix syntax warning caused by backslash ( #21251 )
2025-07-20 17:12:10 +00:00
9499e26e2a
[Model] Support VLMs with transformers backend ( #20543 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-20 13:25:50 +00:00
51ba839555
[Model] use AutoWeightsLoader for bart ( #18299 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-07-20 08:15:50 +00:00
d1fb65bde3
Enable v1 metrics tests ( #20953 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-20 03:22:02 +00:00
3a1d8940ae
[TPU] support fp8 kv cache quantization ( #19292 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-20 03:01:00 +00:00
2b504eb770
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. ( #21233 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 16:09:58 -07:00
10eb24cc91
GLM-4 Update ( #20736 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Lu Fang <fanglu@fb.com >
2025-07-19 22:40:31 +00:00
2e8cbb58f3
[BugFix] Fix full cuda graph slot_mapping ( #21228 )
...
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
2025-07-19 14:13:18 -07:00
752c6ade2e
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small ( #21217 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-19 13:53:17 -07:00
881e3cbe3b
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers ( #21194 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 19:27:21 +00:00
9f414a12ad
[BugFix] Make PD work with Ray ( #21072 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-19 08:46:50 -07:00
6a971ed692
[Docs] Update the link to the 'Prometheus/Grafana' example ( #21225 )
2025-07-19 06:58:07 -07:00
da6579bf41
[CI/CD][bugfix]fix: error argument to loads has incompatible type ( #21223 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com >
2025-07-19 05:16:48 -07:00
c81259d33a
Fix/remove some broken model executor tests ( #21224 )
...
Signed-off-by: Rabi Mishra <ramishra@redhat.com >
2025-07-19 12:15:07 +00:00
e3a0e43d7f
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code ( #21032 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-19 05:13:55 -07:00
b3d82108e7
[Bugfix][Frontend] Fix openai CLI arg middleware ( #21220 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-19 02:40:38 -07:00
6d0734c562
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency ( #20645 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-19 02:33:01 -07:00
7d94577138
Add torch golden impl for moe_align_block_size kernel test ( #20653 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-19 02:32:36 -07:00
59f935300c
[BugFix] Fix potential cuda-graph IMA ( #21196 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-19 02:18:47 -07:00
18e519ec86
[Bugfix] Fix ndarray video color from VideoAsset ( #21064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-19 02:17:16 -07:00
1eaff27815
[V0 deprecation] Remove long context LoRA ( #21169 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-19 02:15:41 -07:00
cf8cc32674
Fix a couple of Voxtral tests ( #21218 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-19 09:13:41 +00:00
3a2cb2649d
[Misc][Tools][Benchmark] Add readme file for auto_tune script ( #20779 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-19 09:06:59 +00:00
3e04107d97
[Model] EXAONE 4.0 model support ( #21060 )
...
Signed-off-by: Deepfocused <rlawhdrhs27@gmail.com >
Signed-off-by: woongsik <rlawhdrhs27@gmail.com >
2025-07-19 14:25:44 +08:00
37bd8d6e4c
[Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 ( #21187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 23:25:22 -07:00
468e2400fe
[BugFix][CPU] Fix TorchSDPABackendImpl doesn't have use_irope ( #21200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-18 23:18:48 -07:00
dcc6cfb991
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel ( #21193 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-18 23:09:51 -07:00
dd572c0ab3
[V0 Deprecation] Remove V0 Spec Decode workers ( #21152 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-18 21:47:50 -07:00
9ffe905a41
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 ( #21183 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-07-18 21:15:03 -07:00
9a9fda1423
[Core] Support Local Chunked Attention for Hybrid KV Cache ( #19351 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <fanglu@meta.com >
2025-07-18 20:48:38 -07:00
466e878f2a
[Quantization] Enable BNB support for more MoE models ( #21100 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-18 17:52:02 -07:00
217937221b
Elastic Expert Parallel Initial Support ( #20775 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-18 17:46:09 -07:00
5782581acf
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) ( #21077 )
...
Signed-off-by: hax0r31337 <liulihaocaiqwq@gmail.com >
2025-07-18 18:40:18 -04:00
0f199f197b
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue ( #21005 )
...
Signed-off-by: Jialin Ouyang <jialino@meta.com >
2025-07-18 12:34:40 -07:00
b2eb2b5ad7
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 ( #19346 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-18 14:10:21 -04:00
21274ab476
[CI] Update CODEOWNERS for vllm/compilation ( #21185 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-18 06:51:12 -07:00
ed8cbfedf8
Let GraniteMoeAttention use YaRN ( #21174 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-18 05:52:52 -07:00
45badd05d0
[Core] Set pooling params based on task and model ( #21128 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 05:41:17 -07:00
4adc66f64d
[Bugfix] Allocate less memory in non-batched CUTLASS MoE ( #21121 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-18 18:55:52 +08:00
55ad648715
[Doc] Fix typo in model name ( #21178 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 03:55:10 -07:00
5895afd780
[Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. ( #20750 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 09:10:47 +00:00
ca4eb82bcb
[Model] Re-add the implicit conversion feature for as_seq_cls_model ( #21103 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 07:15:07 +00:00
ba2dfbb0c2
[Misc] Make MM embedding merge interface explicit in model runner ( #21147 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-18 07:13:57 +00:00
1bf65138f6
[benchmark] Sending request strictly follows the random intervals ( #21108 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-18 06:22:08 +00:00
54cf1cae62
[Misc] Do not print async output warning for v1 ( #21151 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 21:57:02 -07:00
5780121c95
[Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm ( #20911 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-18 04:34:43 +00:00
c7d8724e78
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) ( #20037 )
...
Signed-off-by: shuw <shuw@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-17 21:32:45 -07:00
b38baabcf9
[Doc] Add inplace weights loading example ( #19640 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-17 21:12:23 -07:00
89cab4d01f
[Attention] Make local attention backend agnostic ( #21093 )
2025-07-18 00:10:42 -04:00
b9a21e9173
[Docs] Update supported models documentation with missing models ( #20844 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-07-17 20:12:13 -07:00
c4e3b12524
[Docs] Add minimal demo of Ray Data API usage ( #21080 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-17 20:09:19 -07:00
8dfb45ca33
[Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel ( #21133 )
2025-07-18 00:35:58 +00:00
8a8fc94639
[Log] Debugging Log with more Information ( #20770 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 00:19:46 +00:00
4de7146351
[V0 deprecation] Remove V0 HPU backend ( #21131 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 16:37:36 -07:00
ac9fb732a5
On environments where numa cannot be detected we get 0 ( #21115 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-07-17 18:52:17 +00:00
a3a6c695f4
[Misc] Qwen MoE model supports LoRA ( #20932 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 18:32:52 +00:00
90bd2ab6e3
[Model] Update pooling model interface ( #21058 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-17 16:05:40 +00:00
9fb2d22032
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-17 09:56:44 -04:00
2d6a38209b
[Docs] Move code block out of admonition now that it's short ( #21118 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 06:12:29 -07:00
89e3c4e9b4
[Misc] Avoid unnecessary import ( #21106 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-17 12:57:41 +00:00
fe8a2c544a
[Docs] Improve docstring formatting for FusedMoEParallelConfig.make ( #21117 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 04:13:00 -07:00
4ef00b5cac
[VLM] Add Nemotron-Nano-VL-8B-V1 support ( #20349 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-17 03:07:55 -07:00
5a7fb3ab9e
[Model] Add ToolParser and MoE Config for Hunyuan A13B ( #20820 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-17 09:10:09 +00:00
11dfdf21bf
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels ( #20903 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-17 08:10:37 +00:00
fdc5b43d20
[Bugfix]: Fix final_res_batch list index out of range error ( #21055 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-17 00:29:09 -07:00
c5b8b5953a
[Misc] Fix PhiMoE expert mapping ( #21085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 05:47:49 +00:00
4fcef49ec4
[V1] [KVConnector] Fix MultiprocExecutor worker output aggregation ( #21048 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-07-17 13:29:45 +08:00
8a4e5c5f3c
[V1][P/D]Enhance Performance and code readability for P2pNcclConnector ( #20906 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-07-16 22:13:00 -07:00
76b494444f
[Attention] Refactor attention metadata builder interface ( #20466 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-17 04:44:25 +00:00
28a6d5423d
[Bugfix] Fix Machete zero point issue for GPTQ models on SM90 ( #21066 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:54:45 -07:00
58760e12b1
[TPU] Start using python 3.12 ( #21000 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-16 19:37:44 -07:00
a50d918225
[Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile ( #21013 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:37:13 -07:00
c9ba8104ed
[Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group ( #21024 )
...
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com >
2025-07-16 19:36:36 -07:00
4e7dfbe7b4
Update PyTorch to torch==2.7.1 for CUDA ( #21011 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-17 02:30:44 +00:00
72ad273582
Remove torch_xla.tpu.version() from pallas.py. ( #21065 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-17 00:25:26 +00:00
01513a334a
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) ( #12010 )
...
Signed-off-by: Nir David <ndavid@habana.ai >
Signed-off-by: Uri Livne <ulivne@habana.ai >
Co-authored-by: Uri Livne <ulivne@habana.ai >
2025-07-16 15:33:41 -04:00
ac2bf41e53
[Model] Remove model sampler ( #21059 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 19:03:37 +00:00
a931b4cdcf
Remove Qwen Omni workaround that's no longer necessary ( #21057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-16 16:25:23 +00:00
a0f8a79646
[fix] fix qwen image_embeds input ( #21049 )
...
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai >
2025-07-16 15:17:20 +00:00
18bdcf4113
feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information ( #20575 )
...
Signed-off-by: m-misiura <mmisiura@redhat.com >
2025-07-16 21:52:14 +08:00
1c3198b6c4
[Model] Consolidate pooler implementations ( #20927 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 13:39:13 +00:00
260127ea54
[Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md ( #19199 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-16 06:11:38 -07:00
d0dc4cfca4
Fix inadvertently silenced PP tests for mp, add DeepSeek V2/V3 model family to PP tests ( #20831 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-16 00:14:49 -07:00
d31a647124
[BugFix] Fix import error on non-blackwell machines ( #21020 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-15 22:27:29 -07:00
85431bd9ad
[TPU] fix kv_cache_update kernel block size choosing logic ( #21007 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-16 04:39:48 +00:00
c11013db8b
[Meta] Llama4 EAGLE Support ( #20591 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: qizixi <qizixi@meta.com >
2025-07-15 21:14:15 -07:00
1eb2b9c102
[CI] update typos config for CI pre-commit and fix some spells ( #20919 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-15 21:12:40 -07:00
6ebf313790
Avoid direct comparison of floating point numbers ( #21002 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-15 21:12:14 -07:00
cfbcb9ed87
[Voxtral] Add more tests ( #21010 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-15 21:11:49 -07:00
76ddeff293
[Doc] Remove duplicate docstring ( #21012 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-15 20:09:13 -07:00
f46098335b
[Bugfix] Fix Mistral3 support on SM100/SM120 ( #20998 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 20:08:41 -07:00
e9534c7202
[CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty ( #21006 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-15 20:07:05 -07:00
7976446015
Add Dockerfile argument for VLLM_USE_PRECOMPILED environment ( #20943 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-15 19:53:57 -07:00
fcb9f879c1
[Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… ( #20937 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-15 19:53:42 -07:00
3ed94f9d0a
[Docs] Enhance Anyscale documentation, add quickstart links for vLLM ( #21018 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 19:46:56 -07:00
fa839565f2
[Misc] Refactor: Improve argument handling for conda command ( #20481 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 19:43:19 -07:00
75a99b98bf
[Chore] Remove outdated transformers check ( #20989 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-07-15 19:42:40 -07:00
b5c3b68359
[Misc] bump xgrammar version to v0.1.21 ( #20992 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 19:42:16 -07:00
6cbc4d4bea
[Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture ( #20923 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 19:19:10 -07:00
153c6f1e61
[Frontend] Remove print left in FrontendArgs.add_cli_args ( #21004 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 19:18:41 -07:00
34cda778a0
[Frontend] OpenAI Responses API supports input image ( #20975 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 18:59:36 -06:00
30800b01c2
[Nvidia] Integrate SM100 cudnn prefill API to MLA prefill ( #20411 )
...
Signed-off-by: Elfie Guo <elfieg@nvidia.com >
Co-authored-by: Elfie Guo <eflieg@nvidia.com >
2025-07-15 17:56:45 -07:00
10be209493
[Bug Fix] get_distributed_init_method should get the ip from get_ip i… ( #20889 )
...
Signed-off-by: Chen Li <lcpingping@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-07-15 21:23:52 +00:00
19c863068b
[Frontend] Support cache_salt in /v1/completions and /v1/responses ( #20981 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-07-15 21:01:04 +00:00
f29fd8a7f8
[BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 ( #20838 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
2025-07-15 16:08:26 -04:00
ed10f3cea1
[ROCm] warpSize is being made non constexpr in ROCm 7.0 ( #20330 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-15 14:01:44 -04:00
b637e9dcb8
Add full serve CLI reference back to docs ( #20978 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:42:30 +00:00
1e36c8687e
[Deprecation] Remove nullable_kvs ( #20969 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:21:50 +00:00
5bac61362b
Configure Gemini ( #20971 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 09:37:05 -07:00
313ae8c16a
[Deprecation] Remove everything scheduled for removal in v0.10.0 ( #20979 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 15:57:53 +00:00
c847e34b39
[CI/Build] Fix wrong path in Transformers Nightly Models Test ( #20994 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-15 08:53:16 -07:00
e7e3e6d263
Voxtral ( #20970 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-15 07:35:30 -07:00
4ffd963fa0
[v1][core] Support for attention free models ( #20811 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-07-15 14:20:01 +00:00
56fe4bedd6
[Deprecation] Remove TokenizerPoolConfig ( #20968 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 14:00:50 +00:00
d91278181d
[doc] Add more details for Ray-based DP ( #20948 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-15 05:37:12 -07:00
20149d84d9
[MISC] Add init files for python package ( #20908 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-07-15 12:16:33 +00:00
3534c39a20
[V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli ( #20840 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 04:04:35 -07:00
c586b55667
[TPU] Optimize kv cache update kernel ( #20415 )
...
Signed-off-by: Yifei Teng <tengyifei88@gmail.com >
2025-07-15 03:56:43 -07:00
33d560001e
[Docs] Improve documentation for ray cluster launcher helper script ( #20602 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 03:55:45 -07:00
f148c44c6a
[frontend] Refactor CLI Args for a better modular integration ( #20206 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-15 02:23:42 -07:00
235bfd5dfe
[Docs] Improve documentation for RLHF example ( #20598 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 01:54:10 -07:00
68d28e37b0
[frontend] Add --help=page option for paginated help output ( #20961 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 00:42:00 -07:00
37a7d5d74a
[Misc] Refactor AllReduceFusionPass. Remove parameter ( #20918 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-15 06:57:40 +00:00
d4d309409f
Implement Async Scheduling ( #19970 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-14 23:01:46 -07:00
85bd6599e4
[Model] Add AutoWeightsLoader support for BERT, RoBERTa ( #20534 )
...
Signed-off-by: Jennifer He <islandhe@gmail.com >
Signed-off-by: <islandhe@gmail.com >
Signed-off-by: Jen H <islandhe@gmail.com >
2025-07-15 13:34:24 +08:00
91b3d190ae
[cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir ( #20940 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-15 13:02:17 +08:00
fc017915f5
[Doc] Clearer mistral3 and pixtral model support description ( #20926 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 21:56:53 -07:00
9ad0a4588b
[Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer ( #20934 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-07-15 03:27:50 +00:00
016b8d1b7f
Enabled BnB NF4 inference on Gaudi ( #20172 )
...
Signed-off-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
2025-07-14 20:26:08 -07:00
80305c1b24
[CI] Fix flaky test_streaming_response test ( #20913 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:15:15 -07:00
37e2ecace2
feat: add image zoom to improve image viewing experience ( #20763 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 20:14:23 -07:00
054c8657e3
[Docs] Add Kuberay to deployment integrations ( #20592 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-14 20:13:55 -07:00
d4170fad39
Use w8a8 quantized matmul Pallas kernel ( #19170 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-15 03:06:33 +00:00
946aadb4a0
[CI/Build] Split Entrypoints Test into LLM and API Server ( #20945 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 02:44:18 +00:00
bcdfb2a330
[Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM ( #20933 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 01:42:17 +00:00
ba8c300018
[BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache ( #20942 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-15 01:26:18 +00:00
8cdc371217
SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP ( #20769 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-07-15 01:06:38 +00:00
61e20828da
Fall back if flashinfer comm module not found ( #20936 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-14 23:11:18 +00:00
55e1c66da5
[Docs] remove outdated performance benchmark ( #20935 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-07-14 22:14:17 +00:00
86f3ac21ce
Fix overflow indexing in causal_conv1d kernel ( #20938 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-14 21:43:07 +00:00
149f2435a5
[Misc] Relax translations tests ( #20856 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:08:36 +00:00
c0569dbc82
[Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts ( #20725 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-14 19:47:16 +00:00
8bb43b9c9e
Add benchmark dataset for mlperf llama tasks ( #20338 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-14 19:10:07 +00:00
559756214b
Change default model to Qwen3-0.6B ( #20335 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-14 16:54:52 +00:00
6d0cf239c6
[CI/Build] Add Transformers nightly tests in CI ( #20924 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 16:33:17 +00:00
3fc964433a
[Misc] Clean up Aimv2 config registration in Ovis config ( #20921 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 15:36:43 +00:00
0caf61c08a
[CI] Update codeowner for compilation code ( #20929 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-14 08:33:19 -07:00
667624659b
[CI] cc folks on changes to vllm/compilation ( #20925 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-14 07:52:17 -07:00
38efa28278
[Model] Add Ling implementation ( #20680 )
...
Signed-off-by: vito.yy <vito.yy@antgroup.com >
2025-07-14 22:10:32 +08:00
e8cc53af5e
[Misc] Log the reason for falling back to FlexAttention ( #20699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 04:16:51 -07:00
a4851cfe68
[Bugfix]: Fix messy code when using logprobs ( #20910 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-14 11:06:45 +00:00
9887e8ec50
[Misc] Remove unused function ( #20909 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 10:48:55 +00:00
f326ab9c88
[Bugfix] Bump up mistral_common to support v13 tokenizer ( #20905 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 10:45:03 +00:00
dcf2a5e208
[CI/Build] Fix OOM issue in Jina-VL test ( #20907 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 10:32:35 +00:00
1e9438e0b0
[MISC] Move bind_kv_cache to worker module ( #20900 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-14 09:40:00 +00:00
697ef765ee
[Refactor][V1] Move outlines utils for V1 imports ( #20878 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-07-14 00:58:35 -07:00
a99b9f7dee
[Quantization] add BNB for MixtralForCausalLM ( #20893 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-14 07:34:34 +00:00
c488b928a7
[ROCm] [Bugfix] [Critical]: Fix mamba compilation bug ( #20883 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-07-14 15:23:28 +08:00
2c7fa47161
Fix: Add missing EOFError handling in CLI complete command ( #20896 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 07:09:57 +00:00
88fc8a97e3
Removing redundant python version check ( #20888 )
...
Signed-off-by: Dannyso05 <dansong1177@gmail.com >
2025-07-14 06:15:05 +00:00
66f6fbd393
[Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) ( #20511 )
...
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com >
2025-07-14 02:45:31 +00:00
8632e831ba
[Core] Add update_config RPC method ( #20095 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 00:49:18 +00:00
4bbfc36b16
[V1] Hybrid allocator without prefix caching ( #20661 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-13 16:55:14 +00:00
80d38b8ac8
[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs ( #20880 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-07-13 15:19:32 +00:00
211b6a6113
[Bugfix] fix define of RerankDocument ( #20877 )
...
Signed-off-by: liuchenlong <liuchenlong@xiaohongshu.com >
Co-authored-by: liuchenlong <liuchenlong@xiaohongshu.com >
2025-07-13 14:32:40 +00:00
247102f07f
[Bugfix] Fix: add patch_rope_scaling after hf override ( #20857 )
...
Signed-off-by: Wang Siyuan <wsy0227@sjtu.edu.cn >
Signed-off-by: Wang Siyuan <sywang0227@gmail.com >
2025-07-13 00:13:25 -07:00
bd4c1e6fdb
Support for LlamaForSequenceClassification ( #20807 )
...
Signed-off-by: thechaos16 <thechaos16@gmail.com >
2025-07-13 00:09:34 -07:00
99b4f080d8
Renable google/gemma-3-1b-it accuracy test. ( #20866 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-12 21:48:56 -07:00
020f58abcd
[Core] Support multiple tasks per model ( #20771 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-12 19:40:11 -07:00
c1acd6d7d4
[Refactor] Change the way of import triton ( #20774 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:39:55 -07:00
3b3b778d4a
[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs ( #20825 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-12 19:39:14 -07:00
42d440c22b
[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant ( #20841 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:38:45 -07:00
f45a332886
[Sched] Enhance the logic to remove stopped requests from queues ( #20739 )
2025-07-12 15:33:13 -07:00
6e2c176e1f
[Bugfix] Restrict Machete to only run on Hopper ( #20830 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-12 17:34:40 +00:00
a86754a12b
[docs] convert supported configs to table ( #20858 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-12 06:54:50 -07:00
c2a2f19aba
[Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models ( #20843 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-07-12 06:11:30 -07:00
2c11a738b3
[Model] New model support for microsoft/Phi-4-mini-flash-reasoning ( #20702 )
...
Signed-off-by: Congcong Chen <congcongchen@microsoft.com >
2025-07-12 06:02:10 -07:00
b639327ad9
Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694 " ( #20853 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 23:07:35 -07:00
4afe687a82
Enable ModelOpt Llama4 fp8 checkpoint deployment ( #20419 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-11 23:07:16 -07:00
5de8d9f111
Remove extra tensor on CPU ( #20693 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-12 14:06:34 +08:00
c1c8ca57ff
[cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile ( #20790 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-11 23:06:13 -07:00
a3a5a47e48
[Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 ( #20823 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-11 23:06:04 -07:00
fb25e95688
[Docs] Update basic.md ( #20846 )
2025-07-11 23:05:32 -07:00
0d4891cd03
[Bug] Fix DeepGemm for EP low latency case ( #20833 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 23:05:12 -07:00
f56d2996ca
[Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph ( #20834 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-07-11 23:04:45 -07:00
147afb448b
[Bugfix] Replace unavailable video url in multimodal test ( #20854 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-12 05:25:39 +00:00
3c7d942da8
[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models ( #20637 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-11 21:33:26 -07:00
890323dc1b
[Bugfix] : Fix typo - logger.warn_once -> logger.warning_once ( #20852 )
2025-07-11 20:56:24 -07:00
01cae37713
[CI/Build] Ensure compatability with Transformers v4.53 ( #20541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-11 20:53:07 -07:00
11c0198615
[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading ( #20682 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-07-11 20:52:43 -07:00
b1235c3e10
[Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices ( #20822 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 20:52:05 -07:00
44d02f54db
[Misc] Restrict deep_gemm's log output ( #20827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 20:50:42 -07:00
a8593237c0
Add pynccl all-gatherv and reducescatterv ( #20154 )
...
Signed-off-by: Trevor Morris <tmorris@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 18:59:23 -07:00
fc0f41d10a
Integration SM100 FlashInfer fused allreduce RMSNorm ( #20691 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-11 18:58:15 -07:00
7b828e30d5
[CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' ( #20845 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 18:57:24 -07:00
5f0af36af5
Update kimi-k2 tool calling docs, enable unit tests ( #20821 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 20:16:14 +00:00
0d21b2664c
[Bugfix] Fix OOM in language generation test ( #20814 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-11 11:21:52 -07:00
9907fc4494
[Docs] Data Parallel deployment documentation ( #20768 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-11 09:42:10 -07:00
d47661f0cd
[Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM ( #20646 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 10:05:33 -06:00
53fa457391
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility ( #20449 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-11 07:51:46 -07:00
6fb162447b
[doc] fix ordered list issue ( #20819 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-11 06:49:46 -07:00
66177189c5
[Bugfix] Add missing field to TritonLanguagePlaceholder ( #20812 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 05:25:11 -07:00
b4f0b5f9aa
Temporarily suspend google/gemma-3-1b-it. ( #20722 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-11 11:21:26 +00:00
cbd14ed561
[Bugfix] Refactor /invocations to be task-agnostic ( #20764 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-11 03:20:54 -07:00
7bd4c37ae7
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). ( #19825 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: shuw <shuw@nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 09:23:23 +00:00
8020e98c9f
[Quantization][1/N] MoE support BNB-Inflight Quantization ( #20061 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 08:01:13 +00:00
762be26a8e
[Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging ( #20777 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Signed-off-by: luka <lgovedic@redhat.com >
2025-07-11 00:15:22 -07:00
6a9e6b2abf
[doc] fold long code block ( #20795 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-10 23:16:41 -07:00
5d09152ff1
[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine ( #20660 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-11 05:53:31 +00:00
31d5c1797f
[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf ( #19830 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 04:56:28 +00:00
35514b682a
[XPU] XCCL support enabled in torch 2.8.0.dev nightly builds ( #20705 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-10 20:39:52 -07:00
e2de455c34
[Feature] Integrate SM100 DeepGEMM support ( #20087 )
2025-07-10 20:18:05 -07:00
5b032352cc
[Attention] MLA - Flashinfer Ragged Prefill ( #20034 )
2025-07-10 20:17:47 -07:00
922f316441
[Model] Support HF format of minimax ( #20211 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 02:55:21 +00:00
5923ab9524
[fix]: disable cutlass block scaled group gemm for EP ( #20781 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-07-11 02:39:18 +00:00
0cf893cae1
Add kimi-k2 tool parser ( #20789 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 10:36:23 +08:00
cf75cd2098
[CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install ( #20772 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 01:16:01 +00:00
b854321ffe
[Docs] Lazy import gguf ( #20785 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-10 16:06:37 -07:00
5b6fe23d05
[Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. ( #20786 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-10 14:52:46 -07:00
f0c98cae27
[Misc] MoE ModularKernel : Introduce TopKWeightAndReduce ( #20648 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-10 14:40:38 -07:00
574ad60db9
[KVConnector] Always call connector clear_metadata() at end of step ( #20756 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: David Ben-David <sdavidbd@gmail.com >
2025-07-10 22:37:27 +01:00
fdadb6f43a
[Bugfix] Fused MoE Modular Kernel chunking loop ( #20392 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-10 20:31:10 +00:00
41060c6e08
[Core] Add Support for Default Modality Specific LoRAs [generate / chat completions] ( #19126 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-07-10 21:09:37 +01:00
3de2ed767f
[Bugfix] Remove assertion of expert_map being None ( #20714 )
...
Signed-off-by: Ming Yang <yming@meta.com >
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-10 19:55:22 +00:00
299252ea82
[CI] Fix pre commit issue ( #20782 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-10 12:48:13 -07:00
d6902ce79f
[V0][V1][Core] Add outlines integration for V1, and update V0 integration. ( #15975 )
...
Signed-off-by: Nathan Hoos <thwackyy.y@gmail.com >
2025-07-10 15:30:26 -04:00
5e53c89a74
[Bugfix] [CI] Fix Tensorizer LoRA test ( #20760 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-07-10 19:07:06 +00:00
c66e38ea4c
[Test] Remove docker build from test. ( #20542 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-10 11:21:58 -07:00
251595368f
Fix DeepSeek-R1-0528 chat template ( #20717 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2025-07-10 17:47:36 +00:00
4bed167768
[Model][VLM] Support JinaVL Reranker ( #20260 )
...
Signed-off-by: shineran96 <shinewang96@gmail.com >
2025-07-10 10:43:43 -07:00
b140416abf
[Model] Add reason parser for Hunyuan A13B Model. ( #20625 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-10 16:33:26 +00:00
5b8366b61a
[ROCm][Regression] Remove tensor creation that harms performance on ROCm ( #20741 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-10 09:22:23 -07:00
c7753a9809
[Hardware][CPU] Vllm int8 quantization enablement for ARM CPU ( #14129 )
...
Signed-off-by: nishith-fujitsu <nishith.jaiswal@fujitsu.com >
2025-07-10 15:59:04 +00:00
4b9a9435bb
Update Dockerfile FlashInfer to v0.2.8rc1 ( #20718 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-10 08:09:02 -07:00
3482fd7e4e
[Doc] Add engine args back in to the docs ( #20674 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-10 08:02:40 -07:00
77f77a951e
[Misc] Clean up mark to fork process in BNB tests ( #20692 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-10 13:59:40 +00:00
1a4f35e2ea
Normalize lm-eval command between baseline and correctness test ( #18560 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-10 13:27:32 +00:00
be1e128dfb
[CI Bugfix] Skip failing Tensorizer+LoRA test ( #20724 )
2025-07-10 21:15:03 +09:00
65393ee064
[doc] fix ordered list ( #20749 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-10 03:13:52 -07:00
dc221ad72d
[Bugfix][Build][Non-CUDA] Only referencing CMAKE_CUDA_COMPILER_VERSION on CUDA where it is defined ( #20738 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-10 02:58:11 -07:00
7571a4a7e5
[CI/Build] Fix Basic Models Test ( #20728 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-10 09:57:19 +00:00
f67d986dd1
[Misc] loose new-model tagger conditions ( #20747 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-10 02:54:47 -07:00
cc876d0f29
[KVConnector] Aggregate finished requests on the scheduler ( #19555 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-07-10 09:22:18 +01:00
fdfd409f8f
[TPU][Core]Make load weight exceed hbm error more instructive for customers ( #20644 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-10 07:01:17 +00:00
ffbcc9e757
[BugFix] Fix VllmConfig() construction on all platforms ( #20695 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-10 07:00:20 +00:00
59389c927b
[BugFix][CPU] Fix CPU worker dependency on cumem_allocator ( #20696 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-10 14:24:20 +08:00
8f2720def9
[Frontend] Support Tool Calling with both tool_choice='required' and $defs. ( #20629 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-10 13:56:35 +08:00
ad6c2e1a0b
Correct PPMissingLayer handling in Deepseek-V2-Lite PP deployment ( #20665 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-09 20:34:40 -07:00
49e8c7ea25
Use NVCC --compress-mode to reduce binary size by 30% ( #20694 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 18:26:48 -07:00
805d62ca88
[Misc] DP : Add ExpertTokensMetadata ( #20332 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-07-10 00:33:14 +00:00
b7d9e9416f
[CI/Build] Fix FlashInfer double build in Dockerfile ( #20651 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 17:41:56 -06:00
7c12a765aa
[Misc] Simplify the prefix caching logic on draft tokens ( #20701 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-09 14:48:35 -07:00
cd587c93ef
[BugFix]: Properly set engine_id when using multi connector ( #19487 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: leiyiming <leiyiming@kingsoft.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-09 20:32:44 +00:00
332d4cb17b
[Feature][Quantization] MXFP4 support for MOE models ( #17888 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
Signed-off-by: Bowen Bao <bowenbao@amd.com >
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
Co-authored-by: Bowen Bao <bowenbao@amd.com >
2025-07-09 13:19:02 -07:00
bf03ff3575
[Kernel] Add Conch backend for mixed-precision linear layer ( #19818 )
...
Signed-off-by: Jacob Manning <jmanning+oss@stackav.com >
2025-07-09 13:17:55 -07:00
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models ( #18218 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-09 12:53:55 -07:00
31b96d1c64
Support Llama 4 for cutlass_moe_fp4 ( #20453 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 15:53:38 -04:00
e59ba9e142
[CI/Build] Enlarge tolerance for a CPU multi-modal test ( #20684 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-09 17:48:52 +00:00
403b481573
Remove heading form installation inc.md file ( #20697 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 10:42:51 -07:00
138709f8d1
[Doc] Update CPU doc ( #20676 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 10:28:30 -07:00
0bbac1c1b4
[Bench] Add NVFP4 GEMM benchmark script ( #20578 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 13:23:48 -04:00
a3e4e85ece
[XPU][CI] enhance xpu test support ( #20652 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
Co-authored-by: zhenwei-intel <zhenweiliu@habana.ai >
2025-07-09 16:53:09 +00:00
eb58f5953d
[TPU][Bugfix] fix test_pallas ( #20666 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-09 09:32:48 -07:00
4ac9c33f78
[Bugfix] Fix handling of Tensorizer arguments for LoadConfig ( #20643 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-07-09 15:36:37 +00:00
efe73d0575
[doc] update doc format ( #20673 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-09 08:08:19 -07:00
853487bc1b
[Docs] Improve docs for RLHF co-location example ( #20599 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 08:06:43 -07:00
9ff2af6d2b
[Benchmark] Parameterization of streaming loading of multimodal datasets ( #20528 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-07-09 13:35:16 +00:00
70ca5484f5
[Doc] Update notes ( #20668 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-09 03:46:36 -07:00
5358cce5ff
[V1] [Doc] Update V1 docs for Mamba models ( #20499 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-09 01:02:41 -07:00
2155e95ef1
[Bugfix] Fix the issue where reasoning_content is None when Thinkng is enabled and tool_choice is set to 'required'. ( #20662 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-09 07:39:58 +00:00
f95570a52d
[Docs] fix minimax tool_calling docs error ( #20667 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-07-09 00:37:07 -07:00
b6e7e3d58f
[Intel GPU] support ray as distributed executor backend for XPU. ( #20659 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-09 00:36:58 -07:00
e760fcef22
[XPU] Use spawn with XPU multiprocessing ( #20649 )
...
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com >
2025-07-09 00:34:28 -07:00
6bbf1795b7
[Misc] Fix the size of batched_dummy_mm_inputs in profile_run ( #20434 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com >
2025-07-08 20:15:44 -07:00
9e0ef888f0
Fix bullets in incremental_build.md ( #20642 )
2025-07-09 11:03:41 +08:00
97abeb1daa
[feat] enable SM100 CUTLASS block scaled group gemm for smaller batch sizes ( #20640 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-07-09 11:03:35 +08:00
34dad19e7b
[Bugfix] set default set cuda_graph_sizes to min(self.max_num_seqs * 2, 512) ( #20628 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
2025-07-09 11:02:51 +08:00
6db31e7a27
[Hardware][PPC64LE] Enable V1 for ppc64le and ARM ( #20554 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-07-08 20:00:41 -07:00
977180c912
[Docs] Improve documentation for multi-node service helper script ( #20600 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-08 19:44:26 -07:00
c40784c794
[BugFix][Intel GPU] Use refactored API for dist_backend in V1 worker ( #20596 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-08 19:44:23 -07:00
baed180aa0
[tech debt] Revisit lora request model checker ( #20636 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-09 09:42:41 +08:00
0b407479ef
[misc]refactor Platform.set_device method ( #20262 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-09 01:39:47 +00:00
5eaf570050
Replace multiply_add with homogeneous_multiply_add to Address Clang Template Parameter Issue ( #20142 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-09 00:30:18 +00:00
d8ee5a2ca4
[TPU][Bugfix] disable phi-3 test ( #20632 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-08 23:14:26 +00:00
b9fca83256
[Bugfix] Fix GLM-4.1-V video prompt update ( #20635 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-08 23:13:58 +00:00
32dffc2772
[Core] Rename get_max_tokens_per_item for backward compatibility ( #20630 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-08 23:11:30 +00:00
c438183e99
[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE ( #20166 )
...
Signed-off-by: Ming Yang <yming@meta.com >
2025-07-08 23:10:57 +00:00
baba0389f7
[CI] Increase the threshold of the MTEB RERANK tests ( #20615 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-08 08:10:11 -07:00
c6c22f16d3
Revert invalid spellchecker fix on deepseek_vl2 ( #20618 )
2025-07-08 15:07:14 +00:00
dd382e0fe3
[Model] Implement missing get_language_model for Keye-VL ( #20631 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-08 07:47:46 -07:00
849590a2a7
Update torch/xla pin to 20250703 ( #20589 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-08 07:44:02 -07:00
a4c23314c0
[xpu]feat: support multi-lora on xpu ( #20616 )
...
Signed-off-by: yan <yan.ma@intel.com >
2025-07-08 22:07:10 +08:00
b942c094e3
Stop using title frontmatter and fix doc that can only be reached by search ( #20623 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-08 03:27:40 -07:00
b4bab81660
Remove unnecessary explicit title anchors and use relative links instead ( #20620 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-08 02:49:13 -07:00
b91cb3fa5c
[Docs] Improve documentation for Deepseek R1 on Ray Serve LLM ( #20601 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-08 02:09:06 -07:00
71d1d75b7a
[PD][Nixl] Remote consumer READ timeout for clearing request blocks ( #20139 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-08 08:56:40 +01:00
72d14d0eed
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load ( #19619 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
Co-authored-by: Eta <esyra@coreweave.com >
2025-07-07 22:47:43 -07:00
e34d130c16
[TPU] Temporary fix vmem oom for long model len by reducing page size ( #20278 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-08 05:16:16 +00:00
7721ef1786
[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files ( #20560 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-07 22:13:44 -07:00
8369b7c2a9
[Misc] improve error msg ( #20604 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-07 21:45:18 -07:00
3eb4ad53f3
[Docs] Add Anyscale to frameworks ( #20590 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:09:13 -07:00
90a2769f20
[Docs] Add Ray Serve LLM section to openai compatible server guide ( #20595 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:08:05 -07:00
e60d422f19
[Docs] Improve docstring for ray data llm example ( #20597 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:06:26 -07:00
0d914c81a2
[Docs] Rewrite offline inference guide ( #20594 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:06:02 -07:00
6e428cdd7a
[Doc] Syntax highlight request responses as JSON instead of bash ( #20582 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 20:02:45 -07:00
93b9d9f499
[Bugfix]: Fix messy code when using logprobs ( #19209 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-08 11:02:15 +08:00
af107d5a0e
Make distinct code and console admonitions so readers are less likely to miss them ( #20585 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 19:55:28 -07:00
31c5d0a1b7
[Optimize] Don't send token ids when kv connector is not used ( #20586 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-07 19:04:54 -07:00
afb7cff1b9
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe ( #20167 )
...
Signed-off-by: Ming Yang <yming@meta.com >
2025-07-08 01:07:22 +00:00
d2e841a10a
[Misc] Improve logging for dynamic shape cache compilation ( #20573 )
...
Signed-off-by: kyolebu <kyu@redhat.com >
2025-07-08 00:48:09 +00:00
14601f5fba
[Config] Refactor mistral configs ( #20570 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-07-07 15:25:10 -07:00
042d131f39
Fix links in multi-modal model contributing page ( #18615 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 21:13:52 +00:00
8e807cdfa4
[Misc] feat output content in stream response ( #19608 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-07 20:45:10 +00:00
e601efcb10
[Misc] Add fully interleaved support for multimodal 'string' content format ( #14047 )
...
Signed-off-by: drobyshev.anton <drobyshev.anton@wb.ru >
Co-authored-by: drobyshev.anton <drobyshev.anton@wb.ru >
2025-07-07 19:43:08 +00:00
22dd9c2730
[Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel ( #20308 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-07-07 19:08:12 +00:00
a6d795d593
[DP] Copy environment variables to Ray DPEngineCoreActors ( #20344 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-07 10:14:22 -07:00
a37d75bbec
[Front-end] microbatch tokenization ( #19334 )
...
Signed-off-by: zt2370 <ztang2370@gmail.com >
2025-07-07 17:54:10 +01:00
edd270bc78
[Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled ( #20486 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-07 09:41:15 -07:00
110df74332
[Model][Last/4] Automatic conversion of CrossEncoding model ( #19675 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-07 14:46:04 +00:00
1ad69e8375
[Doc] Fix some MkDocs snippets used in the installation docs ( #20572 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 07:44:34 -07:00
b8a498c9b2
[Doc] Add outline for content tabs ( #20571 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 07:43:26 -07:00
923147b5e8
[Doc] Fix internal links so they don't always point to latest ( #20563 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 04:15:50 -07:00
45877ef740
[Doc] Use gh-pr and gh-issue everywhere we can in the docs ( #20564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 03:54:22 -07:00
6e4bef1bea
[Doc] Remove extra whitespace from CI failures doc ( #20565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 03:35:47 -07:00
4ff79a136e
[Misc] Set the minimum openai version ( #20539 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-07 09:15:26 +00:00
448acad31e
[Misc] remove unused jinaai_serving_reranking ( #18878 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
2025-07-07 09:14:12 +00:00
eb0b2d2f08
[Docs] Clean up tables in supported_models.md ( #20552 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-07 01:46:31 -07:00
3112271f6e
[XPU] log clean up for XPU platform ( #20553 )
...
Signed-off-by: yan <yan.ma@intel.com >
2025-07-07 01:38:22 -07:00
1fd471e957
Add docstrings to url_schemes.py to improve readability ( #20545 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-07 08:31:49 +00:00
2c5ebec064
[XPU][CI] add v1/core test in xpu hardware ci ( #20537 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-07 01:16:40 -07:00
2e610deb72
[CI/Build] Enable phi2 lora test ( #20540 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-07 05:10:41 +00:00
6e2c19ce22
[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU ( #19410 )
...
Signed-off-by: dbyoung18 <yang5.yang@intel.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-07 04:32:32 +00:00
47db8c2c15
[Misc] add a tip for pre-commit ( #20536 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-06 19:42:06 -07:00
462b269280
Implement OpenAI Responses API [1/N] ( #20504 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-06 18:32:13 -07:00
c18b3b8e8b
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler ( #20527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-06 14:01:48 -07:00
9528e3a05e
[BugFix][Spec Decode] Fix spec token ids in model runner ( #20530 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-06 19:44:52 +00:00
9fb52e523a
[V1] Support any head size for FlexAttention backend ( #20467 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-06 09:54:36 -07:00
e202dd2736
[V0 deprecation] Remove V0 CPU/XPU/TPU backends ( #20412 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-07-06 08:48:13 -07:00
43813e6361
[Misc] call the pre-defined func ( #20518 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-06 10:25:29 +00:00
cede942b87
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py ( #20516 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-07-06 09:20:11 +00:00
fe1e924811
[Frontend] Support image object in llm.chat ( #19635 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Flora Feng <4florafeng@gmail.com >
2025-07-06 06:47:13 +00:00
4548c03c50
[TPU][Bugfix] fix the MoE OOM issue ( #20339 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-05 21:19:09 -07:00
40b86aa05e
[BugFix] Fix: ImportError when building on hopper systems ( #20513 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-06 12:17:30 +08:00
432870829d
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe ( #20509 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-07-06 12:08:30 +08:00
f73d02aadc
[BUG] Fix #20484 . Support empty sequence in cuda penalty kernel ( #20491 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-07-05 19:38:02 -07:00
c5ebe040ac
test_attention compat with coming xformers change ( #20487 )
...
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-05 19:37:59 -07:00
8d763cb891
[Misc] remove unused import ( #20517 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-05 19:17:06 -07:00
cf4cd53982
[Misc] Add logger.exception for TPU information collection failures ( #20510 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-05 07:24:32 -07:00
32c9be2200
[v1] Re-add fp32 support to v1 engine through FlexAttention ( #19754 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-05 09:41:10 +00:00
8aeaa910a2
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod ( #20507 )
...
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-07-05 14:03:20 +08:00
906e05d840
[Misc] Remove the unused LoRA test code ( #20494 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-05 13:48:16 +08:00
ef9a2990ae
[doc] small fix ( #20506 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-04 20:56:39 -07:00
7e90870491
[Misc] Add security warning for development mode endpoints ( #20508 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-04 20:52:13 -07:00
d3f05c9248
[Doc] fix mutltimodal_inputs.md gh examples link ( #20497 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
2025-07-04 16:41:35 -07:00
c108781c85
[CI Bugfix] Fix pre-commit failures on main ( #20502 )
2025-07-04 14:17:30 -07:00
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 ( #19757 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Co-authored-by: Duncan Moss <dmoss@nvidia.com >
2025-07-04 12:58:04 -06:00
2f35a022e6
Enable V1 for Hybrid SSM/Attention Models ( #20016 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-07-04 17:46:53 +00:00
ffe00ef77a
[Misc] Small: Remove global media connector. Each test should have its own test connector object. ( #20395 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-04 08:15:03 -07:00
5561681d04
[CI] add kvcache-connector dependency definition and add into CI build ( #18193 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-04 06:49:18 -07:00
fbd62d8750
[Doc] Fix classification table in list of supported models ( #20489 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-04 06:08:02 -07:00
2e26f9156a
[Model][3/N] Automatic conversion of CrossEncoding model ( #20168 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-04 05:47:39 -07:00
9e5452ee34
[Bug][Frontend] Fix structure of transcription's decoder_prompt ( #18809 )
...
Signed-off-by: sangbumlikeagod <oironese@naver.com >
2025-07-04 11:28:07 +00:00
0e3fe896e2
Support Llama 4 for fused_marlin_moe ( #20457 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-04 07:55:10 +00:00
1caca5a589
[Misc] Add SPDX-FileCopyrightText ( #20428 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-04 07:40:42 +00:00
783921d889
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels ( #20331 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-04 15:06:24 +08:00
4a98edff1f
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers ( #20365 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-04 15:05:49 +08:00
a7bab0c9e5
[Misc] small update ( #20462 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 20:33:44 -07:00
25950dca9b
Add ignore consolidated file in mistral example code ( #20420 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-07-04 02:55:07 +00:00
a4113b035c
[Platform] Add custom default max tokens ( #18557 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
2025-07-04 10:50:17 +08:00
7e1665b089
[Misc] Change warn_for_unimplemented_methods to debug ( #20455 )
2025-07-04 02:35:08 +00:00
8d1096e7db
[Bugfix] Register reducer even if transformers_modules not available ( #19510 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-03 22:08:12 +00:00
8d775dd30a
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning ( #20400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-03 14:56:09 -07:00
78fe77534b
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. ( #18864 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-07-03 14:55:40 -07:00
2f2fcb31b8
[Misc] Remove _maybe_ignore_quant_config from GLM4.1v ( #20432 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-07-03 21:41:13 +00:00
1dba2c4ebe
[Misc] adjust for ipv6 for mookcacke url parse ( #20107 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-03 20:27:17 +00:00
71d6de3a26
[Misc] Clean up InternVL family config registration ( #19992 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-03 20:01:47 +00:00
536fd33003
[CI] Trimming some failing test groups from AMDPRODUCTION. ( #20390 )
2025-07-03 08:21:31 -07:00
619b9f5c7e
[Frontend] fix duplicate output for bench subcmd ( #20446 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 08:02:06 -07:00
d1b689c445
[Bugfix] Fix flaky test_streaming_response test ( #20363 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-03 14:46:24 +00:00
9854dc9040
[Frontend] improve vllm bench <bench_type> --help display ( #20430 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 14:22:16 +00:00
ff5c60fad8
[Misc] Automatically tag PRs to add new models ( #20222 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-03 07:11:03 -07:00
6f1229f91d
[Model][2/N] Automatic conversion of CrossEncoding model ( #19978 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-03 13:59:23 +00:00
1819fbda63
[Quantization] Bump to use latest bitsandbytes ( #20424 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-03 21:58:46 +08:00
7f0367109e
[CI/Build][CPU] Enable cross compilation in CPU release pipeline ( #20423 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-03 05:26:12 -07:00
fb14d53cf6
[Kernel] refactor cpu worker v0 cache dtype ( #20080 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-03 08:39:14 +00:00
b024a42e93
[Core] Move multimodal placeholder from chat utils to model definition ( #20355 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-03 08:18:30 +00:00
cb97f2bfc5
[Docs] Replace two list with tables in intel_gaudi.md ( #20414 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-03 00:48:25 -07:00
359200f6ac
[doc] fix link ( #20417 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 00:21:57 -07:00
220aee902a
[Misc] Add rules to label Speculative Decoding Related PRs ( #20406 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-07-02 23:56:49 -07:00
67d25eca05
[Tests] Update online DP tests to verify that requests are balanced ( #20157 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-03 14:49:13 +08:00
363528de27
[Feature] Support MiniMax-M1 function calls features ( #20297 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-07-03 06:48:27 +00:00
4ff61ababa
[TPU] Add a case to cover RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 ( #20385 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-03 06:46:41 +00:00
0ec3779df7
[Bugfix][CI/CD][CPU] Fix CPU CI tests ( #20383 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-02 20:11:36 -07:00
b616f6a53d
[Misc] Small: Fix video loader return type annotations. ( #20389 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-03 03:10:39 +00:00
2e25bb12a8
[Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py ( #20381 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-07-03 02:07:43 +00:00
9965c47d0d
Enable CPU nightly performance benchmark and its Markdown report ( #18444 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-07-02 17:50:25 -07:00
059d4cdb49
[BugFix] Fix DP headless mode arg validation ( #20398 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 17:15:32 -07:00
bdb84e26b0
[Bugfix] Fixes for FlashInfer's TORCH_CUDA_ARCH_LIST ( #20136 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
2025-07-02 17:15:11 -07:00
3dd359147d
[Docs] Update EAGLE example ( #20375 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-02 17:13:51 -07:00
657f2f301a
[DP] Support external DP Load Balancer mode ( #19790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 10:21:52 -07:00
a1aafc827a
[ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) ( #20254 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-07-02 16:25:46 +00:00
139508a418
[Misc] add handler HF_TOKEN is emptry string ( #20369 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-02 09:14:31 -07:00
d265414dbc
[Minor] Clean up incorrect comment in test ( #20382 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 09:13:37 -07:00
48fb076cbc
[V1] LogitsProcessor programming model ( #16728 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-02 09:10:42 -07:00
c1909e7e8c
[Kernels] MoE refactor ( #19636 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: ElizaWszola <ewszola@redhat.com >
2025-07-02 06:08:27 -07:00
b95877509b
Documentation update tool_calling: mapping back to function from response ( #20373 )
2025-07-02 05:55:49 -07:00
706ff13224
[Model] Adds support for SlimMoE models Phi-tiny-MoE-instruct ( #20286 )
...
Signed-off-by: Zichong Li <t-lizichong@microsoft.com @Reasoning-H100-VM3.drbuo4tcjzruhloch3eo0b25ef.cx.internal.cloudapp.net>
Co-authored-by: Zichong Li <t-lizichong@microsoft.com @Reasoning-H100-VM3.drbuo4tcjzruhloch3eo0b25ef.cx.internal.cloudapp.net>
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-07-02 12:54:12 +00:00
ccbfb1d1c9
[Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models ( #20322 )
...
Signed-off-by: Wang Huaqiang <huaqiang.wang@intel.com >
2025-07-02 12:53:36 +00:00
9e5552aa13
[NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) ( #17280 )
...
Signed-off-by: kaln27 <liaojuncheng123@foxmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-02 06:47:19 -06:00
0c600b9ab6
[Build/CI] Automatically tag DeepSeek related PRs ( #20370 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-02 04:02:43 -07:00
e303dcf523
[Model] Add Ernie4.5 and Ernie4.5MoE Model Support ( #20220 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-07-02 03:37:01 -07:00
ae9c4d416f
[Docs] Make TPU ref prettier in google_tpu.md ( #20356 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-02 02:04:08 -07:00
d853520b3e
[Docs] Fix indentations for 2-level items in deprecation_policy.md ( #20352 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-01 23:50:31 -07:00
ba51aea65e
[Bugfix] Keye-VL compatibility with tok_kwargs ( #20058 ) ( #20353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-01 23:46:59 -07:00
8452946c06
[Model][VLM] Support Keye-VL-8B-Preview ( #20126 )
...
Signed-off-by: Kwai-Keye <Keye@kuaishou.com >
2025-07-01 23:35:04 -07:00
2e7cbf2d7d
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. ( #20105 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-01 23:34:03 -07:00
7da296be04
[TPU] kv cache update kernel supports dynamic grid ( #20235 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-02 06:33:37 +00:00
b205e8467d
[Doc][TPU] Add models and features supporting matrix. ( #20230 )
...
Signed-off-by: Qiliang Cui <cuiq@google.com >
2025-07-02 06:33:20 +00:00
be0cfb2b68
fix[Docs]: link anchor is incorrect #20309 ( #20315 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-07-02 06:32:34 +00:00
1a03dd496b
[Bugfix] Fix dynamic rotary embedding ( #20343 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-02 06:31:26 +00:00
27b8017636
[FIX][Intel GPU]fix ipex flash_attn_varlen_func api missing parameter ( #20348 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-01 22:26:40 -07:00
9ec1e3065a
[Misc][Doc] Add missing comment for LLM ( #20285 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-07-01 19:04:24 -07:00
9dae7d46bf
[Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON ( #20334 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 19:03:43 -07:00
7058d7dd5d
[Refactor] Remove duplicate find_free_port ( #20333 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 19:03:07 -07:00
a0389e0554
[UT][intel GPU] use current_platform instead of device hardcode in v1 tests ( #20169 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-02 09:06:04 +08:00
3be8d312a2
[Kernel][Bugfix] Fixup some warnings in nvfp4_blockwise_moe when CUDA < 12.8 ( #20324 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-01 18:05:47 -07:00
3abfe22154
Enable group size 64 for Machete ( #20290 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-01 18:05:44 -07:00
e81fbefe8a
[Refactor] Refactor import utils ( #20269 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 18:05:42 -07:00
9290de5667
remove unused variables in marlin_template.h ( #20236 )
2025-07-02 00:51:52 +00:00
7f280d69c9
[Optimization] Cache sampled token ids in model runner ( #20291 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 11:01:31 -07:00
02cabff207
[V1] [ROCm] Enable EP with AITER Fused MoE ( #20270 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-07-01 16:48:30 +00:00
3d19d47d91
[Frontend] Expand tools even if tool_choice="none" ( #17177 )
...
Signed-off-by: okada shintarou <okada@preferred.jp >
2025-07-01 12:47:38 -04:00
8acb4badee
[CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling ( #20301 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 09:07:36 -07:00
314af8617c
[Docs] Update transcriptions API to use openai client with stream=True ( #20271 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-01 15:47:13 +00:00
0e96cc9b7e
[Misc] Minor refactoring for scheduler ( #20299 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 07:55:32 -07:00
ecad851cbd
[Model]Add Tencent HunYuanMoEV1 Model Support ( #20114 )
...
Signed-off-by: aiyiwang <aiyiwang@tencent.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: quinnrong <quinnrong@tencent.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-01 07:28:13 -07:00
ed70f3c64f
Add GLM4.1V model (Draft) ( #19331 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-01 12:48:26 +00:00
650d5dbd04
[Misc] Minor refactor of NIXL background handshake ( #20068 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-01 12:40:14 +01:00
9025a9a705
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper ( #20046 )
2025-07-01 19:20:34 +09:00
c05596f1a3
[Perf] Validate @config in pre-commit instead of dynamically ( #20200 )
...
Signed-off-by: Lionel Villard <villard@us.ibm.com >
2025-07-01 05:10:28 -04:00
787b13389e
[doc] fix the incorrect logo in dark mode ( #20289 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-01 08:18:09 +00:00
96453cfa83
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine ( #19067 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-07-01 16:12:19 +08:00
b1c1fe35a5
[Misc] remove redundant char ( #20287 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-01 15:33:22 +08:00
08d81f1014
[Bugfix] Fix deepep tests ( #20288 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-01 15:29:08 +08:00
6cc1e7d96d
[CPU] Update custom ops for the CPU backend ( #20255 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-01 07:25:03 +00:00
9909726d2a
Enable ZP Support for Machete ( #20268 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-01 07:12:20 +00:00
22e9d42040
[Misc] add xgrammar for arm64 ( #18359 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2025-07-01 07:02:20 +00:00
86debab54c
Fix numel() downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2 ( #17082 )
...
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-01 06:48:10 +00:00
be250bbc67
[V1] Only print cudagraph tqdm on rank 0 with is_global_first_rank ( #19516 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-01 06:02:09 +00:00
27949354fa
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference ( #18768 )
...
Signed-off-by: Alex Kogan <alex.kogan@oracle.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-07-01 05:44:38 +00:00
bd5038af07
[Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA ( #15897 )
...
Signed-off-by: Ernest Wong <chwong719@gmail.com >
2025-06-30 21:44:39 -07:00
a2f14dc8f9
[CI][Intel Gaudi][vllm-Plugin]Add CI for hpu-plugin-v1-test ( #20196 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-07-01 04:17:07 +00:00
92ee7baaf9
[Example] add one-click runnable example for P2P NCCL XpYd ( #20246 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-06-30 21:03:55 -07:00
7151f92241
[Misc] Fix spec decode example ( #20296 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 21:01:48 -07:00
e28533a16f
[Bugfix] Fix include prompt in stream response when echo=true ( #15233 )
...
Signed-off-by: Yuan Fang <yuanfang@alauda.io >
2025-07-01 01:30:14 +00:00
6d42ce8315
[CLI] Improve CLI arg parsing for -O/--compilation-config ( #20156 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-07-01 01:03:13 +00:00
ded1fb635b
[Bugfix][V1][P/D]Fix the issue of occasional garbled output for P2pNcclConnector ( #20263 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-06-30 16:45:14 -07:00
97d9524fe9
[Refactor] Remove useless pdb comment ( #20266 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-30 18:15:24 +00:00
d8cf819a9a
[Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models ( #20058 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-30 17:26:49 +00:00
551ef1631a
[Unit Test] Add unit test for deep gemm ( #20090 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-30 10:26:42 -06:00
2863befce3
[Optimization] Use Shared CachedRequestData Instance Across All Requests ( #20232 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 09:07:50 -07:00
2965c99c86
[Spec Decode] Clean up spec decode example ( #20240 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 08:28:13 -07:00
2062c0723d
[Spec Decode] Refactor spec decoding into a separate function ( #20238 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 08:13:50 -07:00
1c50e100a9
[Bugfix] fix quark ptpc ( #20251 )
...
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com >
Co-authored-by: Haoyang Li <307790822@qq.com >
2025-06-30 22:24:50 +09:00
3ee56e26be
[Docs] Fix 1-2-3 list in v1/prefix_caching.md ( #20243 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-30 11:20:51 +00:00
8fe7fc8634
[Quantization] Improve BitsAndBytesModelLoader ( #20242 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-30 18:22:09 +08:00
e936e401de
[Bugfix] Fix processor initialization in transformers 4.53.0 ( #20244 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-30 10:16:16 +00:00
f5dfa07531
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model ( #19598 )
...
Signed-off-by: noiji <>
2025-06-30 18:21:56 +09:00
022c58b80f
[doc] Add Slack and Forum to the top navigation ( #20208 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-06-30 07:53:45 +00:00
19108ef311
[Misc] Fix import ( #20233 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-29 20:34:54 -07:00
5a52f389dd
[BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert ( #20202 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-06-29 19:46:19 -07:00
65b1cbb138
[Model] support dots1 ( #18254 )
...
Signed-off-by: redmoe-moutain <agiredmoe@gmail.com >
2025-06-29 19:34:36 -07:00
6c9837a761
Fix cuda_archs_loose_intersection when handling sm_*a ( #20207 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-06-29 16:52:34 -07:00
6f2f53a82d
[Quantization] Add compressed-tensors NVFP4 MoE Support ( #19990 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-06-29 22:05:40 +00:00
7b1895e6ce
[CI Fix] Try fixing eagle e2e test OOM by reducing block allocation ( #20213 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-29 10:31:37 +08:00
4d36693687
[Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx ( #20187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-28 22:06:38 +00:00
daec9dea6e
[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution ( #20137 )
...
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
2025-06-28 08:16:41 -07:00
daceac57c7
[Frontend] Generalize v1/audio/transcriptions endpoint ( #20179 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-28 08:15:26 -07:00
8615d9776f
[CI/Build] Add new CI job to validate Hybrid Models for every PR ( #20147 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-27 23:00:25 -07:00
7b460c25f9
[BugFix] Fix the incorrect func name in the comments. (config.py) ( #20185 )
2025-06-27 22:51:16 -07:00
f719772281
[Bugfix] Properly reject requests with empty list guided_choice ( #20195 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 22:50:52 -07:00
d45417b804
fix ci issue distributed 4 gpu test ( #20204 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-27 22:50:00 -07:00
a29e62ea34
Fix num_token_padding support for static per-tensor scaled_fp8_quant ( #20188 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 22:48:13 -07:00
e53be6f00a
[Misc] Add type assertion of request_id for LLMEngine.add_request ( #19700 )
...
Signed-off-by: n2ptr <xuzhanchaomail@163.com >
2025-06-27 22:47:36 -07:00
c329ceca6d
[CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to revision due to model changes ( #20199 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-28 13:43:06 +08:00
3c545c0c3b
[CI/Build] Allow hermetic builds ( #18064 )
...
Signed-off-by: Fabien Dupont <fdupont@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Fabien Dupont <fabiendupont@pm.me >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Elias Levy <eliaslevy@google.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-06-27 09:04:39 -07:00
e8c3bd2cd1
[Bugfix] Fix some narrowing conversion warnings ( #20141 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-27 09:01:28 -07:00
c6c983053d
[Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. ( #20152 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-27 09:42:22 -06:00
aafabaa0d5
[Fix][torch.compile] Enable custom ops by default when Inductor off ( #20102 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-27 09:00:42 -06:00
94a55c7681
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12 ( #19891 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-06-27 07:14:44 -07:00
aa0dc77ef5
[Perf] Improved perf for resolve_chat_template_content_format ( #20065 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net >
2025-06-27 09:16:41 +00:00
4ab3ac285e
[Bugfix] Fix flaky failure when getting DP ports ( #20151 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 15:30:53 +08:00
d1c956dc0f
Gemma3n (Text-only) ( #20134 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-27 07:16:26 +00:00
dec197e3e5
Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn ( #20143 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-06-27 05:48:13 +00:00
6e244ae091
[Perf][Frontend] eliminate api_key and x_request_id headers middleware overhead ( #19946 )
...
Signed-off-by: Yazan-Sharaya <yazan.sharaya.yes@gmail.com >
2025-06-27 00:44:14 -04:00
cd4cfee689
[Model][1/N] Automatic conversion of CrossEncoding model ( #20012 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-06-26 21:10:04 -07:00
e110930680
[Fix] Fix gemma CI test failing on main ( #20124 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-26 21:06:59 -07:00
8b64c895c0
[CI] Sync test dependency with test.in for torch nightly ( #19632 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Concurrensee <yida.wu@amd.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-26 20:55:25 -07:00
0740e29b66
[Feature] add quick all reduce ( #19744 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-06-26 20:54:24 -07:00
44d2e6af63
[Bugfix] Build moe_data for both sm100 and sm90 ( #20086 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-26 20:50:12 -07:00
2d7779f888
[Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler ( #20071 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-06-26 20:50:09 -07:00
a57d57fa72
[Quantization] Bump to use latest compressed-tensors ( #20033 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-26 20:50:06 -07:00
71799fd005
[CI Failure] Fix OOM with test_oot_registration_embedding ( #20144 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 11:21:04 +08:00
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) ( #18343 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
2025-06-26 15:30:21 -07:00
07b8fae219
[Doc] correct LoRA capitalization ( #20135 )
...
Signed-off-by: kyolebu <kyu@redhat.com >
2025-06-26 15:22:12 -07:00
562308816c
[Refactor] Rename commnication utils ( #20091 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 22:19:32 +00:00
04e1642e32
[TPU] add kv cache update kernel ( #19928 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-26 10:01:37 -07:00
b69781f107
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. ( #19560 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-06-26 09:27:18 -07:00
0bceac9810
Spam folks if config.py changes ( #20131 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-26 08:19:46 -07:00
34878a0b48
[Doc] Rename page titles ( #20130 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 08:18:49 -07:00
6393b03986
[Doc] Auto sign-off for VSCode ( #20132 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 08:18:36 -07:00
0907d507bf
[Doc] Automatically signed-off by PyCharm ( #20120 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-06-26 14:34:17 +00:00
c894c5dc1f
[Bug Fix] Fix address/port already in use error for deep_ep test ( #20094 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 22:33:13 +08:00
1f5d178e9c
Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" ( #20128 )
2025-06-26 07:32:22 -07:00
27c065df50
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) ( #19904 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-06-26 12:42:31 +00:00
84c260caeb
[Docs] Improve frameworks/helm.md ( #20113 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-26 10:41:51 +00:00
167aca45cb
[Misc] Use collapsible blocks for benchmark examples. ( #20017 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-26 03:35:16 -07:00
0567c8249f
[CPU] Fix torch version in x86 CPU backend ( #19258 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-26 03:34:47 -07:00
d188913d99
[Refactor] Remove unused library ( #20099 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 09:16:10 +00:00
1d7c29f5fe
[Doc] Update docs for New Model Implementation ( #20115 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 00:47:06 -07:00
65397e40f5
[Bugfix] Allow CUDA_VISIBLE_DEVICES='' in Platform.device_id_to_physical_device_id ( #18979 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-06-26 00:01:57 -07:00
9502c38138
[Benchmark][Bug] Fix multiple bugs in bench and add args to spec_decode offline ( #20083 )
2025-06-25 22:06:27 -07:00
2582683566
[PD] Skip tp_size exchange with rank0 ( #19413 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-25 20:04:39 -07:00
754b00edb3
[Bugfix] Fix Mistral tool-parser regex for nested JSON ( #20093 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-26 01:01:17 +00:00
296ce95d8e
[CI] Add SM120 to the Dockerfile ( #19794 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-25 16:23:56 -07:00
2d7620c3eb
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN ( #19919 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-25 15:51:02 -07:00
55c65ab495
[P/D] Avoid stranding blocks in P when aborted in D's waiting queue ( #19223 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-25 15:19:44 -07:00
2cc2069970
[TPU][Bugfix] fix kv cache padding ( #20048 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-25 21:24:10 +00:00
9f0608fc16
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine ( #20062 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
2025-06-25 21:03:17 +00:00
4e0db57fff
Fix the path to the testing script. ( #20082 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-25 20:48:17 +00:00
c40692bf9a
[Misc] Add parallel state node_count function ( #20045 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-25 13:38:53 -07:00
4734704b30
[PD] let toy proxy handle /chat/completions ( #19730 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-06-25 15:17:45 -04:00
8b8c209e35
static_scaled_fp8_quant should not run when scale.numel is not 1 ( #20076 )
2025-06-25 15:08:03 -04:00
23a04e0895
[Fix] Support cls pooling in ModernBertPooler ( #20067 )
...
Signed-off-by: shengzhe.li <shengzhe.li@sbintuitions.co.jp >
2025-06-25 15:07:45 -04:00
02c97d9a92
[Quantization] Add compressed-tensors emulations support for NVFP4 ( #19879 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-06-25 14:28:19 -04:00
e795d723ed
[Frontend] Add /v1/audio/translations OpenAI API endpoint ( #19615 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-06-25 17:54:14 +00:00
8359f4c8d8
[V1][Speculative Decoding] Fix DeepSeek MTP ( #20022 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-06-25 08:41:02 -07:00
bf5181583f
[Doc] Guide for Incremental Compilation Workflow ( #19109 )
2025-06-25 22:06:46 +09:00
c53fec1fcb
[doc] add reference link for Intel XPU ( #20064 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-25 12:24:07 +00:00
0f9e7354f5
[BugFix] Fix full-cuda-graph illegal memory access in FA3 ( #20057 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-06-25 08:39:04 +00:00
ba7ba35cda
[Chore] debloat some initial logs ( #19438 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-25 06:36:22 +00:00
015fab8c2f
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. ( #19717 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-24 23:22:58 -07:00
f59fc60fb3
[Feat][CLI] enforce-include-usage ( #19695 )
...
Signed-off-by: Max Wittig <max.wittig@siemens.com >
2025-06-25 01:43:04 -04:00
879f69bed3
[Refactor] Remove duplicate ceil_div ( #20023 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-25 05:19:09 +00:00
7108934142
[Frontend] speed up import time of vllm.config ( #18036 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-06-25 00:41:11 -04:00
3443aaf8dd
Move to a faster base64 implementation ( #19984 )
...
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai >
2025-06-24 20:33:51 -07:00
2273ec322c
Revert "Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor" ( #20030 )
2025-06-25 11:23:29 +08:00
a6c4b87fbc
Revert "[Feature] Integrate new deepgemm ( #19820 )" ( #20049 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-24 19:45:22 -07:00
1afa9948f5
[Llama4] Update attn_temperature_tuning ( #19997 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-24 22:42:53 -04:00
0d06b533a0
cmake: Update vllm_flash_attn for vllm_kernels ( #20032 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
2025-06-24 22:44:10 +00:00
c01d1c5aba
use .dev for version comparison with pytorch nightly release ( #20031 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-06-24 21:52:16 +00:00
ead369845d
[Easy] Remove submodule added in #19463 ( #20039 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-24 13:23:15 -07:00
c6e3bba8e6
[Feature] Integrate new deepgemm ( #19820 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-24 12:51:56 -07:00
91f7d9d0b6
[P/D] Asynchronously do _nixl_handshake ( #19836 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-24 12:46:10 -07:00
8619e7158c
[BugFix] Fix multi-node offline data parallel ( #19937 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-24 12:45:20 -07:00
c635c5f744
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. ( #19423 )
...
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-24 18:41:49 +00:00
a045b7e89a
[Perf] Improve/Fix-regression for FA3 in High QPS regimes ( #19463 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-06-24 13:09:01 -04:00
981eeca41a
[Fix][V1] Remove --scheduling-policy oracle ( #20010 )
...
Signed-off-by: amit <amit.man@gmail.com >
2025-06-24 09:52:15 -07:00
26d34eb67e
refactor example - qwen3_reranker ( #19847 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-24 14:03:20 +00:00
53da4cd397
[Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 ( #20014 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-24 13:20:04 +00:00
9a3b88328f
[PERF] Speedup of MRoPE prepare inputs ( #19939 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-06-23 23:01:26 -07:00
3014c920da
add some examples for other benchmark scripts ( #19893 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-24 05:57:46 +00:00
0eed516951
[doc] Fix broken link in the installation for CPU ( #19980 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-06-24 12:04:11 +08:00
ee5ad8d2c5
[Misc][Tools][Benchmark] Add profile to autotune script ( #19711 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-24 00:59:41 +00:00
a738dbb2a1
Update test case parameter to have the throughput above 8.0 ( #19994 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-24 00:18:10 +00:00
33d5e29be9
[TPU] Fix tpu model runner test ( #19995 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-23 16:04:28 -07:00
4671ac6e2a
[Bugfix][Benchmark] Fix Marlin benchmark ( #19929 )
2025-06-24 07:25:12 +09:00
dd2ccf8dde
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend ( #19395 )
2025-06-24 07:23:28 +09:00
a3bc76e4b5
[CI/Build] Push latest tag for cpu and neuron docker image ( #19897 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-23 14:15:37 -07:00
e6327c9b3e
[Feature] Support sequence parallelism for static fp8 quantization ( #19181 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-06-23 16:09:02 -04:00
d0132f025d
[Misc] Add type alias ReqId and EngineId for better readability ( #19880 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-23 12:57:57 -07:00
61f4fc5dc6
[Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 ( #19956 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-23 18:38:06 +00:00
68aaeb3749
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case ( #19885 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-23 11:07:47 -07:00
c3649e4fee
[Docs] Fix syntax highlighting of shell commands ( #19870 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-23 17:59:09 +00:00
53243e5c42
[doc] improve readability for long commands ( #19920 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 14:27:07 +00:00
a6e6604d32
[Bugfix] Fix CI bitsandbytes failure ( #19969 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-23 21:30:55 +08:00
b82e0f82cb
[doc] use MkDocs collapsible blocks - supplement ( #19973 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 10:54:16 +00:00
5111642a6f
[Doc] Update V1 status for decoder-only embedding models ( #19952 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-23 09:31:06 +00:00
1bcd15edc7
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when *all* transfer done ( #19874 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-22 22:41:53 -07:00
2ebff5b77c
[P/D][NixlConnector] Support tp_size > num_kv_heads deployments ( #19691 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-22 22:41:50 -07:00
f17aec0d63
[doc] Fold long code blocks to improve readability ( #19926 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 05:24:23 +00:00
493c275352
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor ( #19643 )
...
Signed-off-by: Vensenmu <vensenmu@gmail.com >
2025-06-23 03:40:28 +00:00
f39ab2d4bd
[Misc] Configurable timeout for execute_model RPC calls via env var ( #19544 )
...
Signed-off-by: jinqinn <goodqinjin@163.com >
2025-06-22 20:36:26 -07:00
4a0f7888a3
[Core] feat: Implement Priority Scheduling in V1 Engine ( #19057 )
...
Signed-off-by: amit <amit.man@gmail.com >
Co-authored-by: Roger Wang <Rogerw0108@gmail.com >
2025-06-22 20:18:08 -07:00
c4cf260677
[Perf][CLI] Improve overall startup time ( #19941 )
2025-06-22 23:11:22 +00:00
33d51f599e
[BugFix] Add an env to disable moe chunking to work around compile incompatibility ( #19642 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-22 15:17:49 -07:00
e91386cde1
[Chore] dedup logs ( #19955 )
2025-06-22 19:43:07 +00:00
2c11a29f0b
[Misc] Simplify vllm bench cli subcommand implementation ( #19948 )
2025-06-22 12:34:48 -04:00
c76a506bd6
[Misc] Update model-specific PR tagging ( #19949 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-06-22 12:16:08 +00:00
ec0db6f51c
[doc] use snippets for contact us ( #19944 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-22 10:26:13 +00:00
c305a2109d
[CI/Build] Auto tag perf benchmarks related PRs ( #19943 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-22 08:46:21 +00:00
202c5df935
[Benchmark] fix request loss if "ping" is returned ( #19535 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-22 07:21:04 +00:00
2bb246b8f7
[MISC] add cpu_kvcache_space_bytes to CacheConfig ( #19812 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-22 13:39:09 +08:00
4c409cabc2
[Misc] add vllm_config in __init__ ( #19866 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-21 23:10:46 -04:00
3b1e4c6a23
[Docs] Add GPT2ForSequenceClassification to supported models in docs ( #19932 )
...
Signed-off-by: nie3e <adrcwiek@gmail.com >
2025-06-21 20:57:19 +00:00
2c5302fadd
[Multimodal] Optimize Qwen2/2.5-VL startup time ( #19756 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-21 20:01:07 +00:00
caa680fd2e
[doc] add contact us in community ( #19922 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-21 17:29:06 +00:00
c3bf9bad11
[New model support]Support Tarsier2 ( #19887 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-21 04:01:51 +00:00
6f170f11dd
[Bugfix] Fix bnb 8bit model weights loading ( #19917 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-21 03:29:09 +00:00
8ca81bb069
Fix: Check the type of params to be a Sequence not list. ( #19910 )
...
Signed-off-by: Rabin Adhikari <rabin.adk1@gmail.com >
2025-06-20 23:03:17 +00:00
e773a9e1c2
[Misc] Clean up useless code ( #19889 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-06-20 21:09:09 +00:00
71baf85ae1
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError ( #19749 )
2025-06-20 18:18:11 +00:00
79f2f1c2a1
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests ( #19901 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-20 15:30:36 +00:00
2e3e3c86dc
Export NaNs in logits to scheduler_stats if output is corrupted ( #18777 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2025-06-20 22:47:16 +08:00
7e8977fcd4
[custom_op][vllm-plugin] update custom_op class to use op_registry ( #19164 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-06-20 07:44:56 -07:00
f1e840e842
[Model] GPT2ForSequenceClassification model ( #19663 )
...
Signed-off-by: nie3e <adrcwiek@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-20 12:07:41 +00:00
7771d1de88
[Fix] import regex instead of re ( #19875 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-20 11:16:48 +00:00
71d1219545
[Kernel] correct cpu worker function parameter type ( #19745 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-20 10:50:13 +00:00
e384f2f108
[Misc] refactor example - openai_transcription_client ( #19851 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-20 08:02:21 +00:00
089a306f19
[Misc] update cuda version ( #19526 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-20 07:25:15 +00:00
5e666f72cd
[Bugfix][Ray] Set the cuda context eagerly in the ray worker ( #19583 )
2025-06-19 22:01:16 -07:00
e3a3e4db46
[Bugfix] Enable PP with AITER+V1 ( #19822 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-06-20 12:43:20 +08:00
e41bf15cd0
[Chore]: qwen3-moe-type-hints-mistake ( #19860 )
...
Co-authored-by: xinnan.hou <hxn02029096@alibaba-inc.com >
2025-06-19 21:43:07 -07:00
5aa4a015ce
[Benchmark] Fix Value of type "SampleRequest" is not indexable ( #18032 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-19 21:28:55 -07:00
b6bad3d186
[CI][Neuron] Fail and exit on first error ( #19622 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-20 12:27:51 +08:00
ee9a1531aa
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI ( #19872 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-20 09:51:07 +08:00
10d82f9ac5
[Benchmark][Bugfix] Fix Dataset Length Calculation ( #19868 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-06-19 18:30:41 -07:00
ea10dd9d9e
[Frontend] early return chat format resolution when specified ( #19735 )
2025-06-19 18:49:59 +00:00
ead2110297
[Core][Bugfix] Fix Online MM Beam Search ( #19688 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-06-19 17:18:07 +00:00
01220ce89a
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI ( #19838 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-19 15:46:09 +00:00
6f68c49220
[Doc] Update V1 user guide for embedding models ( #19842 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-19 09:43:27 +00:00
4719460644
Fixing Chunked Prefill Test. ( #19762 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-06-19 01:36:16 -07:00
466166dcfd
[Frontend] Add optional token-level progress bar to LLM.beam_search ( #19301 )
...
Signed-off-by: Ruosen Li <rxl190028@utdallas.edu >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Ubuntu <ubuntu@ip-172-31-71-179.ec2.internal >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-19 03:21:41 -04:00
1d0ae26c85
Add xLAM tool parser support ( #17148 )
2025-06-19 14:26:41 +08:00
6021999573
[Minor] Allow redirecting model path for HfRunner in test ( #19795 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-18 23:04:10 -07:00
c7b370c603
raise exception for pin_lora ( #19809 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-18 22:57:35 -07:00
aa20d10a91
[Misc] [ROCm] Prevent surplus tensor reshape ( #19803 )
...
Signed-off-by: Zsolt Borbely <zsolt.borbely@htecgroup.com >
2025-06-19 13:57:16 +08:00
2de12be428
[ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374 ( #18990 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-06-18 22:56:31 -07:00
83ca9ae47b
Mark invariant normalizer in Gemma as non-persistent ( #19788 )
...
Signed-off-by: Yu-Hang Tang <Tang.Maxin@gmail.com >
2025-06-18 22:56:03 -07:00
e2148dc5ea
[Bugfix] Add check_health to v1 async client. ( #19821 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-06-18 21:47:01 -07:00
b1098b4072
[Bugfix] Fix the linter ( #19826 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-18 21:44:41 -07:00
799397ee4f
Support embedding models in V1 ( #16188 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-18 21:36:33 -07:00