7ea22e42d5
[Misc] Add override for allreduce fusion thresholds ( #23639 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
2025-08-26 15:53:04 +00:00
9d4183dd2e
[model] support qwen2audio embedding input ( #23625 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-26 23:48:08 +08:00
513298f1b4
[Bugfix] fix bf16 multimodal model hash ( #23623 )
...
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-08-26 23:47:50 +08:00
379f828fba
[Docs] Reduce requirements for docs build ( #23651 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 15:43:28 +00:00
1fdc732419
[ROCm] Starting to add AMD code reviewers for ROCm components ( #23496 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-08-26 07:32:37 -07:00
f58675bfb3
[CPU] add cpu fused moe pytorch native implementation ( #23146 )
...
Signed-off-by: Tianyu Li <tianyu.li@arm.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 14:09:17 +00:00
7c04779afa
[Doc]: fix various spelling issues in multiple files ( #23636 )
...
Signed-off-by: Didier Durand <durand.didier@gmail.com >
2025-08-26 14:05:29 +00:00
f66673a39d
[Kernel] Added flashinfer fp8 per-tensor gemms ( #22895 )
...
Signed-off-by: Julien Lin <jullin@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 06:54:04 -07:00
b78bed1bc5
[Hardware][Mac] Fix the installation fail for Apple Silicon (CPU) ( #23565 )
...
Signed-off-by: oye93 <en.ouyang93@outlook.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-08-26 13:04:25 +00:00
164b2273c8
[Docs] Fix broken links to docs/api/summary.md
( #23637 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 13:00:18 +00:00
2b4fc9bd9b
Support FlashAttention Backend for Hybrid SSM Models ( #23299 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-08-26 12:41:52 +00:00
ebd5a77bb5
feat: add usage to TranscriptionResponse (text and json response_format) ( #23576 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-08-26 05:26:26 -07:00
384dd1b0a8
[Bugfix] Add missing enable_log_outputs parameter to init_app_state function ( #23634 )
...
Signed-off-by: Matúš Námešný <matus.namesny@ameria.com >
2025-08-26 12:13:15 +00:00
fdeb3dac13
[Model] fix DeepSeek e_score_correction_bias dtype to fp32 ( #23640 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-08-26 20:09:47 +08:00
d52358c1e0
[Perf] Remove duplicated NVFP4 blockscales to save memory ( #23379 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-26 19:16:33 +08:00
6ace2f72b0
Fix writing benchmark results with tuple keys ( #23633 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-08-26 19:16:09 +08:00
b00e69f8ca
Fix nits from #20059 ( #23548 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 03:27:20 -07:00
50fede6634
[V1] Enable V1 for compute capability < 8.0 + FP32 ( #23614 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-26 03:00:18 -07:00
b5d34af328
[Bugfix] Fix scheduling when repeated images in one request ( #23544 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Signed-off-by: Roger Wang <hey@rogerw.io >
Co-authored-by: Roger Wang <hey@rogerw.me >
Co-authored-by: knlnguyen1802 <knlnguyen1802@gmail.com >
2025-08-26 09:46:28 +00:00
9b5f64238f
[Bugfix] Fix Qwen25VL packed_modules_mapping ( #23604 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-08-26 01:09:14 -07:00
ff77764f86
Fix CLI parameter documentation inconsistency in pooling_models.md ( #23630 )
2025-08-26 01:05:37 -07:00
bfc1edc9f5
[Docs] Fix titles for multi-file examples that are rendered in the docs ( #23573 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-08-26 00:16:44 -07:00
3ecbb14b81
[Benchmarks] add benchmark for embedding models ( #23000 )
...
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com >
2025-08-25 23:57:08 -07:00
7d67a9d9f9
[mypy] Fix incorrect type hint for EAGLE3 support ( #23617 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:50:17 -07:00
959783fb99
[fix] fix seed-oss-parser ( #23560 )
...
Signed-off-by: jiabin.00 <jiabin.00@bytedance.com >
2025-08-25 23:16:36 -07:00
ce0e9dbd43
[CI/Build] Fix typo in #23561 ( #23616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 23:13:03 -07:00
b395b3b0a3
[Disagg][Perf] Use CUDA event sync instead of blocking tolist
to avoid unintentional copy ops blocking across different CUDA streams, improving disagg TTIT/TTFT ( #22760 )
...
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Zijing Liu <liuzijing2014@users.noreply.github.com >
2025-08-25 21:06:00 -07:00
6fad29b11b
Remove graph_pool as member of VllmBackend and argument to CUDAGraphWrapper ( #23385 )
...
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com >
Co-authored-by: ProExpertProg <11367180+ProExpertProg@users.noreply.github.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-08-25 19:34:15 -07:00
6fd45e7b8a
[CI/Build] Use vLLM client's user agent to fetch images ( #23561 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 19:34:12 -07:00
56dcf4e7e9
[Bug] Fix DeepGEMM Env Control ( #23591 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-08-25 18:41:21 -07:00
ae067888d6
Update Flashinfer to 0.2.14.post1 ( #23537 )
...
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com >
Signed-off-by: siyuanf <siyuanf@nvidia.com >
Signed-off-by: Weiliang Liu <weiliangl@nvidia.com >
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 18:30:44 -07:00
906e461ed6
[CI Fix] Pin deepep and pplx tags in tools/ep_kernels/, gate multigpu tests ( #23568 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-08-25 18:29:00 -07:00
2a97ffc33d
[Misc] Add release note draft to PR template ( #23598 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-08-25 16:44:51 -07:00
efc88cf64a
[Misc] Simplify FlashInfer attention metadata ( #23585 )
...
Signed-off-by: Woosuk Kwon <woosuk@thinkingmachines.ai >
2025-08-25 15:42:29 -07:00
7b6a837275
[Docs] Update Documentation of Cohere Command-A Models ( #23584 )
...
Signed-off-by: Terrencezzj <terrence@cohere.ai >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Zhonghua Deng <abzhonghua@gmail.com >
2025-08-25 21:53:52 +00:00
c34c82b7fe
[TPU][Bugfix] Fixes prompt_token_ids error in tpu tests. ( #23574 )
...
Signed-off-by: Pate Motter <patemotter@google.com >
2025-08-25 14:29:16 -07:00
8a044754bd
[XPU] Delay BF16 check to worker init for spawn compatibility ( #22979 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-08-25 13:09:26 -07:00
9188ae7cb5
[Bugfix][V1][P/D]Fix the issue where repeated requests for the same input produce abnormal outputs for P2pNcclConnector ( #23403 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-08-25 12:57:08 -07:00
8a3cd90af5
[Kernel] Add fused grouped_topk kernel for MoE ( #23274 )
...
Signed-off-by: Xin Yang <xyangx@amazon.com >
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com >
2025-08-25 11:47:52 -07:00
2a167b2eeb
[test][RL] Add sleep level 2 test and fix reload with sleep mode ( #23521 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-08-26 00:25:52 +08:00
0ff902f3b4
[Refactor] Refactor persistent buffers with CpuGpuBuffer ( #23515 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-25 08:44:48 -07:00
a9082a4d14
[Bugfix] Fix Qwen3 MoE GPTQ inference ( #23490 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-08-25 06:40:20 -07:00
e0329ed4b4
Updates to Flex + VLLm integration ( #21416 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-08-25 09:32:42 -04:00
6879cd80ae
[Refactor] Pass tokenizer
explicitly instead of binding to prompt update ( #23542 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:31:57 -07:00
e269be2ba2
[Doc] Add caution for API server scale-out ( #23550 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 06:14:15 -07:00
5c4b6e66fe
[Attention] Unify mamba and attention backend selection ( #23171 )
...
Signed-off-by: Ayush Satyam <ayushsatyam146@gmail.com >
2025-08-25 09:09:36 +00:00
d0a4a3f645
[misc] add shanghai meetup ( #23535 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-08-25 17:00:03 +08:00
ebafb0936d
[Bugfix] Allow dynamic number of patches for llava_onevision ( #23525 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-08-25 08:34:54 +00:00
0cb7b065c3
Feature/benchmark/random mm data/images ( #23119 )
...
Signed-off-by: breno.skuk <breno.skuk@hcompany.ai >
2025-08-25 01:28:35 -07:00
2da02dd0d8
[Fix] DeepSeek V3.1 tool parser error message ( #23492 )
...
Signed-off-by: zitian.zhao <zitian.zhao@tencentmusic.com >
2025-08-25 00:56:39 -07:00