6d8d0a24c0
Add think chunk ( #21333 )
...
Signed-off-by: Julien Denize <julien.denize@mistral.ai >
2025-07-23 21:51:32 -07:00
11ef7a611e
[BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses ( #21211 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-23 21:44:04 -07:00
dc2f159f8a
Dump input metadata on crash for async scheduling ( #21258 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-23 21:10:30 -07:00
d5b981f8b1
[DP] Internal Load Balancing Per Node [one-pod-per-node
] ( #21238 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-23 20:57:32 -07:00
eec6942014
[BugFix] Fix KVConnector TP worker aggregation ( #21473 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 20:56:49 -07:00
fd48d99ffd
[BugFix]: Batch generation from prompt_embeds fails for long prompts ( #21390 )
...
Signed-off-by: KazusatoOko <kazusto.oko@sakana.ai >
Co-authored-by: KazusatoOko <kazusto.oko@sakana.ai >
2025-07-23 20:43:17 -07:00
f8c15c4efb
[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process ( #21437 )
...
Signed-off-by: David Chen <530634352@qq.com >
2025-07-23 20:42:11 -07:00
aa08a954f9
[Bugfix] Fix casing warning ( #21468 )
...
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com >
2025-07-23 20:41:23 -07:00
13e4ee1dc3
[XPU][UT] increase intel xpu CI test scope ( #21492 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-23 20:24:04 -07:00
772ce5af97
[Misc] Add dummy maverick test to CI ( #21324 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-23 20:22:42 -07:00
63d92abb7c
[Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding ( #21374 )
...
Signed-off-by: Deven Labovitch <deven@videa.ai >
2025-07-23 20:22:19 -07:00
11599b0e1f
feat(gguf_loader): accept HF repo paths & URLs for GGUF ( #20793 )
...
Signed-off-by: Hardik <hardikgupta1999@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-23 20:21:02 -07:00
f3137cdd81
[Core] Freeze gc during cuda graph capture to speed up init ( #21146 )
...
Signed-off-by: Codex <codex@openai.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 17:20:14 -07:00
82ec66f514
[V0 Deprecation] Remove Prompt Adapters ( #20588 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 16:36:48 -07:00
78c13e30e1
[V1] Fix local chunked attention always disabled ( #21419 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-23 15:59:30 -07:00
5c9b807b34
[Core] Add reload_weights
RPC method ( #20096 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-23 14:24:52 -07:00
14bf19e39f
[TPU][TEST] Fix the downloading issue in TPU v1 test 11. ( #21418 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-23 11:29:36 -07:00
4ac7713e32
Add test case for compiling multiple graphs ( #21044 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-23 11:00:47 -07:00
8560a5b258
[Core][Model] PrithviMAE Enablement on vLLM v1 engine ( #20577 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-07-23 11:00:23 -07:00
316b1bf706
[Tests] Add tests for headless internal DP LB ( #21450 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 07:49:25 -07:00
7c734ee09b
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. ( #21364 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-07-23 06:34:37 -07:00
f59ec35b7f
[V1] Check all pooling tasks during profiling ( #21299 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-23 05:53:26 -07:00
2671334d45
[Model] add Hunyuan V1 Dense Model support. ( #21368 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-23 03:54:08 -07:00
2cc5016a19
[Docs] Clean up v1/metrics.md ( #21449 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 03:37:25 -07:00
6929f8b437
[Misc] fixed nvfp4_moe test failures due to invalid kwargs ( #21246 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-07-23 01:41:43 -07:00
32ec9e2f2a
Mamba V2 Test not Asserting Failures. ( #21379 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-07-23 01:40:27 -07:00
accac82928
[Sampler] Introduce logprobs mode for logging ( #21398 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-23 01:39:25 -07:00
23637dcdef
[Docs] Fix bullets and grammars in tool_calling.md ( #21440 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 01:23:20 -07:00
6364af92f8
Fixed typo in profiling logs ( #21441 )
2025-07-23 01:18:54 -07:00
7aaa2bd5a8
[Bugfix] ensure tool_choice is popped when tool_choice:null
is passed in json payload ( #19679 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-07-23 00:30:05 -07:00
2f5c14de6a
add clear messages for deprecated models ( #21424 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-07-23 00:03:16 -07:00
f002e9a870
[Cleanup] Only log MoE DP setup warning if DP is enabled ( #21315 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 00:02:48 -07:00
a1f3610fc6
[Core] Add basic unit test for maybe_evict_cached_block ( #21400 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-23 00:02:02 -07:00
4ecedd1806
[Bugfix] Fix nightly transformers CI failure ( #21427 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-23 00:01:01 -07:00
107111a859
Changing "amdproduction" allocation. ( #21409 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-07-22 20:48:31 -07:00
2dec7c1a5d
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported ( #21420 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-22 20:34:50 -07:00
08d2bd78da
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update ( #21414 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-22 20:33:57 -07:00
4f76a05f4f
[BugFix] Update python to python3 calls for image; fix prefix & input calculations. ( #21391 )
...
Signed-off-by: Eric Hanley <ericehanley@google.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-22 20:33:00 -07:00
f154bb9ff0
Simplify weight loading in Transformers backend ( #21382 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-22 20:29:43 -07:00
3ec7170ff1
[Bugfix][ROCm][Build] Fix build regression on ROCm ( #21393 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-22 20:27:41 -07:00
c401c64b4c
[CI/Build] Fix model executor tests ( #21387 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 20:25:37 -07:00
b77c7d327f
[BugFix] Fix ray import error mem cleanup bug ( #21381 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-07-22 16:19:55 -07:00
35bc8bd5fb
[Misc] Copy HF_TOKEN env var to Ray workers ( #21406 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-22 16:18:42 -07:00
4594fc3b28
[Model] Add Qwen3CoderToolParser ( #21396 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-07-22 15:05:57 -07:00
ae268b6326
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num
( #21325 )
...
Signed-off-by: XIn Li <xinli@nvidia.com >
2025-07-22 12:42:31 -07:00
35366ae57c
[CI/Build] Fix test failure due to updated model repo ( #21375 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 08:39:35 -07:00
2226d5bd85
[Bugfix] Decode Tokenized IDs to Strings for hf_processor
in llm.chat()
with model_impl=transformers
( #21353 )
...
Signed-off-by: ariG23498 <aritra.born2fly@gmail.com >
2025-07-22 08:27:28 -07:00
44554a0068
Add tokenization_kwargs to encode for embedding model truncation ( #21033 )
2025-07-22 08:24:00 -07:00
226b452a20
Revert "[Refactor] Fix Compile Warning #1444-D ( #21208 )" ( #21384 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 08:22:10 -07:00
f38ee34a0a
[feat] Enable mm caching for transformers backend ( #21358 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-07-22 08:18:46 -07:00
b194557a6c
Adds parallel model weight loading for runai_streamer ( #21330 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-22 08:15:53 -07:00
774d0c014b
[Perf] Cuda Kernel for Per Token Group Quant ( #21083 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 07:27:15 -07:00
2c8db17cfd
[feat]: add SM100 support for cutlass FP8 groupGEMM ( #20447 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:27:12 -07:00
4fb56914c5
[perf] Add fused MLA QKV + strided layernorm ( #21116 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:07:44 -07:00
0df4d9b06b
[Misc] unify variable for LLM instance v2 ( #21356 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-22 06:32:36 -07:00
ed25054577
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool ( #21222 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 06:17:47 -07:00
10904e6d75
[benchmark] Port benchmark request sent optimization to benchmark_serving ( #21209 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:28:00 -07:00
a32237665d
[Core] Optimize update checks in LogitsProcessor ( #21245 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:27:18 -07:00
bc8a8ce5ec
[Misc] Remove deprecated args in v0.10 ( #21349 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-22 05:26:39 -07:00
32142b3c62
[Bugfix] Fix eviction cached blocked logic ( #21357 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-22 01:18:40 -07:00
82b8027be6
Add arcee model ( #21296 )
...
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-22 00:57:43 -07:00
3779eb8c81
[Feature][eplb] add verify ep or tp or dp ( #21102 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-21 23:41:14 -07:00
9e23ad9655
Update fp4 quantize API ( #21327 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-07-21 23:40:21 -07:00
e69a92a1ce
[Bug] DeepGemm: Fix Cuda Init Error ( #21312 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:36:18 -07:00
8425f785ad
[Misc] DeepEPHighThroughtput - Enable Inductor pass ( #21311 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-21 23:35:45 -07:00
c17231e827
Fix kv_cache_dtype handling for out-of-tree HPU plugin ( #21302 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-21 23:35:14 -07:00
6e5b5ca580
[Refactor] Fix Compile Warning #1444-D ( #21208 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:33:51 -07:00
488d8a986a
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible ( #21300 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-21 23:31:18 -07:00
af376ca19d
[Core] Minimize number of dict lookup in _maybe_evict_cached_block ( #21281 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-21 22:37:34 -07:00
e7b2042681
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 ) ( #21334 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-21 21:49:01 -07:00
90f1e55421
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU ( #21338 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-21 21:48:27 -07:00
5e70dcd6e6
[Doc] Fix CPU doc format ( #21316 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 21:47:49 -07:00
25d585ab7b
[XPU] Enable external_launcher to serve as an executor via torchrun ( #21021 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-07-21 21:47:35 -07:00
8d0a01a5f2
[v1][sampler] Inplace logprobs comparison to get the token rank ( #21283 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-21 13:47:47 -07:00
0ec82edda5
[perf] Speed up align sum kernels ( #21079 )
...
Signed-off-by: Himanshu Jaju <hj@mistral.ai >
2025-07-21 11:19:23 -07:00
005ae9be6c
Fix bad lm-eval fork ( #21318 )
2025-07-21 10:47:51 -07:00
29d1ffc5b4
[DP] Fix Prometheus Logging ( #21257 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-07-21 09:11:35 -07:00
304dce7ec0
[Attention] Clean up iRoPE in V1 ( #21188 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-07-21 09:10:30 -07:00
6ece16c4fe
[Misc] Add dummy maverick test ( #21199 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-21 09:08:09 -07:00
a0e827e07c
[BugFix] make utils.current_stream thread-safety ( #21252 ) ( #21253 )
...
Signed-off-by: simpx <simpxx@gmail.com >
2025-07-21 09:07:36 -07:00
a15a50fc17
[CPU] Enable shared-memory based pipeline parallel for CPU backend ( #21289 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 09:07:08 -07:00
6dda13c86b
[Misc] Add sliding window to flashinfer test ( #21282 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-21 08:37:49 -07:00
6b46c4b653
Add Nvidia ModelOpt config adaptation ( #19815 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-21 10:02:58 -04:00
d97841078b
[Misc] unify variable for LLM instance ( #20996 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-21 12:18:33 +01:00
e6b90a2805
[Docs] Make tables more space efficient in supported_models.md
( #21291 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:25:02 -07:00
be54a951a3
[Docs] Fix hardcoded links in docs ( #21287 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:23:57 -07:00
042af0c8d3
[Model][1/N] Support multiple poolers at model level ( #21227 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-21 02:22:21 -07:00
378d33c392
[Bugfix] Fix missing placeholder in logger debug ( #21280 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-20 22:50:06 -07:00
940af1f03a
Add the instruction to run e2e validation manually before release ( #21023 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-20 22:29:18 -07:00
92615d7fe8
[Docs] Add RFC Meeting to Issue Template ( #21279 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-20 21:58:07 -07:00
8188196a1c
[CI] Cleanup modelscope version constraint in Dockerfile ( #21243 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-07-20 20:13:02 -07:00
7ba34b1241
[bugfix] fix syntax warning caused by backslash ( #21251 )
2025-07-20 17:12:10 +00:00
9499e26e2a
[Model] Support VLMs with transformers backend ( #20543 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-20 13:25:50 +00:00
51ba839555
[Model] use AutoWeightsLoader for bart ( #18299 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-07-20 08:15:50 +00:00
d1fb65bde3
Enable v1 metrics tests ( #20953 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-20 03:22:02 +00:00
3a1d8940ae
[TPU] support fp8 kv cache quantization ( #19292 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-20 03:01:00 +00:00
2b504eb770
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. ( #21233 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 16:09:58 -07:00
10eb24cc91
GLM-4 Update ( #20736 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Lu Fang <fanglu@fb.com >
2025-07-19 22:40:31 +00:00
2e8cbb58f3
[BugFix] Fix full cuda graph slot_mapping ( #21228 )
...
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
2025-07-19 14:13:18 -07:00
752c6ade2e
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small ( #21217 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-19 13:53:17 -07:00
881e3cbe3b
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers ( #21194 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 19:27:21 +00:00
9f414a12ad
[BugFix] Make PD work with Ray ( #21072 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-19 08:46:50 -07:00
6a971ed692
[Docs] Update the link to the 'Prometheus/Grafana' example ( #21225 )
2025-07-19 06:58:07 -07:00
da6579bf41
[CI/CD][bugfix]fix: error argument to loads has incompatible type ( #21223 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com >
2025-07-19 05:16:48 -07:00
c81259d33a
Fix/remove some broken model executor tests ( #21224 )
...
Signed-off-by: Rabi Mishra <ramishra@redhat.com >
2025-07-19 12:15:07 +00:00
e3a0e43d7f
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code ( #21032 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-19 05:13:55 -07:00
b3d82108e7
[Bugfix][Frontend] Fix openai CLI arg middleware
( #21220 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-19 02:40:38 -07:00
6d0734c562
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency ( #20645 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-19 02:33:01 -07:00
7d94577138
Add torch golden impl for moe_align_block_size kernel test ( #20653 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-19 02:32:36 -07:00
59f935300c
[BugFix] Fix potential cuda-graph IMA ( #21196 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-19 02:18:47 -07:00
18e519ec86
[Bugfix] Fix ndarray video color from VideoAsset ( #21064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-19 02:17:16 -07:00
1eaff27815
[V0 deprecation] Remove long context LoRA ( #21169 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-19 02:15:41 -07:00
cf8cc32674
Fix a couple of Voxtral tests ( #21218 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-19 09:13:41 +00:00
3a2cb2649d
[Misc][Tools][Benchmark] Add readme file for auto_tune script ( #20779 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-19 09:06:59 +00:00
3e04107d97
[Model] EXAONE 4.0 model support ( #21060 )
...
Signed-off-by: Deepfocused <rlawhdrhs27@gmail.com >
Signed-off-by: woongsik <rlawhdrhs27@gmail.com >
2025-07-19 14:25:44 +08:00
37bd8d6e4c
[Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 ( #21187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 23:25:22 -07:00
468e2400fe
[BugFix][CPU] Fix TorchSDPABackendImpl
doesn't have use_irope
( #21200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-18 23:18:48 -07:00
dcc6cfb991
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel ( #21193 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-18 23:09:51 -07:00
dd572c0ab3
[V0 Deprecation] Remove V0 Spec Decode workers ( #21152 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-18 21:47:50 -07:00
9ffe905a41
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 ( #21183 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-07-18 21:15:03 -07:00
9a9fda1423
[Core] Support Local Chunked Attention for Hybrid KV Cache ( #19351 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <fanglu@meta.com >
2025-07-18 20:48:38 -07:00
466e878f2a
[Quantization] Enable BNB support for more MoE models ( #21100 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-18 17:52:02 -07:00
217937221b
Elastic Expert Parallel Initial Support ( #20775 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-18 17:46:09 -07:00
5782581acf
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) ( #21077 )
...
Signed-off-by: hax0r31337 <liulihaocaiqwq@gmail.com >
2025-07-18 18:40:18 -04:00
0f199f197b
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue ( #21005 )
...
Signed-off-by: Jialin Ouyang <jialino@meta.com >
2025-07-18 12:34:40 -07:00
b2eb2b5ad7
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 ( #19346 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-18 14:10:21 -04:00
21274ab476
[CI] Update CODEOWNERS for vllm/compilation ( #21185 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-18 06:51:12 -07:00
ed8cbfedf8
Let GraniteMoeAttention use YaRN ( #21174 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-18 05:52:52 -07:00
45badd05d0
[Core] Set pooling params based on task and model ( #21128 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 05:41:17 -07:00
4adc66f64d
[Bugfix] Allocate less memory in non-batched CUTLASS MoE ( #21121 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-18 18:55:52 +08:00
55ad648715
[Doc] Fix typo in model name ( #21178 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 03:55:10 -07:00
5895afd780
[Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. ( #20750 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 09:10:47 +00:00
ca4eb82bcb
[Model] Re-add the implicit conversion feature for as_seq_cls_model ( #21103 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 07:15:07 +00:00
ba2dfbb0c2
[Misc] Make MM embedding merge interface explicit in model runner ( #21147 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-18 07:13:57 +00:00
1bf65138f6
[benchmark] Sending request strictly follows the random intervals ( #21108 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-18 06:22:08 +00:00
54cf1cae62
[Misc] Do not print async output warning for v1 ( #21151 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 21:57:02 -07:00
5780121c95
[Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm ( #20911 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-18 04:34:43 +00:00
c7d8724e78
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) ( #20037 )
...
Signed-off-by: shuw <shuw@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-17 21:32:45 -07:00
b38baabcf9
[Doc] Add inplace weights loading example ( #19640 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-17 21:12:23 -07:00
89cab4d01f
[Attention] Make local attention backend agnostic ( #21093 )
2025-07-18 00:10:42 -04:00
b9a21e9173
[Docs] Update supported models documentation with missing models ( #20844 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-07-17 20:12:13 -07:00
c4e3b12524
[Docs] Add minimal demo of Ray Data API usage ( #21080 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-17 20:09:19 -07:00
8dfb45ca33
[Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel ( #21133 )
2025-07-18 00:35:58 +00:00
8a8fc94639
[Log] Debugging Log with more Information ( #20770 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 00:19:46 +00:00
4de7146351
[V0 deprecation] Remove V0 HPU backend ( #21131 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 16:37:36 -07:00
ac9fb732a5
On environments where numa cannot be detected we get 0 ( #21115 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-07-17 18:52:17 +00:00
a3a6c695f4
[Misc] Qwen MoE model supports LoRA ( #20932 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 18:32:52 +00:00
90bd2ab6e3
[Model] Update pooling model interface ( #21058 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-17 16:05:40 +00:00
9fb2d22032
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-17 09:56:44 -04:00
2d6a38209b
[Docs] Move code block out of admonition now that it's short ( #21118 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 06:12:29 -07:00
89e3c4e9b4
[Misc] Avoid unnecessary import ( #21106 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-17 12:57:41 +00:00
fe8a2c544a
[Docs] Improve docstring formatting for FusedMoEParallelConfig.make
( #21117 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 04:13:00 -07:00
4ef00b5cac
[VLM] Add Nemotron-Nano-VL-8B-V1 support ( #20349 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-17 03:07:55 -07:00
5a7fb3ab9e
[Model] Add ToolParser and MoE Config for Hunyuan A13B ( #20820 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-17 09:10:09 +00:00
11dfdf21bf
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels ( #20903 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-17 08:10:37 +00:00
fdc5b43d20
[Bugfix]: Fix final_res_batch list index out of range error ( #21055 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-17 00:29:09 -07:00
c5b8b5953a
[Misc] Fix PhiMoE expert mapping ( #21085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 05:47:49 +00:00
4fcef49ec4
[V1] [KVConnector] Fix MultiprocExecutor worker output aggregation ( #21048 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-07-17 13:29:45 +08:00
8a4e5c5f3c
[V1][P/D]Enhance Performance and code readability for P2pNcclConnector ( #20906 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-07-16 22:13:00 -07:00
76b494444f
[Attention] Refactor attention metadata builder interface ( #20466 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-17 04:44:25 +00:00
28a6d5423d
[Bugfix] Fix Machete zero point issue for GPTQ models on SM90 ( #21066 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:54:45 -07:00
58760e12b1
[TPU] Start using python 3.12 ( #21000 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-16 19:37:44 -07:00
a50d918225
[Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile ( #21013 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:37:13 -07:00
c9ba8104ed
[Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group ( #21024 )
...
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com >
2025-07-16 19:36:36 -07:00
4e7dfbe7b4
Update PyTorch to torch==2.7.1
for CUDA ( #21011 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-17 02:30:44 +00:00
72ad273582
Remove torch_xla.tpu.version() from pallas.py. ( #21065 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-17 00:25:26 +00:00
01513a334a
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) ( #12010 )
...
Signed-off-by: Nir David <ndavid@habana.ai >
Signed-off-by: Uri Livne <ulivne@habana.ai >
Co-authored-by: Uri Livne <ulivne@habana.ai >
2025-07-16 15:33:41 -04:00
ac2bf41e53
[Model] Remove model sampler ( #21059 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 19:03:37 +00:00
a931b4cdcf
Remove Qwen Omni workaround that's no longer necessary ( #21057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-16 16:25:23 +00:00
a0f8a79646
[fix] fix qwen image_embeds input ( #21049 )
...
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai >
2025-07-16 15:17:20 +00:00
18bdcf4113
feat - add a new endpoint get_tokenizer_info
to provide tokenizer/chat-template information ( #20575 )
...
Signed-off-by: m-misiura <mmisiura@redhat.com >
2025-07-16 21:52:14 +08:00
1c3198b6c4
[Model] Consolidate pooler implementations ( #20927 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 13:39:13 +00:00
260127ea54
[Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md ( #19199 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-16 06:11:38 -07:00
d0dc4cfca4
Fix inadvertently silenced PP tests for mp
, add DeepSeek V2/V3 model family to PP tests ( #20831 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-16 00:14:49 -07:00
d31a647124
[BugFix] Fix import error on non-blackwell machines ( #21020 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-15 22:27:29 -07:00
85431bd9ad
[TPU] fix kv_cache_update kernel block size choosing logic ( #21007 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-16 04:39:48 +00:00
c11013db8b
[Meta] Llama4 EAGLE Support ( #20591 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: qizixi <qizixi@meta.com >
2025-07-15 21:14:15 -07:00
1eb2b9c102
[CI] update typos config for CI pre-commit and fix some spells ( #20919 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-15 21:12:40 -07:00
6ebf313790
Avoid direct comparison of floating point numbers ( #21002 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-15 21:12:14 -07:00
cfbcb9ed87
[Voxtral] Add more tests ( #21010 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-15 21:11:49 -07:00
76ddeff293
[Doc] Remove duplicate docstring ( #21012 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-15 20:09:13 -07:00
f46098335b
[Bugfix] Fix Mistral3 support on SM100/SM120 ( #20998 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 20:08:41 -07:00
e9534c7202
[CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty ( #21006 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-15 20:07:05 -07:00
7976446015
Add Dockerfile argument for VLLM_USE_PRECOMPILED environment ( #20943 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-15 19:53:57 -07:00
fcb9f879c1
[Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… ( #20937 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-15 19:53:42 -07:00
3ed94f9d0a
[Docs] Enhance Anyscale documentation, add quickstart links for vLLM ( #21018 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 19:46:56 -07:00
fa839565f2
[Misc] Refactor: Improve argument handling for conda
command ( #20481 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 19:43:19 -07:00
75a99b98bf
[Chore] Remove outdated transformers check ( #20989 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-07-15 19:42:40 -07:00
b5c3b68359
[Misc] bump xgrammar version to v0.1.21 ( #20992 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 19:42:16 -07:00
6cbc4d4bea
[Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture ( #20923 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 19:19:10 -07:00
153c6f1e61
[Frontend] Remove print left in FrontendArgs.add_cli_args ( #21004 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 19:18:41 -07:00
34cda778a0
[Frontend] OpenAI Responses API supports input image ( #20975 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 18:59:36 -06:00
30800b01c2
[Nvidia] Integrate SM100 cudnn prefill API to MLA prefill ( #20411 )
...
Signed-off-by: Elfie Guo <elfieg@nvidia.com >
Co-authored-by: Elfie Guo <eflieg@nvidia.com >
2025-07-15 17:56:45 -07:00
10be209493
[Bug Fix] get_distributed_init_method should get the ip from get_ip i… ( #20889 )
...
Signed-off-by: Chen Li <lcpingping@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-07-15 21:23:52 +00:00
19c863068b
[Frontend] Support cache_salt in /v1/completions and /v1/responses ( #20981 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-07-15 21:01:04 +00:00
f29fd8a7f8
[BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 ( #20838 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
2025-07-15 16:08:26 -04:00
ed10f3cea1
[ROCm] warpSize is being made non constexpr in ROCm 7.0 ( #20330 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-15 14:01:44 -04:00
b637e9dcb8
Add full serve CLI reference back to docs ( #20978 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:42:30 +00:00
1e36c8687e
[Deprecation] Remove nullable_kvs
( #20969 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:21:50 +00:00
5bac61362b
Configure Gemini ( #20971 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 09:37:05 -07:00
313ae8c16a
[Deprecation] Remove everything scheduled for removal in v0.10.0 ( #20979 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 15:57:53 +00:00
c847e34b39
[CI/Build] Fix wrong path in Transformers Nightly Models Test ( #20994 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-15 08:53:16 -07:00
e7e3e6d263
Voxtral ( #20970 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-15 07:35:30 -07:00
4ffd963fa0
[v1][core] Support for attention free models ( #20811 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-07-15 14:20:01 +00:00
56fe4bedd6
[Deprecation] Remove TokenizerPoolConfig
( #20968 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 14:00:50 +00:00
d91278181d
[doc] Add more details for Ray-based DP ( #20948 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-15 05:37:12 -07:00
20149d84d9
[MISC] Add init files for python package ( #20908 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-07-15 12:16:33 +00:00
3534c39a20
[V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli ( #20840 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 04:04:35 -07:00
c586b55667
[TPU] Optimize kv cache update kernel ( #20415 )
...
Signed-off-by: Yifei Teng <tengyifei88@gmail.com >
2025-07-15 03:56:43 -07:00
33d560001e
[Docs] Improve documentation for ray cluster launcher helper script ( #20602 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 03:55:45 -07:00
f148c44c6a
[frontend] Refactor CLI Args for a better modular integration ( #20206 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-15 02:23:42 -07:00
235bfd5dfe
[Docs] Improve documentation for RLHF example ( #20598 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 01:54:10 -07:00
68d28e37b0
[frontend] Add --help=page option for paginated help output ( #20961 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 00:42:00 -07:00
37a7d5d74a
[Misc] Refactor AllReduceFusionPass. Remove parameter ( #20918 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-15 06:57:40 +00:00
d4d309409f
Implement Async Scheduling ( #19970 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-14 23:01:46 -07:00
85bd6599e4
[Model] Add AutoWeightsLoader support for BERT, RoBERTa ( #20534 )
...
Signed-off-by: Jennifer He <islandhe@gmail.com >
Signed-off-by: <islandhe@gmail.com >
Signed-off-by: Jen H <islandhe@gmail.com >
2025-07-15 13:34:24 +08:00
91b3d190ae
[cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir ( #20940 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-15 13:02:17 +08:00
fc017915f5
[Doc] Clearer mistral3 and pixtral model support description ( #20926 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 21:56:53 -07:00
9ad0a4588b
[Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer ( #20934 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-07-15 03:27:50 +00:00
016b8d1b7f
Enabled BnB NF4 inference on Gaudi ( #20172 )
...
Signed-off-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
2025-07-14 20:26:08 -07:00
80305c1b24
[CI] Fix flaky test_streaming_response
test ( #20913 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:15:15 -07:00
37e2ecace2
feat: add image zoom to improve image viewing experience ( #20763 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 20:14:23 -07:00
054c8657e3
[Docs] Add Kuberay to deployment integrations ( #20592 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-14 20:13:55 -07:00
d4170fad39
Use w8a8 quantized matmul Pallas kernel ( #19170 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-15 03:06:33 +00:00
946aadb4a0
[CI/Build] Split Entrypoints Test into LLM and API Server ( #20945 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 02:44:18 +00:00
bcdfb2a330
[Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM ( #20933 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 01:42:17 +00:00
ba8c300018
[BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache ( #20942 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-15 01:26:18 +00:00
8cdc371217
SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP ( #20769 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-07-15 01:06:38 +00:00
61e20828da
Fall back if flashinfer comm module not found ( #20936 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-14 23:11:18 +00:00
55e1c66da5
[Docs] remove outdated performance benchmark ( #20935 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-07-14 22:14:17 +00:00
86f3ac21ce
Fix overflow indexing in causal_conv1d kernel ( #20938 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-14 21:43:07 +00:00
149f2435a5
[Misc] Relax translations tests ( #20856 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:08:36 +00:00
c0569dbc82
[Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts ( #20725 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-14 19:47:16 +00:00
8bb43b9c9e
Add benchmark dataset for mlperf llama tasks ( #20338 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-14 19:10:07 +00:00
559756214b
Change default model to Qwen3-0.6B ( #20335 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-14 16:54:52 +00:00
6d0cf239c6
[CI/Build] Add Transformers nightly tests in CI ( #20924 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 16:33:17 +00:00
3fc964433a
[Misc] Clean up Aimv2 config registration in Ovis config ( #20921 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 15:36:43 +00:00
0caf61c08a
[CI] Update codeowner for compilation code ( #20929 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-14 08:33:19 -07:00
667624659b
[CI] cc folks on changes to vllm/compilation ( #20925 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-14 07:52:17 -07:00
38efa28278
[Model] Add Ling implementation ( #20680 )
...
Signed-off-by: vito.yy <vito.yy@antgroup.com >
2025-07-14 22:10:32 +08:00
e8cc53af5e
[Misc] Log the reason for falling back to FlexAttention ( #20699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 04:16:51 -07:00
a4851cfe68
[Bugfix]: Fix messy code when using logprobs ( #20910 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-14 11:06:45 +00:00
9887e8ec50
[Misc] Remove unused function ( #20909 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 10:48:55 +00:00
f326ab9c88
[Bugfix] Bump up mistral_common to support v13 tokenizer ( #20905 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 10:45:03 +00:00
dcf2a5e208
[CI/Build] Fix OOM issue in Jina-VL test ( #20907 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 10:32:35 +00:00
1e9438e0b0
[MISC] Move bind_kv_cache to worker module ( #20900 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-14 09:40:00 +00:00
697ef765ee
[Refactor][V1] Move outlines utils for V1 imports ( #20878 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-07-14 00:58:35 -07:00
a99b9f7dee
[Quantization] add BNB for MixtralForCausalLM ( #20893 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-14 07:34:34 +00:00
c488b928a7
[ROCm] [Bugfix] [Critical]: Fix mamba compilation bug ( #20883 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-07-14 15:23:28 +08:00
2c7fa47161
Fix: Add missing EOFError handling in CLI complete command ( #20896 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 07:09:57 +00:00
88fc8a97e3
Removing redundant python version check ( #20888 )
...
Signed-off-by: Dannyso05 <dansong1177@gmail.com >
2025-07-14 06:15:05 +00:00
66f6fbd393
[Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) ( #20511 )
...
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com >
2025-07-14 02:45:31 +00:00
8632e831ba
[Core] Add update_config
RPC method ( #20095 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 00:49:18 +00:00
4bbfc36b16
[V1] Hybrid allocator without prefix caching ( #20661 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-13 16:55:14 +00:00
80d38b8ac8
[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c
and bugfix APIs ( #20880 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-07-13 15:19:32 +00:00
211b6a6113
[Bugfix] fix define of RerankDocument ( #20877 )
...
Signed-off-by: liuchenlong <liuchenlong@xiaohongshu.com >
Co-authored-by: liuchenlong <liuchenlong@xiaohongshu.com >
2025-07-13 14:32:40 +00:00
247102f07f
[Bugfix] Fix: add patch_rope_scaling after hf override ( #20857 )
...
Signed-off-by: Wang Siyuan <wsy0227@sjtu.edu.cn >
Signed-off-by: Wang Siyuan <sywang0227@gmail.com >
2025-07-13 00:13:25 -07:00
bd4c1e6fdb
Support for LlamaForSequenceClassification ( #20807 )
...
Signed-off-by: thechaos16 <thechaos16@gmail.com >
2025-07-13 00:09:34 -07:00
99b4f080d8
Renable google/gemma-3-1b-it accuracy test. ( #20866 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-12 21:48:56 -07:00
020f58abcd
[Core] Support multiple tasks per model ( #20771 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-12 19:40:11 -07:00
c1acd6d7d4
[Refactor] Change the way of import triton ( #20774 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:39:55 -07:00
3b3b778d4a
[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs ( #20825 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-12 19:39:14 -07:00
42d440c22b
[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant ( #20841 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:38:45 -07:00
f45a332886
[Sched] Enhance the logic to remove stopped requests from queues ( #20739 )
2025-07-12 15:33:13 -07:00
6e2c176e1f
[Bugfix] Restrict Machete to only run on Hopper ( #20830 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-12 17:34:40 +00:00
a86754a12b
[docs] convert supported configs to table ( #20858 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-12 06:54:50 -07:00
c2a2f19aba
[Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models ( #20843 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-07-12 06:11:30 -07:00
2c11a738b3
[Model] New model support for microsoft/Phi-4-mini-flash-reasoning ( #20702 )
...
Signed-off-by: Congcong Chen <congcongchen@microsoft.com >
2025-07-12 06:02:10 -07:00
b639327ad9
Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694 " ( #20853 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 23:07:35 -07:00
4afe687a82
Enable ModelOpt Llama4 fp8 checkpoint deployment ( #20419 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-11 23:07:16 -07:00
5de8d9f111
Remove extra tensor on CPU ( #20693 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-12 14:06:34 +08:00
c1c8ca57ff
[cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile ( #20790 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-11 23:06:13 -07:00
a3a5a47e48
[Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 ( #20823 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-11 23:06:04 -07:00
fb25e95688
[Docs] Update basic.md ( #20846 )
2025-07-11 23:05:32 -07:00
0d4891cd03
[Bug] Fix DeepGemm for EP low latency case ( #20833 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 23:05:12 -07:00
f56d2996ca
[Misc] Respect no_use_tqdm_on_load
flag while capturing CUDA graph ( #20834 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-07-11 23:04:45 -07:00
147afb448b
[Bugfix] Replace unavailable video url in multimodal test ( #20854 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-12 05:25:39 +00:00
3c7d942da8
[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models ( #20637 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-11 21:33:26 -07:00
890323dc1b
[Bugfix] : Fix typo - logger.warn_once -> logger.warning_once ( #20852 )
2025-07-11 20:56:24 -07:00
01cae37713
[CI/Build] Ensure compatability with Transformers v4.53 ( #20541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-11 20:53:07 -07:00
11c0198615
[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading ( #20682 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-07-11 20:52:43 -07:00
b1235c3e10
[Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices ( #20822 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 20:52:05 -07:00
44d02f54db
[Misc] Restrict deep_gemm's log output ( #20827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 20:50:42 -07:00
a8593237c0
Add pynccl all-gatherv and reducescatterv ( #20154 )
...
Signed-off-by: Trevor Morris <tmorris@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 18:59:23 -07:00
fc0f41d10a
Integration SM100 FlashInfer fused allreduce RMSNorm ( #20691 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-11 18:58:15 -07:00
7b828e30d5
[CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' ( #20845 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 18:57:24 -07:00
5f0af36af5
Update kimi-k2 tool calling docs, enable unit tests ( #20821 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 20:16:14 +00:00
0d21b2664c
[Bugfix] Fix OOM in language generation test ( #20814 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-11 11:21:52 -07:00
9907fc4494
[Docs] Data Parallel deployment documentation ( #20768 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-11 09:42:10 -07:00
d47661f0cd
[Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM ( #20646 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 10:05:33 -06:00
53fa457391
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility ( #20449 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-11 07:51:46 -07:00
6fb162447b
[doc] fix ordered list issue ( #20819 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-11 06:49:46 -07:00
66177189c5
[Bugfix] Add missing field to TritonLanguagePlaceholder ( #20812 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 05:25:11 -07:00
b4f0b5f9aa
Temporarily suspend google/gemma-3-1b-it. ( #20722 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-11 11:21:26 +00:00
cbd14ed561
[Bugfix] Refactor /invocations
to be task-agnostic ( #20764 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-11 03:20:54 -07:00
7bd4c37ae7
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). ( #19825 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: shuw <shuw@nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 09:23:23 +00:00
8020e98c9f
[Quantization][1/N] MoE support BNB-Inflight Quantization ( #20061 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 08:01:13 +00:00
762be26a8e
[Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging ( #20777 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Signed-off-by: luka <lgovedic@redhat.com >
2025-07-11 00:15:22 -07:00
6a9e6b2abf
[doc] fold long code block ( #20795 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-10 23:16:41 -07:00
5d09152ff1
[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine ( #20660 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-11 05:53:31 +00:00
31d5c1797f
[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf ( #19830 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 04:56:28 +00:00
35514b682a
[XPU] XCCL support enabled in torch 2.8.0.dev nightly builds ( #20705 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-10 20:39:52 -07:00
e2de455c34
[Feature] Integrate SM100 DeepGEMM support ( #20087 )
2025-07-10 20:18:05 -07:00
5b032352cc
[Attention] MLA - Flashinfer Ragged Prefill ( #20034 )
2025-07-10 20:17:47 -07:00
922f316441
[Model] Support HF format of minimax ( #20211 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 02:55:21 +00:00
5923ab9524
[fix]: disable cutlass block scaled group gemm for EP ( #20781 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-07-11 02:39:18 +00:00
0cf893cae1
Add kimi-k2 tool parser ( #20789 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 10:36:23 +08:00
cf75cd2098
[CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install ( #20772 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 01:16:01 +00:00
b854321ffe
[Docs] Lazy import gguf ( #20785 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-10 16:06:37 -07:00
5b6fe23d05
[Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. ( #20786 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-10 14:52:46 -07:00
f0c98cae27
[Misc] MoE ModularKernel : Introduce TopKWeightAndReduce ( #20648 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-10 14:40:38 -07:00
574ad60db9
[KVConnector] Always call connector clear_metadata()
at end of step ( #20756 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: David Ben-David <sdavidbd@gmail.com >
2025-07-10 22:37:27 +01:00
fdadb6f43a
[Bugfix] Fused MoE Modular Kernel chunking loop ( #20392 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-10 20:31:10 +00:00
41060c6e08
[Core] Add Support for Default Modality Specific LoRAs [generate / chat completions] ( #19126 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-07-10 21:09:37 +01:00
3de2ed767f
[Bugfix] Remove assertion of expert_map being None ( #20714 )
...
Signed-off-by: Ming Yang <yming@meta.com >
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-10 19:55:22 +00:00
299252ea82
[CI] Fix pre commit issue ( #20782 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-10 12:48:13 -07:00
d6902ce79f
[V0][V1][Core] Add outlines integration for V1, and update V0 integration. ( #15975 )
...
Signed-off-by: Nathan Hoos <thwackyy.y@gmail.com >
2025-07-10 15:30:26 -04:00
5e53c89a74
[Bugfix] [CI] Fix Tensorizer LoRA test ( #20760 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-07-10 19:07:06 +00:00
c66e38ea4c
[Test] Remove docker build from test. ( #20542 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-10 11:21:58 -07:00
251595368f
Fix DeepSeek-R1-0528 chat template ( #20717 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2025-07-10 17:47:36 +00:00
4bed167768
[Model][VLM] Support JinaVL Reranker ( #20260 )
...
Signed-off-by: shineran96 <shinewang96@gmail.com >
2025-07-10 10:43:43 -07:00
b140416abf
[Model] Add reason parser for Hunyuan A13B Model. ( #20625 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-10 16:33:26 +00:00
5b8366b61a
[ROCm][Regression] Remove tensor creation that harms performance on ROCm ( #20741 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-10 09:22:23 -07:00
c7753a9809
[Hardware][CPU] Vllm int8 quantization enablement for ARM CPU ( #14129 )
...
Signed-off-by: nishith-fujitsu <nishith.jaiswal@fujitsu.com >
2025-07-10 15:59:04 +00:00
4b9a9435bb
Update Dockerfile FlashInfer to v0.2.8rc1 ( #20718 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-10 08:09:02 -07:00
3482fd7e4e
[Doc] Add engine args back in to the docs ( #20674 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-10 08:02:40 -07:00
77f77a951e
[Misc] Clean up mark to fork process in BNB tests ( #20692 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-10 13:59:40 +00:00
1a4f35e2ea
Normalize lm-eval command between baseline and correctness test ( #18560 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-10 13:27:32 +00:00
be1e128dfb
[CI Bugfix] Skip failing Tensorizer+LoRA test ( #20724 )
2025-07-10 21:15:03 +09:00
65393ee064
[doc] fix ordered list ( #20749 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-10 03:13:52 -07:00
dc221ad72d
[Bugfix][Build][Non-CUDA] Only referencing CMAKE_CUDA_COMPILER_VERSION on CUDA where it is defined ( #20738 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-10 02:58:11 -07:00
7571a4a7e5
[CI/Build] Fix Basic Models Test ( #20728 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-10 09:57:19 +00:00
f67d986dd1
[Misc] loose new-model tagger conditions ( #20747 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-10 02:54:47 -07:00
cc876d0f29
[KVConnector] Aggregate finished requests on the scheduler ( #19555 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-07-10 09:22:18 +01:00
fdfd409f8f
[TPU][Core]Make load weight exceed hbm error more instructive for customers ( #20644 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-10 07:01:17 +00:00
ffbcc9e757
[BugFix] Fix VllmConfig()
construction on all platforms ( #20695 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-10 07:00:20 +00:00
59389c927b
[BugFix][CPU] Fix CPU worker dependency on cumem_allocator ( #20696 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-10 14:24:20 +08:00
8f2720def9
[Frontend] Support Tool Calling with both tool_choice='required'
and $defs
. ( #20629 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-10 13:56:35 +08:00
ad6c2e1a0b
Correct PPMissingLayer handling in Deepseek-V2-Lite PP deployment ( #20665 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-09 20:34:40 -07:00
49e8c7ea25
Use NVCC --compress-mode
to reduce binary size by 30% ( #20694 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 18:26:48 -07:00
805d62ca88
[Misc] DP : Add ExpertTokensMetadata ( #20332 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-07-10 00:33:14 +00:00
b7d9e9416f
[CI/Build] Fix FlashInfer double build in Dockerfile ( #20651 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 17:41:56 -06:00
7c12a765aa
[Misc] Simplify the prefix caching logic on draft tokens ( #20701 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-09 14:48:35 -07:00
cd587c93ef
[BugFix]: Properly set engine_id when using multi connector ( #19487 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: leiyiming <leiyiming@kingsoft.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-09 20:32:44 +00:00
332d4cb17b
[Feature][Quantization] MXFP4 support for MOE models ( #17888 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
Signed-off-by: Bowen Bao <bowenbao@amd.com >
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
Co-authored-by: Bowen Bao <bowenbao@amd.com >
2025-07-09 13:19:02 -07:00
bf03ff3575
[Kernel] Add Conch backend for mixed-precision linear layer ( #19818 )
...
Signed-off-by: Jacob Manning <jmanning+oss@stackav.com >
2025-07-09 13:17:55 -07:00
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models ( #18218 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-09 12:53:55 -07:00
31b96d1c64
Support Llama 4 for cutlass_moe_fp4 ( #20453 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 15:53:38 -04:00
e59ba9e142
[CI/Build] Enlarge tolerance for a CPU multi-modal test ( #20684 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-09 17:48:52 +00:00
403b481573
Remove heading form installation inc.md
file ( #20697 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 10:42:51 -07:00
138709f8d1
[Doc] Update CPU doc ( #20676 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 10:28:30 -07:00
0bbac1c1b4
[Bench] Add NVFP4 GEMM benchmark script ( #20578 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 13:23:48 -04:00
a3e4e85ece
[XPU][CI] enhance xpu test support ( #20652 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
Co-authored-by: zhenwei-intel <zhenweiliu@habana.ai >
2025-07-09 16:53:09 +00:00
eb58f5953d
[TPU][Bugfix] fix test_pallas ( #20666 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-09 09:32:48 -07:00
4ac9c33f78
[Bugfix] Fix handling of Tensorizer arguments for LoadConfig ( #20643 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-07-09 15:36:37 +00:00
efe73d0575
[doc] update doc format ( #20673 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-09 08:08:19 -07:00
853487bc1b
[Docs] Improve docs for RLHF co-location example ( #20599 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 08:06:43 -07:00
9ff2af6d2b
[Benchmark] Parameterization of streaming loading of multimodal datasets ( #20528 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-07-09 13:35:16 +00:00
70ca5484f5
[Doc] Update notes ( #20668 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-09 03:46:36 -07:00
5358cce5ff
[V1] [Doc] Update V1 docs for Mamba models ( #20499 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-09 01:02:41 -07:00
2155e95ef1
[Bugfix] Fix the issue where reasoning_content
is None
when Thinkng is enabled and tool_choice
is set to 'required'
. ( #20662 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-09 07:39:58 +00:00
f95570a52d
[Docs] fix minimax tool_calling docs error ( #20667 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-07-09 00:37:07 -07:00
b6e7e3d58f
[Intel GPU] support ray as distributed executor backend for XPU. ( #20659 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-09 00:36:58 -07:00
e760fcef22
[XPU] Use spawn with XPU multiprocessing ( #20649 )
...
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com >
2025-07-09 00:34:28 -07:00
6bbf1795b7
[Misc] Fix the size of batched_dummy_mm_inputs in profile_run ( #20434 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com >
2025-07-08 20:15:44 -07:00
9e0ef888f0
Fix bullets in incremental_build.md ( #20642 )
2025-07-09 11:03:41 +08:00
97abeb1daa
[feat] enable SM100 CUTLASS block scaled group gemm for smaller batch sizes ( #20640 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-07-09 11:03:35 +08:00
34dad19e7b
[Bugfix] set default set cuda_graph_sizes to min(self.max_num_seqs * 2, 512) ( #20628 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
2025-07-09 11:02:51 +08:00
6db31e7a27
[Hardware][PPC64LE] Enable V1 for ppc64le and ARM ( #20554 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-07-08 20:00:41 -07:00
977180c912
[Docs] Improve documentation for multi-node service helper script ( #20600 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-08 19:44:26 -07:00
c40784c794
[BugFix][Intel GPU] Use refactored API for dist_backend in V1 worker ( #20596 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-08 19:44:23 -07:00
baed180aa0
[tech debt] Revisit lora request model checker ( #20636 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-09 09:42:41 +08:00
0b407479ef
[misc]refactor Platform.set_device
method ( #20262 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-09 01:39:47 +00:00
5eaf570050
Replace multiply_add
with homogeneous_multiply_add
to Address Clang Template Parameter Issue ( #20142 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-09 00:30:18 +00:00
d8ee5a2ca4
[TPU][Bugfix] disable phi-3 test ( #20632 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-08 23:14:26 +00:00
b9fca83256
[Bugfix] Fix GLM-4.1-V video prompt update ( #20635 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-08 23:13:58 +00:00
32dffc2772
[Core] Rename get_max_tokens_per_item
for backward compatibility ( #20630 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-08 23:11:30 +00:00
c438183e99
[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE ( #20166 )
...
Signed-off-by: Ming Yang <yming@meta.com >
2025-07-08 23:10:57 +00:00
baba0389f7
[CI] Increase the threshold of the MTEB RERANK tests ( #20615 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-08 08:10:11 -07:00
c6c22f16d3
Revert invalid spellchecker fix on deepseek_vl2 ( #20618 )
2025-07-08 15:07:14 +00:00
dd382e0fe3
[Model] Implement missing get_language_model
for Keye-VL ( #20631 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-08 07:47:46 -07:00
849590a2a7
Update torch/xla pin to 20250703 ( #20589 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-08 07:44:02 -07:00
a4c23314c0
[xpu]feat: support multi-lora on xpu ( #20616 )
...
Signed-off-by: yan <yan.ma@intel.com >
2025-07-08 22:07:10 +08:00
b942c094e3
Stop using title frontmatter and fix doc that can only be reached by search ( #20623 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-08 03:27:40 -07:00
b4bab81660
Remove unnecessary explicit title anchors and use relative links instead ( #20620 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-08 02:49:13 -07:00
b91cb3fa5c
[Docs] Improve documentation for Deepseek R1 on Ray Serve LLM ( #20601 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-08 02:09:06 -07:00
71d1d75b7a
[PD][Nixl] Remote consumer READ timeout for clearing request blocks ( #20139 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-08 08:56:40 +01:00
72d14d0eed
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load ( #19619 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
Co-authored-by: Eta <esyra@coreweave.com >
2025-07-07 22:47:43 -07:00
e34d130c16
[TPU] Temporary fix vmem oom for long model len by reducing page size ( #20278 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-08 05:16:16 +00:00
7721ef1786
[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files ( #20560 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-07 22:13:44 -07:00
8369b7c2a9
[Misc] improve error msg ( #20604 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-07 21:45:18 -07:00
3eb4ad53f3
[Docs] Add Anyscale to frameworks ( #20590 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:09:13 -07:00
90a2769f20
[Docs] Add Ray Serve LLM section to openai compatible server guide ( #20595 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:08:05 -07:00
e60d422f19
[Docs] Improve docstring for ray data llm example ( #20597 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:06:26 -07:00
0d914c81a2
[Docs] Rewrite offline inference guide ( #20594 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:06:02 -07:00
6e428cdd7a
[Doc] Syntax highlight request responses as JSON instead of bash ( #20582 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 20:02:45 -07:00
93b9d9f499
[Bugfix]: Fix messy code when using logprobs ( #19209 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-08 11:02:15 +08:00
af107d5a0e
Make distinct code
and console
admonitions so readers are less likely to miss them ( #20585 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 19:55:28 -07:00
31c5d0a1b7
[Optimize] Don't send token ids when kv connector is not used ( #20586 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-07 19:04:54 -07:00
afb7cff1b9
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe ( #20167 )
...
Signed-off-by: Ming Yang <yming@meta.com >
2025-07-08 01:07:22 +00:00
d2e841a10a
[Misc] Improve logging for dynamic shape cache compilation ( #20573 )
...
Signed-off-by: kyolebu <kyu@redhat.com >
2025-07-08 00:48:09 +00:00
14601f5fba
[Config] Refactor mistral configs ( #20570 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-07-07 15:25:10 -07:00
042d131f39
Fix links in multi-modal model contributing page ( #18615 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 21:13:52 +00:00
8e807cdfa4
[Misc] feat output content in stream response ( #19608 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-07 20:45:10 +00:00
e601efcb10
[Misc] Add fully interleaved support for multimodal 'string' content format ( #14047 )
...
Signed-off-by: drobyshev.anton <drobyshev.anton@wb.ru >
Co-authored-by: drobyshev.anton <drobyshev.anton@wb.ru >
2025-07-07 19:43:08 +00:00
22dd9c2730
[Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel ( #20308 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-07-07 19:08:12 +00:00
a6d795d593
[DP] Copy environment variables to Ray DPEngineCoreActors ( #20344 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-07 10:14:22 -07:00
a37d75bbec
[Front-end] microbatch tokenization ( #19334 )
...
Signed-off-by: zt2370 <ztang2370@gmail.com >
2025-07-07 17:54:10 +01:00
edd270bc78
[Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled ( #20486 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-07 09:41:15 -07:00
110df74332
[Model][Last/4] Automatic conversion of CrossEncoding model ( #19675 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-07 14:46:04 +00:00
1ad69e8375
[Doc] Fix some MkDocs snippets used in the installation docs ( #20572 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 07:44:34 -07:00
b8a498c9b2
[Doc] Add outline for content tabs ( #20571 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 07:43:26 -07:00
923147b5e8
[Doc] Fix internal links so they don't always point to latest ( #20563 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 04:15:50 -07:00
45877ef740
[Doc] Use gh-pr
and gh-issue
everywhere we can in the docs ( #20564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 03:54:22 -07:00
6e4bef1bea
[Doc] Remove extra whitespace from CI failures doc ( #20565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 03:35:47 -07:00
4ff79a136e
[Misc] Set the minimum openai version ( #20539 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-07 09:15:26 +00:00
448acad31e
[Misc] remove unused jinaai_serving_reranking ( #18878 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
2025-07-07 09:14:12 +00:00
eb0b2d2f08
[Docs] Clean up tables in supported_models.md ( #20552 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-07 01:46:31 -07:00
3112271f6e
[XPU] log clean up for XPU platform ( #20553 )
...
Signed-off-by: yan <yan.ma@intel.com >
2025-07-07 01:38:22 -07:00
1fd471e957
Add docstrings to url_schemes.py to improve readability ( #20545 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-07 08:31:49 +00:00
2c5ebec064
[XPU][CI] add v1/core test in xpu hardware ci ( #20537 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-07 01:16:40 -07:00
2e610deb72
[CI/Build] Enable phi2 lora test ( #20540 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-07 05:10:41 +00:00
6e2c19ce22
[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU ( #19410 )
...
Signed-off-by: dbyoung18 <yang5.yang@intel.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-07 04:32:32 +00:00
47db8c2c15
[Misc] add a tip for pre-commit ( #20536 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-06 19:42:06 -07:00
462b269280
Implement OpenAI Responses API [1/N] ( #20504 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-06 18:32:13 -07:00
c18b3b8e8b
[Bugfix] Add use_cross_encoder
flag to use correct activation in ClassifierPooler
( #20527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-06 14:01:48 -07:00
9528e3a05e
[BugFix][Spec Decode] Fix spec token ids in model runner ( #20530 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-06 19:44:52 +00:00
9fb52e523a
[V1] Support any head size for FlexAttention backend ( #20467 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-06 09:54:36 -07:00
e202dd2736
[V0 deprecation] Remove V0 CPU/XPU/TPU backends ( #20412 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-07-06 08:48:13 -07:00
43813e6361
[Misc] call the pre-defined func ( #20518 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-06 10:25:29 +00:00
cede942b87
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py
( #20516 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-07-06 09:20:11 +00:00
fe1e924811
[Frontend] Support image object in llm.chat ( #19635 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Flora Feng <4florafeng@gmail.com >
2025-07-06 06:47:13 +00:00
4548c03c50
[TPU][Bugfix] fix the MoE OOM issue ( #20339 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-05 21:19:09 -07:00
40b86aa05e
[BugFix] Fix: ImportError when building on hopper systems ( #20513 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-06 12:17:30 +08:00
432870829d
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe ( #20509 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-07-06 12:08:30 +08:00
f73d02aadc
[BUG] Fix #20484 . Support empty sequence in cuda penalty kernel ( #20491 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-07-05 19:38:02 -07:00
c5ebe040ac
test_attention compat with coming xformers change ( #20487 )
...
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-05 19:37:59 -07:00
8d763cb891
[Misc] remove unused import ( #20517 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-05 19:17:06 -07:00
cf4cd53982
[Misc] Add logger.exception for TPU information collection failures ( #20510 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-05 07:24:32 -07:00
32c9be2200
[v1] Re-add fp32 support to v1 engine through FlexAttention ( #19754 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-05 09:41:10 +00:00
8aeaa910a2
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod ( #20507 )
...
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-07-05 14:03:20 +08:00
906e05d840
[Misc] Remove the unused LoRA test code ( #20494 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-05 13:48:16 +08:00
ef9a2990ae
[doc] small fix ( #20506 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-04 20:56:39 -07:00
7e90870491
[Misc] Add security warning for development mode endpoints ( #20508 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-04 20:52:13 -07:00
d3f05c9248
[Doc] fix mutltimodal_inputs.md gh examples link ( #20497 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
2025-07-04 16:41:35 -07:00
c108781c85
[CI Bugfix] Fix pre-commit failures on main ( #20502 )
2025-07-04 14:17:30 -07:00
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 ( #19757 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Co-authored-by: Duncan Moss <dmoss@nvidia.com >
2025-07-04 12:58:04 -06:00
2f35a022e6
Enable V1 for Hybrid SSM/Attention Models ( #20016 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-07-04 17:46:53 +00:00
ffe00ef77a
[Misc] Small: Remove global media connector. Each test should have its own test connector object. ( #20395 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-04 08:15:03 -07:00
5561681d04
[CI] add kvcache-connector dependency definition and add into CI build ( #18193 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-04 06:49:18 -07:00
fbd62d8750
[Doc] Fix classification table in list of supported models ( #20489 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-04 06:08:02 -07:00
2e26f9156a
[Model][3/N] Automatic conversion of CrossEncoding model ( #20168 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-04 05:47:39 -07:00
9e5452ee34
[Bug][Frontend] Fix structure of transcription's decoder_prompt ( #18809 )
...
Signed-off-by: sangbumlikeagod <oironese@naver.com >
2025-07-04 11:28:07 +00:00
0e3fe896e2
Support Llama 4 for fused_marlin_moe ( #20457 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-04 07:55:10 +00:00
1caca5a589
[Misc] Add SPDX-FileCopyrightText ( #20428 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-04 07:40:42 +00:00
783921d889
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels ( #20331 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-04 15:06:24 +08:00
4a98edff1f
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers ( #20365 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-04 15:05:49 +08:00
a7bab0c9e5
[Misc] small update ( #20462 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 20:33:44 -07:00
25950dca9b
Add ignore consolidated file in mistral example code ( #20420 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-07-04 02:55:07 +00:00
a4113b035c
[Platform] Add custom default max tokens ( #18557 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
2025-07-04 10:50:17 +08:00
7e1665b089
[Misc] Change warn_for_unimplemented_methods to debug ( #20455 )
2025-07-04 02:35:08 +00:00
8d1096e7db
[Bugfix] Register reducer even if transformers_modules not available ( #19510 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-03 22:08:12 +00:00
8d775dd30a
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout
warning ( #20400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-03 14:56:09 -07:00
78fe77534b
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. ( #18864 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-07-03 14:55:40 -07:00
2f2fcb31b8
[Misc] Remove _maybe_ignore_quant_config from GLM4.1v ( #20432 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-07-03 21:41:13 +00:00
1dba2c4ebe
[Misc] adjust for ipv6 for mookcacke url parse ( #20107 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-03 20:27:17 +00:00
71d6de3a26
[Misc] Clean up InternVL family config registration ( #19992 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-03 20:01:47 +00:00
536fd33003
[CI] Trimming some failing test groups from AMDPRODUCTION. ( #20390 )
2025-07-03 08:21:31 -07:00
619b9f5c7e
[Frontend] fix duplicate output for bench subcmd ( #20446 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 08:02:06 -07:00
d1b689c445
[Bugfix] Fix flaky test_streaming_response
test ( #20363 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-03 14:46:24 +00:00
9854dc9040
[Frontend] improve vllm bench <bench_type> --help display ( #20430 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 14:22:16 +00:00
ff5c60fad8
[Misc] Automatically tag PRs to add new models ( #20222 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-03 07:11:03 -07:00
6f1229f91d
[Model][2/N] Automatic conversion of CrossEncoding model ( #19978 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-03 13:59:23 +00:00
1819fbda63
[Quantization] Bump to use latest bitsandbytes ( #20424 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-03 21:58:46 +08:00
7f0367109e
[CI/Build][CPU] Enable cross compilation in CPU release pipeline ( #20423 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-03 05:26:12 -07:00
fb14d53cf6
[Kernel] refactor cpu worker v0 cache dtype ( #20080 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-03 08:39:14 +00:00
b024a42e93
[Core] Move multimodal placeholder from chat utils to model definition ( #20355 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-03 08:18:30 +00:00
cb97f2bfc5
[Docs] Replace two list with tables in intel_gaudi.md ( #20414 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-03 00:48:25 -07:00
359200f6ac
[doc] fix link ( #20417 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 00:21:57 -07:00
220aee902a
[Misc] Add rules to label Speculative Decoding Related PRs ( #20406 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-07-02 23:56:49 -07:00
67d25eca05
[Tests] Update online DP tests to verify that requests are balanced ( #20157 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-03 14:49:13 +08:00
363528de27
[Feature] Support MiniMax-M1 function calls features ( #20297 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-07-03 06:48:27 +00:00
4ff61ababa
[TPU] Add a case to cover RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 ( #20385 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-03 06:46:41 +00:00
0ec3779df7
[Bugfix][CI/CD][CPU] Fix CPU CI tests ( #20383 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-02 20:11:36 -07:00
b616f6a53d
[Misc] Small: Fix video loader return type annotations. ( #20389 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-03 03:10:39 +00:00
2e25bb12a8
[Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py ( #20381 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-07-03 02:07:43 +00:00
9965c47d0d
Enable CPU nightly performance benchmark and its Markdown report ( #18444 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-07-02 17:50:25 -07:00
059d4cdb49
[BugFix] Fix DP headless mode arg validation ( #20398 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 17:15:32 -07:00
bdb84e26b0
[Bugfix] Fixes for FlashInfer's TORCH_CUDA_ARCH_LIST ( #20136 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
2025-07-02 17:15:11 -07:00
3dd359147d
[Docs] Update EAGLE example ( #20375 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-02 17:13:51 -07:00
657f2f301a
[DP] Support external DP Load Balancer mode ( #19790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 10:21:52 -07:00
a1aafc827a
[ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) ( #20254 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-07-02 16:25:46 +00:00
139508a418
[Misc] add handler HF_TOKEN is emptry string ( #20369 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-02 09:14:31 -07:00
d265414dbc
[Minor] Clean up incorrect comment in test ( #20382 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 09:13:37 -07:00
48fb076cbc
[V1] LogitsProcessor programming model ( #16728 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-02 09:10:42 -07:00
c1909e7e8c
[Kernels] MoE refactor ( #19636 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: ElizaWszola <ewszola@redhat.com >
2025-07-02 06:08:27 -07:00
b95877509b
Documentation update tool_calling: mapping back to function from response ( #20373 )
2025-07-02 05:55:49 -07:00
706ff13224
[Model] Adds support for SlimMoE models Phi-tiny-MoE-instruct ( #20286 )
...
Signed-off-by: Zichong Li <t-lizichong@microsoft.com @Reasoning-H100-VM3.drbuo4tcjzruhloch3eo0b25ef.cx.internal.cloudapp.net>
Co-authored-by: Zichong Li <t-lizichong@microsoft.com @Reasoning-H100-VM3.drbuo4tcjzruhloch3eo0b25ef.cx.internal.cloudapp.net>
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-07-02 12:54:12 +00:00
ccbfb1d1c9
[Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models ( #20322 )
...
Signed-off-by: Wang Huaqiang <huaqiang.wang@intel.com >
2025-07-02 12:53:36 +00:00
9e5552aa13
[NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) ( #17280 )
...
Signed-off-by: kaln27 <liaojuncheng123@foxmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-02 06:47:19 -06:00
0c600b9ab6
[Build/CI] Automatically tag DeepSeek related PRs ( #20370 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-02 04:02:43 -07:00
e303dcf523
[Model] Add Ernie4.5 and Ernie4.5MoE Model Support ( #20220 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-07-02 03:37:01 -07:00
ae9c4d416f
[Docs] Make TPU ref prettier in google_tpu.md ( #20356 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-02 02:04:08 -07:00
d853520b3e
[Docs] Fix indentations for 2-level items in deprecation_policy.md ( #20352 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-01 23:50:31 -07:00
ba51aea65e
[Bugfix] Keye-VL compatibility with tok_kwargs
( #20058 ) ( #20353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-01 23:46:59 -07:00
8452946c06
[Model][VLM] Support Keye-VL-8B-Preview ( #20126 )
...
Signed-off-by: Kwai-Keye <Keye@kuaishou.com >
2025-07-01 23:35:04 -07:00
2e7cbf2d7d
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. ( #20105 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-01 23:34:03 -07:00
7da296be04
[TPU] kv cache update kernel supports dynamic grid ( #20235 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-02 06:33:37 +00:00
b205e8467d
[Doc][TPU] Add models and features supporting matrix. ( #20230 )
...
Signed-off-by: Qiliang Cui <cuiq@google.com >
2025-07-02 06:33:20 +00:00
be0cfb2b68
fix[Docs]: link anchor is incorrect #20309 ( #20315 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-07-02 06:32:34 +00:00
1a03dd496b
[Bugfix] Fix dynamic rotary embedding ( #20343 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-02 06:31:26 +00:00
27b8017636
[FIX][Intel GPU]fix ipex flash_attn_varlen_func api missing parameter ( #20348 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-01 22:26:40 -07:00
9ec1e3065a
[Misc][Doc] Add missing comment for LLM ( #20285 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-07-01 19:04:24 -07:00
9dae7d46bf
[Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON
( #20334 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 19:03:43 -07:00
7058d7dd5d
[Refactor] Remove duplicate find_free_port
( #20333 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 19:03:07 -07:00
a0389e0554
[UT][intel GPU] use current_platform instead of device hardcode in v1 tests ( #20169 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-02 09:06:04 +08:00
3be8d312a2
[Kernel][Bugfix] Fixup some warnings in nvfp4_blockwise_moe when CUDA < 12.8 ( #20324 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-01 18:05:47 -07:00
3abfe22154
Enable group size 64 for Machete ( #20290 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-01 18:05:44 -07:00
e81fbefe8a
[Refactor] Refactor import utils ( #20269 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 18:05:42 -07:00
9290de5667
remove unused variables in marlin_template.h ( #20236 )
2025-07-02 00:51:52 +00:00
7f280d69c9
[Optimization] Cache sampled token ids in model runner ( #20291 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 11:01:31 -07:00
02cabff207
[V1] [ROCm] Enable EP with AITER Fused MoE ( #20270 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-07-01 16:48:30 +00:00
3d19d47d91
[Frontend] Expand tools even if tool_choice="none" ( #17177 )
...
Signed-off-by: okada shintarou <okada@preferred.jp >
2025-07-01 12:47:38 -04:00
8acb4badee
[CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling ( #20301 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 09:07:36 -07:00
314af8617c
[Docs] Update transcriptions API to use openai client with stream=True
( #20271 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-01 15:47:13 +00:00
0e96cc9b7e
[Misc] Minor refactoring for scheduler ( #20299 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 07:55:32 -07:00
ecad851cbd
[Model]Add Tencent HunYuanMoEV1 Model Support ( #20114 )
...
Signed-off-by: aiyiwang <aiyiwang@tencent.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: quinnrong <quinnrong@tencent.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-01 07:28:13 -07:00
ed70f3c64f
Add GLM4.1V model (Draft) ( #19331 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-01 12:48:26 +00:00
650d5dbd04
[Misc] Minor refactor of NIXL background handshake ( #20068 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-01 12:40:14 +01:00
9025a9a705
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper
( #20046 )
2025-07-01 19:20:34 +09:00
c05596f1a3
[Perf] Validate @config in pre-commit instead of dynamically ( #20200 )
...
Signed-off-by: Lionel Villard <villard@us.ibm.com >
2025-07-01 05:10:28 -04:00
787b13389e
[doc] fix the incorrect logo in dark mode ( #20289 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-01 08:18:09 +00:00
96453cfa83
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine ( #19067 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-07-01 16:12:19 +08:00
b1c1fe35a5
[Misc] remove redundant char ( #20287 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-01 15:33:22 +08:00
08d81f1014
[Bugfix] Fix deepep tests ( #20288 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-01 15:29:08 +08:00
6cc1e7d96d
[CPU] Update custom ops for the CPU backend ( #20255 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-01 07:25:03 +00:00
9909726d2a
Enable ZP Support for Machete ( #20268 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-01 07:12:20 +00:00
22e9d42040
[Misc] add xgrammar for arm64 ( #18359 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2025-07-01 07:02:20 +00:00
86debab54c
Fix numel()
downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2 ( #17082 )
...
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-01 06:48:10 +00:00
be250bbc67
[V1] Only print cudagraph tqdm on rank 0 with is_global_first_rank
( #19516 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-01 06:02:09 +00:00
27949354fa
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference ( #18768 )
...
Signed-off-by: Alex Kogan <alex.kogan@oracle.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-07-01 05:44:38 +00:00
bd5038af07
[Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA ( #15897 )
...
Signed-off-by: Ernest Wong <chwong719@gmail.com >
2025-06-30 21:44:39 -07:00
a2f14dc8f9
[CI][Intel Gaudi][vllm-Plugin]Add CI for hpu-plugin-v1-test ( #20196 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-07-01 04:17:07 +00:00
92ee7baaf9
[Example] add one-click runnable example for P2P NCCL XpYd ( #20246 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-06-30 21:03:55 -07:00
7151f92241
[Misc] Fix spec decode example ( #20296 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 21:01:48 -07:00
e28533a16f
[Bugfix] Fix include prompt in stream response when echo=true ( #15233 )
...
Signed-off-by: Yuan Fang <yuanfang@alauda.io >
2025-07-01 01:30:14 +00:00
6d42ce8315
[CLI] Improve CLI arg parsing for -O
/--compilation-config
( #20156 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-07-01 01:03:13 +00:00
ded1fb635b
[Bugfix][V1][P/D]Fix the issue of occasional garbled output for P2pNcclConnector ( #20263 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-06-30 16:45:14 -07:00
97d9524fe9
[Refactor] Remove useless pdb comment ( #20266 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-30 18:15:24 +00:00
d8cf819a9a
[Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models ( #20058 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-30 17:26:49 +00:00
551ef1631a
[Unit Test] Add unit test for deep gemm ( #20090 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-30 10:26:42 -06:00
2863befce3
[Optimization] Use Shared CachedRequestData
Instance Across All Requests ( #20232 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 09:07:50 -07:00
2965c99c86
[Spec Decode] Clean up spec decode example ( #20240 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 08:28:13 -07:00
2062c0723d
[Spec Decode] Refactor spec decoding into a separate function ( #20238 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 08:13:50 -07:00
1c50e100a9
[Bugfix] fix quark ptpc ( #20251 )
...
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com >
Co-authored-by: Haoyang Li <307790822@qq.com >
2025-06-30 22:24:50 +09:00
3ee56e26be
[Docs] Fix 1-2-3 list in v1/prefix_caching.md ( #20243 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-30 11:20:51 +00:00
8fe7fc8634
[Quantization] Improve BitsAndBytesModelLoader ( #20242 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-30 18:22:09 +08:00
e936e401de
[Bugfix] Fix processor initialization in transformers 4.53.0 ( #20244 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-30 10:16:16 +00:00
f5dfa07531
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model ( #19598 )
...
Signed-off-by: noiji <>
2025-06-30 18:21:56 +09:00
022c58b80f
[doc] Add Slack and Forum to the top navigation ( #20208 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-06-30 07:53:45 +00:00
19108ef311
[Misc] Fix import ( #20233 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-29 20:34:54 -07:00
5a52f389dd
[BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert ( #20202 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-06-29 19:46:19 -07:00
65b1cbb138
[Model] support dots1 ( #18254 )
...
Signed-off-by: redmoe-moutain <agiredmoe@gmail.com >
2025-06-29 19:34:36 -07:00
6c9837a761
Fix cuda_archs_loose_intersection when handling sm_*a ( #20207 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-06-29 16:52:34 -07:00
6f2f53a82d
[Quantization] Add compressed-tensors NVFP4 MoE Support ( #19990 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-06-29 22:05:40 +00:00
7b1895e6ce
[CI Fix] Try fixing eagle e2e test OOM by reducing block allocation ( #20213 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-29 10:31:37 +08:00
4d36693687
[Refactor] Create a function util and cache the results for has_deepgemm
, has_deepep
, has_pplx
( #20187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-28 22:06:38 +00:00
daec9dea6e
[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution ( #20137 )
...
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
2025-06-28 08:16:41 -07:00
daceac57c7
[Frontend] Generalize v1/audio/transcriptions
endpoint ( #20179 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-28 08:15:26 -07:00
8615d9776f
[CI/Build] Add new CI job to validate Hybrid Models for every PR ( #20147 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-27 23:00:25 -07:00
7b460c25f9
[BugFix] Fix the incorrect func name in the comments. (config.py) ( #20185 )
2025-06-27 22:51:16 -07:00
f719772281
[Bugfix] Properly reject requests with empty list guided_choice ( #20195 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 22:50:52 -07:00
d45417b804
fix ci issue distributed 4 gpu test ( #20204 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-27 22:50:00 -07:00
a29e62ea34
Fix num_token_padding support for static per-tensor scaled_fp8_quant ( #20188 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 22:48:13 -07:00
e53be6f00a
[Misc] Add type assertion of request_id for LLMEngine.add_request ( #19700 )
...
Signed-off-by: n2ptr <xuzhanchaomail@163.com >
2025-06-27 22:47:36 -07:00
c329ceca6d
[CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to revision due to model changes ( #20199 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-28 13:43:06 +08:00
3c545c0c3b
[CI/Build] Allow hermetic builds ( #18064 )
...
Signed-off-by: Fabien Dupont <fdupont@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Fabien Dupont <fabiendupont@pm.me >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Elias Levy <eliaslevy@google.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-06-27 09:04:39 -07:00
e8c3bd2cd1
[Bugfix] Fix some narrowing conversion warnings ( #20141 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-27 09:01:28 -07:00
c6c983053d
[Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. ( #20152 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-27 09:42:22 -06:00
aafabaa0d5
[Fix][torch.compile] Enable custom ops by default when Inductor off ( #20102 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-27 09:00:42 -06:00
94a55c7681
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12 ( #19891 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-06-27 07:14:44 -07:00
aa0dc77ef5
[Perf] Improved perf for resolve_chat_template_content_format ( #20065 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net >
2025-06-27 09:16:41 +00:00
4ab3ac285e
[Bugfix] Fix flaky failure when getting DP ports ( #20151 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 15:30:53 +08:00
d1c956dc0f
Gemma3n (Text-only) ( #20134 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-27 07:16:26 +00:00
dec197e3e5
Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn ( #20143 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-06-27 05:48:13 +00:00
6e244ae091
[Perf][Frontend] eliminate api_key and x_request_id headers middleware overhead ( #19946 )
...
Signed-off-by: Yazan-Sharaya <yazan.sharaya.yes@gmail.com >
2025-06-27 00:44:14 -04:00
cd4cfee689
[Model][1/N] Automatic conversion of CrossEncoding model ( #20012 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-06-26 21:10:04 -07:00
e110930680
[Fix] Fix gemma CI test failing on main ( #20124 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-26 21:06:59 -07:00
8b64c895c0
[CI] Sync test dependency with test.in for torch nightly ( #19632 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Concurrensee <yida.wu@amd.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-26 20:55:25 -07:00
0740e29b66
[Feature] add quick all reduce ( #19744 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-06-26 20:54:24 -07:00
44d2e6af63
[Bugfix] Build moe_data for both sm100 and sm90 ( #20086 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-26 20:50:12 -07:00
2d7779f888
[Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler ( #20071 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-06-26 20:50:09 -07:00
a57d57fa72
[Quantization] Bump to use latest compressed-tensors
( #20033 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-26 20:50:06 -07:00
71799fd005
[CI Failure] Fix OOM with test_oot_registration_embedding ( #20144 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 11:21:04 +08:00
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) ( #18343 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
2025-06-26 15:30:21 -07:00
07b8fae219
[Doc] correct LoRA capitalization ( #20135 )
...
Signed-off-by: kyolebu <kyu@redhat.com >
2025-06-26 15:22:12 -07:00
562308816c
[Refactor] Rename commnication utils ( #20091 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 22:19:32 +00:00
04e1642e32
[TPU] add kv cache update kernel ( #19928 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-26 10:01:37 -07:00
b69781f107
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. ( #19560 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-06-26 09:27:18 -07:00
0bceac9810
Spam folks if config.py changes ( #20131 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-26 08:19:46 -07:00
34878a0b48
[Doc] Rename page titles ( #20130 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 08:18:49 -07:00
6393b03986
[Doc] Auto sign-off for VSCode ( #20132 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 08:18:36 -07:00
0907d507bf
[Doc] Automatically signed-off by PyCharm ( #20120 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-06-26 14:34:17 +00:00
c894c5dc1f
[Bug Fix] Fix address/port already in use error for deep_ep test ( #20094 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 22:33:13 +08:00
1f5d178e9c
Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" ( #20128 )
2025-06-26 07:32:22 -07:00
27c065df50
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) ( #19904 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-06-26 12:42:31 +00:00
84c260caeb
[Docs] Improve frameworks/helm.md ( #20113 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-26 10:41:51 +00:00
167aca45cb
[Misc] Use collapsible blocks for benchmark examples. ( #20017 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-26 03:35:16 -07:00
0567c8249f
[CPU] Fix torch version in x86 CPU backend ( #19258 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-26 03:34:47 -07:00
d188913d99
[Refactor] Remove unused library ( #20099 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 09:16:10 +00:00
1d7c29f5fe
[Doc] Update docs for New Model Implementation ( #20115 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 00:47:06 -07:00
65397e40f5
[Bugfix] Allow CUDA_VISIBLE_DEVICES=''
in Platform.device_id_to_physical_device_id
( #18979 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-06-26 00:01:57 -07:00
9502c38138
[Benchmark][Bug] Fix multiple bugs in bench and add args to spec_decode offline ( #20083 )
2025-06-25 22:06:27 -07:00
2582683566
[PD] Skip tp_size
exchange with rank0 ( #19413 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-25 20:04:39 -07:00
754b00edb3
[Bugfix] Fix Mistral tool-parser regex for nested JSON ( #20093 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-26 01:01:17 +00:00
296ce95d8e
[CI] Add SM120 to the Dockerfile ( #19794 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-25 16:23:56 -07:00
2d7620c3eb
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN ( #19919 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-25 15:51:02 -07:00
55c65ab495
[P/D] Avoid stranding blocks in P when aborted in D's waiting queue ( #19223 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-25 15:19:44 -07:00
2cc2069970
[TPU][Bugfix] fix kv cache padding ( #20048 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-25 21:24:10 +00:00
9f0608fc16
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine ( #20062 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
2025-06-25 21:03:17 +00:00
4e0db57fff
Fix the path to the testing script. ( #20082 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-25 20:48:17 +00:00
c40692bf9a
[Misc] Add parallel state node_count
function ( #20045 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-25 13:38:53 -07:00
4734704b30
[PD] let toy proxy handle /chat/completions ( #19730 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-06-25 15:17:45 -04:00
8b8c209e35
static_scaled_fp8_quant should not run when scale.numel is not 1 ( #20076 )
2025-06-25 15:08:03 -04:00
23a04e0895
[Fix] Support cls pooling in ModernBertPooler ( #20067 )
...
Signed-off-by: shengzhe.li <shengzhe.li@sbintuitions.co.jp >
2025-06-25 15:07:45 -04:00
02c97d9a92
[Quantization] Add compressed-tensors emulations support for NVFP4 ( #19879 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-06-25 14:28:19 -04:00
e795d723ed
[Frontend] Add /v1/audio/translations
OpenAI API endpoint ( #19615 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-06-25 17:54:14 +00:00
8359f4c8d8
[V1][Speculative Decoding] Fix DeepSeek MTP ( #20022 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-06-25 08:41:02 -07:00
bf5181583f
[Doc] Guide for Incremental Compilation Workflow ( #19109 )
2025-06-25 22:06:46 +09:00
c53fec1fcb
[doc] add reference link for Intel XPU ( #20064 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-25 12:24:07 +00:00
0f9e7354f5
[BugFix] Fix full-cuda-graph illegal memory access in FA3 ( #20057 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-06-25 08:39:04 +00:00
ba7ba35cda
[Chore] debloat some initial logs ( #19438 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-25 06:36:22 +00:00
015fab8c2f
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. ( #19717 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-24 23:22:58 -07:00
f59fc60fb3
[Feat][CLI] enforce-include-usage ( #19695 )
...
Signed-off-by: Max Wittig <max.wittig@siemens.com >
2025-06-25 01:43:04 -04:00
879f69bed3
[Refactor] Remove duplicate ceil_div
( #20023 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-25 05:19:09 +00:00
7108934142
[Frontend] speed up import time of vllm.config ( #18036 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-06-25 00:41:11 -04:00
3443aaf8dd
Move to a faster base64 implementation ( #19984 )
...
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai >
2025-06-24 20:33:51 -07:00
2273ec322c
Revert "Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor" ( #20030 )
2025-06-25 11:23:29 +08:00
a6c4b87fbc
Revert "[Feature] Integrate new deepgemm ( #19820 )" ( #20049 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-24 19:45:22 -07:00
1afa9948f5
[Llama4] Update attn_temperature_tuning
( #19997 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-24 22:42:53 -04:00
0d06b533a0
cmake: Update vllm_flash_attn for vllm_kernels ( #20032 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
2025-06-24 22:44:10 +00:00
c01d1c5aba
use .dev for version comparison with pytorch nightly release ( #20031 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-06-24 21:52:16 +00:00
ead369845d
[Easy] Remove submodule added in #19463 ( #20039 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-24 13:23:15 -07:00
c6e3bba8e6
[Feature] Integrate new deepgemm ( #19820 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-24 12:51:56 -07:00
91f7d9d0b6
[P/D] Asynchronously do _nixl_handshake ( #19836 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-24 12:46:10 -07:00
8619e7158c
[BugFix] Fix multi-node offline data parallel ( #19937 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-24 12:45:20 -07:00
c635c5f744
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. ( #19423 )
...
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-24 18:41:49 +00:00
a045b7e89a
[Perf] Improve/Fix-regression for FA3 in High QPS regimes ( #19463 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-06-24 13:09:01 -04:00
981eeca41a
[Fix][V1] Remove --scheduling-policy oracle ( #20010 )
...
Signed-off-by: amit <amit.man@gmail.com >
2025-06-24 09:52:15 -07:00
26d34eb67e
refactor example - qwen3_reranker ( #19847 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-24 14:03:20 +00:00
53da4cd397
[Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 ( #20014 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-24 13:20:04 +00:00
9a3b88328f
[PERF] Speedup of MRoPE prepare inputs ( #19939 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-06-23 23:01:26 -07:00
3014c920da
add some examples for other benchmark scripts ( #19893 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-24 05:57:46 +00:00
0eed516951
[doc] Fix broken link in the installation for CPU ( #19980 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-06-24 12:04:11 +08:00
ee5ad8d2c5
[Misc][Tools][Benchmark] Add profile to autotune script ( #19711 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-24 00:59:41 +00:00
a738dbb2a1
Update test case parameter to have the throughput above 8.0 ( #19994 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-24 00:18:10 +00:00
33d5e29be9
[TPU] Fix tpu model runner test ( #19995 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-23 16:04:28 -07:00
4671ac6e2a
[Bugfix][Benchmark] Fix Marlin benchmark ( #19929 )
2025-06-24 07:25:12 +09:00
dd2ccf8dde
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend ( #19395 )
2025-06-24 07:23:28 +09:00
a3bc76e4b5
[CI/Build] Push latest tag for cpu and neuron docker image ( #19897 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-23 14:15:37 -07:00
e6327c9b3e
[Feature] Support sequence parallelism for static fp8 quantization ( #19181 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-06-23 16:09:02 -04:00
d0132f025d
[Misc] Add type alias ReqId
and EngineId
for better readability ( #19880 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-23 12:57:57 -07:00
61f4fc5dc6
[Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 ( #19956 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-23 18:38:06 +00:00
68aaeb3749
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case ( #19885 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-23 11:07:47 -07:00
c3649e4fee
[Docs] Fix syntax highlighting of shell commands ( #19870 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-23 17:59:09 +00:00
53243e5c42
[doc] improve readability for long commands ( #19920 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 14:27:07 +00:00
a6e6604d32
[Bugfix] Fix CI bitsandbytes failure ( #19969 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-23 21:30:55 +08:00
b82e0f82cb
[doc] use MkDocs collapsible blocks - supplement ( #19973 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 10:54:16 +00:00
5111642a6f
[Doc] Update V1 status for decoder-only embedding models ( #19952 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-23 09:31:06 +00:00
1bcd15edc7
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when *all* transfer done ( #19874 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-22 22:41:53 -07:00
2ebff5b77c
[P/D][NixlConnector] Support tp_size > num_kv_heads
deployments ( #19691 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-22 22:41:50 -07:00
f17aec0d63
[doc] Fold long code blocks to improve readability ( #19926 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 05:24:23 +00:00
493c275352
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor ( #19643 )
...
Signed-off-by: Vensenmu <vensenmu@gmail.com >
2025-06-23 03:40:28 +00:00
f39ab2d4bd
[Misc] Configurable timeout for execute_model RPC calls via env var ( #19544 )
...
Signed-off-by: jinqinn <goodqinjin@163.com >
2025-06-22 20:36:26 -07:00
4a0f7888a3
[Core] feat: Implement Priority Scheduling in V1 Engine ( #19057 )
...
Signed-off-by: amit <amit.man@gmail.com >
Co-authored-by: Roger Wang <Rogerw0108@gmail.com >
2025-06-22 20:18:08 -07:00
c4cf260677
[Perf][CLI] Improve overall startup time ( #19941 )
2025-06-22 23:11:22 +00:00
33d51f599e
[BugFix] Add an env to disable moe chunking to work around compile incompatibility ( #19642 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-22 15:17:49 -07:00
e91386cde1
[Chore] dedup logs ( #19955 )
2025-06-22 19:43:07 +00:00
2c11a29f0b
[Misc] Simplify vllm bench cli subcommand implementation ( #19948 )
2025-06-22 12:34:48 -04:00
c76a506bd6
[Misc] Update model-specific PR tagging ( #19949 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-06-22 12:16:08 +00:00
ec0db6f51c
[doc] use snippets for contact us ( #19944 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-22 10:26:13 +00:00
c305a2109d
[CI/Build] Auto tag perf benchmarks related PRs ( #19943 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-22 08:46:21 +00:00
202c5df935
[Benchmark] fix request loss if "ping" is returned ( #19535 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-22 07:21:04 +00:00
2bb246b8f7
[MISC] add cpu_kvcache_space_bytes to CacheConfig ( #19812 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-22 13:39:09 +08:00
4c409cabc2
[Misc] add vllm_config in __init__ ( #19866 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-21 23:10:46 -04:00
3b1e4c6a23
[Docs] Add GPT2ForSequenceClassification to supported models in docs ( #19932 )
...
Signed-off-by: nie3e <adrcwiek@gmail.com >
2025-06-21 20:57:19 +00:00
2c5302fadd
[Multimodal] Optimize Qwen2/2.5-VL startup time ( #19756 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-21 20:01:07 +00:00
caa680fd2e
[doc] add contact us in community ( #19922 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-21 17:29:06 +00:00
c3bf9bad11
[New model support]Support Tarsier2 ( #19887 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-21 04:01:51 +00:00
6f170f11dd
[Bugfix] Fix bnb 8bit model weights loading ( #19917 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-21 03:29:09 +00:00
8ca81bb069
Fix: Check the type of params to be a Sequence not list. ( #19910 )
...
Signed-off-by: Rabin Adhikari <rabin.adk1@gmail.com >
2025-06-20 23:03:17 +00:00
e773a9e1c2
[Misc] Clean up useless code ( #19889 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-06-20 21:09:09 +00:00
71baf85ae1
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError ( #19749 )
2025-06-20 18:18:11 +00:00
79f2f1c2a1
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests ( #19901 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-20 15:30:36 +00:00
2e3e3c86dc
Export NaNs in logits to scheduler_stats if output is corrupted ( #18777 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2025-06-20 22:47:16 +08:00
7e8977fcd4
[custom_op][vllm-plugin] update custom_op class to use op_registry ( #19164 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-06-20 07:44:56 -07:00
f1e840e842
[Model] GPT2ForSequenceClassification model ( #19663 )
...
Signed-off-by: nie3e <adrcwiek@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-20 12:07:41 +00:00
7771d1de88
[Fix] import regex instead of re ( #19875 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-20 11:16:48 +00:00
71d1219545
[Kernel] correct cpu worker function parameter type ( #19745 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-20 10:50:13 +00:00
e384f2f108
[Misc] refactor example - openai_transcription_client ( #19851 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-20 08:02:21 +00:00
089a306f19
[Misc] update cuda version ( #19526 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-20 07:25:15 +00:00
5e666f72cd
[Bugfix][Ray] Set the cuda context eagerly in the ray worker ( #19583 )
2025-06-19 22:01:16 -07:00
e3a3e4db46
[Bugfix] Enable PP with AITER+V1 ( #19822 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-06-20 12:43:20 +08:00
e41bf15cd0
[Chore]: qwen3-moe-type-hints-mistake ( #19860 )
...
Co-authored-by: xinnan.hou <hxn02029096@alibaba-inc.com >
2025-06-19 21:43:07 -07:00
5aa4a015ce
[Benchmark] Fix Value of type "SampleRequest" is not indexable
( #18032 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-19 21:28:55 -07:00
b6bad3d186
[CI][Neuron] Fail and exit on first error ( #19622 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-20 12:27:51 +08:00
ee9a1531aa
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI ( #19872 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-20 09:51:07 +08:00
10d82f9ac5
[Benchmark][Bugfix] Fix Dataset Length Calculation ( #19868 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-06-19 18:30:41 -07:00
ea10dd9d9e
[Frontend] early return chat format resolution when specified ( #19735 )
2025-06-19 18:49:59 +00:00
ead2110297
[Core][Bugfix] Fix Online MM Beam Search ( #19688 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-06-19 17:18:07 +00:00
01220ce89a
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI ( #19838 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-19 15:46:09 +00:00
6f68c49220
[Doc] Update V1 user guide for embedding models ( #19842 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-19 09:43:27 +00:00
4719460644
Fixing Chunked Prefill Test. ( #19762 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-06-19 01:36:16 -07:00
466166dcfd
[Frontend] Add optional token-level progress bar to LLM.beam_search
( #19301 )
...
Signed-off-by: Ruosen Li <rxl190028@utdallas.edu >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Ubuntu <ubuntu@ip-172-31-71-179.ec2.internal >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-19 03:21:41 -04:00
1d0ae26c85
Add xLAM tool parser support ( #17148 )
2025-06-19 14:26:41 +08:00
6021999573
[Minor] Allow redirecting model path for HfRunner in test ( #19795 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-18 23:04:10 -07:00
c7b370c603
raise exception for pin_lora ( #19809 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-18 22:57:35 -07:00
aa20d10a91
[Misc] [ROCm] Prevent surplus tensor reshape ( #19803 )
...
Signed-off-by: Zsolt Borbely <zsolt.borbely@htecgroup.com >
2025-06-19 13:57:16 +08:00
2de12be428
[ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374
( #18990 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-06-18 22:56:31 -07:00
83ca9ae47b
Mark invariant normalizer in Gemma as non-persistent ( #19788 )
...
Signed-off-by: Yu-Hang Tang <Tang.Maxin@gmail.com >
2025-06-18 22:56:03 -07:00
e2148dc5ea
[Bugfix] Add check_health to v1 async client. ( #19821 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-06-18 21:47:01 -07:00
b1098b4072
[Bugfix] Fix the linter ( #19826 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-18 21:44:41 -07:00
799397ee4f
Support embedding models in V1 ( #16188 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-18 21:36:33 -07:00
4959915089
[Quantization] Modify the logic of BNB double quantization ( #19742 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-19 03:52:09 +00:00
8d1e89d946
[Misc][ROCm] Enforce no unused variable in ROCm C++ files ( #19796 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-18 20:25:15 -07:00
36239f79dd
Fix FA2 fallback for Blackwell V1 ( #19781 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-19 09:53:55 +08:00
dfada85eee
[Frontend] Expose custom args in OpenAI APIs ( #16862 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-18 17:41:11 -07:00
ed33349738
[BugFix] Fix use_cudagraph=False ( #19612 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-06-19 08:23:12 +08:00
d49adea1f9
[Multimodal] Use fast processor for Qwen2/2.5-VL ( #19789 )
2025-06-18 15:49:40 -07:00
14fdd21d39
[Core] More fixes to MultiModalEmbeddings type handling ( #19715 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-18 22:48:29 +00:00
04fefe7c9a
[TPU] Update torch-xla version to include paged attention tuned block change ( #19813 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-18 22:41:13 +00:00
3b523e38d9
[Core] Do not copy array during hashing ( #19484 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-18 15:36:55 -07:00
16c16301c8
Disable "Forbid direct 'import triton'" check for vllm/triton_utils/importing.py
in an extensible way ( #19783 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-06-18 15:08:00 -07:00
9206d0ff01
docs: fix Slack bulletpoint in README ( #19811 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
2025-06-18 20:47:08 +00:00
a89209b78d
[v1] Support mamba2 ( #19327 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-18 20:34:15 +00:00
ffacb222cb
[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc ( #19808 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-18 20:22:28 +00:00
12575cfa7a
[Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully ( #19725 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-18 10:26:16 -07:00
8b6e1d639c
[Hardware][AMD] integrate aiter chunked prefill into vllm ( #18596 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: fsx950223 <fsx950223@outlook.com >
Co-authored-by: charlifu <charlifu@amd.com >
2025-06-18 08:46:51 -07:00
735a9de71f
[Qwen] Add tagging rule for Qwen related PRs ( #19799 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-18 14:26:43 +00:00
257ab95439
[Platform] Allow platform use V1 Engine by default ( #19792 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-06-18 13:03:36 +00:00
cca91a7a10
[doc] fix the incorrect label ( #19787 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-18 10:30:58 +00:00
f04d604567
[Minor] Zero-initialize attn output buffer ( #19784 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-18 06:59:27 +00:00
19a53b2783
[V1] Decouple GPU and TPU InputBatch
( #19778 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-06-18 06:38:13 +00:00
eccdc8318c
[V1][P/D] An native implementation of xPyD based on P2P NCCL ( #18242 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-06-18 06:32:36 +00:00
5f52a84685
[V1] Add API docs for EncoderCacheManager ( #19294 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-18 13:37:01 +08:00
d4629dc43f
[Misc] Add __str__ for RequestStatus ( #19780 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-18 03:03:01 +00:00
6e9cc73f67
[MISC] correct DeviceConfig device field static type analysis ( #19699 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-17 17:21:50 -07:00
c53711bd63
[MISC] correct copy_blocks src_to_dists param type ( #19696 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-17 17:21:06 -07:00
dac8cc49f4
[TPU] Update torch version to include paged attention kernel change ( #19706 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-17 22:24:49 +00:00
a44b1c951d
[Feature][ROCm] Add full graph capture support for TritonAttentionBackend ( #19158 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-06-17 17:03:06 -04:00
b447624ee3
[Bugfix] Fix faulty triton importing logic when using Ray for DP ( #19734 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-17 20:59:29 +00:00
cda92307c1
[Misc] Update lmcache connector with the latest connector apis ( #19441 )
...
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2025-06-17 19:57:54 +00:00
bf57ccc5c2
Remove sm120 arch from sm100 cutlass kernel arch list ( #19716 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-17 11:49:39 -07:00
ffb2cd6b54
[Perf] Optimize moe_align_block_size
CUDA kernel ( #19572 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-06-17 11:49:26 -07:00
ca94d7fa00
[Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 ( #19151 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-17 15:58:38 +00:00
5a1c2e15d8
[Mis] remove duplicate engine status checks ( #19647 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-17 08:17:38 -07:00
4c8f64faa7
[V1][Kernel] Flashinfer HND KV cache layout ( #19280 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-17 09:09:22 -04:00
93aee29fdb
[doc] split "Other AI Accelerators" tabs ( #19708 )
2025-06-17 22:05:29 +09:00
154d063b9f
[doc][mkdocs] Add edit button to documentation ( #19637 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-17 11:10:31 +00:00
ccd7c05089
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel ( #19152 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-06-17 10:45:07 +00:00
c48c6c4008
Add a doc on how to update PyTorch version ( #19705 )
2025-06-17 18:10:37 +08:00
aed8468642
[Doc] Add missing llava family multi-image examples ( #19698 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-17 07:05:21 +00:00
5c76b9cdaf
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager ( #19686 )
...
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
2025-06-17 04:40:58 +00:00
ddfed314f9
Fixes IMA for TP w/ flex-attention ( #19712 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-06-17 04:01:50 +00:00
5b3ad5ecf2
[DOC] fix doc typos ( #19600 )
...
Signed-off-by: Di Liu <liu-di@sjtu.edu.cn >
2025-06-17 11:34:53 +08:00
ede5c4ebdf
[Frontend] add chunking audio for > 30s audio ( #19597 )
...
Signed-off-by: nguyenhoangthuan99 <thuanhppro12@gmail.com >
2025-06-17 11:34:00 +08:00
07334959d8
[Wheel Size] Only build FA2 8.0+PTX ( #19336 )
2025-06-17 12:32:49 +09:00
119f683949
[doc] add project flag to gcloud TPU command ( #19664 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-06-17 01:00:09 +00:00
0860087aff
[Fix] Fall back to Gloo when NCCL backend is unavailable ( #19641 )
...
Signed-off-by: conroy-cheers <conroy@corncheese.org >
2025-06-17 08:42:14 +08:00
6bc7b57315
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 ( #19563 )
2025-06-16 17:33:51 -04:00
90f9c2eb5c
[V1] Change return type on get_multimodal_embeddings() ( #19446 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-16 13:32:15 -04:00
387bdf0ab9
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) ( #19677 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
2025-06-16 09:47:14 -07:00
5e5baa91aa
[Kernels] Use empty for modular MoE workspaces ( #19667 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-16 14:58:01 +00:00
836d4ce140
[Bugfix] fix missing 'finish_reason': null in streaming chat ( #19662 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-16 14:10:39 +00:00
c3fec47bb7
[MISC] bump huggingface_hub pkg to 0.33.0 ( #19547 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-16 05:22:28 -07:00
1173804dca
[Bugfix] Fix TP inference for Flex attention backend ( #19657 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-16 11:21:37 +00:00
4d5424029b
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. ( #19652 )
...
Signed-off-by: Shawn Tan <shawntan@ibm.com >
2025-06-16 11:14:18 +00:00
3e7506975c
[DOC] Add reasoning capability to vLLM streamlit code ( #19557 )
2025-06-16 07:09:12 -04:00
ee35e96ac3
[BugFix] Don't catch BaseException when dumping execute_model errors ( #19626 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-16 11:01:08 +00:00
dec66d253b
[Kernel] GGUF MMVQ kernel for multiple input vectors ( #18754 )
...
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
2025-06-16 17:33:26 +08:00
8d120701fd
[Docs] Move multiproc doc to v1 dir ( #19651 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-16 09:10:12 +00:00
f40f763f12
[CI] Add mteb testing for rerank models ( #19344 )
2025-06-16 01:36:43 -07:00
26bc46ef89
[MISC] typo fix ( #19672 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-16 07:18:49 +00:00
a77aea59fd
[TPU] support attention head dim smaller than 128 ( #19620 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-06-16 06:40:53 +00:00
b692e9cd07
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config ( #19660 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-16 06:30:29 +00:00
367871a469
[Misc][Frontend] passthrough bad_words
( #19564 )
...
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai >
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai >
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-06-16 05:05:13 +00:00
92183b41f3
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker ( #18957 )
...
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
2025-06-15 21:56:37 -07:00
c6703d1e0d
[MISC] Remove unused variableds in C++ ( #19609 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-15 20:05:28 -07:00
a5e7242d5f
[Misc] Remove duplicate multiproc method setting for CPU platform ( #19649 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-16 02:26:58 +00:00
91b2c17a55
[CI/Build] Fix torch nightly CI dependencies part 2 ( #19589 )
2025-06-15 20:01:10 +08:00
055915e6ce
Enable prefix caching with full cuda graphs ( #19617 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-15 01:05:05 -07:00
3d330c4c09
[Benchmark] Refactor benchmark script for fp8 & int8 ( #19627 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-15 15:15:37 +08:00
0b73736a0d
[Kernel] Raise verbose error and consolidate num_heads/num_kv_heads
divisibility check ( #19339 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-15 13:43:48 +08:00
ee1531bc38
[Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness ( #19644 )
2025-06-14 21:15:41 -07:00
e13945f9dd
[Perf] Further tunings for SM100 FP8 CUTLASS kernel ( #19566 )
2025-06-14 17:25:10 -07:00
08500011d3
[Fix] Convert kv_transfer_config from dict to KVTransferConfig ( #19262 )
2025-06-14 12:32:07 -07:00
861a0a0a39
[Bugfix] Don't attempt to use triton if no driver is active ( #19561 )
2025-06-14 12:30:54 -07:00
bc956b38d0
Only build CUTLASS MoE kernels on Hopper ( #19648 )
2025-06-14 11:44:15 -07:00
294fc1e2c9
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization ( #19500 )
2025-06-14 09:34:28 -07:00
2db9044ab6
[Bugfix] Fix auto dtype casting for BatchFeature ( #19316 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-06-14 15:13:08 +00:00
6fa718a460
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts ( #19593 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-14 16:54:52 +08:00
06be858828
[Bugfix] Fix the speculative decoding test by setting the target dtype ( #19633 )
2025-06-13 20:57:32 -07:00
d1e34cc9ac
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. ( #18354 )
...
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai >
2025-06-14 11:07:36 +08:00
bd517eb9fe
[BugFix] Fix DP Coordinator incorrect debug log message ( #19624 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-14 00:18:03 +00:00
d65668b4e8
Adding "AMD: Multi-step Tests" to amdproduction. ( #19508 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-13 17:08:51 -07:00
aafbbd981f
[torch.compile] Use custom ops when use_inductor=False ( #19618 )
2025-06-13 15:05:54 -07:00
0f0874515a
[Doc] Add troubleshooting section to k8s deployment ( #19377 )
...
Signed-off-by: Anna Pendleton <pendleton@google.com >
2025-06-13 21:47:51 +00:00
3597b06a4f
[CUDA] Enable full cudagraph for FlashMLA ( #18581 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-13 18:12:26 +00:00
1015296b79
[doc][mkdocs] fix the duplicate Supported features sections in GPU docs ( #19606 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-13 16:25:08 +00:00
ce9dc02c93
[Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl
( #19573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-13 06:12:15 -07:00
a24cb91600
[Model] Fix minimax model cache & lm_head precision ( #19592 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-06-13 12:08:20 +00:00
7e8d97dd3f
[BugFix] Honor enable_caching
in connector-delayed kvcache load case ( #19435 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-13 09:46:32 +00:00
d70bc7c029
[torch.compile] reorganize the cache directory to support compiling multiple models ( #19064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-13 15:23:25 +08:00
ce688ad46e
use base version for version comparison ( #19587 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-06-13 15:09:34 +08:00
cefdb9962d
[Fix] The zip function in Python 3.9 does not have the strict argument ( #19549 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-13 14:57:48 +08:00
ace5cdaff0
[Fix] bump mistral common to support magistral ( #19533 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-12 22:28:12 -07:00
6458721108
[CPU] Refine default config for the CPU backend ( #19539 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-13 13:27:39 +08:00
bb4a0decef
[Misc] Correct broken docs link ( #19553 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-06-12 22:27:13 -07:00
c707cfc12e
[doc] fix incorrect link ( #19586 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-13 04:26:09 +00:00
7b3c9ff91d
[Doc] uses absolute links for structured outputs ( #19582 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-13 03:35:17 +00:00
c68698b326
[Bugfix] Fix EAGLE vocab embedding for multimodal target model ( #19570 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-06-12 23:09:19 -04:00
e3b12667d4
[BugFix] : Fix Batched DeepGemm Experts ( #19515 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-12 20:43:02 -06:00
e6aab5de29
Revert "[Build/CI] Add tracing deps to vllm container image ( #15224 )" ( #19378 )
2025-06-12 17:26:40 -07:00
c57bb199b3
[V1] Resolve failed concurrent structured output requests ( #19565 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-12 23:30:09 +00:00
dba68f9159
[Doc] Unify structured outputs examples ( #18196 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-12 22:50:31 +00:00
a3319f4f04
[Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant ( #19452 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-12 15:39:15 -04:00
9d880f594d
[Misc] Turn MOE_DP_CHUNK_SIZE into an env var ( #19506 )
2025-06-12 18:01:16 +00:00
017ef648e9
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets ( #18847 )
2025-06-12 10:30:56 -07:00
4b25ab14e2
[doc] Make top navigation sticky ( #19540 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-12 15:48:11 +00:00
f98548b9da
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass ( #16756 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2025-06-12 08:31:04 -07:00
96846bb360
Fix TorchAOConfig skip layers ( #19265 )
...
Signed-off-by: mobicham <hicham@mobiuslabs.com >
2025-06-12 22:22:53 +08:00
b6efafd9e4
[Perf] Vectorize static / dynamic INT8 quant kernels ( #19233 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-12 06:51:41 -07:00
1129e2b1ab
[V1][NixlConnector] Drop num_blocks
check ( #19532 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-12 12:36:14 +00:00
c742438f8b
[Doc] Add V1 column to supported models list ( #19523 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-12 19:16:44 +08:00
73e2e0118f
[Quantization] Improve AWQ logic ( #19431 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-12 11:02:11 +00:00
c9280e6346
[Bugfix] Respect num-gpu-blocks-override in v1 ( #19503 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-12 11:00:23 +00:00
af09b3f0a0
[Bugfix][V1] Allow manual FlashAttention for Blackwell ( #19492 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-12 10:40:24 +00:00
4f6c42fa0a
[Security] Prevent new imports of (cloud)pickle ( #18018 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-06-12 10:30:17 +00:00
dff680001d
Fix typo ( #19525 )
...
Signed-off-by: 2niuhe <carlton2tang@gmail.com >
2025-06-12 09:24:45 +00:00
2e090bd5df
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm ( #19509 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-06-12 07:14:24 +00:00
1b0b065eb5
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API ( #19522 )
...
Signed-off-by: strutive07 <strutive07@gmail.com >
2025-06-12 07:00:47 +00:00
d5bdf899e4
[BugFix] Work-around incremental detokenization edge case error ( #19449 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-12 06:43:20 +00:00
7e3e74c97c
[Frontend] Improve error message in tool_choice validation ( #19239 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-12 01:13:00 -04:00
3f6341bf7f
Add Triton Fused MoE kernel config for E=16 on B200 ( #19518 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-12 04:31:51 +00:00
e5d35d62f5
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import ( #19514 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-12 04:28:12 +00:00
2f1c19b245
[CI] change spell checker from codespell to typos ( #18711 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-11 19:57:10 -07:00
42f52cc95b
[CI/Build] Fix torch nightly CI dependencies ( #19505 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-06-11 14:40:42 -07:00
97a9465bbc
[UX] Add Feedback During CUDAGraph Capture ( #19501 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-06-11 21:09:05 +00:00
c7ea0b56cd
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger ( #17331 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-06-11 15:53:28 -04:00
29fa5cac1c
[Kernels] Add activation chunking logic to FusedMoEModularKernel ( #19168 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-11 12:53:10 -04:00
b2d9be6f7d
[Docs] Remove WIP features in V1 guide ( #19498 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-11 09:15:03 -07:00
04a55612dd
[Misc] Fix misleading ROCm warning ( #19486 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-12 00:12:10 +08:00
89b0f84e17
[doc] fix "Other AI accelerators" getting started page ( #19457 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-06-11 16:11:17 +00:00
497a91e9f7
[CI] Update FlashInfer to 0.2.6.post1 ( #19297 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-11 22:57:28 +08:00
943ffa5703
[Bugfix] Update the example code, make it work with the latest lmcache ( #19453 )
...
Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com >
2025-06-11 12:42:20 +00:00
5c8d34a42c
Support no privileged mode on CPU for docker and kubernetes deployments ( #19241 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-06-11 04:11:47 -07:00
3c8694eabe
Fix some typo ( #19475 )
...
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com >
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com >
2025-06-11 10:36:04 +00:00
7484e1fce2
Add cache to cuda get_device_capability ( #19436 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-11 17:37:05 +08:00
a2142f0196
Support non-string values in JSON keys from CLI ( #19471 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 09:34:04 +00:00
871d6b7c74
[Misc] Reduce warning message introduced in env_override ( #19476 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-11 17:29:54 +08:00
29a38f0352
[Doc] Support "important" and "announcement" admonitions ( #19479 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 01:39:58 -07:00
a5115f4ff5
[Doc] Fix quantization link titles ( #19478 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 01:27:22 -07:00
68b4a26149
[Doc] Update V1 User Guide for Hardware and Models ( #19474 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 00:49:06 -07:00
b8e809a057
[Kernel] Support deep_gemm for linear methods ( #19085 )
...
Signed-off-by: artetaout <lulala341@gmail.com >
2025-06-11 15:14:45 +08:00
5039ec2336
[ROCm] Add rules to automatically label ROCm related PRs ( #19405 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-11 15:09:18 +08:00
7c644ab6d5
Fix Typo in Documentation and Function Name ( #19442 )
2025-06-10 22:44:11 -07:00
2d40665fe8
Add fused MOE config for Qwen3 30B A3B on B200 ( #19455 )
...
Signed-off-by: Junhao Li <junhao@ubicloud.com >
2025-06-11 13:43:46 +08:00
96ada386b7
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data
( #19422 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-11 05:18:57 +00:00
1e473b3010
[CI] Disable failing GGUF model test ( #19454 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-11 05:12:38 +00:00
2b1e2111b0
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py ( #19451 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-11 12:54:59 +08:00
a45b979d9f
[BugFix] Fix docker build cpu-dev image error ( #19394 )
...
Signed-off-by: niu_he <carlton2tang@gmail.com >
2025-06-10 20:56:40 -07:00
3952731e8f
[New Model]: Support Qwen3 Embedding & Reranker ( #19260 )
2025-06-10 20:07:30 -07:00
77f0d465d0
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 ( #19390 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-11 07:54:41 +08:00
22c3c0aa4a
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 ( #19401 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-06-11 07:23:57 +08:00
33f8dba7c6
[Model] use AutoWeightsLoader for commandr ( #19399 )
...
Signed-off-by: py-andy-c <pychen1017@gmail.com >
2025-06-10 22:42:21 +00:00
5241ca50d6
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default ( #19440 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-10 22:06:15 +00:00
da9b523ce1
[Docs] Note that alternative structured output backends are supported ( #19426 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-10 16:20:00 +00:00
b6553be1bc
[Misc] Slight improvement of the BNB ( #19418 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-10 13:51:49 +00:00
64a9af5afa
Simplify ep kernels installation ( #19412 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-10 20:06:08 +08:00
e4248849ec
[BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral ( #19411 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-06-10 12:02:40 +00:00
467bef18a3
[BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope
( #19134 )
...
Signed-off-by: Yunqiu Guo <guorachel@meta.com >
2025-06-10 16:48:51 +08:00
5f1ac1e1d1
Revert "[v1] Add fp32 support to v1 engine through flex attn" ( #19404 )
2025-06-10 01:30:20 -07:00
9368cc90b2
Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. ( #17930 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-06-10 06:22:05 +00:00
32b3946bb4
Add clear documentation around the impact of debugging flag ( #19369 )
...
Signed-off-by: Anna Pendleton <pendleton@google.com >
2025-06-10 06:16:09 +00:00
6b1391ca7e
[Misc] refactor neuron_multimodal and profiling ( #19397 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-10 06:12:42 +00:00
a3f66e75d1
Add security warning to bug report template ( #19365 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-06-10 06:06:36 +00:00
319cb1e351
[Core] Batch multi modal input using pinned memory ( #19169 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-10 13:44:59 +08:00
1efef71645
[Bugfix] Fix modelscope token passed in ( #19389 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-10 13:39:37 +08:00
646d62f636
[Core] Use tuple for kv cache group block ids ( #19175 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-10 07:01:17 +02:00
6cd4ae8acd
[Frontend] Add tqdm_leave_pbar to control progress bar visibility ( #19357 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-10 04:55:09 +00:00
c016047ed7
Fix docs/mkdocs/hooks/remove_announcement.py ( #19382 )
2025-06-09 21:36:54 -07:00
9af6d22e4c
Use xla flag to improve the quantized model performance ( #19303 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-06-10 01:28:45 +00:00
4589b94032
[Bugfix] Fix benchmark_moe.py ( #19016 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
2025-06-09 18:04:36 -07:00
cc867be19c
[V1] Reuse V0's memory_profiling util for gpu worker memory profiling ( #19312 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-10 08:40:01 +08:00
3a7cd627a8
[Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration ( #19383 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-09 16:41:51 -07:00
8058c91108
[HOT-FIX] Add kv_sharing_target_layer_name
argument to cutlass_mla backend ( #19374 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-06-09 19:00:07 -04:00
7d44c469fe
[TPU]Fix KV cache sharing tests ( #19371 )
2025-06-09 18:38:15 -04:00
31f58be96a
[Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var ( #18472 )
...
Signed-off-by: liusiqian <liusiqian@tal.com >
2025-06-09 21:41:21 +00:00
ebb2f383b8
[Quantization] Bump compressed-tensors version ( #19295 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-09 14:33:15 -07:00
c1c7dbbeeb
[Bugfix][Core] Prevent token lengths exceeding max_model_len
in V0 ( #19348 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-09 23:01:29 +08:00
5cf2daea9a
[Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. ( #19298 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-06-09 10:50:39 -04:00
b8089195b4
[v1] Add fp32 support to v1 engine through flex attn ( #19319 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-06-09 22:10:44 +08:00
770e5dcdb8
[full_graph] Fix query_start_loc padding ( #19321 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-06-09 21:32:56 +08:00
c57c9415b1
[Docs] Fix a bullet list in usage/security.md ( #19358 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-09 13:28:51 +00:00
01810f9236
[CI] Introduce rules for llama auto-label ( #19323 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-09 20:05:42 +08:00
59abbd84f9
[Fix] Allow kernel compilation for CUDA capability 8.7 ( #19328 )
...
Signed-off-by: Conroy Cheers <conroy@corncheese.org >
2025-06-09 02:57:23 -07:00
95a6568b5c
[CI/Build] Fix LoRA test ( #19350 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-09 09:52:10 +00:00
0eca5eacd0
[Doc] Fix description in the Automatic Prefix Caching design doc ( #19333 )
...
Signed-off-by: cr7258 <chengzw258@163.com >
2025-06-09 17:30:02 +08:00
12e5829221
[doc] improve ci doc ( #19307 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-09 07:26:12 +00:00
3a4d417707
[Misc] Cleanup compilation tests ( #19343 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-09 15:05:44 +08:00
8335667c22
[Frontend] Remove unreachable code from llm.py ( #19288 )
...
Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com >
2025-06-09 10:22:10 +08:00
e1c4380d4c
[Misc] Add documentation update reminder to PR template ( #19289 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-09 10:20:53 +08:00
e31ae3de36
[Deprecation] Remove inputs
arg fallback in Engine classes ( #18799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-09 10:19:56 +08:00
2ffb9b6e07
[Bugfix] model_max_length should consider max_model_len in tokenizer_config ( #19201 )
2025-06-08 07:17:53 -07:00
cda10fa3e2
[Multi Modal] Add an env var for message queue max chunk bytes ( #19242 )
...
Signed-off-by: yZhen <yZhen@fb.com >
Co-authored-by: yZhen <yZhen@fb.com >
2025-06-08 21:39:12 +08:00
c123bc33f9
[Quantization] Add compressed-tensors NVFP4 support ( #18312 )
2025-06-08 09:05:55 -04:00
b9a1791e2c
[Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection ( #19082 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
2025-06-08 09:17:14 +00:00
989dcee981
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B ( #19315 )
...
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-06-08 16:07:02 +08:00
3d64d366e0
[Misc] Change tests/compile to use VLLM_V1 by default ( #19302 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-08 16:06:48 +08:00
eaa2e51088
[Bugfix] Re-enable use_cudagraph in vLLM v1 ( #19299 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-06-08 08:56:12 +08:00
d77f7fb871
[Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer ( #19283 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-08 08:16:31 +08:00
2d8476e465
[BugFix][V1] Fix memory profiling bug ( #18974 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-07 10:34:51 -07:00
88be823d57
[AMD] Update compatible packaging version ( #19309 )
...
Signed-off-by: pramkuma <Pramendra.Kumar@amd.com >
2025-06-07 20:55:09 +08:00
4e4f63ad45
[Nit][Benchmark]Fix example in benchmark_serving_structured_output.py ( #19311 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-06-07 18:25:38 +08:00
d2f0e7e615
[CI/Build] Improve Llama GGUF test robustness ( #19287 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-07 17:23:28 +08:00
122cdca5f6
[Misc] refactor context extension ( #19246 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-07 05:13:21 +00:00
cf02f9b283
Add FlexAttention to V1 ( #16078 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-06-06 21:58:55 -07:00
c4296b1a27
[CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py ( #19253 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-06-07 11:52:52 +08:00
66c508b137
[TPU][Test] Add script to run benchmark on TPU for buildkite ( #19039 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-06 20:10:24 -07:00
84166fee97
[Kernel] Integrate CUTLASS MoE kernel with PPLX ( #18762 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-06 18:26:11 -07:00
6e0cd10f72
[Easy][Test] Simplify test_function_tool_use with multiple parametrizes ( #19269 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-07 09:19:09 +08:00
e010688f50
[Build][ROCm] Update Dockerfile.rocm ( #19296 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-06-06 19:35:16 -04:00
441b65d8c7
[Misc][Tools][Benchmark] Fix and improve auto tune script ( #19163 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-06 23:31:19 +00:00
46ecc57973
[BugFix] Fix tpu_model_runner block_id concatenation ( #19228 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:28:17 -07:00
b6a3a9f76d
[Core] Fix abrupt request abort ( #18485 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:27:59 -07:00
ca27f0f9c1
[Bugfix][Core] Update cancellation logic in generate()
to handle Generator exits ( #19225 )
...
Co-authored-by: Adolfo Victoria <adovi@meta.com >
2025-06-06 20:17:54 +00:00
aad30bd306
[BugFix] Fix MultiConnector test after HMA changes ( #19291 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 20:16:24 +00:00
94ecee6282
Fixed ppc build when it runs on non-RHEL based linux distros ( #18422 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-06-06 11:54:26 -07:00
8267f9916f
improve logits bias ( #19041 )
2025-06-06 19:59:25 +08:00
7353492a47
[Core] Raise when non-multi-instance DP clients target a DP rank ( #19227 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-06 19:03:01 +08:00
7661e92ef8
[Model] Optimize nemotron_h implementation ( #19249 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-06 10:05:14 +00:00
f168b85725
Unit Test for run_dp_sharded_vision_model ( #19103 )
...
Signed-off-by: Siqi Yan <siqi@meta.com >
Co-authored-by: Siqi Yan <siqi@meta.com >
2025-06-06 16:24:02 +08:00
da511d54d8
Fix CompilationConfig repr ( #19091 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-06 16:23:35 +08:00
65c69444b1
[Docs] Improve V1 KVConnector interface documentation ( #19172 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:22:45 +08:00
94870359cd
[Quantization] Bump compressed-tensors version; update NVFP4A16 test model ( #19224 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-06-06 01:21:54 -07:00
0d49483ea9
[TPU] fix kv cache dtype in model runner ( #19244 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-06 16:20:16 +08:00
90b78ec5f9
[v1][P/D] Fix a edge case in kv cache schedule ( #19182 )
...
Co-authored-by: jinghui <jinghui@fb.com >
2025-06-05 23:32:55 -07:00
91a2ef98ea
[Chore] update CODEOWNERS ( #19247 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-06 06:09:43 +00:00
3da2313d78
Support allowed_token_ids in ChatCompletionRequest ( #19143 )
...
Signed-off-by: Xu Song <xusong.vip@gmail.com >
2025-06-06 05:06:48 +00:00
b61dc5f972
[TPU] update torch_xla pin ( #19231 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-06 04:27:38 +00:00
f8a1a2d108
[v1] Hybrid Memory Allocator ( #17996 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-05 20:47:09 -07:00
3465b87ef8
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B ( #19033 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-06-05 19:10:08 -07:00
c8134bea15
Fix AOPerModuleConfig name changes ( #18869 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-06-05 18:51:32 -07:00
cb6d572e85
[Model] NemotronH support ( #18863 )
...
Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com >
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com >
2025-06-05 21:29:28 +00:00
87360308b7
[V1] Use FlashInfer by default on Blackwell GPUs ( #19118 )
2025-06-05 15:40:39 -04:00
aa49f14832
[Quantization] Skip Fp4 Test for compressed-tensors
( #19217 )
2025-06-05 18:21:53 +00:00
9ef9173cfa
[P/D][NixlConnector] Enable FlashInfer backend ( #19090 )
2025-06-05 17:10:15 +00:00
85e2b7bb13
[MISC][Bugfix] Use less CPU when message queue has been empty for some time ( #16226 )
...
Signed-off-by: Povilas Kanapickas <povilas@radix.lt >
2025-06-05 16:53:08 +00:00
61059bee40
[Hardware][NVIDIA] FP4 MoE kernel optimization ( #19110 )
...
Signed-off-by: Chiyue Wei <chiyuew@nvidia.com >
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com >
2025-06-05 09:48:26 -07:00
ec89524f50
Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 ( #19205 )
2025-06-05 16:38:54 +00:00
f20f9f063b
[mistral_common] Add v11 tokenizer ( #19193 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-06-05 08:27:41 -07:00
9bc8bb07cf
[Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided ( #19202 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-06-05 12:59:28 +00:00
1aeb925f34
[Frontend] improve vllm run-batch --help display ( #19187 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-05 11:16:25 +00:00
188a4590d8
[Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly ( #19105 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-05 11:14:32 +00:00
18093084be
[Misc] Remove unnecessary fallback to prefill-decode attention ( #19138 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-06-05 16:08:26 +08:00
da40380214
[Build] Annotate wheel and container path for release workflow ( #19162 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-04 23:24:56 -07:00
8fc57501d3
[Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled ( #19135 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-05 06:24:24 +00:00
af7fc84fd2
[BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 ( #19171 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-05 13:41:25 +08:00
0678b52251
Handle non-serializable objects when dumping benchmark results ( #19114 )
2025-06-04 22:40:04 -07:00
25b918eee6
[Torch Nightly]add missing dependency ( #18770 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-06-04 21:56:12 -07:00
a408820f2f
[Bugfix] Fix port handling in make_zmq_path ( #19117 )
2025-06-04 21:00:59 -06:00
c56ed8bb0e
[Bugfix][Nixl] Fix full prefix cache hit bug ( #18632 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-05 02:07:32 +00:00
78dcf56cb3
[doc] small fix ( #19167 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-05 09:13:50 +08:00
b2fac67130
[P/D] Heterogeneous TP ( #18833 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-06-04 23:25:34 +00:00
23027e2daf
[Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM ( #18817 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-04 15:37:25 -07:00
c3fd4d669a
[Kernel] Integrate batched/masked deepgemm kernel ( #19111 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-06-04 21:59:18 +00:00
ef3f98b59f
[Bugfix] fix v1 cpu worker fails on macOS ( #19121 )
2025-06-04 20:17:38 +00:00
7ee2590478
[TPU] Update dynamo dump file name in compilation test ( #19108 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-04 16:13:43 -04:00
53a5a0ce30
[Perf] Tunings for SM100 FP8 CUTLASS kernel ( #18778 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-04 10:46:28 -07:00
d459fae0a2
[Bugfix][EP+DP] Fix internode check ( #19112 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
2025-06-04 23:39:23 +08:00
c8dcc15921
Allow AsyncLLMEngine.generate to target a specific DP rank ( #19102 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-04 08:26:47 -07:00
8f4ffbd373
[Doc] Update V1 Guide for embedding models ( #19141 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-04 22:57:55 +08:00
5f2cd251d2
Sm100 blockwise fp8 swap ab ( #18564 )
2025-06-04 07:48:45 -07:00
02658c2dfe
Add DeepSeek-R1-0528 function call chat template ( #18874 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-06-04 13:24:18 +00:00
01dc9a76db
[CI/Build][Bugfix] Ensure compatibility with transformers 4.52 ( #18678 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-04 04:49:20 -07:00
35cf32df30
Improve the output precision of embedding models ( #19092 )
2025-06-04 11:48:57 +00:00
8711bc5e68
[Misc] Add packages for benchmark as extra dependency ( #19089 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-04 04:18:48 -07:00
2669a0d7b5
Fix ValueError: Missing value for tag key(s): model_name,engine. ( #19113 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-06-04 17:10:45 +08:00
8e972d9c44
[TPU] Skip hanging tests ( #19115 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-04 01:43:00 -07:00
3336c8cfbe
Fix #19130 ( #19132 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-04 01:42:06 -07:00
b124e1085b
[Bugfix] Fix FA3 full cuda graph correctness ( #19106 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-03 23:10:15 -07:00
41aa578428
[NVIDIA] Add Cutlass MLA backend ( #17625 )
2025-06-03 21:40:26 -07:00
8d646c2e53
[Cleanup][v1]:remote guided-decoding-backend for example ( #19059 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-06-04 04:23:26 +00:00
5d6d1adf15
[KERNEL] Sampler. CUDA kernel for applying repetition penalty ( #18437 )
2025-06-03 21:13:01 -07:00
1409ef9134
[Core] Cast multimodal input in hf processor ( #18862 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-03 20:24:56 -07:00
4555143ea7
[CPU] V1 support for the CPU backend ( #16441 )
2025-06-03 18:43:01 -07:00
52dceb172d
[Docs] Add developer doc about CI failures ( #18782 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-04 01:09:13 +00:00
abd7df2fca
[Misc] Fix path and python alias errors in disagg_prefill exmaples ( #18919 )
2025-06-03 17:15:18 -07:00
b712be98c7
feat: add data parallel rank to KVEventBatch ( #18925 )
2025-06-03 17:14:20 -07:00
a8da78eac9
[Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers ( #19029 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-04 00:14:06 +00:00
5d96533e22
[Bugfix][P/D] Fix Prefix Cache Bug ( #18411 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-06-03 23:53:16 +00:00
4de790fcad
[Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled ( #19075 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-03 23:27:24 +00:00
b5fd9506c1
[Bugfix] get_num_blocks_to_allocate with null_block ( #19031 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 15:30:55 -07:00
135cf55cd1
[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix ( #18971 )
2025-06-03 15:26:33 -07:00
6cac54f4d1
[v1] Re-init input batch for multiple kv cache groups ( #18654 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 21:41:36 +00:00
6865fe0074
Fix interaction between Optional
and Annotated
in CLI typing ( #19093 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Yikun Jiang <yikun@apache.org >
2025-06-03 21:07:19 +00:00
e31446b6c8
[Perf] Tune scaled_fp8_quant
by increasing vectorization ( #18844 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-03 13:48:25 -07:00
bdf13965ab
[V1] Support cross-layer KV sharing ( #18212 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-06-03 20:33:07 +00:00
fa98d77773
[Kernel] DeepEP dispatch-combine kernel integration ( #18434 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-03 12:30:02 -07:00
01eee40536
[doc] update docker version ( #19074 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-03 19:08:21 +00:00
19bdaf32b1
[Doc] Readme standardization ( #18695 )
...
Co-authored-by: Soren Dreano <soren@numind.ai >
2025-06-03 11:50:55 -07:00
02f0c7b220
[Misc] Add SPDX-FileCopyrightText ( #19100 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-06-03 11:20:17 -07:00
d054da1992
[Misc] fix: add miss best_of param validation ( #18555 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-03 11:02:07 -07:00
4b7817c119
[Misc] Add missing _Backend
enums ( #19081 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-06-03 16:15:16 +00:00
d00dd65cd4
[Doc] Improve the Pull Request template with key components ( #19086 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-03 23:44:34 +08:00
d81edded69
[Bugfix] disable processor cache ( #19068 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-06-03 15:06:04 +00:00
476844d44c
Fix underscores in dict keys passed via CLI ( #19030 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-06-03 14:39:24 +00:00
4e68ae5e59
[CI/Build] Remove V0 LoRA test ( #19066 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-03 14:30:18 +00:00
4e88723f32
[doc] clarify windows support ( #19088 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-03 21:42:17 +08:00
118ff92111
[Doc] Update V1 user guide for embedding and enc-dec models ( #19060 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-03 02:29:41 -07:00
ec2dcd80bc
[Misc] Update WeightsMapper
for qwen2-vl/qwen2.5-vl ( #19054 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-03 09:08:20 +00:00
42243fbda0
[Doc] Add InternVL LoRA support ( #19055 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-03 09:08:03 +00:00
6d18ed2a2e
Update docker docs with ARM CUDA cross-compile ( #19037 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-06-03 08:21:53 +00:00
f32fcd9444
[v1][KVCacheManager] Rename BlockHashType to BlockHash ( #19015 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 08:01:48 +00:00
d32aa2e670
[Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure ( #19019 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-03 00:16:17 -07:00
cc977286e7
Reduce logs in CLI scripts and plugin loader ( #18970 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-03 06:00:45 +00:00
17430e3653
[bugfix] small fix logic issue ( #18999 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-03 05:35:12 +00:00
1282bd812e
Add tarsier model support ( #18985 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-03 13:13:13 +08:00
bdce64f236
[V1] Support DP with Ray ( #18779 )
2025-06-02 21:15:13 -07:00
9e6f61e8c3
[ROCm][Build] Clean up the ROCm build ( #19040 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-02 20:47:47 -07:00
8655f47f37
[CPU][CI] Re-enable the CPU CI tests ( #19046 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-06-02 20:46:47 -07:00
4ce42f9204
Adding "LoRA Test %N" to AMD production tests ( #18929 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-06-02 20:46:44 -07:00
8a57872b2a
[Bugfix][EP+DP] Use pplx-kernel internode instead of intranode ( #19034 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-03 11:36:51 +08:00
5bc1ad6cee
[Doc] Remove duplicate TOCs during MkDocs migration ( #19021 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-06-02 19:49:48 -07:00
9112b443a0
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD ( #18011 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Co-authored-by: Hossein Sarshar <hossein.sarshar@gmail.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-06-03 00:06:20 +00:00
c57d577e8d
add an absolute path for run.sh ( #18258 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-06-02 19:38:23 +00:00
ca2f6b9c30
[Bugfix][Model] Attempt to fix eagle in V0. ( #18978 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-02 08:15:53 -07:00
20133cfee2
[Frontend] enable custom logging for the uvicorn server (OpenAI API server) ( #18403 )
...
Signed-off-by: François Paupier <francois.paupier@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-02 15:04:23 +00:00
ebb1ec9318
[Model] enable data parallel for Llama4 vision encoder ( #18368 )
...
Signed-off-by: yzhen <yzhen@devgpu093.cco2.facebook.com >
Co-authored-by: yZhen <yZhen@fb.com >
Co-authored-by: yzhen <yzhen@devgpu093.cco2.facebook.com >
2025-06-02 19:22:54 +08:00
5b168b6d7a
[doc] add pytest tips ( #19010 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-02 11:07:26 +00:00
9760fd8f6a
[Core] Support inplace model weights loading ( #18745 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-02 17:38:50 +08:00
b9f61e1387
[Bugfix][Nixl] Fix DP Metadata Handshake ( #19008 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-06-02 03:30:41 +00:00
d6fd3a33b8
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context ( #18935 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-06-01 19:41:18 +00:00
432ec9926e
[doc] wrong output ( #19000 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-01 11:26:14 +00:00
2b102d51ad
[BugFix] Fix incorrect metrics shutdown error log message ( #18992 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-01 11:42:23 +08:00
aa54a7bf7b
[BugFix] fix data parallel construct ipv6 url addres ( #18991 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-06-01 11:42:10 +08:00
2ad6194a02
Let max_num_batched_tokens use human_readable_int for large numbers ( #18968 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-01 11:41:29 +08:00
c594cbf565
[doc] small fix - mkdocs ( #18996 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 20:23:43 -07:00
a35ca765a5
[LoRA] Support dynamically initialize packed_modules_mapping
for VLM with arbitrary components ( #18987 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-01 11:06:57 +08:00
6aa8f9a4e7
[Core] Rework dtype resolution ( #18751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-01 11:04:23 +08:00
1bc86a3da1
[Bugfix] Fix EAGLE3 broken logits ( #18909 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-05-31 19:58:07 -07:00
bbfa0c61d1
[Misc][Benchmark] Add support for CustomDataset ( #18511 )
2025-05-31 19:07:38 +00:00
20079c6e36
[Misc] add return token strs for tokenize ( #18941 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 18:00:11 +00:00
9a1b9b99d7
[BugFix] Fix multi-node offline data-parallel ( #18981 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-05-31 08:34:52 -07:00
8bf507d766
[P/D] NixlConnector use cache device index for memory registration ( #18969 )
...
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com >
2025-05-31 11:19:18 -04:00
306d60401d
[ROCm][Kernel] Add gfx950 support for skinny gemms ( #18010 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-05-31 07:40:05 -07:00
f2c3f66d59
[Bugfix] Fix for issue 17396 ( #18773 )
...
Signed-off-by: Fred Reiss <frreiss@us.ibm.com >
2025-05-31 11:58:17 +00:00
0f5e0d567e
[FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 ( #18825 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-31 03:39:31 -07:00
c55d804672
[BugFix] Pydantic part 2 ( #18911 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-05-31 03:39:28 -07:00
749f5bdd38
[doc] fix the list rendering issue - security.md ( #18982 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 10:39:21 +00:00
2a50ef5760
[Neuron] Add Multi-Modal model support for Neuron ( #18921 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
Co-authored-by: Rohith Nallamaddi <nalrohit@amazon.com >
Co-authored-by: FeliciaLuo <luof@amazon.com >
Co-authored-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-31 10:39:11 +00:00
b8b904795d
fix security issue of logging llm output ( #18980 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-05-31 10:38:56 +00:00
ba5111f237
[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled ( #18879 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-31 09:20:54 +00:00
1e123529d7
[Misc] Fix estimated max model len msg ( #18966 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-31 16:43:44 +08:00
dff80b0e42
[Frontend] Add rerank support to run_batch endpoint ( #16278 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-05-31 07:40:01 +00:00
7782464a17
create util function for batched arange ( #18937 )
2025-05-31 13:50:38 +08:00
0f71e24034
[Docs] Correct multiprocessing design doc ( #18964 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-31 01:30:15 +00:00
1dab4d5718
Tool parser regex timeout handling ( #18960 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-30 21:02:54 +00:00
7f21e8052b
[Misc] add group_size is -1 in awq quantization ( #18910 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-30 17:34:22 +00:00
5a8641638a
[VLM] Add PP support and fix GPTQ inference for Ovis models ( #18958 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-30 17:11:44 +00:00
f49239cb45
Benchmark script for fp8 vs bf16 gemm ( #17126 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 10:56:11 -06:00
2dbe8c0774
[Perf] API-server scaleout with many-to-many server-engine comms ( #17546 )
2025-05-30 08:17:00 -07:00
84ec470fca
Improve "failed to get the hash of the compiled graph" error ( #18956 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-30 15:00:54 +00:00
b29ca5c4d5
[Docs] Update SECURITY.md with link to our security guide ( #18961 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-30 07:37:27 -07:00
ec6833c5e9
[doc] show the count for fork and watch ( #18950 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-30 06:45:59 -07:00
e1fadf1197
[Feature] minicpm eagle support ( #18943 )
...
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com >
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com >
2025-05-30 06:45:56 -07:00
43ff405b90
[CI/Build] remove regex from build dependencies ( #18945 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-30 04:02:50 -07:00
fba02e3bd1
[Bugfix][TPU] Fix tpu model runner testcase failure ( #18810 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-30 18:04:03 +08:00
4577fc9abb
[Misc]Fix typo ( #18947 )
2025-05-30 02:21:35 -07:00
5f1d0c8118
[Bugfix][Failing Test] Fix test_vllm_port.py ( #18618 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-30 17:13:47 +08:00
c3bb9f2331
[Model] Use in-place adds in SigLIP ( #18922 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-30 17:12:59 +08:00
8f8900cee9
[doc] add mkdocs doc ( #18930 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-30 07:58:44 +00:00
6acb7a6285
[Misc]Fix benchmarks/README.md for speculative decoding ( #18897 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-30 07:58:04 +00:00
4f4a6b844a
[Deprecation] Remove mean pooling default for Qwen2EmbeddingModel
( #18913 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-30 06:53:37 +00:00
4d0a1541be
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy ( #18861 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 13:37:36 +08:00
77b6e74fe2
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. ( #18938 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-29 22:33:17 -07:00
5acf828d99
[docs] fix: fix markdown syntax ( #18927 )
2025-05-30 05:20:48 +00:00
3987e2ae96
[Model] Use AutoWeightsLoader for mamba2 ( #18918 )
...
Signed-off-by: iLeGend <824040212@qq.com >
2025-05-30 04:50:10 +00:00
77164dad5e
[Bugfix] Consistent ascii handling in tool parsers ( #18883 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-30 04:44:43 +00:00
3de3eadf5b
improve the robustness of parsing vlms config in AutoRound ( #18894 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-29 19:24:47 -07:00
3132290a14
[TPU][CI/CD] Clean up docker for TPU tests. ( #18926 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-30 10:24:19 +08:00
1aa2f81b43
[Misc] Update type annotation for rotary embedding base
( #18914 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-30 10:17:01 +08:00
d54af615d5
[Bugfix] Fix PP default fallback behavior for V1 ( #18915 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 10:13:17 +08:00
a1cc9f33a3
[TPU] remove transpose ops in moe kernel ( #18923 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-05-29 23:00:11 +00:00
a521ef06e5
Use standalone_compile by default in torch >= 2.8.0 ( #18846 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-30 06:41:58 +08:00
64eaf5fe05
[P/D] NixlConnector DP fixes ( #18903 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-29 18:08:40 +00:00
d1d61f3351
[BugFix] Make DP work with connector-delayed new requests ( #18559 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Will Eaton <weaton@redhat.com >
2025-05-29 18:04:18 +00:00
32ce3cf7c9
[V1] Allocate kv_cache with stride order for V1 ( #18775 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-05-29 17:54:16 +00:00
d58f9c7f7a
[Misc] Remove duplicate init for self.vllm_config ( #18896 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-29 17:26:07 +00:00
c29034037d
[Deprecation] Disallow pos-args other than model
when initializing LLM
( #18802 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-29 09:36:58 -07:00
1b7cfd5a36
[ROCm][V0][Attention] Revert to the previous FA triton kernel ( #18226 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-29 12:13:18 -04:00
da4b69d0b4
[Attention][V1] Toggle for v1 attention backend ( #18275 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-29 10:48:24 -04:00
c9479b2920
[Bugfix] Fix the failing gte embedding test ( #18720 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-29 07:39:25 -07:00
6f2909405e
[Doc] Fix codeblocks formatting in LoRA adapters documentation ( #18907 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-29 07:38:55 -07:00
b169d5f7b6
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. ( #18692 )
...
Signed-off-by: Duyi-Wang <duyi.wang@intel.com >
2025-05-29 20:02:08 +08:00
f8977c233f
Fix an error in dummy weight loading for quantization models ( #18855 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-29 03:07:20 -07:00
f274581f44
[BugFix] Update pydantic to fix error on python 3.10 ( #18852 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-05-29 03:05:46 -07:00
0b1447f890
[Bugfix] Ensure tensors are contiguous during serialisation ( #18860 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-29 03:05:20 -07:00
24d0ef8970
[Misc] Replace TODO in serving transcription ( #18895 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-29 02:58:14 -07:00
7fcfd954ff
[Bugfix] Fix misleading information in the documentation ( #18845 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-29 02:54:14 -07:00
e740d07f07
[doc] add CLI doc ( #18871 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-29 09:51:36 +00:00
a652e71dd0
[Doc] Remove redundant spaces from compatibility_matrix.md ( #18891 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-29 02:51:20 -07:00
34d6c447c4
[LoRA] Add LoRA support for InternVL ( #18842 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-29 08:46:24 +00:00
972eddf7c9
[Neuron] Add multi-LoRA support for Neuron. ( #18284 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-29 16:41:22 +08:00
fd7bb88d72
Fixes a dead link in nightly benchmark readme ( #18856 )
...
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com >
2025-05-29 04:41:39 +00:00
3c49dbdd03
Skip device and quant Pydantic validation to make plugin device work ( #18843 )
...
Signed-off-by: Yikun Jiang <yikunkero@gmail.com >
2025-05-28 20:12:30 -07:00
1661a9c28f
[Doc][Neuron] Update documentation for Neuron ( #18868 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-28 19:44:01 -07:00
8e882ffdc0
[Bugfix][TPU] fix moe custom kernel import ( #18853 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-05-28 19:34:19 -07:00
26b4fa45be
Add ability to use CUDAGraphs with use_inductor=False ( #17345 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-29 10:16:52 +08:00
515b413ebf
Prevent the cross-encoder logic from being applied to classification tasks ( #18838 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-28 19:16:17 -07:00
269d901734
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix ( #18100 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-29 07:21:46 +08:00
7951d78738
[Core] Enable CUDA graphs for DP + All2All kernels ( #18724 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-05-28 22:55:30 +00:00
6dbe5b5c93
Remove checks for None
for fields which should never be None
( #17985 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-28 21:32:19 +00:00
643622ba46
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend ( #15655 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Signed-off-by: xihajun <junfan@krai.ai >
Signed-off-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk >
Signed-off-by: Jorge de Freitas <jorge@krai.ai >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: xihajun <junfan@krai.ai >
Co-authored-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk >
Co-authored-by: Jorge de Freitas <jorge@krai.ai >
2025-05-28 19:59:09 +00:00
a09c7ca9f2
[Chore][Spec Decode] Update check NoneType instead of assigning variables ( #18836 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-28 18:57:19 +00:00
0e98964e94
[V1][Metrics] Remove metrics that were deprecated in 0.8 ( #18837 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-28 18:54:12 +00:00
c68b5c63eb
[Misc] fix olmoe model layer can't laod in tp gt 1 ( #18828 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-28 17:36:21 +00:00
fced756923
[Chore] update ty configuration ( #18839 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-28 08:59:11 -07:00
321331b8ae
[Core] Add Lora Support to Beam Search ( #18346 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-05-28 08:58:24 -07:00
6e4cea1cc5
decrement server_load on listen for disconnect ( #18784 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2025-05-28 22:15:12 +08:00
435fa95444
[Frontend] add run batch to CLI ( #18804 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-28 07:08:57 -07:00
4c2b38ce9e
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses ( #17599 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-28 12:46:04 +00:00
d781930f90
[Platform][Dist] Make torch distributed process group extendable ( #18763 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-28 10:52:34 +00:00
ce75efeecb
[BugFix] FA2 MLA Accuracy Issue ( #18807 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-05-28 08:59:39 +00:00
aa42561e40
Fix PiecewiseCompileInterpreter ( #17338 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-28 08:40:53 +00:00
de65fc8e1e
[CI] improve embed testing ( #18747 )
2025-05-28 00:16:35 -07:00
0c492b7824
[Deprecation] Remove fallbacks for Embeddings API ( #18795 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:09:04 +08:00
0f0926b43f
[Deprecation] Remove unused sync methods in async_timeout
( #18792 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:08:48 +08:00
7f2c1a87e9
[Deprecation] Require overriding get_dummy_text
and get_dummy_mm_data
( #18796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:08:35 +08:00
b78f844a67
[Bugfix][FailingTest]Fix test_model_load_with_params.py ( #18758 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-28 05:42:54 +00:00
5e13c07d00
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) ( #18781 )
...
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2025-05-28 05:09:14 +00:00
774c5fde30
[V1] fix torch profiling for V1 offline scenarios ( #18445 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-05-28 04:16:30 +00:00
9a21e331ff
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client ( #18769 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-05-28 03:35:43 +00:00
3e9ce609bd
[Bugfix] Fix nomic max_model_len ( #18755 )
2025-05-27 20:29:53 -07:00
794ae1f551
[rocm] Fix wrong attention log ( #18764 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
2025-05-27 19:45:41 -07:00
d73a9457a5
[Core] Improve Tensor serialisation ( #18774 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-28 09:46:21 +08:00
a3896c7f02
[Build] Fixes for CMake install ( #18570 )
2025-05-27 20:49:24 -04:00
51e98e4ffd
[Bugfix] Disable prefix caching by default for benchmark ( #18771 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-28 08:18:09 +08:00
e56f44d9ec
Support datasets in vllm bench serve
and sync with benchmark_[serving,datasets].py ( #18566 )
2025-05-27 19:59:48 -04:00
e0cbad4e30
[Neuron] Support quantization on neuron ( #18283 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-27 22:10:33 +00:00
b48d5cca16
[CI/Build] [TPU] Fix TPU CI exit code ( #18282 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-27 14:54:59 -07:00