f0945e311d
stash
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
2025-07-24 00:33:37 +00:00
4ec76caafa
updated
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
2025-07-23 20:02:41 +00:00
1588294a88
updated
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
2025-07-23 18:58:49 +00:00
e82e9afeb7
updated
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
2025-07-23 18:43:20 +00:00
10abfaf309
Merge branch 'fix-connector-agg' into debug-logging
2025-07-23 18:20:39 +00:00
9ff1a2b537
[BugFix] Fix KVConnector TP worker aggregation
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 18:29:06 +01:00
0abe10e4a7
updated
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
2025-07-23 15:21:46 +00:00
316b1bf706
[Tests] Add tests for headless internal DP LB ( #21450 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-23 07:49:25 -07:00
7c734ee09b
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models. ( #21364 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2025-07-23 06:34:37 -07:00
f59ec35b7f
[V1] Check all pooling tasks during profiling ( #21299 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-23 05:53:26 -07:00
2671334d45
[Model] add Hunyuan V1 Dense Model support. ( #21368 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-23 03:54:08 -07:00
2cc5016a19
[Docs] Clean up v1/metrics.md ( #21449 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 03:37:25 -07:00
6929f8b437
[Misc] fixed nvfp4_moe test failures due to invalid kwargs ( #21246 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-07-23 01:41:43 -07:00
32ec9e2f2a
Mamba V2 Test not Asserting Failures. ( #21379 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-07-23 01:40:27 -07:00
accac82928
[Sampler] Introduce logprobs mode for logging ( #21398 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-23 01:39:25 -07:00
23637dcdef
[Docs] Fix bullets and grammars in tool_calling.md ( #21440 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-23 01:23:20 -07:00
6364af92f8
Fixed typo in profiling logs ( #21441 )
2025-07-23 01:18:54 -07:00
7aaa2bd5a8
[Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload ( #19679 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-07-23 00:30:05 -07:00
2f5c14de6a
add clear messages for deprecated models ( #21424 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-07-23 00:03:16 -07:00
f002e9a870
[Cleanup] Only log MoE DP setup warning if DP is enabled ( #21315 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-23 00:02:48 -07:00
a1f3610fc6
[Core] Add basic unit test for maybe_evict_cached_block ( #21400 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-23 00:02:02 -07:00
4ecedd1806
[Bugfix] Fix nightly transformers CI failure ( #21427 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-23 00:01:01 -07:00
107111a859
Changing "amdproduction" allocation. ( #21409 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-07-22 20:48:31 -07:00
2dec7c1a5d
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported ( #21420 )
...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com >
2025-07-22 20:34:50 -07:00
08d2bd78da
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update ( #21414 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-22 20:33:57 -07:00
4f76a05f4f
[BugFix] Update python to python3 calls for image; fix prefix & input calculations. ( #21391 )
...
Signed-off-by: Eric Hanley <ericehanley@google.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-22 20:33:00 -07:00
f154bb9ff0
Simplify weight loading in Transformers backend ( #21382 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-22 20:29:43 -07:00
3ec7170ff1
[Bugfix][ROCm][Build] Fix build regression on ROCm ( #21393 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-22 20:27:41 -07:00
c401c64b4c
[CI/Build] Fix model executor tests ( #21387 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 20:25:37 -07:00
b77c7d327f
[BugFix] Fix ray import error mem cleanup bug ( #21381 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-07-22 16:19:55 -07:00
35bc8bd5fb
[Misc] Copy HF_TOKEN env var to Ray workers ( #21406 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-22 16:18:42 -07:00
4594fc3b28
[Model] Add Qwen3CoderToolParser ( #21396 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-07-22 15:05:57 -07:00
ae268b6326
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num ( #21325 )
...
Signed-off-by: XIn Li <xinli@nvidia.com >
2025-07-22 12:42:31 -07:00
35366ae57c
[CI/Build] Fix test failure due to updated model repo ( #21375 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-22 08:39:35 -07:00
2226d5bd85
[Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers ( #21353 )
...
Signed-off-by: ariG23498 <aritra.born2fly@gmail.com >
2025-07-22 08:27:28 -07:00
44554a0068
Add tokenization_kwargs to encode for embedding model truncation ( #21033 )
2025-07-22 08:24:00 -07:00
226b452a20
Revert "[Refactor] Fix Compile Warning #1444-D ( #21208 )" ( #21384 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 08:22:10 -07:00
f38ee34a0a
[feat] Enable mm caching for transformers backend ( #21358 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-07-22 08:18:46 -07:00
b194557a6c
Adds parallel model weight loading for runai_streamer ( #21330 )
...
Signed-off-by: bbartels <benjamin@bartels.dev >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-22 08:15:53 -07:00
774d0c014b
[Perf] Cuda Kernel for Per Token Group Quant ( #21083 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-22 07:27:15 -07:00
2c8db17cfd
[feat]: add SM100 support for cutlass FP8 groupGEMM ( #20447 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:27:12 -07:00
4fb56914c5
[perf] Add fused MLA QKV + strided layernorm ( #21116 )
...
Signed-off-by: Mickael Seznec <mickael@mistral.ai >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-22 07:07:44 -07:00
0df4d9b06b
[Misc] unify variable for LLM instance v2 ( #21356 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-22 06:32:36 -07:00
ed25054577
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool ( #21222 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 06:17:47 -07:00
10904e6d75
[benchmark] Port benchmark request sent optimization to benchmark_serving ( #21209 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:28:00 -07:00
a32237665d
[Core] Optimize update checks in LogitsProcessor ( #21245 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-22 05:27:18 -07:00
bc8a8ce5ec
[Misc] Remove deprecated args in v0.10 ( #21349 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-22 05:26:39 -07:00
32142b3c62
[Bugfix] Fix eviction cached blocked logic ( #21357 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-22 01:18:40 -07:00
82b8027be6
Add arcee model ( #21296 )
...
Signed-off-by: alyosha-swamy <raghav@arcee.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-22 00:57:43 -07:00
3779eb8c81
[Feature][eplb] add verify ep or tp or dp ( #21102 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-21 23:41:14 -07:00
9e23ad9655
Update fp4 quantize API ( #21327 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-07-21 23:40:21 -07:00
e69a92a1ce
[Bug] DeepGemm: Fix Cuda Init Error ( #21312 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:36:18 -07:00
8425f785ad
[Misc] DeepEPHighThroughtput - Enable Inductor pass ( #21311 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-21 23:35:45 -07:00
c17231e827
Fix kv_cache_dtype handling for out-of-tree HPU plugin ( #21302 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-21 23:35:14 -07:00
6e5b5ca580
[Refactor] Fix Compile Warning #1444-D ( #21208 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-21 23:33:51 -07:00
488d8a986a
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible ( #21300 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-21 23:31:18 -07:00
af376ca19d
[Core] Minimize number of dict lookup in _maybe_evict_cached_block ( #21281 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-21 22:37:34 -07:00
e7b2042681
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 ) ( #21334 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-21 21:49:01 -07:00
90f1e55421
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU ( #21338 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-21 21:48:27 -07:00
5e70dcd6e6
[Doc] Fix CPU doc format ( #21316 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 21:47:49 -07:00
25d585ab7b
[XPU] Enable external_launcher to serve as an executor via torchrun ( #21021 )
...
Signed-off-by: chzhang <chaojun.zhang@intel.com >
2025-07-21 21:47:35 -07:00
8d0a01a5f2
[v1][sampler] Inplace logprobs comparison to get the token rank ( #21283 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-21 13:47:47 -07:00
0ec82edda5
[perf] Speed up align sum kernels ( #21079 )
...
Signed-off-by: Himanshu Jaju <hj@mistral.ai >
2025-07-21 11:19:23 -07:00
005ae9be6c
Fix bad lm-eval fork ( #21318 )
2025-07-21 10:47:51 -07:00
29d1ffc5b4
[DP] Fix Prometheus Logging ( #21257 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-07-21 09:11:35 -07:00
304dce7ec0
[Attention] Clean up iRoPE in V1 ( #21188 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-07-21 09:10:30 -07:00
6ece16c4fe
[Misc] Add dummy maverick test ( #21199 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-21 09:08:09 -07:00
a0e827e07c
[BugFix] make utils.current_stream thread-safety ( #21252 ) ( #21253 )
...
Signed-off-by: simpx <simpxx@gmail.com >
2025-07-21 09:07:36 -07:00
a15a50fc17
[CPU] Enable shared-memory based pipeline parallel for CPU backend ( #21289 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-21 09:07:08 -07:00
6dda13c86b
[Misc] Add sliding window to flashinfer test ( #21282 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-21 08:37:49 -07:00
6b46c4b653
Add Nvidia ModelOpt config adaptation ( #19815 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-21 10:02:58 -04:00
d97841078b
[Misc] unify variable for LLM instance ( #20996 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-21 12:18:33 +01:00
e6b90a2805
[Docs] Make tables more space efficient in supported_models.md ( #21291 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:25:02 -07:00
be54a951a3
[Docs] Fix hardcoded links in docs ( #21287 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-21 02:23:57 -07:00
042af0c8d3
[Model][1/N] Support multiple poolers at model level ( #21227 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-21 02:22:21 -07:00
378d33c392
[Bugfix] Fix missing placeholder in logger debug ( #21280 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-20 22:50:06 -07:00
940af1f03a
Add the instruction to run e2e validation manually before release ( #21023 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-20 22:29:18 -07:00
92615d7fe8
[Docs] Add RFC Meeting to Issue Template ( #21279 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-20 21:58:07 -07:00
8188196a1c
[CI] Cleanup modelscope version constraint in Dockerfile ( #21243 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-07-20 20:13:02 -07:00
7ba34b1241
[bugfix] fix syntax warning caused by backslash ( #21251 )
2025-07-20 17:12:10 +00:00
9499e26e2a
[Model] Support VLMs with transformers backend ( #20543 )
...
Signed-off-by: raushan <raushan@huggingface.co >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-20 13:25:50 +00:00
51ba839555
[Model] use AutoWeightsLoader for bart ( #18299 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-07-20 08:15:50 +00:00
d1fb65bde3
Enable v1 metrics tests ( #20953 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-20 03:22:02 +00:00
3a1d8940ae
[TPU] support fp8 kv cache quantization ( #19292 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-20 03:01:00 +00:00
2b504eb770
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. ( #21233 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 16:09:58 -07:00
10eb24cc91
GLM-4 Update ( #20736 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Lu Fang <fanglu@fb.com >
2025-07-19 22:40:31 +00:00
2e8cbb58f3
[BugFix] Fix full cuda graph slot_mapping ( #21228 )
...
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com >
2025-07-19 14:13:18 -07:00
752c6ade2e
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small ( #21217 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-19 13:53:17 -07:00
881e3cbe3b
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers ( #21194 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-19 19:27:21 +00:00
9f414a12ad
[BugFix] Make PD work with Ray ( #21072 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-19 08:46:50 -07:00
6a971ed692
[Docs] Update the link to the 'Prometheus/Grafana' example ( #21225 )
2025-07-19 06:58:07 -07:00
da6579bf41
[CI/CD][bugfix]fix: error argument to loads has incompatible type ( #21223 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: Sungjae Lee <sung-jae.lee@navercorp.com >
2025-07-19 05:16:48 -07:00
c81259d33a
Fix/remove some broken model executor tests ( #21224 )
...
Signed-off-by: Rabi Mishra <ramishra@redhat.com >
2025-07-19 12:15:07 +00:00
e3a0e43d7f
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code ( #21032 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-19 05:13:55 -07:00
b3d82108e7
[Bugfix][Frontend] Fix openai CLI arg middleware ( #21220 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-19 02:40:38 -07:00
6d0734c562
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency ( #20645 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-19 02:33:01 -07:00
7d94577138
Add torch golden impl for moe_align_block_size kernel test ( #20653 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-19 02:32:36 -07:00
59f935300c
[BugFix] Fix potential cuda-graph IMA ( #21196 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-19 02:18:47 -07:00
18e519ec86
[Bugfix] Fix ndarray video color from VideoAsset ( #21064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-19 02:17:16 -07:00
1eaff27815
[V0 deprecation] Remove long context LoRA ( #21169 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-19 02:15:41 -07:00
cf8cc32674
Fix a couple of Voxtral tests ( #21218 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-07-19 09:13:41 +00:00
3a2cb2649d
[Misc][Tools][Benchmark] Add readme file for auto_tune script ( #20779 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-19 09:06:59 +00:00
3e04107d97
[Model] EXAONE 4.0 model support ( #21060 )
...
Signed-off-by: Deepfocused <rlawhdrhs27@gmail.com >
Signed-off-by: woongsik <rlawhdrhs27@gmail.com >
2025-07-19 14:25:44 +08:00
37bd8d6e4c
[Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 ( #21187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 23:25:22 -07:00
468e2400fe
[BugFix][CPU] Fix TorchSDPABackendImpl doesn't have use_irope ( #21200 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-18 23:18:48 -07:00
dcc6cfb991
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel ( #21193 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-18 23:09:51 -07:00
dd572c0ab3
[V0 Deprecation] Remove V0 Spec Decode workers ( #21152 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-18 21:47:50 -07:00
9ffe905a41
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 ( #21183 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-07-18 21:15:03 -07:00
9a9fda1423
[Core] Support Local Chunked Attention for Hybrid KV Cache ( #19351 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lu Fang <fanglu@meta.com >
2025-07-18 20:48:38 -07:00
466e878f2a
[Quantization] Enable BNB support for more MoE models ( #21100 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-18 17:52:02 -07:00
217937221b
Elastic Expert Parallel Initial Support ( #20775 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-18 17:46:09 -07:00
5782581acf
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) ( #21077 )
...
Signed-off-by: hax0r31337 <liulihaocaiqwq@gmail.com >
2025-07-18 18:40:18 -04:00
0f199f197b
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue ( #21005 )
...
Signed-off-by: Jialin Ouyang <jialino@meta.com >
2025-07-18 12:34:40 -07:00
b2eb2b5ad7
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0 ( #19346 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-18 14:10:21 -04:00
21274ab476
[CI] Update CODEOWNERS for vllm/compilation ( #21185 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-18 06:51:12 -07:00
ed8cbfedf8
Let GraniteMoeAttention use YaRN ( #21174 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-18 05:52:52 -07:00
45badd05d0
[Core] Set pooling params based on task and model ( #21128 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 05:41:17 -07:00
4adc66f64d
[Bugfix] Allocate less memory in non-batched CUTLASS MoE ( #21121 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-18 18:55:52 +08:00
55ad648715
[Doc] Fix typo in model name ( #21178 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-18 03:55:10 -07:00
5895afd780
[Bugfix] The special_tokens in tokenizer should also be controlled by do_lower_case in encoder_config. ( #20750 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 09:10:47 +00:00
ca4eb82bcb
[Model] Re-add the implicit conversion feature for as_seq_cls_model ( #21103 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-18 07:15:07 +00:00
ba2dfbb0c2
[Misc] Make MM embedding merge interface explicit in model runner ( #21147 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-18 07:13:57 +00:00
1bf65138f6
[benchmark] Sending request strictly follows the random intervals ( #21108 )
...
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com >
2025-07-18 06:22:08 +00:00
54cf1cae62
[Misc] Do not print async output warning for v1 ( #21151 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 21:57:02 -07:00
5780121c95
[Perf] Add swap_ab to SM90 FP8 non-block CUTLASS moe grouped gemm ( #20911 )
...
Signed-off-by: Shixian Cui <shixian@amazon.com >
Co-authored-by: Shixian Cui <shixian@amazon.com >
2025-07-18 04:34:43 +00:00
c7d8724e78
[Core] FlashInfer CUTLASS fused MoE backend (NVFP4) ( #20037 )
...
Signed-off-by: shuw <shuw@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-17 21:32:45 -07:00
b38baabcf9
[Doc] Add inplace weights loading example ( #19640 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-17 21:12:23 -07:00
89cab4d01f
[Attention] Make local attention backend agnostic ( #21093 )
2025-07-18 00:10:42 -04:00
b9a21e9173
[Docs] Update supported models documentation with missing models ( #20844 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-07-17 20:12:13 -07:00
c4e3b12524
[Docs] Add minimal demo of Ray Data API usage ( #21080 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-17 20:09:19 -07:00
8dfb45ca33
[Bugfix] Fix the tensor non-contiguous issue for Flashinfer TRT-LLM backend attention kernel ( #21133 )
2025-07-18 00:35:58 +00:00
8a8fc94639
[Log] Debugging Log with more Information ( #20770 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-18 00:19:46 +00:00
4de7146351
[V0 deprecation] Remove V0 HPU backend ( #21131 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-17 16:37:36 -07:00
ac9fb732a5
On environments where numa cannot be detected we get 0 ( #21115 )
...
Signed-off-by: Eric Curtin <ecurtin@redhat.com >
2025-07-17 18:52:17 +00:00
a3a6c695f4
[Misc] Qwen MoE model supports LoRA ( #20932 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 18:32:52 +00:00
90bd2ab6e3
[Model] Update pooling model interface ( #21058 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-17 16:05:40 +00:00
9fb2d22032
[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE ( #20762 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-17 09:56:44 -04:00
2d6a38209b
[Docs] Move code block out of admonition now that it's short ( #21118 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 06:12:29 -07:00
89e3c4e9b4
[Misc] Avoid unnecessary import ( #21106 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-17 12:57:41 +00:00
fe8a2c544a
[Docs] Improve docstring formatting for FusedMoEParallelConfig.make ( #21117 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-17 04:13:00 -07:00
4ef00b5cac
[VLM] Add Nemotron-Nano-VL-8B-V1 support ( #20349 )
...
Signed-off-by: Kyle Huang <kylhuang@nvidia.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-17 03:07:55 -07:00
5a7fb3ab9e
[Model] Add ToolParser and MoE Config for Hunyuan A13B ( #20820 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-17 09:10:09 +00:00
11dfdf21bf
[Kernel] DeepGemm MoE : Integrate triton permute / unpermute kernels ( #20903 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-17 08:10:37 +00:00
fdc5b43d20
[Bugfix]: Fix final_res_batch list index out of range error ( #21055 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-17 00:29:09 -07:00
c5b8b5953a
[Misc] Fix PhiMoE expert mapping ( #21085 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-17 05:47:49 +00:00
4fcef49ec4
[V1] [KVConnector] Fix MultiprocExecutor worker output aggregation ( #21048 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-07-17 13:29:45 +08:00
8a4e5c5f3c
[V1][P/D]Enhance Performance and code readability for P2pNcclConnector ( #20906 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-07-16 22:13:00 -07:00
76b494444f
[Attention] Refactor attention metadata builder interface ( #20466 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-17 04:44:25 +00:00
28a6d5423d
[Bugfix] Fix Machete zero point issue for GPTQ models on SM90 ( #21066 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:54:45 -07:00
58760e12b1
[TPU] Start using python 3.12 ( #21000 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-16 19:37:44 -07:00
a50d918225
[Docker] Allow FlashInfer to be built in the ARM CUDA Dockerfile ( #21013 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-16 19:37:13 -07:00
c9ba8104ed
[Bugfix] weight loading use correct tp_group with patch_tensor_parallel_group ( #21024 )
...
Signed-off-by: KevinXiong-C <kevin_xiong1997@outlook.com >
2025-07-16 19:36:36 -07:00
4e7dfbe7b4
Update PyTorch to torch==2.7.1 for CUDA ( #21011 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-17 02:30:44 +00:00
72ad273582
Remove torch_xla.tpu.version() from pallas.py. ( #21065 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-17 00:25:26 +00:00
01513a334a
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor) ( #12010 )
...
Signed-off-by: Nir David <ndavid@habana.ai >
Signed-off-by: Uri Livne <ulivne@habana.ai >
Co-authored-by: Uri Livne <ulivne@habana.ai >
2025-07-16 15:33:41 -04:00
ac2bf41e53
[Model] Remove model sampler ( #21059 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 19:03:37 +00:00
a931b4cdcf
Remove Qwen Omni workaround that's no longer necessary ( #21057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-16 16:25:23 +00:00
a0f8a79646
[fix] fix qwen image_embeds input ( #21049 )
...
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai >
2025-07-16 15:17:20 +00:00
18bdcf4113
feat - add a new endpoint get_tokenizer_info to provide tokenizer/chat-template information ( #20575 )
...
Signed-off-by: m-misiura <mmisiura@redhat.com >
2025-07-16 21:52:14 +08:00
1c3198b6c4
[Model] Consolidate pooler implementations ( #20927 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-16 13:39:13 +00:00
260127ea54
[Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md ( #19199 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-16 06:11:38 -07:00
d0dc4cfca4
Fix inadvertently silenced PP tests for mp, add DeepSeek V2/V3 model family to PP tests ( #20831 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-16 00:14:49 -07:00
d31a647124
[BugFix] Fix import error on non-blackwell machines ( #21020 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-15 22:27:29 -07:00
85431bd9ad
[TPU] fix kv_cache_update kernel block size choosing logic ( #21007 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-16 04:39:48 +00:00
c11013db8b
[Meta] Llama4 EAGLE Support ( #20591 )
...
Signed-off-by: qizixi <qizixi@meta.com >
Co-authored-by: qizixi <qizixi@meta.com >
2025-07-15 21:14:15 -07:00
1eb2b9c102
[CI] update typos config for CI pre-commit and fix some spells ( #20919 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-15 21:12:40 -07:00
6ebf313790
Avoid direct comparison of floating point numbers ( #21002 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-15 21:12:14 -07:00
cfbcb9ed87
[Voxtral] Add more tests ( #21010 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-15 21:11:49 -07:00
76ddeff293
[Doc] Remove duplicate docstring ( #21012 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-15 20:09:13 -07:00
f46098335b
[Bugfix] Fix Mistral3 support on SM100/SM120 ( #20998 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 20:08:41 -07:00
e9534c7202
[CI][HPU] update for v0 deprecate by switching to VLLM_TARGET_DEVICE=empty ( #21006 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-07-15 20:07:05 -07:00
7976446015
Add Dockerfile argument for VLLM_USE_PRECOMPILED environment ( #20943 )
...
Signed-off-by: dougbtv <dosmith@redhat.com >
2025-07-15 19:53:57 -07:00
fcb9f879c1
[Bugfix] Correct per_act_token in CompressedTensorsW8A8Fp8MoECutlassM… ( #20937 )
...
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-15 19:53:42 -07:00
3ed94f9d0a
[Docs] Enhance Anyscale documentation, add quickstart links for vLLM ( #21018 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 19:46:56 -07:00
fa839565f2
[Misc] Refactor: Improve argument handling for conda command ( #20481 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 19:43:19 -07:00
75a99b98bf
[Chore] Remove outdated transformers check ( #20989 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-07-15 19:42:40 -07:00
b5c3b68359
[Misc] bump xgrammar version to v0.1.21 ( #20992 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 19:42:16 -07:00
6cbc4d4bea
[Model] Add ModelConfig class for GraniteMoeHybrid to override default max_seq_len_to_capture ( #20923 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 19:19:10 -07:00
153c6f1e61
[Frontend] Remove print left in FrontendArgs.add_cli_args ( #21004 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 19:18:41 -07:00
34cda778a0
[Frontend] OpenAI Responses API supports input image ( #20975 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-15 18:59:36 -06:00
30800b01c2
[Nvidia] Integrate SM100 cudnn prefill API to MLA prefill ( #20411 )
...
Signed-off-by: Elfie Guo <elfieg@nvidia.com >
Co-authored-by: Elfie Guo <eflieg@nvidia.com >
2025-07-15 17:56:45 -07:00
10be209493
[Bug Fix] get_distributed_init_method should get the ip from get_ip i… ( #20889 )
...
Signed-off-by: Chen Li <lcpingping@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-07-15 21:23:52 +00:00
19c863068b
[Frontend] Support cache_salt in /v1/completions and /v1/responses ( #20981 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-07-15 21:01:04 +00:00
f29fd8a7f8
[BugFix] fix 3 issues: (1) using metadata for causal-conv1d, (2) indexing overflow in v1 vLLM, and (3) init_states in v0 ( #20838 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
2025-07-15 16:08:26 -04:00
ed10f3cea1
[ROCm] warpSize is being made non constexpr in ROCm 7.0 ( #20330 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-15 14:01:44 -04:00
b637e9dcb8
Add full serve CLI reference back to docs ( #20978 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:42:30 +00:00
1e36c8687e
[Deprecation] Remove nullable_kvs ( #20969 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 17:21:50 +00:00
5bac61362b
Configure Gemini ( #20971 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 09:37:05 -07:00
313ae8c16a
[Deprecation] Remove everything scheduled for removal in v0.10.0 ( #20979 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 15:57:53 +00:00
c847e34b39
[CI/Build] Fix wrong path in Transformers Nightly Models Test ( #20994 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-15 08:53:16 -07:00
e7e3e6d263
Voxtral ( #20970 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-15 07:35:30 -07:00
4ffd963fa0
[v1][core] Support for attention free models ( #20811 )
...
Signed-off-by: Christian Pinto <christian.pinto@ibm.com >
2025-07-15 14:20:01 +00:00
56fe4bedd6
[Deprecation] Remove TokenizerPoolConfig ( #20968 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-15 14:00:50 +00:00
d91278181d
[doc] Add more details for Ray-based DP ( #20948 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-15 05:37:12 -07:00
20149d84d9
[MISC] Add init files for python package ( #20908 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-07-15 12:16:33 +00:00
3534c39a20
[V1] [Hybrid] Refactor mamba state shape calculation; enable V1 via cli ( #20840 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-15 04:04:35 -07:00
c586b55667
[TPU] Optimize kv cache update kernel ( #20415 )
...
Signed-off-by: Yifei Teng <tengyifei88@gmail.com >
2025-07-15 03:56:43 -07:00
33d560001e
[Docs] Improve documentation for ray cluster launcher helper script ( #20602 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 03:55:45 -07:00
f148c44c6a
[frontend] Refactor CLI Args for a better modular integration ( #20206 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-15 02:23:42 -07:00
235bfd5dfe
[Docs] Improve documentation for RLHF example ( #20598 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-15 01:54:10 -07:00
68d28e37b0
[frontend] Add --help=page option for paginated help output ( #20961 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-15 00:42:00 -07:00
37a7d5d74a
[Misc] Refactor AllReduceFusionPass. Remove parameter ( #20918 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-15 06:57:40 +00:00
d4d309409f
Implement Async Scheduling ( #19970 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-14 23:01:46 -07:00
85bd6599e4
[Model] Add AutoWeightsLoader support for BERT, RoBERTa ( #20534 )
...
Signed-off-by: Jennifer He <islandhe@gmail.com >
Signed-off-by: <islandhe@gmail.com >
Signed-off-by: Jen H <islandhe@gmail.com >
2025-07-15 13:34:24 +08:00
91b3d190ae
[cold start] replace VLLM_COMPILE_DEPYF with debug_dump_dir ( #20940 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-15 13:02:17 +08:00
fc017915f5
[Doc] Clearer mistral3 and pixtral model support description ( #20926 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 21:56:53 -07:00
9ad0a4588b
[Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer ( #20934 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-07-15 03:27:50 +00:00
016b8d1b7f
Enabled BnB NF4 inference on Gaudi ( #20172 )
...
Signed-off-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
2025-07-14 20:26:08 -07:00
80305c1b24
[CI] Fix flaky test_streaming_response test ( #20913 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:15:15 -07:00
37e2ecace2
feat: add image zoom to improve image viewing experience ( #20763 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 20:14:23 -07:00
054c8657e3
[Docs] Add Kuberay to deployment integrations ( #20592 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-14 20:13:55 -07:00
d4170fad39
Use w8a8 quantized matmul Pallas kernel ( #19170 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-15 03:06:33 +00:00
946aadb4a0
[CI/Build] Split Entrypoints Test into LLM and API Server ( #20945 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 02:44:18 +00:00
bcdfb2a330
[Bugfix] Fix incorrect dispatch for CutlassBlockScaledGroupedGemm and DeepGEMM ( #20933 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-15 01:42:17 +00:00
ba8c300018
[BugFix] VLLM_DISABLE_COMPILE_CACHE=1 should disable all reads and writes from the cache ( #20942 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-15 01:26:18 +00:00
8cdc371217
SM100 Cutlass MLA decode with unrestricted num_heads (< 128) for DeepSeek TP ( #20769 )
...
Signed-off-by: Alexander Matveev <amatveev@redhat.com >
2025-07-15 01:06:38 +00:00
61e20828da
Fall back if flashinfer comm module not found ( #20936 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-07-14 23:11:18 +00:00
55e1c66da5
[Docs] remove outdated performance benchmark ( #20935 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-07-14 22:14:17 +00:00
86f3ac21ce
Fix overflow indexing in causal_conv1d kernel ( #20938 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-07-14 21:43:07 +00:00
149f2435a5
[Misc] Relax translations tests ( #20856 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-14 20:08:36 +00:00
c0569dbc82
[Misc] ModularKernel : Perform WeightAndReduce inside TritonExperts & DeepGemmExperts ( #20725 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-14 19:47:16 +00:00
8bb43b9c9e
Add benchmark dataset for mlperf llama tasks ( #20338 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-14 19:10:07 +00:00
559756214b
Change default model to Qwen3-0.6B ( #20335 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-14 16:54:52 +00:00
6d0cf239c6
[CI/Build] Add Transformers nightly tests in CI ( #20924 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 16:33:17 +00:00
3fc964433a
[Misc] Clean up Aimv2 config registration in Ovis config ( #20921 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-14 15:36:43 +00:00
0caf61c08a
[CI] Update codeowner for compilation code ( #20929 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-14 08:33:19 -07:00
667624659b
[CI] cc folks on changes to vllm/compilation ( #20925 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-07-14 07:52:17 -07:00
38efa28278
[Model] Add Ling implementation ( #20680 )
...
Signed-off-by: vito.yy <vito.yy@antgroup.com >
2025-07-14 22:10:32 +08:00
e8cc53af5e
[Misc] Log the reason for falling back to FlexAttention ( #20699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 04:16:51 -07:00
a4851cfe68
[Bugfix]: Fix messy code when using logprobs ( #20910 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-14 11:06:45 +00:00
9887e8ec50
[Misc] Remove unused function ( #20909 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 10:48:55 +00:00
f326ab9c88
[Bugfix] Bump up mistral_common to support v13 tokenizer ( #20905 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 10:45:03 +00:00
dcf2a5e208
[CI/Build] Fix OOM issue in Jina-VL test ( #20907 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-14 10:32:35 +00:00
1e9438e0b0
[MISC] Move bind_kv_cache to worker module ( #20900 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-07-14 09:40:00 +00:00
697ef765ee
[Refactor][V1] Move outlines utils for V1 imports ( #20878 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-07-14 00:58:35 -07:00
a99b9f7dee
[Quantization] add BNB for MixtralForCausalLM ( #20893 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-14 07:34:34 +00:00
c488b928a7
[ROCm] [Bugfix] [Critical]: Fix mamba compilation bug ( #20883 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-07-14 15:23:28 +08:00
2c7fa47161
Fix: Add missing EOFError handling in CLI complete command ( #20896 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-14 07:09:57 +00:00
88fc8a97e3
Removing redundant python version check ( #20888 )
...
Signed-off-by: Dannyso05 <dansong1177@gmail.com >
2025-07-14 06:15:05 +00:00
66f6fbd393
[Prefix Cache] Add reproducible prefix-cache block hashing using SHA-256 + CBOR (64bit) ( #20511 )
...
Signed-off-by: Maroon Ayoub <maroon.ayoub@ibm.com >
2025-07-14 02:45:31 +00:00
8632e831ba
[Core] Add update_config RPC method ( #20095 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-07-14 00:49:18 +00:00
4bbfc36b16
[V1] Hybrid allocator without prefix caching ( #20661 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-13 16:55:14 +00:00
80d38b8ac8
[V1] [ROCm] [AITER] Upgrade AITER to commit 916bf3c and bugfix APIs ( #20880 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-07-13 15:19:32 +00:00
211b6a6113
[Bugfix] fix define of RerankDocument ( #20877 )
...
Signed-off-by: liuchenlong <liuchenlong@xiaohongshu.com >
Co-authored-by: liuchenlong <liuchenlong@xiaohongshu.com >
2025-07-13 14:32:40 +00:00
247102f07f
[Bugfix] Fix: add patch_rope_scaling after hf override ( #20857 )
...
Signed-off-by: Wang Siyuan <wsy0227@sjtu.edu.cn >
Signed-off-by: Wang Siyuan <sywang0227@gmail.com >
2025-07-13 00:13:25 -07:00
bd4c1e6fdb
Support for LlamaForSequenceClassification ( #20807 )
...
Signed-off-by: thechaos16 <thechaos16@gmail.com >
2025-07-13 00:09:34 -07:00
99b4f080d8
Renable google/gemma-3-1b-it accuracy test. ( #20866 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-12 21:48:56 -07:00
020f58abcd
[Core] Support multiple tasks per model ( #20771 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-12 19:40:11 -07:00
c1acd6d7d4
[Refactor] Change the way of import triton ( #20774 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:39:55 -07:00
3b3b778d4a
[Bugfix] Fix a couple PPLX+CUTLASS MoE bugs ( #20825 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
2025-07-12 19:39:14 -07:00
42d440c22b
[Perf] Use Triton instead of Torch for DeepGEMM Per Token Group Quant ( #20841 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-12 19:38:45 -07:00
f45a332886
[Sched] Enhance the logic to remove stopped requests from queues ( #20739 )
2025-07-12 15:33:13 -07:00
6e2c176e1f
[Bugfix] Restrict Machete to only run on Hopper ( #20830 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-12 17:34:40 +00:00
a86754a12b
[docs] convert supported configs to table ( #20858 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-12 06:54:50 -07:00
c2a2f19aba
[Bugfix] Fix Tensor Parallelism Padding Consistency in Granite Models ( #20843 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-07-12 06:11:30 -07:00
2c11a738b3
[Model] New model support for microsoft/Phi-4-mini-flash-reasoning ( #20702 )
...
Signed-off-by: Congcong Chen <congcongchen@microsoft.com >
2025-07-12 06:02:10 -07:00
b639327ad9
Revert "Use NVCC --compress-mode to reduce binary size by 30% #20694 " ( #20853 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 23:07:35 -07:00
4afe687a82
Enable ModelOpt Llama4 fp8 checkpoint deployment ( #20419 )
...
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com >
2025-07-11 23:07:16 -07:00
5de8d9f111
Remove extra tensor on CPU ( #20693 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-07-12 14:06:34 +08:00
c1c8ca57ff
[cold start time] add envs.VLLM_COMPILE_DEPYF to guard decompile ( #20790 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-07-11 23:06:13 -07:00
a3a5a47e48
[Bugfix] Fix torch.compile x LoRA for PyTorch 2.8 ( #20823 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-07-11 23:06:04 -07:00
fb25e95688
[Docs] Update basic.md ( #20846 )
2025-07-11 23:05:32 -07:00
0d4891cd03
[Bug] Fix DeepGemm for EP low latency case ( #20833 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 23:05:12 -07:00
f56d2996ca
[Misc] Respect no_use_tqdm_on_load flag while capturing CUDA graph ( #20834 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-07-11 23:04:45 -07:00
147afb448b
[Bugfix] Replace unavailable video url in multimodal test ( #20854 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-12 05:25:39 +00:00
3c7d942da8
[Frontend] Abstract prompt and SpeechToTextConfig for transcriptions models ( #20637 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-11 21:33:26 -07:00
890323dc1b
[Bugfix] : Fix typo - logger.warn_once -> logger.warning_once ( #20852 )
2025-07-11 20:56:24 -07:00
01cae37713
[CI/Build] Ensure compatability with Transformers v4.53 ( #20541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-11 20:53:07 -07:00
11c0198615
[Bugfix] Fix tensor parallel issue in Qwen3 reranker weight loading ( #20682 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-07-11 20:52:43 -07:00
b1235c3e10
[Bugfix] Lazy import fused_experts in BitsAndBytesMoEMethod to avoid break not-cuda-alike devices ( #20822 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 20:52:05 -07:00
44d02f54db
[Misc] Restrict deep_gemm's log output ( #20827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 20:50:42 -07:00
a8593237c0
Add pynccl all-gatherv and reducescatterv ( #20154 )
...
Signed-off-by: Trevor Morris <tmorris@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 18:59:23 -07:00
fc0f41d10a
Integration SM100 FlashInfer fused allreduce RMSNorm ( #20691 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-07-11 18:58:15 -07:00
7b828e30d5
[CI Bug] Fix Async Engine, Inputs, Utils, Worker Test: 'State' object has no attribute 'enable_server_load_tracking' ( #20845 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-11 18:57:24 -07:00
5f0af36af5
Update kimi-k2 tool calling docs, enable unit tests ( #20821 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 20:16:14 +00:00
0d21b2664c
[Bugfix] Fix OOM in language generation test ( #20814 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-11 11:21:52 -07:00
9907fc4494
[Docs] Data Parallel deployment documentation ( #20768 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-11 09:42:10 -07:00
d47661f0cd
[Kernel] Basic tuned configs for NVFP4 CUTLASS dense GEMM ( #20646 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 10:05:33 -06:00
53fa457391
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility ( #20449 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-11 07:51:46 -07:00
6fb162447b
[doc] fix ordered list issue ( #20819 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-11 06:49:46 -07:00
66177189c5
[Bugfix] Add missing field to TritonLanguagePlaceholder ( #20812 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-11 05:25:11 -07:00
b4f0b5f9aa
Temporarily suspend google/gemma-3-1b-it. ( #20722 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-11 11:21:26 +00:00
cbd14ed561
[Bugfix] Refactor /invocations to be task-agnostic ( #20764 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-11 03:20:54 -07:00
7bd4c37ae7
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100). ( #19825 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: shuw <shuw@nvidia.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 09:23:23 +00:00
8020e98c9f
[Quantization][1/N] MoE support BNB-Inflight Quantization ( #20061 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-11 08:01:13 +00:00
762be26a8e
[Bugfix] Upgrade depyf to 0.19 and streamline custom pass logging ( #20777 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Signed-off-by: luka <lgovedic@redhat.com >
2025-07-11 00:15:22 -07:00
6a9e6b2abf
[doc] fold long code block ( #20795 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-10 23:16:41 -07:00
5d09152ff1
[V1] Enable Mamba2 layers other than MambaMixer2 in the v1 engine ( #20660 )
...
Signed-off-by: nopperl <54780682+nopperl@users.noreply.github.com >
2025-07-11 05:53:31 +00:00
31d5c1797f
[Perf][fp8] Use CustomOp abstraction for fp8 quant for better perf ( #19830 )
...
Signed-off-by: Luka Govedic <lgovedic@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-11 04:56:28 +00:00
35514b682a
[XPU] XCCL support enabled in torch 2.8.0.dev nightly builds ( #20705 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-10 20:39:52 -07:00
e2de455c34
[Feature] Integrate SM100 DeepGEMM support ( #20087 )
2025-07-10 20:18:05 -07:00
5b032352cc
[Attention] MLA - Flashinfer Ragged Prefill ( #20034 )
2025-07-10 20:17:47 -07:00
922f316441
[Model] Support HF format of minimax ( #20211 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 02:55:21 +00:00
5923ab9524
[fix]: disable cutlass block scaled group gemm for EP ( #20781 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-07-11 02:39:18 +00:00
0cf893cae1
Add kimi-k2 tool parser ( #20789 )
...
Signed-off-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@moonshot.cn >
Co-authored-by: wangzhengtao <wangzhengtao@msh.team >
2025-07-11 10:36:23 +08:00
cf75cd2098
[CI Bugfix] Specify same TORCH_CUDA_ARCH_LIST for flashinfer aot and install ( #20772 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-11 01:16:01 +00:00
b854321ffe
[Docs] Lazy import gguf ( #20785 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-07-10 16:06:37 -07:00
5b6fe23d05
[Bugfix][Benchmark] Make sure the output length > 0 when testing prefill workload. ( #20786 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-10 14:52:46 -07:00
f0c98cae27
[Misc] MoE ModularKernel : Introduce TopKWeightAndReduce ( #20648 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-10 14:40:38 -07:00
574ad60db9
[KVConnector] Always call connector clear_metadata() at end of step ( #20756 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: David Ben-David <sdavidbd@gmail.com >
2025-07-10 22:37:27 +01:00
fdadb6f43a
[Bugfix] Fused MoE Modular Kernel chunking loop ( #20392 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-10 20:31:10 +00:00
41060c6e08
[Core] Add Support for Default Modality Specific LoRAs [generate / chat completions] ( #19126 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-07-10 21:09:37 +01:00
3de2ed767f
[Bugfix] Remove assertion of expert_map being None ( #20714 )
...
Signed-off-by: Ming Yang <yming@meta.com >
Signed-off-by: Ming Yang <minos.future@gmail.com >
2025-07-10 19:55:22 +00:00
299252ea82
[CI] Fix pre commit issue ( #20782 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-10 12:48:13 -07:00
d6902ce79f
[V0][V1][Core] Add outlines integration for V1, and update V0 integration. ( #15975 )
...
Signed-off-by: Nathan Hoos <thwackyy.y@gmail.com >
2025-07-10 15:30:26 -04:00
5e53c89a74
[Bugfix] [CI] Fix Tensorizer LoRA test ( #20760 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-07-10 19:07:06 +00:00
c66e38ea4c
[Test] Remove docker build from test. ( #20542 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-10 11:21:58 -07:00
251595368f
Fix DeepSeek-R1-0528 chat template ( #20717 )
...
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com >
2025-07-10 17:47:36 +00:00
4bed167768
[Model][VLM] Support JinaVL Reranker ( #20260 )
...
Signed-off-by: shineran96 <shinewang96@gmail.com >
2025-07-10 10:43:43 -07:00
b140416abf
[Model] Add reason parser for Hunyuan A13B Model. ( #20625 )
...
Signed-off-by: Asher Zhang <asherszhang@tencent.com >
2025-07-10 16:33:26 +00:00
5b8366b61a
[ROCm][Regression] Remove tensor creation that harms performance on ROCm ( #20741 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-10 09:22:23 -07:00
c7753a9809
[Hardware][CPU] Vllm int8 quantization enablement for ARM CPU ( #14129 )
...
Signed-off-by: nishith-fujitsu <nishith.jaiswal@fujitsu.com >
2025-07-10 15:59:04 +00:00
4b9a9435bb
Update Dockerfile FlashInfer to v0.2.8rc1 ( #20718 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-10 08:09:02 -07:00
3482fd7e4e
[Doc] Add engine args back in to the docs ( #20674 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-10 08:02:40 -07:00
77f77a951e
[Misc] Clean up mark to fork process in BNB tests ( #20692 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-10 13:59:40 +00:00
1a4f35e2ea
Normalize lm-eval command between baseline and correctness test ( #18560 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-10 13:27:32 +00:00
be1e128dfb
[CI Bugfix] Skip failing Tensorizer+LoRA test ( #20724 )
2025-07-10 21:15:03 +09:00
65393ee064
[doc] fix ordered list ( #20749 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-10 03:13:52 -07:00
dc221ad72d
[Bugfix][Build][Non-CUDA] Only referencing CMAKE_CUDA_COMPILER_VERSION on CUDA where it is defined ( #20738 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-07-10 02:58:11 -07:00
7571a4a7e5
[CI/Build] Fix Basic Models Test ( #20728 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-10 09:57:19 +00:00
f67d986dd1
[Misc] loose new-model tagger conditions ( #20747 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-10 02:54:47 -07:00
cc876d0f29
[KVConnector] Aggregate finished requests on the scheduler ( #19555 )
...
Signed-off-by: Or Ozeri <oro@il.ibm.com >
2025-07-10 09:22:18 +01:00
fdfd409f8f
[TPU][Core]Make load weight exceed hbm error more instructive for customers ( #20644 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-10 07:01:17 +00:00
ffbcc9e757
[BugFix] Fix VllmConfig() construction on all platforms ( #20695 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-10 07:00:20 +00:00
59389c927b
[BugFix][CPU] Fix CPU worker dependency on cumem_allocator ( #20696 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-10 14:24:20 +08:00
8f2720def9
[Frontend] Support Tool Calling with both tool_choice='required' and $defs. ( #20629 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-10 13:56:35 +08:00
ad6c2e1a0b
Correct PPMissingLayer handling in Deepseek-V2-Lite PP deployment ( #20665 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-09 20:34:40 -07:00
49e8c7ea25
Use NVCC --compress-mode to reduce binary size by 30% ( #20694 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 18:26:48 -07:00
805d62ca88
[Misc] DP : Add ExpertTokensMetadata ( #20332 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-07-10 00:33:14 +00:00
b7d9e9416f
[CI/Build] Fix FlashInfer double build in Dockerfile ( #20651 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 17:41:56 -06:00
7c12a765aa
[Misc] Simplify the prefix caching logic on draft tokens ( #20701 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-09 14:48:35 -07:00
cd587c93ef
[BugFix]: Properly set engine_id when using multi connector ( #19487 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: leiyiming <leiyiming@kingsoft.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-09 20:32:44 +00:00
332d4cb17b
[Feature][Quantization] MXFP4 support for MOE models ( #17888 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
Signed-off-by: Bowen Bao <bowenbao@amd.com >
Signed-off-by: Felix Marty <Felix.Marty@amd.com >
Co-authored-by: Bowen Bao <bowenbao@amd.com >
2025-07-09 13:19:02 -07:00
bf03ff3575
[Kernel] Add Conch backend for mixed-precision linear layer ( #19818 )
...
Signed-off-by: Jacob Manning <jmanning+oss@stackav.com >
2025-07-09 13:17:55 -07:00
47043eb678
[Kernel] Triton implementation of causal-conv1d for Mamba-based models ( #18218 )
...
Signed-off-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tuan M. Hoang-Trong <tmhoangt@us.ibm.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-09 12:53:55 -07:00
31b96d1c64
Support Llama 4 for cutlass_moe_fp4 ( #20453 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 15:53:38 -04:00
e59ba9e142
[CI/Build] Enlarge tolerance for a CPU multi-modal test ( #20684 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-09 17:48:52 +00:00
403b481573
Remove heading form installation inc.md file ( #20697 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 10:42:51 -07:00
138709f8d1
[Doc] Update CPU doc ( #20676 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 10:28:30 -07:00
0bbac1c1b4
[Bench] Add NVFP4 GEMM benchmark script ( #20578 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-09 13:23:48 -04:00
a3e4e85ece
[XPU][CI] enhance xpu test support ( #20652 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
Co-authored-by: zhenwei-intel <zhenweiliu@habana.ai >
2025-07-09 16:53:09 +00:00
eb58f5953d
[TPU][Bugfix] fix test_pallas ( #20666 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-09 09:32:48 -07:00
4ac9c33f78
[Bugfix] Fix handling of Tensorizer arguments for LoadConfig ( #20643 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-07-09 15:36:37 +00:00
efe73d0575
[doc] update doc format ( #20673 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-09 08:08:19 -07:00
853487bc1b
[Docs] Improve docs for RLHF co-location example ( #20599 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-09 08:06:43 -07:00
9ff2af6d2b
[Benchmark] Parameterization of streaming loading of multimodal datasets ( #20528 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-07-09 13:35:16 +00:00
70ca5484f5
[Doc] Update notes ( #20668 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-09 03:46:36 -07:00
5358cce5ff
[V1] [Doc] Update V1 docs for Mamba models ( #20499 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-07-09 01:02:41 -07:00
2155e95ef1
[Bugfix] Fix the issue where reasoning_content is None when Thinkng is enabled and tool_choice is set to 'required'. ( #20662 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-09 07:39:58 +00:00
f95570a52d
[Docs] fix minimax tool_calling docs error ( #20667 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-07-09 00:37:07 -07:00
b6e7e3d58f
[Intel GPU] support ray as distributed executor backend for XPU. ( #20659 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-09 00:36:58 -07:00
e760fcef22
[XPU] Use spawn with XPU multiprocessing ( #20649 )
...
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com >
2025-07-09 00:34:28 -07:00
6bbf1795b7
[Misc] Fix the size of batched_dummy_mm_inputs in profile_run ( #20434 )
...
Signed-off-by: bk-201 <joy25810@foxmail.com >
2025-07-08 20:15:44 -07:00
9e0ef888f0
Fix bullets in incremental_build.md ( #20642 )
2025-07-09 11:03:41 +08:00
97abeb1daa
[feat] enable SM100 CUTLASS block scaled group gemm for smaller batch sizes ( #20640 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
2025-07-09 11:03:35 +08:00
34dad19e7b
[Bugfix] set default set cuda_graph_sizes to min(self.max_num_seqs * 2, 512) ( #20628 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
2025-07-09 11:02:51 +08:00
6db31e7a27
[Hardware][PPC64LE] Enable V1 for ppc64le and ARM ( #20554 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com >
2025-07-08 20:00:41 -07:00
977180c912
[Docs] Improve documentation for multi-node service helper script ( #20600 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-08 19:44:26 -07:00
c40784c794
[BugFix][Intel GPU] Use refactored API for dist_backend in V1 worker ( #20596 )
...
Signed-off-by: ratnampa <ratnam.parikh@intel.com >
2025-07-08 19:44:23 -07:00
baed180aa0
[tech debt] Revisit lora request model checker ( #20636 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-07-09 09:42:41 +08:00
0b407479ef
[misc]refactor Platform.set_device method ( #20262 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-09 01:39:47 +00:00
5eaf570050
Replace multiply_add with homogeneous_multiply_add to Address Clang Template Parameter Issue ( #20142 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-09 00:30:18 +00:00
d8ee5a2ca4
[TPU][Bugfix] disable phi-3 test ( #20632 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-08 23:14:26 +00:00
b9fca83256
[Bugfix] Fix GLM-4.1-V video prompt update ( #20635 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-08 23:13:58 +00:00
32dffc2772
[Core] Rename get_max_tokens_per_item for backward compatibility ( #20630 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-08 23:11:30 +00:00
c438183e99
[Bugfix] Fix topk_ids indices_type for CUTLASS w8a8 FP8 MoE ( #20166 )
...
Signed-off-by: Ming Yang <yming@meta.com >
2025-07-08 23:10:57 +00:00
baba0389f7
[CI] Increase the threshold of the MTEB RERANK tests ( #20615 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-08 08:10:11 -07:00
c6c22f16d3
Revert invalid spellchecker fix on deepseek_vl2 ( #20618 )
2025-07-08 15:07:14 +00:00
dd382e0fe3
[Model] Implement missing get_language_model for Keye-VL ( #20631 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-08 07:47:46 -07:00
849590a2a7
Update torch/xla pin to 20250703 ( #20589 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-07-08 07:44:02 -07:00
a4c23314c0
[xpu]feat: support multi-lora on xpu ( #20616 )
...
Signed-off-by: yan <yan.ma@intel.com >
2025-07-08 22:07:10 +08:00
b942c094e3
Stop using title frontmatter and fix doc that can only be reached by search ( #20623 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-08 03:27:40 -07:00
b4bab81660
Remove unnecessary explicit title anchors and use relative links instead ( #20620 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-08 02:49:13 -07:00
b91cb3fa5c
[Docs] Improve documentation for Deepseek R1 on Ray Serve LLM ( #20601 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-08 02:09:06 -07:00
71d1d75b7a
[PD][Nixl] Remote consumer READ timeout for clearing request blocks ( #20139 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-08 08:56:40 +01:00
72d14d0eed
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load ( #19619 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
Co-authored-by: Eta <esyra@coreweave.com >
2025-07-07 22:47:43 -07:00
e34d130c16
[TPU] Temporary fix vmem oom for long model len by reducing page size ( #20278 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-07-08 05:16:16 +00:00
7721ef1786
[CI/Build][CPU] Fix CPU CI and remove all CPU V0 files ( #20560 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-07 22:13:44 -07:00
8369b7c2a9
[Misc] improve error msg ( #20604 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-07 21:45:18 -07:00
3eb4ad53f3
[Docs] Add Anyscale to frameworks ( #20590 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:09:13 -07:00
90a2769f20
[Docs] Add Ray Serve LLM section to openai compatible server guide ( #20595 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:08:05 -07:00
e60d422f19
[Docs] Improve docstring for ray data llm example ( #20597 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:06:26 -07:00
0d914c81a2
[Docs] Rewrite offline inference guide ( #20594 )
...
Signed-off-by: Ricardo Decal <rdecal@anyscale.com >
2025-07-07 20:06:02 -07:00
6e428cdd7a
[Doc] Syntax highlight request responses as JSON instead of bash ( #20582 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 20:02:45 -07:00
93b9d9f499
[Bugfix]: Fix messy code when using logprobs ( #19209 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-07-08 11:02:15 +08:00
af107d5a0e
Make distinct code and console admonitions so readers are less likely to miss them ( #20585 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 19:55:28 -07:00
31c5d0a1b7
[Optimize] Don't send token ids when kv connector is not used ( #20586 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-07 19:04:54 -07:00
afb7cff1b9
[Bugfix] Fix Maverick correctness by filling zero to cache space in cutlass_moe ( #20167 )
...
Signed-off-by: Ming Yang <yming@meta.com >
2025-07-08 01:07:22 +00:00
d2e841a10a
[Misc] Improve logging for dynamic shape cache compilation ( #20573 )
...
Signed-off-by: kyolebu <kyu@redhat.com >
2025-07-08 00:48:09 +00:00
14601f5fba
[Config] Refactor mistral configs ( #20570 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-07-07 15:25:10 -07:00
042d131f39
Fix links in multi-modal model contributing page ( #18615 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 21:13:52 +00:00
8e807cdfa4
[Misc] feat output content in stream response ( #19608 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-07 20:45:10 +00:00
e601efcb10
[Misc] Add fully interleaved support for multimodal 'string' content format ( #14047 )
...
Signed-off-by: drobyshev.anton <drobyshev.anton@wb.ru >
Co-authored-by: drobyshev.anton <drobyshev.anton@wb.ru >
2025-07-07 19:43:08 +00:00
22dd9c2730
[Kernel] Optimize Prefill Attention in Unified Triton Attention Kernel ( #20308 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-07-07 19:08:12 +00:00
a6d795d593
[DP] Copy environment variables to Ray DPEngineCoreActors ( #20344 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-07-07 10:14:22 -07:00
a37d75bbec
[Front-end] microbatch tokenization ( #19334 )
...
Signed-off-by: zt2370 <ztang2370@gmail.com >
2025-07-07 17:54:10 +01:00
edd270bc78
[Bugfix] Prevent IndexError for cached requests when pipeline parallelism is disabled ( #20486 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-07 09:41:15 -07:00
110df74332
[Model][Last/4] Automatic conversion of CrossEncoding model ( #19675 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-07 14:46:04 +00:00
1ad69e8375
[Doc] Fix some MkDocs snippets used in the installation docs ( #20572 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 07:44:34 -07:00
b8a498c9b2
[Doc] Add outline for content tabs ( #20571 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 07:43:26 -07:00
923147b5e8
[Doc] Fix internal links so they don't always point to latest ( #20563 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 04:15:50 -07:00
45877ef740
[Doc] Use gh-pr and gh-issue everywhere we can in the docs ( #20564 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 03:54:22 -07:00
6e4bef1bea
[Doc] Remove extra whitespace from CI failures doc ( #20565 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-07-07 03:35:47 -07:00
4ff79a136e
[Misc] Set the minimum openai version ( #20539 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-07 09:15:26 +00:00
448acad31e
[Misc] remove unused jinaai_serving_reranking ( #18878 )
...
Signed-off-by: Abirdcfly <fp544037857@gmail.com >
2025-07-07 09:14:12 +00:00
eb0b2d2f08
[Docs] Clean up tables in supported_models.md ( #20552 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-07 01:46:31 -07:00
3112271f6e
[XPU] log clean up for XPU platform ( #20553 )
...
Signed-off-by: yan <yan.ma@intel.com >
2025-07-07 01:38:22 -07:00
1fd471e957
Add docstrings to url_schemes.py to improve readability ( #20545 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-07 08:31:49 +00:00
2c5ebec064
[XPU][CI] add v1/core test in xpu hardware ci ( #20537 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-07 01:16:40 -07:00
2e610deb72
[CI/Build] Enable phi2 lora test ( #20540 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-07 05:10:41 +00:00
6e2c19ce22
[Refactor]Abstract Platform Interface for Distributed Backend and Add xccl Support for Intel XPU ( #19410 )
...
Signed-off-by: dbyoung18 <yang5.yang@intel.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-07 04:32:32 +00:00
47db8c2c15
[Misc] add a tip for pre-commit ( #20536 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-06 19:42:06 -07:00
462b269280
Implement OpenAI Responses API [1/N] ( #20504 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-06 18:32:13 -07:00
c18b3b8e8b
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler ( #20527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-06 14:01:48 -07:00
9528e3a05e
[BugFix][Spec Decode] Fix spec token ids in model runner ( #20530 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-06 19:44:52 +00:00
9fb52e523a
[V1] Support any head size for FlexAttention backend ( #20467 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-06 09:54:36 -07:00
e202dd2736
[V0 deprecation] Remove V0 CPU/XPU/TPU backends ( #20412 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Co-authored-by: Li, Jiang <jiang1.li@intel.com >
2025-07-06 08:48:13 -07:00
43813e6361
[Misc] call the pre-defined func ( #20518 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-06 10:25:29 +00:00
cede942b87
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py ( #20516 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-07-06 09:20:11 +00:00
fe1e924811
[Frontend] Support image object in llm.chat ( #19635 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
Signed-off-by: Flora Feng <4florafeng@gmail.com >
2025-07-06 06:47:13 +00:00
4548c03c50
[TPU][Bugfix] fix the MoE OOM issue ( #20339 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-05 21:19:09 -07:00
40b86aa05e
[BugFix] Fix: ImportError when building on hopper systems ( #20513 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-07-06 12:17:30 +08:00
432870829d
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe ( #20509 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-07-06 12:08:30 +08:00
f73d02aadc
[BUG] Fix #20484 . Support empty sequence in cuda penalty kernel ( #20491 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-07-05 19:38:02 -07:00
c5ebe040ac
test_attention compat with coming xformers change ( #20487 )
...
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-07-05 19:37:59 -07:00
8d763cb891
[Misc] remove unused import ( #20517 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-05 19:17:06 -07:00
cf4cd53982
[Misc] Add logger.exception for TPU information collection failures ( #20510 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-05 07:24:32 -07:00
32c9be2200
[v1] Re-add fp32 support to v1 engine through FlexAttention ( #19754 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-05 09:41:10 +00:00
8aeaa910a2
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod ( #20507 )
...
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-07-05 14:03:20 +08:00
906e05d840
[Misc] Remove the unused LoRA test code ( #20494 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-05 13:48:16 +08:00
ef9a2990ae
[doc] small fix ( #20506 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-04 20:56:39 -07:00
7e90870491
[Misc] Add security warning for development mode endpoints ( #20508 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-04 20:52:13 -07:00
d3f05c9248
[Doc] fix mutltimodal_inputs.md gh examples link ( #20497 )
...
Signed-off-by: Guy Stone <guys@spotify.com >
2025-07-04 16:41:35 -07:00
c108781c85
[CI Bugfix] Fix pre-commit failures on main ( #20502 )
2025-07-04 14:17:30 -07:00
3d184b95b8
[feat]: CUTLASS block scaled group gemm for SM100 ( #19757 )
...
Signed-off-by: Duncan Moss <djm.moss@gmail.com >
Co-authored-by: Duncan Moss <dmoss@nvidia.com >
2025-07-04 12:58:04 -06:00
2f35a022e6
Enable V1 for Hybrid SSM/Attention Models ( #20016 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
2025-07-04 17:46:53 +00:00
ffe00ef77a
[Misc] Small: Remove global media connector. Each test should have its own test connector object. ( #20395 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-04 08:15:03 -07:00
5561681d04
[CI] add kvcache-connector dependency definition and add into CI build ( #18193 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2025-07-04 06:49:18 -07:00
fbd62d8750
[Doc] Fix classification table in list of supported models ( #20489 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-04 06:08:02 -07:00
2e26f9156a
[Model][3/N] Automatic conversion of CrossEncoding model ( #20168 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-04 05:47:39 -07:00
9e5452ee34
[Bug][Frontend] Fix structure of transcription's decoder_prompt ( #18809 )
...
Signed-off-by: sangbumlikeagod <oironese@naver.com >
2025-07-04 11:28:07 +00:00
0e3fe896e2
Support Llama 4 for fused_marlin_moe ( #20457 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-04 07:55:10 +00:00
1caca5a589
[Misc] Add SPDX-FileCopyrightText ( #20428 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-04 07:40:42 +00:00
783921d889
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels ( #20331 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-04 15:06:24 +08:00
4a98edff1f
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers ( #20365 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-04 15:05:49 +08:00
a7bab0c9e5
[Misc] small update ( #20462 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 20:33:44 -07:00
25950dca9b
Add ignore consolidated file in mistral example code ( #20420 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-07-04 02:55:07 +00:00
a4113b035c
[Platform] Add custom default max tokens ( #18557 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
2025-07-04 10:50:17 +08:00
7e1665b089
[Misc] Change warn_for_unimplemented_methods to debug ( #20455 )
2025-07-04 02:35:08 +00:00
8d1096e7db
[Bugfix] Register reducer even if transformers_modules not available ( #19510 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-07-03 22:08:12 +00:00
8d775dd30a
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning ( #20400 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-03 14:56:09 -07:00
78fe77534b
[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. ( #18864 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-07-03 14:55:40 -07:00
2f2fcb31b8
[Misc] Remove _maybe_ignore_quant_config from GLM4.1v ( #20432 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
2025-07-03 21:41:13 +00:00
1dba2c4ebe
[Misc] adjust for ipv6 for mookcacke url parse ( #20107 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-03 20:27:17 +00:00
71d6de3a26
[Misc] Clean up InternVL family config registration ( #19992 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-07-03 20:01:47 +00:00
536fd33003
[CI] Trimming some failing test groups from AMDPRODUCTION. ( #20390 )
2025-07-03 08:21:31 -07:00
619b9f5c7e
[Frontend] fix duplicate output for bench subcmd ( #20446 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 08:02:06 -07:00
d1b689c445
[Bugfix] Fix flaky test_streaming_response test ( #20363 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-03 14:46:24 +00:00
9854dc9040
[Frontend] improve vllm bench <bench_type> --help display ( #20430 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 14:22:16 +00:00
ff5c60fad8
[Misc] Automatically tag PRs to add new models ( #20222 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-03 07:11:03 -07:00
6f1229f91d
[Model][2/N] Automatic conversion of CrossEncoding model ( #19978 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-07-03 13:59:23 +00:00
1819fbda63
[Quantization] Bump to use latest bitsandbytes ( #20424 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-03 21:58:46 +08:00
7f0367109e
[CI/Build][CPU] Enable cross compilation in CPU release pipeline ( #20423 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-03 05:26:12 -07:00
fb14d53cf6
[Kernel] refactor cpu worker v0 cache dtype ( #20080 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-07-03 08:39:14 +00:00
b024a42e93
[Core] Move multimodal placeholder from chat utils to model definition ( #20355 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-03 08:18:30 +00:00
cb97f2bfc5
[Docs] Replace two list with tables in intel_gaudi.md ( #20414 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-03 00:48:25 -07:00
359200f6ac
[doc] fix link ( #20417 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-03 00:21:57 -07:00
220aee902a
[Misc] Add rules to label Speculative Decoding Related PRs ( #20406 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-07-02 23:56:49 -07:00
67d25eca05
[Tests] Update online DP tests to verify that requests are balanced ( #20157 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-03 14:49:13 +08:00
363528de27
[Feature] Support MiniMax-M1 function calls features ( #20297 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-07-03 06:48:27 +00:00
4ff61ababa
[TPU] Add a case to cover RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 ( #20385 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-07-03 06:46:41 +00:00
0ec3779df7
[Bugfix][CI/CD][CPU] Fix CPU CI tests ( #20383 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-02 20:11:36 -07:00
b616f6a53d
[Misc] Small: Fix video loader return type annotations. ( #20389 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-03 03:10:39 +00:00
2e25bb12a8
[Bugfix] Fix import of CutlassExpertsFp8 in compressed_tensors_moe.py ( #20381 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-07-03 02:07:43 +00:00
9965c47d0d
Enable CPU nightly performance benchmark and its Markdown report ( #18444 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-07-02 17:50:25 -07:00
059d4cdb49
[BugFix] Fix DP headless mode arg validation ( #20398 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 17:15:32 -07:00
bdb84e26b0
[Bugfix] Fixes for FlashInfer's TORCH_CUDA_ARCH_LIST ( #20136 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
2025-07-02 17:15:11 -07:00
3dd359147d
[Docs] Update EAGLE example ( #20375 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-02 17:13:51 -07:00
657f2f301a
[DP] Support external DP Load Balancer mode ( #19790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 10:21:52 -07:00
a1aafc827a
[ROCm][FEAT] Enable Full Graph Mode in AITER MLA V1 Attn Backend (Decode Phase only) ( #20254 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-07-02 16:25:46 +00:00
139508a418
[Misc] add handler HF_TOKEN is emptry string ( #20369 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-07-02 09:14:31 -07:00
d265414dbc
[Minor] Clean up incorrect comment in test ( #20382 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-07-02 09:13:37 -07:00
48fb076cbc
[V1] LogitsProcessor programming model ( #16728 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-07-02 09:10:42 -07:00
c1909e7e8c
[Kernels] MoE refactor ( #19636 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Co-authored-by: ElizaWszola <ewszola@redhat.com >
2025-07-02 06:08:27 -07:00
b95877509b
Documentation update tool_calling: mapping back to function from response ( #20373 )
2025-07-02 05:55:49 -07:00
706ff13224
[Model] Adds support for SlimMoE models Phi-tiny-MoE-instruct ( #20286 )
...
Signed-off-by: Zichong Li <t-lizichong@microsoft.com @Reasoning-H100-VM3.drbuo4tcjzruhloch3eo0b25ef.cx.internal.cloudapp.net>
Co-authored-by: Zichong Li <t-lizichong@microsoft.com @Reasoning-H100-VM3.drbuo4tcjzruhloch3eo0b25ef.cx.internal.cloudapp.net>
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-07-02 12:54:12 +00:00
ccbfb1d1c9
[Bugfix] Fix the max_seq_len limit of 16384 for DeepSeek models ( #20322 )
...
Signed-off-by: Wang Huaqiang <huaqiang.wang@intel.com >
2025-07-02 12:53:36 +00:00
9e5552aa13
[NVIDIA] Support Cutlass w8a8 FP8 for Blackwell Geforce GPUs (sm120) ( #17280 )
...
Signed-off-by: kaln27 <liaojuncheng123@foxmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-02 06:47:19 -06:00
0c600b9ab6
[Build/CI] Automatically tag DeepSeek related PRs ( #20370 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-07-02 04:02:43 -07:00
e303dcf523
[Model] Add Ernie4.5 and Ernie4.5MoE Model Support ( #20220 )
...
Signed-off-by: wangyafeng <wangyafeng@baidu.com >
2025-07-02 03:37:01 -07:00
ae9c4d416f
[Docs] Make TPU ref prettier in google_tpu.md ( #20356 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-02 02:04:08 -07:00
d853520b3e
[Docs] Fix indentations for 2-level items in deprecation_policy.md ( #20352 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-07-01 23:50:31 -07:00
ba51aea65e
[Bugfix] Keye-VL compatibility with tok_kwargs ( #20058 ) ( #20353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-01 23:46:59 -07:00
8452946c06
[Model][VLM] Support Keye-VL-8B-Preview ( #20126 )
...
Signed-off-by: Kwai-Keye <Keye@kuaishou.com >
2025-07-01 23:35:04 -07:00
2e7cbf2d7d
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. ( #20105 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-07-01 23:34:03 -07:00
7da296be04
[TPU] kv cache update kernel supports dynamic grid ( #20235 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-07-02 06:33:37 +00:00
b205e8467d
[Doc][TPU] Add models and features supporting matrix. ( #20230 )
...
Signed-off-by: Qiliang Cui <cuiq@google.com >
2025-07-02 06:33:20 +00:00
be0cfb2b68
fix[Docs]: link anchor is incorrect #20309 ( #20315 )
...
Signed-off-by: zxw <1020938856@qq.com >
2025-07-02 06:32:34 +00:00
1a03dd496b
[Bugfix] Fix dynamic rotary embedding ( #20343 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-07-02 06:31:26 +00:00
27b8017636
[FIX][Intel GPU]fix ipex flash_attn_varlen_func api missing parameter ( #20348 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-07-01 22:26:40 -07:00
9ec1e3065a
[Misc][Doc] Add missing comment for LLM ( #20285 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-07-01 19:04:24 -07:00
9dae7d46bf
[Refactor] Remove Unused Env VLLM_ENABLE_MOE_ALIGN_BLOCK_SIZE_TRITON ( #20334 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 19:03:43 -07:00
7058d7dd5d
[Refactor] Remove duplicate find_free_port ( #20333 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 19:03:07 -07:00
a0389e0554
[UT][intel GPU] use current_platform instead of device hardcode in v1 tests ( #20169 )
...
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com >
2025-07-02 09:06:04 +08:00
3be8d312a2
[Kernel][Bugfix] Fixup some warnings in nvfp4_blockwise_moe when CUDA < 12.8 ( #20324 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-07-01 18:05:47 -07:00
3abfe22154
Enable group size 64 for Machete ( #20290 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-01 18:05:44 -07:00
e81fbefe8a
[Refactor] Refactor import utils ( #20269 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-07-01 18:05:42 -07:00
9290de5667
remove unused variables in marlin_template.h ( #20236 )
2025-07-02 00:51:52 +00:00
7f280d69c9
[Optimization] Cache sampled token ids in model runner ( #20291 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 11:01:31 -07:00
02cabff207
[V1] [ROCm] Enable EP with AITER Fused MoE ( #20270 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-07-01 16:48:30 +00:00
3d19d47d91
[Frontend] Expand tools even if tool_choice="none" ( #17177 )
...
Signed-off-by: okada shintarou <okada@preferred.jp >
2025-07-01 12:47:38 -04:00
8acb4badee
[CUDA graphs] Enable full cuda graphs with FA3 AoT scheduling ( #20301 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 09:07:36 -07:00
314af8617c
[Docs] Update transcriptions API to use openai client with stream=True ( #20271 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-01 15:47:13 +00:00
0e96cc9b7e
[Misc] Minor refactoring for scheduler ( #20299 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-07-01 07:55:32 -07:00
ecad851cbd
[Model]Add Tencent HunYuanMoEV1 Model Support ( #20114 )
...
Signed-off-by: aiyiwang <aiyiwang@tencent.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: quinnrong <quinnrong@tencent.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-07-01 07:28:13 -07:00
ed70f3c64f
Add GLM4.1V model (Draft) ( #19331 )
...
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-07-01 12:48:26 +00:00
650d5dbd04
[Misc] Minor refactor of NIXL background handshake ( #20068 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-07-01 12:40:14 +01:00
9025a9a705
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper ( #20046 )
2025-07-01 19:20:34 +09:00
c05596f1a3
[Perf] Validate @config in pre-commit instead of dynamically ( #20200 )
...
Signed-off-by: Lionel Villard <villard@us.ibm.com >
2025-07-01 05:10:28 -04:00
787b13389e
[doc] fix the incorrect logo in dark mode ( #20289 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-07-01 08:18:09 +00:00
96453cfa83
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine ( #19067 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-07-01 16:12:19 +08:00
b1c1fe35a5
[Misc] remove redundant char ( #20287 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-07-01 15:33:22 +08:00
08d81f1014
[Bugfix] Fix deepep tests ( #20288 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-07-01 15:29:08 +08:00
6cc1e7d96d
[CPU] Update custom ops for the CPU backend ( #20255 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-07-01 07:25:03 +00:00
9909726d2a
Enable ZP Support for Machete ( #20268 )
...
Signed-off-by: czhu-cohere <conway.zhu@cohere.com >
2025-07-01 07:12:20 +00:00
22e9d42040
[Misc] add xgrammar for arm64 ( #18359 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2025-07-01 07:02:20 +00:00
86debab54c
Fix numel() downcast in vllm/csrc/moe/moe_align_sum_kernels.cu +2 ( #17082 )
...
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-07-01 06:48:10 +00:00
be250bbc67
[V1] Only print cudagraph tqdm on rank 0 with is_global_first_rank ( #19516 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-07-01 06:02:09 +00:00
27949354fa
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference ( #18768 )
...
Signed-off-by: Alex Kogan <alex.kogan@oracle.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-07-01 05:44:38 +00:00
bd5038af07
[Doc] add config and troubleshooting guide for NCCL & GPUDirect RDMA ( #15897 )
...
Signed-off-by: Ernest Wong <chwong719@gmail.com >
2025-06-30 21:44:39 -07:00
a2f14dc8f9
[CI][Intel Gaudi][vllm-Plugin]Add CI for hpu-plugin-v1-test ( #20196 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-07-01 04:17:07 +00:00
92ee7baaf9
[Example] add one-click runnable example for P2P NCCL XpYd ( #20246 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-06-30 21:03:55 -07:00
7151f92241
[Misc] Fix spec decode example ( #20296 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 21:01:48 -07:00
e28533a16f
[Bugfix] Fix include prompt in stream response when echo=true ( #15233 )
...
Signed-off-by: Yuan Fang <yuanfang@alauda.io >
2025-07-01 01:30:14 +00:00
6d42ce8315
[CLI] Improve CLI arg parsing for -O/--compilation-config ( #20156 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-07-01 01:03:13 +00:00
ded1fb635b
[Bugfix][V1][P/D]Fix the issue of occasional garbled output for P2pNcclConnector ( #20263 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-06-30 16:45:14 -07:00
97d9524fe9
[Refactor] Remove useless pdb comment ( #20266 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-30 18:15:24 +00:00
d8cf819a9a
[Core] [Bugfix] [Multimodal] Fix multimodal profiling and generation for SFT/PTQed models ( #20058 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-30 17:26:49 +00:00
551ef1631a
[Unit Test] Add unit test for deep gemm ( #20090 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-30 10:26:42 -06:00
2863befce3
[Optimization] Use Shared CachedRequestData Instance Across All Requests ( #20232 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 09:07:50 -07:00
2965c99c86
[Spec Decode] Clean up spec decode example ( #20240 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 08:28:13 -07:00
2062c0723d
[Spec Decode] Refactor spec decoding into a separate function ( #20238 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-30 08:13:50 -07:00
1c50e100a9
[Bugfix] fix quark ptpc ( #20251 )
...
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com >
Co-authored-by: Haoyang Li <307790822@qq.com >
2025-06-30 22:24:50 +09:00
3ee56e26be
[Docs] Fix 1-2-3 list in v1/prefix_caching.md ( #20243 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-30 11:20:51 +00:00
8fe7fc8634
[Quantization] Improve BitsAndBytesModelLoader ( #20242 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-30 18:22:09 +08:00
e936e401de
[Bugfix] Fix processor initialization in transformers 4.53.0 ( #20244 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-30 10:16:16 +00:00
f5dfa07531
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model ( #19598 )
...
Signed-off-by: noiji <>
2025-06-30 18:21:56 +09:00
022c58b80f
[doc] Add Slack and Forum to the top navigation ( #20208 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
2025-06-30 07:53:45 +00:00
19108ef311
[Misc] Fix import ( #20233 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-29 20:34:54 -07:00
5a52f389dd
[BUGFIX][DEEPSEEK][MODEL_LOAD] fix w13, w2 weight not initialized assert ( #20202 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-06-29 19:46:19 -07:00
65b1cbb138
[Model] support dots1 ( #18254 )
...
Signed-off-by: redmoe-moutain <agiredmoe@gmail.com >
2025-06-29 19:34:36 -07:00
6c9837a761
Fix cuda_archs_loose_intersection when handling sm_*a ( #20207 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-06-29 16:52:34 -07:00
6f2f53a82d
[Quantization] Add compressed-tensors NVFP4 MoE Support ( #19990 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-06-29 22:05:40 +00:00
7b1895e6ce
[CI Fix] Try fixing eagle e2e test OOM by reducing block allocation ( #20213 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-29 10:31:37 +08:00
4d36693687
[Refactor] Create a function util and cache the results for has_deepgemm, has_deepep, has_pplx ( #20187 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-28 22:06:38 +00:00
daec9dea6e
[Bugfix] Correct behavior of GraniteMoeHybrid for TensorParallel execution ( #20137 )
...
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
2025-06-28 08:16:41 -07:00
daceac57c7
[Frontend] Generalize v1/audio/transcriptions endpoint ( #20179 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-28 08:15:26 -07:00
8615d9776f
[CI/Build] Add new CI job to validate Hybrid Models for every PR ( #20147 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-27 23:00:25 -07:00
7b460c25f9
[BugFix] Fix the incorrect func name in the comments. (config.py) ( #20185 )
2025-06-27 22:51:16 -07:00
f719772281
[Bugfix] Properly reject requests with empty list guided_choice ( #20195 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 22:50:52 -07:00
d45417b804
fix ci issue distributed 4 gpu test ( #20204 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-27 22:50:00 -07:00
a29e62ea34
Fix num_token_padding support for static per-tensor scaled_fp8_quant ( #20188 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 22:48:13 -07:00
e53be6f00a
[Misc] Add type assertion of request_id for LLMEngine.add_request ( #19700 )
...
Signed-off-by: n2ptr <xuzhanchaomail@163.com >
2025-06-27 22:47:36 -07:00
c329ceca6d
[CI Fix] Pin tests/models/registry.py MiniMaxText01ForCausalLM to revision due to model changes ( #20199 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-28 13:43:06 +08:00
3c545c0c3b
[CI/Build] Allow hermetic builds ( #18064 )
...
Signed-off-by: Fabien Dupont <fdupont@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Fabien Dupont <fabiendupont@pm.me >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Elias Levy <eliaslevy@google.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-06-27 09:04:39 -07:00
e8c3bd2cd1
[Bugfix] Fix some narrowing conversion warnings ( #20141 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-27 09:01:28 -07:00
c6c983053d
[Bugfix] Mark 'hidden_states' as mutable in moe_forward registration. ( #20152 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-27 09:42:22 -06:00
aafabaa0d5
[Fix][torch.compile] Enable custom ops by default when Inductor off ( #20102 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-27 09:00:42 -06:00
94a55c7681
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12 ( #19891 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-06-27 07:14:44 -07:00
aa0dc77ef5
[Perf] Improved perf for resolve_chat_template_content_format ( #20065 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net >
2025-06-27 09:16:41 +00:00
4ab3ac285e
[Bugfix] Fix flaky failure when getting DP ports ( #20151 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 15:30:53 +08:00
d1c956dc0f
Gemma3n (Text-only) ( #20134 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-27 07:16:26 +00:00
dec197e3e5
Quick Fix by adding conditional import for flash_attn_varlen_func in flash_attn ( #20143 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-06-27 05:48:13 +00:00
6e244ae091
[Perf][Frontend] eliminate api_key and x_request_id headers middleware overhead ( #19946 )
...
Signed-off-by: Yazan-Sharaya <yazan.sharaya.yes@gmail.com >
2025-06-27 00:44:14 -04:00
cd4cfee689
[Model][1/N] Automatic conversion of CrossEncoding model ( #20012 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-06-26 21:10:04 -07:00
e110930680
[Fix] Fix gemma CI test failing on main ( #20124 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-26 21:06:59 -07:00
8b64c895c0
[CI] Sync test dependency with test.in for torch nightly ( #19632 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Concurrensee <yida.wu@amd.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-26 20:55:25 -07:00
0740e29b66
[Feature] add quick all reduce ( #19744 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-06-26 20:54:24 -07:00
44d2e6af63
[Bugfix] Build moe_data for both sm100 and sm90 ( #20086 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-26 20:50:12 -07:00
2d7779f888
[Perf] SM100 FP8 GEMM Optimizations after cutlass_profiler ( #20071 )
...
Signed-off-by: ilmarkov <imarkov@redhat.com >
Co-authored-by: ilmarkov <imarkov@redhat.com >
2025-06-26 20:50:09 -07:00
a57d57fa72
[Quantization] Bump to use latest compressed-tensors ( #20033 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-26 20:50:06 -07:00
71799fd005
[CI Failure] Fix OOM with test_oot_registration_embedding ( #20144 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-27 11:21:04 +08:00
e9fd658a73
[Feature] Expert Parallelism Load Balancer (EPLB) ( #18343 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
2025-06-26 15:30:21 -07:00
07b8fae219
[Doc] correct LoRA capitalization ( #20135 )
...
Signed-off-by: kyolebu <kyu@redhat.com >
2025-06-26 15:22:12 -07:00
562308816c
[Refactor] Rename commnication utils ( #20091 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 22:19:32 +00:00
04e1642e32
[TPU] add kv cache update kernel ( #19928 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-26 10:01:37 -07:00
b69781f107
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend. ( #19560 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-06-26 09:27:18 -07:00
0bceac9810
Spam folks if config.py changes ( #20131 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-26 08:19:46 -07:00
34878a0b48
[Doc] Rename page titles ( #20130 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 08:18:49 -07:00
6393b03986
[Doc] Auto sign-off for VSCode ( #20132 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 08:18:36 -07:00
0907d507bf
[Doc] Automatically signed-off by PyCharm ( #20120 )
...
Signed-off-by: wang.yuqi <noooop@126.com >
2025-06-26 14:34:17 +00:00
c894c5dc1f
[Bug Fix] Fix address/port already in use error for deep_ep test ( #20094 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 22:33:13 +08:00
1f5d178e9c
Revert "[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine" ( #20128 )
2025-06-26 07:32:22 -07:00
27c065df50
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend (Fix API Break and Local Attention Logic: affecting Llama4) ( #19904 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-06-26 12:42:31 +00:00
84c260caeb
[Docs] Improve frameworks/helm.md ( #20113 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-26 10:41:51 +00:00
167aca45cb
[Misc] Use collapsible blocks for benchmark examples. ( #20017 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-26 03:35:16 -07:00
0567c8249f
[CPU] Fix torch version in x86 CPU backend ( #19258 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-26 03:34:47 -07:00
d188913d99
[Refactor] Remove unused library ( #20099 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-26 09:16:10 +00:00
1d7c29f5fe
[Doc] Update docs for New Model Implementation ( #20115 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-26 00:47:06 -07:00
65397e40f5
[Bugfix] Allow CUDA_VISIBLE_DEVICES='' in Platform.device_id_to_physical_device_id ( #18979 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-06-26 00:01:57 -07:00
9502c38138
[Benchmark][Bug] Fix multiple bugs in bench and add args to spec_decode offline ( #20083 )
2025-06-25 22:06:27 -07:00
2582683566
[PD] Skip tp_size exchange with rank0 ( #19413 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-25 20:04:39 -07:00
754b00edb3
[Bugfix] Fix Mistral tool-parser regex for nested JSON ( #20093 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-26 01:01:17 +00:00
296ce95d8e
[CI] Add SM120 to the Dockerfile ( #19794 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-25 16:23:56 -07:00
2d7620c3eb
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN ( #19919 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-25 15:51:02 -07:00
55c65ab495
[P/D] Avoid stranding blocks in P when aborted in D's waiting queue ( #19223 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-25 15:19:44 -07:00
2cc2069970
[TPU][Bugfix] fix kv cache padding ( #20048 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-25 21:24:10 +00:00
9f0608fc16
[Bugfix] default set cuda_graph_sizes to max_num_seqs for v1 engine ( #20062 )
...
Signed-off-by: izhuhaoran <izhuhaoran@qq.com >
2025-06-25 21:03:17 +00:00
4e0db57fff
Fix the path to the testing script. ( #20082 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-25 20:48:17 +00:00
c40692bf9a
[Misc] Add parallel state node_count function ( #20045 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-25 13:38:53 -07:00
4734704b30
[PD] let toy proxy handle /chat/completions ( #19730 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-06-25 15:17:45 -04:00
8b8c209e35
static_scaled_fp8_quant should not run when scale.numel is not 1 ( #20076 )
2025-06-25 15:08:03 -04:00
23a04e0895
[Fix] Support cls pooling in ModernBertPooler ( #20067 )
...
Signed-off-by: shengzhe.li <shengzhe.li@sbintuitions.co.jp >
2025-06-25 15:07:45 -04:00
02c97d9a92
[Quantization] Add compressed-tensors emulations support for NVFP4 ( #19879 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-06-25 14:28:19 -04:00
e795d723ed
[Frontend] Add /v1/audio/translations OpenAI API endpoint ( #19615 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-06-25 17:54:14 +00:00
8359f4c8d8
[V1][Speculative Decoding] Fix DeepSeek MTP ( #20022 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-06-25 08:41:02 -07:00
bf5181583f
[Doc] Guide for Incremental Compilation Workflow ( #19109 )
2025-06-25 22:06:46 +09:00
c53fec1fcb
[doc] add reference link for Intel XPU ( #20064 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-25 12:24:07 +00:00
0f9e7354f5
[BugFix] Fix full-cuda-graph illegal memory access in FA3 ( #20057 )
...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-06-25 08:39:04 +00:00
ba7ba35cda
[Chore] debloat some initial logs ( #19438 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-25 06:36:22 +00:00
015fab8c2f
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs. ( #19717 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-24 23:22:58 -07:00
f59fc60fb3
[Feat][CLI] enforce-include-usage ( #19695 )
...
Signed-off-by: Max Wittig <max.wittig@siemens.com >
2025-06-25 01:43:04 -04:00
879f69bed3
[Refactor] Remove duplicate ceil_div ( #20023 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-25 05:19:09 +00:00
7108934142
[Frontend] speed up import time of vllm.config ( #18036 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-06-25 00:41:11 -04:00
3443aaf8dd
Move to a faster base64 implementation ( #19984 )
...
Signed-off-by: h-avsha <avshalom.manevich@hcompany.ai >
2025-06-24 20:33:51 -07:00
2273ec322c
Revert "Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor" ( #20030 )
2025-06-25 11:23:29 +08:00
a6c4b87fbc
Revert "[Feature] Integrate new deepgemm ( #19820 )" ( #20049 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-24 19:45:22 -07:00
1afa9948f5
[Llama4] Update attn_temperature_tuning ( #19997 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-24 22:42:53 -04:00
0d06b533a0
cmake: Update vllm_flash_attn for vllm_kernels ( #20032 )
...
Signed-off-by: Eli Uriegas <eliuriegas@meta.com >
2025-06-24 22:44:10 +00:00
c01d1c5aba
use .dev for version comparison with pytorch nightly release ( #20031 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-06-24 21:52:16 +00:00
ead369845d
[Easy] Remove submodule added in #19463 ( #20039 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-24 13:23:15 -07:00
c6e3bba8e6
[Feature] Integrate new deepgemm ( #19820 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-24 12:51:56 -07:00
91f7d9d0b6
[P/D] Asynchronously do _nixl_handshake ( #19836 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-24 12:46:10 -07:00
8619e7158c
[BugFix] Fix multi-node offline data parallel ( #19937 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-24 12:45:20 -07:00
c635c5f744
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. ( #19423 )
...
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-24 18:41:49 +00:00
a045b7e89a
[Perf] Improve/Fix-regression for FA3 in High QPS regimes ( #19463 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-06-24 13:09:01 -04:00
981eeca41a
[Fix][V1] Remove --scheduling-policy oracle ( #20010 )
...
Signed-off-by: amit <amit.man@gmail.com >
2025-06-24 09:52:15 -07:00
26d34eb67e
refactor example - qwen3_reranker ( #19847 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-24 14:03:20 +00:00
53da4cd397
[Bugfix][CPU] Fix InputBatch for pooling models in the CPU v1 ( #20014 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-24 13:20:04 +00:00
9a3b88328f
[PERF] Speedup of MRoPE prepare inputs ( #19939 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-06-23 23:01:26 -07:00
3014c920da
add some examples for other benchmark scripts ( #19893 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-24 05:57:46 +00:00
0eed516951
[doc] Fix broken link in the installation for CPU ( #19980 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-06-24 12:04:11 +08:00
ee5ad8d2c5
[Misc][Tools][Benchmark] Add profile to autotune script ( #19711 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-24 00:59:41 +00:00
a738dbb2a1
Update test case parameter to have the throughput above 8.0 ( #19994 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-24 00:18:10 +00:00
33d5e29be9
[TPU] Fix tpu model runner test ( #19995 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-23 16:04:28 -07:00
4671ac6e2a
[Bugfix][Benchmark] Fix Marlin benchmark ( #19929 )
2025-06-24 07:25:12 +09:00
dd2ccf8dde
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend ( #19395 )
2025-06-24 07:23:28 +09:00
a3bc76e4b5
[CI/Build] Push latest tag for cpu and neuron docker image ( #19897 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-23 14:15:37 -07:00
e6327c9b3e
[Feature] Support sequence parallelism for static fp8 quantization ( #19181 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-06-23 16:09:02 -04:00
d0132f025d
[Misc] Add type alias ReqId and EngineId for better readability ( #19880 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-23 12:57:57 -07:00
61f4fc5dc6
[Bugfix][v1] Fix step pooler implementation and step pooling usage in v1 ( #19956 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-23 18:38:06 +00:00
68aaeb3749
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case ( #19885 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-23 11:07:47 -07:00
c3649e4fee
[Docs] Fix syntax highlighting of shell commands ( #19870 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-23 17:59:09 +00:00
53243e5c42
[doc] improve readability for long commands ( #19920 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 14:27:07 +00:00
a6e6604d32
[Bugfix] Fix CI bitsandbytes failure ( #19969 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-23 21:30:55 +08:00
b82e0f82cb
[doc] use MkDocs collapsible blocks - supplement ( #19973 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 10:54:16 +00:00
5111642a6f
[Doc] Update V1 status for decoder-only embedding models ( #19952 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-23 09:31:06 +00:00
1bcd15edc7
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when *all* transfer done ( #19874 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-22 22:41:53 -07:00
2ebff5b77c
[P/D][NixlConnector] Support tp_size > num_kv_heads deployments ( #19691 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-22 22:41:50 -07:00
f17aec0d63
[doc] Fold long code blocks to improve readability ( #19926 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-23 05:24:23 +00:00
493c275352
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor ( #19643 )
...
Signed-off-by: Vensenmu <vensenmu@gmail.com >
2025-06-23 03:40:28 +00:00
f39ab2d4bd
[Misc] Configurable timeout for execute_model RPC calls via env var ( #19544 )
...
Signed-off-by: jinqinn <goodqinjin@163.com >
2025-06-22 20:36:26 -07:00
4a0f7888a3
[Core] feat: Implement Priority Scheduling in V1 Engine ( #19057 )
...
Signed-off-by: amit <amit.man@gmail.com >
Co-authored-by: Roger Wang <Rogerw0108@gmail.com >
2025-06-22 20:18:08 -07:00
c4cf260677
[Perf][CLI] Improve overall startup time ( #19941 )
2025-06-22 23:11:22 +00:00
33d51f599e
[BugFix] Add an env to disable moe chunking to work around compile incompatibility ( #19642 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-22 15:17:49 -07:00
e91386cde1
[Chore] dedup logs ( #19955 )
2025-06-22 19:43:07 +00:00
2c11a29f0b
[Misc] Simplify vllm bench cli subcommand implementation ( #19948 )
2025-06-22 12:34:48 -04:00
c76a506bd6
[Misc] Update model-specific PR tagging ( #19949 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-06-22 12:16:08 +00:00
ec0db6f51c
[doc] use snippets for contact us ( #19944 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-22 10:26:13 +00:00
c305a2109d
[CI/Build] Auto tag perf benchmarks related PRs ( #19943 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-22 08:46:21 +00:00
202c5df935
[Benchmark] fix request loss if "ping" is returned ( #19535 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-22 07:21:04 +00:00
2bb246b8f7
[MISC] add cpu_kvcache_space_bytes to CacheConfig ( #19812 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-22 13:39:09 +08:00
4c409cabc2
[Misc] add vllm_config in __init__ ( #19866 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-21 23:10:46 -04:00
3b1e4c6a23
[Docs] Add GPT2ForSequenceClassification to supported models in docs ( #19932 )
...
Signed-off-by: nie3e <adrcwiek@gmail.com >
2025-06-21 20:57:19 +00:00
2c5302fadd
[Multimodal] Optimize Qwen2/2.5-VL startup time ( #19756 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Roger Wang <hey@rogerw.me >
Co-authored-by: Roger Wang <hey@rogerw.me >
2025-06-21 20:01:07 +00:00
caa680fd2e
[doc] add contact us in community ( #19922 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-21 17:29:06 +00:00
c3bf9bad11
[New model support]Support Tarsier2 ( #19887 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-21 04:01:51 +00:00
6f170f11dd
[Bugfix] Fix bnb 8bit model weights loading ( #19917 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-21 03:29:09 +00:00
8ca81bb069
Fix: Check the type of params to be a Sequence not list. ( #19910 )
...
Signed-off-by: Rabin Adhikari <rabin.adk1@gmail.com >
2025-06-20 23:03:17 +00:00
e773a9e1c2
[Misc] Clean up useless code ( #19889 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-06-20 21:09:09 +00:00
71baf85ae1
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError ( #19749 )
2025-06-20 18:18:11 +00:00
79f2f1c2a1
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests ( #19901 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-20 15:30:36 +00:00
2e3e3c86dc
Export NaNs in logits to scheduler_stats if output is corrupted ( #18777 )
...
Signed-off-by: Vlad Mihailescu <vtmihailescu@gmail.com >
2025-06-20 22:47:16 +08:00
7e8977fcd4
[custom_op][vllm-plugin] update custom_op class to use op_registry ( #19164 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2025-06-20 07:44:56 -07:00
f1e840e842
[Model] GPT2ForSequenceClassification model ( #19663 )
...
Signed-off-by: nie3e <adrcwiek@gmail.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-20 12:07:41 +00:00
7771d1de88
[Fix] import regex instead of re ( #19875 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-06-20 11:16:48 +00:00
71d1219545
[Kernel] correct cpu worker function parameter type ( #19745 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-20 10:50:13 +00:00
e384f2f108
[Misc] refactor example - openai_transcription_client ( #19851 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-20 08:02:21 +00:00
089a306f19
[Misc] update cuda version ( #19526 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-20 07:25:15 +00:00
5e666f72cd
[Bugfix][Ray] Set the cuda context eagerly in the ray worker ( #19583 )
2025-06-19 22:01:16 -07:00
e3a3e4db46
[Bugfix] Enable PP with AITER+V1 ( #19822 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-06-20 12:43:20 +08:00
e41bf15cd0
[Chore]: qwen3-moe-type-hints-mistake ( #19860 )
...
Co-authored-by: xinnan.hou <hxn02029096@alibaba-inc.com >
2025-06-19 21:43:07 -07:00
5aa4a015ce
[Benchmark] Fix Value of type "SampleRequest" is not indexable ( #18032 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-19 21:28:55 -07:00
b6bad3d186
[CI][Neuron] Fail and exit on first error ( #19622 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-20 12:27:51 +08:00
ee9a1531aa
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI ( #19872 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-20 09:51:07 +08:00
10d82f9ac5
[Benchmark][Bugfix] Fix Dataset Length Calculation ( #19868 )
...
Signed-off-by: Robert Shaw <robshaw@redhat.com >
Co-authored-by: Robert Shaw <robshaw@redhat.com >
2025-06-19 18:30:41 -07:00
ea10dd9d9e
[Frontend] early return chat format resolution when specified ( #19735 )
2025-06-19 18:49:59 +00:00
ead2110297
[Core][Bugfix] Fix Online MM Beam Search ( #19688 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-06-19 17:18:07 +00:00
01220ce89a
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI ( #19838 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-19 15:46:09 +00:00
6f68c49220
[Doc] Update V1 user guide for embedding models ( #19842 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-19 09:43:27 +00:00
4719460644
Fixing Chunked Prefill Test. ( #19762 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-06-19 01:36:16 -07:00
466166dcfd
[Frontend] Add optional token-level progress bar to LLM.beam_search ( #19301 )
...
Signed-off-by: Ruosen Li <rxl190028@utdallas.edu >
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Ubuntu <ubuntu@ip-172-31-71-179.ec2.internal >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-19 03:21:41 -04:00
1d0ae26c85
Add xLAM tool parser support ( #17148 )
2025-06-19 14:26:41 +08:00
6021999573
[Minor] Allow redirecting model path for HfRunner in test ( #19795 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-18 23:04:10 -07:00
c7b370c603
raise exception for pin_lora ( #19809 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-18 22:57:35 -07:00
aa20d10a91
[Misc] [ROCm] Prevent surplus tensor reshape ( #19803 )
...
Signed-off-by: Zsolt Borbely <zsolt.borbely@htecgroup.com >
2025-06-19 13:57:16 +08:00
2de12be428
[ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374 ( #18990 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-06-18 22:56:31 -07:00
83ca9ae47b
Mark invariant normalizer in Gemma as non-persistent ( #19788 )
...
Signed-off-by: Yu-Hang Tang <Tang.Maxin@gmail.com >
2025-06-18 22:56:03 -07:00
e2148dc5ea
[Bugfix] Add check_health to v1 async client. ( #19821 )
...
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com >
2025-06-18 21:47:01 -07:00
b1098b4072
[Bugfix] Fix the linter ( #19826 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-18 21:44:41 -07:00
799397ee4f
Support embedding models in V1 ( #16188 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
Co-authored-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-18 21:36:33 -07:00
4959915089
[Quantization] Modify the logic of BNB double quantization ( #19742 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-19 03:52:09 +00:00
8d1e89d946
[Misc][ROCm] Enforce no unused variable in ROCm C++ files ( #19796 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-18 20:25:15 -07:00
36239f79dd
Fix FA2 fallback for Blackwell V1 ( #19781 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-19 09:53:55 +08:00
dfada85eee
[Frontend] Expose custom args in OpenAI APIs ( #16862 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-18 17:41:11 -07:00
ed33349738
[BugFix] Fix use_cudagraph=False ( #19612 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-06-19 08:23:12 +08:00
d49adea1f9
[Multimodal] Use fast processor for Qwen2/2.5-VL ( #19789 )
2025-06-18 15:49:40 -07:00
14fdd21d39
[Core] More fixes to MultiModalEmbeddings type handling ( #19715 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-18 22:48:29 +00:00
04fefe7c9a
[TPU] Update torch-xla version to include paged attention tuned block change ( #19813 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-18 22:41:13 +00:00
3b523e38d9
[Core] Do not copy array during hashing ( #19484 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-18 15:36:55 -07:00
16c16301c8
Disable "Forbid direct 'import triton'" check for vllm/triton_utils/importing.py in an extensible way ( #19783 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-06-18 15:08:00 -07:00
9206d0ff01
docs: fix Slack bulletpoint in README ( #19811 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
2025-06-18 20:47:08 +00:00
a89209b78d
[v1] Support mamba2 ( #19327 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-18 20:34:15 +00:00
ffacb222cb
[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc ( #19808 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-18 20:22:28 +00:00
12575cfa7a
[Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully ( #19725 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-18 10:26:16 -07:00
8b6e1d639c
[Hardware][AMD] integrate aiter chunked prefill into vllm ( #18596 )
...
Signed-off-by: fsx950223 <fsx950223@outlook.com >
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: fsx950223 <fsx950223@outlook.com >
Co-authored-by: charlifu <charlifu@amd.com >
2025-06-18 08:46:51 -07:00
735a9de71f
[Qwen] Add tagging rule for Qwen related PRs ( #19799 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-18 14:26:43 +00:00
257ab95439
[Platform] Allow platform use V1 Engine by default ( #19792 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-06-18 13:03:36 +00:00
cca91a7a10
[doc] fix the incorrect label ( #19787 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-18 10:30:58 +00:00
f04d604567
[Minor] Zero-initialize attn output buffer ( #19784 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-18 06:59:27 +00:00
19a53b2783
[V1] Decouple GPU and TPU InputBatch ( #19778 )
...
Signed-off-by: Andrew Feldman <afeldman@redhat.com >
2025-06-18 06:38:13 +00:00
eccdc8318c
[V1][P/D] An native implementation of xPyD based on P2P NCCL ( #18242 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-06-18 06:32:36 +00:00
5f52a84685
[V1] Add API docs for EncoderCacheManager ( #19294 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-18 13:37:01 +08:00
d4629dc43f
[Misc] Add __str__ for RequestStatus ( #19780 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-06-18 03:03:01 +00:00
6e9cc73f67
[MISC] correct DeviceConfig device field static type analysis ( #19699 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-17 17:21:50 -07:00
c53711bd63
[MISC] correct copy_blocks src_to_dists param type ( #19696 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-17 17:21:06 -07:00
dac8cc49f4
[TPU] Update torch version to include paged attention kernel change ( #19706 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-17 22:24:49 +00:00
a44b1c951d
[Feature][ROCm] Add full graph capture support for TritonAttentionBackend ( #19158 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-06-17 17:03:06 -04:00
b447624ee3
[Bugfix] Fix faulty triton importing logic when using Ray for DP ( #19734 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-17 20:59:29 +00:00
cda92307c1
[Misc] Update lmcache connector with the latest connector apis ( #19441 )
...
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2025-06-17 19:57:54 +00:00
bf57ccc5c2
Remove sm120 arch from sm100 cutlass kernel arch list ( #19716 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-17 11:49:39 -07:00
ffb2cd6b54
[Perf] Optimize moe_align_block_size CUDA kernel ( #19572 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-06-17 11:49:26 -07:00
ca94d7fa00
[Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52 ( #19151 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-17 15:58:38 +00:00
5a1c2e15d8
[Mis] remove duplicate engine status checks ( #19647 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-17 08:17:38 -07:00
4c8f64faa7
[V1][Kernel] Flashinfer HND KV cache layout ( #19280 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-17 09:09:22 -04:00
93aee29fdb
[doc] split "Other AI Accelerators" tabs ( #19708 )
2025-06-17 22:05:29 +09:00
154d063b9f
[doc][mkdocs] Add edit button to documentation ( #19637 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-17 11:10:31 +00:00
ccd7c05089
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel ( #19152 )
...
Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com >
2025-06-17 10:45:07 +00:00
c48c6c4008
Add a doc on how to update PyTorch version ( #19705 )
2025-06-17 18:10:37 +08:00
aed8468642
[Doc] Add missing llava family multi-image examples ( #19698 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-17 07:05:21 +00:00
5c76b9cdaf
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager ( #19686 )
...
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
2025-06-17 04:40:58 +00:00
ddfed314f9
Fixes IMA for TP w/ flex-attention ( #19712 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-06-17 04:01:50 +00:00
5b3ad5ecf2
[DOC] fix doc typos ( #19600 )
...
Signed-off-by: Di Liu <liu-di@sjtu.edu.cn >
2025-06-17 11:34:53 +08:00
ede5c4ebdf
[Frontend] add chunking audio for > 30s audio ( #19597 )
...
Signed-off-by: nguyenhoangthuan99 <thuanhppro12@gmail.com >
2025-06-17 11:34:00 +08:00
07334959d8
[Wheel Size] Only build FA2 8.0+PTX ( #19336 )
2025-06-17 12:32:49 +09:00
119f683949
[doc] add project flag to gcloud TPU command ( #19664 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-06-17 01:00:09 +00:00
0860087aff
[Fix] Fall back to Gloo when NCCL backend is unavailable ( #19641 )
...
Signed-off-by: conroy-cheers <conroy@corncheese.org >
2025-06-17 08:42:14 +08:00
6bc7b57315
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100 ( #19563 )
2025-06-16 17:33:51 -04:00
90f9c2eb5c
[V1] Change return type on get_multimodal_embeddings() ( #19446 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-16 13:32:15 -04:00
387bdf0ab9
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM) ( #19677 )
...
Signed-off-by: QscQ <qscqesze@gmail.com >
2025-06-16 09:47:14 -07:00
5e5baa91aa
[Kernels] Use empty for modular MoE workspaces ( #19667 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-16 14:58:01 +00:00
836d4ce140
[Bugfix] fix missing 'finish_reason': null in streaming chat ( #19662 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-16 14:10:39 +00:00
c3fec47bb7
[MISC] bump huggingface_hub pkg to 0.33.0 ( #19547 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-16 05:22:28 -07:00
1173804dca
[Bugfix] Fix TP inference for Flex attention backend ( #19657 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-16 11:21:37 +00:00
4d5424029b
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts. ( #19652 )
...
Signed-off-by: Shawn Tan <shawntan@ibm.com >
2025-06-16 11:14:18 +00:00
3e7506975c
[DOC] Add reasoning capability to vLLM streamlit code ( #19557 )
2025-06-16 07:09:12 -04:00
ee35e96ac3
[BugFix] Don't catch BaseException when dumping execute_model errors ( #19626 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-16 11:01:08 +00:00
dec66d253b
[Kernel] GGUF MMVQ kernel for multiple input vectors ( #18754 )
...
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
2025-06-16 17:33:26 +08:00
8d120701fd
[Docs] Move multiproc doc to v1 dir ( #19651 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-16 09:10:12 +00:00
f40f763f12
[CI] Add mteb testing for rerank models ( #19344 )
2025-06-16 01:36:43 -07:00
26bc46ef89
[MISC] typo fix ( #19672 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-16 07:18:49 +00:00
a77aea59fd
[TPU] support attention head dim smaller than 128 ( #19620 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-06-16 06:40:53 +00:00
b692e9cd07
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config ( #19660 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-16 06:30:29 +00:00
367871a469
[Misc][Frontend] passthrough bad_words ( #19564 )
...
Signed-off-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai >
Co-authored-by: Francesco Bertolotti <francesco.bertolotti@igenius.ai >
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-06-16 05:05:13 +00:00
92183b41f3
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker ( #18957 )
...
Signed-off-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
Co-authored-by: 刘全 <quan.liu2@dbappsecurity.com.cn >
2025-06-15 21:56:37 -07:00
c6703d1e0d
[MISC] Remove unused variableds in C++ ( #19609 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-15 20:05:28 -07:00
a5e7242d5f
[Misc] Remove duplicate multiproc method setting for CPU platform ( #19649 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-16 02:26:58 +00:00
91b2c17a55
[CI/Build] Fix torch nightly CI dependencies part 2 ( #19589 )
2025-06-15 20:01:10 +08:00
055915e6ce
Enable prefix caching with full cuda graphs ( #19617 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-15 01:05:05 -07:00
3d330c4c09
[Benchmark] Refactor benchmark script for fp8 & int8 ( #19627 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-15 15:15:37 +08:00
0b73736a0d
[Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check ( #19339 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-15 13:43:48 +08:00
ee1531bc38
[Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness ( #19644 )
2025-06-14 21:15:41 -07:00
e13945f9dd
[Perf] Further tunings for SM100 FP8 CUTLASS kernel ( #19566 )
2025-06-14 17:25:10 -07:00
08500011d3
[Fix] Convert kv_transfer_config from dict to KVTransferConfig ( #19262 )
2025-06-14 12:32:07 -07:00
861a0a0a39
[Bugfix] Don't attempt to use triton if no driver is active ( #19561 )
2025-06-14 12:30:54 -07:00
bc956b38d0
Only build CUTLASS MoE kernels on Hopper ( #19648 )
2025-06-14 11:44:15 -07:00
294fc1e2c9
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization ( #19500 )
2025-06-14 09:34:28 -07:00
2db9044ab6
[Bugfix] Fix auto dtype casting for BatchFeature ( #19316 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-06-14 15:13:08 +00:00
6fa718a460
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts ( #19593 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-14 16:54:52 +08:00
06be858828
[Bugfix] Fix the speculative decoding test by setting the target dtype ( #19633 )
2025-06-13 20:57:32 -07:00
d1e34cc9ac
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. ( #18354 )
...
Signed-off-by: Saheli Bhattacharjee <saheli@krai.ai >
2025-06-14 11:07:36 +08:00
bd517eb9fe
[BugFix] Fix DP Coordinator incorrect debug log message ( #19624 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-14 00:18:03 +00:00
d65668b4e8
Adding "AMD: Multi-step Tests" to amdproduction. ( #19508 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-13 17:08:51 -07:00
aafbbd981f
[torch.compile] Use custom ops when use_inductor=False ( #19618 )
2025-06-13 15:05:54 -07:00
0f0874515a
[Doc] Add troubleshooting section to k8s deployment ( #19377 )
...
Signed-off-by: Anna Pendleton <pendleton@google.com >
2025-06-13 21:47:51 +00:00
3597b06a4f
[CUDA] Enable full cudagraph for FlashMLA ( #18581 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-13 18:12:26 +00:00
1015296b79
[doc][mkdocs] fix the duplicate Supported features sections in GPU docs ( #19606 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-13 16:25:08 +00:00
ce9dc02c93
[Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl ( #19573 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-13 06:12:15 -07:00
a24cb91600
[Model] Fix minimax model cache & lm_head precision ( #19592 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-06-13 12:08:20 +00:00
7e8d97dd3f
[BugFix] Honor enable_caching in connector-delayed kvcache load case ( #19435 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-13 09:46:32 +00:00
d70bc7c029
[torch.compile] reorganize the cache directory to support compiling multiple models ( #19064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-13 15:23:25 +08:00
ce688ad46e
use base version for version comparison ( #19587 )
...
Signed-off-by: Boyuan Feng <boyuan@meta.com >
2025-06-13 15:09:34 +08:00
cefdb9962d
[Fix] The zip function in Python 3.9 does not have the strict argument ( #19549 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-13 14:57:48 +08:00
ace5cdaff0
[Fix] bump mistral common to support magistral ( #19533 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-12 22:28:12 -07:00
6458721108
[CPU] Refine default config for the CPU backend ( #19539 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-06-13 13:27:39 +08:00
bb4a0decef
[Misc] Correct broken docs link ( #19553 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-06-12 22:27:13 -07:00
c707cfc12e
[doc] fix incorrect link ( #19586 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-13 04:26:09 +00:00
7b3c9ff91d
[Doc] uses absolute links for structured outputs ( #19582 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-13 03:35:17 +00:00
c68698b326
[Bugfix] Fix EAGLE vocab embedding for multimodal target model ( #19570 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-06-12 23:09:19 -04:00
e3b12667d4
[BugFix] : Fix Batched DeepGemm Experts ( #19515 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-12 20:43:02 -06:00
e6aab5de29
Revert "[Build/CI] Add tracing deps to vllm container image ( #15224 )" ( #19378 )
2025-06-12 17:26:40 -07:00
c57bb199b3
[V1] Resolve failed concurrent structured output requests ( #19565 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-12 23:30:09 +00:00
dba68f9159
[Doc] Unify structured outputs examples ( #18196 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-12 22:50:31 +00:00
a3319f4f04
[Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant ( #19452 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-12 15:39:15 -04:00
9d880f594d
[Misc] Turn MOE_DP_CHUNK_SIZE into an env var ( #19506 )
2025-06-12 18:01:16 +00:00
017ef648e9
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets ( #18847 )
2025-06-12 10:30:56 -07:00
4b25ab14e2
[doc] Make top navigation sticky ( #19540 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-12 15:48:11 +00:00
f98548b9da
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass ( #16756 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2025-06-12 08:31:04 -07:00
96846bb360
Fix TorchAOConfig skip layers ( #19265 )
...
Signed-off-by: mobicham <hicham@mobiuslabs.com >
2025-06-12 22:22:53 +08:00
b6efafd9e4
[Perf] Vectorize static / dynamic INT8 quant kernels ( #19233 )
...
Signed-off-by: yewentao256 <zhyanwentao@126.com >
2025-06-12 06:51:41 -07:00
1129e2b1ab
[V1][NixlConnector] Drop num_blocks check ( #19532 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-06-12 12:36:14 +00:00
c742438f8b
[Doc] Add V1 column to supported models list ( #19523 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-12 19:16:44 +08:00
73e2e0118f
[Quantization] Improve AWQ logic ( #19431 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-12 11:02:11 +00:00
c9280e6346
[Bugfix] Respect num-gpu-blocks-override in v1 ( #19503 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-12 11:00:23 +00:00
af09b3f0a0
[Bugfix][V1] Allow manual FlashAttention for Blackwell ( #19492 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-12 10:40:24 +00:00
4f6c42fa0a
[Security] Prevent new imports of (cloud)pickle ( #18018 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-06-12 10:30:17 +00:00
dff680001d
Fix typo ( #19525 )
...
Signed-off-by: 2niuhe <carlton2tang@gmail.com >
2025-06-12 09:24:45 +00:00
2e090bd5df
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm ( #19509 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-06-12 07:14:24 +00:00
1b0b065eb5
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API ( #19522 )
...
Signed-off-by: strutive07 <strutive07@gmail.com >
2025-06-12 07:00:47 +00:00
d5bdf899e4
[BugFix] Work-around incremental detokenization edge case error ( #19449 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-12 06:43:20 +00:00
7e3e74c97c
[Frontend] Improve error message in tool_choice validation ( #19239 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-12 01:13:00 -04:00
3f6341bf7f
Add Triton Fused MoE kernel config for E=16 on B200 ( #19518 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-06-12 04:31:51 +00:00
e5d35d62f5
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import ( #19514 )
...
Signed-off-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-12 04:28:12 +00:00
2f1c19b245
[CI] change spell checker from codespell to typos ( #18711 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-06-11 19:57:10 -07:00
42f52cc95b
[CI/Build] Fix torch nightly CI dependencies ( #19505 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-06-11 14:40:42 -07:00
97a9465bbc
[UX] Add Feedback During CUDAGraph Capture ( #19501 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-06-11 21:09:05 +00:00
c7ea0b56cd
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger ( #17331 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-06-11 15:53:28 -04:00
29fa5cac1c
[Kernels] Add activation chunking logic to FusedMoEModularKernel ( #19168 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-06-11 12:53:10 -04:00
b2d9be6f7d
[Docs] Remove WIP features in V1 guide ( #19498 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-11 09:15:03 -07:00
04a55612dd
[Misc] Fix misleading ROCm warning ( #19486 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-12 00:12:10 +08:00
89b0f84e17
[doc] fix "Other AI accelerators" getting started page ( #19457 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-06-11 16:11:17 +00:00
497a91e9f7
[CI] Update FlashInfer to 0.2.6.post1 ( #19297 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-11 22:57:28 +08:00
943ffa5703
[Bugfix] Update the example code, make it work with the latest lmcache ( #19453 )
...
Signed-off-by: Runzhen Wang <wangrunzhen@gmail.com >
2025-06-11 12:42:20 +00:00
5c8d34a42c
Support no privileged mode on CPU for docker and kubernetes deployments ( #19241 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
2025-06-11 04:11:47 -07:00
3c8694eabe
Fix some typo ( #19475 )
...
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com >
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com >
2025-06-11 10:36:04 +00:00
7484e1fce2
Add cache to cuda get_device_capability ( #19436 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-11 17:37:05 +08:00
a2142f0196
Support non-string values in JSON keys from CLI ( #19471 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 09:34:04 +00:00
871d6b7c74
[Misc] Reduce warning message introduced in env_override ( #19476 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-11 17:29:54 +08:00
29a38f0352
[Doc] Support "important" and "announcement" admonitions ( #19479 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 01:39:58 -07:00
a5115f4ff5
[Doc] Fix quantization link titles ( #19478 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 01:27:22 -07:00
68b4a26149
[Doc] Update V1 User Guide for Hardware and Models ( #19474 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-11 00:49:06 -07:00
b8e809a057
[Kernel] Support deep_gemm for linear methods ( #19085 )
...
Signed-off-by: artetaout <lulala341@gmail.com >
2025-06-11 15:14:45 +08:00
5039ec2336
[ROCm] Add rules to automatically label ROCm related PRs ( #19405 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-11 15:09:18 +08:00
7c644ab6d5
Fix Typo in Documentation and Function Name ( #19442 )
2025-06-10 22:44:11 -07:00
2d40665fe8
Add fused MOE config for Qwen3 30B A3B on B200 ( #19455 )
...
Signed-off-by: Junhao Li <junhao@ubicloud.com >
2025-06-11 13:43:46 +08:00
96ada386b7
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data ( #19422 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-11 05:18:57 +00:00
1e473b3010
[CI] Disable failing GGUF model test ( #19454 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-11 05:12:38 +00:00
2b1e2111b0
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py ( #19451 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-11 12:54:59 +08:00
a45b979d9f
[BugFix] Fix docker build cpu-dev image error ( #19394 )
...
Signed-off-by: niu_he <carlton2tang@gmail.com >
2025-06-10 20:56:40 -07:00
3952731e8f
[New Model]: Support Qwen3 Embedding & Reranker ( #19260 )
2025-06-10 20:07:30 -07:00
77f0d465d0
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 ( #19390 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-11 07:54:41 +08:00
22c3c0aa4a
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 ( #19401 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-06-11 07:23:57 +08:00
33f8dba7c6
[Model] use AutoWeightsLoader for commandr ( #19399 )
...
Signed-off-by: py-andy-c <pychen1017@gmail.com >
2025-06-10 22:42:21 +00:00
5241ca50d6
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default ( #19440 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-10 22:06:15 +00:00
da9b523ce1
[Docs] Note that alternative structured output backends are supported ( #19426 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-06-10 16:20:00 +00:00
b6553be1bc
[Misc] Slight improvement of the BNB ( #19418 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-10 13:51:49 +00:00
64a9af5afa
Simplify ep kernels installation ( #19412 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-10 20:06:08 +08:00
e4248849ec
[BugFix][CPU] Fix CPU CI by ignore collecting test_pixtral ( #19411 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-06-10 12:02:40 +00:00
467bef18a3
[BugFix][FlashInfer] Fix attention backend interface mismatch with unexpected keyword use_irope ( #19134 )
...
Signed-off-by: Yunqiu Guo <guorachel@meta.com >
2025-06-10 16:48:51 +08:00
5f1ac1e1d1
Revert "[v1] Add fp32 support to v1 engine through flex attn" ( #19404 )
2025-06-10 01:30:20 -07:00
9368cc90b2
Automatically bind CPU OMP Threads of a rank to CPU ids of a NUMA node. ( #17930 )
...
Signed-off-by: Tsai, Louie <louie.tsai@intel.com >
Co-authored-by: Li, Jiang <bigpyj64@gmail.com >
2025-06-10 06:22:05 +00:00
32b3946bb4
Add clear documentation around the impact of debugging flag ( #19369 )
...
Signed-off-by: Anna Pendleton <pendleton@google.com >
2025-06-10 06:16:09 +00:00
6b1391ca7e
[Misc] refactor neuron_multimodal and profiling ( #19397 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-10 06:12:42 +00:00
a3f66e75d1
Add security warning to bug report template ( #19365 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com >
2025-06-10 06:06:36 +00:00
319cb1e351
[Core] Batch multi modal input using pinned memory ( #19169 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-10 13:44:59 +08:00
1efef71645
[Bugfix] Fix modelscope token passed in ( #19389 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-10 13:39:37 +08:00
646d62f636
[Core] Use tuple for kv cache group block ids ( #19175 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-10 07:01:17 +02:00
6cd4ae8acd
[Frontend] Add tqdm_leave_pbar to control progress bar visibility ( #19357 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-10 04:55:09 +00:00
c016047ed7
Fix docs/mkdocs/hooks/remove_announcement.py ( #19382 )
2025-06-09 21:36:54 -07:00
9af6d22e4c
Use xla flag to improve the quantized model performance ( #19303 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-06-10 01:28:45 +00:00
4589b94032
[Bugfix] Fix benchmark_moe.py ( #19016 )
...
Signed-off-by: Tianyu Guo <guoty9@mail2.sysu.edu.cn >
2025-06-09 18:04:36 -07:00
cc867be19c
[V1] Reuse V0's memory_profiling util for gpu worker memory profiling ( #19312 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-06-10 08:40:01 +08:00
3a7cd627a8
[Misc] Fix a config typo in disable_hybrid_kv_cache_manager configuration ( #19383 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-09 16:41:51 -07:00
8058c91108
[HOT-FIX] Add kv_sharing_target_layer_name argument to cutlass_mla backend ( #19374 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-06-09 19:00:07 -04:00
7d44c469fe
[TPU]Fix KV cache sharing tests ( #19371 )
2025-06-09 18:38:15 -04:00
31f58be96a
[Frontend] Make TIMEOUT_KEEP_ALIVE configurable through env var ( #18472 )
...
Signed-off-by: liusiqian <liusiqian@tal.com >
2025-06-09 21:41:21 +00:00
ebb2f383b8
[Quantization] Bump compressed-tensors version ( #19295 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-06-09 14:33:15 -07:00
c1c7dbbeeb
[Bugfix][Core] Prevent token lengths exceeding max_model_len in V0 ( #19348 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-09 23:01:29 +08:00
5cf2daea9a
[Misc] Fixes and Optimizations for DeepEP + DeepGEMM combination. ( #19298 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-06-09 10:50:39 -04:00
b8089195b4
[v1] Add fp32 support to v1 engine through flex attn ( #19319 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-06-09 22:10:44 +08:00
770e5dcdb8
[full_graph] Fix query_start_loc padding ( #19321 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-06-09 21:32:56 +08:00
c57c9415b1
[Docs] Fix a bullet list in usage/security.md ( #19358 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-06-09 13:28:51 +00:00
01810f9236
[CI] Introduce rules for llama auto-label ( #19323 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-09 20:05:42 +08:00
59abbd84f9
[Fix] Allow kernel compilation for CUDA capability 8.7 ( #19328 )
...
Signed-off-by: Conroy Cheers <conroy@corncheese.org >
2025-06-09 02:57:23 -07:00
95a6568b5c
[CI/Build] Fix LoRA test ( #19350 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-09 09:52:10 +00:00
0eca5eacd0
[Doc] Fix description in the Automatic Prefix Caching design doc ( #19333 )
...
Signed-off-by: cr7258 <chengzw258@163.com >
2025-06-09 17:30:02 +08:00
12e5829221
[doc] improve ci doc ( #19307 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-09 07:26:12 +00:00
3a4d417707
[Misc] Cleanup compilation tests ( #19343 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-09 15:05:44 +08:00
8335667c22
[Frontend] Remove unreachable code from llm.py ( #19288 )
...
Signed-off-by: KsuParkhamchuk <k.parkhamchuk@gmail.com >
2025-06-09 10:22:10 +08:00
e1c4380d4c
[Misc] Add documentation update reminder to PR template ( #19289 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-09 10:20:53 +08:00
e31ae3de36
[Deprecation] Remove inputs arg fallback in Engine classes ( #18799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-09 10:19:56 +08:00
2ffb9b6e07
[Bugfix] model_max_length should consider max_model_len in tokenizer_config ( #19201 )
2025-06-08 07:17:53 -07:00
cda10fa3e2
[Multi Modal] Add an env var for message queue max chunk bytes ( #19242 )
...
Signed-off-by: yZhen <yZhen@fb.com >
Co-authored-by: yZhen <yZhen@fb.com >
2025-06-08 21:39:12 +08:00
c123bc33f9
[Quantization] Add compressed-tensors NVFP4 support ( #18312 )
2025-06-08 09:05:55 -04:00
b9a1791e2c
[Hardware][POWER] Add IBM POWER11 Support to CPU Extension Detection ( #19082 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
2025-06-08 09:17:14 +00:00
989dcee981
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B ( #19315 )
...
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-06-08 16:07:02 +08:00
3d64d366e0
[Misc] Change tests/compile to use VLLM_V1 by default ( #19302 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-08 16:06:48 +08:00
eaa2e51088
[Bugfix] Re-enable use_cudagraph in vLLM v1 ( #19299 )
...
Signed-off-by: Richard Zou <zou3519@gmail.com >
2025-06-08 08:56:12 +08:00
d77f7fb871
[Bugfix]: Fix TypeError: 'float' object cannot be interpreted as an integer ( #19283 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-08 08:16:31 +08:00
2d8476e465
[BugFix][V1] Fix memory profiling bug ( #18974 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-06-07 10:34:51 -07:00
88be823d57
[AMD] Update compatible packaging version ( #19309 )
...
Signed-off-by: pramkuma <Pramendra.Kumar@amd.com >
2025-06-07 20:55:09 +08:00
4e4f63ad45
[Nit][Benchmark]Fix example in benchmark_serving_structured_output.py ( #19311 )
...
Signed-off-by: Lifan Shen <lifans@meta.com >
2025-06-07 18:25:38 +08:00
d2f0e7e615
[CI/Build] Improve Llama GGUF test robustness ( #19287 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-07 17:23:28 +08:00
122cdca5f6
[Misc] refactor context extension ( #19246 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-07 05:13:21 +00:00
cf02f9b283
Add FlexAttention to V1 ( #16078 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-06-06 21:58:55 -07:00
c4296b1a27
[CI][PowerPC] Use a more appropriate way to select testcase in tests/models/language/pooling/test_embedding.py ( #19253 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-06-07 11:52:52 +08:00
66c508b137
[TPU][Test] Add script to run benchmark on TPU for buildkite ( #19039 )
...
Signed-off-by: Qiliang Cui <derrhein@gmail.com >
2025-06-06 20:10:24 -07:00
84166fee97
[Kernel] Integrate CUTLASS MoE kernel with PPLX ( #18762 )
...
Signed-off-by: ElizaWszola <ewszola@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-06 18:26:11 -07:00
6e0cd10f72
[Easy][Test] Simplify test_function_tool_use with multiple parametrizes ( #19269 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-07 09:19:09 +08:00
e010688f50
[Build][ROCm] Update Dockerfile.rocm ( #19296 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-06-06 19:35:16 -04:00
441b65d8c7
[Misc][Tools][Benchmark] Fix and improve auto tune script ( #19163 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-06-06 23:31:19 +00:00
46ecc57973
[BugFix] Fix tpu_model_runner block_id concatenation ( #19228 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:28:17 -07:00
b6a3a9f76d
[Core] Fix abrupt request abort ( #18485 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:27:59 -07:00
ca27f0f9c1
[Bugfix][Core] Update cancellation logic in generate() to handle Generator exits ( #19225 )
...
Co-authored-by: Adolfo Victoria <adovi@meta.com >
2025-06-06 20:17:54 +00:00
aad30bd306
[BugFix] Fix MultiConnector test after HMA changes ( #19291 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 20:16:24 +00:00
94ecee6282
Fixed ppc build when it runs on non-RHEL based linux distros ( #18422 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
Co-authored-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-06-06 11:54:26 -07:00
8267f9916f
improve logits bias ( #19041 )
2025-06-06 19:59:25 +08:00
7353492a47
[Core] Raise when non-multi-instance DP clients target a DP rank ( #19227 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-06 19:03:01 +08:00
7661e92ef8
[Model] Optimize nemotron_h implementation ( #19249 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-06 10:05:14 +00:00
f168b85725
Unit Test for run_dp_sharded_vision_model ( #19103 )
...
Signed-off-by: Siqi Yan <siqi@meta.com >
Co-authored-by: Siqi Yan <siqi@meta.com >
2025-06-06 16:24:02 +08:00
da511d54d8
Fix CompilationConfig repr ( #19091 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-06-06 16:23:35 +08:00
65c69444b1
[Docs] Improve V1 KVConnector interface documentation ( #19172 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-06 16:22:45 +08:00
94870359cd
[Quantization] Bump compressed-tensors version; update NVFP4A16 test model ( #19224 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
2025-06-06 01:21:54 -07:00
0d49483ea9
[TPU] fix kv cache dtype in model runner ( #19244 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-06 16:20:16 +08:00
90b78ec5f9
[v1][P/D] Fix a edge case in kv cache schedule ( #19182 )
...
Co-authored-by: jinghui <jinghui@fb.com >
2025-06-05 23:32:55 -07:00
91a2ef98ea
[Chore] update CODEOWNERS ( #19247 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-06-06 06:09:43 +00:00
3da2313d78
Support allowed_token_ids in ChatCompletionRequest ( #19143 )
...
Signed-off-by: Xu Song <xusong.vip@gmail.com >
2025-06-06 05:06:48 +00:00
b61dc5f972
[TPU] update torch_xla pin ( #19231 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-06-06 04:27:38 +00:00
f8a1a2d108
[v1] Hybrid Memory Allocator ( #17996 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-05 20:47:09 -07:00
3465b87ef8
[Bugfix] Fix EAGLE vocab embedding construction for Llama 70B ( #19033 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-06-05 19:10:08 -07:00
c8134bea15
Fix AOPerModuleConfig name changes ( #18869 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-06-05 18:51:32 -07:00
cb6d572e85
[Model] NemotronH support ( #18863 )
...
Signed-off-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com >
Co-authored-by: Luis Vega <2478335+vegaluisjose@users.noreply.github.com >
2025-06-05 21:29:28 +00:00
87360308b7
[V1] Use FlashInfer by default on Blackwell GPUs ( #19118 )
2025-06-05 15:40:39 -04:00
aa49f14832
[Quantization] Skip Fp4 Test for compressed-tensors ( #19217 )
2025-06-05 18:21:53 +00:00
9ef9173cfa
[P/D][NixlConnector] Enable FlashInfer backend ( #19090 )
2025-06-05 17:10:15 +00:00
85e2b7bb13
[MISC][Bugfix] Use less CPU when message queue has been empty for some time ( #16226 )
...
Signed-off-by: Povilas Kanapickas <povilas@radix.lt >
2025-06-05 16:53:08 +00:00
61059bee40
[Hardware][NVIDIA] FP4 MoE kernel optimization ( #19110 )
...
Signed-off-by: Chiyue Wei <chiyuew@nvidia.com >
Co-authored-by: Chiyue Wei <chiyuew@nvidia.com >
2025-06-05 09:48:26 -07:00
ec89524f50
Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 ( #19205 )
2025-06-05 16:38:54 +00:00
f20f9f063b
[mistral_common] Add v11 tokenizer ( #19193 )
...
Signed-off-by: Patrick von Platen <patrick.v.platen@gmail.com >
2025-06-05 08:27:41 -07:00
9bc8bb07cf
[Bugfix] properly catch PIL-related errors for vision models when incorrect data urls are provided ( #19202 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-06-05 12:59:28 +00:00
1aeb925f34
[Frontend] improve vllm run-batch --help display ( #19187 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-05 11:16:25 +00:00
188a4590d8
[Misc] Do not override NCCL_CUMEM_ENABLE if set explicitly ( #19105 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-05 11:14:32 +00:00
18093084be
[Misc] Remove unnecessary fallback to prefill-decode attention ( #19138 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-06-05 16:08:26 +08:00
da40380214
[Build] Annotate wheel and container path for release workflow ( #19162 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-06-04 23:24:56 -07:00
8fc57501d3
[Bugfix]: Fix the incompatibility issue with stream when Thinking is disabled ( #19135 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-05 06:24:24 +00:00
af7fc84fd2
[BugFix][Minor] Fix full cuda graph bug when max_num_seqs < 512 ( #19171 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-05 13:41:25 +08:00
0678b52251
Handle non-serializable objects when dumping benchmark results ( #19114 )
2025-06-04 22:40:04 -07:00
25b918eee6
[Torch Nightly]add missing dependency ( #18770 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-06-04 21:56:12 -07:00
a408820f2f
[Bugfix] Fix port handling in make_zmq_path ( #19117 )
2025-06-04 21:00:59 -06:00
c56ed8bb0e
[Bugfix][Nixl] Fix full prefix cache hit bug ( #18632 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-06-05 02:07:32 +00:00
78dcf56cb3
[doc] small fix ( #19167 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-05 09:13:50 +08:00
b2fac67130
[P/D] Heterogeneous TP ( #18833 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-06-04 23:25:34 +00:00
23027e2daf
[Misc] refactor: simplify EngineCoreClient.make_async_mp_client in AsyncLLM ( #18817 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-04 15:37:25 -07:00
c3fd4d669a
[Kernel] Integrate batched/masked deepgemm kernel ( #19111 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun <vsundarr@redhat.com >
2025-06-04 21:59:18 +00:00
ef3f98b59f
[Bugfix] fix v1 cpu worker fails on macOS ( #19121 )
2025-06-04 20:17:38 +00:00
7ee2590478
[TPU] Update dynamo dump file name in compilation test ( #19108 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-04 16:13:43 -04:00
53a5a0ce30
[Perf] Tunings for SM100 FP8 CUTLASS kernel ( #18778 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-04 10:46:28 -07:00
d459fae0a2
[Bugfix][EP+DP] Fix internode check ( #19112 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
2025-06-04 23:39:23 +08:00
c8dcc15921
Allow AsyncLLMEngine.generate to target a specific DP rank ( #19102 )
...
Signed-off-by: Jon Swenson <jmswen@gmail.com >
2025-06-04 08:26:47 -07:00
8f4ffbd373
[Doc] Update V1 Guide for embedding models ( #19141 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-04 22:57:55 +08:00
5f2cd251d2
Sm100 blockwise fp8 swap ab ( #18564 )
2025-06-04 07:48:45 -07:00
02658c2dfe
Add DeepSeek-R1-0528 function call chat template ( #18874 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
2025-06-04 13:24:18 +00:00
01dc9a76db
[CI/Build][Bugfix] Ensure compatibility with transformers 4.52 ( #18678 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-04 04:49:20 -07:00
35cf32df30
Improve the output precision of embedding models ( #19092 )
2025-06-04 11:48:57 +00:00
8711bc5e68
[Misc] Add packages for benchmark as extra dependency ( #19089 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-04 04:18:48 -07:00
2669a0d7b5
Fix ValueError: Missing value for tag key(s): model_name,engine. ( #19113 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-06-04 17:10:45 +08:00
8e972d9c44
[TPU] Skip hanging tests ( #19115 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-06-04 01:43:00 -07:00
3336c8cfbe
Fix #19130 ( #19132 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-04 01:42:06 -07:00
b124e1085b
[Bugfix] Fix FA3 full cuda graph correctness ( #19106 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-06-03 23:10:15 -07:00
41aa578428
[NVIDIA] Add Cutlass MLA backend ( #17625 )
2025-06-03 21:40:26 -07:00
8d646c2e53
[Cleanup][v1]:remote guided-decoding-backend for example ( #19059 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-06-04 04:23:26 +00:00
5d6d1adf15
[KERNEL] Sampler. CUDA kernel for applying repetition penalty ( #18437 )
2025-06-03 21:13:01 -07:00
1409ef9134
[Core] Cast multimodal input in hf processor ( #18862 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-06-03 20:24:56 -07:00
4555143ea7
[CPU] V1 support for the CPU backend ( #16441 )
2025-06-03 18:43:01 -07:00
52dceb172d
[Docs] Add developer doc about CI failures ( #18782 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-04 01:09:13 +00:00
abd7df2fca
[Misc] Fix path and python alias errors in disagg_prefill exmaples ( #18919 )
2025-06-03 17:15:18 -07:00
b712be98c7
feat: add data parallel rank to KVEventBatch ( #18925 )
2025-06-03 17:14:20 -07:00
a8da78eac9
[Bugfix] Max concurrency estimation and check_enough_kv_cache_memory for models with sliding window layers ( #19029 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-04 00:14:06 +00:00
5d96533e22
[Bugfix][P/D] Fix Prefix Cache Bug ( #18411 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
2025-06-03 23:53:16 +00:00
4de790fcad
[Bugfix]: Fix the incompatibility issue with tool_choice 'required' when Thinking is enabled ( #19075 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-06-03 23:27:24 +00:00
b5fd9506c1
[Bugfix] get_num_blocks_to_allocate with null_block ( #19031 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 15:30:55 -07:00
135cf55cd1
[V1][Spec Decode][Ngram] 1.35x gain -> 1.95x gain on InstructCoder with prompt fix ( #18971 )
2025-06-03 15:26:33 -07:00
6cac54f4d1
[v1] Re-init input batch for multiple kv cache groups ( #18654 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 21:41:36 +00:00
6865fe0074
Fix interaction between Optional and Annotated in CLI typing ( #19093 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Yikun Jiang <yikun@apache.org >
2025-06-03 21:07:19 +00:00
e31446b6c8
[Perf] Tune scaled_fp8_quant by increasing vectorization ( #18844 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-03 13:48:25 -07:00
bdf13965ab
[V1] Support cross-layer KV sharing ( #18212 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-06-03 20:33:07 +00:00
fa98d77773
[Kernel] DeepEP dispatch-combine kernel integration ( #18434 )
...
Signed-off-by: Varun <vsundarr@redhat.com >
Co-authored-by: Varun Sundar Rabindranath <vsundarr@redhat.com >
2025-06-03 12:30:02 -07:00
01eee40536
[doc] update docker version ( #19074 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-03 19:08:21 +00:00
19bdaf32b1
[Doc] Readme standardization ( #18695 )
...
Co-authored-by: Soren Dreano <soren@numind.ai >
2025-06-03 11:50:55 -07:00
02f0c7b220
[Misc] Add SPDX-FileCopyrightText ( #19100 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-06-03 11:20:17 -07:00
d054da1992
[Misc] fix: add miss best_of param validation ( #18555 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-06-03 11:02:07 -07:00
4b7817c119
[Misc] Add missing _Backend enums ( #19081 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-06-03 16:15:16 +00:00
d00dd65cd4
[Doc] Improve the Pull Request template with key components ( #19086 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-03 23:44:34 +08:00
d81edded69
[Bugfix] disable processor cache ( #19068 )
...
Signed-off-by: raushan <raushan@huggingface.co >
2025-06-03 15:06:04 +00:00
476844d44c
Fix underscores in dict keys passed via CLI ( #19030 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-06-03 14:39:24 +00:00
4e68ae5e59
[CI/Build] Remove V0 LoRA test ( #19066 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-03 14:30:18 +00:00
4e88723f32
[doc] clarify windows support ( #19088 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-06-03 21:42:17 +08:00
118ff92111
[Doc] Update V1 user guide for embedding and enc-dec models ( #19060 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-03 02:29:41 -07:00
ec2dcd80bc
[Misc] Update WeightsMapper for qwen2-vl/qwen2.5-vl ( #19054 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-03 09:08:20 +00:00
42243fbda0
[Doc] Add InternVL LoRA support ( #19055 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-06-03 09:08:03 +00:00
6d18ed2a2e
Update docker docs with ARM CUDA cross-compile ( #19037 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-06-03 08:21:53 +00:00
f32fcd9444
[v1][KVCacheManager] Rename BlockHashType to BlockHash ( #19015 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-06-03 08:01:48 +00:00
d32aa2e670
[Bugfix] Use cmake 3.26.1 instead of 3.26 to avoid build failure ( #19019 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-06-03 00:16:17 -07:00
cc977286e7
Reduce logs in CLI scripts and plugin loader ( #18970 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-03 06:00:45 +00:00
17430e3653
[bugfix] small fix logic issue ( #18999 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-03 05:35:12 +00:00
1282bd812e
Add tarsier model support ( #18985 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-06-03 13:13:13 +08:00
bdce64f236
[V1] Support DP with Ray ( #18779 )
2025-06-02 21:15:13 -07:00
9e6f61e8c3
[ROCm][Build] Clean up the ROCm build ( #19040 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-02 20:47:47 -07:00
8655f47f37
[CPU][CI] Re-enable the CPU CI tests ( #19046 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-06-02 20:46:47 -07:00
4ce42f9204
Adding "LoRA Test %N" to AMD production tests ( #18929 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-06-02 20:46:44 -07:00
8a57872b2a
[Bugfix][EP+DP] Use pplx-kernel internode instead of intranode ( #19034 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-06-03 11:36:51 +08:00
5bc1ad6cee
[Doc] Remove duplicate TOCs during MkDocs migration ( #19021 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-06-02 19:49:48 -07:00
9112b443a0
[Hardware][TPU] Initial support of model parallelism with single worker using SPMD ( #18011 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Co-authored-by: Hossein Sarshar <hossein.sarshar@gmail.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-06-03 00:06:20 +00:00
c57d577e8d
add an absolute path for run.sh ( #18258 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-06-02 19:38:23 +00:00
ca2f6b9c30
[Bugfix][Model] Attempt to fix eagle in V0. ( #18978 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-06-02 08:15:53 -07:00
20133cfee2
[Frontend] enable custom logging for the uvicorn server (OpenAI API server) ( #18403 )
...
Signed-off-by: François Paupier <francois.paupier@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-06-02 15:04:23 +00:00
ebb1ec9318
[Model] enable data parallel for Llama4 vision encoder ( #18368 )
...
Signed-off-by: yzhen <yzhen@devgpu093.cco2.facebook.com >
Co-authored-by: yZhen <yZhen@fb.com >
Co-authored-by: yzhen <yzhen@devgpu093.cco2.facebook.com >
2025-06-02 19:22:54 +08:00
5b168b6d7a
[doc] add pytest tips ( #19010 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-02 11:07:26 +00:00
9760fd8f6a
[Core] Support inplace model weights loading ( #18745 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-06-02 17:38:50 +08:00
b9f61e1387
[Bugfix][Nixl] Fix DP Metadata Handshake ( #19008 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-06-02 03:30:41 +00:00
d6fd3a33b8
[Misc] reuse num_tokens_across_dp of get_dp_padding to avoid unnecessary dp all reduce in set_forward_context ( #18935 )
...
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-06-01 19:41:18 +00:00
432ec9926e
[doc] wrong output ( #19000 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-06-01 11:26:14 +00:00
2b102d51ad
[BugFix] Fix incorrect metrics shutdown error log message ( #18992 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-06-01 11:42:23 +08:00
aa54a7bf7b
[BugFix] fix data parallel construct ipv6 url addres ( #18991 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-06-01 11:42:10 +08:00
2ad6194a02
Let max_num_batched_tokens use human_readable_int for large numbers ( #18968 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-06-01 11:41:29 +08:00
c594cbf565
[doc] small fix - mkdocs ( #18996 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 20:23:43 -07:00
a35ca765a5
[LoRA] Support dynamically initialize packed_modules_mapping for VLM with arbitrary components ( #18987 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-06-01 11:06:57 +08:00
6aa8f9a4e7
[Core] Rework dtype resolution ( #18751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-06-01 11:04:23 +08:00
1bc86a3da1
[Bugfix] Fix EAGLE3 broken logits ( #18909 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-05-31 19:58:07 -07:00
bbfa0c61d1
[Misc][Benchmark] Add support for CustomDataset ( #18511 )
2025-05-31 19:07:38 +00:00
20079c6e36
[Misc] add return token strs for tokenize ( #18941 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 18:00:11 +00:00
9a1b9b99d7
[BugFix] Fix multi-node offline data-parallel ( #18981 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com >
2025-05-31 08:34:52 -07:00
8bf507d766
[P/D] NixlConnector use cache device index for memory registration ( #18969 )
...
Signed-off-by: Piotr Tarasiewicz <ptarasiewicz@nvidia.com >
2025-05-31 11:19:18 -04:00
306d60401d
[ROCm][Kernel] Add gfx950 support for skinny gemms ( #18010 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-05-31 07:40:05 -07:00
f2c3f66d59
[Bugfix] Fix for issue 17396 ( #18773 )
...
Signed-off-by: Fred Reiss <frreiss@us.ibm.com >
2025-05-31 11:58:17 +00:00
0f5e0d567e
[FEAT][ROCm] Add AITER grouped topk for DeepSeekV2 ( #18825 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-31 03:39:31 -07:00
c55d804672
[BugFix] Pydantic part 2 ( #18911 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-05-31 03:39:28 -07:00
749f5bdd38
[doc] fix the list rendering issue - security.md ( #18982 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-31 10:39:21 +00:00
2a50ef5760
[Neuron] Add Multi-Modal model support for Neuron ( #18921 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
Co-authored-by: Rohith Nallamaddi <nalrohit@amazon.com >
Co-authored-by: FeliciaLuo <luof@amazon.com >
Co-authored-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-31 10:39:11 +00:00
b8b904795d
fix security issue of logging llm output ( #18980 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-05-31 10:38:56 +00:00
ba5111f237
[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled ( #18879 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-31 09:20:54 +00:00
1e123529d7
[Misc] Fix estimated max model len msg ( #18966 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-31 16:43:44 +08:00
dff80b0e42
[Frontend] Add rerank support to run_batch endpoint ( #16278 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-05-31 07:40:01 +00:00
7782464a17
create util function for batched arange ( #18937 )
2025-05-31 13:50:38 +08:00
0f71e24034
[Docs] Correct multiprocessing design doc ( #18964 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-31 01:30:15 +00:00
1dab4d5718
Tool parser regex timeout handling ( #18960 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-30 21:02:54 +00:00
7f21e8052b
[Misc] add group_size is -1 in awq quantization ( #18910 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-30 17:34:22 +00:00
5a8641638a
[VLM] Add PP support and fix GPTQ inference for Ovis models ( #18958 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-30 17:11:44 +00:00
f49239cb45
Benchmark script for fp8 vs bf16 gemm ( #17126 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 10:56:11 -06:00
2dbe8c0774
[Perf] API-server scaleout with many-to-many server-engine comms ( #17546 )
2025-05-30 08:17:00 -07:00
84ec470fca
Improve "failed to get the hash of the compiled graph" error ( #18956 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-30 15:00:54 +00:00
b29ca5c4d5
[Docs] Update SECURITY.md with link to our security guide ( #18961 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-30 07:37:27 -07:00
ec6833c5e9
[doc] show the count for fork and watch ( #18950 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-30 06:45:59 -07:00
e1fadf1197
[Feature] minicpm eagle support ( #18943 )
...
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com >
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com >
2025-05-30 06:45:56 -07:00
43ff405b90
[CI/Build] remove regex from build dependencies ( #18945 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-30 04:02:50 -07:00
fba02e3bd1
[Bugfix][TPU] Fix tpu model runner testcase failure ( #18810 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-30 18:04:03 +08:00
4577fc9abb
[Misc]Fix typo ( #18947 )
2025-05-30 02:21:35 -07:00
5f1d0c8118
[Bugfix][Failing Test] Fix test_vllm_port.py ( #18618 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-30 17:13:47 +08:00
c3bb9f2331
[Model] Use in-place adds in SigLIP ( #18922 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-30 17:12:59 +08:00
8f8900cee9
[doc] add mkdocs doc ( #18930 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-30 07:58:44 +00:00
6acb7a6285
[Misc]Fix benchmarks/README.md for speculative decoding ( #18897 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-30 07:58:04 +00:00
4f4a6b844a
[Deprecation] Remove mean pooling default for Qwen2EmbeddingModel ( #18913 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-30 06:53:37 +00:00
4d0a1541be
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy ( #18861 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 13:37:36 +08:00
77b6e74fe2
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. ( #18938 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-29 22:33:17 -07:00
5acf828d99
[docs] fix: fix markdown syntax ( #18927 )
2025-05-30 05:20:48 +00:00
3987e2ae96
[Model] Use AutoWeightsLoader for mamba2 ( #18918 )
...
Signed-off-by: iLeGend <824040212@qq.com >
2025-05-30 04:50:10 +00:00
77164dad5e
[Bugfix] Consistent ascii handling in tool parsers ( #18883 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-30 04:44:43 +00:00
3de3eadf5b
improve the robustness of parsing vlms config in AutoRound ( #18894 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-29 19:24:47 -07:00
3132290a14
[TPU][CI/CD] Clean up docker for TPU tests. ( #18926 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-30 10:24:19 +08:00
1aa2f81b43
[Misc] Update type annotation for rotary embedding base ( #18914 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-30 10:17:01 +08:00
d54af615d5
[Bugfix] Fix PP default fallback behavior for V1 ( #18915 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-30 10:13:17 +08:00
a1cc9f33a3
[TPU] remove transpose ops in moe kernel ( #18923 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-05-29 23:00:11 +00:00
a521ef06e5
Use standalone_compile by default in torch >= 2.8.0 ( #18846 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-30 06:41:58 +08:00
64eaf5fe05
[P/D] NixlConnector DP fixes ( #18903 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-29 18:08:40 +00:00
d1d61f3351
[BugFix] Make DP work with connector-delayed new requests ( #18559 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Will Eaton <weaton@redhat.com >
2025-05-29 18:04:18 +00:00
32ce3cf7c9
[V1] Allocate kv_cache with stride order for V1 ( #18775 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-05-29 17:54:16 +00:00
d58f9c7f7a
[Misc] Remove duplicate init for self.vllm_config ( #18896 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-29 17:26:07 +00:00
c29034037d
[Deprecation] Disallow pos-args other than model when initializing LLM ( #18802 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-29 09:36:58 -07:00
1b7cfd5a36
[ROCm][V0][Attention] Revert to the previous FA triton kernel ( #18226 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-29 12:13:18 -04:00
da4b69d0b4
[Attention][V1] Toggle for v1 attention backend ( #18275 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-29 10:48:24 -04:00
c9479b2920
[Bugfix] Fix the failing gte embedding test ( #18720 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-29 07:39:25 -07:00
6f2909405e
[Doc] Fix codeblocks formatting in LoRA adapters documentation ( #18907 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-29 07:38:55 -07:00
b169d5f7b6
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. ( #18692 )
...
Signed-off-by: Duyi-Wang <duyi.wang@intel.com >
2025-05-29 20:02:08 +08:00
f8977c233f
Fix an error in dummy weight loading for quantization models ( #18855 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-29 03:07:20 -07:00
f274581f44
[BugFix] Update pydantic to fix error on python 3.10 ( #18852 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-05-29 03:05:46 -07:00
0b1447f890
[Bugfix] Ensure tensors are contiguous during serialisation ( #18860 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-29 03:05:20 -07:00
24d0ef8970
[Misc] Replace TODO in serving transcription ( #18895 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-29 02:58:14 -07:00
7fcfd954ff
[Bugfix] Fix misleading information in the documentation ( #18845 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-29 02:54:14 -07:00
e740d07f07
[doc] add CLI doc ( #18871 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-29 09:51:36 +00:00
a652e71dd0
[Doc] Remove redundant spaces from compatibility_matrix.md ( #18891 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-29 02:51:20 -07:00
34d6c447c4
[LoRA] Add LoRA support for InternVL ( #18842 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-29 08:46:24 +00:00
972eddf7c9
[Neuron] Add multi-LoRA support for Neuron. ( #18284 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-29 16:41:22 +08:00
fd7bb88d72
Fixes a dead link in nightly benchmark readme ( #18856 )
...
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com >
2025-05-29 04:41:39 +00:00
3c49dbdd03
Skip device and quant Pydantic validation to make plugin device work ( #18843 )
...
Signed-off-by: Yikun Jiang <yikunkero@gmail.com >
2025-05-28 20:12:30 -07:00
1661a9c28f
[Doc][Neuron] Update documentation for Neuron ( #18868 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-28 19:44:01 -07:00
8e882ffdc0
[Bugfix][TPU] fix moe custom kernel import ( #18853 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-05-28 19:34:19 -07:00
26b4fa45be
Add ability to use CUDAGraphs with use_inductor=False ( #17345 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-29 10:16:52 +08:00
515b413ebf
Prevent the cross-encoder logic from being applied to classification tasks ( #18838 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-28 19:16:17 -07:00
269d901734
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix ( #18100 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-29 07:21:46 +08:00
7951d78738
[Core] Enable CUDA graphs for DP + All2All kernels ( #18724 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-05-28 22:55:30 +00:00
6dbe5b5c93
Remove checks for None for fields which should never be None ( #17985 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-28 21:32:19 +00:00
643622ba46
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend ( #15655 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Signed-off-by: xihajun <junfan@krai.ai >
Signed-off-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk >
Signed-off-by: Jorge de Freitas <jorge@krai.ai >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: xihajun <junfan@krai.ai >
Co-authored-by: Jorge de Freitas <jorge.de-freitas22@imperial.ac.uk >
Co-authored-by: Jorge de Freitas <jorge@krai.ai >
2025-05-28 19:59:09 +00:00
a09c7ca9f2
[Chore][Spec Decode] Update check NoneType instead of assigning variables ( #18836 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-28 18:57:19 +00:00
0e98964e94
[V1][Metrics] Remove metrics that were deprecated in 0.8 ( #18837 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-28 18:54:12 +00:00
c68b5c63eb
[Misc] fix olmoe model layer can't laod in tp gt 1 ( #18828 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-28 17:36:21 +00:00
fced756923
[Chore] update ty configuration ( #18839 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-28 08:59:11 -07:00
321331b8ae
[Core] Add Lora Support to Beam Search ( #18346 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-05-28 08:58:24 -07:00
6e4cea1cc5
decrement server_load on listen for disconnect ( #18784 )
...
Signed-off-by: Daniel Salib <danielsalib@meta.com >
2025-05-28 22:15:12 +08:00
435fa95444
[Frontend] add run batch to CLI ( #18804 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-28 07:08:57 -07:00
4c2b38ce9e
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses ( #17599 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-28 12:46:04 +00:00
d781930f90
[Platform][Dist] Make torch distributed process group extendable ( #18763 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-28 10:52:34 +00:00
ce75efeecb
[BugFix] FA2 MLA Accuracy Issue ( #18807 )
...
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com >
2025-05-28 08:59:39 +00:00
aa42561e40
Fix PiecewiseCompileInterpreter ( #17338 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-28 08:40:53 +00:00
de65fc8e1e
[CI] improve embed testing ( #18747 )
2025-05-28 00:16:35 -07:00
0c492b7824
[Deprecation] Remove fallbacks for Embeddings API ( #18795 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:09:04 +08:00
0f0926b43f
[Deprecation] Remove unused sync methods in async_timeout ( #18792 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:08:48 +08:00
7f2c1a87e9
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data ( #18796 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-28 15:08:35 +08:00
b78f844a67
[Bugfix][FailingTest]Fix test_model_load_with_params.py ( #18758 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-28 05:42:54 +00:00
5e13c07d00
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) ( #18781 )
...
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2025-05-28 05:09:14 +00:00
774c5fde30
[V1] fix torch profiling for V1 offline scenarios ( #18445 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-05-28 04:16:30 +00:00
9a21e331ff
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client ( #18769 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-05-28 03:35:43 +00:00
3e9ce609bd
[Bugfix] Fix nomic max_model_len ( #18755 )
2025-05-27 20:29:53 -07:00
794ae1f551
[rocm] Fix wrong attention log ( #18764 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
2025-05-27 19:45:41 -07:00
d73a9457a5
[Core] Improve Tensor serialisation ( #18774 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-28 09:46:21 +08:00
a3896c7f02
[Build] Fixes for CMake install ( #18570 )
2025-05-27 20:49:24 -04:00
51e98e4ffd
[Bugfix] Disable prefix caching by default for benchmark ( #18771 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-28 08:18:09 +08:00
e56f44d9ec
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py ( #18566 )
2025-05-27 19:59:48 -04:00
e0cbad4e30
[Neuron] Support quantization on neuron ( #18283 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-27 22:10:33 +00:00
b48d5cca16
[CI/Build] [TPU] Fix TPU CI exit code ( #18282 )
...
Signed-off-by: Carol Zheng <cazheng@google.com >
2025-05-27 14:54:59 -07:00
5873877241
[Bugfix] Mistral tool calling when content is list ( #18729 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-27 09:05:37 -07:00
696259ca01
[Core] Automatically cast multi-modal input dtype ( #18756 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 23:45:48 +08:00
6b6d496114
optimize get_kv_cache_torch_dtype ( #18531 )
...
Signed-off-by: idellzheng <idellzheng@tencent.com >
2025-05-27 13:08:44 +00:00
aaa4ac1c95
Disable prefix cache by default for benchmark ( #18639 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-27 20:06:34 +08:00
06a0338015
[V1][Metrics] Add API for accessing in-memory Prometheus metrics ( #17010 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-27 09:37:06 +00:00
4318c0559d
[CI/Build] Remove imports of built-in re ( #18750 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 09:19:18 +00:00
a68e293cb9
[Doc] Convert Sphinx directives ( {class}, {meth}, {attr}, ...) to MkDocs format for better documentation linking ( #18663 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-27 01:44:20 -07:00
6881107948
[BUG FIX] minicpm ( #18739 )
...
Signed-off-by: huangyuxiang03 <huangyx0321@gmail.com >
Co-authored-by: huangyuxiang03 <huangyx0321@gmail.com >
2025-05-27 01:04:49 -07:00
e0f0ff87b8
[Build] fix cpu build missing libtbbmalloc.so ( #18744 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-05-27 01:03:56 -07:00
c24b1572ac
Minor fix about MooncakeStoreConnector ( #18721 )
...
Signed-off-by: baoloongmao <baoloongmao@tencent.com >
2025-05-27 08:02:28 +00:00
4693a3438c
[Doc] cleanup deprecated flag for doc ( #18715 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-27 07:12:02 +00:00
bbd9a84dc5
[Hardware][Intel-Gaudi] [CI/Build] Fix multiple containers using the same name in run-hpu-test.sh ( #18752 )
...
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai >
2025-05-27 00:10:26 -07:00
a547aeb828
feat(rocm-support): support mamba2 on rocm ( #18565 )
...
Signed-off-by: Islam Almersawi <islam.almersawi@openinnovation.ai >
Co-authored-by: Islam Almersawi <islam.almersawi@openinnovation.ai >
2025-05-27 00:07:53 -07:00
fc6d0c290f
[Misc] improve docs ( #18734 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-27 07:07:01 +00:00
753944fa9b
[Doc] Update reproducibility doc and example ( #18741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 07:03:13 +00:00
25a817f202
[Doc] Update OOT model docs ( #18742 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-27 06:30:31 +00:00
d260f799a9
[FEAT] [ROCm] Upgrade AITER Fused MoE kernels. ( #18271 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-26 23:14:07 -07:00
b50602d5f0
[Model][Gemma3] Cast image pixel values already on CPU ( #18732 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-27 05:42:54 +00:00
1f1b1bc03b
[V1][Quantization] Add CUDA graph compatible v1 GGUF support ( #18646 )
...
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-27 04:40:28 +00:00
1f88dbd2bb
[Misc] improve web section group title display ( #18684 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-27 04:35:16 +00:00
0eebd74842
[Model][Gemma3] Simplify image input validation ( #18710 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-27 11:13:37 +08:00
27bebcd897
Convert examples to ruff-format ( #18400 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-26 16:57:54 +00:00
e7523c2e03
[V1][Sampler] Improve performance of FlashInfer sampling by sampling logits instead of probs ( #18608 )
2025-05-26 11:49:36 -04:00
a869baca73
[Bugfix] Fix Llama GGUF initialization ( #18717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:49:22 -07:00
82e2339b06
[Doc] Move examples and further reorganize user guide ( #18666 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:38:04 -07:00
9553fdb41e
[Doc] Improve API docs ( #18713 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 07:33:34 -07:00
243eb9199f
[Bugfix]: handle hf-xet CAS error when loading Qwen3 weights in vLLM ( #18701 )
2025-05-26 07:10:56 -07:00
0665e29998
[Misc] add AutoGen integration ( #18712 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-26 13:56:18 +00:00
e76be06550
[Hardware][Intel-Gaudi] [CI/Build] Add tensor parallel size = 2 test to HPU CI ( #18709 )
...
Signed-off-by: Lukasz Durejko <ldurejko@habana.ai >
2025-05-26 05:26:07 -07:00
0877750029
[CI/Build] Split pooling and generation extended language models tests in CI ( #18705 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-26 04:00:08 -07:00
6d68030f1c
[Model] Add support for YARN in NemotronNAS models ( #18427 )
...
Signed-off-by: Nave Assaf <nassaf@nvidia.com >
2025-05-26 10:31:49 +00:00
5a2c76cbe1
[CI] fix dump_input for str type ( #18697 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-26 18:23:35 +08:00
38b13dfe78
[CI/Build] Replace math.isclose with pytest.approx ( #18703 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 02:05:17 -07:00
61a45e7a72
[Bugfix] Fix Mistral-format models with sliding window ( #18693 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 01:44:04 -07:00
65523a0995
[Doc] Fix issue template format ( #18699 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 00:45:39 -07:00
4b7740a105
[GH] Add issue template for reporting CI failures ( #18696 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-26 00:42:04 -07:00
4ea62c0ea0
[CI] add missing argument ( #18694 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-26 00:22:04 -07:00
561b77a0d6
[Bugfix] Fix the lm_head in gpt_bigcode in lora mode ( #6357 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2025-05-26 14:52:25 +08:00
abd4030d94
refactor: simplify request handler, use positive condition check for handler assignment ( #18690 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-26 06:32:28 +00:00
8820821b59
[Misc] Fixed the abnormally high TTFT issue in the PD disaggregation example ( #18644 )
...
Signed-off-by: zhaohaidao <zhaohaidao2008@hotmail.com >
Signed-off-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com >
Co-authored-by: zhaohaiyuan <zhaohaiyuan@xiaohongshu.com >
2025-05-26 13:51:27 +08:00
fba0642704
[CI/Build][Doc] Update gte-Qwen2-1.5B-instruct usage ( #18683 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-25 20:27:50 -07:00
6071e989df
[Core][Multimodal] Convert PIL Image to array without data copy when hashing ( #18682 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-25 17:33:35 +00:00
57fd13a707
[Bugfix] Fix profiling dummy data for Pixtral ( #18677 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-25 14:05:30 +00:00
3a886bd58c
[Misc] small improve ( #18680 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 06:05:38 -07:00
35be8fad62
[CI/build] fix no regex ( #18676 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 10:10:51 +00:00
f2faac745d
[Bugfix] Fix cpu usage and cache hit stats reporting on cpu environment ( #18674 )
...
Signed-off-by: zzzyq <zhangyuqi94@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-25 02:36:06 -07:00
279f854519
[doc] improve readability ( #18675 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 01:40:31 -07:00
624b77a2b3
[doc] fix broken links ( #18671 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-25 01:36:33 -07:00
503f8487c2
[Misc] Reduce logs on startup ( #18649 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 23:03:53 -07:00
44073a7ac3
[BUGFIX] catch subclass first for try...except ( #18672 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-25 05:34:24 +00:00
63934543a0
Speed up the kernels/quantization/ tests ( #18669 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-25 05:02:59 +00:00
75f81750f3
[VLM] Initialize video input support for InternVL models ( #18499 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-25 04:51:25 +00:00
6ab681bcbe
[Misc][ModelScope] Change to use runtime VLLM_USE_MODELSCOPE ( #18655 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-25 04:51:21 +00:00
cebc22f3b6
[Misc]Replace cuda hard code with current_platform in Ray ( #14668 )
...
Signed-off-by: noemotiovon <757486878@qq.com >
2025-05-24 20:26:31 -07:00
6c6dcd8611
[MISC] correct signature for LoaderFunction ( #18670 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-24 20:17:47 -07:00
7891fdf0c6
[V1] Fix _pickle.PicklingError: Can't pickle <class 'transformers_modules.deepseek-ai.DeepSeek-V2-Lite... ( #18640 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-05-24 20:07:20 -07:00
6825d9a998
[BugFix][Spec Decode] Improve Prefix Caching Logic in Speculative Decoding ( #18668 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-24 17:33:46 -07:00
b554ab736e
[CI/Build] fix permission denied issue ( #18645 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-24 16:09:10 +00:00
9ea7f1abf3
fix(regression): clone from reference items ( #18662 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-24 15:25:20 +00:00
2807271c86
[CI] enforce import regex instead of re ( #18665 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-24 08:04:14 -07:00
b9018a3f9f
[BugFix] Fix import error for fused_moe ( #18642 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-05-24 07:53:36 -07:00
4ceafb6299
[MISC] typo fix and clean import ( #18664 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-24 07:52:09 -07:00
2e6705784f
[CI/Build] chmod +x to cleanup_pr_body.sh ( #18650 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 07:26:45 -07:00
1cb194a018
[Doc] Reorganize user guide ( #18661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 07:25:33 -07:00
2cd4d58df4
[Model] use AutoWeightsLoader for gpt2 ( #18625 )
...
Signed-off-by: zt2370 <ztang2370@gmail.com >
2025-05-24 13:36:13 +00:00
6d166a8d35
[Doc] Add community links ( #18657 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 06:06:38 -07:00
ef1dd6870f
[Doc] Fix indentation problems in V0 Paged Attention docs ( #18659 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 06:06:35 -07:00
e77dc4bad8
[MISC][pre-commit] Add pre-commit check for triton import ( #17716 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-24 20:09:15 +08:00
07458a51ce
[Doc] Update README links, mark external links ( #18635 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-24 09:57:15 +00:00
c1e4a4052d
[V1][Spec Decode] Support multi-layer eagle draft model ( #18030 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-24 09:45:34 +00:00
a859320575
[Model] Add support for Qwen2.5-Omni-7B-AWQ (Qwen2_5OmniForConditionalGeneration) ( #18647 )
2025-05-24 09:15:36 +00:00
441dc63ac7
[Frontend] improve vllm serve --help display ( #18643 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-24 07:53:22 +00:00
d55e446d13
[V1][Spec Decode] Small refactors to improve eagle bookkeeping performance ( #18424 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-24 06:51:22 +00:00
ec82c3e388
FIX MOE issue in AutoRound format ( #18586 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-23 22:01:40 -07:00
45ab403a1f
config.py: Clarify that only local GGUF checkpoints are supported. ( #18623 )
...
Signed-off-by: Mathieu Bordere <mathieu@letmetweakit.com >
2025-05-24 08:46:34 +08:00
2b10ba7491
[Bugfix][Nixl] Fix Preemption Bug ( #18631 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-23 23:30:16 +00:00
4fc1bf813a
[Bugfix] Migrate to REGEX Library to prevent catastrophic backtracking ( #18454 )
...
Signed-off-by: Crucifixion-Fxl <xmufxl@gmail.com >
Co-authored-by: Crucifixion-Fxl <xmufxl@gmail.com >
2025-05-23 16:16:26 -07:00
f2036734fb
[ModelOpt] Introduce VLLM_MAX_TOKENS_PER_EXPERT_FP4_MOE env var to control blockscale tensor allocation ( #18160 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-05-23 15:52:20 -07:00
7d9216495c
[Doc] Update references to doc files ( #18637 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 15:49:21 -07:00
0ddf88e16e
[CI] Enable test_initialization to run on V1 ( #16736 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 15:09:44 -07:00
1645b60196
Use prebuilt FlashInfer x86_64 PyTorch 2.7 CUDA 12.8 wheel for CI ( #18537 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-05-23 21:17:16 +00:00
2628a69e35
[V1] Support Deepseek MTP ( #18435 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn >
Co-authored-by: Rui Qiao <ruisearch42@gmail.com >
2025-05-23 10:26:28 -07:00
371f7e4ca2
[Doc] Fix broken links and unlinked docs, add shortcuts to home sidebar ( #18627 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 10:22:40 -07:00
15b45ffb9a
[Doc] Avoid documenting dynamic / internal modules ( #18626 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 09:58:02 -07:00
273cb3b4d9
[Doc] Fix top-level API links/docs ( #18621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 09:46:56 -07:00
8ddd1cf26a
[Doc] fix list formatting ( #18624 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-23 09:41:17 -07:00
6550114c9c
[v1] Redo "Support multiple KV cache groups in GPU model runner ( #17945 )" ( #18593 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-23 09:39:47 -07:00
9520a989df
[Docs] Change mkdocs to not use directory urls ( #18622 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 09:33:21 -07:00
3d28ad343f
Fix figures in design doc ( #18612 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 09:09:54 -07:00
6a7988c55b
Refactor pplx init logic to make it modular (prepare for deepep) ( #18200 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-23 23:43:43 +08:00
022d8abe29
[Doc] Use a different color for the announcement ( #18616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 08:25:03 -07:00
5221815a00
[Doc] Fix markdown list indentation for MkDocs rendering ( #18620 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-23 08:23:21 -07:00
1068556b2c
[Bugfix][Build/CI] Fixup CUDA compiler version check for CUDA_SUPPORTED_ARCHS ( #18579 )
2025-05-23 07:43:58 -07:00
2cd1fa4556
[Misc] add Haystack integration ( #18601 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-23 06:21:19 -07:00
d4c2919760
Include private attributes in API documentation ( #18614 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 06:18:31 -07:00
6220f3c6b0
[Bugfix] Fix transformers model impl ignored for mixtral quant ( #18602 )
...
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com >
2025-05-23 05:54:13 -07:00
52fb23f47e
Fix examples with code blocks in docs ( #18609 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 05:53:44 -07:00
6dd51c7ef1
[CI/Build] Fix V1 flag being set in entrypoints tests ( #18598 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 05:51:53 -07:00
2edb533af2
Replace {func} with mkdocs style links ( #18610 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 05:51:38 -07:00
38a95cb4a8
[Doc] Fix indent of contributing to vllm ( #18611 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-05-23 05:50:07 -07:00
cd821ea5d2
[CI] fix kv_cache_type argument ( #18594 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-23 04:49:18 -07:00
7ab056c273
[Hardware][CPU] Update intel_extension_for_pytorch 2.7.0 and move to requirements/cpu.txt ( #18542 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-05-23 04:38:42 -07:00
6526e05111
Add myself as docs code owner ( #18605 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 04:08:31 -07:00
e493e48524
[V0][Bugfix] Fix parallel sampling performance regression when guided decoding is enabled ( #17731 )
...
Signed-off-by: Madeesh Kannan <shadeMe@users.noreply.github.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-23 03:38:23 -07:00
4ce64e2df4
[Bugfix][Model] Fix baichuan model loader for tp ( #18597 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-23 02:39:05 -07:00
fbb13a2c15
Revert "[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal ( #18034 )" ( #18600 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-23 02:18:22 -07:00
a1fe24d961
Migrate docs from Sphinx to MkDocs ( #18145 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 02:09:53 -07:00
d0bc2f810b
[Bugfix] Add half type support in reshape_and_cache_cpu_impl on x86 cpu platform ( #18430 )
...
Signed-off-by: Yuqi Zhang <yuqizhang@google.com >
Co-authored-by: Yuqi Zhang <yuqizhang@google.com >
2025-05-23 01:41:37 -07:00
b046cf792d
[Feature][V1]: suupports cached_tokens in response usage ( #18149 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-05-23 01:41:03 -07:00
54af915949
[Doc] Update quickstart and install for cu128 using --torch-backend=auto ( #18505 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-23 08:36:37 +00:00
71ea614d4a
[Feature]Add async tensor parallelism using compilation pass ( #17882 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-23 01:03:34 -07:00
4c611348a7
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal ( #18034 )
...
Signed-off-by: Ronald Xu <ronaldxu@amazon.com >
2025-05-23 00:37:18 -07:00
60cad94b86
[Hardware] correct method signatures for HPU,ROCm,XPU ( #18551 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-22 22:31:59 -07:00
9c1baa5bc6
[Misc] Replace cuda hard code with current_platform ( #16983 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-05-23 04:38:50 +00:00
4be2255c81
[Bugfix][Benchmarks] Fix a benchmark of deepspeed-mii backend to use api_key ( #17291 )
...
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com >
2025-05-23 12:30:47 +08:00
ed5d408255
[Neuron] Remove bypass on EAGLEConfig and add a test ( #18514 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
2025-05-22 21:26:32 -07:00
583507d130
[Spec Decode] Make EAGLE3 draft token ID mapping optional ( #18488 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-22 20:17:39 -07:00
e44d8ce8c7
[Bugfix] Set KVTransferConfig.engine_id in post_init ( #18576 )
...
Signed-off-by: Linkun Chen <github@lkchen.net >
2025-05-23 02:54:42 +00:00
93ecb8139c
[BugFix] Increase TP execute_model timeout ( #18558 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-23 10:22:11 +08:00
fae453f8ce
[Misc] refactor: simplify input validation and num_requests handling in _convert_v1_inputs ( #18482 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-23 10:15:32 +08:00
4b0da7b60e
Enable hybrid attention models for Transformers backend ( #18494 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-23 10:12:08 +08:00
c6b636f9fb
[V1][Spec Decoding] Use model_loader.get_model() to load models ( #18273 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-23 02:05:44 +00:00
04eb88dc80
Re-submit: Fix: Proper RGBA -> RGB conversion for PIL images. ( #18569 )
...
Signed-off-by: Chenheli Hua <huachenheli@outlook.com >
2025-05-23 01:59:18 +00:00
46791e1b4b
[AMD] [P/D] Compute num gpus for ROCm correctly in run_accuracy_test.sh ( #18568 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-05-22 18:45:35 -07:00
c32e249a23
[Frontend] [Core] Add Tensorizer support for V1, LoRA adapter serialization and deserialization ( #17926 )
...
Signed-off-by: Sanger Steel <sangersteel@gmail.com >
2025-05-22 18:44:18 -07:00
c91fe7b1b9
[Frontend][Bug Fix] Update llama4 pythonic jinja template and llama4_pythonic parser ( #17917 )
...
Signed-off-by: Kai Wu <kaiwu@meta.com >
2025-05-22 16:44:08 -07:00
a04720bc36
[V1][Spec Decode][Bugfix] Load quantize weights for EAGLE ( #18290 )
2025-05-22 15:17:33 -07:00
7b9d832c80
[Tool] Add NIXL installation script ( #18172 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-22 14:33:16 -07:00
6e588da0f4
[Build/CI] Fix CUDA 11.8 build ( #17679 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-22 12:13:54 -07:00
f8d2cc5f55
[Compile][Platform] Make PiecewiseBackend pluggable and extendable ( #18076 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-05-22 12:11:53 -07:00
721fb9b181
[Platform] Move platform check to right place ( #18470 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-05-22 12:11:28 -07:00
1f3a1200e4
[Bugfix] make test_openai_schema.py pass ( #18224 )
...
Signed-off-by: David Xia <david@davidxia.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-22 18:34:06 +00:00
54631f8262
[Misc] Call ndarray.tobytes() directly instead of ndarray.data.tobytes() ( #18347 )
...
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com >
2025-05-22 09:00:13 -07:00
cb506ecb5a
[Misc] improve Automatic Prefix Caching example ( #18554 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-22 14:50:46 +00:00
93f71673ce
[BugFix][CPU] Fix x86 SHM distributed module initialization ( #18536 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-05-22 07:35:00 -07:00
3f505233fd
[Doc] Add stream flag for chat completion example ( #18524 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-22 14:07:10 +00:00
4e04eceb58
[Bugfix] Use random hidden states in dummy sampler run ( #18543 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
2025-05-22 06:48:56 -07:00
71075029f2
[Doc] Support --stream arg in openai_completion_client.py script ( #18388 )
...
Signed-off-by: googs1025 <googs1025@gmail.com >
2025-05-22 13:20:17 +00:00
ca86a7cf6e
[CI/Build] Update bamba test model location ( #18544 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-22 06:01:07 -07:00
a35a494745
[Bugfix] Add kwargs to RequestOutput __init__ to be forward compatible ( #18513 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-22 05:24:43 -07:00
f6037d1907
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18526 )
...
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-22 05:22:53 -07:00
fa72f9a812
Order sequence ids + config update to support specifying custom quantization layers ( #18279 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Tailin Pan <tailinpa@amazon.com >
Co-authored-by: Rishabh Rajesh <rishyraj@amazon.com >
Co-authored-by: Yishan McNabb <yishanm@amazon.com >
Co-authored-by: Patrick Lange <patlange@amazon.com >
Co-authored-by: Maxwell Goldberg <mgld@amazon.com >
Co-authored-by: Aakash Shetty <sheaak@amazon.com >
2025-05-22 02:20:36 -07:00
ebed81fbf5
Update default neuron config for speculation ( #18274 )
...
Signed-off-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com >
Co-authored-by: Aakash Shetty <sheaak@amazon.com >
2025-05-22 02:18:55 -07:00
e2d7d31244
[Neuron] Update Dockerfile.neuron to use latest neuron release (2.23) ( #18512 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-22 02:17:34 -07:00
23b67b37b2
[Doc] Fix invalid JSON in example args ( #18527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-22 07:11:46 +00:00
db5a29ba19
[Bugfix] Fix LoRA test ( #18518 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-21 21:48:53 -07:00
51797775c3
[Bugfix][Model] Make Olmo2Model weight loading return loaded weights ( #18504 )
...
Signed-off-by: Shane A <shanea@allenai.org >
2025-05-21 21:17:03 -07:00
cf5984b2fe
[BugFix][DP] Send DP wave completion only from dp_rank==0 ( #18502 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: kourosh hakhamaneshi <kourosh@anyscale.com >
2025-05-21 20:25:25 -07:00
d022115cc6
[Bugfix] Inconsistent token calculation compared to HF in llava family ( #18479 )
...
Signed-off-by: jaycha <jaycha@ncsoft.com >
2025-05-21 20:21:47 -07:00
acb54ca8e1
Intialize io_thread_pool attribute in the beginning. ( #18331 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-21 20:21:14 -07:00
6e0fd34d3c
[CI] Fix race condition with StatelessProcessGroup.barrier ( #18506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-21 20:19:13 -07:00
176d62e4ea
[MISC] update project urls in pyproject.toml ( #18519 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-21 20:17:34 -07:00
20bd6f4d2e
[FalconH1] Fix output dtype in RMSNorm fallback path for Falcon-H1 (e.g. 0.5B) ( #18500 )
...
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae >
Co-authored-by: younesbelkada <younesbelkada@gmail.com >
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae >
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae >
2025-05-21 19:23:59 -07:00
1f079540db
[Bugfix] Consistent ascii handling in tool parsers ( #17704 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-05-21 20:41:23 +00:00
94d8ec8d2b
[FEAT][ROCm] Upgrade AITER MLA v1 backend ( #18338 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com >
2025-05-21 10:34:28 -07:00
bb0a311213
Revert "[v1] Support multiple KV cache groups in GPU model runner ( #17945 ) ( #18459 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-21 10:25:23 -07:00
dd5fa7e04f
[ROCm][Kernel][V1] Enable AMD Radeon GPU Custom Paged Attention on v1 ( #17004 )
...
Signed-off-by: Hosang Yoon <hosang.yoon@amd.com >
2025-05-21 08:35:00 -07:00
2b16104557
[Misc] Update deprecation message for --enable-reasoning ( #18404 )
2025-05-21 07:33:11 -07:00
371376f996
[Build] fix Dockerfile shell ( #18402 )
2025-05-21 07:32:06 -07:00
c6c10ca920
[Bugfix] Reduce moe_sum test size to avoid OOM ( #18484 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-05-21 06:46:39 -07:00
c154d89306
[Doc] fix arg docstring in linear layers ( #18410 )
...
Signed-off-by: giantcroc <1204449533@qq.com >
2025-05-21 06:45:57 -07:00
eca18691d2
[MODEL] FalconH1 ( #18406 )
...
Signed-off-by: dhia.rhaiem <dhia.rhaiem@tii.ae >
Co-authored-by: younesbelkada <younesbelkada@gmail.com >
Co-authored-by: Ilyas Chahed <ilyas.chahed@tii.ae >
Co-authored-by: Jingwei Zuo <jingwei.zuo@tii.ae >
2025-05-21 04:59:06 -07:00
61acfc45bc
[Bugfix][Failing Test] Fix test_events.py ( #18460 )
...
Signed-off-by: rabi <ramishra@redhat.com >
2025-05-21 04:57:28 -07:00
107f5fc4cb
[Misc] refactor disaggregated-prefill-v1 example ( #18474 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-21 11:10:14 +00:00
907f935de9
[V1] Fix general plugins not loaded in engine for multiproc ( #18326 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-21 01:21:49 -07:00
5d7f545204
[Frontend] deprecate --device arg ( #18399 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-05-21 01:21:17 -07:00
cd8dfc6dfc
[Misc] MultiConnector._connectors type ( #18423 )
...
Signed-off-by: nicklucche <nlucches@redhat.com >
2025-05-20 22:48:43 -07:00
d06dd72ba9
[Bugfix][Failing Test] Fix nixl connector test when promt size < block size ( #18429 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-20 22:41:44 -07:00
ad0012a0ac
Revert "[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18407 )" ( #18456 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-20 22:39:22 -07:00
92247c522e
[Bug] Fix moe_sum signature ( #18440 )
...
Signed-off-by: Bill Nell <bnell@redhat.com >
2025-05-20 22:37:08 -07:00
0c15c2e486
[Bugfix] config.head_dim is now explicitly set to None ( #18432 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-20 21:04:33 -07:00
3b17ea26e4
[TPU] Re-enable the Pallas MoE kernel ( #18025 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-05-20 19:52:27 -07:00
23baa2180b
fix:Build torch wheel inline rather than picking from nightly ( #18351 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
2025-05-20 22:22:24 +00:00
980a172474
[Kernel] update comment for KV shape in unified triton attn ( #18099 )
...
Signed-off-by: haochengxia <xhc_1007@163.com >
2025-05-20 11:19:34 -07:00
e1f5a71ed7
[Model] use AutoWeightsLoader for bloom ( #18300 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-20 09:40:05 -07:00
f4a8a37465
[Minor] Rename quantization nvfp4 to modelopt_fp4 ( #18356 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-20 09:08:37 -07:00
8f55962a7f
[Misc] refactor prompt embedding examples ( #18405 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-20 15:26:12 +00:00
be48360c1f
[Bugfix] Fix MRoPE Errors in the Qwen-VL Model When Processing Pure Text ( #18407 )
...
Co-authored-by: 松灵 <wpf272043@alibaba-inc.com >
2025-05-20 06:59:48 -07:00
86847700d7
[CI] Add mteb testing to test the accuracy of the embedding model ( #17175 )
2025-05-20 06:51:12 -07:00
d6c86d09ae
Update cpu.txt ( #18398 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
2025-05-20 10:53:23 +00:00
6b35cb10a0
[Misc] Add LoRA code owner ( #18387 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-20 03:27:30 -07:00
1b1e8e05ff
[doc] update env variable export ( #18391 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-20 08:53:27 +00:00
bca55b556f
[Bugfix] fix adding bias twice in ipex GPTQ quantization ( #18363 )
...
Signed-off-by: rand-fly <randfly@outlook.com >
2025-05-20 00:54:33 -07:00
d981396778
[release] Change dockerhub username for TPU release ( #18389 )
2025-05-19 23:49:23 -07:00
9609327fa4
[Core] [Bugfix]: tensor parallel with prompt embeds ( #18171 )
...
Signed-off-by: Nan2018 <nan@protopia.ai >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
2025-05-19 20:21:27 -07:00
f07a673eb2
[Misc] Allow AutoWeightsLoader to skip loading weights with specific substr in name ( #18358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-19 20:20:12 -07:00
d565e0976f
[neuron] fix authorization issue ( #18364 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-05-19 23:30:32 +00:00
258bf621d5
fix CUDA_check redefinition in #17918 ( #18287 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
Co-authored-by: Lucia (Lu) Fang <fanglu@meta.com >
2025-05-19 13:42:35 -07:00
dc1440cf9f
Neuron up mistral ( #18222 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
2025-05-19 09:54:47 -07:00
8171221834
[Misc] Fix typo ( #18330 )
2025-05-19 09:51:01 -07:00
7937c2fd52
Add files via uploadAdd fused MoE kernel tuning configs (fp8_w8a8) for DeepSeek V3/R1 on a single-node 8x NVIDIA H20 96GB setup ( #18337 )
2025-05-19 09:49:57 -07:00
e2ee1e8e9e
[Feature]Add support for models quantized with AutoRound ( #17850 )
...
Signed-off-by: wenhuach21 <wenhua.cheng@intel.com >
2025-05-19 09:38:53 -07:00
20d8ce81eb
[Frontend] add --quick option for vllm chat/complete ( #18297 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-19 09:36:13 -07:00
84ab4feb7e
[Doc] Fix typo ( #18355 )
2025-05-19 16:05:16 +00:00
6781af5608
[Quantization] Pool model support bitsandbytes ( #18087 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-19 09:03:43 -07:00
1b15df2546
[BugFix] Fix handling of num_computed_tokens with connector ( #18232 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com >
2025-05-19 09:03:25 -07:00
43b5f61dce
[Doc] Move input-related docs to Features ( #18353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-19 15:08:39 +00:00
c5bb0ebdc6
[Doc] Fix prompt embedding examples ( #18350 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-05-19 06:48:16 -07:00
d637b96099
[BugFix] [Vul] Add missing usedforsecurity=False in MD5 hashing to enable FIPS ( #18319 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
Signed-off-by: shaoyuyoung <shaoyuyoung@gmail.com >
Co-authored-by: cascade <cascade812@outlook.com >
2025-05-19 01:31:23 -07:00
275c5daeb0
fix: Add type specifications for CLI arguments in tensorizer options ( #18314 )
2025-05-18 23:42:17 -07:00
47fda6d089
[Build] Supports CUDA 12.6 and 11.8 after Blackwell Update ( #18316 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-05-18 23:19:33 -07:00
27d0952600
[Misc] extract parser.parse_args() ( #18323 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-19 04:06:26 +00:00
221cfc2fea
Feature/vllm/input embedding completion api ( #17590 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: Nan2018 <nan@protopia.ai >
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com >
Co-authored-by: Bryce1010 <bryceyx@gmail.com >
Co-authored-by: Andrew Sansom <andrew@protopia.ai >
Co-authored-by: Andrew Sansom <qthequartermasterman@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-18 20:18:05 -07:00
9da1095daf
[Spec Decode][V0] Fix spec decode correctness test in V0 eagle/medusa ( #18175 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-18 19:49:46 -07:00
d1211f8794
[Doc] Add doc to explain the usage of Qwen3 thinking ( #18291 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-05-18 23:04:07 +00:00
b6a6e7a529
[Misc] add litellm integration ( #18320 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-18 15:32:30 +00:00
4fb349f66a
Fix copy-paste error in phi4mm image processing ( #18315 )
...
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com >
2025-05-18 07:00:12 -07:00
908733aca7
[Model] Use sigmoid for single-label classification ( #18313 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-05-18 07:00:09 -07:00
1a8f68bb90
[doc] update reasoning doc ( #18306 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-18 06:59:14 -07:00
9ab2c02ff8
Support sequence parallelism combined with pipeline parallelism ( #18243 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
2025-05-17 22:47:25 +00:00
66e63e86ec
[MISC] fix typo ( #18305 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-17 10:52:09 -07:00
9214e60631
[Model] use AutoWeightsLoader for solar ( #18113 )
2025-05-17 00:24:17 -07:00
f880d42582
Fixed build on ppc64le due to openssl conflicts ( #18262 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-05-17 00:23:46 -07:00
dcfe95234c
Update Dockerfile to build for Blackwell ( #18095 )
2025-05-17 00:23:25 -07:00
48ac2bed5b
[Hardware][TPU] Optionally import for TPU backend ( #18269 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
Co-authored-by: Carol Zheng <cazheng@google.com >
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com >
Co-authored-by: Hongmin Fan <fanhongmin@google.com >
2025-05-17 15:23:12 +08:00
3e0d435027
[P/D][V1] Support dynamic loading of external KV connector implementations ( #18142 )
...
Signed-off-by: David Ben-David <davidb@pliops.com >
Co-authored-by: David Ben-David <davidb@pliops.com >
2025-05-17 06:40:39 +00:00
4ee4826ede
[BugFix] Correct max_model_len derivation from config.json for Mistral format ( #17937 )
...
Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com >
Co-authored-by: tracelogfb <48808670+tracelogfb@users.noreply.github.com >
Co-authored-by: Stephen Chen <tracelog@meta.com >
2025-05-17 04:20:13 +00:00
60017dc841
[Misc] reformat the collect-env output ( #18285 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-16 19:46:18 -07:00
55f1a468d9
Move cli args docs to its own page ( #18228 ) ( #18264 )
...
Signed-off-by: Trevor Royer <troyer@redhat.com >
2025-05-16 19:43:45 -07:00
fd195b194e
[V1][P/D] Local attention optimization for NIXL ( #18170 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-16 21:16:33 -04:00
fabe89bbc4
[Spec Decode] Don't fall back to V0 when spec decoding is enabled ( #18265 )
2025-05-16 16:10:27 -07:00
e73b7dfd69
[Bugfix] fix an illegal memory access was encountered of marlin kernel + act_order ( #18245 )
2025-05-16 16:02:44 -07:00
7fdfa01530
[Sampler] Adapt to FlashInfer 0.2.3 sampler API ( #15777 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-16 15:14:03 -07:00
aef94c6d07
[CI] Assign reviewer to mergify with changes to Tensorizer files ( #18278 )
2025-05-16 12:04:14 -07:00
0ceaebf87b
[BugFix] Fix ordering of KVConnector finished send/rcv sets ( #18211 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-16 09:20:54 -07:00
1db4f47f81
[BugFix] Fix multi async save in MultiConnector ( #18246 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-16 08:13:47 -07:00
d3d91b6f71
[Misc][MacOS] fix bfloat16 error ( #18249 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-16 15:05:59 +00:00
87d871470d
[Model] Use autoweightloader for dbrx ( #18251 )
...
Signed-off-by: learner0810 <zhongjun.li@daocloud.io >
2025-05-16 07:54:13 -07:00
a5f8c111c2
[Fix] Fix typo in resolve_hf_chat_template ( #18259 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
2025-05-16 14:52:41 +00:00
e23564cb70
use ceil_div in cutlass block scaling shape check ( #17918 )
2025-05-16 03:02:58 -07:00
390ec88905
[Misc] Consolidate Audio tests into multimodal common generation tests ( #18214 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-16 09:18:08 +00:00
541817670c
[Misc] Add Ray Prometheus logger to V1 ( #17925 )
...
Signed-off-by: Seiji Eicher <seiji@anyscale.com >
2025-05-16 01:02:42 -07:00
67da5720d4
[PERF] Speed up Qwen2.5-VL model by speed up rotary position embedding ( #17973 )
...
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai >
2025-05-15 23:31:02 -07:00
5c04bb8b86
[doc] fix multimodal example script ( #18089 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-16 06:05:34 +00:00
3d2779c29a
[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 ( #17827 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-05-15 22:28:27 -07:00
6b31c84aff
Throw better error for when running into k8s service discovery issue ( #18209 )
...
Signed-off-by: Will Eaton <weaton@redhat.com >
2025-05-15 21:07:28 -07:00
b18201fe06
Allow users to pass arbitrary JSON keys from CLI ( #18208 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 21:05:34 -07:00
f4937a51c1
[Model] vLLM v1 supports Medusa ( #17956 )
...
Signed-off-by: lisiqi23 <lisiqi23@xiaomi.com >
Signed-off-by: skylee-01 <497627264@qq.com >
Co-authored-by: lisiqi23 <lisiqi23@xiaomi.com >
2025-05-15 21:05:31 -07:00
ee659e3b60
[Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 attention on ROCm ( #18093 )
...
Signed-off-by: kf <kuanfu.liu@embeddedllm.com >
2025-05-15 19:30:17 -07:00
4e1c6a0264
[Bugfix] fix rotary embedding test for _get_padded_tensor_shape ( #18229 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-16 01:32:45 +00:00
c7852a6d9b
[Build] Allow shipping PTX on a per-file basis ( #18155 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 16:41:55 -07:00
8795eb9975
[Bugfix] Fix test_eagle test ( #18223 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-05-15 15:59:42 -07:00
0b34593017
Adding "AMD: Tensorizer Test" to amdproduction. ( #18216 )
2025-05-15 11:01:25 -07:00
e3f3aee6f4
[Misc] Avoid cuda graph log when sizes still match ( #18202 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-15 09:59:38 -07:00
92540529c0
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 ( #18205 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-15 09:53:18 -07:00
fadb8d5c2d
[Bugfix]Change the exception thrown by call_hf_processor from RuntimeError to ValueError ( #18181 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2025-05-15 09:01:47 -07:00
2aa5470ac5
[Frontend] Fix chat template content format detection ( #18190 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-05-15 09:00:21 -07:00
51ff154639
Improve examples rendering in docs and GitHub ( #18203 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 15:57:49 +00:00
566ec04c3d
Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3" in AMD Pipeline ( #18106 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-15 08:49:23 -07:00
01c22335ba
[Kernel] [V1] Fix performance regression for triton unified attention ( #18161 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 06:39:00 -07:00
451da4bcbd
add tools into TokenizeChatRequest ( #18187 )
...
Signed-off-by: yangxia <yangxiast@gmail.com >
2025-05-15 04:01:49 -07:00
07ad27121f
Update deprecated type hinting in model_loader ( #18130 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-15 04:00:21 -07:00
a9944aabfa
fix: typos ( #18151 )
...
Signed-off-by: omahs <73983677+omahs@users.noreply.github.com >
2025-05-15 02:16:15 -07:00
a8f5aec20a
[V1] Update zmq socket creation in nixl connector ( #18148 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 23:17:57 -07:00
de71fec81b
[CI] don't skip fixed test_kv_cache_events() ( #18183 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-14 23:17:16 -07:00
70f8b96724
[Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends ( #18178 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-14 23:16:31 -07:00
dd2a94596a
[Model] Allow the use of sliding window in Qwen2 ( #17772 )
...
Signed-off-by: inkcherry <mingzhi.liu@intel.com >
2025-05-14 22:29:38 -07:00
420caf7557
[UT] Add ut for none hash ( #17892 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-05-15 13:28:11 +08:00
4f07a64075
Support custom implementations of VideoLoader backends. ( #18091 )
2025-05-15 13:26:49 +08:00
e6b8e65d2d
[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 ( #18013 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-15 13:26:34 +08:00
26d0419309
Update deprecated type hinting in models ( #18132 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 22:06:50 -07:00
83f74c698f
[Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm ( #18154 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-05-14 22:04:43 -07:00
2dff093574
[Misc] add lobe-chat support ( #18177 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-15 05:02:23 +00:00
afe3236e90
[Chore] astral's ty ( #18116 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-15 05:00:43 +00:00
65334ef3b9
[V1][Metrics] Remove unused code ( #18158 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-14 20:13:17 -07:00
e60f550b38
[v1] Support multiple KV cache groups in GPU model runner ( #17945 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-14 18:54:54 -07:00
f25e0d1125
[Bugfix]: make most of test_openai_schema.py pass ( #17664 )
2025-05-14 17:04:35 -07:00
09f106a91e
Upload vllm index for the rc builds ( #18173 )
2025-05-14 16:35:56 -07:00
2142035b51
[V1] Support multiple kv connectors ( #17564 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-05-14 16:28:02 -07:00
78aa341d12
[CI] Fix race condition in test_kv_cache_events test ( #18169 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 16:27:48 -07:00
7974736740
Add support for loading torchao models with AOPerModuleConfig ( #17826 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-05-14 16:24:59 -07:00
2fc9075b82
[V1] Structured Outputs + Thinking compatibility ( #16577 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-14 15:45:24 -07:00
d93c976a0d
[Kernel] Have rotary embeddings support tensors ( #18046 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-14 15:43:55 -07:00
749f792553
[Frontend] decrease import time of vllm.multimodal ( #18031 )
...
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-05-14 15:43:32 -07:00
856865008e
[CI] Disable Failing Tests ( #18165 )
2025-05-14 13:49:56 -07:00
f9c069c85e
Modularize fused experts and integrate PPLX kernels ( #15956 )
2025-05-14 13:11:54 -07:00
418d2f8bfb
[V1][Spec Decode] Share input embedding of target model with EAGLE draft model to free ~1GB for llama 3 model ( #17326 )
...
Co-authored-by: root <root@ekagra-8xh100.us-east5-a .c.serving-efficiency-poc.internal>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-14 12:31:46 -07:00
964472b966
[Doc] Update prefix cache metrics to counting tokens ( #18138 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-14 15:23:30 +00:00
59dd311cf5
[KVConnector] Keep KVTransferParams as a dict ( #18033 )
2025-05-14 08:05:57 -07:00
d066e52013
[Bugfix] Fix chat utils tests ( #18139 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 05:38:21 -07:00
c8ea982d9b
Update deprecated type hinting in platform, plugins, triton_utils, vllm_flash_attn ( #18129 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 05:28:16 -07:00
dc372b9c8a
Update deprecated type hinting in vllm/device_allocator and vllm/distributed ( #18126 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 04:07:57 -07:00
9b5b39b650
Update deprecated type hinting in vllm/lora ( #18128 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-14 03:57:59 -07:00
9ccc6ded42
[doc] add missing import ( #18133 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-14 10:57:34 +00:00
d62a076e84
[Model] GritLM supports other attention backends ( #18109 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 03:33:19 -07:00
259127f8b8
[Bugfix] Fix LoRA test ( #18123 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-14 10:25:47 +00:00
612c2edb4f
[FEAT] [ROCm]: Add AITER CK 2 Stages MoE support ( #17110 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-14 03:03:11 -07:00
38fe728d60
[Bugfix] Fix QKVCrossParallelLinear::sync_weight_attrs for PyTorch compile ( #17844 )
...
Signed-off-by: Andrzej Kotłowski <akotlowski@habana.ai >
2025-05-14 09:39:51 +00:00
82e7f9bb03
[Misc] replace does not exist model ( #18119 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-05-14 02:13:47 -07:00
63dc3426e0
[Model] Add packed_modules_mapping for Qwen3-MOE ( #18118 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-14 02:13:19 -07:00
8f5dc41481
[Bugfix] Fix entrypoints audio test failure ( #18111 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-14 09:08:07 +00:00
63ad622233
[New Model]: support GTE NewModel ( #17986 )
2025-05-14 01:31:31 -07:00
e7ef61c1f0
[Bugfix][Example] make lmcache v0 work. ( #18051 )
...
Signed-off-by: Ma, Jianpeng <jianpeng.ma@intel.com >
2025-05-13 23:43:44 -07:00
d4154c35a2
[Bugfix] fix moe marlin topk_weight loading ( #18080 )
...
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-13 23:31:57 -07:00
6685890d11
[Fix] Move "model_config" as keyword args in chat_utils.py ( #18098 )
...
Signed-off-by: Linkun <github@lkchen.net >
2025-05-13 23:27:26 -07:00
33011318c2
Fix broken example: examples/offline_inference/profiling at scheduler_config ( #18117 )
2025-05-13 23:19:14 -07:00
4f8b373225
[BugFix][AMD] Compatible patch for AITER lib after 04/20 ( #17912 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-05-13 23:05:20 -07:00
7b2f28deba
[AMD][torch.compile] Enable silu+fp8_quant fusion for rocm ( #18082 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-05-13 22:13:56 -07:00
2d912fb66f
[FEAT] [ROCm] [V1]: Add AITER biased group topk for DeepSeekV3 ( #17955 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-13 22:03:47 -07:00
12e6c0b41c
[Bugfix][V1] Fix FlashInfer V1 backend using the wrong VllmConfig ( #18086 )
2025-05-13 20:36:17 -07:00
9a2a6357de
[Bugfix] Fix FP8 Marlin MoE and enable for compressed-tensors models ( #18026 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 19:48:33 -07:00
6266c57bae
[core][distributed] add ep group and all2all interface ( #18077 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-14 10:46:49 +08:00
754b699cbe
[Bug]: Fix S3 model/tokenizer path resolution ( #18083 )
...
Signed-off-by: Jon Gill <jon@yurts.ai >
2025-05-13 19:34:17 -07:00
6e27c6d86b
[Misc] Remove unused numpy tensor ( #18084 )
...
Signed-off-by: Roger Wang <hey@rogerw.me >
2025-05-13 19:33:40 -07:00
d5af47a149
[P/D] Add some more debug logs to NixlConnector ( #18102 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-13 19:33:03 -07:00
65f0f74b66
[Hardware/NVIDIA/Modelopt] Fix modelopt forward method for v1 torch.compile ( #18101 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2025-05-13 19:33:00 -07:00
176a95c670
[Fix] Support CUDAGraph capture for encoder-decoder on ROCm ( #18104 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-05-13 19:31:42 -07:00
f2ae883b67
[v1][KVCacheManager] pass num_new_computed_tokens to kv cache manager ( #18001 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-13 19:09:39 -07:00
40de1ef455
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature ( #14968 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-13 19:08:20 -07:00
0189a65a2e
[Docs] Expand security doc with firewall info ( #18081 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-13 19:36:00 +00:00
55aa7af994
[V1] DP scale-out (2/N): Decouple engine process management and comms ( #15977 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-13 10:48:21 -07:00
0b217da646
Update deprecated type hinting in vllm/adapter_commons ( #18073 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 08:32:51 -07:00
19324d660c
Update deprecated type hinting in vllm/compilation ( #18072 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 08:32:48 -07:00
fc407a1425
Give auto-merge label workflow permission to add labels to issues ( #18078 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 07:53:13 -07:00
009d9e7590
Convert benchmarks to ruff format ( #18068 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 13:43:29 +00:00
b922c2ebd2
[Bugfix] Fix entrypoints metrics tests ( #18063 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-13 06:42:43 -07:00
00b14e0f16
[CI] set token permissions for pre-commit CI job ( #17729 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 13:38:30 +00:00
54e467e6f8
[CI] Add token permissions for add-ready-label CI job ( #17730 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 13:38:13 +00:00
79a1d25bbd
[CI] Add workflow permissions for helm CI job ( #17727 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 12:49:07 +00:00
9944011b30
[CI] Set token permissions for reminder comment CI job ( #17728 )
...
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-05-13 12:46:58 +00:00
8c946cecca
Update deprecated type hinting in vllm/transformers_utils ( #18058 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:34:37 -07:00
ff334ca1cd
Update deprecated type hinting in vllm/profiler ( #18057 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:34:34 -07:00
6223dd8114
Update deprecated type hinting in model_executor/layers ( #18056 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 04:17:23 -07:00
906f0598fc
[doc] add download/list/delete HF model CLI usage ( #17940 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-13 11:15:51 +00:00
cb528d0585
[Fix] check to make sure processor has chat templates ( #18047 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-13 03:04:10 -07:00
98fcba1575
Convert .buildkite to ruff format ( #17656 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 09:28:31 +00:00
23b3134eb5
[Benchmarks] Refactor run_structured_output_benchmarks.sh ( #17722 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-13 01:47:29 -07:00
ea6ae8cb45
[Bugfix] Fix marlin moe fallback logic for llama4 ( #18042 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 07:53:28 +00:00
2ff297dce9
[BugFix] Set default random seed to 0 for V1 ( #17929 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-13 07:52:19 +00:00
8dd0671bac
[Bugfix][V1] Only get input embeddings w/ multi-modal models if first PP ( #17916 )
...
Signed-off-by: Jin Huang <jinhun@amazon.com >
Co-authored-by: Jin Huang <jinhun@amazon.com >
2025-05-13 15:10:07 +08:00
f0d610a8ae
[v1][KVCacheManager] Avoid full cache hit by controlling max_length ( #17999 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-13 06:50:38 +00:00
e57e4d6e9e
Fix Broken macro for cutlass moe ( #18049 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-05-12 23:31:06 -07:00
ee5be834e7
[BugFix] Fix 4-GPU RLHF tests ( #18007 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-12 23:03:55 -07:00
48545728d8
cleanup invalid prints ( #18050 )
...
Signed-off-by: calvin chen <120380290@qq.com >
2025-05-12 23:01:57 -07:00
dc1a821768
[Feature][V1] Support tool_choice: required when using Xgrammar as the StructuredOutputBackend. ( #17845 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-12 23:01:31 -07:00
61e0a506a3
[Bugfix] Avoid repeatedly creating dummy data during engine startup ( #17935 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-12 22:40:19 -07:00
1df491c522
[Bugfix] Fixes for new marlin moe usage ( #18017 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-13 03:50:04 +00:00
d8487ef557
[ROCm]: Fix build from source failure with gcc14 and ROCm 6.3 ( #13779 )
...
Signed-off-by: Arjun Kathuria <arjun.kathuria8@gmail.com >
2025-05-12 20:36:33 -07:00
c06af9a959
[Misc] Slight spelling modification ( #18039 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-12 20:36:27 -07:00
60f7624334
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support ( #11844 )
2025-05-12 19:52:47 -07:00
f6518b2b48
[ROCm] Skip tests for quantizations incompatible with ROCm ( #17905 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2025-05-12 18:39:28 -06:00
d67085c2c8
Remove noisy warnings from SchedulerConfig ( #17995 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-13 00:33:45 +00:00
307939f299
Use NVFP4 Marlin for CompressedTensorsW4A16Fp4 ( #18000 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2025-05-12 18:07:34 -06:00
9d7ea9dbbf
Update some more deprecated type hinting ( #17998 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-12 23:49:33 +00:00
acee8f48aa
[Model] Support MiMo-7B inference with MTP ( #17433 )
...
Signed-off-by: wp-alpha <wangpeng66@xiaomi.com >
Co-authored-by: wangpeng66 <wangpeng66@xiaomi.com >
2025-05-12 23:25:33 +00:00
f065de4e88
Fix FBGEMM integration ( #18002 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-12 23:02:07 +00:00
dc9905368d
[V1][Spec Decode] Eagle unit tests ( #17350 )
...
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com >
2025-05-12 23:01:17 +00:00
ebab1ac37c
[CI] Make JSON output tests less likely to fail ( #17859 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-12 22:31:54 +00:00
2b0db9b0e2
Enable standard language model for torhc nightly ( #18004 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-12 14:00:04 -07:00
195adb47c0
[Chore] Remove unused method ( #18024 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-12 13:59:47 -07:00
302f3aca7e
[v1][KVCacheManager] Change prefix caching metric from counting blocks to counting tokens ( #18003 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-12 13:46:12 -07:00
e9c730c9bd
Enabling "Weight Loading Multiple GPU Test - Large Models" ( #18020 )
2025-05-12 13:05:33 -07:00
289199feb6
[Core] Use platform-agnostic device control for DP engine core ( #17245 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-05-12 12:09:16 -07:00
b9fd0d7a69
[CI/Build] Fix TPU V1 Test mixed use of & and && across tests ( #17968 )
2025-05-12 12:06:59 -07:00
72a3f6b898
Construct KVTransferConfig properly from Python instead of using JSON blobs without CLI ( #17994 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-12 11:25:33 -07:00
98ea35601c
[Lora][Frontend]Add default local directory LoRA resolver plugin. ( #16855 )
...
Signed-off-by: jberkhahn <jaberkha@us.ibm.com >
2025-05-12 10:39:10 -07:00
d19110204c
[P/D] NIXL Integration ( #17751 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Brent Salisbury <bsalisbu@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Brent Salisbury <bsalisbu@redhat.com >
2025-05-12 09:46:16 -07:00
05a4324f8e
Initialize the delta tool call fields explicitly ( #17340 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: igmainc <igmainc@icloud.com >
2025-05-12 13:28:58 +00:00
7ea6cb28b2
[Misc] Improve modelscope import error ( #17983 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-12 10:46:45 +00:00
9fbf2bfbd5
Correcting testcases in builkite job for IBM Power ( #17675 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-05-12 08:11:55 +00:00
3a5ea75129
[Feature] Support DeepSeekV3 Function Call ( #17784 )
...
Signed-off-by: 许文卿 <xwq391974@alibaba-inc.com >
Signed-off-by: Xu Wenqing <xuwq1993@qq.com >
2025-05-12 00:45:21 -07:00
891b9d33de
[Fix] Benchmark "EngineClient" has no attribute "model_config" ( #17976 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-05-11 22:55:53 -07:00
430783018c
[Bugfix][TPU] Use np array when updating cache slot_mapping ( #17971 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-05-12 12:58:33 +08:00
19a3c78d1f
[Bugfix] Fix pydantic.errors.PydanticUserError ( #17962 )
...
Signed-off-by: wangli <wangli858794774@gmail.com >
2025-05-12 12:58:23 +08:00
ada50aa295
[bugfix] fix the wrong parser ( #17958 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-12 04:58:02 +00:00
08bf784078
[Bugfix] validate grammar and throw 400 error instead of crashing the engine when xgrammar validation fails ( #17623 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-05-12 09:06:10 +08:00
d45fe333fb
[misc] add instructions on how to install nvshmem/pplx/deepep ( #17964 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-05-11 18:02:39 -07:00
021c16c7ca
[Model] Broadcast Ovis2 implementation to fit Ovis1.6 ( #17861 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-11 17:56:30 -07:00
7de18d541b
[BUG] [ROCm] [MLA] Fix variable name bug due to change in variable name in PR #17483 ( #17961 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-11 09:14:30 -07:00
a810b5b088
[BugFix] [ROCm]: Bugfix and handle addition case of input for rocm_aiter_rms_norm ( #17857 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-11 04:17:11 -07:00
009b3d5382
[Misc] not show --model in vllm serve --help ( #16691 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-11 08:47:58 +00:00
e4b8713380
[New Model]: nomic-embed-text-v2-moe ( #17785 )
2025-05-11 00:59:43 -07:00
06c0922a69
[FP8][ROCm][Attention] Enable FP8 KV cache on ROCm for V1 ( #17870 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-11 15:58:45 +08:00
cd3edfc908
[Misc] Add compressed-tensors NVFP4A16 emulation support ( #17914 )
...
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com >
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2025-05-11 15:58:38 +08:00
9cea90eab4
[Frontend] Add /classify endpoint ( #17032 )
...
Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com >
2025-05-11 07:57:07 +00:00
d1110f5b5a
[doc] update lora doc ( #17936 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-11 15:56:21 +08:00
8132365b74
[Bugfix]: v1 engine - consider lora adapters in allowed_token_ids ( #17855 )
...
Signed-off-by: Ben Browning <bbrownin@redhat.com >
2025-05-11 00:53:58 -07:00
eea22a56ab
fix amd triton mla path ( #17871 )
2025-05-11 07:53:31 +00:00
9112155283
[Perf] Use small max_num_batched_tokens for A100 ( #17885 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-05-11 07:53:23 +00:00
90d0a74b60
[Bugfix] Add revision to transformers.Auto*.from_pretrained processors ( #17948 )
...
Signed-off-by: Xin Li <xin@centml.ai >
2025-05-11 07:52:44 +00:00
d74e5f37bc
[Kernel] fp4 marlin kernel ( #17687 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-05-10 19:58:49 -07:00
ca66a1674c
[v1] Rename specialized_manager.py to single_type_kv_cache_manager.py ( #17946 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-10 16:14:12 -07:00
950751a987
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders ( #17483 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-10 16:12:04 -07:00
4c31218f80
[Misc] remove --model from vllm serve usage ( #17944 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-10 13:23:31 +00:00
68311891f5
Don't default construct ModelConfig when default constructing VllmConfig ( #17943 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-10 13:23:00 +00:00
fc4441a4ee
Add missing content type headers to /ping and /health ( #17036 ) ( #17786 )
...
Signed-off-by: Ximo Guanter <ximo.guanter@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-10 07:13:32 +01:00
246e3e0a36
fix broken test vllm:test_kernels - test_attention_selector.py::test_flash_attn ( #17873 )
...
Co-authored-by: Stephen Chen <tracelog@meta.com >
2025-05-10 10:46:54 +08:00
7042cc96b0
[V1][Spec Decoding] Log accumulated metrics after system goes idle ( #17913 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-09 18:23:07 -07:00
0c0fdae84f
[Hardware/NVIDIA/Kernel] Enable nvidia/DeepSeek-R1-FP4 Model ( #16362 )
2025-05-09 16:24:41 -07:00
3b602cdea7
AMD conditional all test execution // new test groups ( #17556 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-05-09 15:35:58 -07:00
4b2ed7926a
Improve configs - the rest! ( #17562 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 15:18:44 -07:00
7e3571134f
[V1][Spec Decoding] Include bonus tokens in mean acceptance length ( #17908 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-05-09 13:32:36 -07:00
ea2236bf95
Add option to use torch._inductor.standalone_compile ( #17057 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-09 12:59:04 -07:00
7d4aedae7c
Handle error when str passed to /v1/audio/transcriptions ( #17909 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 19:23:59 +00:00
22481fbfa3
Update CT WNA16MarlinMoE integration ( #16666 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-09 13:19:45 -04:00
5c4c08f6f1
[Misc] Auto fallback to float16 for pre-Ampere GPUs when detected bfloat16 config ( #17265 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-09 17:16:12 +00:00
c44c384b1c
[Misc] Add references in ray_serve_deepseek example ( #17907 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-05-09 16:59:36 +00:00
85b72cb7b1
Revert "[BugFix][AMD] Compatible patch for latest AITER(05/07/2025)" ( #17910 )
2025-05-09 08:58:18 -07:00
6e5595ca39
[CI/Build] Automatically retry flaky tests ( #17856 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-09 09:55:17 -06:00
200da9a517
[v1] Move block management logic from KVCacheManager to SpecializedManager ( #17474 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-09 15:25:34 +00:00
9f64e93415
[BugFix][AMD] Compatible patch for latest AITER(05/07/2025) ( #17864 )
...
Signed-off-by: Qiang Li <qiang.li2@amd.com >
2025-05-09 08:59:36 -06:00
ec61ea20a8
[Misc] add dify integration ( #17895 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-09 03:42:39 -07:00
c6798baa9c
Change top_k to be disabled with 0 (still accept -1 for now) ( #17773 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-09 10:01:49 +00:00
5b2dcbf0b8
Fix Whisper crash caused by invalid`` max_num_batched_tokens`` config ( #17853 )
...
Signed-off-by: inkcherry <mingzhi.liu@intel.com >
2025-05-09 09:16:26 +00:00
6e4a93e3f7
[Bugfix][CPU] Fix broken AVX2 CPU TP support ( #17252 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-09 08:55:14 +00:00
217db4baa6
[Bugfix][ROCm] Fix AITER MLA V1 ( #17880 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-05-09 08:38:21 +00:00
ff8c400502
[Doc] remove visible token in doc ( #17884 )
...
Signed-off-by: yan <yanma1@habana.ai >
2025-05-09 01:21:31 -07:00
89a0315f4c
[Doc] Update several links in reasoning_outputs.md ( #17846 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-09 01:20:55 -07:00
3d1e387652
[Docs] Add Slides from NYC Meetup ( #17879 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-05-08 21:46:54 -07:00
d310e6de98
[BUGFIX]: return fast when request requires prompt logprobs ( #17251 )
2025-05-08 21:25:41 -07:00
5e6f939484
[Attention] MLA move rotary embedding to cuda-graph region ( #17668 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-09 11:14:42 +08:00
760e3ecc8f
[V1][Structured Output] Update llguidance (>= 0.7.11) to avoid AttributeError (no StructTag) ( #17839 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-05-08 20:14:18 -07:00
3c9396a64f
[FEAT][ROCm]: Support AITER MLA on V1 Engine ( #17523 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
2025-05-09 10:42:05 +08:00
376786fac1
Add cutlass support for blackwell fp8 blockwise gemm ( #14383 )
...
Signed-off-by: Shu Wang <shuw@nvidia.com >
2025-05-08 15:09:55 -07:00
4f605a6de5
Fix noisy warning for uncalibrated q_scale/p_scale ( #17414 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-08 15:56:59 -04:00
8342e3abd1
[CI] Prune down lm-eval small tests ( #17012 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-08 19:00:26 +00:00
a83a0f92b5
[Test] Attempt all TPU V1 tests, even if some of them fail. ( #17334 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-05-08 17:20:54 +00:00
226a4272cf
[V1] Improve VLLM_ALLOW_INSECURE_SERIALIZATION logging ( #17860 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-08 16:57:35 +00:00
ec54d73c31
[CI] Fix test_collective_rpc ( #17858 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-08 16:47:12 +00:00
a944f8ede7
[Misc] Delete LoRA-related redundancy code ( #17841 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-08 06:02:21 -07:00
015815fe01
[Bugfix] use_fast failing to be propagated to Qwen2-VL image processor ( #17838 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-08 05:39:21 -07:00
e4ca6e3a99
Fix transient dependency error in docs build ( #17848 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-08 03:42:03 -07:00
53d0cb7423
[Misc] add chatbox integration ( #17828 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-08 10:05:26 +00:00
f50dcb7c21
[Easy] Eliminate c10::optional usage in vllm/csrc ( #17819 )
2025-05-08 03:05:10 -07:00
a1e19b635d
[Doc] Fix a typo in the file name ( #17836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-08 18:04:18 +08:00
bb239a730f
[Bugfix] Fix quark fp8 format loading on AMD GPUs ( #12612 )
...
Signed-off-by: Felix Marty <felmarty@amd.com >
Signed-off-by: kewang2 <kewang2@amd.com >
Co-authored-by: kewang2 <kewang2@amd.com >
2025-05-08 02:53:53 -07:00
a463555dee
[TPU] Fix the test_sampler ( #17820 )
2025-05-08 05:51:33 -04:00
ca04b97c93
[Bugfix] Fix tool call template validation for Mistral models ( #17644 )
...
Signed-off-by: Rick Yuan <yuan821120@gmail.com >
Signed-off-by: RIck Yuan <yuan821120@gmail.com >
Co-authored-by: Aaron Pham <Aaronpham0103@gmail.com >
2025-05-08 09:47:19 +00:00
0a9bbaa104
[Misc] support model prefix & add deepseek vl2 tiny fused moe config ( #17763 )
...
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com >
Co-authored-by: 唯勤 <xsank.mz@alibaba-inc.com >
2025-05-08 07:50:22 +00:00
39956efb3f
[Bugfix] Fix bad words for Mistral models ( #17753 )
...
Signed-off-by: Qiong Zhou Huang <qiong@phonic.co >
2025-05-07 23:32:10 -07:00
597051e56f
[Qwen3]add qwen3-235b-bf16 fused moe config on A100 ( #17715 )
2025-05-07 23:09:32 -07:00
96722aa81d
[Frontend] Chat template fallbacks for multimodal models ( #17805 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-07 23:05:54 -07:00
843b222723
[Hardware][Intel-Gaudi] Support Automatic Prefix Caching on HPU ( #17648 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-05-07 22:37:03 -07:00
e515668edf
[Hardware][Power] Enable compressed tensor W8A8 INT8 quantization for POWER ( #17153 )
...
Signed-off-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: Akash Kaothalkar <akash.kaothalkar@ibm.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-05-07 22:35:03 -07:00
5a499e70d5
[Kernel][Hardware][AMD] Bf16 mfma opt for ROCm skinny GEMMs ( #17071 )
...
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com >
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: charlifu <charlifu@amd.com >
2025-05-07 22:34:49 -07:00
6930a41116
[V1] Add VLLM_ALLOW_INSECURE_SERIALIZATION env var ( #17490 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-05-08 13:34:02 +08:00
998eea4a0e
Only log non-default CLI args for online serving ( #17803 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 22:33:29 -07:00
c747d84576
[Installation] OpenTelemetry version update ( #17771 )
...
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com >
2025-05-07 22:32:49 -07:00
b2da14a05a
Improve exception reporting in MP engine ( #17800 )
...
Signed-off-by: Vadim Markovtsev <vadim@poolside.ai >
2025-05-08 05:32:39 +00:00
7ea2adb802
[Core] Support full cuda graph in v1 ( #16072 )
...
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com >
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com >
2025-05-07 22:30:15 -07:00
3d13ca0e24
[BugFix] Fix --disable-log-stats in V1 server mode ( #17600 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-08 04:08:15 +00:00
66ab3b13c9
Don't call the venv vllm ( #17810 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-08 04:06:39 +00:00
a8238bbdb0
[Chore][Doc] uses model id determined from OpenAI client ( #17815 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-08 01:48:57 +00:00
d43f914d42
[Core][Feature] Input metadata dump on crash ( #13407 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2025-05-07 22:15:09 +00:00
ed5272cf21
[BugFix] Avoid secondary missing MultiprocExecutor.workers error ( #17811 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-07 21:55:04 +00:00
c20ef40fd0
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend ( #14238 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Chengji Yao <chengjiyao@google.com >
Co-authored-by: Chengji Yao <chengjiyao@google.com >
2025-05-07 16:28:47 -04:00
db593aa67f
[Quantization] Quark MXFP4 format loading ( #16943 )
2025-05-07 15:05:05 -04:00
f98e307588
[Bugfix] Fix missing lora name mapping for lora without prefix ( #17793 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 16:17:12 +00:00
646a31e51e
Fix and simplify deprecated=True CLI kwarg ( #17781 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 16:51:06 +01:00
be8ff88e66
[Bugfix] Fix Video IO error for short video ( #17791 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 15:36:06 +00:00
1a6af1453d
Only depend on importlib-metadata for Python < 3.10 ( #17776 )
...
Signed-off-by: Christian Heimes <christian@python.org >
2025-05-07 07:51:06 -07:00
32aa74c09c
[ROCm][FP8][Kernel] FP8 quantization fused into Custom Paged Attention ( #17139 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-07 07:12:35 -07:00
7377dd0307
[doc] update the issue link ( #17782 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-07 20:29:05 +08:00
98c89e16ff
Make key optional for rotary embedding ( #17566 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-07 00:11:46 -07:00
324a3119b0
Fix test_memory_usage_no_spec ( #17754 )
...
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
2025-05-07 00:10:33 -07:00
8a15c2603a
[Frontend] Add missing chat templates for various MLLMs ( #17758 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-07 00:10:01 -07:00
043e4c4955
Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling ( #16357 )
...
Signed-off-by: Satyajith Chilappagari <satchill@amazon.com >
Co-authored-by: Aaron Dou <yzdou@amazon.com >
Co-authored-by: Shashwat Srijan <sssrijan@amazon.com >
Co-authored-by: Chongming Ni <chongmni@amazon.com >
Co-authored-by: Amulya Ballakur <amulyaab@amazon.com >
Co-authored-by: Patrick Lange <patlange@amazon.com >
Co-authored-by: Elaine Zhao <elaineyz@amazon.com >
Co-authored-by: Lin Lin Pan <tailinpa@amazon.com >
Co-authored-by: Navyadhara Gogineni <navyadha@amazon.com >
Co-authored-by: Yishan McNabb <yishanm@amazon.com >
Co-authored-by: Mrinal Shukla <181322398+mrinalks@users.noreply.github.com >
2025-05-07 00:07:30 -07:00
ba7703e659
[Misc] Remove qlora_adapter_name_or_path ( #17699 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-06 23:10:37 -07:00
f80ae5bdcf
[Kernel] Use fused rmsnorm for some models like qwen3 series ( #17735 )
...
Signed-off-by: evian <eviantai@u.nus.edu >
Co-authored-by: evian <eviantai@u.nus.edu >
2025-05-06 23:10:02 -07:00
1a45a61387
[Kernel] GGUF MoeVec kernel ( #16780 )
...
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com >
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-05-06 23:07:23 -07:00
c3e9d5060e
[Misc] Use apply_rotary_emb from vllm_flash_attn for Qwen2-VL vision RoPE ( #17726 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-07 04:51:33 +00:00
822de7fb94
[Misc] Split model loader ( #17712 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-05-07 12:42:26 +08:00
8d84d836d1
[BugFix][Spec Decode] Fix hidden size mismatch between target and eagle head ( #17740 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-05-06 19:51:26 -07:00
950b71186f
Replace lm-eval bash script with pytest and use enforce_eager for faster CI ( #17717 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 18:00:10 -07:00
e50a1f1a9c
[TPU] Add kernel test for moe_pallas ( #17496 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-05-06 17:59:57 -07:00
a17cef70ea
Removed unused marlin cuda code ( #17684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 17:59:47 -07:00
18dd5e01f2
[Model] Mamba2 causal conv1d Refactor to Split Prefill and Decode Requests for Corresponding Kernels ( #17146 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
2025-05-06 17:59:30 -07:00
6de3e13413
Add logging for torch nightly version ( #17669 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-07 00:45:51 +00:00
ed3a1d2106
[ROCm] fix num_stages for default moe config to avoid triton OutOfResource error ( #17744 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-05-07 00:39:48 +00:00
022afbeb4e
Fix doc build performance ( #17748 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-07 00:36:41 +00:00
2f925e5777
[Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode ( #16828 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-06 18:21:48 -04:00
de906b95f9
[Bugfix] Fix for the condition to accept empty encoder inputs for mllama ( #17732 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-06 19:59:06 +00:00
d456aea71f
[Misc] Add Next Edit Prediction (NEP) datasets support in benchmark_serving.py ( #16839 )
...
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b .c.jetbrains-grazie.internal>
2025-05-06 15:38:45 -04:00
621ca2c0ab
[TPU] Increase block size and reset block shapes ( #16458 )
2025-05-06 13:55:04 -04:00
6115b11582
Make right sidebar more readable in "Supported Models" ( #17723 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-06 16:48:26 +00:00
5b8c390747
[Bugfix] Fix modality limits in vision language example ( #17721 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-06 16:12:28 +00:00
7525d5f3d5
[doc] Add RAG Integration example ( #17692 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-06 16:10:23 +00:00
aabcd2cae3
[v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager ( #17479 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-06 08:50:34 -07:00
0d115460a7
[Docs] Use gh-file to add links to tool_calling.md ( #17709 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-05-06 15:27:19 +00:00
175bda67a1
[Feat] Add deprecated=True to CLI args ( #17426 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-05-06 08:11:27 -07:00
cba31c47c4
[v1] AttentionMetadata for each layer ( #17394 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-05-06 07:58:37 -07:00
a6fed02068
[V1][PP] Support PP for MultiprocExecutor ( #14219 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-05-06 07:58:05 -07:00
d419aa5dc4
[V1] Enable TPU V1 backend by default ( #17673 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 06:49:49 -07:00
f9bc5a0693
[Bugfix] Fix triton import with local TritonPlaceholder ( #17446 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-05-06 17:53:09 +08:00
05e1f96419
Fix dockerfilegraph pre-commit hook ( #17698 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-06 08:56:48 +00:00
6eae34533a
[Misc] Fix ScalarType float4 naming ( #17690 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-06 01:07:15 -07:00
63ced7b43f
[Doc] Update notes for H2O-VL and Gemma3 ( #17219 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-06 07:51:02 +00:00
dc47ba32f8
[Bugfix] Fixed prompt length for random dataset ( #17408 )
...
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com >
2025-05-06 07:00:08 +00:00
edbf2d609e
[easy] Fix logspam on PiecewiseBackend errors ( #17138 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-05 23:46:11 -07:00
999328be0d
[Model] Add GraniteMoeHybrid 4.0 model ( #17497 )
...
Signed-off-by: Thomas Ortner <boh@zurich.ibm.com >
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com >
Co-authored-by: Thomas Ortner <boh@zurich.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-05-06 12:00:31 +08:00
98834fefaa
Update nm to rht in doc links + refine fp8 doc ( #17678 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-06 00:41:14 +00:00
90bd2ae172
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument ( #17677 )
2025-05-05 17:34:29 -07:00
5941e0b7ea
[TPU][V1] Add support for top-logprobs ( #17072 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-05-05 14:20:15 -07:00
9765940824
[TPU] Enable gemma3-27b with TP>1 on multi-chips. ( #17335 )
...
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com >
2025-05-05 14:19:58 -07:00
5ea5c514da
[BugFix] Increase timeout for startup failure test ( #17642 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-05-05 20:53:19 +00:00
d3efde8176
[Benchmarks] Remove invalid option under V1 engine ( #17651 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-05 16:30:22 -04:00
aea302be6c
Use git-path commit in hook ( #17616 )
...
Signed-off-by: Thomas J. Fan <thomasjpfan@gmail.com >
2025-05-05 17:55:32 +00:00
cc05b90d86
[Doc] Fix broken cuda installation doc rendering ( #17654 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-05 17:52:40 +00:00
1d0c9d6b2d
[Kernel] some optimizations for dense marlin and moe marlin ( #16850 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-05-05 09:39:30 -07:00
f62cad6431
[Build/CI] Upgrade CUTLASS to 3.9.2 ( #17641 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-04 19:23:17 -07:00
5394ad7387
[Bugfix] fix KeyError on top logprobs are special tokens ( #17637 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-04 19:22:35 -07:00
68e1ee0072
[Bugfix][Easy] Fix whitespace in shm_broadcast.py logging ( #17635 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-04 19:20:19 -07:00
2858830c39
[Bugfix] Prioritize dtype in root config before checking text config ( #17629 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-04 12:43:05 +00:00
d6484ef3c3
Add full API docs and improve the UX of navigating them ( #17485 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-03 19:42:43 -07:00
46fae69cf0
[Misc] V0 fallback for --enable-prompt-embeds ( #17615 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-03 22:59:24 +00:00
f66f1e0fa3
[Bugfix] Fix broken Qwen2.5-omni tests ( #17613 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-03 17:08:14 +00:00
887d7af882
[Core] Gate prompt_embeds behind a feature flag ( #17607 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-04 00:19:20 +08:00
a92842454c
[Bugfix][ROCm] Using device_type because on ROCm the API is still torch.cuda ( #17601 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-02 22:25:47 -07:00
c8386fa61d
[Build/CI] Upgrade CUTLASS to 3.9.1 ( #17602 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-05-02 22:25:14 -07:00
87baebebd8
[Frontend][TPU] Add TPU default max-num-batched-tokens based on device name ( #17508 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-02 21:42:44 -07:00
e3d0a1d190
[Quantizaton] [AMD] Add support for running DeepSeek int8 w8a8 MoE on ROCm ( #17558 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-05-02 21:41:10 -07:00
d47b605eca
Update test requirements to CUDA 12.8 ( #17576 )
...
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com >
2025-05-02 21:40:15 -07:00
22c6f6397f
[Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 ( #17603 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-05-03 02:41:59 +00:00
3ec97e2cc5
[release] Add command to clean up Docker containers/images in TPU release machine ( #17606 )
2025-05-02 18:54:34 -07:00
9b103a1d76
fix typo in logging ( #17605 )
2025-05-02 18:04:40 -07:00
b90b0852e9
[easy] Print number of needed GPUs in skip message ( #17594 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-05-02 15:27:43 -07:00
9352cdb56d
[Hardware][AMD] Improve OAM device ID + llama4 Maverick MOE tuning ( #16263 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
Co-authored-by: Lu Fang <lufang@fb.com >
2025-05-02 19:44:19 +00:00
182f40ea8b
Add NVIDIA TensorRT Model Optimizer in vLLM documentation ( #17561 )
2025-05-02 11:36:46 -07:00
3e887d2e0c
permute/unpermute kernel for moe optimization ( #14568 )
...
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn >
2025-05-02 11:31:55 -07:00
0f87d8f7b2
[BugFix][Attention] Fix sliding window attention in V1 giving incorrect results ( #17574 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-02 11:01:38 -07:00
4c33d67321
[Bugfix] fix tmp_out and exp_sums dimensions ( #17438 )
...
Signed-off-by: Hui Liu <96135754+hliuca@users.noreply.github.com >
2025-05-02 16:44:07 +00:00
cb234955df
[Misc] Clean up input processing ( #17582 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 08:11:53 -07:00
3a500cd0b6
[doc] miss result ( #17589 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-02 07:04:49 -07:00
868c546da4
Support W8A8 INT8 MoE for compressed-tensors ( #16745 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 10:03:32 -04:00
99404f53c7
[Security] Fix image hash collision ( #17378 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 08:36:39 -04:00
785d75a03b
Automatically tell users that dict args must be valid JSON in CLI ( #17577 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-02 05:24:55 -07:00
6d1479ca4b
[doc] add the print result ( #17584 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-02 05:24:45 -07:00
b8b0859b5c
add more pytorch related tests for torch nightly ( #17422 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-05-02 03:29:59 -07:00
d7543862bd
[Misc] Rename assets for testing ( #17575 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 03:29:25 -07:00
c777df79f7
[BugFix] Fix Memory Leak ( #17567 )
...
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
2025-05-02 01:07:03 -07:00
cc2a77d7f1
[Core] [Bugfix] Add Input Embeddings ( #15428 )
...
Signed-off-by: Andrew Sansom <andrew@protopia.ai >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: 临景 <linjing.yx@alibaba-inc.com >
Co-authored-by: Bryce1010 <bryceyx@gmail.com >
Co-authored-by: Nan2018 <nan@protopia.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-02 01:06:39 -07:00
9e2de9b9e9
[Bugifx] Remove TritonPlaceholder from sys.modules ( #17317 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-02 00:45:01 -07:00
109e15a335
Add pt_load_map_location to allow loading to cuda ( #16869 )
...
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com >
2025-05-01 23:23:42 -07:00
f192ca90e6
Fix PixtralHF missing spatial_merge_size ( #17571 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-01 22:14:09 -07:00
f89d0e11bf
[Misc] Continue refactoring model tests ( #17573 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 22:06:08 -07:00
b4003d11fc
Check if bitblas is installed during support check ( #17572 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 04:32:54 +00:00
292fc59d61
[CI] Actually run tests/kv_transfer/test_disagg.py in CI ( #17555 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-02 04:05:04 +00:00
afcb3f8863
[Attention] MLA move o_proj q_proj into cuda-graph region ( #17484 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-02 03:16:26 +00:00
afb12e4294
[Doc] note that not all unit tests pass on CPU platforms ( #17554 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-05-02 02:57:21 +00:00
24aebae177
[Bugfix] Disable gptq_bitblas for <SM80 to fix GPTQ on V100/T4 ( #17541 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-05-01 17:59:35 -07:00
39c0813a7f
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE3 ( #17504 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-05-01 16:19:30 -07:00
9b70e2b4c1
[Misc][Tools][Benchmark] Publish script to auto tune server parameters ( #17207 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-05-01 19:53:03 +00:00
173daac19d
[Bug]change the position of cuda_graph_sizes in dataclasses ( #17548 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
2025-05-01 11:52:37 -07:00
04f2cfc894
Remove duplicate code from dbrx.py ( #17550 )
2025-05-01 11:51:58 -07:00
811a6c0972
[ROCM] Add gfx950 to the custom attention archs ( #16034 )
...
Signed-off-by: jpvillam <Juan.Villamizar@amd.com >
Signed-off-by: seungrokjung <seungrok.jung@amd.com >
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: seungrokjung <seungrok.jung@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-05-01 11:18:28 -07:00
9b1769dd9a
[Bugfix] Fix lint error ( #17547 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 11:12:19 -07:00
61c299f81f
[Misc]add configurable cuda graph size ( #17201 )
...
Signed-off-by: CXIAAAAA <cxia0209@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 11:04:50 -07:00
4acfa3354a
[ROCm] update installation guide to include build aiter from source instructions ( #17542 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-05-01 11:01:28 -07:00
88c8304104
[Model] Refactor Ovis2 to support original tokenizer ( #17537 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-05-01 11:00:53 -07:00
6768ff4a22
Move the last arguments in arg_utils.py to be in their final groups ( #17531 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 10:31:44 -07:00
f2e7af9b86
[CI/Build] Remove awscli dependency ( #17532 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 09:20:54 -07:00
7423cf0a9b
[Misc] refactor example - cpu_offload_lmcache ( #17460 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-01 15:05:24 +00:00
460a2b1100
[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations ( #10867 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-05-01 07:59:28 -07:00
28566d73b3
[ROCm] remove unsupported archs from rocm triton flash-attention supported list ( #17536 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-05-01 07:54:25 -07:00
98060b001d
[Feature][Frontend]: Deprecate --enable-reasoning ( #17452 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-01 06:46:16 -07:00
f5a3c655b2
[FEAT] [ROCm]: Add Qwen/Qwen3-235B-A22B-FP8 TP4 triton fused moe config ( #17535 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-01 06:37:17 -07:00
7169f87ad0
[doc] add streamlit integration ( #17522 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-05-01 13:34:02 +00:00
b74d888c63
Fix more broken speculative decode tests ( #17450 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-05-01 06:05:58 -07:00
2007d4d54f
[FEAT] [ROCm]: Add Qwen/Qwen3-30B-A3B-FP8 fused moe config for MI300X ( #17530 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-05-01 06:03:13 -07:00
48e925fab5
[Misc] Clean up test docstrings and names ( #17521 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 05:19:32 -07:00
1903c0b8a3
[Frontend] Show progress bar for adding requests ( #17525 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-05-01 05:15:32 -07:00
86a1f67a3b
[Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model ( #17285 )
...
Signed-off-by: Teruaki Ishizaki <teruaki.ishizaki@ntt.com >
2025-05-01 11:54:51 +00:00
a257d9bccc
Improve configs - ObservabilityConfig ( #17453 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-05-01 03:52:05 -07:00
015069b017
[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content ( #17515 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-05-01 03:29:01 -07:00
fbefc8a78d
[Core] Enable IPv6 with vllm.utils.make_zmq_socket() ( #16506 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-05-01 09:38:18 +00:00
26bc4bbcd8
Avoid overwriting vllm_compile_cache.py ( #17418 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-05-01 07:30:57 +00:00
3c3d767201
[BugFix] Fix mla cpu - missing 3 required positional arguments ( #17494 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-05-01 14:36:52 +08:00
13cf6b6236
[BugFix] fix speculative decoding memory leak when speculation is disabled ( #15506 )
...
Signed-off-by: Noah Yoshida <noahcy117@gmail.com >
2025-04-30 23:28:17 -07:00
90d0a54c4d
[ROCm] Effort to reduce the number of environment variables in command line ( #17229 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
2025-04-30 23:27:06 -07:00
7a0a146c54
[Build] Require setuptools >= 77.0.3 for PEP 639 ( #17389 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-30 23:25:36 -07:00
7ab643e425
FIxing the AMD test failures caused by PR#16457 ( #17511 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-30 23:23:07 -07:00
afb4429b4f
[CI/Build] Reorganize models tests ( #17459 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-30 23:03:08 -07:00
aa4502e7f3
[CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg ( #17500 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 21:03:30 -07:00
17b4d85f63
[CI][TPU] Skip structured outputs+spec decode tests on TPU ( #17510 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 20:36:20 -07:00
1144a8efe7
[Bugfix] Temporarily disable gptq_bitblas on ROCm ( #17411 )
...
Signed-off-by: Yan Cangang <nalanzeyu@gmail.com >
2025-04-30 19:51:45 -07:00
08fb5587b4
[Bugfix][ROCm] Fix import error on ROCm ( #17495 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-30 19:51:42 -07:00
dbc18e7816
[CI][TPU] Skip Multimodal test ( #17488 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-04-30 19:51:39 -07:00
02bd654846
[Misc] Rename Audios -> Audio in Qwen2audio Processing ( #17507 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-30 19:51:36 -07:00
200bbf92e8
Bump Compressed Tensors version to 0.9.4 ( #17478 )
...
Signed-off-by: Rahul Tuli <rtuli@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-30 15:24:45 -07:00
81ecf425f0
[v1][Spec Decode] Make sliding window compatible with eagle prefix caching ( #17398 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-30 18:25:53 +00:00
42d9a2c4c7
doc: fix bug report Github template formatting ( #17486 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-30 10:03:20 -07:00
2ac74d098e
[doc] add install tips ( #17373 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-30 17:02:41 +00:00
584f5fb4c6
[Bugfix][ROCm] Restrict ray version due to a breaking release ( #17480 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-30 09:59:06 -07:00
d586ddc691
[BugFix] Fix authorization of openai_transcription_client.py ( #17321 )
...
Signed-off-by: zh Wang <rekind133@outlook.com >
2025-04-30 09:51:05 -07:00
0b7e701dd4
[Docs] Update optimization.md doc ( #17482 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-30 09:34:02 -07:00
947f2f5375
[V1] Allow turning off pickle fallback in vllm.v1.serial_utils ( #17427 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-30 16:10:54 +00:00
739e03b344
[Bugfix] Fixed mistral tokenizer path when pointing to file ( #17457 )
...
Signed-off-by: Pete Savage <psavage@redhat.com >
2025-04-30 08:08:37 -07:00
da4e7687b5
[Fix] Support passing args to logger ( #17425 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-30 08:06:58 -07:00
39317cf42b
[Docs] Add command for running mypy tests from CI ( #17475 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-30 08:06:09 -07:00
2990cee95b
[Feature] The Qwen3 reasoning parser supports guided decoding ( #17466 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 07:48:21 -07:00
0be6d05b5e
[V1][Metrics] add support for kv event publishing ( #16750 )
...
Signed-off-by: alec-flowers <aflowers@nvidia.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
2025-04-30 07:44:45 -07:00
77073c77bc
[Core] Prevent side-channel attacks via cache salting ( #17045 )
...
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com >
2025-04-30 20:27:21 +08:00
a7d5b016bd
[TPU][V1][CI] Update regression test baseline for v6 CI ( #17064 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-30 04:03:22 -07:00
d803786731
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None ( #15755 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-30 18:20:39 +08:00
1534d389af
[Misc] Remove deprecated files ( #17447 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 01:52:19 -07:00
ece5a8b0b6
Make the _apply_rotary_emb compatible with dynamo ( #17435 )
2025-04-30 07:52:48 +00:00
54072f315f
[MODEL ADDITION] Ovis2 Model Addition ( #15826 )
...
Signed-off-by: Marco <121761685+mlinmg@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-04-30 07:33:29 +00:00
be633fba0f
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' ( #17434 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-30 00:11:04 -07:00
ed6cfb90c8
[Hardware][Intel GPU] Upgrade to torch 2.7 ( #17444 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com >
2025-04-30 00:03:58 -07:00
6ed9f6047e
[Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue ( #17298 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-04-29 22:54:10 -07:00
a44c4f1d2f
Support LoRA for Mistral3 ( #17428 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-29 21:10:30 -07:00
88fcf00dda
Fix some speculative decode tests with tl.dot ( #17371 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-04-29 19:41:02 -07:00
d1f569b1b9
Fix call to logger.info_once ( #17416 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:39:18 -07:00
13698db634
Improve configs - ModelConfig ( #17130 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-30 10:38:22 +08:00
2c4f59afc3
Update PyTorch to 2.7.0 ( #16859 )
2025-04-29 19:08:04 -07:00
1c2bc7ead0
Truncation control for embedding models ( #14776 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2025-04-30 09:24:57 +08:00
4055130a85
[release] Always git fetch all to get latest tag on TPU release ( #17322 )
2025-04-29 17:52:11 -07:00
34120f5acd
[V1][Feature] Enable Speculative Decoding with Structured Outputs ( #14702 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com >
2025-04-30 00:02:10 +00:00
7489ec0bab
Remove Bamba 9B from CI ( #17407 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 21:10:31 +00:00
70788bdbdc
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE ( #17211 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-29 21:10:00 +00:00
c9c1b59e59
Fix: Python package installation for opentelmetry ( #17049 )
...
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com >
2025-04-29 20:20:24 +00:00
0350809f3a
Remove Falcon3 2x7B from CI ( #17404 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:52:25 +00:00
a6977dbd15
Simplify (and fix) passing of guided decoding backend options ( #17008 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 19:02:23 +00:00
2fa2a50bf9
[Bugfix] Fix Minicpm-O-int4 GPTQ model inference ( #17397 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-29 18:21:42 +00:00
08e15defa9
[CI/Build] Add retry mechanism for add-apt-repository ( #17107 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-29 10:40:52 -07:00
b37685afbb
[CI] Uses Python 3.11 for TPU ( #17359 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-29 17:39:16 +00:00
792595b59d
[TPU][V1][CI] Replace python3 setup.py develop with standard pip install --e on TPU ( #17374 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-29 10:36:48 -07:00
0c1c788312
[Doc][Typo] Fixing label in new model requests link in overview.md ( #17400 )
2025-04-29 10:29:48 -07:00
56d64fbe30
[Docs] Propose a deprecation policy for the project ( #17063 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-29 10:29:44 -07:00
608968b7c5
Enabling multi-group kernel tests. ( #17115 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-29 10:27:27 -07:00
06ffc7e1d3
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build ( #17289 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-04-29 10:26:42 -07:00
d3cf61b89b
fix gemma3 results all zero ( #17364 )
...
Signed-off-by: mayuyuace <qiming1.zhang@intel.com >
2025-04-29 09:40:25 -07:00
a39203f99e
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … ( #17369 )
...
Signed-off-by: mofanke <mofanke@gmail.com >
2025-04-29 16:32:40 +00:00
24e6ad3f16
[V1] Remove num_input_tokens from attn_metadata ( #17193 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-29 09:28:41 -07:00
2ef5d106bb
Improve literal dataclass field conversion to argparse argument ( #17391 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 16:25:08 +00:00
0ed27ef66c
Fix: Spelling of inference ( #17387 )
2025-04-29 09:23:39 -07:00
900edfa8d4
Transformers backend tweaks ( #17365 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 09:08:03 -07:00
88ad9ec6b2
[Frontend] Support chat_template_kwargs in LLM.chat ( #17356 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 22:03:35 +08:00
40896bdf3f
pre-commit autoupdate (#17380 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 06:46:55 -07:00
00ee37efa2
[Bugfix] Clean up MiniMax-VL and fix processing ( #17354 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 20:42:16 +08:00
890f104cdf
[Doc] Fix QWen3MOE info ( #17381 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-29 12:38:32 +00:00
4a5e13149a
Update docs requirements ( #17379 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-29 11:35:47 +00:00
97cc8729f0
[Model] Ignore rotary embed load for Cohere model ( #17319 )
2025-04-29 00:30:40 -07:00
4464109219
[Build][Bugfix] Restrict setuptools version to <80 ( #17320 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-29 00:17:23 -07:00
193e78e35d
[Fix] Documentation spacing in compilation config help text ( #17342 )
...
Signed-off-by: Zerohertz <ohg3417@gmail.com >
2025-04-29 00:16:17 -07:00
bdb2cddafc
[Misc]Use a platform independent interface to obtain the device attributes ( #17100 )
2025-04-29 06:59:13 +00:00
ebb3930d28
[Misc] Move config fields to MultiModalConfig ( #17343 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 06:37:21 +00:00
cde384cd92
[Model] support MiniMax-VL-01 model ( #16328 )
...
Signed-off-by: qingjun <qingjun@minimaxi.com >
2025-04-29 12:05:50 +08:00
96e06e3cb7
[Misc] Add a Jinja template to support Mistral3 function calling ( #17195 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-28 19:53:44 -07:00
17eb306fcc
[Bugfix] Add contiguous call inside rope kernel wrapper ( #17091 )
...
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn >
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn >
2025-04-28 19:24:07 -07:00
165cb56329
Ignore '<string>' filepath ( #17330 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-28 19:23:29 -07:00
d6da8a8ff2
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu ( #17316 )
2025-04-28 19:23:18 -07:00
b4ac4fa04d
[model] make llama4 compatible with pure dense layers ( #17315 )
...
Signed-off-by: Lucia Fang <fanglu@fb.com >
2025-04-29 10:22:22 +08:00
e136000595
[V1][Spec Decode] Make Eagle model arch config driven ( #17323 )
2025-04-29 10:22:02 +08:00
86d9fc29cb
implement Structural Tag with Guidance backend ( #17333 )
...
Signed-off-by: Michal Moskal <michal@moskal.me >
2025-04-29 02:21:32 +00:00
506475de5f
[Optim] Compute multimodal hash only once per item ( #17314 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-29 09:40:35 +08:00
cfe4532093
[Benchmark] Add single turn MTBench to Serving Bench ( #17202 )
2025-04-28 16:46:15 -07:00
8fc88d63f1
[Model] Add tuned triton fused_moe configs for Qwen3Moe ( #17328 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-28 15:20:24 -07:00
6e74fd4945
Support loading transformers models with named parameters ( #16868 )
...
Signed-off-by: Alex <alexwu@character.ai >
2025-04-28 23:15:58 +01:00
dcbac4cb4b
[Model] Qwen3 Dense FP8 Compat Fixes ( #17318 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
2025-04-28 14:12:01 -07:00
ed2462030f
[Bugfix] Fix moe weight losing all extra attrs after process_weights_after_loading. ( #16854 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-04-28 21:05:07 +00:00
cc5befbced
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) ( #17283 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-28 13:55:50 -07:00
2c89cd96a8
[Chore] cleanup license indicators in light of SPDX ( #17259 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-04-28 19:43:52 +00:00
a0304dc504
[Security] Don't bind tcp zmq socket to all interfaces ( #17197 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-28 10:08:20 -07:00
c7941cca18
Explicitly explain quant method override ordering and ensure all overrides are ordered ( #17256 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:55:31 +00:00
b6dd32aa07
Make name of compressed-tensors quant method consistent across vLLM ( #17255 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:28:13 +00:00
f94886946e
Improve conversion from dataclass configs to argparse arguments ( #17303 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 16:22:12 +00:00
72dfe4c74f
[Docs] Add a security guide ( #17230 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-28 15:12:17 +00:00
8b464d9660
[Misc] Clean up Qwen2.5-Omni code ( #17301 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 06:20:45 -07:00
889ebb2638
[Misc] Minor typo/grammar in platforms/interface.py ( #17307 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-28 05:45:42 -07:00
3ad986c28b
[doc] update wrong model id ( #17287 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-28 04:20:51 -07:00
344e193b7d
[Bugfix] Add missing get_language_model to new MLLMs ( #17300 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 04:09:57 -07:00
fb1c933ade
Add missing class docstring for PromptAdapterConfig ( #17302 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-28 04:06:59 -07:00
72c5b97231
Update tpu_worker.py 's typo ( #17288 )
2025-04-28 04:01:15 -07:00
fa93cd9f60
[Model] Add Granite Speech Support ( #16246 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-28 10:05:00 +00:00
aec9674dbe
[Core] Remove legacy input mapper/processor from V0 ( #15686 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-28 15:38:48 +08:00
7fcc4223dc
[Minor][Models] Pass partial_rotary_factor parameter to rope ( #17266 )
...
Signed-off-by: evian <eviantai@u.nus.edu >
Co-authored-by: evian <eviantai@u.nus.edu >
2025-04-28 04:28:59 +00:00
8262a3e23b
[Misc] Validate stop_token_ids contents ( #17268 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-28 03:54:05 +00:00
f211331c48
[Doc] small fix ( #17277 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-28 03:53:35 +00:00
9053d0b134
[Doc] Fix wrong github link in LMCache examples ( #17274 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2025-04-28 03:09:11 +00:00
cb3f2d8d10
[Bugfix] Fix Mistral3 spatial merge error ( #17270 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-27 19:40:05 -07:00
c12df53b60
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… ( #16751 )
...
Signed-off-by: Ther-LF <2639852836@qq.com >
2025-04-27 19:38:42 -07:00
d1aeea7553
[Bugfix] Fix missing ARG in Dockerfile for arm64 platforms ( #17261 )
...
Signed-off-by: lkm-schulz <44176356+lkm-schulz@users.noreply.github.com >
2025-04-27 19:38:14 -07:00
d8bccde686
[BugFix] Fix vllm_flash_attn install issues ( #17267 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-27 17:27:56 -07:00
20e489eaa1
[V1][Spec Decode] Make eagle compatible with prefix caching. ( #17137 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-27 09:29:43 -07:00
4213475ec7
[Metrics] Fix minor inconsistencies in bucket progression ( #17262 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-27 16:19:39 +00:00
d92879baf6
[doc] Add feature status legend ( #17257 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-27 08:17:02 -07:00
690fe019f0
[Feature] support sequence parallelism using compilation pass ( #16155 )
...
Signed-off-by: cascade812 <cascade812@outlook.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-04-27 06:29:35 -07:00
ed7a29d9f8
[NVIDIA] Support Cutlass MLA for Blackwell GPUs ( #16032 )
...
Signed-off-by: kaixih <kaixih@nvidia.com >
2025-04-27 06:29:21 -07:00
756848e79e
[Bugfix] Fix Lora Name Parsing ( #17196 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-27 20:33:09 +08:00
18445edd0f
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens ( #17033 )
...
Signed-off-by: sfc-gh-zhwang <flex.wang@snowflake.com >
2025-04-27 12:30:53 +00:00
30215ca61f
[MISC] Use string annotation types for class definitions ( #17244 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-04-27 08:39:57 +00:00
838cedade7
[Bugfix] Get a specific type of layer from forward context ( #17222 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-27 00:58:05 -07:00
4283a28c2f
[Bugfix] Fix QWen2 VL multimodal mapping ( #17240 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-27 05:53:23 +00:00
93a126fbc7
[Misc] Make cached tokenizer pickle-compatible ( #17048 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-27 13:05:00 +08:00
8e4b351a0c
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel ( #12591 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-04-27 00:35:08 +00:00
9869453c42
Update test_flash_attn.py ( #17102 )
...
Signed-off-by: ShuaibinLi <lishuaibin@live.cn >
2025-04-26 22:17:35 +00:00
3642c59aa8
[CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh ( #16271 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-26 18:25:05 +00:00
43eea2953b
[Minor] Fix lint error in main branch ( #17233 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-26 11:10:14 -07:00
de7eb10ce4
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation ( #16878 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-04-26 10:41:35 -07:00
fd11a325b8
[MISC] rename interval to max_recent_requests ( #14285 )
2025-04-26 16:59:18 +00:00
4d17e20310
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 ( #16573 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-04-26 09:17:58 -07:00
10fd1d7380
[Bugfix] fix error due to an uninitialized tokenizer when using skip_tokenizer_init with num_scheduler_steps ( #9276 )
...
Signed-off-by: changjun.lee <pord7457@gmail.com >
2025-04-26 11:51:17 -04:00
52b4f4a8d7
[Docs] Update structured output doc for V1 ( #17135 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-26 15:12:18 +00:00
e782e0a170
[Chore] added stubs for vllm_flash_attn during development mode ( #17228 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2025-04-26 07:45:26 -07:00
dc2ceca5c5
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set ( #17088 )
...
Signed-off-by: Andy Xie <andy.xning@gmail.com >
2025-04-26 14:34:24 +00:00
f8acd01ff7
[V1] Add structural_tag support using xgrammar ( #17085 )
2025-04-26 14:06:37 +00:00
c48334d405
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device ( #17186 )
...
Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai >
2025-04-26 05:55:14 -07:00
909fdaf152
[Bugfix] Fix standard models tests ( #17217 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-26 02:26:41 -07:00
8c1c926d00
[Bugfix] Fix missing int type for -n in multi-image example ( #17223 )
2025-04-26 08:49:52 +00:00
df6f3ce883
[Core] Remove prompt string from engine core data structures ( #17214 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-25 23:41:05 -07:00
513f074766
[CI/test] Fix Eagle Correctness Test ( #17209 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 23:40:36 -07:00
b07bf83c7d
[BugFix] Avoid race conditions in zero-copy tensor transmission ( #17203 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-26 06:00:07 +00:00
53e8cf53a4
[V1][Metrics] Allow V1 AsyncLLM to use custom logger ( #14661 )
...
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Mark McLoughlin <markmc@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-25 22:05:40 -07:00
54271bb766
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. ( #17011 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2025-04-25 22:05:10 -07:00
9e96f56efb
Allocate kv_cache with stride order ( #16605 )
...
Signed-off-by: shuw <shuw@nvidia.com >
2025-04-25 22:03:31 -07:00
b278911229
[Minor][Models] Fix Return Types of Llama & Eagle ( #17220 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 21:54:47 -07:00
7bd0c7745c
[Doc] Minor fix for the vLLM TPU setup page ( #17206 )
...
Signed-off-by: Yarong Mu <ymu@google.com >
2025-04-26 04:39:56 +00:00
1cf0719ebd
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig ( #17213 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-25 21:08:15 -07:00
537d5ee025
[doc] add Anything LLM integration ( #17216 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-25 21:03:23 -07:00
c8e5be35f7
[MISC][AMD] Add unused annotation to rocm kernel file ( #17097 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-04-25 20:33:35 -07:00
a6e72e1e4f
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env ( #17142 )
...
Signed-off-by: James Wu <jjwu@meta.com >
2025-04-26 11:28:20 +08:00
5e83a7277f
[v1] [P/D] Adding LMCache KV connector for v1 ( #16625 )
2025-04-26 03:03:38 +00:00
68af5f6c5c
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary ( #17215 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-04-25 19:55:05 -07:00
8de2901fea
[Bugfix] gemma[2,3] interleaved attention when sliding window is disabled ( #17180 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-04-25 19:53:51 -07:00
c53e0730cb
[Misc] Refine ray_serve_deepseek example ( #17204 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-25 16:06:59 -07:00
a0e619e62a
[V1][Spec Decode] EAGLE-3 Support ( #16937 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
Co-authored-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-25 15:43:07 -07:00
70116459c3
[BugFix][Frontend] Fix LLM.chat() tokenization ( #16081 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-25 22:20:05 +00:00
65e262b93b
Fix Python packaging edge cases ( #17159 )
...
Signed-off-by: Christian Heimes <christian@python.org >
2025-04-26 06:15:07 +08:00
43faa0461a
[Bugfix] Fix hybrid model tests ( #17182 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 15:14:37 -07:00
48cb2109b6
[V1] Move usage stats to worker and start logging TPU hardware ( #16211 )
2025-04-25 14:06:01 -06:00
a5450f11c9
[Security] Use safe serialization and fix zmq setup for mooncake pipe ( #17192 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-25 16:53:23 +00:00
9d98ab5ec6
[Misc] Inline Molmo requirements ( #17190 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 16:41:44 +00:00
df5c879527
[doc] update wrong hf model links ( #17184 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-25 16:40:54 +00:00
423e9f1cbe
Use Transformers helper get_text_config() instead of checking for text_config ( #17105 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-25 08:47:35 -07:00
0bd7f8fca5
Bump Transformers to 4.51.3 ( #17116 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-25 08:34:34 -07:00
d5615af9ae
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception ( #16769 )
...
Signed-off-by: Jasmond Loh <Jasmond.Loh@hotmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-25 07:26:30 -07:00
19dcc02a72
[Bugfix] Fix mistral model tests ( #17181 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-25 06:03:34 -07:00
7feae92c1f
[Doc] Move todo out of beam search docstring ( #17183 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-25 04:44:58 -07:00
f851b84266
[Doc] Add two links to disagg_prefill.md ( #17168 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-25 10:23:57 +00:00
fc966e9cc6
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 ( #17158 )
2025-04-25 17:10:32 +08:00
ef19e67d2c
[Doc] Add headings to improve gptqmodel.md ( #17164 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-25 01:13:13 -07:00
a41351f363
[Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization ( #15734 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
Co-authored-by: Luka Govedič <lgovedic@redhat.com >
2025-04-25 00:45:02 -07:00
6aae216b4e
[Bugfix] remove fallback in guided_json (int range, patterns) ( #16725 )
...
Signed-off-by: csy1204 <josang1204@gmail.com >
Co-authored-by: 조상연[플레이스 AI] <sang-yeon.cho@navercorp.com >
2025-04-25 06:54:43 +00:00
b22980a1dc
[Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance ( #16457 )
...
Signed-off-by: cynthieye <yexin93@qq.com >
Co-authored-by: MagnetoWang <magnetowang@outlook.com >
2025-04-25 14:52:28 +08:00
881f735827
[Misc] Benchmark Serving Script Support Appending Results ( #17028 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-24 22:53:55 -07:00
2f54045508
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton ( #15099 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-04-24 22:51:02 -07:00
5aa6efb9a5
[Misc] Clean up redundant code in uniproc_executor.py ( #16762 )
...
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com >
2025-04-24 22:49:30 -07:00
6ca0234478
Move missed SchedulerConfig args into scheduler config group in EngineArgs ( #17131 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 22:48:53 -07:00
649818995f
[Docs] Fix True->true in supported_models.md ( #17141 )
2025-04-25 04:20:04 +00:00
7a0a9da72b
[Doc] V1 : Update LoRA status ( #17133 )
...
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com >
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com >
2025-04-24 20:17:22 -07:00
69bff9bc89
fix float16 support for kimi-vl ( #17156 )
...
Co-authored-by: zhouzaida <zhouzaida@msh.team >
2025-04-24 20:16:32 -07:00
41ca7eb491
[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 ( #16864 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-24 20:12:21 -07:00
eef364723c
[FEAT] [ROCm]: AITER Fused MOE V1 Support ( #16752 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-04-25 11:06:50 +08:00
0d6e187e88
Use custom address for listening socket ( #15988 )
...
Signed-off-by: Jens Glaser <glaserj@ornl.gov >
2025-04-25 01:57:16 +00:00
9420a1fc30
Better error message for missing mistral params.json ( #17132 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 23:43:08 +00:00
583e900996
[Misc] Add example to run DeepSeek with Ray Serve LLM ( #17134 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-24 22:25:21 +00:00
05e1fbfc52
Add chat template for Llama 4 models ( #16428 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-04-24 20:19:36 +00:00
fe92176321
Add collective_rpc to llm engine ( #16999 )
...
Signed-off-by: Yinghai Lu <yinghai@thinkingmachines.ai >
2025-04-24 20:16:52 +00:00
6d0df0ebeb
[Docs] Generate correct github links for decorated functions ( #17125 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-24 10:39:43 -07:00
0fa939e2d1
Improve configs - LoRAConfig + PromptAdapterConfig ( #16980 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 10:29:34 -07:00
0422ce109f
Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly ( #17124 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 10:28:45 -07:00
47bdee409c
Molmo Requirements ( #17026 )
...
Signed-off-by: Eyshika Agarwal <eyshikaengineer@gmail.com >
Signed-off-by: eyshika <eyshikaengineer@gmail.com >
2025-04-24 10:08:37 -07:00
49f189439d
existing torch installation pip command fix for docs ( #17059 )
2025-04-24 10:07:21 -07:00
5adf6f6b7f
Updating builkite job for IBM Power ( #17111 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-04-24 10:06:17 -07:00
4115f19958
[CI] Add automation for the tool-calling github label ( #17118 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-24 09:22:00 -07:00
340d7b1b21
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics ( #16665 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-04-24 08:57:40 -07:00
1bcbcbf574
[Misc] refactor example series - structured outputs ( #17040 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-24 07:49:48 -07:00
82e43b2d7e
Add missing rocm_skinny_gemms kernel test to CI ( #17060 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 07:49:37 -07:00
67309a1cb5
[Frontend] Using matryoshka_dimensions control the allowed output dimensions. ( #16970 )
2025-04-24 07:06:28 -07:00
b724afe343
[V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning ( #16954 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-24 06:15:03 -07:00
21f4f1c9a4
Improve static type checking in LoRAModelRunnerMixin ( #17104 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 06:14:47 -07:00
b0c1f6202d
[Misc] Remove OLMo2 config copy ( #17066 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-24 06:14:32 -07:00
c0dfd97519
[V1][PP] Optimization: continue scheduling prefill chunks ( #17080 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-04-24 05:27:08 -07:00
a9138e85b1
Fix OOT registration test ( #17099 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 04:44:12 -07:00
0a05ed57e6
Simplify TokenizerGroup ( #16790 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-24 04:43:56 -07:00
14288d1332
Disable enforce_eager for V1 TPU sampler and structured output tests ( #17016 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-24 02:50:09 -07:00
b411418ff0
[Chore] Remove Sampler from Model Code ( #17084 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-24 02:49:33 -07:00
2bc0f72ae5
Add docs for runai_streamer_sharded ( #17093 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-24 01:03:21 -07:00
9c1244de57
[doc] update to hyperlink ( #17096 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-24 00:58:08 -07:00
db2f8d915c
[V1] Update structured output ( #16812 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-23 23:57:17 -07:00
6167c0e5d2
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… ( #16472 )
...
Signed-off-by: 开哲 <kaizhe.zy@alibaba-inc.com >
Co-authored-by: 开哲 <kaizhe.zy@alibaba-inc.com >
2025-04-24 11:25:37 +08:00
ed2e464653
Addendum Fix to support FIPS enabled machines with MD5 hashing ( #17043 )
...
Signed-off-by: sydarb <areebsyed237@gmail.com >
2025-04-23 19:55:00 -07:00
2c8ed8ee48
More informative error when using Transformers backend ( #16988 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 19:54:03 -07:00
ed50f46641
[Bugfix] Enable V1 usage stats ( #16986 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-23 19:54:00 -07:00
46e678bcff
[Minor] Use larger batch sizes for A100/B100/B200/MI300x ( #17073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-23 19:18:59 -07:00
6b2427f995
[Quantization]add prefix for commandA quantized model ( #17017 )
2025-04-23 17:32:40 -07:00
b07d741661
[CI/Build] workaround for CI build failure ( #17070 )
...
Signed-off-by: csy1204 <josang1204@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-04-23 16:14:18 -07:00
41fb013d29
[V1][Spec Decode] Always use argmax for sampling draft tokens ( #16899 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-23 14:57:43 -07:00
32d4b669d0
[BugFix][V1] Fix int32 token index overflow when preparing input ids ( #16806 )
2025-04-23 12:12:35 -07:00
3cde34a4a4
[Frontend] Support guidance:no-additional-properties for compatibility with xgrammar ( #15949 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-04-23 18:34:41 +00:00
bdb3660312
Use @property and private field for data_parallel_rank_local ( #17053 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 08:50:08 -07:00
f3a21e9c68
CacheConfig.block_size should always be int when used (#17052 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 08:50:05 -07:00
8e630d680e
Improve Transformers backend model loading QoL ( #17039 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 07:33:51 -07:00
af869f6dff
[CI] Update structured-output label automation ( #17055 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-23 07:33:14 -07:00
53c0fa1e25
Ensure that pid passed to kill_process_tree is int for mypy ( #17051 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-23 07:32:26 -07:00
f7912cba3d
[Doc] Add top anchor and a note to quantization/bitblas.md ( #17042 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-23 07:32:16 -07:00
6317a5174a
Categorize tests/kernels/ based on kernel type ( #16799 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-23 09:21:07 -04:00
aa72d9a4ea
Mistral-format support for compressed-tensors ( #16803 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-23 08:46:23 -04:00
ce17db8085
[CI] Run v1/test_serial_utils.py in CI ( #16996 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-23 01:13:34 -07:00
8c87a9ad46
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers ( #16964 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-23 07:24:09 +00:00
ec69124eb4
[Misc] Improve readability of get_open_port function. ( #17024 )
...
Signed-off-by: gitover22 <qidizou88@gmail.com >
2025-04-23 06:16:53 +00:00
d0da99fb70
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) ( #16998 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-22 21:49:24 -07:00
b2f195c429
[V1] Avoid socket errors during shutdown when requests are in in-flight ( #16807 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-23 12:36:29 +08:00
047797ef90
[Bugfix] Triton FA function takes no keyword arguments ( #16902 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-22 21:35:24 -07:00
eb8ef4224d
[doc] add download path tips ( #17013 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-23 04:06:30 +00:00
56a735261c
[INTEL-HPU][v0] Port delayed sampling to upstream ( #16949 )
...
Signed-off-by: Michal Adamczyk <michal.adamczyk@intel.com >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
2025-04-22 20:14:11 -07:00
e1cf90e099
[misc] tune some env vars for GB200 ( #16992 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-23 10:59:48 +08:00
6bc1e30ef9
Revert "[Misc] Add S3 environment variables for better support of MinIO." ( #17021 )
2025-04-22 19:22:29 -07:00
7e081ba7ca
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check ( #17022 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-22 19:17:48 -07:00
1e013fa388
[V1][DP] More robust DP/EP dummy request coordination ( #16277 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 19:12:15 -07:00
bc7c4d206b
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 ( #13305 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: root <root@banff-cyxtera-s65-4.amd.com >
Signed-off-by: maleksan85 <maleksan@amd.com >
Signed-off-by: <>
Co-authored-by: Sage Moore <sage@neuralmagic.com >
Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
Co-authored-by: root <root@banff-cyxtera-s65-4.amd.com >
2025-04-22 19:11:56 -07:00
f67e9e9f22
add Dockerfile build vllm against torch nightly ( #16936 )
...
Signed-off-by: Yang Wang <elainewy@meta.com >
2025-04-22 19:08:27 -07:00
36fe78769f
[Bugfix] validate urls object for multimodal content parts ( #16990 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-23 09:43:06 +08:00
83d933718c
[Core][V1][TPU] Enable structured decoding on TPU V1 ( #16499 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-22 18:05:23 -06:00
5175b884f7
[BugFix] Remove default multiproc executor collective_rpc timeout ( #17000 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 23:27:14 +00:00
5536b30a4c
Fencing Kernels Tests for enabling on AMD ( #16929 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-04-22 09:32:40 -07:00
7f58fb9718
Add assertion for no objects while hashing hf_config ( #16930 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-22 09:32:22 -07:00
30bc3e0f66
[FEAT][ROCm]: Support AITER MLA ( #15893 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Co-authored-by: qli88 <qiang.li2@amd.com >
2025-04-22 09:31:13 -07:00
f34410715f
[frontend] enhance tool_calls type check ( #16882 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-22 15:40:24 +00:00
68d4c33202
[Misc] Add S3 environment variables for better support of MinIO. ( #16977 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-22 14:27:36 +00:00
f961d7f6ef
[BugFix] Pass in correct VLLM config in FlashInfer backend ( #13207 ) ( #16973 )
...
Signed-off-by: 苏政渊 <suzhengyuan@moonshot.cn >
Co-authored-by: 苏政渊 <suzhengyuan@moonshot.cn >
2025-04-22 06:44:10 -07:00
d059110498
Improve configs - SpeculativeConfig ( #16971 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-22 12:55:36 +00:00
571e8dd65e
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni ( #16974 )
...
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com >
2025-04-22 12:23:17 +00:00
4b91c927f6
[Misc] refactor example series ( #16972 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-22 11:44:21 +00:00
0e237f0035
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER ( #15001 )
...
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-04-22 02:46:28 -07:00
8f7bace7c3
[Doc] Improve documentation for multimodal CLI args ( #16960 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-22 08:35:35 +00:00
e4d6144232
[BugFix] Fix incremental detokenization perf issue ( #16963 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-22 08:16:19 +00:00
8d32dc603d
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS ( #6036 )
...
Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com >
Co-authored-by: xinyuxiao <xinyuxiao2024@gmail.com >
2025-04-22 09:01:36 +01:00
c4ab9f3e71
[V1] Remove pre-allocation for KV cache ( #16941 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-22 00:52:18 -07:00
2689d5c027
[Model] Use autoweightloader for mamba ( #16950 )
...
Signed-off-by: sfeng33 <4florafeng@gmail.com >
2025-04-22 07:48:15 +00:00
acba33a0f1
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams ( #16767 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2025-04-22 06:02:20 +00:00
a114bf20a3
[Perf] Optimize _update_states for GPU model runner ( #16910 )
...
Signed-off-by: snowcharm <snowcharmqq@gmail.com >
2025-04-22 14:01:54 +08:00
3097ce3a32
[Doc] Update ai_accelerator/hpu-gaudi.inc.md ( #16956 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-22 05:33:27 +00:00
d6da9322c8
[Bugfix] Fix f-string for Python 3.9-3.11 ( #16962 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-21 21:45:55 -07:00
71ce44047f
Support S3 Sharded loading with RunAI Model Streamer ( #16317 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-21 21:21:49 -07:00
188b7f9b8c
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm ( #15830 )
...
Signed-off-by: charlifu <charlifu@amd.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-04-21 20:46:22 -07:00
b9b4746950
[V1] Remove additional_config check ( #16710 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-04-21 20:45:27 -07:00
7b8a2ab76f
[Kernel] Add expert_map support to Cutlass FP8 MOE ( #16861 )
...
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com >
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com >
2025-04-21 20:44:32 -07:00
c9acbf1141
[Misc] Remove the chunked prefill warning for LoRA ( #16925 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-21 20:44:24 -07:00
5b794cae8d
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 ( #16727 )
...
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com >
2025-04-21 20:42:34 -07:00
0e4254492f
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other ( #16863 )
...
Signed-off-by: Jeffrey Li <jeffrey.dot.li@gmail.com >
2025-04-22 11:40:19 +08:00
1311913f55
[BugFix][Spec Decode] No in-place update to draft probs ( #16952 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-21 19:54:19 -07:00
29f395c97c
[Doc] Remove unnecessary V1 flag ( #16924 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-21 21:04:38 -04:00
fa3bba2a53
[TPU][V1] Enable Top-P ( #16843 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-22 00:46:07 +00:00
986537f1c3
[V1] V1 FlashInfer Attention ( #16684 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Aurick Qiao <qiao@aurick.net >
2025-04-22 00:38:41 +00:00
210207525e
[TPU][V1] Capture multimodal encoder during model compilation ( #15051 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Siyuan Liu <lsiyuan@google.com >
2025-04-21 18:36:59 -06:00
71eda0bb76
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml ( #16946 )
2025-04-21 18:35:32 -06:00
471fe65630
[TPU][V1] Implicitly adjust page size when there's SMEM OOM ( #16871 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-21 15:43:13 -06:00
3a0fba5cf4
[V1][Spec Decode] Handle draft tokens beyond max_model_len ( #16087 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-21 12:38:50 -07:00
299ebb62b2
[Core] Speed up decode by remove synchronizing operation in sampler ( #16436 )
...
Signed-off-by: Chanh Nguyen <cnguyen@linkedin.com >
Co-authored-by: Chanh Nguyen <cnguyen@linkedin.com >
2025-04-21 18:18:22 +00:00
f728ab8e35
[Doc] mention how to install in CPU editable mode ( #16923 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-21 17:45:51 +00:00
63e26fff78
[doc] install required python3-dev apt package ( #16888 )
...
Signed-off-by: David Xia <david@davidxia.com >
2025-04-21 16:15:18 +00:00
fe3462c774
[XPU][Bugfix] minor fix for XPU ( #15591 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-04-22 00:02:57 +08:00
3b34fd5273
Raise error for data-parallel with benchmark_throughput ( #16737 )
...
Signed-off-by: Kartik Ramesh <kartikx2000@gmail.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-04-21 23:51:43 +08:00
55d6d3fdb8
[Bugfix] Fix GLM rotary_dim issue and support v1 ( #16912 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
2025-04-21 14:26:34 +00:00
7272bfae77
[Misc] Refactor platform to get device specific stream and event ( #14411 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-21 21:25:49 +08:00
d9ac9e3dc5
[Misc] fix collect_env version parse ( #15267 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-04-21 20:29:40 +08:00
d41faaf9df
Restore buffers when wake up from level 2 sleep ( #16564 ) ( #16889 )
...
Signed-off-by: Han <zh950713@gmail.com >
2025-04-21 20:18:28 +08:00
b34f33438a
[Doc] Split dummy_processor_inputs() in Multimodal Docs ( #16915 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-21 11:10:01 +00:00
26c0406555
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni ( #16907 )
2025-04-21 10:25:21 +00:00
4c41278b77
[CI/CD][V1] Add spec decode tests to CI ( #16900 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-20 22:37:16 -07:00
bb3605db85
[Bugfix] Fix v1/spec_decode/test_ngram.py ( #16895 )
...
Signed-off-by: qizixi <qizixi@meta.com >
2025-04-20 20:54:29 -07:00
fe742aef5a
[easy] Pass compile_fx only the config patches ( #16845 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-20 12:25:19 +08:00
4b07d36891
Improve configs - CacheConfig ( #16835 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-20 12:25:04 +08:00
87aaadef73
Serialize tensors using int8 views ( #16866 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-19 10:28:34 -07:00
682e0b6d2f
Log how much time loading a compiled artifact takes ( #16848 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-19 16:50:46 +00:00
d6195a748b
[doc] update hyperlink ( #16877 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-19 16:40:38 +00:00
205d84aaa9
[VLM] Clean up models ( #16873 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-19 12:13:06 +00:00
5124f5bf51
[Model] Qwen2.5-Omni Cleanup ( #16872 )
2025-04-19 09:37:02 +00:00
83f3c3bd91
[Model] Refactor Phi-4-multimodal to use merged processor and support V1 ( #15477 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-19 02:26:11 -07:00
d9737ca1c6
[V1][Misc] stop update prefix cache stats when logs_stats is disabled ( #16460 )
...
Signed-off-by: vie-serendipity <2733147505@qq.com >
2025-04-19 02:25:19 -07:00
9d4ca19d50
[Misc] Benchmarks for audio models ( #16505 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-19 02:24:14 -07:00
2ef0dc53b8
[Frontend] Add sampling params to v1/audio/transcriptions endpoint ( #16591 )
...
Signed-off-by: Jannis Schönleber <joennlae@gmail.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Co-authored-by: Jannis Schönleber <joennlae@gmail.com >
2025-04-19 07:03:54 +00:00
1d4680fad2
[rocm][MI300] llama4 maverick fp8 moe config tp8 ( #16847 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-04-19 06:21:43 +00:00
2c1bd848a6
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) ( #15130 )
...
Signed-off-by: fyabc <suyang.fy@alibaba-inc.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiong Wang <wangxiongts@163.com >
2025-04-18 23:14:36 -07:00
5c9121203c
[release] Publish neuron docker image ( #16733 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2025-04-18 17:11:25 -07:00
490b1698a5
[Doc] Updated Llama section in tool calling docs to have llama 3.2 config info ( #16857 )
...
Signed-off-by: jmho <jaylenho734@gmail.com >
2025-04-18 23:28:53 +00:00
5a5e29de88
[Misc] refactor examples series - Chat Completion Client With Tools ( #16829 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-18 23:24:42 +00:00
3d3ab3689f
[New Model]: Snowflake Arctic Embed (Family) ( #16649 )
2025-04-18 08:11:57 -07:00
686623c5e7
Fix nullable_kvs fallback ( #16837 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-18 05:58:39 -07:00
aadb656562
[Misc] Clean up Kimi-VL ( #16833 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-18 05:15:09 -07:00
87e067de41
[Model] use AutoWeightsLoader for BigCode, GPT-J ( #16823 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-18 10:42:41 +00:00
26507f8973
[Docs] Fix a link and grammar issue in production-stack.md ( #16809 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-18 06:42:58 +00:00
9c1d5b456d
[Doc] add podman setup instructions for official image ( #16796 )
...
Signed-off-by: Nathan Weinberg <nweinber@redhat.com >
2025-04-18 06:10:49 +00:00
e31045f95c
[Bugfix] fix pp for llama4 ( #16746 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-18 13:51:30 +08:00
aaec845f8e
[ROCm] [Attention] Cleanup ROCm output passing ( #16431 )
...
Signed-off-by: Luka Govedič <lgovedic@redhat.com >
2025-04-18 05:46:45 +00:00
7bdfd29a35
[Misc] add collect_env to cli and docker image ( #16759 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-17 22:13:35 -07:00
e78587a64c
Improve-mm-and-pooler-and-decoding-configs ( #16789 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 22:13:32 -07:00
7eb4255628
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales ( #16801 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-17 22:13:29 -07:00
6a0f547561
Add hardware print to TPU V1 test ( #16792 )
2025-04-17 22:13:26 -07:00
30ed81b7ca
[V1][Structured Output] Minor modification to _validate_structured_output() ( #16748 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-18 13:12:54 +08:00
7a4a5de729
[Misc] Update outdated note: LMCache now supports chunked prefill ( #16697 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-18 05:12:42 +00:00
c16fb5dae8
[Doc] Improve help examples for --compilation-config ( #16729 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 21:22:34 -07:00
e37073efd7
Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema ( #16721 )
...
Signed-off-by: Tarun Kumar <takumar@redhat.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-17 21:08:27 -07:00
183dad7a85
[Attention] Update to lastest FA3 code ( #13111 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-17 15:14:07 -07:00
3408e47159
[P/D][V1] KV Connector API V1 ( #15960 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Signed-off-by: remi <remi@mistral.ai >
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
2025-04-17 13:22:40 -07:00
0377b8310b
[MLA] Simplification to batch P/D reordering ( #16673 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-17 16:12:09 -04:00
e4755f7fac
[V1][Metrics] Fix http metrics middleware ( #15894 )
2025-04-17 19:52:18 +00:00
92edf35826
[ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints ( #16674 )
2025-04-17 11:44:34 -07:00
eb5819b2d9
[V1][TPU] Enable Top K ( #15489 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
Co-authored-by: Hyesoo Yang <hyeygit@gmail.com >
2025-04-17 18:18:11 +00:00
5989f4684d
[TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even ( #16726 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-17 18:09:57 +00:00
5125d72f02
[Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small ( #16548 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-17 17:48:31 +00:00
a018e555fd
[Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 ( #16753 )
...
Signed-off-by: ximing.wxm <ximing.wxm@antgroup.com >
Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com >
2025-04-18 00:01:30 +08:00
6211b92273
[Bugfix]Fix index out of range error in api server log ( #16787 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-04-17 09:01:07 -07:00
05fcd1b430
[V1][Perf] Faster incremental detokenization ( #15137 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-17 07:45:24 -07:00
7c02d6a137
[Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion ( #16784 )
...
Signed-off-by: insukim1994 <insu.kim@moreh.io >
2025-04-17 14:10:08 +00:00
11c3b98491
[Doc] Document Matryoshka Representation Learning support ( #16770 )
2025-04-17 13:37:37 +00:00
dbe7f07001
[Doc] Make sure to update vLLM when installing latest code ( #16781 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 06:53:31 -06:00
c69bf4ee06
fix: hyperlink ( #16778 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 11:34:20 +00:00
d27ea94034
Improve configs - TokenizerPoolConfig + DeviceConfig ( #16603 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 11:19:42 +00:00
99ed526101
[Misc] refactor examples series - lmcache ( #16758 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 11:02:35 +00:00
207da28186
[Doc] Fix a 404 link in installation/cpu.md ( #16773 )
...
Signed-off-by: windsonsea <haifeng.yao@daocloud.io >
2025-04-17 10:46:21 +00:00
5b1aca2ae3
[Bugfix] Fix GLM4 model ( #16618 )
...
Signed-off-by: intervitens <intervitens@tutanota.com >
2025-04-17 03:35:07 -07:00
d8e557b5e5
[doc] add open-webui example ( #16747 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-17 18:27:32 +08:00
61a44a0b22
[Doc] Add more tips to avoid OOM ( #16765 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-17 09:54:34 +00:00
a6481525b8
[misc] ignore marlin_moe_wna16 local gen codes ( #16760 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-17 17:15:14 +08:00
8cac35ba43
[Ray] Improve documentation on batch inference ( #16609 )
...
Signed-off-by: Richard Liaw <rliaw@berkeley.edu >
2025-04-16 22:19:26 -07:00
9dbf7a2dc1
[V1] Remove log noise when idle ( #16735 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-16 21:34:08 -07:00
607029e515
[Bugfix] Revert max_prompt_len validation for decoder-only models. ( #16741 )
...
Signed-off-by: David Heineman <david@davidheineman.com >
2025-04-16 21:33:15 -07:00
cb072ce93b
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work ( #16734 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-17 04:17:39 +00:00
95aca283b4
[rocm][V0] fix selection logic for custom PA in V0 ( #16426 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-04-16 19:52:11 -07:00
2b05b8ce69
[V1][Frontend] Improve Shutdown And Logs ( #11737 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-16 19:48:34 -07:00
3c776dcefb
Adding vllm buildkite job for IBM Power ( #16679 )
...
Signed-off-by: Aaruni Aggarwal <aaruniagg@gmail.com >
2025-04-17 10:47:47 +08:00
2cbd4d2999
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification ( #16636 )
...
Signed-off-by: Bryan Lu <yuzhelu@amazon.com >
2025-04-16 19:47:26 -07:00
3092375e27
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] ( #16432 )
...
Signed-off-by: Staszek Pasko <staszek@gmail.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-16 19:28:32 -07:00
3cd91dc955
Help user create custom model for Transformers backend remote code models ( #16719 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 01:05:59 +00:00
8a7368e069
[Misc] Remove redundant comment ( #16703 )
...
Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com >
2025-04-17 00:44:52 +00:00
93e561ec4d
Improve error for structured output backend selection ( #16717 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-17 00:35:35 +00:00
e1b004839a
[Hardware] Add processor inputs to platform validation ( #16680 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-04-16 09:28:42 -07:00
ee378f3d49
[Model] support modernbert ( #16648 )
...
Signed-off-by: 唯勤 <xsank.mz@alibaba-inc.com >
Co-authored-by: 唯勤 <xsank.mz@alibaba-inc.com >
2025-04-16 05:30:15 -07:00
e82ee40de3
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel ( #16693 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-16 03:31:39 -07:00
facbe2a114
[Doc] Improve OOM troubleshooting ( #16704 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-16 18:29:48 +08:00
7168920491
[Misc] refactor examples series ( #16708 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-16 10:16:36 +00:00
21378a2323
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook ( #16405 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-16 10:05:31 +00:00
976711d9db
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py ( #16578 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-16 17:01:36 +08:00
44fa4d556c
[ROCM] Bind triton version to 3.2 in requirements-built.txt ( #16664 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-04-16 14:05:28 +08:00
3ac98edcb1
[Feature] add model aware kv ops helper ( #16020 )
...
Signed-off-by: billishyahao <bill.he@amd.com >
2025-04-15 23:00:43 -07:00
966c742ed2
Disable remote caching when calling compile_fx ( #16611 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-15 22:18:28 -07:00
0d7d05f4b6
[Misc] Modify LRUCache touch ( #16689 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-16 04:51:38 +00:00
96bb8aa68b
[Bugfix] fix gpu docker image mis benchmarks dir ( #16628 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-15 21:21:14 -07:00
3badb0213b
[Model] Add PLaMo2 ( #14323 )
...
Signed-off-by: Shinichi Hemmi <50256998+Alnusjaponica@users.noreply.github.com >
Signed-off-by: shemmi <shemmi@preferred.jp >
Co-authored-by: Kento Nozawa <nzw0301@preferred.jp >
Co-authored-by: Hiroaki Mikami <mhiroaki@preferred.jp >
Co-authored-by: Calvin Metzger <metzger@preferred.jp >
2025-04-15 19:31:30 -07:00
fdcb850f14
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server ( #10546 )
...
Signed-off-by: Angky William <angkywilliam@Angkys-MacBook-Pro.local >
Co-authored-by: Angky William <angkywilliam@Angkys-MacBook-Pro.local >
2025-04-15 22:31:38 +00:00
54a66e5fee
[Misc] Update compressed-tensors WNA16 to support zero-points ( #14211 )
2025-04-15 07:33:51 -06:00
280d62b8a2
[Kernel] Remove redundant Exp calculations ( #16123 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-15 12:58:37 +00:00
1666e66443
Add "/server_info" endpoint in api_server to retrieve the vllm_config. ( #16572 )
...
Signed-off-by: Xihui Cang <xihuicang@gmail.com >
2025-04-15 11:50:38 +00:00
1575c1701a
[CI/Build] Fix LoRA OOM ( #16624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-15 16:38:19 +08:00
6ae996a873
[Misc] refactor argument parsing in examples ( #16635 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-15 08:05:30 +00:00
b590adfdc1
Fix vLLM x torch.compile config caching ( #16491 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-14 23:11:11 -07:00
b4fe16c75b
Add vllm bench [latency, throughput] CLI commands ( #16508 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-14 23:10:35 -07:00
bc5dd4f669
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) ( #16631 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-04-14 23:09:58 -07:00
dbb036cf61
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py ( #16623 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-04-15 05:35:38 +00:00
70e7ed841d
[BugFix]: Update minimum pyzmq version ( #16549 )
...
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-04-14 20:06:03 -07:00
d06ba4ed3f
[Kernel] moe wna16 marlin kernel ( #14447 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-14 20:05:22 -07:00
6b40996ae8
[Core][Bugfix] Fix Offline MM Beam Search ( #16390 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-15 10:33:02 +08:00
d2020acac7
config check sleep mode support oot platforms ( #16562 )
2025-04-14 16:31:50 -07:00
1eb3c2ed48
[DOC][TPU] Add core idea about avoiding recompilation after warmup ( #16614 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-14 21:56:06 +00:00
c64ee87267
[Hardware][TPU] Add torchvision to tpu dependency file ( #16616 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-04-14 17:50:46 -04:00
b1308b84a3
[Model][VLM] Add Kimi-VL model support ( #16387 )
...
Signed-off-by: courage17340 <courage17340@163.com >
2025-04-14 21:41:48 +00:00
7b5ecf79bd
s390x: Fix PyArrow build and add CPU test script for Buildkite CI ( #16036 )
...
Signed-off-by: Nishan Acharya <Nishan.Acharya@ibm.com >
2025-04-14 10:55:32 -07:00
9883a18859
Fix triton install condition on CPU ( #16600 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-14 17:06:01 +00:00
b3f2fddd17
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 ( #16596 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-14 17:01:05 +00:00
aa29841ede
[Bugfix] Multi-modal caches not acting like LRU caches ( #16593 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-14 09:24:16 -07:00
6bf27affb6
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet ( #16048 )
...
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com >
2025-04-14 17:08:39 +01:00
1dd23386ec
[Misc] Update usage with mooncake lib for kv transfer ( #16523 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-04-14 11:31:37 +00:00
7cbfc10943
[Misc] refactor examples ( #16563 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-14 09:59:15 +00:00
ce4ddd2d1a
[Misc] remove warning if triton>=3.2.0 ( #16553 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-14 02:39:47 -07:00
e51929ebca
Improve configs - SchedulerConfig ( #16533 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-14 17:24:16 +08:00
dc1b4a6f13
[Core][V0] Enable regex support with xgrammar ( #13228 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-14 10:13:38 +08:00
63d2705edb
[Benchmark][Bugfix] Fix SonnetDataset default values in benchmark_throughput.py ( #16556 )
2025-04-13 17:20:26 -07:00
d085a44082
Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) ( #16537 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-13 14:55:18 +00:00
f49e5aff11
[V1][Spec Decode] KV cache slots for eagle heads ( #16370 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-12 19:42:51 -07:00
6c11ecf8d3
[Bugfix] Validate logit biases to prevent out of vocab ids crashing engine ( #16529 )
...
Signed-off-by: Ryan McConville <ryan@ryanmcconville.com >
2025-04-12 20:19:19 +00:00
93e5f3c5fb
[Perf] Optimize Preparing Inputs for GPU Model Runner ( #16484 )
...
Signed-off-by: snowcharm <snowcharmqq@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-12 22:54:37 +08:00
70363bccfa
Fix syntaxWarning: invalid escape sequence '\s' ( #16532 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2025-04-12 14:39:42 +00:00
3cdc57669f
[Misc] Delete redundant code ( #16530 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-04-12 11:21:37 +00:00
68bb122eb4
[MISC] Make GroupCoordinator compatible with out-of-tree devices ( #16464 )
...
Signed-off-by: hzji210@gmail.com <hzji210@gmail.com >
2025-04-12 09:20:25 +00:00
d9fc8cd9da
[V1] Enable multi-input by default ( #15799 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-12 08:52:39 +00:00
f069f3ea74
[Misc] Openai transcription client example use same Whisper model ( #16487 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-12 07:27:03 +00:00
c5bc0e7fcc
[Misc] Update chat utils tests ( #16520 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-12 06:48:43 +00:00
4a3a518722
fix: spelling ( #16466 )
...
Signed-off-by: Tianer Zhou <ezhoureal@gmail.com >
2025-04-11 23:24:22 -07:00
fbf722c6e6
[Frontend] support matryoshka representation / support embedding API dimensions ( #16331 )
2025-04-11 23:23:10 -07:00
e92d7085bf
[Feature][V1] Add xgrammar to support minLength, maxLength with test ( #16516 )
...
Signed-off-by: Leon Seidel <leon.seidel@fau.de >
2025-04-11 23:22:07 -07:00
bd6028d6b0
Optimized topk for topk=1 (Llama-4) ( #16512 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-12 14:21:08 +08:00
802329dee9
[Doc] Update Llama4 Model Names in Supported Models ( #16509 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-12 02:53:10 +00:00
41cc883c29
[BugFix] Handle non-contiguous tensors properly when serializing ( #16492 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-11 17:54:06 -07:00
57504a4bcf
[CI][Bugfix] Add mistral_tool_use to Ci ( #16517 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 17:52:38 -07:00
ed4792c990
[Doc] Fix link to vLLM blog ( #16519 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-04-11 17:39:23 -07:00
87b836ba77
Bugfix for PixtralHF models without spatial_merge_size ( #16513 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 23:32:22 +00:00
56c76c2e0e
[Bugfix] clean up duplicated code ( #16485 )
...
Signed-off-by: Gogs <gogs@fake.local >
Co-authored-by: Gogs <gogs@fake.local >
2025-04-11 23:19:40 +00:00
c09632a66c
Update openai_compatible_server.md ( #16507 )
...
Signed-off-by: Christian Sears <csears@redhat.com >
2025-04-11 22:54:58 +00:00
a3bf8d4a2b
[Kernel] Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 ( #16488 )
2025-04-12 06:26:55 +08:00
16eda8c43a
[Frontend] Added chat templates for LLaMa4 pythonic tool calling ( #16463 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Kai Wu <kaiwu@meta.com >
2025-04-12 06:26:17 +08:00
cd77382ac1
Improve configs - LoadConfig ( #16422 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-11 20:27:27 +00:00
71b9cde010
[Bugfix] handle alignment of encoder_seq_lens in mllama.py ( #14784 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-04-11 19:59:50 +00:00
5285589f37
[Doc] Document InternVL3 support ( #16495 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-11 19:41:09 +00:00
f41647ee6b
[Kernel] Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel ( #16366 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 17:54:08 +00:00
4d022cbc75
[TPU][V1] Make --disable_chunked_mm_input mandatory for serving MM models ( #16483 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-11 17:06:14 +00:00
70de35a881
Fix erroneous "model doesn't support compile" warning ( #16486 )
...
Signed-off-by: rzou <zou3519@gmail.com >
2025-04-11 16:24:36 +00:00
34b2cf3b33
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU ( #12779 )
...
Signed-off-by: Tomasz Zielinski <tomasz.zielinski@intel.com >
2025-04-11 07:38:36 -07:00
9e90c9f73f
[Bugfix] Fix bugs of running Quark quantized models ( #16236 )
...
Signed-off-by: chaow <chaow@amd.com >
2025-04-11 10:18:32 -04:00
e9528f6dc6
[Kernel] support merge_attn_states CUDA kernel, 3x speedup ( #16173 )
...
Signed-off-by: DefTruth <qiustudent_r@163.com >
2025-04-11 06:50:50 -06:00
51baa9c333
Don't install triton on ppc64le platform ( #16470 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-11 10:11:00 +00:00
35e076b3a8
[Misc] update api_client example ( #16459 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-11 10:05:40 +00:00
a26f59ccbc
[Misc] Raise error for V1 not supporting Long LoRA. ( #16415 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-11 01:51:20 -07:00
aa3b3d76e0
Enforce valid max_num_batched_tokens when disable_chunked_mm_input=True ( #16447 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-11 08:09:52 +00:00
f7030df3be
[Core][LoRA][1/N] Add LoRA for EncoderDecoderModelRunner ( #15990 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-11 15:32:37 +08:00
905e91e9ac
Revert "[Model] use AutoWeightsLoader for deepseek_v2, internlm2" ( #16453 )
2025-04-11 06:44:22 +00:00
f8f9c0ba62
[Bugfix] Don't set an upper bound on repetition penalty ( #16403 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-04-11 14:19:40 +08:00
dda811021a
[CPU][Bugfix] Fix CPU docker issues ( #16454 )
...
Signed-off-by: jiang.li <jiang1.li@intel.com >
2025-04-11 14:19:07 +08:00
93195146ea
[Bugfix][VLM] Fix failing Phi-4-MM multi-images tests and add vision-speech test ( #16424 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-11 04:57:16 +00:00
ed37599544
Update supported_hardware.md for TPU INT8 ( #16437 )
2025-04-11 12:28:07 +08:00
99ef59cf7f
[Llama4] Enable attention temperature tuning by default for long context (>32k) ( #16439 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Co-authored-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-10 21:26:07 -07:00
d544d141ec
update benchmark_serving_structured_output to include auto backend ( #16438 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-11 12:25:52 +08:00
3e397a9484
check input length of sonnet samples ( #16423 )
...
Signed-off-by: alexey-belyakov <alexey.belyakov@intel.com >
2025-04-11 10:15:06 +08:00
268c325078
Fix range_ratio Bug in RandomDataset ( #16126 )
...
Signed-off-by: jadewang21 <jadewangcn@outlook.com >
2025-04-10 15:31:17 -07:00
3cc9af88ff
[TPU][V1] Disable per-request seed/Generator ( #16172 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-10 17:05:44 -04:00
7cd0bd7212
[Bugfix] Fix output token length check logic ( #16419 )
...
Signed-off-by: look <eeslook@163.com >
2025-04-10 20:16:48 +00:00
56d4aefa33
[VLM] Avoid unnecessary dummy multimodal data during processing ( #16416 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 19:32:14 +00:00
dd143ef541
[V1] Zero-copy tensor/ndarray serialization/transmission ( #13790 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-10 19:23:14 +00:00
daefed052c
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B ( #15423 )
...
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-04-10 19:07:07 +00:00
5fbab20e02
[Bugfix] Fix bug when dataset is json ( #15899 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-10 18:35:41 +00:00
e8224f3dca
[V1][Spec Decode] Eagle Model loading ( #16035 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-04-10 11:21:48 -07:00
9665313c39
[V1] Set structured output backend to auto by default ( #15724 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-10 17:53:26 +00:00
0c54fc7273
Improve configs - ParallelConfig ( #16332 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-10 17:34:37 +00:00
c1b57855ec
[TPU][V1] Use language_model interface for getting text backbone in MM ( #16410 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-10 17:32:04 +00:00
83b824c8b4
[VLM] Remove BaseProcessingInfo.get_mm_max_tokens_per_item ( #16408 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 09:06:58 -07:00
7678fcd5b6
Fix the torch version parsing logic ( #15857 )
2025-04-10 07:37:47 -07:00
8661c0241d
[CI] Add auto update workflow for Dockerfile graph ( #11879 )
...
Signed-off-by: wineandchord <guoqizhou19@gmail.com >
2025-04-10 13:43:05 +00:00
ce8d6b75fc
[doc] update the wrong link ( #16401 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-10 21:02:37 +08:00
61de3ef74b
[Model] Remove image mm limit for LLaMa4 ( #16365 )
...
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
2025-04-10 09:36:27 +00:00
ec1f9c8c91
Update Numba to 0.61.2 ( #16376 )
...
Signed-off-by: cyy <cyyever@outlook.com >
2025-04-10 07:59:37 +00:00
65e09094c4
[doc] add download model tips ( #16389 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-10 07:45:26 +00:00
c70cf0fe06
[Kernel] Use moe_wna16 kernel for compressed tensors wna16 moe models ( #16038 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-10 15:08:47 +08:00
a5d11a54dc
[Bugfix] Fix validation error for text-only Mllama 3.2 ( #16377 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-10 14:19:42 +08:00
3d4c87758e
[Misc] Update transformers version limits of multi-modal tests ( #16381 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-09 23:03:33 -07:00
a9bd832fc5
[Model] use AutoWeightsLoader for deepseek_v2, internlm2 ( #16383 )
...
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com >
2025-04-09 23:01:00 -07:00
417bcefbae
fix sonnet dataset sample when prefix len is very small ( #16379 )
...
Signed-off-by: Chenyaaang <chenyangli@google.com >
2025-04-10 05:35:07 +00:00
baada0e737
[Bugfix][TPU] Fix TPU validate_request ( #16369 )
...
Signed-off-by: Michael Goin <mgoin64@gmail.com >
2025-04-10 12:55:12 +08:00
82eb61dd4c
[misc] use tqdm.auto where appropriate ( #16290 )
...
Signed-off-by: Benjamin Kitor <bkitor@gigaio.com >
2025-04-09 21:54:54 -07:00
0d4d06fe2f
[CI][Bugfix] Pin triton version for CPU ( #16384 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-10 04:35:00 +00:00
4aed0ca6a2
[bugfix] Avoid the time consumption caused by creating dummy videos. ( #16371 )
2025-04-10 04:30:05 +00:00
1621b25288
[TPU] Fix dummy loading OOM ( #16372 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-10 04:06:16 +00:00
a564797151
[Model] use AutoWeightsLoader for granite, granitemoe, granitemoeshared, grok1, mixtral ( #16325 )
...
Signed-off-by: Aaron Ang <aaron.angyd@gmail.com >
2025-04-09 20:07:40 -07:00
1da6a09274
[Bugfix]: do not shutdown server if skip_special_use=False for MistralTokenizer ( #14094 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 19:43:09 -07:00
1e44ffc3ff
Add GLM-4-0414 support ( #16338 )
...
Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com >
Signed-off-by: zRzRzRzRzRzRzR <2448370773@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
Co-authored-by: Accelerator1996 <lvfei.lv@alibaba-inc.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: yihong <zouzou0208@gmail.com >
Co-authored-by: Lucia Fang <116399278+luccafong@users.noreply.github.com >
Co-authored-by: ajayvohra2005 <ajayvohr@amazon.com >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-10 09:19:42 +08:00
a454748544
[TPU][V1] Refine tpu_model_runner to mitigate future recompilation issues ( #16275 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-09 18:51:51 -06:00
1bff42c4b7
[Misc] refactor Structured Outputs example ( #16322 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-09 23:32:42 +00:00
cb391d85dc
[Hardware] add platform-specific request validation api ( #16291 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-04-09 12:50:01 -07:00
fee5b8d37f
[Build/CI] Add tracing deps to vllm container image ( #15224 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-09 19:14:06 +00:00
b2ce859bd2
Fix benchmark_throughput.py --backend=hf ( #16352 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-09 19:09:28 +00:00
566f10a929
[CI]Fix hpu docker and numpy version for CI ( #16355 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2025-04-09 17:52:26 +00:00
c3b5189137
[Bugfix] catch AssertionError in MistralTokenizer as ValueError ( #16344 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 17:33:24 +00:00
a25866ac8d
[Bugfix] Fix profiling.py ( #16202 )
...
Signed-off-by: zh Wang <rekind133@outlook.com >
2025-04-09 17:03:34 +00:00
098900d7c2
Revert "Update label-tpu mergify and remove removal bot" ( #16350 )
2025-04-09 07:59:36 -07:00
98d01d3ce2
[Bugfix][Frontend] respect provided default guided decoding backend ( #15476 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2025-04-09 05:11:10 -07:00
d55244df31
[Model] Add SupportsMultiModal.get_language_model interface ( #16007 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-04-09 04:12:54 -07:00
04149cce27
[BugFix] fix some typos found by typos. ( #16314 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-09 03:43:59 -07:00
24834f4894
update neuron config ( #16289 )
...
Signed-off-by: Ajay Vohra <ajayvohr@amazon.com >
2025-04-09 03:43:22 -07:00
ec7da6fcf3
[BugFix] llama4 qknorm should be not shared across head ( #16311 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-09 00:59:14 -07:00
819d548e8a
[BugFix] logger is not callable ( #16312 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-09 00:59:02 -07:00
477d2a8aa2
Update label-tpu mergify and remove removal bot ( #16298 )
2025-04-09 07:56:25 +00:00
e484e02857
[Bugfix] Avoid transferring cached multi-modal items from P0 to P1 ( #16273 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-09 00:51:27 -07:00
24f6b9a713
[Misc] Fix test_sharded_state_loader.py( #16004 ) ( #16005 )
...
Signed-off-by: lvfei.lv <lvfei.lv@alibaba-inc.com >
2025-04-09 14:47:30 +08:00
9cdde47289
[BugFix] Fix fusion test and add them to CI ( #16287 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2025-04-08 23:46:45 -07:00
b1eb4ca152
[TPU] Update PyTorch/XLA ( #16288 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-09 14:46:32 +08:00
87b4ac56c2
[CI][Bugfix] Fix bad tolerance for test_batch_base64_embedding ( #16221 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-09 04:14:46 +00:00
cb84e45ac7
[Core] Upgrade to xgrammar 0.1.18, add cache size limit ( #16283 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-08 19:13:22 -07:00
4716377fbc
[Feature] Estimate max-model-len use available KV cache memory ( #16168 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 19:12:51 -07:00
4e9cf8c1dd
[Bugfix] fix gettid method is not define ( #16084 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 19:12:44 -07:00
2976dc27e9
[Bug] [ROCm] Fix Llama 4 Enablement Bug on ROCm: V0 ROCmFlashAttentionImpl and Triton Fused MoE bugs ( #16198 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com >
Co-authored-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: kliuae <kuanfu.liu@embeddedllm.com >
2025-04-08 19:12:34 -07:00
102bf967f0
[Model] Add smolvlm support ( #16017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-08 19:12:17 -07:00
1f4b09b525
Add support to modelopt quantization of Mixtral model ( #15961 )
...
Signed-off-by: Yue <yueshen@nvidia.com >
2025-04-09 01:53:31 +00:00
86c3369eb8
[CI/Build] Fix CI LoRA failure ( #16270 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-04-09 09:13:56 +08:00
2755c34a8f
[V1] Update structured output offline inference example ( #15721 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-04-08 22:34:09 +00:00
db10422184
[Bugfix] fix deepseek fp16 scale bug ( #14809 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-08 16:56:09 -04:00
e1a2c699dd
[BugFix] Fix Llama4 - Index Error When Single Request Near Max Context ( #16209 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-04-08 18:56:51 +00:00
0115ccd5c0
Add warning that content below line in template will be removed ( #16276 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-08 18:18:40 +00:00
40b4284fe3
[Bugfix] Handle process_weights_after_loading for QKVCrossParallelLinear ( #15328 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-08 10:02:23 -07:00
4ebc0b9640
[Bugfix] Proper input validation for multi-modal encoder-decoder models ( #16156 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-08 09:45:21 -07:00
dc96fd54c6
[Misc] Avoid stripping meaningful whitespace from nvidia-smi topo -m output in collect_env.py ( #16272 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-04-08 16:08:09 +00:00
1f5d13ab9f
[New Model]: jinaai/jina-embeddings-v3 ( #16120 )
2025-04-08 08:39:12 -07:00
90cb44eb02
Update to transformers==4.51.1 ( #16257 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-08 06:53:39 -07:00
e11880deea
[Bugfix] Remove triton do_bench fast_flush arg ( #16256 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-04-08 13:51:06 +00:00
9351f91be9
[BugFix][ROCm] Fix GGUF MoE Dispatch Block_Dim for ROCm ( #16247 )
...
Signed-off-by: Tianyuan Wu <Tianyuan.Wu@amd.com >
2025-04-08 05:10:26 -07:00
5a1e1c8353
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe ( #16203 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-08 04:05:47 -07:00
69ecaa7c79
[Misc] Add warning for multimodal data in LLM.beam_search ( #16241 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2025-04-08 04:05:27 -07:00
7f00899ff7
[Misc] format and refactor some examples ( #16252 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-08 10:42:32 +00:00
995e3d1f41
[Docs] Add Slides from Singapore Meetup ( #16213 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-04-08 07:20:22 +00:00
b4ac449a83
[Misc] Merge the logs of pp layers partitions ( #16225 )
...
Signed-off-by: Kebe <mail@kebe7jun.com >
2025-04-08 00:18:15 -07:00
8e5314a468
[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill ( #15837 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-07 23:24:07 -07:00
87918e40c4
[torch.compile][TPU] Make @support_torch_compile work for XLA backend ( #15782 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-04-08 14:23:53 +08:00
f6b32efb7f
[Bugfix] Fix and reorganize broken GGUF tests and bump gguf version ( #16194 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-08 13:38:13 +08:00
b99733d092
[Bugfix] Do not skip "empty" parts of chats that are parsable ( #16219 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-04-08 05:14:15 +00:00
05a015d6a5
Add warning for Attention backends that do not support irope yet ( #16212 )
2025-04-08 03:59:26 +00:00
ad971af8c7
[Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 ( #16161 )
2025-04-07 20:48:47 -07:00
f2ebb6f541
[V1] Scatter and gather placeholders in the model runner ( #16076 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: mgoin <mgoin64@gmail.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: mgoin <mgoin64@gmail.com >
Co-authored-by: Jennifer Zhao <ai.jenniferzhao@gmail.com >
2025-04-08 10:43:41 +08:00
1d01211264
Update BASE_IMAGE to 2.22 release of Neuron ( #16218 )
2025-04-07 19:11:18 -07:00
f94ab12f79
[Misc] Update compressed-tensors to version 0.9.3 ( #16196 )
...
Signed-off-by: Miles Williams <42222518+mlsw@users.noreply.github.com >
2025-04-07 19:09:06 -07:00
a865bc1ca6
[core] do not send error across process ( #16174 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-04-07 19:09:03 -07:00
21802c4b6d
[ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping ( #16031 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-04-07 21:28:14 -04:00
652907b354
Torchao ( #14231 )
...
Signed-off-by: drisspg <drisspguessous@gmail.com >
2025-04-07 19:39:28 -04:00
24f1c01e0f
[Bugfix][V0] XGrammar structured output supports Enum ( #15878 )
...
Signed-off-by: Leon Seidel <leon.seidel@fau.de >
2025-04-07 22:38:25 +00:00
fad6e2538e
[Misc] add description attribute in CLI ( #15921 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-07 22:30:35 +00:00
7f6d47c1a2
[V1][BugFix] Exit properly if engine core fails during startup ( #16137 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-04-07 15:30:15 -07:00
3147586ebd
[Bugfix] Fix guidance backend for Qwen models ( #16210 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-04-07 22:15:43 +00:00
ed636d99ca
[Misc] Move Llama 4 projector call into encoder execution ( #16201 )
2025-04-07 14:02:05 -07:00
090c856d76
[Misc] Human-readable max-model-len cli arg ( #16181 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-04-07 14:40:58 -04:00
ad434d4cfe
Print the warning only once ( #16193 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-04-07 18:30:06 +00:00
66d433b94f
[V1] Revert the default max_num_seqs to V0 values for most hardware ( #16158 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 13:54:36 -04:00
027b204ff1
[Bugfix] Re-enable support for ChatGLMForConditionalGeneration ( #16187 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 23:15:58 +08:00
55dcce91df
Upstream Llama4 Support to Main ( #16113 )
...
Signed-off-by: Aston Zhang <22279212+astonzhang@users.noreply.github.com >
Signed-off-by: Chris Thi <chris.c.thi@gmail.com >
Signed-off-by: drisspg <drisspguessous@gmail.com >
Signed-off-by: Jon Swenson <jmswen@gmail.com >
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
Signed-off-by: Lu Fang <fanglu@meta.com >
Signed-off-by: Xiaodong Wang <xdwang@meta.com >
Signed-off-by: Yang Chen <yangche@fb.com >
Signed-off-by: Ye (Charlotte) Qi <yeq@meta.com >
Signed-off-by: Yong Hoon Shin <yhshin@meta.com >
Signed-off-by: Zijing Liu <liuzijing2014@gmail.com >
Signed-off-by: Lu Fang <lufang@fb.com >
Signed-off-by: Lu Fang <fanglu@fb.com >
Signed-off-by: Lucia Fang <fanglu@fb.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Lu Fang <fanglu@fb.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 08:06:27 -07:00
8017c8db7f
[Doc]Update image to latest version ( #16186 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-04-07 14:17:39 +00:00
dc3529dbf6
[Misc] improve example mlpspeculator and llm_engine_example ( #16175 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-07 11:53:52 +00:00
7699258ef0
[Model] Add Qwen3 and Qwen3MoE ( #15289 )
...
Signed-off-by: YamPengLi <yampayne.lyp@alibaba-inc.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-04-07 04:06:41 -07:00
e9ba99f296
[V1][Structured Output] Add supports_structured_output() method to Platform ( #16148 )
...
Signed-off-by: shen-shanshan <467638484@qq.com >
2025-04-07 11:06:24 +00:00
7c80368710
[VLM] Florence-2 supports online serving ( #16164 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-07 04:04:02 -07:00
95d63f38c0
doc: fix some typos in doc ( #16154 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-07 05:32:06 +00:00
bb8dab821e
[CI] Set max transformers version for Ultravox model test ( #16149 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-07 04:37:58 +00:00
fc0f87768a
[Bugfix] Make dummy encoder prompt padding alternative and add missing warnings ( #16129 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-07 04:07:15 +00:00
0a57386721
[Misc] Update Mistral-3.1 example ( #16147 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-04-07 03:57:37 +00:00
3749e28774
[V1][Minor] Minor simplification for get_computed_blocks ( #16139 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-04-06 20:38:12 -07:00
86fc2321ff
[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token ( #15202 )
...
Signed-off-by: Kay Yan <kay.yan@daocloud.io >
2025-04-06 20:34:51 -07:00
2549c0dfef
Fix requires-python ( #16132 )
2025-04-06 19:22:25 -07:00
b10e519895
[V1][Minor] Optimize get_cached_block ( #16135 )
2025-04-06 20:48:14 +00:00
9bde5ba127
[TPU] Update PyTorch/XLA ( #16130 )
...
Signed-off-by: Chengji Yao <chengjiyao@google.com >
2025-04-06 18:25:55 +00:00
72c8f1ad04
[Misc] update requires-python in pyproject.toml ( #16116 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-06 14:56:34 +00:00
da224daaa9
[Bugfix] add hf_token to EngineArgs ( #16093 )
...
Signed-off-by: paolovic <paul-philipp.luley@uzh.ch >
Co-authored-by: paolovic <paul-philipp.luley@uzh.ch >
2025-04-06 14:47:33 +00:00
3a100b9278
[Bugfix] LoRA : Fix the order in which the kernels process LoRAs ( #16040 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-04-06 14:04:50 +00:00
242a637aea
[Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 ( #16103 )
...
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io >
2025-04-06 05:52:01 -07:00
c2a9671510
[Misc] Improve model redirect to accept json dictionary ( #16119 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-04-06 05:51:45 -07:00
d5ae4f7f42
[Doc][Bugfix] Add missing EOF in k8s deploy doc ( #16025 )
2025-04-06 12:10:57 +00:00
b6c502a150
[Misc] refactor example eagle ( #16100 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-06 09:42:48 +00:00
9ca710e525
[CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar ( #16117 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-04-06 16:18:00 +08:00
eb07c8cb5b
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace ( #14501 )
...
Signed-off-by: Ben Jackson <ben@ben.com >
2025-04-06 07:44:36 +00:00
ba10801961
[Benchmark] Add sampling parameters to benchmark_serving. ( #16022 )
...
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com >
2025-04-06 12:30:35 +08:00
620fc2d09e
[Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 ( #16112 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
2025-04-05 21:23:40 -07:00
29283eaa7e
[Model] use AutoWeightsLoader for phi, gemma, deepseek ( #16088 )
...
Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com >
2025-04-05 20:34:38 -07:00
2fa66ef713
[Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine ( #15946 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-04-05 20:04:22 -07:00
13affc432d
[Misc] Remove redundant code ( #16098 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-04-05 20:03:50 -07:00
d8f094a92a
[Misc] format output for encoder_decoder.py ( #16095 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-05 19:57:18 -07:00
97ae6d777f
Fix some capitalisations in generated examples doc titles ( #16094 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-04-05 13:44:03 +00:00
6baeee70d1
Revert "doc: add info for macos clang errors ( #16049 )" ( #16091 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-05 11:51:51 +00:00
d2517a4939
[doc] fix 404 ( #16082 )
...
Signed-off-by: reidliu41 <reid201711@gmail.com >
Co-authored-by: reidliu41 <reid201711@gmail.com >
2025-04-05 11:39:18 +00:00
6342adc438
fix: support clang17 for macos and fix the real libomp ( #16086 )
...
Signed-off-by: yihong0618 <zouzou0208@gmail.com >
2025-04-05 11:00:12 +00:00
0adba91547
[CI] Fix benchmark script level ( #16089 )
2025-04-05 03:36:01 -07:00
4285e423a6
[Misc] Auto detect bitsandbytes pre-quantized models ( #16027 )
...
Signed-off-by: Tristan Leclercq <tristanleclercq@gmail.com >
2025-04-04 23:30:45 -07:00