ed6e9075d3
[Bugfix] Fix deepseekv3 grouped topk error ( #13474 )
...
Signed-off-by: Chen-XiaoBing <chenxb002@whu.edu.cn >
2025-02-20 06:47:01 -08:00
992e5c3d34
Merge similar examples in offline_inference into single basic example ( #12737 )
2025-02-20 04:53:51 -08:00
b69692a2d8
[Kernel] LoRA - Refactor sgmv kernels ( #13110 )
2025-02-20 07:28:06 -05:00
a64a84433d
[2/n][ci] S3: Use full model path ( #13564 )
...
Signed-off-by: <>
2025-02-20 01:20:15 -08:00
aa1e62d0db
[ci] Fix spec decode test ( #13600 )
2025-02-20 16:56:00 +08:00
497bc83124
[CI/Build] Use uv in the Dockerfile ( #13566 )
2025-02-19 23:05:44 -08:00
3738e6fa80
[API Server] Add port number range validation ( #13506 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-20 15:05:13 +08:00
0023cd2b9d
[ROCm] MI300A compile targets deprecation ( #13560 )
2025-02-19 23:05:00 -08:00
041e294716
[Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL ( #13533 )
2025-02-19 23:04:30 -08:00
9621667874
[Misc] Warn if the vLLM version can't be retrieved ( #13501 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-20 06:24:48 +00:00
8c755c3b6d
[bugfix] spec decode worker get tp group only when initialized ( #13578 )
2025-02-20 04:46:28 +00:00
ba81163997
[core] add sleep and wake up endpoint and v1 support ( #12987 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: cennn <2523403608@qq.com >
Co-authored-by: cennn <2523403608@qq.com >
2025-02-20 12:41:17 +08:00
0d243f2a54
[ROCm][MoE] mi300 mixtral8x7B perf for specific BS ( #13577 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-20 04:01:02 +00:00
88f6ba3281
[ci] Add AWS creds for AMD ( #13572 )
2025-02-20 03:56:06 +00:00
512368e34a
[Misc] Qwen2.5 VL support LoRA ( #13261 )
2025-02-19 18:37:55 -08:00
473f51cfd9
[3/n][CI] Load Quantization test models with S3 ( #13570 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-20 10:12:30 +08:00
a4c402a756
[BugFix] Avoid error traceback in logs when V1 LLM terminates ( #13565 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-20 00:49:01 +00:00
550d97eb58
[Misc] Avoid calling unnecessary hf_list_repo_files for local model path ( #13348 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
2025-02-19 18:57:48 +00:00
fbbe1fbac6
[MISC] Logging the message about Ray teardown ( #13502 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
2025-02-19 09:40:50 -08:00
01c184b8f3
Fix copyright year to auto get current year ( #13561 )
2025-02-19 16:55:34 +00:00
ad5a35c21b
[doc] clarify multi-node serving doc ( #13558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 22:32:17 +08:00
5ae9f26a5a
[Bugfix] Fix device ordinal for multi-node spec decode ( #13269 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-19 22:13:15 +08:00
377d10bd14
[VLM][Bugfix] Pass processor kwargs properly on init ( #13516 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-19 13:13:50 +00:00
52ce14d31f
[doc] clarify profiling is only for developers ( #13554 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 20:55:58 +08:00
81dabf24a8
[CI/Build] force writing version file ( #13544 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-02-19 18:48:03 +08:00
423330263b
[Feature] Pluggable platform-specific scheduler ( #13161 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2025-02-19 17:16:38 +08:00
caf7ff4456
[V1][Core] Generic mechanism for handling engine utility ( #13060 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-19 17:09:22 +08:00
f525c0be8b
[Model][Speculative Decoding] DeepSeek MTP spec decode ( #12755 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-19 17:06:23 +08:00
983a40a8bb
[Bugfix] Fix Positive Feature Layers in Llava Models ( #13514 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-19 08:50:07 +00:00
fdc5df6f54
use device param in load_model method ( #13037 )
2025-02-19 16:05:02 +08:00
3b05cd4555
[perf-benchmark] Fix ECR path for premerge benchmark ( #13512 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:56:11 +00:00
d5d214ac7f
[1/n][CI] Load models in CI from S3 instead of HF ( #13205 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:34:59 +00:00
fd84857f64
[Doc] Add clarification note regarding paligemma ( #13511 )
2025-02-18 22:24:03 -08:00
8aada19dfc
[ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe ( #13503 )
2025-02-18 22:23:24 -08:00
9aa95b0e6a
[perf-benchmark] Allow premerge ECR ( #13509 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 05:13:41 +00:00
d0a7a2769d
[Hardware][Gaudi][Feature] Support Contiguous Cache Fetch ( #12139 )
...
Signed-off-by: yuzhou <yuzhou@habana.ai >
Signed-off-by: zhouyu5 <yu.zhou@intel.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-18 19:40:19 -08:00
00b69c2d27
[Misc] Remove dangling references to --use-v2-block-manager ( #13492 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-19 03:37:26 +00:00
4c82229898
[V1][Spec Decode] Optimize N-gram matching with Numba ( #13365 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 13:19:58 -08:00
c8d70e2437
Pin Ray version to 2.40.0 ( #13490 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 12:50:31 -08:00
30172b4947
[V1] Optimize handling of sampling metadata and req_ids list ( #13244 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-18 12:15:33 -08:00
a4d577b379
[V1][Tests] Adding additional testing for multimodal models to V1 ( #13308 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-18 09:53:14 -08:00
7b203b7694
[misc] fix debugging code ( #13487 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 09:37:11 -08:00
4fb8142a0e
[V1][PP] Enable true PP with Ray executor ( #13472 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 09:15:32 -08:00
a02c86b4dd
[CI/Build] migrate static project metadata from setup.py to pyproject.toml ( #8772 )
2025-02-18 08:02:49 -08:00
3809458456
[Bugfix] Fix invalid rotary embedding unit test ( #13431 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-02-18 11:52:03 +00:00
d3231cb436
[Bugfix] Handle content type with optional parameters ( #13383 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-02-18 11:29:13 +00:00
435b502a6e
[ROCm] Make amdsmi import optional for other platforms ( #13460 )
2025-02-18 03:15:56 -08:00
29fc5772c4
[Bugfix] Remove noisy error logging during local model loading ( #13458 )
2025-02-18 03:15:48 -08:00
2358ca527b
[Doc]: Improve feature tables ( #13224 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-18 18:52:39 +08:00
8cf97f8661
[Bugfix] Fix failing transformers dynamic module resolving with spawn multiproc method ( #13403 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-18 10:25:53 +00:00
e2603fefb8
[Bugfix] Ensure LoRA path from the request can be included in err msg ( #13450 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-18 16:19:15 +08:00
b53d79983c
Add outlines fallback when JSON schema has enum ( #13449 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-18 06:49:41 +00:00
9915912f7f
[V1][PP] Fix & Pin Ray version in requirements-cuda.txt ( #13436 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 21:58:06 -08:00
d1b649f1ef
[Quant] Aria SupportsQuant ( #13416 )
2025-02-17 21:51:09 -08:00
ac19b519ed
[core] fix sleep mode in pytorch 2.6 ( #13456 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 13:48:10 +08:00
a1074b3efe
[Bugfix] Only print out chat template when supplied ( #13444 )
2025-02-17 21:43:31 -08:00
00294e1bc6
[Quant] Arctic SupportsQuant ( #13366 )
2025-02-17 21:35:09 -08:00
88787bce1d
[Quant] Molmo SupportsQuant ( #13336 )
2025-02-17 21:34:47 -08:00
932b51cedd
[v1] fix parallel config rank ( #13445 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 12:33:45 +08:00
7c7adf81fc
[ROCm] fix get_device_name for rocm ( #13438 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-18 04:07:12 +00:00
67ef8f666a
[Model] Enable quantization support for transformers backend ( #12960 )
2025-02-17 19:52:47 -08:00
efbe854448
[Misc] Remove dangling references to SamplingType.BEAM ( #13402 )
2025-02-17 19:52:35 -08:00
b3942e157e
[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallback issue ( #13425 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-18 00:32:48 +00:00
cd4a72a28d
[V1][Spec decode] Move drafter to model runner ( #13363 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 15:40:12 -08:00
6ac485a953
[V1][PP] Fix intermediate tensor values ( #13417 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-17 13:37:45 -08:00
4c21ce9eba
[V1] Get input tokens from scheduler ( #13339 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 11:01:07 -08:00
ce77eb9410
[Bugfix] Fix VLLM_USE_MODELSCOPE issue ( #13384 )
2025-02-17 14:22:01 +00:00
30513d1cb6
[Bugfix] fix xpu communicator ( #13368 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-02-17 20:59:18 +08:00
1f69c4a892
[Model] Support Mamba2 (Codestral Mamba) ( #9292 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-02-17 20:17:50 +08:00
7b623fca0b
[VLM] Check required fields before initializing field config in DictEmbeddingItems ( #13380 )
2025-02-17 01:36:07 -08:00
238dfc8ac3
[MISC] tiny fixes ( #13378 )
2025-02-17 00:57:13 -08:00
45186834a0
Run v1 benchmark and integrate with PyTorch OSS benchmark database ( #13068 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-02-17 08:16:32 +00:00
f857311d13
Fix spelling error in index.md ( #13369 )
2025-02-17 06:53:20 +00:00
46cdd59577
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode ( #12304 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-16 19:32:26 -08:00
2010f04c17
[V1][Misc] Avoid unnecessary log output ( #13289 )
2025-02-16 19:26:24 -08:00
69e1d23e1e
[V1][BugFix] Clean up rejection sampler & Fix warning msg ( #13362 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 12:25:29 -08:00
d67cc21b78
[Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case ( #13358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-16 18:55:27 +00:00
e18227b04a
[V1][PP] Cache Intermediate Tensors ( #13353 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 10:02:27 -08:00
7b89386553
[V1][BugFix] Add __init__.py to v1/spec_decode/ ( #13359 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 09:39:08 -08:00
da833b0aee
[Docs] Change myenv to vllm. Update python_env_setup.inc.md ( #13325 )
2025-02-16 16:04:21 +00:00
5d2965b7d7
[Bugfix] Fix 2 Node and Spec Decode tests ( #13341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-16 22:20:22 +08:00
a0231b7c25
[platform] add base class for communicators ( #13208 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:14:22 +08:00
124776ebd5
[ci] skip failed tests for flashinfer ( #13352 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:09:15 +08:00
b7d309860e
[V1] Update doc and examples for H2O-VL ( #13349 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-16 10:35:54 +00:00
dc0f7ccf8b
[BugFix] Enhance test_pos_encoding to support execution on multi-devices ( #13187 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-02-16 08:59:49 +00:00
d3d547e057
[Bugfix] Pin xgrammar to 0.1.11 ( #13338 )
2025-02-15 19:42:25 -08:00
12913d17ba
[Quant] Add SupportsQuant to phi3 and clip ( #13104 )
2025-02-15 19:28:33 -08:00
80f63a3966
[V1][Spec Decode] Ngram Spec Decode ( #12193 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-15 18:05:11 -08:00
367cb8ce8c
[Doc] [2/N] Add Fuyu E2E example for multimodal processor ( #13331 )
2025-02-15 07:06:23 -08:00
54ed913f34
[ci/build] update flashinfer ( #13323 )
2025-02-15 05:33:13 -08:00
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue ( #13064 )
2025-02-15 03:59:01 -08:00
ed0de3e4b8
[AMD] [Model] DeepSeek tunings ( #13199 )
2025-02-15 03:58:09 -08:00
2ad1bc7afe
[V1][Metrics] Add iteration_tokens_total histogram from V0 ( #13288 )
2025-02-15 03:56:19 -08:00
7fdaaf48ef
[Bugfix] Fix qwen2.5-vl image processor ( #13286 )
2025-02-15 03:00:11 -08:00
067fa2255b
[Bugfix]Fix search start_index of stop_checker ( #13280 )
2025-02-14 21:39:42 -08:00
9076325677
[BugFix] Don't scan entire cache dir when loading model ( #13302 )
2025-02-14 21:33:31 -08:00
97a3d6d995
[Bugfix] Massage MLA's usage of flash attn for RoCM ( #13310 )
2025-02-14 21:33:25 -08:00
579d7a63b2
[Bugfix][Docs] Fix offline Whisper ( #13274 )
2025-02-14 21:32:37 -08:00
c9f9d5b397
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm ( #13235 )
2025-02-14 20:30:42 -08:00
0c73026844
[V1][PP] Fix memory profiling in PP ( #13315 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 20:17:25 -08:00
6a854c7a2b
[V1][Sampler] Don't apply temp for greedy-only ( #13311 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-14 18:10:53 -08:00
e7eea5a520
[V1][CI] Fix failed v1-test because of min_p ( #13316 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 17:29:51 -08:00
a12934d3ec
[V1][Core] min_p sampling support ( #13191 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
2025-02-14 15:50:05 -08:00
3bcb8c75da
[Core] Reduce TTFT with concurrent partial prefills ( #10235 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-14 15:36:07 -08:00
5e5c8e091e
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts ( #13236 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-14 12:53:42 -08:00
c9e2d644e7
[Hardware][Gaudi][Bugfix] Fix error for guided decoding ( #12317 )
2025-02-14 04:36:49 -08:00
7734e9a291
[Core] choice-based structured output with xgrammar ( #12632 )
2025-02-14 04:36:05 -08:00
6224a9f620
Support logit_bias in v1 Sampler ( #13079 )
2025-02-14 04:34:59 -08:00
085b7b2d6c
[V1] Simplify GPUModelRunner._update_states check ( #13265 )
2025-02-14 04:33:43 -08:00
4da1f667e9
[VLM] Keep track of whether prompt replacements have been applied ( #13215 )
2025-02-14 04:20:46 -08:00
556ef7f714
[Misc] Log time consumption of sleep and wake-up ( #13115 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-02-14 20:10:21 +08:00
83481ceb49
[Bugfix] Fix missing parentheses ( #13263 )
2025-02-14 01:07:10 -08:00
185cc19f92
[Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch ( #12927 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-02-14 08:22:42 +00:00
45f90bcbba
[WIP] TPU V1 Support Refactored ( #13049 )
2025-02-14 00:21:53 -08:00
b0ccfc565a
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch ( #13126 )
2025-02-13 22:39:20 -08:00
ba59b78a9c
[ROCm][V1] Add intial ROCm support to V1 ( #12790 )
2025-02-13 22:21:50 -08:00
cbc40128eb
[V1] LoRA - Enable Serving Usecase ( #12883 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-14 14:21:12 +08:00
f0b2da72a8
Expand MLA to support most types of quantization ( #13181 )
2025-02-13 22:19:22 -08:00
f2b20fe491
Consolidate Llama model usage in tests ( #13094 )
2025-02-13 22:18:03 -08:00
40932d7a05
[Misc] Remove redundant statements in scheduler.py ( #13229 )
2025-02-13 22:07:25 -08:00
84683fa271
[Bugfix] Offline example of disaggregated prefill ( #13214 )
2025-02-13 20:20:47 -08:00
067678262a
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config ( #13237 )
2025-02-13 20:19:43 -08:00
09545c0a94
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on ( #13250 )
2025-02-13 20:19:25 -08:00
dd5ede4440
[V1] Consolidate MM cache size to vllm.envs ( #13239 )
2025-02-13 20:19:03 -08:00
8c32b08a86
[Kernel] Fix awq error when n is not divisable by 128 ( #13227 )
2025-02-13 20:07:05 -08:00
410886950a
[ROCm] Avoid using the default stream on ROCm ( #13238 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-14 09:29:26 +08:00
e38be640e6
Revert "Add label if pre-commit passes" ( #13242 )
2025-02-13 16:12:32 -08:00
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels ( #13198 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-14 00:01:14 +00:00
2344192a55
Optimize moe_align_block_size for deepseek_v3 ( #12850 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-13 18:43:37 -05:00
bffddd9a05
Add label if pre-commit passes ( #12527 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-13 20:51:30 +00:00
d84cef76eb
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint ( #12909 )
2025-02-13 07:23:45 -08:00
37dfa60037
[Bugfix] Missing Content Type returns 500 Internal Server Error ( #13193 )
2025-02-13 06:52:22 -08:00
1bc3b5e71b
[VLM] Separate text-only and vision variants of the same model architecture ( #13157 )
2025-02-13 06:19:15 -08:00
02ed8a1fbe
[Misc] Qwen2.5-VL Optimization ( #13155 )
2025-02-13 06:17:57 -08:00
2092a6fa7d
[V1][Core] Add worker_base for v1 worker ( #12816 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-13 20:35:18 +08:00
c9d3ecf016
[VLM] Merged multi-modal processor for Molmo ( #12966 )
2025-02-13 04:34:00 -08:00
fdcf64d3c6
[V1] Clarify input processing and multimodal feature caching logic ( #13211 )
2025-02-13 03:43:24 -08:00
578087e56c
[Frontend] Pass pre-created socket to uvicorn ( #13113 )
2025-02-13 00:51:46 -08:00
fa253f1a70
[VLM] Remove input processor from clip and siglip ( #13165 )
2025-02-13 00:31:37 -08:00
9605c1256e
[V1][core] Implement pipeline parallel on Ray ( #12996 )
2025-02-13 08:02:46 +00:00
0ccd8769fb
[CI/Build] Allow ruff to auto-fix some issues ( #13180 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 07:45:38 +00:00
cb944d5818
Allow Unsloth Dynamic 4bit BnB quants to work ( #12974 )
2025-02-12 23:13:08 -08:00
d46d490c27
[Frontend] Move CLI code into vllm.cmd package ( #12971 )
2025-02-12 23:12:21 -08:00
04f50ad9d1
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case ( #13097 )
2025-02-12 23:11:26 -08:00
60c68df6d1
[Build] Automatically use the wheel of the base commit with Python-only build ( #13178 )
2025-02-12 23:10:28 -08:00
009439caeb
Simplify logic of locating CUDART so file path ( #13203 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-13 13:52:41 +08:00
bc55d13070
[VLM] Implement merged multimodal processor for Mllama ( #11427 )
2025-02-12 20:26:21 -08:00
d88c8666a1
[Bugfix][Example] Fix GCed profiling server for TPU ( #12792 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-13 11:52:11 +08:00
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization ( #12784 )
2025-02-12 19:51:51 -08:00
9f9704dca6
[perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance ( #12706 )
2025-02-12 19:51:33 -08:00
8eafe5eaea
[CI/Build] Ignore ruff warning up007 ( #13182 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 11:48:31 +08:00
4c0d93f4b2
[V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort ( #13173 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-12 12:58:11 -08:00
14b7899d10
[CI] Fix failing FP8 cpu offload test ( #13170 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-12 19:16:06 +00:00
09972e716c
[Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity ( #13119 )
2025-02-12 09:19:53 -08:00
36a08630e8
[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control ( #7086 )
2025-02-12 09:19:43 -08:00
2c2b560f48
[CI/Build] Use mypy matcher for pre-commit CI job ( #13162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-12 17:12:22 +00:00
042c3419fa
Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path ( #12998 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-12 09:06:13 -08:00
82cabf53a3
[Misc] Delete unused LoRA modules ( #13151 )
2025-02-12 08:58:24 -08:00
314cfade02
[Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral ( #12332 )
2025-02-12 08:29:56 -08:00
985b4a2b19
[Bugfix] Fix num video tokens calculation for Qwen2-VL ( #13148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-12 11:55:23 +00:00
f4d97e4fc2
[Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request ( #13108 )
2025-02-12 02:39:16 -08:00
f1042e86f0
[Misc] AMD Build Improvements ( #12923 )
2025-02-12 02:36:10 -08:00
7c4033acd4
Further reduce the HTTP calls to huggingface.co ( #13107 )
2025-02-12 02:34:09 -08:00
d59def4730
Bump actions/setup-python from 5.3.0 to 5.4.0 ( #12672 )
2025-02-12 16:41:22 +08:00
0c7d9effce
Bump helm/chart-testing-action from 2.6.1 to 2.7.0 ( #12463 )
2025-02-12 16:41:06 +08:00
dd3b4a01f8
Bump actions/stale from 9.0.0 to 9.1.0 ( #12462 )
2025-02-12 00:40:25 -08:00
a0597c6b75
Bump helm/kind-action from 1.10.0 to 1.12.0 ( #11612 )
2025-02-12 00:40:19 -08:00
e92694b6fe
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency ( #12921 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com >
2025-02-11 21:12:37 -08:00
842b0fd402
[ci] Add more source file dependencies for some tests ( #13123 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-11 20:38:10 -08:00
974dfd4971
[Model] IBM/NASA Prithvi Geospatial model ( #12830 )
2025-02-11 20:34:30 -08:00
3ee696a63d
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM ( #12518 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-02-12 12:25:58 +08:00
72c2b68dc9
[Misc] Move pre-commit suggestion back to the end ( #13114 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 22:34:16 +00:00
14ecab5be2
[Bugfix] Guided decoding falls back to outlines when fails to import xgrammar ( #12976 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-11 18:17:44 +00:00
deb6c1c6b4
[Doc] Improve OpenVINO installation doc ( #13102 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 18:02:46 +00:00
565c1efa65
[CI/Build][Bugfix] Fix CPU backend default threads num ( #13077 )
2025-02-11 16:55:56 +00:00
2b25b7d2e1
Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 ( #13023 )
2025-02-11 08:38:48 -08:00
6c4dbe23eb
[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES ( #12962 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-12 00:21:50 +08:00
21f5d50fa5
[Bugfix] Do not use resource module on Windows ( #12858 ) ( #13029 )
2025-02-11 08:21:18 -08:00
bf3e05215c
[Misc] Fix typo at comments at metrics.py ( #13024 )
2025-02-11 08:20:37 -08:00
ad9776353e
Set torch_dtype in TransformersModel ( #13088 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 23:51:19 +08:00
75e6e14516
[V1][Metrics] Add several request timing histograms ( #12644 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-11 10:14:00 -05:00
110f59a33e
[Bugfix] fix flaky test ( #13089 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-11 14:41:20 +00:00
2e3b969ec0
[Platform] add pre_register_and_update function ( #12432 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-02-11 22:06:46 +08:00
da317197dd
[Build] Fix cuda link target of cumem_allocator in CPU env ( #12863 )
...
Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-11 21:55:57 +08:00
7539bbc6a6
[ROCm] Using a more precise memory profiling ( #12624 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-11 21:47:10 +08:00
9cf4759493
[executor] init local_rank as device index ( #13027 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-02-11 21:20:53 +08:00
41c5dd45b9
[V1][Metrics] Add GPU prefix cache hit rate % gauge ( #12592 )
2025-02-11 08:27:25 +00:00
fc6485d277
[Bugfix]: Reasoning output bug according to the chat template change ( #13025 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-02-11 15:49:03 +08:00
78a141d768
[Misc] LoRA - Refactor Punica ops tests ( #12970 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-11 07:26:03 +00:00
c320ca8edd
[Core] Don't do platform detection at import time ( #12933 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 07:25:25 +00:00
58047c6f04
[Benchmark] Add BurstGPT to benchmark_serving ( #13063 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-02-10 21:25:30 -08:00
cb080f32e3
[Bugfix] Support missing tool parameters in mistral tokenizer ( #12884 )
...
Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com >
2025-02-11 03:33:33 +00:00
2c0f58203c
[Docs] Annouce Meta Meetup ( #13065 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-02-10 18:24:29 -08:00
2ff4857678
[V1][Minor] Move scheduler outputs to a separate file ( #13062 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-11 02:10:06 +00:00
91e876750e
[misc] Fix setup.py condition to avoid AMD from being mistaken with CPU ( #13022 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-02-10 18:06:16 -08:00
08b2d845d6
[Model] Ultravox Model: Support v0.5 Release ( #12912 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
2025-02-10 22:02:48 +00:00
2ae889052c
Fix seed parameter behavior in vLLM ( #13007 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-10 23:26:50 +08:00
51f0b5f7f6
[Bugfix] Clean up and fix multi-modal processors ( #13012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-10 10:45:21 +00:00
fde71262e0
[misc] Add retries with exponential backoff for HF file existence check ( #13008 )
2025-02-10 01:15:02 -08:00
243137143c
[Doc] Add link to tool_choice tracking issue in tool_calling.md ( #13003 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 06:09:33 +00:00
b2496bb07f
[core] fix sleep mode and pytorch checkpoint compatibility ( #13001 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 13:03:43 +08:00
44607e07d3
Check if selected backend is None in get_attn_backend_cls() ( #12975 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 11:45:07 +08:00
67c4637ccf
[V1] Use msgpack for core request serialization ( #12918 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-10 11:35:56 +08:00
aa0ca5ebb7
[core][rlhf] add colocate example for RLHF ( #12984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 10:28:59 +08:00
59fff4a01a
[core] improve error handling when wake up from sleep mode ( #12981 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 09:38:57 +08:00
29f1d47e73
[MISC] Always import version library first in the vllm package ( #12979 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-09 18:56:40 +08:00
cf797aa856
[core] port pynvml into vllm codebase ( #12963 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 15:00:00 +08:00
24700c346b
[V1] Cache uses_mrope in GPUModelRunner ( #12969 )
2025-02-08 15:32:32 -08:00
d366ccc4e3
[RFC] [Mistral] FP8 format ( #10130 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-02-08 14:12:53 -07:00
870c37481e
[V1][Minor] Remove outdated comment ( #12968 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 12:48:30 -08:00
86222a3dab
[VLM] Merged multi-modal processor for GLM4V ( #12449 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-08 20:32:16 +00:00
fe743b798d
[bugfix] fix early import of flash attention ( #12959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 00:06:56 +08:00
913df14da3
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU ( #12935 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-08 14:46:19 +00:00
8a69e0e20e
[CI/Build] Auto-fix Markdown files ( #12941 )
2025-02-08 04:25:15 -08:00
4c8dd12ef3
[Misc] Add qwen2.5-vl BNB support ( #12944 )
2025-02-08 04:24:47 -08:00
256a2d29dc
[Doc] Correct HF repository for TeleChat2 models ( #12949 )
2025-02-08 01:42:15 -08:00
c45d398e6f
[CI] Resolve transformers-neuronx version conflict ( #12925 )
2025-02-08 01:41:35 -08:00
011e612d92
[Misc] Log time consumption on weight downloading ( #12926 )
2025-02-08 09:16:42 +00:00
7e1837676a
[misc] Add LoRA to benchmark_serving ( #12898 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-08 17:15:44 +08:00
2880e21e3d
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi ( #12812 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2025-02-08 17:15:30 +08:00
407b5537db
[Build] Make pypi install work on CPU platform ( #12874 )
2025-02-08 01:15:15 -08:00
4ea48fb35c
[V1][Minor] Move cascade attn logic outside _prepare_inputs ( #12943 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 00:39:09 -08:00
e31498bdcb
[Misc] Add offline test for disaggregated prefill ( #12418 )
2025-02-08 08:38:20 +00:00
91dd8f7aa6
[bugfix] respect distributed_executor_backend in world_size=1 ( #12934 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-08 16:17:08 +08:00
d01f66b039
[Bugfix] Fix multi-round chat error when mistral tokenizer is used ( #12859 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-08 07:04:34 +00:00
cc01223f3b
[Misc] Fix typo in the example file ( #12896 )
...
Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com >
2025-02-08 06:56:43 +00:00
306923da82
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping ( #12905 )
2025-02-07 21:02:53 -08:00
3243158336
[V1] Move KV block hashes from Request to KVCacheManager ( #12922 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:14:10 -08:00
b21f0f9d17
[V1][Minor] Remove outdated comment ( #12928 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:07:37 -08:00
45cbc4991d
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues ( #12723 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 16:39:50 -08:00
932c6b7461
[V1] LM Eval With Streaming Integration Tests ( #11590 )
2025-02-07 15:07:03 -08:00
eaa92d4437
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing ( #12501 )
2025-02-07 08:13:43 -08:00
0630d4537a
[V1] Logprobs and prompt logprobs support ( #9880 )
...
This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.
New behavior:
- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-02-07 07:26:20 -08:00
538fab93cd
PR #12718 ( #12718 )
2025-02-07 06:22:37 -08:00
ce26b16268
[Misc] Remove unnecessary detokenization in multimodal processing ( #12868 )
2025-02-07 06:21:17 -08:00
1918aa1b80
[MISC][EASY] Break check file names into entry and args in the pre-commit hooks ( #12880 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 13:04:39 +00:00
6e1fc61f0f
Prevent unecessary requests to huggingface hub ( #12837 )
2025-02-06 21:37:41 -08:00
aa375dca9f
[Bugfix] Missing quant_config in deepseek embedding layer ( #12836 )
2025-02-06 21:35:09 -08:00
433c4a4923
Make vllm compatible with verl ( #12824 )
...
Co-authored-by: zhangshulai <zhangshulai@bytedance.com >
2025-02-07 11:54:20 +08:00
ef533d25fb
[Bugfix] FA2 illegal memory access ( #12848 )
2025-02-06 19:54:07 -08:00
b260782357
[misc] Revert # 12833 ( #12857 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 16:29:12 -08:00
741429a4cd
[MISC] Check space in the file names in the pre commit checks ( #12804 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 15:36:21 -08:00
aff404571b
Add Bamba Model ( #10909 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-06 15:22:42 -08:00
467a96a541
[V1] LoRA Support ( #10957 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-06 09:32:51 -08:00
8108ac841d
[Bugfix] Fix unsupported FA version check for Turing GPU ( #12828 )
2025-02-06 09:18:22 -08:00
afe74f7a96
[Doc] double quote cmake package in build.inc.md ( #12840 )
2025-02-06 09:17:55 -08:00
09b95e36ab
[torch.compile] PyTorch 2.6 and nightly compatibility ( #12393 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-07 01:09:07 +08:00
85ac82d228
[Kernel] Make rotary_embedding ops more flexible with input shape ( #12777 )
2025-02-06 08:46:13 -08:00
1e57b1ee63
[Misc] Remove unnecessary decode call ( #12833 )
2025-02-06 08:45:44 -08:00
e152f29502
[misc] Reduce number of config file requests to HuggingFace ( #12797 )
...
Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 14:59:18 +00:00
c786e757fa
[Attention] Use FA3 for MLA on Hopper ( #12807 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-02-06 11:43:12 +00:00
cefd56ee35
[Docs] Add Google Cloud Slides ( #12814 )
2025-02-06 01:02:38 -08:00
7ca9934fe7
[Misc] Update w2 scale loading for GPTQMarlinMoE ( #12757 )
2025-02-06 01:02:14 -08:00
0408efc6d0
[Misc] Improve error message for incorrect pynvml ( #12809 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 15:23:50 +08:00
449d1bce02
[Misc] Remove duplicated DeepSeek V2/V3 model definition ( #12793 )
2025-02-05 23:16:20 -08:00
1a6fcad4c9
Improve TransformersModel UX ( #12785 )
2025-02-05 22:24:57 -08:00
56534cd577
[Bugfix] Fix the test_ultravox.py's license ( #12806 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 13:25:54 +08:00
d88506dda4
[Model] LoRA Support for Ultravox model ( #11253 )
2025-02-05 19:54:13 -08:00
9cdea30b4f
[Misc][Easy] Remove the space from the file name
2025-02-05 19:23:35 -08:00
76abd0c881
[Bugfix] Better FP8 supported defaults
2025-02-05 19:22:19 -08:00
5b19b93082
[ROCm][Kernel] Using the correct warp_size value
2025-02-05 19:15:08 -08:00
75404d041b
[VLM] Update compatibility with transformers 4.49
2025-02-05 19:09:45 -08:00
bf3b79efb8
[VLM] Qwen2.5-VL
2025-02-05 13:31:38 -08:00
9a5b1554b4
[Docs] Drop duplicate [source] links
2025-02-05 13:30:50 -08:00
a4ce74c14a
[VLM] Use shared field to pass token ids to model
2025-02-05 13:30:46 -08:00
3b2005e1db
Add: Support for Sparse24Bitmask Compressed Models
2025-02-05 13:30:43 -08:00
af8486de49
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU)
2025-02-05 13:29:45 -08:00
4c3aac51e1
Merging PR #12536
...
Merged via CLI script
2025-02-05 13:24:26 -08:00
bc1bdecebf
[core][distributed] exact ray placement control ( #12732 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 02:03:19 +08:00
022bcc701a
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 ( #12546 )
2025-02-04 23:11:02 -08:00
c53dc466b1
[Doc] Remove performance warning for auto_awq.md ( #12743 )
2025-02-04 22:43:11 -08:00
3d09e592a8
[V1][Misc] Shorten FinishReason enum and use constant strings ( #12760 )
2025-02-04 22:43:02 -08:00
fcf2e3d7fc
[Bugfix] Fix OpenVINO model runner ( #12750 )
2025-02-04 22:42:46 -08:00
58b218d7ae
[Doc] Update PR Reminder with link to Developer Slack ( #12748 )
2025-02-04 22:42:09 -08:00
7ff7a638b6
[Model][Quant] Fix GLM, Fix fused module mappings for quantization ( #12634 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-02-05 05:32:06 +00:00
686006a220
[Misc] Bump the compressed-tensors version ( #12736 )
2025-02-04 20:44:48 -08:00
98fd089fc9
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models ( #12729 )
2025-02-04 20:44:26 -08:00
249824c3bf
Refactor Linear handling in TransformersModel ( #12727 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-05 04:31:12 +00:00
64862d106e
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling ( #12713 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-02-05 03:58:22 +00:00
b3a0d01e45
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS ( #12368 )
...
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com >
2025-02-04 18:46:26 -08:00
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) ( #12676 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-02-04 18:22:24 -08:00
233df6f5c4
[V1][Metrics] Add request_success_total counter, labelled with finish reason ( #12579 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-04 19:46:54 -05:00
18016a5e62
[Bugfix] Fix CI failures for InternVL and Mantis models ( #12728 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-04 23:54:23 +08:00
649550f27e
[Build] update requirements of no-device for plugin usage ( #12630 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
2025-02-04 21:19:12 +08:00
62467a834a
Avoid unnecessary multi-modal input data copy when len(batch) == 1 ( #12722 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-02-04 21:03:19 +08:00
6469038b14
[Bugfix] Fix loading of fine-tuned models based on Phi-3-Small ( #12689 )
...
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com >
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com >
2025-02-04 20:58:48 +08:00
815079de8e
[VLM] merged multimodal processor and V1 support for idefics3 ( #12660 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-04 20:00:51 +08:00
18a88fcccc
[V1] Remove scheduling constraint on partial requests ( #12674 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-04 02:43:58 -08:00
d1ca7df84d
[VLM] Merged multi-modal processor for InternVL-based models ( #12553 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-02-04 16:44:52 +08:00
96b23621c1
[Misc] Add BNB quantization for Whisper ( #12381 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-04 16:27:36 +08:00
c36ac98d01
[AMD][ROCm] Enable DeepSeek model on ROCm ( #12662 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2025-02-04 08:24:11 +00:00
4896d0c2dd
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs ( #12711 )
2025-02-03 23:27:11 -08:00
bb392af434
[Doc] Replace ibm-fms with ibm-ai-platform ( #12709 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-02-04 07:05:04 +00:00
5d98d56089
Support Pixtral-Large HF by using llava multimodal_projector_bias config ( #12710 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-04 11:55:46 +08:00
73b35cca7f
[Core] Improve hash collision avoidance in prefix caching ( #12621 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 16:28:20 -08:00
5095e96606
[V1] Revert uncache_blocks and support recaching full blocks ( #12415 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 15:04:53 -08:00
cf58b9c4ca
[MISC] Remove model input dumping when exception ( #12582 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 13:34:16 -08:00
4797dad3ec
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 ( #12707 )
2025-02-03 13:30:39 -08:00
6dd5e52823
Squelch MLA warning for Compressed-Tensors Models ( #12704 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-02-03 13:29:56 -08:00
c11de33dad
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm ( #12696 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 13:04:59 -08:00
33e0602e59
[Misc] Fix improper placement of SPDX header in scripts ( #12694 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 11:16:59 -08:00
a1a2aaadb9
[Model]: Add transformers backend support ( #11330 )
...
# Adds support for `transformers` as a backend
Following https://github.com/huggingface/transformers/pull/35235 , a
bunch of models should already be supported, we are ramping up support
for more models.
Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes:
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-02-03 21:30:38 +08:00
1298a400e8
[ci/build] fix gh200 test ( #12681 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:59:49 +08:00
ad4a9dc817
[cuda] manually import the correct pynvml module ( #12679 )
...
fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:58:21 +08:00
b9986454fe
Fix for attention layers to remain unquantized during moe_wn16 quant ( #12570 )
...
Fix to AWQ quant loading of the new R1 model
The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit
The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again
---------
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Beim <beim2015@outlook.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Beim <805908499@qq.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com >
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com >
Co-authored-by: fade_away <1028552010@qq.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Shawn Du <shawnd200@outlook.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:46:19 +08:00
c5932e5dac
Properly check if all fused layers are in the list of targets ( #12666 )
...
Thanks @kylesayrs for catching this!
2025-02-03 13:42:18 +08:00
20579c0fae
make sure mistral_common not imported for non-mistral models ( #12669 )
...
When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691 .
I added the check, and make all imports of `cv2` lazy.
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:40:25 +08:00
95460fc513
[Kernel] port sgl moe_align_block_size kernels ( #12574 )
...
sgl_moe_align_block_size is based on:
ded9fcd09a
moe_align_block_size is based on:
ba5112ff69
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-03 13:09:50 +08:00
326fcc8b9f
[Doc] Deprecate Discord ( #12668 )
2025-02-02 19:19:56 -08:00
e64330910b
[doc][misc] clarify VLLM_HOST_IP for multi-node inference ( #12667 )
...
As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815 becomes more
frequent. Let's give clear message to users.
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 09:32:18 +08:00
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
f256ebe4df
[Hardware][Intel GPU] add XPU bf16 support ( #12392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-02-02 10:17:26 +00:00
f8ece6e17f
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager ( #12608 )
...
As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254 ,
this PR achieves the task: combine allocate_slots and append_slots.
There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.
@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo
---------
Signed-off-by: Shawn Du <shawnd200@outlook.com >
2025-02-02 16:40:58 +08:00
abfcdcdf27
[V1][Minor] Avoid frequently creating ConstantList ( #12653 )
...
A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-01 23:43:20 -08:00
e497f33491
[Core] Silence unnecessary deprecation warnings ( #12620 )
...
I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:
```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
and will be removed in a future version.
Please use 'lora_path' instead.
```
The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 15:35:50 +08:00
baaa2b24da
[Bugfix] fix moe_wna16 get_quant_method ( #12648 )
...
Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.
baeded2569/vllm/attention/layer.py (L86-L92)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-02-02 15:29:56 +08:00
b4e5c03306
doc: fixing minor typo in readme.md ( #12643 )
...
Word "evolved" was mistyped
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
---------
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
2025-02-01 17:17:29 +00:00
3194039c0e
Apply torch.compile to fused_moe/grouped_topk ( #12637 )
2025-02-01 16:16:19 +00:00
4f4d427ac2
Disable chunked prefill and/or prefix caching when MLA is enabled ( #12642 )
...
From @mgoin in https://github.com/vllm-project/vllm/pull/12638
I cannot push to that branch, therefore a new PR to unblock release.
---------
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-31 23:46:57 -08:00
1e3698393f
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 ( #12280 )
...
We have `v1`, `structured-output`, and `speculative-decoding` labels on
github. This adds automation for applying these labels based on the
files touched by a PR.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-31 23:13:10 -08:00
baeded2569
[Attention] Deepseek v3 MLA support with FP8 compute ( #12601 )
...
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
2025-01-31 21:52:51 -08:00
3e1c76cf3a
Fix: Respect sparsity_config.ignore in Cutlass Integration ( #12517 )
...
This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.
---
### Changes
- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.
---
<details>
<summary>Testing Setup</summary>
The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097 ).
#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```
#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
QuantizationStrategy,
QuantizationType)
from pydantic import BaseModel
-
+from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe import FusedMoE
from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
should_ignore_layer)
from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
__all__ = ["CompressedTensorsLinearMethod"]
SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```
Apply using:
```bash
git apply logging-patch.patch
```
</details>
---
<details>
<summary>Models Tested</summary>
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24`
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`
</details>
---
<details>
<summary>Example Output</summary>
#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```
#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```
</details>
**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.
---
For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160 )](https://app.asana.com/0/0/1209227810815160 ).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
2025-02-01 13:41:59 +08:00
cfa134d247
[Bugfix/CI] Fixup benchmark_moe.py ( #12562 )
...
Fixes `is_marlin` not being passed into `get_default_config`
Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-01 13:41:35 +08:00
35b7a05507
[ci] Upgrade transformers to 4.48.2 in CI dependencies ( #12599 )
2025-01-31 21:22:23 -08:00
1867c258bd
Fix target matching for fused layers with compressed-tensors ( #12617 )
...
Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).
Example recipe:
```
recipe = """
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "group"
group_size: 128
targets: [
"model.layers.0.mlp.down_proj",
"model.layers.2.mlp.down_proj",
"model.layers.3.mlp.down_proj",
"model.layers.4.mlp.down_proj",
"model.layers.5.mlp.down_proj",
"model.layers.6.mlp.down_proj",
"model.layers.7.mlp.down_proj",
"model.layers.8.mlp.down_proj",
"model.layers.9.mlp.down_proj",
"model.layers.10.mlp.down_proj",
"model.layers.11.mlp.down_proj",
"model.layers.12.mlp.down_proj",
"model.layers.13.mlp.down_proj",
"model.layers.14.mlp.down_proj",
"model.layers.15.mlp.down_proj",
"model.layers.16.mlp.down_proj",
"model.layers.17.mlp.down_proj",
"model.layers.19.mlp.down_proj",
"model.layers.21.mlp.down_proj",
"model.layers.22.mlp.down_proj",
.
.
.
]
"""
```
To reproduce the vLLM error:
```bash
vllm serve nm-testing/eldar-test
```
With this PR
------------
Models are loaded correctly without any errors.
2025-02-01 05:07:46 +00:00
cb3e73e4c8
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 ( #11161 )
...
FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086 #12487
---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-01 12:52:07 +08:00
b1340f9d55
[V1] Bugfix: Validate Model Input Length ( #12600 )
...
SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len
FIX #12567(*link existing issues this PR will resolve*)
2025-01-31 18:32:04 -08:00
44bbca78d7
[Doc] int4 w4a16 example ( #12585 )
...
Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py )
FIX #n/a (no issue created)
@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets
---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-31 15:38:48 -08:00
60808bd4c7
[Doc] Improve installation signposting ( #12575 )
...
- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html
- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 15:38:35 -08:00
fc542144c4
[Feature] Fix guided decoding blocking bitmask memcpy ( #12563 )
...
**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.
(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

With the optimization, this is no longer the case:

---------
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
2025-01-31 15:37:30 -08:00
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 ( #12587 )
...
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-31 15:29:11 -08:00
145c2ff648
[Bugfix] Revert MoE Triton Config Default ( #12629 )
...
SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files ) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-31 15:28:47 -08:00
415f19474d
[release] Add input step to ask for Release version ( #12631 )
...
Instead of having to create a new build with release version put in as
env var.
2025-01-31 13:39:36 -08:00
89003c4082
[v1][Bugfix] Add extra_keys to block_hash for prefix caching ( #12603 )
...
This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
request_id=0,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash1", "hash2"],
)
request2 = make_request(
request_id=1,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash3", "hash2"],
)
```
---------
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-31 13:13:04 -08:00
60bcef000e
[Docs][V1] Prefix caching design ( #12598 )
...
- Create v1 design document section in docs.
- Add prefix caching design doc.
@WoosukKwon @ywang96
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:46 -08:00
847f883232
[Git] Automatically sign-off commits ( #12595 )
...
It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just
```
pre-commit install
```
Now we need to
```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```
Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.
cc @hmellor @youkaichao
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:33 -08:00
325f679f32
[BugFix] Fix Torch.Compile For DeepSeek ( #12594 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-31 12:06:39 -08:00
e3f7ff65e7
Add favicon to docs ( #12611 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 09:20:34 -08:00
7a8987dac5
[Bugfix] Gracefully handle huggingface hub http error ( #12571 )
2025-01-31 08:19:35 +00:00
cabaf4eff3
[Attention] MLA decode optimizations ( #12528 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 23:49:37 -08:00
a1fc18c030
[ROCm][AMD][Model] llama 3.2 support upstreaming ( #12421 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-01-31 12:24:28 +08:00
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling ( #11868 )
2025-01-30 18:33:00 -08:00
4078052f09
[V1][Log] Add max request concurrency log to V1 ( #12569 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-30 23:07:19 +00:00
bd2107e30a
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies ( #12555 )
...
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
2025-01-30 16:29:39 -05:00
9b0c4bab36
[Kernel] Triton Configs for Fp8 Block Quantization ( #11589 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 11:53:22 -08:00
41bf5612f5
[Misc] fix typo: add missing space in lora adapter error message ( #12564 )
...
Signed-off-by: Beim <beim2015@outlook.com >
2025-01-30 15:39:22 +00:00
a2769032ca
Set ?device={device} when changing tab in installation guides ( #12560 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-30 00:05:42 -08:00
f17f1d4608
[V1][Metrics] Add GPU cache usage % gauge ( #12561 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 18:31:01 -08:00
1c1bb0bbf2
[Misc][MoE] add Deepseek-V3 moe tuning support ( #12558 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-30 00:47:30 +00:00
e0cc5f259a
[V1][BugFix] Free encoder cache for aborted requests ( #12545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-29 13:47:33 -08:00
73aa6cfdf7
Revert "[Build/CI] Fix libcuda.so linkage" ( #12552 )
2025-01-29 21:12:24 +00:00
27b78c73ca
[Kernel] add triton fused moe kernel for gptq/awq ( #12185 )
2025-01-29 09:07:09 -05:00
b02fd288b2
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. ( #11787 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-29 01:46:12 -08:00
ff7424f491
[Frontend] Support override generation config in args ( #12409 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-01-29 01:41:01 -08:00
d93bf4da85
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM ( #12069 )
...
Signed-off-by: hzh <hezhihui_thu@163.com >
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Oleg Mosalov <oleg@krai.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Signed-off-by: Chenguang Li <757486878@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Shanshan Shen <467638484@qq.com >
Signed-off-by: elijah <f1renze.142857@gmail.com >
Signed-off-by: Yikun <yikunkero@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com >
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: sixgod <evethwillbeok@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com >
Co-authored-by: Oleg Mosalov <oleg@krai.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com >
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com >
Co-authored-by: Concurrensee <yida.wu@amd.com >
Co-authored-by: Chenguang Li <757486878@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Alex Brooks <alex.brooks@ibm.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Shanshan Shen <467638484@qq.com >
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com >
Co-authored-by: Yikun Jiang <yikunkero@gmail.com >
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: TJian <tunjian1996@gmail.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-01-29 09:24:59 +00:00
036ca94c25
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense ( #12347 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Wallas Santos <wallashss@ibm.com >
2025-01-29 08:54:35 +00:00
ef001d98ef
Fix the pydantic logging validator ( #12420 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-01-29 07:53:13 +00:00
5f671cb4c3
[V1] Improve Error Message for Unsupported Config ( #12535 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-29 04:56:56 +00:00
bd02164cf9
Bugfix for whisper quantization due to fake k_proj bias ( #12524 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-29 04:49:03 +00:00
46fb056749
[V1][Metrics] Add TTFT and TPOT histograms ( #12530 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 04:11:16 +00:00
dd6a3a02cb
[Doc] Convert docs to use colon fences ( #12471 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-29 11:38:29 +08:00
a7e3eba66f
[Frontend] Support reasoning content for deepseek r1 ( #12473 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
2025-01-29 11:38:08 +08:00
fbb5bd4cef
[TPU] Add example for profiling TPU inference ( #12531 )
...
Signed-off-by: mgoin <mgoin@redhat.com >
2025-01-29 03:16:47 +00:00
80fcc3ed1c
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels ( #12482 )
...
Signed-off-by: Fenghui Zhang <fhzhang@google.com >
2025-01-28 22:36:44 +00:00
c386c43ca3
[V1][Metrics] Add per-request prompt/generation_tokens histograms ( #12516 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 22:07:22 +00:00
f26d790718
Do not run suggestion pre-commit hook multiple times ( #12521 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-28 20:05:27 +00:00
0f657bdc52
Replace missed warning_once for rerank API ( #12472 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-28 19:06:32 +00:00
3fd1fb63ef
[V1][Metrics] Hook up IterationStats for Prometheus metrics ( #12478 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 16:38:38 +00:00
925d2f1908
[Doc] Fix typo for x86 CPU installation ( #12514 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-01-28 16:37:10 +00:00
8f58a51358
[VLM] Merged multi-modal processor and V1 support for Qwen-VL ( #12504 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-28 16:25:05 +00:00
2079e43bee
[Core] Make raw_request optional in ServingCompletion ( #12503 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-01-28 10:56:45 +00:00
e29d4358ef
[V1] Include Engine Version in Logs ( #12496 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-28 08:27:41 +00:00
8cbc424975
Update README.md with V1 alpha release ( #12495 )
2025-01-28 08:22:41 +00:00
dd66fd2b01
[CI] fix pre-commit error ( #12494 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-28 06:11:05 +00:00
0f465ab533
[FEATURE] Enables offline /score for embedding models ( #12021 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
2025-01-28 11:30:13 +08:00
23a7cbc88b
[CI/Build] Fixed the xla nightly issue report in #12451 ( #12453 )
2025-01-28 11:18:07 +08:00
426a5c3625
Fix bad path in prometheus example ( #12481 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-27 18:56:31 -07:00
ddee88d0ff
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache ( #11277 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: Jiangfei Duan <jfduan@outlook.com >
2025-01-27 17:31:16 -08:00
823ab79633
Update pre-commit hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-27 17:23:08 -07:00
6116ca8cd7
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill ( #10132 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: wallashss <wallashss@ibm.com >
Co-authored-by: wallashss <wallashss@ibm.com >
2025-01-27 13:38:35 -08:00
2bc3fbba0c
[FlashInfer] Upgrade to 0.2.0 ( #11194 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-27 18:19:24 +00:00
3f1fc7425a
[V1][CI/Test] Do basic test for top-p & top-k sampling ( #12469 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 09:40:04 -08:00
01ba927040
[V1][Metrics] Add initial Prometheus logger ( #12416 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-27 12:26:28 -05:00
103bd17ac5
[Build] Only build 9.0a for scaled_mm and sparse kernels ( #12339 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-27 10:40:00 -05:00
ce69f7f754
[Bugfix] Fix gpt2 GGUF inference ( #12467 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 18:31:49 +08:00
624a1e4711
[V1][Minor] Minor optimizations for update_from_output ( #12454 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 01:09:27 -08:00
372bf0890b
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata ( #12464 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 07:25:30 +00:00
5204ff5c3f
[Bugfix] Fix Granite 3.0 MoE model loading ( #12446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-26 21:26:44 -08:00
0cc6b383d7
[Frontend] Support scores endpoint in run_batch ( #12430 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-01-27 04:30:17 +00:00
28e0750847
[V1] Avoid list creation in input preparation ( #12457 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-26 19:57:56 -08:00
582cf78798
[DOC] Add link to vLLM blog ( #12460 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-27 03:46:19 +00:00
0034b09ceb
[Frontend] Rerank API (Jina- and Cohere-compatible API) ( #12376 )
...
Signed-off-by: Kyle Mistele <kyle@mistele.com >
2025-01-26 19:58:45 -07:00
72bac73067
[Build/CI] Fix libcuda.so linkage ( #12424 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-26 21:18:19 +00:00
68f11149d8
[Bugfix][Kernel] Fix perf regression caused by PR #12405 ( #12434 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-26 11:09:34 -08:00
72f4880425
[Bugfix/CI] Fix broken kernels/test_mha.py ( #12450 )
2025-01-26 10:39:03 -08:00
aa2cd2c43d
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 ( #12417 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-26 19:59:58 +08:00
9ddc35220b
[Frontend] generation_config.json for maximum tokens( #12242 )
...
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com >
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: shangmingc <caishangming@linux.alibaba.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-26 19:59:25 +08:00
a5255270c3
[Misc] Revert FA on ViT #12355 and #12435 ( #12445 )
2025-01-26 03:56:34 -08:00
0ee349b553
[V1][Bugfix] Fix assertion when mm hashing is turned off ( #12439 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-26 00:47:42 -08:00
fa63e710c7
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync ( #12094 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-01-26 00:42:37 -08:00
2a0309a646
[Misc][Bugfix] FA3 support to ViT MHA layer ( #12435 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-26 05:00:31 +00:00
324960a95c
[TPU][CI] Update torchxla version in requirement-tpu.txt ( #12422 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-01-25 07:23:03 +00:00
f1fc0510df
[Misc] Add FA2 support to ViT MHA layer ( #12355 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-25 15:07:35 +08:00
bf21481dde
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 ( #12408 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-25 12:17:19 +08:00
fb30ee92ee
[Bugfix] Fix BLIP-2 processing ( #12412 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-25 11:42:42 +08:00
221d388cc5
[Bugfix][Kernel] Fix moe align block issue for mixtral ( #12413 )
2025-01-25 01:49:28 +00:00
3132a933b6
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). ( #12405 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 20:20:59 +00:00
df5dafaa5b
[Misc] Remove deprecated code ( #12383 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-24 14:45:20 -05:00
ab5bbf5ae3
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build ( #12375 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 15:27:59 +00:00
3bb8e2c9a2
[Misc] Enable proxy support in benchmark script ( #12356 )
...
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp >
2025-01-24 14:58:26 +00:00
e784c6b998
[ci/build] sync default value for wheel size ( #12398 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:54:29 +08:00
9a0f3bdbe5
[Hardware][Gaudi][Doc] Add missing step in setup instructions ( #12382 )
2025-01-24 09:43:49 +00:00
c7c9851036
[ci/build] fix wheel size check ( #12396 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:31:25 +08:00
3c818bdb42
[Misc] Use VisionArena Dataset for VLM Benchmarking ( #12389 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-24 00:22:04 -08:00
6dd94dbe94
[perf] fix perf regression from #12253 ( #12380 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 11:34:27 +08:00
0e74d797ce
[V1] Increase default batch size for H100/H200 ( #12369 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-24 03:19:55 +00:00
55ef66edf4
Update compressed-tensors version ( #12367 )
2025-01-24 11:19:42 +08:00
5e5630a478
[Bugfix] Path join when building local path for S3 clone ( #12353 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
2025-01-24 11:06:07 +08:00
d3d6bb13fb
Set weights_only=True when using torch.load() ( #12366 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-24 02:17:30 +00:00
24b0205f58
[V1][Frontend] Coalesce bunched RequestOutputs ( #12298 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2025-01-23 17:17:41 -08:00
c5cffcd0cd
[Docs] Update spec decode + structured output in compat matrix ( #12373 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-24 01:15:52 +00:00
682b55bc07
[Docs] Add meetup slides ( #12345 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-23 14:10:03 -08:00
9726ad676d
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script ( #12357 )
...
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp >
2025-01-23 17:02:13 -05:00
eb5cb5e528
[BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order ( #11528 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-23 21:40:33 +00:00
2cbeedad09
[Docs] Document Phi-4 support ( #12362 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 19:18:51 +00:00
2c85529bfc
[TPU] Update TPU CI to use torchxla nightly on 20250122 ( #12334 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-01-23 18:50:16 +00:00
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-01-23 18:04:03 +00:00
6e650f56a1
[torch.compile] decouple compile sizes and cudagraph sizes ( #12243 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 02:01:30 +08:00
3f50c148fd
[core] add wake_up doc and some sanity check ( #12361 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 02:00:50 +08:00
8c01b8022c
[Bugfix] Fix broken internvl2 inference with v1 ( #12360 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 17:20:33 +00:00
99d01a5e3d
[V1] Simplify M-RoPE ( #12352 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: imkero <kerorek@outlook.com >
2025-01-23 23:13:23 +08:00
d07efb31c5
[Doc] Troubleshooting errors during model inspection ( #12351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-23 22:46:58 +08:00
978b45f399
[Kernel] Flash Attention 3 Support ( #12093 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-23 06:45:48 -08:00
c5b4b11d7f
[Bugfix] Fix k_proj's bias for whisper self attention ( #12342 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 10:15:33 +00:00
8ae5ff2009
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update ( #12338 )
...
Signed-off-by: zhenwei <zhenweiliu@habana.ai >
2025-01-23 08:35:46 +00:00
511627445e
[doc] explain common errors around torch.compile ( #12340 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-23 14:56:02 +08:00
f0ef37233e
[V1] Add uncache_blocks ( #12333 )
2025-01-23 04:19:21 +00:00
7551a34032
[Docs] Document vulnerability disclosure process ( #12326 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-23 03:44:09 +00:00
01a55941f5
[Docs] Update FP8 KV Cache documentation ( #12238 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-23 11:18:09 +08:00
8d7aa9de71
[Bugfix] Fixing AMD LoRA CI test. ( #12329 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-01-23 10:53:02 +08:00
68c4421b6d
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD ( #12282 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-01-23 00:10:37 +00:00
aea94362c9
[Frontend][V1] Online serving performance improvements ( #12287 )
2025-01-22 22:22:12 +00:00
7206ce4ce1
[Core] Support reset_prefix_cache ( #12284 )
2025-01-22 18:52:27 +00:00
96f6a7596f
[Bugfix] Fix HPU multiprocessing executor ( #12167 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-23 02:07:07 +08:00
84bee4bd5c
[Misc] Improve the readability of BNB error messages ( #12320 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-22 16:56:54 +00:00
fc66dee76d
[Misc] Fix the error in the tip for the --lora-modules parameter ( #12319 )
...
Signed-off-by: wangerxiao <863579016@qq.com >
2025-01-22 16:48:41 +00:00
6609cdf019
[Doc] Add docs for prompt replacement ( #12318 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 14:56:29 +00:00
16366ee8bb
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 ( #12313 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-22 21:06:36 +08:00
528dbcac7d
[Model][Bugfix]: correct Aria model output ( #12309 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2025-01-22 11:39:19 +00:00
cd7b6f0857
[VLM] Avoid unnecessary tokenization ( #12310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 11:08:31 +00:00
68ad4e3a8d
[Core] Support fully transparent sleep mode ( #11743 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:39:32 +08:00
4004f144f3
[Build] update requirements of no-device ( #12299 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-22 14:29:31 +08:00
66818e5b63
[core] separate builder init and builder prepare for each batch ( #12253 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:13:52 +08:00
222a9dc350
[Benchmark] More accurate TPOT calc in benchmark_serving.py ( #12288 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-01-22 13:46:14 +08:00
cbdc4ad5a5
[Ci/Build] Fix mypy errors on main ( #12296 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 12:06:54 +08:00
016e3676e7
[CI] add docker volume prune to neuron CI ( #12291 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-01-22 10:47:49 +08:00
64ea24d0b3
[ci/lint] Add back default arg for pre-commit ( #12279 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-01-22 01:15:27 +00:00
df76e5af26
[VLM] Simplify post-processing of replacement info ( #12269 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 16:48:13 -08:00
09ccc9c8f7
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose ( #12281 )
...
Signed-off-by: Hongxia Yang <hongxyan@amd.com >
2025-01-22 07:49:22 +08:00
69196a9bc7
[BUGFIX] When skip_tokenize_init and multistep are set, execution crashes ( #12277 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: maleksan85 <maleksan@amd.com >
2025-01-21 23:30:46 +00:00
2acba47d9b
[bugfix] moe tuning. rm is_navi() ( #12273 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-21 22:47:32 +00:00
9c485d9e25
[Core] Free CPU pinned memory on environment cleanup ( #10477 )
2025-01-21 11:56:41 -08:00
fa9ee08121
[Misc] Set default backend to SDPA for get_vit_attn_backend ( #12235 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-21 11:52:11 -08:00
347eeebe3b
[Misc] Remove experimental dep from tracing.py ( #12007 )
...
Signed-off-by: Adrian Cole <adrian.cole@elastic.co >
2025-01-21 11:51:55 -08:00
18fd4a8331
[Bugfix] Multi-sequence broken ( #11898 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-01-21 11:51:35 -08:00
132a132100
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types ( #10907 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2025-01-21 11:51:13 -08:00
1e60f87bb3
[Kernel] fix moe_align_block_size error condition ( #12239 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-01-21 10:30:28 -08:00
9705b90bcf
[Bugfix] fix race condition that leads to wrong order of token returned ( #10802 )
...
Signed-off-by: Jannis Schönleber <joennlae@gmail.com >
2025-01-21 09:47:04 -08:00
3aec49e56f
[ci/build] update nightly torch for gh200 test ( #12270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 23:03:17 +08:00
c64612802b
[Platform] improve platforms getattr ( #12264 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-21 14:42:41 +00:00
9a7c3a0042
Remove pytorch comments for outlines + compressed-tensors ( #12260 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-01-21 21:49:08 +08:00
b197a5ccfd
[V1][Bugfix] Fix data item ordering in mixed-modality inference ( #12259 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-21 13:18:43 +00:00
c81081fece
[torch.compile] transparent compilation with more logging ( #12246 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 19:32:55 +08:00
a94eee4456
[Bugfix] Fix mm_limits access for merged multi-modal processor ( #12252 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 10:09:39 +00:00
f2e9f2a3be
[Misc] Remove redundant TypeVar from base model ( #12248 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 08:40:39 +00:00
1f1542afa9
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration ( #12237 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-21 07:49:08 +00:00
96912550c8
[Misc] Rename MultiModalInputsV2 -> MultiModalInputs ( #12244 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 07:31:19 +00:00
2fc6944c5e
[ci/build] disable failed and flaky tests ( #12240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 13:25:03 +08:00
5fe6bf29d6
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 ( #12230 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-01-21 12:23:14 +08:00
d4b62d4641
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork ( #11777 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-01-21 12:22:23 +08:00
ecf67814f1
Add quantization and guided decoding CODEOWNERS ( #12228 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-20 18:23:40 -07:00
750f4cabfa
[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) ( #12222 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-20 16:42:16 -08:00
06a760d6e8
[bugfix] catch xgrammar unsupported array constraints ( #12210 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
2025-01-20 16:42:02 -08:00
da7512215f
[misc] add cuda runtime version to usage data ( #12190 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-01-21 00:31:01 +00:00
af69a6aded
fix: update platform detection for M-series arm based MacBook processors ( #12227 )
...
Signed-off-by: isikhi <huseyin.isik000@gmail.com >
2025-01-20 22:23:28 +00:00
7bd3630067
[Misc] Update CODEOWNERS ( #12229 )
2025-01-20 22:19:09 +00:00
96663699b2
[CI] Pass local python version explicitly to pre-commit mypy.sh ( #12224 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-20 23:49:18 +08:00
18572e3384
[Bugfix] Fix HfExampleModels.find_hf_info ( #12223 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 15:35:36 +00:00
86bfb6dba7
[Misc] Pass attention to impl backend ( #12218 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-20 23:25:28 +08:00
5f0ec3935a
[V1] Remove _get_cache_block_size ( #12214 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-20 21:54:16 +08:00
c222f47992
[core][bugfix] configure env var during import vllm ( #12209 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 19:35:59 +08:00
170eb35079
[misc] print a message to suggest how to bypass commit hooks ( #12217 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 18:06:24 +08:00
b37d82791e
[Model] Upgrade Aria to transformers 4.48 ( #12203 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 17:58:48 +08:00
3127e975fb
[CI/Build] Make pre-commit faster ( #12212 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 17:36:24 +08:00
4001ea1266
[CI/Build] Remove dummy CI steps ( #12208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 16:41:57 +08:00
5c89a29c22
[misc] add placeholder format.sh ( #12206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 16:04:49 +08:00
59a0192fb9
[Core] Interface for accessing model from VllmRunner ( #10353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 15:00:59 +08:00
83609791d2
[Model] Add Qwen2 PRM model support ( #12202 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-20 14:59:46 +08:00
0974c9bc5c
[Bugfix] Fix incorrect types in LayerwiseProfileResults ( #12196 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:59:20 +08:00
d2643128f7
[DOC] Add missing docstring in LLMEngine.add_request() ( #12195 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:59:00 +08:00
c5c06209ec
[DOC] Fix typo in docstring and assert message ( #12194 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:58:29 +08:00
3ea7b94523
Move linting to pre-commit ( #11975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-20 14:58:01 +08:00
51ef828f10
[torch.compile] fix sym_tensor_indices ( #12191 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 11:37:50 +08:00
df450aa567
[Bugfix] Fix num_heads value for simple connector when tp enabled ( #12074 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-01-20 02:56:43 +00:00
bbe5f9de7d
[Model] Support for fairseq2 Llama ( #11442 )
...
Signed-off-by: Martin Gleize <mgleize@meta.com >
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas >
2025-01-19 10:40:40 -08:00
81763c58a0
[V1] Add V1 support of Qwen2-VL ( #12128 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: imkero <kerorek@outlook.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-19 19:52:13 +08:00
edaae198e7
[Misc] Add BNB support to GLM4-V model ( #12184 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-19 19:49:22 +08:00
936db119ed
benchmark_serving support --served-model-name param ( #12109 )
...
Signed-off-by: zibai <zibai.gj@alibaba-inc.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-01-19 09:59:56 +00:00
e66faf4809
[torch.compile] store inductor compiled Python file ( #12182 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-19 16:27:26 +08:00
630eb5b5ce
[Bugfix] Fix multi-modal processors for transformers 4.48 ( #12187 )
2025-01-18 19:16:34 -08:00
4e94951bb1
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu ( #12152 )
...
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai >
2025-01-19 11:12:05 +08:00
7a8a48d51e
[V1] Collect env var for usage stats ( #12115 )
2025-01-19 03:07:15 +00:00
32eb0da808
[Misc] Support register quantization method out-of-tree ( #11969 )
2025-01-18 16:13:16 -08:00
6d0e3d3724
[core] clean up executor class hierarchy between v1 and v0 ( #12171 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 14:35:15 +08:00
02798ecabe
[Model] Port deepseek-vl2 processor, remove dependency ( #12169 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-18 13:59:39 +08:00
813f249f02
[Docs] Fix broken link in SECURITY.md ( #12175 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-18 04:35:21 +00:00
da02cb4b27
[core] further polish memory profiling ( #12126 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 12:25:08 +08:00
c09503ddd6
[AMD][CI/Build][Bugfix] use pytorch stale wheel ( #12172 )
...
Signed-off-by: hongxyan <hongxyan@amd.com >
2025-01-18 11:15:53 +08:00
2b83503227
[misc] fix cross-node TP ( #12166 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 10:53:27 +08:00
7b98a65ae6
[torch.compile] disable logging when cache is disabled ( #12043 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-17 20:29:31 +00:00
b5b57e301e
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant ( #12134 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-01-17 17:12:26 +00:00
54cacf008f
[Bugfix] Mistral tokenizer encode accept list of str ( #12149 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-17 16:47:53 +00:00
58fd57ff1d
[Bugfix] Fix score api for missing max_model_len validation ( #12119 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2025-01-17 16:24:22 +00:00
87a0c076af
[core] allow callable in collective_rpc ( #12151 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-17 20:47:01 +08:00
d4e6194570
[CI/Build][CPU][Bugfix] Fix CPU CI ( #12150 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-17 19:39:52 +08:00
07934cc237
[Misc][LoRA] Improve the readability of LoRA error messages ( #12102 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-17 19:32:28 +08:00
69d765f5a5
[V1] Move more control of kv cache initialization from model_executor to EngineCore ( #11960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-17 07:39:35 +00:00
8027a72461
[ROCm][MoE] moe tuning support for rocm ( #12049 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-17 14:49:16 +08:00
d75ab55f10
[Misc] Add deepseek_vl2 chat template ( #12143 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-17 06:34:48 +00:00
d1adb9b403
[BugFix] add more is not None check in VllmConfig.__post_init__ ( #12138 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-17 05:33:22 +00:00
b8bfa46a18
[Bugfix] Fix issues in CPU build Dockerfile ( #12135 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-17 12:54:01 +08:00
1475847a14
[Doc] Add instructions on using Podman when SELinux is active ( #12136 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-17 04:45:36 +00:00
fead53ba78
[CI]add genai-perf benchmark in nightly benchmark ( #10704 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-17 04:15:09 +00:00
ebc73f2828
[Bugfix] Fix a path bug in disaggregated prefill example script. ( #12121 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-17 11:12:41 +08:00
d06e824006
[Bugfix] Set enforce_eager automatically for mllama ( #12127 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-16 15:30:08 -05:00
62b06ba23d
[Model] Add support for deepseek-vl2-tiny model ( #12068 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-16 17:14:48 +00:00
5fd24ec02e
[misc] Add LoRA kernel micro benchmarks ( #11579 )
2025-01-16 15:51:40 +00:00
874f7c292a
[Bugfix] Fix max image feature size for Llava-one-vision ( #12104 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-16 14:54:06 +00:00
92e793d91a
[core] LLM.collective_rpc interface and RLHF example ( #12084 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-16 20:19:52 +08:00
bf53e0c70b
Support torchrun and SPMD-style offline inference ( #12071 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-16 19:58:53 +08:00
dd7c9ad870
[Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 ( #12067 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-16 10:11:54 +00:00
9aa1519f08
Various cosmetic/comment fixes ( #12089 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-16 09:59:06 +00:00
f8ef146f03
[Doc] Add documentation for specifying model architecture ( #12105 )
2025-01-16 15:53:43 +08:00
fa0050db08
[Core] Default to using per_token quantization for fp8 when cutlass is supported. ( #8651 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-16 04:31:27 +00:00
cd9d06fb8d
Allow hip sources to be directly included when compiling for rocm. ( #12087 )
2025-01-15 16:46:03 -05:00
ebd8c669ef
[Bugfix] Fix _get_lora_device for HQQ marlin ( #12090 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-01-15 19:59:42 +00:00
70755e819e
[V1][Core] Autotune encoder cache budget ( #11895 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-15 11:29:00 -08:00
edce722eaa
[Bugfix] use right truncation for non-generative tasks ( #12050 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-01-16 00:31:01 +08:00
57e729e874
[Doc]: Update OpenAI-Compatible Server documents ( #12082 )
2025-01-15 16:07:45 +00:00
de0526f668
[Misc][Quark] Upstream Quark format to VLLM ( #10765 )
...
Signed-off-by: kewang-xlnx <kewang@xilinx.com >
Signed-off-by: kewang2 <kewang2@amd.com >
Co-authored-by: kewang2 <kewang2@amd.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-15 11:05:15 -05:00
5ecf3e0aaf
Misc: allow to use proxy in HTTPConnection ( #12042 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2025-01-15 13:16:40 +00:00
97eb97b5a4
[Model]: Support internlm3 ( #12037 )
2025-01-15 11:35:17 +00:00
3adf0ffda8
[Platform] Do not raise error if _Backend is not found ( #12023 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
2025-01-15 10:14:15 +00:00
ad388d25a8
Type-fix: make execute_model output type optional ( #12020 )
2025-01-15 09:44:56 +00:00
cbe94391eb
Fix: cases with empty sparsity config ( #12057 )
...
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
2025-01-15 17:41:24 +08:00
994fc655b7
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager ( #12003 )
2025-01-15 07:55:30 +00:00
3f9b7ab9f5
[Doc] Update examples to remove SparseAutoModelForCausalLM ( #12062 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-01-15 06:36:01 +00:00
ad34c0df0f
[core] platform agnostic executor via collective_rpc ( #11256 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-15 13:45:21 +08:00
f218f9c24d
[core] Turn off GPU communication overlap for Ray executor ( #12051 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-01-15 05:19:55 +00:00
0794e7446e
[Misc] Add multipstep chunked-prefill support for FlashInfer ( #10467 )
2025-01-15 12:47:49 +08:00
b7ee940a82
[V1][BugFix] Fix edge case in VLM scheduling ( #12065 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-14 20:21:28 -08:00
9ddac56311
[Platform] move current_memory_usage() into platform ( #11369 )
...
Signed-off-by: Shanshan Shen <467638484@qq.com >
2025-01-15 03:38:25 +00:00
1a51b9f872
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py ( #12046 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-15 02:59:18 +00:00
42f5e7c52a
[Kernel] Support MulAndSilu ( #11624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 02:29:53 +00:00
a3a3ee4e6f
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping ( #11924 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 07:49:49 +08:00
87054a57ab
[Doc]: Update the Json Example of the Engine Arguments document ( #12045 )
2025-01-14 17:03:04 +00:00
c9d6ff530b
Explain where the engine args go when using Docker ( #12041 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-14 16:05:50 +00:00
a2d2acb4c8
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention ( #12040 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-14 15:45:05 +00:00
2e0e017610
[Platform] Add output for Attention Backend ( #11981 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-14 13:27:04 +00:00
1f18adb245
[Kernel] Revert the API change of Attention.forward ( #12038 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-14 20:59:32 +08:00
bb354e6b2d
[Bugfix] Fix various bugs in multi-modal processor ( #12031 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-14 12:16:11 +00:00
ff39141a49
[HPU][misc] add comments for explanation ( #12034 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-14 19:24:06 +08:00
8a1f938e6f
[Doc] Update Quantization Hardware Support Documentation ( #12025 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-01-14 04:37:52 +00:00
078da31903
[HPU][Bugfix] set_forward_context and CI test execution ( #12014 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-14 11:04:18 +08:00
1a401252b5
[Docs] Add Sky Computing Lab to project intro ( #12019 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-13 17:24:36 -08:00
f35ec461fc
[Bugfix] Fix deepseekv3 gate bias error ( #12002 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-13 13:43:51 -07:00
289b5191d5
[Doc] Fix build from source and installation link in README.md ( #12013 )
...
Signed-off-by: Yikun <yikunkero@gmail.com >
2025-01-13 17:23:59 +00:00
c6db21313c
bugfix: Fix signature mismatch in benchmark's get_tokenizer function ( #11982 )
...
Signed-off-by: elijah <f1renze.142857@gmail.com >
2025-01-13 15:22:07 +00:00
a7d59688fb
[Platform] Move get_punica_wrapper() function to Platform ( #11516 )
...
Signed-off-by: Shanshan Shen <467638484@qq.com >
2025-01-13 13:12:10 +00:00
458e63a2c6
[platform] add device_control env var ( #12009 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-13 20:59:09 +08:00
e8c23ff989
[Doc] Organise installation documentation into categories and tabs ( #11935 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-13 12:27:36 +00:00
cd8249903f
[Doc][V1] Update model implementation guide for V1 support ( #11998 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-01-13 11:58:54 +00:00
0f8cafe2d1
[Kernel] unified_attention for Attention.forward ( #11967 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-13 19:28:53 +08:00
5340a30d01
Fix Max Token ID for Qwen-VL-Chat ( #11980 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-01-13 08:37:48 +00:00
89ce62a316
[platform] add ray_device_key ( #11948 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-13 16:20:52 +08:00
c3f05b09a0
[Misc]Minor Changes about Worker ( #11555 )
...
Signed-off-by: Chenguang Li <757486878@qq.com >
2025-01-13 15:47:05 +08:00
cf6bbcb493
[Misc] Fix Deepseek V2 fp8 kv-scale remapping ( #11947 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-01-12 23:05:06 -08:00
80ea3af1a0
[CI][Spec Decode] fix: broken test for EAGLE model ( #11972 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
2025-01-13 06:50:35 +00:00
9dd02d85ca
[Bug] Fix usage of .transpose() and .view() consecutively. ( #11979 )
2025-01-13 06:24:10 +00:00
f7b3ba82c3
[MISC] fix typo in kv transfer send recv test ( #11983 )
2025-01-13 05:07:48 +00:00
619ae268c3
[V1] [2/n] Logging and Metrics - OutputProcessor Abstraction ( #11973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-13 04:54:10 +00:00
d14e98d924
[Model] Support GGUF models newly added in transformers 4.46.0 ( #9685 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-13 00:13:44 +00:00
9597a095f2
[V1][Core][1/n] Logging and Metrics ( #11962 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-12 21:02:02 +00:00
263a870ee1
[Hardware][TPU] workaround fix for MoE on TPU ( #11764 )
2025-01-12 10:53:51 -05:00
8bddb73512
[Hardware][CPU] Multi-LoRA implementation for the CPU backend ( #11100 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Oleg Mosalov <oleg@krai.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Oleg Mosalov <oleg@krai.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-12 13:01:52 +00:00
f967e51f38
[Model] Initialize support for Deepseek-VL2 models ( #11578 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-12 00:17:24 -08:00
43f3d9e699
[CI/Build] Add markdown linter ( #11857 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2025-01-12 00:17:13 -08:00
b25cfab9a0
[V1] Avoid sending text prompt to core engine ( #11963 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-12 06:36:38 +00:00
4b657d3292
[Model] Add cogagent model support vLLM ( #11742 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-11 19:05:56 +00:00
d697dc01b4
[Bugfix] Fix RobertaModel loading ( #11940 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-01-11 14:05:09 +00:00
a991f7d508
[Doc] Basic guide for writing unit tests for new models ( #11951 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-11 21:27:24 +08:00
7a3a83e3b8
[CI/Build] Move model-specific multi-modal processing tests ( #11934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-11 13:50:05 +08:00
c32a7c7c0c
[Bugfix] fused_experts_impl wrong compute type for float32 ( #11921 )
...
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
2025-01-11 13:49:39 +08:00
2118d0565c
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design ( #11672 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
2025-01-10 20:49:38 -08:00
899136b857
[ci] fix broken distributed-tests-4-gpus ( #11937 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-11 09:07:24 +08:00
c9f09a4fe8
[mypy] Fix mypy warnings in api_server.py ( #11941 )
...
Signed-off-by: Fred Reiss <frreiss@us.ibm.com >
2025-01-11 01:04:58 +00:00
d45cbe70f5
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama ( #11939 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-01-10 23:26:00 +00:00
8a579408f3
[Misc] Update benchmark_prefix_caching.py fixed example usage ( #11920 )
...
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn >
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn >
2025-01-10 20:39:22 +00:00
46fa98ccad
[Misc] Clean up debug code in Deepseek-V3 ( #11930 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-10 19:19:15 +00:00
aa1e77a19c
[Hardware][CPU] Support MOE models on x86 CPU ( #11831 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-10 11:07:58 -05:00
5959564f94
Doc fix in benchmark_long_document_qa_throughput.py ( #11933 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-10 23:51:43 +08:00
f33e033e27
[Docs] Fix docstring in get_ip function ( #11932 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-10 23:51:02 +08:00
482cdc494e
[Doc] Rename offline inference examples ( #11927 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-10 23:50:29 +08:00
20410b2fda
[platform] support custom torch.compile backend key ( #11318 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-10 23:46:51 +08:00
12664ddda5
[Doc] [1/N] Initial guide for merged multi-modal processor ( #11925 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 14:30:25 +00:00
241ad7b301
[ci] Fix sampler tests ( #11922 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-10 20:45:33 +08:00
d85c47d6ad
Replace "online inference" with "online serving" ( #11923 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-10 12:05:56 +00:00
ef725feafc
[platform] support pytorch custom op pluggable ( #11328 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-10 10:02:38 +00:00
d907be7dc7
[misc] remove python function call for custom activation op ( #11885 )
...
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-10 17:18:25 +08:00
d53575a5f0
[ci] fix gh200 tests ( #11919 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-10 16:25:17 +08:00
61af633256
[BUGFIX] Fix UnspecifiedPlatform package name ( #11916 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-10 16:20:46 +08:00
ac2f3f7fee
[Bugfix] Validate lora adapters to avoid crashing server ( #11727 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-10 15:56:36 +08:00
cf5f000d21
[torch.compile] Hide KV cache behind torch.compile boundary ( #11677 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-10 13:14:42 +08:00
3de2b1eafb
[Doc] Show default pooling method in a table ( #11904 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 11:25:20 +08:00
b844b99ad3
[VLM] Enable tokenized inputs for merged multi-modal processor ( #11900 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 03:24:00 +00:00
c3cf54dda4
[Doc][5/N] Move Community and API Reference to the bottom ( #11896 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-01-10 03:10:12 +00:00
36f5303578
[Docs] Add Modal to deployment frameworks ( #11907 )
2025-01-09 23:26:37 +00:00
9a228348d2
[Misc] Provide correct Pixtral-HF chat template ( #11891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 10:19:37 -07:00
bd82872211
[ci]try to fix flaky multi-step tests ( #11894 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-09 14:47:29 +00:00
405eb8e396
[platform] Allow platform specify attention backend ( #11609 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
2025-01-09 21:46:50 +08:00
65097ca0af
[Doc] Add model development API Reference ( #11884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 09:43:40 +00:00
1d967acb45
[Bugfix] fix beam search input errors and latency benchmark script ( #11875 )
...
Signed-off-by: Ye Qi <yeq@meta.com >
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com >
2025-01-09 17:36:39 +08:00
0bd1ff4346
[Bugfix] Override dunder methods of placeholder modules ( #11882 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 09:02:53 +00:00
310aca88c9
[perf]fix current stream ( #11870 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-09 07:18:21 +00:00
a732900efc
[Doc] Intended links Python multiprocessing library ( #11878 )
2025-01-09 05:39:39 +00:00
d848800e88
[Misc] Move print_*_once from utils to logger ( #11298 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
2025-01-09 12:48:12 +08:00
730e9592e9
[Doc] Recommend uv and python 3.12 for quickstart guide ( #11849 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-09 11:37:48 +08:00
1fe554bac3
treat do_lower_case in the same way as the sentence-transformers library ( #11815 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-01-09 11:05:43 +08:00
615e4a5401
[CI] Turn on basic correctness tests for V1 ( #10864 )
2025-01-08 21:20:44 -05:00
3db0cafdf1
[Docs] Add Google Cloud Meetup ( #11864 )
2025-01-08 12:38:28 -08:00
526de822d5
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models ( #11698 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-01-08 20:23:15 +00:00
56fe4c297c
[TPU][Quantization] TPU W8A8 ( #11785 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-08 19:33:29 +00:00
47de8821d3
[Misc]add some explanations for BlockHashType ( #11847 )
2025-01-08 18:21:30 +00:00
5984499e47
[Doc] Expand Multimodal API Reference ( #11852 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 17:14:14 +00:00
ca47e176af
[Misc] Move some model utils into vision file ( #11848 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 17:04:46 +00:00
78f4590b60
[Bugfix][XPU] fix silu_and_mul ( #11823 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-01-09 00:11:50 +08:00
2f7024987e
[CI/Build][Bugfix] Fix CPU CI image clean up ( #11836 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-08 15:18:28 +00:00
6cd40a5bfe
[Doc][4/N] Reorganize API Reference ( #11843 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 21:34:44 +08:00
aba8d6ee00
[Doc] Move examples into categories ( #11840 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-08 13:09:53 +00:00
2a0596bc48
[VLM] Reorganize profiling/processing-related code ( #11812 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 18:59:58 +08:00
f12141170a
[torch.compile] consider relevant code in compilation cache ( #11614 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-08 10:46:43 +00:00
cfd3219f58
[Hardware][Apple] Native support for macOS Apple Silicon ( #11696 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-08 16:35:49 +08:00
a1b2b8606e
[Docs] Update sponsor name: 'Novita' to 'Novita AI' ( #11833 )
2025-01-07 23:05:46 -08:00
ad9f1aa679
[doc] update wheels url ( #11830 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-08 14:36:49 +08:00
889e662eae
[misc] improve memory profiling ( #11809 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-08 06:36:03 +00:00
ef68eb28d8
[Bug] Fix pickling of ModelConfig when RunAI Model Streamer is used ( #11825 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 13:40:09 +08:00
259abd8953
[Docs] reorganize sponsorship page ( #11639 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-01-07 21:16:08 -08:00
f645eb6954
[Bugfix] Add checks for LoRA and CPU offload ( #11810 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-08 13:08:48 +08:00
f4923cb8bc
[OpenVINO] Fixed Docker.openvino build ( #11732 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2025-01-08 13:08:30 +08:00
b640b19cc0
Fixed docker build for ppc64le ( #11518 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-01-08 13:05:37 +08:00
dc71af0a71
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… ( #11824 )
2025-01-08 04:09:25 +00:00
4d29e91be8
[Misc] sort torch profiler table by kernel timing ( #11813 )
2025-01-08 10:57:04 +08:00
91445c7bc8
[Bugfix] Fix image input for Pixtral-HF ( #11741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 10:17:16 +08:00
5950f555a1
[Doc] Group examples into categories ( #11782 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-08 09:20:12 +08:00
a4e2b26856
[Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 ( #11794 )
2025-01-07 16:15:50 -08:00
973f5dc581
[Doc]Add documentation for using EAGLE in vLLM ( #11417 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2025-01-07 19:19:12 +00:00
c994223d56
[Bugfix] update the prefix for qwen2 ( #11795 )
...
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com >
2025-01-07 18:36:34 +00:00
869579a702
[optimization] remove python function call for custom op ( #11750 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-07 17:04:28 +00:00
c0efe92d8b
[Doc] Add note to gte-Qwen2 models ( #11808 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 21:50:58 +08:00
d9fa1c05ad
[doc] update how pip can install nightly wheels ( #11806 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-07 21:42:58 +08:00
2de197bdd4
[V1] Support audio language models on V1 ( #11733 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 19:47:36 +08:00
869e829b85
[doc] add doc to explain how to use uv ( #11773 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-07 18:41:17 +08:00
8f37be38eb
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation ( #11800 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 18:25:02 +08:00
8082ad7950
[V1][Doc] Update V1 support for LLaVa-NeXT-Video ( #11798 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 09:55:39 +00:00
1e4ce295ae
[CI][CPU] adding build number to docker image name ( #11788 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2025-01-07 07:28:01 +00:00
ce1917fcf2
[Doc] Create a vulnerability management team ( #9925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-06 22:57:32 -08:00
e512f76a89
fix init error for MessageQueue when n_local_reader is zero ( #11768 )
2025-01-07 06:12:48 +00:00
898cdf033e
[CI] Fix neuron CI and run offline tests ( #11779 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-01-06 21:36:10 -08:00
0f3f3c86ec
[Bugfix] Update attention interface in Whisper ( #11784 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 04:36:24 +00:00
b278557935
[Kernel][LoRA]Punica prefill kernels fusion ( #11234 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Zhonghua Deng <abatom@163.com >
2025-01-07 04:01:39 +00:00
8ceffbf315
[Doc][3/N] Reorganize Serving section ( #11766 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 11:20:01 +08:00
d93d2d74fd
[XPU] Make pp group initilized for pipeline-parallelism ( #11648 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2025-01-07 11:09:58 +08:00
d0169e1b0f
[Model] Future-proof Qwen2-Audio multi-modal processor ( #11776 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 11:05:17 +08:00
08fb75c72e
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) ( #11772 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 01:10:54 +00:00
91b361ae89
[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision ( #11685 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 19:58:16 +00:00
e20c92bb61
[Kernel] Move attn_type to Attention.__init__() ( #11690 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-07 00:11:28 +08:00
32c9eff2ff
[Bugfix][V1] Fix molmo text-only inputs ( #11676 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-06 15:22:25 +00:00
4ca5d40adc
[doc] explain how to add interleaving sliding window support ( #11771 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-06 21:57:44 +08:00
9279b9f83d
[Bugfix] Fix max image size for LLaVA-Onevision ( #11769 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-06 13:48:53 +00:00
ee77fdb5de
[Doc][2/N] Reorganize Models and Usage sections ( #11755 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 21:40:31 +08:00
996357e480
[VLM] Separate out profiling-related logic ( #11746 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 16:02:21 +08:00
2a622d704a
k8s-config: Update the secret to use stringData ( #11679 )
...
Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com >
2025-01-06 08:01:22 +00:00
9c749713f6
[mypy] Forward pass function type hints in lora ( #11740 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2025-01-06 07:59:36 +00:00
022c5c6944
[V1] Refactor get_executor_cls ( #11754 )
2025-01-06 07:59:16 +00:00
f8fcca100b
[Misc] Fix typo for valid_tool_parses ( #11753 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-01-06 07:12:38 +00:00
06bfb51963
[V1] Add BlockTable class ( #11693 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-06 14:24:42 +09:00
408e560015
[Bugfix] Remove block size constraint ( #11723 )
2025-01-06 12:49:55 +08:00
402d378360
[Doc] [1/N] Reorganize Getting Started section ( #11645 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 02:18:33 +00:00
9e764e7b10
[distributed] remove pynccl's redundant change_state ( #11749 )
2025-01-06 09:05:48 +08:00
33fc1e2e86
[Frontend] Improve StreamingResponse Exception Handling ( #11752 )
2025-01-05 16:35:01 -05:00
eba17173d3
fix: [doc] fix typo ( #11751 )
...
Co-authored-by: Lancer <maruixiang6688@gmail.com >
2025-01-06 00:48:16 +08:00
635b897246
[distributed] remove pynccl's redundant stream ( #11744 )
2025-01-05 23:09:11 +08:00
4068f4b5b5
[MISC] Replace c10::optional with std::optional ( #11730 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-05 10:20:34 +09:00
47831430cc
[Bugfix][V1] Fix test_kv_cache_utils.py ( #11738 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-04 16:07:59 +00:00
65c08928c2
[Model] Remove unnecessary weight initialization logic ( #11736 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-04 23:46:21 +08:00
ba214dffbe
[Bugfix] Fix precision error in LLaVA-NeXT ( #11735 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-04 23:45:57 +08:00
eed11ebee9
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision ( #11717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-04 11:40:53 +00:00
300acb8347
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture ( #11233 )
...
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com >
Signed-off-by: Ido Asraff <idoa@atero.ai >
2025-01-04 14:50:16 +08:00
d91457d529
[V1] Add kv cache utils tests. ( #11513 )
...
Signed-off-by: xcnick <xcnick0412@gmail.com >
2025-01-04 14:49:46 +08:00
fbf2564554
[V1] Add RayExecutor support for AsyncLLM (api server) ( #11712 )
2025-01-04 06:41:31 +00:00
d1d49397e7
Update bnb.md with example for OpenAI ( #11718 )
2025-01-04 06:29:02 +00:00
9c93636d84
Update tool_calling.md ( #11701 )
2025-01-04 06:16:30 +00:00
e5d7ed0c53
[V1] log GPU blocks num for MultiprocExecutor ( #11656 )
2025-01-04 00:13:12 +00:00
ad0d567e1c
[V1] Chore: cruft removal ( #11724 )
2025-01-03 23:25:02 +00:00
bf0d97d786
Update requirements-tpu.txt to support python 3.9 and 3.11 ( #11695 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-03 22:36:46 +00:00
a655eb3025
[Misc]Add BNB quantization for Qwen2VL ( #11719 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-03 15:19:02 -07:00
1543914c04
[V1] Improve TP>1 Error Handling + Stack Trace ( #11721 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-03 21:29:11 +00:00
61fed92c7e
[Bugfix] Fix ColumnParallelLinearWithLoRA slice ( #11708 )
...
Signed-off-by: ZincCat <zincchloride@outlook.com >
2025-01-03 21:02:34 +00:00
80c751e7f6
[V1] Simplify Shutdown ( #11659 )
2025-01-03 17:25:38 +00:00
e1a5c2f0a1
[Model] Whisper model implementation ( #11280 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
2025-01-03 16:39:19 +08:00
fd3a62a122
[perf-benchmark] Fix dependency for steps in benchmark pipeline ( #11710 )
2025-01-02 22:38:37 -08:00
07064cb1d4
[Bugfix] Check chain_speculative_sampling before calling it ( #11673 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-02 16:58:56 -08:00
2f1e8e8f54
Update default max_num_batch_tokens for chunked prefill ( #11694 )
2025-01-03 00:25:53 +00:00
68d37809b9
[Misc] Minimum requirements for SageMaker compatibility ( #11576 )
2025-01-02 15:59:25 -08:00
5dba257506
Resolve race conditions in Marlin kernel ( #11493 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-01-02 22:58:56 +00:00
187e32997c
[Bugfix] Change kv scaling factor by param json on nvidia gpu ( #11688 )
...
Signed-off-by: bjmsong <bjmsong@126.com >
Co-authored-by: bjmsong <bjmsong@126.com >
2025-01-02 21:11:39 +00:00
b55ed6ef8a
[V1][Minor] Optimize token_ids_cpu copy ( #11692 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-02 12:04:58 -07:00
2f385183f3
[Bugfix] Free cross attention block table for preempted-for-recompute sequence group. ( #10013 )
...
Signed-off-by: Kathy Yu <feiyangyu@google.com >
2025-01-02 10:28:09 -08:00
84c35c374a
According to vllm.EngineArgs, the name should be distributed_executor_backend ( #11689 )
2025-01-02 18:14:16 +00:00
8c38ee7007
[VLM] Merged multi-modal processor for LLaVA-NeXT ( #11682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-02 16:39:27 +00:00
b6087a6bee
[mypy] Pass type checking in vllm/inputs ( #11680 )
...
Signed-off-by: Tobias Pitters <tobias.pitters@gmail.com >
2025-01-02 16:18:15 +00:00
23c1b10a4c
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input ( #11674 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-02 17:00:00 +08:00
a115ac46b5
[VLM] Move supported limits and max tokens to merged multi-modal processor ( #11669 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-01 15:44:42 +00:00
73001445fb
[V1] Implement Cascade Attention ( #11635 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-01 21:56:46 +09:00
6d70198b17
[Doc] Fix typo ( #11666 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
2025-01-01 08:10:10 +00:00
f962f426bc
[Misc] Replace space with - in the file names ( #11667 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-01 07:39:30 +00:00
11d8a091c6
[Misc] Optimize Qwen2-VL LoRA test ( #11663 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-01 14:42:23 +08:00
365801fedd
[VLM] Add max-count checking in data parser for single image models ( #11661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-31 22:15:21 -08:00
4db72e57f6
[Bugfix][Refactor] Unify model management in frontend ( #11660 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-01-01 02:21:51 +00:00
0c6f998554
[Benchmark] Add benchmark script for CPU offloading ( #11533 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: KuntaiDu <kuntai@uchicago.edu >
2025-01-01 00:10:55 +00:00
e7c7c5e822
[V1][VLM] V1 support for selected single-image models. ( #11632 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-31 21:17:22 +00:00
8c3230d8c1
[V1] Simpify vision block hash for prefix caching by removing offset from hash ( #11646 )
2024-12-31 08:56:01 +00:00
2c5718809b
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. ( #11565 )
2024-12-31 06:29:04 +00:00
82c49d3260
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) ( #6909 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 22:15:58 -08:00
74fa1d123c
[Bugfix] Fix OpenAI parallel sampling when using xgrammar ( #11637 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-31 03:43:54 +00:00
a2a40bcd0d
[Model][LoRA]LoRA support added for MolmoForCausalLM ( #11439 )
...
Signed-off-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 17:33:06 -08:00
ccb1aabcca
[benchmark] Remove dependency for H100 benchmark step ( #11572 )
2024-12-30 12:27:07 -08:00
36e7670045
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel ( #11631 )
2024-12-30 18:51:04 +00:00
5886aa496e
[V1] [6/N] API Server: Better Shutdown ( #11586 )
2024-12-30 15:51:02 +00:00
8d9b6721e7
[VLM] Abstract out multi-modal data parsing in merged processor ( #11620 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-30 15:01:35 +00:00
b12e87f942
[platforms] enable platform plugins ( #11602 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 20:24:45 +08:00
5dbf854553
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels ( #11618 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-30 10:17:04 +00:00
970d6d0776
[Build][Kernel] Update CUTLASS to v3.6.0 ( #11607 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-30 17:22:13 +08:00
628ec6c17b
[Docker] bump up neuron sdk v2.21 ( #11593 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2024-12-30 13:46:14 +08:00
3682e33f9f
[v1] fix compilation cache ( #11598 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 04:24:12 +00:00
0aa38d16f5
Remove print statement in DeepseekScalingRotaryEmbedding ( #11604 )
2024-12-29 20:16:46 +00:00
faef77c0d6
[Misc] KV cache transfer connector registry ( #11481 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2024-12-29 16:08:09 +00:00
dba4d9dec6
[v1][bugfix] fix cudagraph with inplace buffer assignment ( #11596 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-29 09:03:49 +00:00
32b4c63f02
[Doc] Convert list tables to MyST ( #11594 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-29 15:56:22 +08:00
4fb8e329fd
[V1] [5/N] API Server: unify Detokenizer and EngineCore input ( #11545 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-28 20:51:57 +00:00
328841d002
[bugfix] interleaving sliding window for cohere2 model ( #11583 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-28 16:55:42 +00:00
d427e5cfda
[Doc] Minor documentation fixes ( #11580 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-28 21:53:59 +08:00
42bb201fd6
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor ( #11581 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-28 13:33:12 +00:00
59d6bb4c86
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version ( #11515 )
...
Signed-off-by: hjwei <hjwei_xd@163.com >
2024-12-28 11:17:35 +00:00
b7dcc003dc
[Model] Remove hardcoded image tokens ids from Pixtral ( #11582 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-28 10:54:23 +00:00
d34be24bb1
[Model] Support InternLM2 Reward models ( #11571 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-28 06:14:10 +00:00
b5cbe8eeb3
[Bugfix] Last token measurement fix ( #11376 )
...
Signed-off-by: rajveerb <46040700+rajveerb@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-28 11:34:46 +08:00
df04dffade
[V1] [4/N] API Server: ZMQ/MP Utilities ( #11541 )
2024-12-28 01:45:08 +00:00
a60731247f
[Doc] Update mllama example based on official doc ( #11567 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2024-12-28 00:31:10 +00:00
ac79799403
[Bugfix] Fix for ROCM compressed tensor support ( #11561 )
2024-12-27 20:12:11 +00:00
dde1fa18c9
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix ( #11566 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-27 19:45:13 +00:00
0240402c46
[Misc]Add BNB quantization for MolmoForCausalLM ( #11551 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 18:48:24 +00:00
55509c2114
[MODEL] LoRA support for Jamba model ( #11209 )
...
Signed-off-by: Erez Schwartz <erezs@ai21.com >
2024-12-27 17:58:21 +00:00
101418096f
[VLM] Support caching in merged multi-modal processor ( #11396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 17:22:48 +00:00
5ce4627a7e
[Doc] Add xgrammar in doc ( #11549 )
...
Signed-off-by: ccjincong <chenjincong11@gmail.com >
2024-12-27 13:05:10 +00:00
7af553ea30
[Misc] Abstract the logic for reading and writing media content ( #11527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 19:21:23 +08:00
2c9b8ea2b0
[Bugfix] Fix TeleChat2ForCausalLM weights mapper ( #11546 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 10:39:15 +00:00
d003f3ea39
Update deploying_with_k8s.md with AMD ROCm GPU example ( #11465 )
...
Signed-off-by: Alex He <alehe@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-27 10:00:04 +00:00
6c6f7fe8a8
[Platform] Move model arch check to platform ( #11503 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-27 08:45:25 +00:00
2339d59f92
[BugFix] Fix quantization for all other methods ( #11547 )
2024-12-26 22:23:29 -08:00
1b875a0ef3
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly ( #11534 )
2024-12-26 21:19:21 -08:00
eb881ed006
[misc] fix typing ( #11540 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-27 11:05:08 +08:00
46d4359450
[CI] Fix broken CI ( #11543 )
2024-12-26 18:49:16 -08:00
81b979f2a8
[V1] Fix yapf ( #11538 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-27 09:47:10 +09:00
371d04d39b
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling ( #11394 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-27 09:32:38 +09:00
0c0c2015c5
Update openai_compatible_server.md ( #11536 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-12-26 16:26:18 -08:00
82d24f7aac
[Docs] Document Deepseek V3 support ( #11535 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-12-26 16:21:56 -08:00
f49777ba62
Deepseek v3 ( #11502 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com >
2024-12-26 16:09:44 -08:00
55fb97f7bd
[2/N] API Server: Avoid ulimit footgun ( #11530 )
2024-12-26 23:43:05 +00:00
2072924d14
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization ( #11523 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: HandH1998 <1335248067@qq.com >
2024-12-26 15:33:30 -08:00
720b10fdc6
[1/N] API Server (Remove Proxy) ( #11529 )
2024-12-26 23:03:43 +00:00
b85a977822
[Doc] Add video example to openai client for multimodal ( #11521 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-26 17:31:29 +00:00
eec906d811
[Misc] Add placeholder module ( #11501 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 13:12:51 +00:00
f57ee5650d
[Model] Modify MolmoForCausalLM MLP ( #11510 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 13:12:05 +00:00
dcb1a944d4
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler ( #10681 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 19:02:58 +09:00
7492a36207
[Doc] Add QVQ and QwQ to the list of supported models ( #11509 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-26 09:44:32 +00:00
aa25985bd1
[Misc][LoRA] Fix LoRA weight mapper ( #11495 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 15:52:48 +08:00
dbeac95dbb
Mypy checking for vllm/compilation ( #11496 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-26 05:04:07 +00:00
51a624bf02
[Misc] Move some multimodal utils to modality-specific modules ( #11494 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 04:23:20 +00:00
6ad909fdda
[Doc] Improve GitHub links ( #11491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 14:49:26 -08:00
b689ada91e
[Frontend] Enable decord to load video from base64 ( #11492 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 16:33:55 +00:00
fc601665eb
[Misc] Update disaggregation benchmark scripts and test logs ( #11456 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-25 06:58:48 +00:00
9832e5572a
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor ( #11472 )
2024-12-24 19:49:46 -08:00
3f3e92e1f2
[Model] Automatic conversion of classification and reward models ( #11469 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 18:22:22 +00:00
409475a827
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 ( #11435 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-24 16:53:28 +00:00
196c34b0ac
[Misc] Move weights mapper ( #11443 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 13:05:25 +00:00
5c7963249d
[attn][tiny fix] fix attn backend in MultiHeadAttention ( #11463 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-24 12:39:36 +00:00
461cde2080
[OpenVINO] Fixed installation conflicts ( #11458 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2024-12-24 11:38:21 +00:00
7a5286cc04
[Bugfix][Hardware][CPU] Fix CPU input_positions creation for text-only inputs with mrope ( #11434 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-24 17:59:51 +08:00
b1b1038fbd
[Bugfix] Fix Qwen2-VL LoRA weight loading ( #11430 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 09:56:10 +00:00
9edca6bf8f
[Frontend] Online Pooling API ( #11457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 17:54:30 +08:00
4f074fbf53
[Misc]Suppress irrelevant exception stack trace information when CUDA… ( #11438 )
...
Co-authored-by: shiquan <shiquan>
2024-12-24 08:43:39 +00:00
a491d6f535
[V1] TP Ray executor ( #11107 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-12-23 23:00:12 +00:00
32aa2059ad
[Docs] Convert rST to MyST (Markdown) ( #11145 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-23 22:35:38 +00:00
94d545a1a1
[Doc] Fix typo in the help message of '--guided-decoding-backend' ( #11440 )
2024-12-23 20:20:44 +00:00
60fb4f3bcf
[Bugfix] Add kv cache scales to gemma2.py ( #11269 )
2024-12-23 19:30:45 +00:00
63afbe9215
[CI] Expand OpenAI test_chat.py guided decoding tests ( #11048 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 18:35:38 +00:00
8cef6e02dc
[Misc] add w8a8 asym models ( #11075 )
2024-12-23 13:33:20 -05:00
b866cdbd05
[Misc] Add assertion and helpful message for marlin24 compressed models ( #11388 )
2024-12-24 02:23:38 +08:00
2e726680b3
[Bugfix] torch nightly version in ROCm installation guide ( #11423 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-23 17:20:22 +00:00
5bfb30a529
[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF ( #11389 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 23:06:20 +08:00
e51719ae72
mypy type checking for vllm/worker ( #11418 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-23 13:55:49 +00:00
f30581c518
[misc][perf] remove old code ( #11425 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-23 08:01:08 +00:00
048fc57a0f
[CI] Unboock H100 Benchmark ( #11419 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-12-22 14:17:43 -08:00
f1d1bf6288
[Bugfix] Fix fully sharded LoRAs with Mixtral ( #11390 )
...
Signed-off-by: Jason Greene <jason.greene@redhat.com >
2024-12-22 23:25:10 +08:00
72d9c316d3
[cd][release] fix race conditions ( #11407 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-22 00:39:11 -08:00
4a9139780a
[cd][release] add pypi index for every commit and nightly build ( #11404 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-21 23:53:44 -08:00
29c748930e
[CI] Fix flaky entrypoint tests ( #11403 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-21 21:08:44 -08:00
c2d1b075ba
[Bugfix] Fix issues for Pixtral-Large-Instruct-2411 ( #11393 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-21 10:15:03 +00:00
584f0ae40d
[V1] Make AsyncLLMEngine v1-v0 opaque ( #11383 )
...
Signed-off-by: Ricky Xu <xuchen727@hotmail.com >
2024-12-21 15:14:08 +08:00
51ff216d85
[Bugfix] update should_ignore_layer ( #11354 )
...
Signed-off-by: George Ohashi <george@neuralmagic.com >
2024-12-21 06:36:23 +00:00
dd2b5633dd
[V1][Bugfix] Skip hashing empty or None mm_data ( #11386 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 14:22:21 +09:00
47a0b615b4
Add ray[default] to wget to run distributed inference out of box ( #11265 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-20 13:54:55 -08:00
5d2248d81a
[doc] explain nccl requirements for rlhf ( #11381 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-20 13:00:56 -08:00
d573aeadcc
[Bugfix] Don't log OpenAI field aliases as ignored ( #11378 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 19:03:50 +00:00
995f56236b
[Core] Loading model from S3 using RunAI Model Streamer as optional loader ( #10192 )
...
Signed-off-by: OmerD <omer@run.ai >
2024-12-20 16:46:24 +00:00
7c7aa37c69
[CI/Build] fix pre-compiled wheel install for exact tag ( #11373 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2024-12-21 00:14:40 +08:00
04139ade59
[V1] Fix profiling for models with merged input processor ( #11370 )
...
Signed-off-by: ywang96 <ywang@roblox.com >
2024-12-20 12:04:21 +00:00
1ecc645b8f
[doc] backward compatibility for 0.6.4 ( #11359 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:33:53 -08:00
c954f21ac0
[misc] add early error message for custom ops ( #11355 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:18:25 -08:00
86c2d8fd1c
[Bugfix] Fix spec decoding when seed is none in a batch ( #10863 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-12-20 05:15:31 +00:00
b880ffb87e
[Misc] Add tqdm progress bar during graph capture ( #11349 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 04:35:18 +00:00
7801f56ed7
[ci][gh200] dockerfile clean up ( #11351 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: cenzhiyao <2523403608@qq.com >
2024-12-19 18:13:06 -08:00
48edab8041
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 ( #11331 )
...
Signed-off-by: Akash Kaothalkar <0052v2@linux.vnet.ibm.com >
2024-12-20 01:32:07 +00:00
a985f7af9f
[CI] Adding CPU docker pipeline ( #11261 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-19 11:46:55 -08:00
e461c262f0
[Misc] Remove unused vllm/block.py ( #11336 )
2024-12-19 17:54:24 +00:00
276738ce0f
[Bugfix] Fix broken CPU compressed-tensors test ( #11338 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-19 17:37:31 +00:00
cdf22afdda
[Misc] Clean up and consolidate LRUCache ( #11339 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-20 00:59:32 +08:00
e24113a8fe
[Model] Refactor Qwen2-VL to use merged multimodal processor ( #11258 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 16:28:00 +00:00
7379b3d4b2
[V1] Fix multimodal profiling for Molmo ( #11325 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-19 16:27:22 +00:00
6c7f881541
[Model] Add JambaForSequenceClassification model ( #10860 )
...
Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 22:48:06 +08:00
a0f7d53beb
[Bugfix] Cleanup Pixtral HF code ( #11333 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 13:22:00 +00:00
5aef49806d
[Feature] Add load generation config from model ( #11164 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-19 10:50:38 +00:00
98356735ac
[misc] benchmark_throughput : Add LoRA ( #11267 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 15:43:16 +08:00
f26c4aeecb
[Misc] Optimize ray worker initialization time ( #11275 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-18 23:38:02 -08:00
8936316d58
[Kernel] Refactor Cutlass c3x ( #10049 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 07:00:18 +00:00
6142ef0ada
[VLM] Merged multimodal processor for Qwen2-Audio ( #11303 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 06:14:17 +00:00
c6b0a7d3ba
[V1] Simplify prefix caching logic by removing num_evictable_computed_blocks ( #11310 )
2024-12-19 04:17:12 +00:00
a30482f054
[CI] Expand test_guided_generate to test all backends ( #11313 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-19 04:00:38 +00:00
17ca964273
[Model] IBM Granite 3.1 ( #11307 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-19 11:27:24 +08:00
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) ( #11311 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-19 02:43:30 +00:00
fdea8ec167
[V1] VLM - enable processor cache by default ( #11305 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-18 18:54:46 -05:00
ca5f54a9b9
[Bugfix] fix minicpmv test ( #11304 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-18 10:34:26 -08:00
f954fe0e65
[FIX] update openai version ( #11287 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-12-18 10:17:05 -08:00
362cff1eb3
[CI][Misc] Remove Github Action Release Workflow ( #11274 )
2024-12-18 10:16:53 -08:00
996aa70f00
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests ( #11263 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-18 10:16:40 -08:00
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support ( #10995 )
...
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com >
Co-authored-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-18 09:57:16 -05:00
f04e407e6b
[MISC][XPU]update ipex link for CI fix ( #11278 )
2024-12-17 22:34:23 -08:00
8b79f9e107
[Bugfix] Fix guided decoding with tokenizer mode mistral ( #11046 )
2024-12-17 22:34:08 -08:00
866fa4550d
[Bugfix] Restore support for larger block sizes ( #11259 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-17 16:39:07 -08:00
bf8717ebae
[V1] Prefix caching for vision language models ( #11187 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-17 16:37:59 -08:00
c77eb8a33c
[Bugfix] Set temperature=0.7 in test_guided_choice_chat ( #11264 )
2024-12-17 16:34:06 -08:00
2d1b9baa8f
[Bugfix] Fix request cancellation without polling ( #11190 )
2024-12-17 12:26:32 -08:00
f9ecbb18bf
[Misc] Allow passing logits_soft_cap for xformers backend ( #11252 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-17 00:37:04 -08:00
02222a0256
[Misc] Kernel Benchmark for RMSNorm ( #11241 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com >
2024-12-17 06:57:02 +00:00
2bfdbf2a36
[V1][Core] Use weakref.finalize instead of atexit ( #11242 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 22:11:33 -08:00
e88db68cf5
[Platform] platform agnostic for EngineArgs initialization ( #11225 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-16 22:11:06 -08:00
59c9b6ebeb
[V1][VLM] Proper memory profiling for image language models ( #11210 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-16 22:10:57 -08:00
66d4b16724
[Frontend] Add OpenAI API support for input_audio ( #11027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-16 22:09:58 -08:00
0064f697d3
[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse ( #10935 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-17 11:39:58 +08:00
35bae114a8
fix gh200 tests on main ( #11246 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 17:22:38 -08:00
88a412ed3d
[torch.compile] fast inductor ( #11108 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 16:15:22 -08:00
c301616ed2
[ci][tests] add gh200 tests ( #11244 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 15:53:18 -08:00
35ffa682b1
[Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving ( #11235 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-16 22:20:39 +00:00
551603feff
[core] overhaul memory profiling and fix backward compatibility ( #10511 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 13:32:25 -08:00
efbce85f4d
[misc] Layerwise profile updates ( #10242 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-16 18:14:57 +00:00
2ca830dbaa
[Doc] Reorder vision language examples in alphabet order ( #11228 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-16 11:23:33 +00:00
d927dbcd88
[Model] Refactor Ultravox to use merged input processor ( #11198 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-16 10:09:53 +00:00
bddbbcb132
[Model] Support Cohere2ForCausalLM (Cohere R7B) ( #11203 )
2024-12-16 09:56:19 +00:00
b3b1526f03
WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 ( #11212 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
2024-12-16 09:20:49 +00:00
17138af7c4
[Bugfix] Fix the default value for temperature in ChatCompletionRequest ( #11219 )
2024-12-16 00:15:40 -08:00
69ba344de8
[Bugfix] Fix block size validation ( #10938 )
2024-12-15 16:38:40 -08:00
da6f409246
Update deploying_with_k8s.rst ( #10922 )
2024-12-15 16:33:58 -08:00
25ebed2f8c
[V1][Minor] Cache np arange to reduce input preparation overhead ( #11214 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-15 13:33:00 -08:00
d263bd9df7
[Core] Support disaggregated prefill with Mooncake Transfer Engine ( #10884 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2024-12-15 21:28:18 +00:00
38e599d6a8
[Doc] add documentation for disaggregated prefilling ( #11197 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2024-12-15 13:31:16 -06:00
96d673e0f8
[Bugfix] Fix error handling of unsupported sliding window ( #11213 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 10:59:42 -07:00
b10609e6a1
[Misc] Clean up multi-modal processor ( #11207 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 06:30:28 +00:00
a1c02058ba
[torch.compile] allow tracking forward time ( #11081 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-14 19:45:00 -08:00
15859f2357
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 ( #11201 )
2024-12-15 03:03:06 +00:00
886936837c
[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion ( #7209 )
2024-12-14 11:38:10 -08:00
6d917d0eeb
Enable mypy checking on V1 code ( #11105 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2024-12-14 09:54:04 -08:00
93abf23a64
[VLM] Fully dynamic prompt replacement in merged input processor ( #11199 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 17:52:18 +00:00
9c3dadd1c9
[Frontend] Add logits_processors as an extra completion argument ( #11150 )
...
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com >
2024-12-14 16:46:42 +00:00
3cb5769883
[Misc] Minor improvements to the readability of PunicaWrapperBase ( #11200 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-14 16:38:27 +00:00
ea7bd68d10
[V1][Bugfix] Fix V1 TP trust-remote-code ( #11182 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-14 08:21:23 +00:00
48259264a4
[Core] Update outlines and increase its threadpool size ( #11140 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-14 07:46:18 +00:00
24a3d12b82
update compressed-tensors to latest version ( #11183 )
...
Co-authored-by: dhuangnm <dhuang@MacBook-Pro-2.local >
2024-12-14 03:22:44 +00:00
9855aea21b
[Bugfix][V1] Re-compute an entire block when fully cache hit ( #11186 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 17:08:23 -08:00
4b5b8a6a3b
[V1][Bugfix] Fix EngineCoreProc profile ( #11185 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-13 17:02:35 -08:00
4863e5fba5
[Core] V1: Use multiprocessing by default ( #11074 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-13 16:27:32 -08:00
0d8451c3a4
[Distributed] Allow the placement group more time to wait for resources to be ready ( #11138 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-13 20:17:37 +00:00
0a56bcc03d
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend ( #11169 )
2024-12-13 18:00:40 +00:00
0920ab9131
[Doc] Reorganize online pooling APIs ( #11172 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 00:22:22 +08:00
238c0d93b4
[Misc] Add tokenizer_mode param to benchmark_serving.py ( #11174 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-13 16:19:10 +00:00
5b0ed8391d
[Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor ( #11156 )
2024-12-13 15:56:19 +00:00
c31d4a57a6
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching ( #8240 )
2024-12-13 07:51:25 -08:00
d1fa714cb1
[Refactor]A simple device-related refactor ( #11163 )
...
Signed-off-by: noemotiovon <noemotiovon@gmail.com >
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2024-12-13 13:39:00 +00:00
969da7d70b
[V1][VLM] Fix edge case bug for InternVL2 ( #11165 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-13 11:09:30 +00:00
eeec9e3390
[Frontend] Separate pooling APIs in offline inference ( #11129 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-13 10:40:07 +00:00
f93bf2b189
[Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt ( #11159 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-13 08:50:35 +00:00
7cd7409142
PaliGemma 2 support ( #11142 )
2024-12-13 07:40:07 +00:00
be39e3cd18
[core] clean up cudagraph batchsize padding logic ( #10996 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-13 06:57:50 +00:00
34f1a806d5
[Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' ( #11157 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 06:30:06 +00:00
00c1bde5d8
[ROCm][AMD] Disable auto enabling chunked prefill on ROCm ( #11146 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-13 05:31:26 +00:00
3989a79824
[Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization ( #11148 )
2024-12-13 05:07:20 +00:00
1efce68605
[Bugfix] Use runner_type instead of task in GritLM ( #11144 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-13 04:09:53 +00:00
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion ( #10906 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-13 03:19:23 +00:00
78ed8f57d8
[Misc][V1] Fix type in v1 prefix caching ( #11151 )
2024-12-13 00:57:40 +00:00
db6c264a1e
[Bugfix] Fix value unpack error of simple connector for KVCache transfer. ( #11058 )
...
Signed-off-by: ShangmingCai <csmthu@gmail.com >
2024-12-12 21:19:17 +00:00
9f3974a319
Fix logging of the vLLM Config ( #11143 )
2024-12-12 12:05:57 -08:00
2c97eca1ff
[Misc] Validate grammar and fail early ( #11119 )
2024-12-12 18:34:26 +00:00
5d712571af
[Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c7ae. ( #11024 )
2024-12-12 18:09:20 +00:00
d4d5291cc2
fix(docs): typo in helm install instructions ( #11141 )
...
Signed-off-by: Ramon Ziai <ramon.ziai@bettermarks.com >
2024-12-12 17:36:32 +00:00
4816d20aa4
[V1] Fix torch profiling for offline inference ( #11125 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-12 15:51:53 +00:00
85362f028c
[Misc][LoRA] Ensure Lora Adapter requests return adapter name ( #11094 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-12 09:25:16 +00:00
62de37a38e
[core][distributed] initialization from StatelessProcessGroup ( #10986 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-12 09:04:19 +00:00
8195824206
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) ( #10565 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2024-12-12 08:09:28 +00:00
f092153fbe
[V1] Use more persistent buffers to optimize input preparation overheads ( #11111 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 23:14:20 -08:00
1da8f0e1dd
[Model] Add support for embedding model GritLM ( #10816 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-12 06:39:16 +00:00
ccede2b264
[Core] cleanup zmq ipc sockets on exit ( #11115 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 19:12:24 -08:00
24a36d6d5f
Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst ( #11112 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-12 02:39:21 +00:00
8fb26dac61
[Docs] Add media kit ( #11121 )
2024-12-11 17:33:11 -08:00
7439a8b5fc
[Bugfix] Multiple fixes to tool streaming with hermes and mistral ( #10979 )
...
Signed-off-by: cedonley <clayton@donley.io >
2024-12-12 01:10:12 +00:00
4e11683368
[V1] VLM preprocessor hashing ( #11020 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-12 00:55:30 +00:00
452a723bf2
[V1][Core] Remove should_shutdown to simplify core process termination ( #11113 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 23:34:54 +00:00
d1e21a979b
[CI/Build] Split up VLM tests ( #11083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-12 06:18:16 +08:00
72ff3a9686
[core] Bump ray to use _overlap_gpu_communication in compiled graph tests ( #10410 )
...
Signed-off-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
2024-12-11 11:36:35 -08:00
66aaa7722d
[torch.compile] remove graph logging in ci ( #11110 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:59:50 -08:00
d643c2aba1
[V1] Use input_ids as input for text-only models ( #11032 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 10:49:23 -08:00
91642db952
[torch.compile] use depyf to dump torch.compile internals ( #10972 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:43:05 -08:00
fd22220687
[Doc] Installed version of llmcompressor for int8/fp8 quantization ( #11103 )
...
Signed-off-by: Guangda Liu <bingps@users.noreply.github.com >
Co-authored-by: Guangda Liu <bingps@users.noreply.github.com >
2024-12-11 15:43:24 +00:00
b2f775456e
[CI/Build] Enable prefix caching test for AMD ( #11098 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-12-11 15:23:37 +00:00
cad5c0a6ed
[Doc] Update docs to refer to pooling models ( #11093 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 13:36:27 +00:00
8f10d5e393
[Misc] Split up pooling tasks ( #10820 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 01:28:00 -08:00
40766ca1b8
[Bugfix]: Clamp -inf logprob values in prompt_logprobs ( #11073 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-11 01:27:39 -08:00
2e32f5d28d
[Bugfix] Fix Idefics3 fails during multi-image inference ( #11080 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-12-11 01:27:07 -08:00
61b1d2f6ae
[Core] v1: Use atexit to handle engine core client shutdown ( #11076 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 01:26:36 -08:00
9974fca047
[ci/build] Fix entrypoints test and pin outlines version ( #11088 )
2024-12-11 01:01:53 -08:00
3fb4b4f163
[ci/build] Fix AMD CI dependencies ( #11087 )
2024-12-11 00:39:53 -08:00
2e33fe4191
[CI/Build] Check transformers v4.47 ( #10991 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 05:02:02 +00:00
e39400a4b6
Fix streaming for granite tool call when <|tool_call|> is present ( #11069 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-11 04:51:40 +00:00
ffa48c9146
[Model] PP support for Mamba-like models ( #10992 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-12-10 21:53:37 -05:00
d5c5154fcf
[Misc] LoRA + Chunked Prefill ( #9057 )
2024-12-11 10:09:20 +08:00
9a93973708
[Bugfix] Fix Mamba multistep ( #11071 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 00:16:22 +00:00
134810b3d9
[V1][Bugfix] Always set enable_chunked_prefill = True for V1 ( #11061 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-10 14:41:23 -08:00
75f89dc44c
[torch.compile] add a flag to track batchsize statistics ( #11059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-10 12:40:52 -08:00
e739194926
[Core] Update to outlines >= 0.1.8 ( #10576 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-10 12:08:16 -08:00
250ee65d72
[BUG] Remove token param #10921 ( #11022 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-12-10 17:38:15 +00:00
9b9cef3145
[Bugfix] Backport request id validation to v0 ( #11036 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 16:38:23 +00:00
d05f88679b
[Misc][LoRA] Add PEFTHelper for LoRA ( #11003 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-10 11:12:01 +00:00
beb16b2c81
[Bugfix] Handle <|tool_call|> token in granite tool parser ( #11039 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-10 10:27:11 +00:00
fe2e10c71b
Add example of helm chart for vllm deployment on k8s ( #9199 )
...
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
2024-12-10 09:19:27 +00:00
82c73fd510
[Bugfix] cuda error running llama 3.2 ( #11047 )
2024-12-10 07:41:11 +00:00
bfd610430c
Update README.md ( #11034 )
2024-12-09 23:08:10 -08:00
e35879c276
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. ( #11043 )
2024-12-10 14:54:22 +08:00
ebf778061d
monitor metrics of tokens per step using cudagraph batchsizes ( #11031 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 22:35:36 -08:00
28b3a1c7e5
[V1] Multiprocessing Tensor Parallel Support for v1 ( #9856 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-10 06:28:14 +00:00
bc192a2b09
[Pixtral] Improve loading ( #11040 )
2024-12-10 06:09:32 +00:00
980ad394a8
[Frontend] Use request id from header ( #10968 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 13:46:29 +08:00
391d7b2763
[Bugfix] Fix usage of deprecated decorator ( #11025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-10 13:45:47 +08:00
d1f6d1c8af
[Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba ( #10739 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-10 10:23:07 +08:00
6d525288c1
[Docs] Add dedicated tool calling page to docs ( #10554 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-09 20:15:34 -05:00
6faec54505
[V1] Do not store None in self.generators ( #11038 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 15:08:19 -08:00
5ed5d5f128
Build tpu image in release pipeline ( #10936 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-09 23:07:48 +00:00
b63ba84832
[ROCm][bugfix] scpecilative decoding worker class ( #11035 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-09 14:00:29 -08:00
9c6459e4cb
[Neuron] Upgrade neuron to 2.20.2 ( #11016 )
...
Signed-off-by: Jerzy Zagorski <jzagorsk@amazon.com >
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-12-09 13:53:24 -08:00
1a2f8fb828
[v1] fix use compile sizes ( #11000 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 13:47:24 -08:00
cbcbdb1ceb
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version ( #11028 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-09 13:21:06 -08:00
a811dd6608
[Model] merged input processor for Phi-3-Vision models ( #10977 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-09 12:55:10 -08:00
ca871491ed
[Misc][LoRA] Abstract PunicaWrapper ( #10955 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-09 12:54:44 -08:00
3b61cb450d
[V1] Further reduce CPU overheads in flash-attn ( #10989 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 12:38:46 -08:00
edc4fa3188
[ci/build] Recompile CI dependencies list with Python 3.12 ( #11013 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-12-09 11:46:58 -08:00
25b79d9fd3
[V1] Input Batch Relocation ( #10962 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-09 09:33:41 -08:00
aea2fc38c3
[Platform] Move async output check to platform ( #10768 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-09 17:24:46 +00:00
e691b26f6f
[Core] Require xgrammar >= 0.1.6 ( #11021 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-09 16:44:27 +00:00
c690357928
[V1] Fix Detokenizer loading in AsyncLLM ( #10997 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-09 16:27:10 +00:00
d1c2e15eb3
[torch.compile] add dynamo time tracking ( #11005 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 23:09:04 -08:00
af7c4a92e6
[Doc][V1] Add V1 support column for multimodal models ( #10998 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 22:29:16 -08:00
46004e83a2
[misc] clean up and unify logging ( #10999 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 17:28:27 -08:00
43b05fa314
[torch.compile][misc] fix comments ( #10993 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:18:18 -08:00
a11f326528
[V1] Initial support of multimodal models for V1 re-arch ( #10699 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 12:50:51 +00:00
fd57d2b534
[torch.compile] allow candidate compile sizes ( #10984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:05:21 +00:00
7be15d9356
[core][misc] remove use_dummy driver for _run_workers ( #10920 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 12:06:08 -08:00
1b62745b1d
[core][executor] simplify instance id ( #10976 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 09:33:45 -08:00
78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None ( #10928 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2024-12-08 01:21:18 +08:00
c889d5888b
[Doc] Explicitly state that PP isn't compatible with speculative decoding yet ( #10975 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:20:49 +00:00
39e227c7ae
[Model] Update multi-modal processor to support Mantis(LLaVA) model ( #10711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:10:05 +00:00
1c768fe537
[Doc] Explicitly state that InternVL 2.5 is supported ( #10978 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 16:58:02 +00:00
bf0e382e16
[Model] Composite weight loading for multimodal Qwen2 ( #10944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 07:22:52 -07:00
b26b4cd03c
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation ( #10958 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-07 18:33:49 +08:00
f13cf9ad50
[Build] Fix for the Wswitch-bool clang warning ( #10060 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-07 09:03:44 +00:00
955fa9533a
[3/N] Support and implement merged input processor for LLaVA model ( #10676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-07 00:50:58 -08:00
acf092d348
[Bugfix] Fix test-pipeline.yaml ( #10973 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-07 12:08:54 +08:00
69d357ba12
[Core] Cleanup startup logging a bit ( #10961 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-07 02:30:23 +00:00
dcdc3fafe5
[ci] fix broken tests ( #10956 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:47 -08:00
c05cfb67da
[misc] fix typo ( #10960 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:20 -08:00
7406274041
[Doc] add KubeAI to serving integrations ( #10837 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-12-06 17:03:56 +00:00
8b59631855
[Core] Support Lark grammars for XGrammar ( #10870 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-06 08:34:29 -07:00
a1887f2c96
[torch.compile] fix deprecated code ( #10948 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:01:23 +00:00
222f5b082a
[CI/Build] Fix broken multimodal test ( #10950 )
2024-12-06 10:41:23 +00:00
b031a455a9
[torch.compile] add logging for compilation time ( #10941 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-06 10:07:15 +00:00
db87eb6c67
[torch.compile] use size tuning for specific sizes ( #10933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 20:30:41 -08:00
9743d64e4e
[ci][build] add tests for python only compilation ( #10915 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 08:54:47 -08:00
a43065272f
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives ( #10897 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-05 08:47:46 -08:00
998eeafe58
[CI/Build] Bump test transformers version ( #10106 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 16:05:52 +00:00
571da8fc43
[Misc][LoRA] Clean up the function interface of Punica ( #10917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:22:28 +00:00
39c89e71a8
[Misc] Update llama 3.2 template to support system prompt with images ( #10901 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-05 05:54:06 +00:00
1f958a7d52
[Bugfix] Fix BNB loader target_modules ( #10720 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:20:26 +08:00
aa39a8e175
[Doc] Create a new "Usage" section ( #10827 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 11:19:35 +08:00
8d370e91cb
[Bugfix] Fallback to outlines for complex json schemas ( #10899 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-05 11:14:06 +08:00
7883c2bbe7
[benchmark] Make H100 benchmark optional ( #10908 )
2024-12-04 17:02:17 -08:00
2a56e1264f
[V1] Fix when max_model_len is not divisible by block_size ( #10903 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-04 16:54:05 -08:00
e4c34c23de
[CI/Build] improve python-only dev setup ( #9621 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-04 21:48:13 +00:00
82eb5ea8f3
Benchmark serving structured output ( #10880 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-04 16:28:21 -05:00
10398b4706
[Model] Consolidate ViTs attention implementation without mask ( #10893 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-04 18:11:08 +00:00
01d079fd8e
[LoRA] Change lora_tokenizers capacity ( #10796 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-12-04 17:40:16 +00:00
c92acb9693
[ci/build] Update vLLM postmerge ECR repo ( #10887 )
2024-12-04 09:01:20 +00:00
8db957ee3a
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 ( #10854 )
...
Signed-off-by: jianzheng <57654625+o2363286@users.noreply.github.com >
2024-12-04 08:48:22 +00:00
c9ca4fce3f
[ci/build] Job to build and push release image ( #10877 )
2024-12-04 15:02:40 +08:00
fa2dea61df
[ci/build] Change queue name for Release jobs ( #10875 )
2024-12-04 15:02:16 +08:00
b5b647b084
Drop ROCm load format check ( #10767 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-04 04:32:21 +00:00
d2bd88b122
[CI/Build] Replace mean with torch.all in test_pynccl.py ( #10876 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-04 03:23:21 +00:00
381ac93bb5
[Benchmark] Benchmark structured output with datasets ( #10557 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2024-12-03 17:21:06 -07:00
a061fe601e
[Build][Bugfix] Using the correct type hint ( #10866 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-03 15:47:55 -05:00
7c32b6861e
[Frontend] correctly record prefill and decode time metrics ( #10853 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-12-03 19:13:31 +00:00
7090c27bb2
[Bugfix] Only require XGrammar on x86 ( #10865 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-03 10:32:21 -08:00
2f2cdc745a
[MISC][XPU] quick fix for XPU CI ( #10859 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-12-03 17:16:31 +00:00
3bc94cab69
[V1] VLM - Run the mm_mapper preprocessor in the frontend process ( #10640 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-03 10:33:10 +00:00
f6084f6324
[Speculative Decoding] Move indices to device before filtering output ( #10850 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-12-03 17:01:39 +08:00
9323a3153b
[Core][Performance] Add XGrammar support for guided decoding and set it as default ( #10785 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-12-03 15:17:00 +08:00
3257d449fa
[Misc] Remove deprecated names ( #10817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:52:57 +00:00
ef51831ee8
[Doc] Add github links for source code references ( #10672 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:46:07 +00:00
dc5ce861bf
[torch.compile] remove compilation_context and simplify code ( #10838 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 06:19:02 +00:00
21fe7b481a
[core][distributed] add pynccl broadcast ( #10843 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 04:53:23 +00:00
a4cf256159
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug ( #10844 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-03 12:10:29 +08:00
d746268e92
[Model] support bitsandbytes quantization with minicpm model ( #10842 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-12-03 03:06:41 +00:00
4433195ab7
[Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts ( #10753 )
2024-12-03 02:26:15 +00:00
4c05edb33a
[Model] Add TP and BNB quantization support to LlavaMultiModalProjector ( #10834 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-02 23:06:09 +00:00
9b14d978aa
Fix openvino on GPU ( #10793 )
2024-12-02 18:52:19 +00:00
519cc6ca12
[Misc][XPU] Avoid torch compile for XPU platform ( #10747 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-02 17:53:55 +00:00
b45f0d7946
[Misc][LoRA] Move the implementation of lora bias to punica.py ( #10829 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-02 17:53:36 +00:00
a4c4daf364
[misc] use out argument for flash attention ( #10822 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 10:50:10 +00:00
e95f275f57
[CI/Build] Update mistral_common version for tests and docs ( #10825 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-02 10:26:10 +00:00
ef31eabc68
[Model]: add some tests for aria model ( #10770 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-02 05:36:36 +00:00
995a148575
[doc]Update config docstring ( #10732 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-02 04:14:45 +00:00
63a164172d
[misc] remove xverse modeling file ( #10814 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 03:27:13 +00:00
e25810ae29
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill ( #10799 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-02 10:05:32 +08:00
073a4bd1c0
[Kernel] Use out arg in flash_attn_varlen_func ( #10811 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-01 17:55:39 -08:00
b7954776fd
[core] Avoid metrics log noise when idle - include speculative decodi… ( #10809 )
2024-12-02 01:49:48 +00:00
b18c9bbaba
[Model] Add BNB support to Llava and Pixtral-HF ( #10795 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-02 01:31:09 +00:00
0590ec3fd9
[Core] Implement disagg prefill by StatelessProcessGroup ( #10502 )
...
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2024-12-01 19:01:00 -06:00
c11f172187
[Misc] Adding MMMU-Pro vision dataset to serving benchmark ( #10804 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-01 08:47:05 +00:00
169a0ff911
[doc] add warning about comparing hf and vllm outputs ( #10805 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-01 00:41:38 -08:00
d2f058e76c
[Misc] Rename embedding classes to pooling ( #10801 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 14:36:51 +08:00
f877a7d12a
[Misc] Improve type annotations for support_torch_compile ( #10763 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-30 17:48:35 -08:00
133707123e
[Model] Replace embedding models with pooling adapter ( #10769 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 08:02:54 +08:00
7e4bbda573
[doc] format fix ( #10789 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-30 11:38:40 +00:00
e7cfc4ef4c
[Interleaved ATTN] Support for Mistral-8B ( #10591 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-30 07:45:50 +00:00
16ee07f22a
[Model] Refactor Molmo weights loading to use AutoWeightsLoader ( #10771 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-30 04:19:14 +00:00
40bc242579
[Bugfix] Fix OpenVino/Neuron driver_worker init ( #10779 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-30 12:07:13 +08:00
661175bc82
[platform] Add verify_quantization in platform. ( #10757 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-29 15:22:21 +00:00
3132aac043
[Bugfix] Fix Idefics3 bug ( #10778 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-29 13:56:46 +00:00
c82b432d4a
[Misc] typo find in sampling_metadata.py ( #10740 )
2024-11-29 05:17:57 +00:00
fa6ecb9aa7
[Model] Clean up MiniCPMV ( #10751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-29 04:47:06 +00:00
c83919c7a6
[Model] Add Internlm2 LoRA support ( #5064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-28 17:29:04 +00:00
98f47f2a40
[V1] Optimize the CPU overheads in FlashAttention custom op ( #10733 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 09:01:02 -08:00
8c1e77fb58
[Kernel] Update vllm-flash-attn version to reduce CPU overheads ( #10742 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 08:31:28 -08:00
5fc5ce0fe4
[Model] Added GLM-4 series hf format model support vllm==0.6.4 ( #10561 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-28 14:53:31 +00:00
3ed5e73146
[TPU] Update requirements-tpu ( #10726 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-28 02:30:48 -08:00
9a8bff0285
[Kernel] Update vllm-flash-attn version ( #10736 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 02:25:59 -08:00
a79b122400
[V1] Do not allocate beyond the max_model_len ( #10730 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 00:13:15 -08:00
d9b4b3f069
[Bug][CLI] Allow users to disable prefix caching explicitly ( #10724 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-27 23:59:28 -08:00
278be671a3
[Doc] Update model in arch_overview.rst to match comment ( #10701 )
...
Signed-off-by: spacewander <spacewanderlzx@gmail.com >
2024-11-27 23:58:39 -08:00
70dc14fbd0
[Model] support bitsandbytes quantization with minicpm3 model ( #10682 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-27 23:58:02 -08:00
cb4e1c3f3a
[misc] upgrade filelock version ( #10731 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 19:54:58 -08:00
395b1c7454
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server ( #10635 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-27 13:21:10 -08:00
9b4b150395
[Bugfix] Ignore lm_head when loading embedding models ( #10719 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-27 19:05:29 +00:00
197b4484a3
[Bugfix][Mamba] Fix Multistep on Mamba-like models ( #10705 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-27 19:02:27 +00:00
b98c62ba49
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint ( #10675 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-27 10:43:17 -08:00
c411def234
[torch.compile] fix shape specialization ( #10722 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 10:16:10 -08:00
308cc5e21e
[ci] fix slow tests ( #10698 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 09:26:14 -08:00
9e0a147d50
[V1] Update interface for mistral-format Pixtral ( #10703 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 12:26:27 +00:00
418cb3b93f
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault ( #10700 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-27 11:55:38 +00:00
1209261e93
[Model] Support telechat2 ( #10311 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-27 11:32:35 +00:00
e2251109c7
[Kernel] Remove if-else with identical branches in marlin 2:4 ( #10687 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-26 22:55:32 -08:00
15cc2a9f1a
[Misc]Further reduce BNB static variable ( #10597 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-26 22:54:12 -08:00
e85250b1d1
[Hardware][Gaudi]add get_name method for HPUAttentionBackend ( #10667 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-26 22:49:40 -08:00
cfb3bf25fb
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig ( #10657 )
2024-11-27 13:55:23 +08:00
1bf905ddaa
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. ( #10198 )
...
Signed-off-by: jeongin601 <0200angela@gmail.com >
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com >
2024-11-27 05:07:30 +00:00
0a4d968500
[V1] Update interface for idefics3 ( #10680 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 10:04:01 +08:00
0a71900bc9
Remove hard-dependencies of Speculative decode to CUDA workers ( #10587 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-26 17:57:11 -08:00
2f0a0a17a4
[V1] Refactor model executable interface for multimodal models ( #10570 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-26 20:46:11 +00:00
7576cd38df
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes ( #10642 )
2024-11-26 12:29:00 -08:00
9a99273b48
[Bugfix] Fix using -O[0,3] with LLM entrypoint ( #10677 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-26 10:44:01 -08:00
f5792c7c4a
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson ( #9735 )
...
Signed-off-by: Conroy Cheers <conroy@corncheese.org >
2024-11-26 10:26:28 -08:00
db66e018ea
[Bugfix] Fix for Spec model TP + Chunked Prefill ( #10232 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Co-authored-by: Sourashis Roy <sroy@roblox.com >
2024-11-26 09:11:16 -08:00
1f6584ee85
[V1] Enable profile for LLMEngine ( #10665 )
2024-11-26 10:36:45 +00:00
334d64d1e8
[ci] add vllm_test_utils ( #10659 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-26 00:20:04 -08:00
940635343a
[Misc] Remove outdated init protocols ( #10655 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-26 14:55:00 +08:00
9a88f89799
custom allreduce + torch.compile ( #10121 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-25 22:00:16 -08:00
519e8e4182
[v1] EngineArgs for better config handling for v1 ( #10382 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-25 21:09:43 -08:00
a6760f6456
[Feature] vLLM ARM Enablement for AARCH64 CPUs ( #9228 )
...
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-11-25 18:32:39 -08:00
45ac4ff270
[bugfix] fix aria model and add torch.compile ( #10645 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 18:32:09 -08:00
6e9ff050c8
[misc] do not read HOST_IP ( #10644 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 17:04:50 -08:00
9db713a1dc
[Model] Add OLMo November 2024 model ( #10503 )
2024-11-25 17:26:40 -05:00
1b583cfefa
[Doc] Fix typos in docs ( #10636 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 10:15:45 -08:00
cf73f0c95e
[Model] Enable optional prefix when loading embedding models ( #10639 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 18:14:33 +00:00
b1d920531f
[Model]: Add support for Aria model ( #10514 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-25 18:10:55 +00:00
452a4e80c3
[Docs] Add Snowflake Slides ( #10641 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-25 09:34:46 -08:00
c27df94e1f
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices ( #9850 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-11-25 12:23:32 -05:00
d04b13a380
[Bug]: Authorization ignored when root_path is set ( #10606 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-25 16:21:41 +00:00
2b0879bfc2
Super tiny little typo fix ( #10633 )
2024-11-25 13:08:30 +00:00
ed46f14321
[Model] Support is_causal HF config field for Qwen2 model ( #10621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 09:51:20 +00:00
05d1f8c9c6
[misc] move functions to config.py ( #10624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 09:27:30 +00:00
25d806e953
[misc] add torch.compile compatibility check ( #10618 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:40:08 -08:00
65813781a2
[torch.compile] add warning for unsupported models ( #10622 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:27:51 -08:00
7c2134beda
[torch.compile] force inductor threads ( #10620 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 23:04:21 -08:00
a30a605d21
[Doc] Add encoder-based models to Supported Models page ( #10616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 06:34:07 +00:00
571841b7fc
[torch.compile] support encoder based models ( #10613 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 05:24:33 +00:00
7ea3cd7c3e
[Refactor][MISC] del redundant code in ParallelConfig.postinit ( #10614 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-25 05:14:56 +00:00
214efc2c3c
Support Cross encoder models ( #10400 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-24 18:56:20 -08:00
49628fe13e
[Doc] Update README.md with Ray Summit talk links ( #10610 )
2024-11-24 16:45:09 -08:00
e4fbb14414
[doc] update the code to add models ( #10603 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-24 11:21:40 -08:00
c055747867
[model][utils] add extract_layer_index utility function ( #10599 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-23 22:22:54 -08:00
eda2b3589c
Revert "Print running script to enhance CI log readability" ( #10601 )
2024-11-23 21:31:47 -08:00
1c445dca51
[CI/Build] Print running script to enhance CI log readability ( #10594 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 03:57:13 +00:00
1700c543a5
[Bugfix] Fix LoRA weight sharding ( #10450 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-23 17:23:17 -08:00
17d8fc1806
[bugfix] Fix example/tensorize_vllm_model tests ( #10595 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-23 17:22:33 -08:00
04668ebe7a
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama ( #10593 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-23 18:12:20 +00:00
651f6c31ac
For ppc64le, disabled tests for now and addressed space issues ( #10538 )
2024-11-23 09:33:53 +00:00
86a44fb896
[Platforms] Refactor openvino code ( #10573 )
...
Signed-off-by: statelesshz <hzji210@gmail.com >
2024-11-22 22:23:12 -08:00
4cfe5d2bca
[Bugfix] multi_modal_kwargs broadcast for CPU tensor parallel ( #10541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 21:25:46 -08:00
c8acd80548
[2/N] handling placeholders in merged multi-modal processor ( #10485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-22 21:25:09 -08:00
4634a89d18
Prefix Cache Aware Scheduling [1/n] ( #10128 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-22 21:15:55 -08:00
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm ( #10254 )
2024-11-22 21:14:49 -08:00
02a43f82a9
Update default max_num_batch_tokens for chunked prefill to 2048 ( #10544 )
2024-11-22 21:14:19 -08:00
cfea9c04ef
[Model] Fix Baichuan BNB online quantization ( #10572 )
...
Signed-off-by: Chen Wu <cntryroa@gmail.com >
2024-11-22 21:13:59 -08:00
7d8ffb344f
[Bugfix] Internal Server Error when tool_choice is incorrect. ( #10567 )
...
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com >
2024-11-22 21:13:29 -08:00
4aba6e3d1a
[core] gemma2 full context length support ( #10584 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 20:13:54 -08:00
978b39744b
[Misc] Add pynccl wrappers for all_gather and reduce_scatter ( #9432 )
2024-11-22 22:14:03 -05:00
ebda51968b
[Core] Fix broken log configuration ( #10458 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-23 10:23:51 +08:00
9195dbdbca
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use ( #10164 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-23 10:17:38 +08:00
d559979c54
[bugfix] fix cpu tests ( #10585 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 17:34:03 -08:00
d345f409b7
[V1] EngineCore supports profiling ( #10564 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2024-11-22 17:16:15 -08:00
28598f3939
[Core] remove temporary local variables in LLMEngine.__init__ ( #10577 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-22 16:22:53 -08:00
948c859571
support bitsandbytes quantization with qwen model ( #10549 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-22 16:16:14 -08:00
97814fbf0f
[v1] Refactor KVCacheManager for more hash input than token ids ( #10507 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-22 23:27:25 +00:00
eebad39f26
[torch.compile] support all attention backends ( #10558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 14:04:42 -08:00
db100c5cde
[bugfix] fix full graph tests ( #10581 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 10:02:14 -08:00
11fcf0e066
Remove token-adding chat embedding params ( #10551 )
...
Signed-off-by: Noam Gat <noamgat@gmail.com >
2024-11-21 23:59:47 -08:00
b6374e09b0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel ( #9948 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 15:01:56 +08:00
a111d0151f
[platforms] absorb worker cls difference into platforms folder ( #10555 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-21 21:00:32 -08:00
446c7806b2
[Minor] Fix line-too-long ( #10563 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 19:40:40 -08:00
33e0a2540a
[9/N] torch.compile LLM usage ( #10552 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 19:13:31 -08:00
aed074860a
[Benchmark] Add new H100 machine ( #10547 )
2024-11-21 18:27:20 -08:00
9afa014552
Add small example to metrics.rst ( #10550 )
2024-11-21 23:43:43 +00:00
46fe9b46d8
[Minor] Revert change in offline inference example ( #10545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 21:28:16 +00:00
cf656f5a02
[misc] improve error message ( #10553 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 13:13:17 -08:00
edec3385b6
[CI][Installation] Avoid uploading CUDA 11.8 wheel ( #10535 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-11-21 13:03:58 -08:00
f9310cbd0c
[V1] Fix Compilation config & Enable CUDA graph by default ( #10528 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 12:53:39 -08:00
7560ae5caf
[8/N] enable cli flag without a space ( #10529 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 12:30:42 -08:00
e7a8341c7c
[Bugfix] Allow token ID-only inputs in Qwen2-Audio ( #10536 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-21 18:09:43 +00:00
c51e397fe8
[Misc] Suppress duplicated logging regarding multimodal input pipeline ( #10530 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-21 09:21:31 -08:00
2385b60d83
[Kernel] Register punica ops directly ( #10522 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-21 09:18:11 -08:00
da7e702c6f
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored ( #10180 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-21 16:24:32 +00:00
4d676f0852
[Bugfix] Embedding model pooling_type equals ALL and multi input's bug ( #10494 )
2024-11-21 14:40:02 +00:00
d5ec121f95
[Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models ( #10518 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-21 14:20:08 +00:00
8a93a598d9
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len ( #10524 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
2024-11-21 11:15:36 +00:00
1cfde82ffd
[Model] Add Support for Multimodal Granite Models ( #10291 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-21 10:46:20 +00:00
f0e0238016
[Doc] fix a small typo in docstring of llama_tool_parser ( #10513 )
2024-11-21 09:05:23 +00:00
aaddce5d26
[platforms] improve error message for unspecified platforms ( #10520 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 23:07:56 -08:00
3430857b64
[Misc] Increase default video fetch timeout ( #10495 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 23:06:42 -08:00
8b0fe06c89
[torch.compile] Inductor code caching fix ( #10273 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
2024-11-20 21:44:57 -08:00
9d827170a3
[Platforms] Add device_type in Platform ( #10508 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-21 04:44:20 +00:00
6c1208d083
[Core] Add Sliding Window Support with Flashinfer ( #10462 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2024-11-20 19:56:47 -08:00
388ee3de66
[torch.compile] limit inductor threads and lazy import quant ( #10482 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 18:36:33 -08:00
2f77b6cfec
[TPU] Implement prefix caching for TPUs ( #10307 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-20 13:54:15 -08:00
c68f7ede6a
[Bugfix]: allow extra fields in requests to openai compatible server ( #10463 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-20 16:42:21 -05:00
0cd3d9717e
[7/N] torch.compile, reduce compilation time ( #10460 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 11:20:38 -08:00
5f1d6af2b6
[perf bench] H200 development ( #9768 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-20 11:06:56 -08:00
772a66732d
[platforms] restore xpu check for parallel config ( #10479 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 17:13:28 +00:00
63f1fde277
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU ( #10355 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-20 10:57:39 +00:00
d5b28447e0
[Platforms] Refactor xpu code ( #10468 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-19 22:52:13 -08:00
09dbf9ff16
[Bugfix] Handle conflicts between modern and legacy fields ( #10471 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 14:45:08 +08:00
343041c4c4
[model] Reduce medusa weight ( #10454 )
...
Signed-off-by: skylee-01 <497627264@qq.com >
2024-11-20 06:05:55 +00:00
ed701ca963
[ci/build] Combine nightly and optional ( #10465 )
2024-11-19 21:36:03 -08:00
7629a9c6e5
[CI/Build] Support compilation with local cutlass path ( #10423 ) ( #10424 )
2024-11-19 21:35:50 -08:00
709c9f1f25
[CI/Build] Add sphinx/rst linter for docs ( #10366 )
2024-11-19 21:35:31 -08:00
b4be5a8adb
[Bugfix] Enforce no chunked prefill for embedding models ( #10470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 05:12:51 +00:00
ad44437ba3
[Bugfix] Fix Mamba model initialization and MLP Speculator weights loading ( #10456 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-20 05:04:05 +00:00
9e05252b46
[Misc] Add __setitem__ for LazyDict ( #10469 )
...
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
2024-11-20 04:44:57 +00:00
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) ( #10464 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-19 19:40:33 -08:00
d5b68aba2f
[CI/Build] Update Dockerfile.rocm ( #10434 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2024-11-19 17:19:59 -08:00
a324d3a1a7
Change granite chat template to keep json list formatting for tool calls ( #10452 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2024-11-19 18:16:54 -07:00
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-19 13:31:12 -08:00
efa9084628
[Core] Avoid metrics log noise when idle ( #8868 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 21:05:25 +00:00
803f37eaaa
[6/N] torch.compile rollout to users ( #10437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:09:03 -08:00
fd9f124971
[Doc] fix link for page that was renamed ( #10455 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:48:30 -08:00
1ea291a417
Fix: Build error seen on Power Architecture ( #10421 )
...
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: ismael-dm <ismaeldm99@gmail.com >
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Angus Wang <wangjadehao@gmail.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Ricky Xu <rickyx@anyscale.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:34:57 -08:00
11fd7ea639
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter ( #10449 )
2024-11-19 17:33:06 +00:00
f028dff33d
[BugFix] Fix hermes tool parser output error stream arguments in some cases ( #10395 ) ( #10398 )
...
Signed-off-by: xiyuan lee <lixiyuan@haier.com >
2024-11-19 13:42:50 +00:00
b4614656b8
[CI][CPU] adding numa node number as container name suffix ( #10441 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-19 13:16:43 +00:00
25f9c78961
[misc][plugin] improve plugin loading ( #10443 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:43:21 +00:00
5390d6664f
[Doc] Add the start of an arch overview page ( #10368 )
2024-11-19 09:52:11 +00:00
382b6a4852
[Misc] Avoid misleading warning messages ( #10438 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 08:54:58 +00:00
272e31c0bd
[Bugfix] Guard for negative counter metrics to prevent crash ( #10430 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-19 04:57:10 +00:00
74f8c2cf5f
Add openai.beta.chat.completions.parse example to structured_outputs.rst ( #10433 )
2024-11-19 04:37:46 +00:00
8c1fb50705
[Platform][Refactor] Extract func get_default_attn_backend to Platform ( #10358 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-11-19 11:22:26 +08:00
7eb719df13
[Bugfix]Fix Phi-3 BNB online quantization ( #10417 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 03:21:42 +00:00
284203f171
[ci/build] Have dependabot ignore all patch update ( #10436 )
...
We have too many dependencies and all patch updates can be a little noisy. This is to have dependabot ignore all patch version updates.
2024-11-19 01:04:25 +00:00
90a6c759ca
[misc] partial prefix & random input generation benchmark ( #9929 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-18 15:39:14 -08:00
2298e69b5f
[ci][bugfix] fix kernel tests ( #10431 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:29:37 -08:00
a03ea40792
[3/N][torch.compile] consolidate custom op logging ( #10399 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:14:59 -08:00
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-18 12:59:29 -07:00
c2170a5b39
[Kernel] Explicitly specify other value in tl.load calls ( #9014 )
...
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
2024-11-18 11:39:40 -08:00
6b2d25efc7
[Hardware][XPU] AWQ/GPTQ support for xpu backend ( #10107 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-18 11:18:05 -07:00
281cc4b3cd
[Model][Bugfix] Support TP for PixtralHF ViT ( #10405 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-18 10:04:14 -08:00
4f686d139f
Fix open_collective value in FUNDING.yml ( #10426 )
...
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
2024-11-18 09:52:42 -08:00
31894a2155
[Doc] Add documentation for Structured Outputs ( #9943 )
...
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
2024-11-18 09:52:12 -08:00
7851b45196
[5/N][torch.compile] torch.jit.script --> torch.compile ( #10406 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 23:20:06 +08:00
4186be8111
[Doc] Update doc for LoRA support in GLM-4V ( #10425 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 15:08:30 +00:00
e7ebb662d7
[Model] Remove transformers attention porting in VITs ( #10414 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 21:45:21 +08:00
5be4e52b65
[Model][LoRA]LoRA support added for glm-4v ( #10418 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 12:57:10 +00:00
01aae1cc68
[Model] Remove redundant softmax when using PoolingType.STEP ( #10415 )
2024-11-18 10:05:36 +00:00
c7dec926f6
[VLM] Report multi_modal_placeholders in output ( #10407 )
...
Signed-off-by: Linkun Chen <lkchen+anyscale@github.com >
2024-11-18 16:06:16 +08:00
51bb12d17b
[4/N][torch.compile] clean up set_torch_compile_backend ( #10401 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-17 23:57:20 -08:00
47826cacf0
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU ( #10375 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-11-18 11:29:26 +08:00
c4e464333e
[Misc] Add uninitialized params tracking for AutoWeightsLoader ( #10327 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 09:07:46 +08:00
d1557e66d3
[Misc] Enhance offline_inference to support user-configurable paramet… ( #10392 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-17 11:32:40 +00:00
80d85c5d7b
[Bugfix] Fix mrope_position_delta in non-last prefill chunk ( #10403 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 08:50:24 +00:00
76aab90ab6
[Hardware] [HPU]add mark_step for hpu ( #10239 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-17 00:44:44 -08:00
8d74b5aee9
[platforms] refactor cpu code ( #10402 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 23:14:23 -08:00
cf349c4a97
[Bugfix][CPU] Fix CPU embedding runner with tensor parallel ( #10394 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-16 23:12:04 -08:00
905d0f0af4
[CI/Build] Fix IDC hpu [Device not found] issue ( #10384 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-17 14:58:22 +08:00
643ecf7b11
[V1] Refactor model executable interface for all text-only language models ( #10374 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-17 05:18:46 +00:00
4fd9375028
[2/N][torch.compile] make compilation cfg part of vllm cfg ( #10383 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 18:02:14 -08:00
661a34fd4f
[V1] Add code owners for V1 ( #10397 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-16 10:45:26 -08:00
361c29e174
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled ( #10388 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 02:10:00 +08:00
b98d89efd4
[Misc] Medusa supports custom bias ( #10361 )
2024-11-16 16:33:01 +00:00
8b6725b0cf
[Misc] Update benchmark to support image_url file or http ( #10287 )
...
Signed-off-by: rbbang <anjaehyun87@gmail.com >
2024-11-16 18:15:40 +08:00
1d75472626
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel ( #10385 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-16 09:55:05 +00:00
2f427c2d16
[misc][plugin] improve log messages ( #10386 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 01:23:20 -08:00
755b85359b
[doc] add doc for the plugin system ( #10372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 21:46:27 -08:00
32e46e000f
[Frontend] Automatic detection of chat content format from AST ( #9919 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-16 13:35:40 +08:00
4f168f69a3
[Docs] Misc updates to TPU installation instructions ( #10165 )
2024-11-15 13:26:17 -08:00
3e8d14d8a1
[Doc] Move PR template content to docs ( #10159 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:20:20 -08:00
a067f85e08
[Frontend] Add --version flag to CLI ( #10369 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:13:53 -08:00
c76ac49d26
[Docs] Add Nebius as sponsors ( #10371 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 12:47:40 -08:00
a6221a144a
[Misc] bump mistral common version ( #10367 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 09:48:07 -08:00
79ee45b428
[Misc] Bump up test_fused_moe tolerance ( #10364 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-15 16:31:18 +00:00
691a3ec047
[Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer ( #10363 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-15 14:50:40 +00:00
3a763ba0c3
[core][misc] keep compatibility for old-style classes ( #10356 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 13:55:51 +00:00
f2056f726d
[Misc] Fix some help info of arg_utils to improve readability ( #10362 )
2024-11-15 12:40:30 +00:00
1d65ec7eeb
[Bugfix] Fix fully sharded LoRA bug ( #10352 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-15 10:34:58 +00:00
26908554b2
[Doc] Remove float32 choice from --lora-dtype ( #10348 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-15 10:22:57 +00:00
b311efd0bd
[Misc] Fix import error in tensorizer tests and cleanup some code ( #10349 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 09:34:17 +00:00
3d158cdc8d
Add default value to avoid Falcon crash ( #5363 ) ( #10347 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-15 08:52:20 +00:00
02dbf30e9a
[Build] skip renaming files for release wheels pipeline ( #9671 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-14 23:31:52 -08:00
2ac6d0e75b
[Misc] Consolidate pooler config overrides ( #10351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 06:59:00 +00:00
2ec8827288
[Bugfix] Qwen-vl output is inconsistent in speculative decoding ( #10350 )
2024-11-15 05:40:10 +00:00
b40cf6402e
[Model] Support Qwen2 embeddings and use tags to select model tests ( #10184 )
2024-11-14 20:23:09 -08:00
2885ba0e24
[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug ( #10308 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-15 02:44:26 +00:00
bf2ddc6610
[bugfix] Fix static asymmetric quantization case ( #10334 )
...
Signed-off-by: Daniël de Kok <me@danieldk.eu >
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Daniël de Kok <me@danieldk.eu >
2024-11-15 09:35:11 +08:00
972112d82f
[Bugfix] Fix unable to load some models ( #10312 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 16:55:54 -08:00
11cd1ae6ad
[Tool parsing] Improve / correct mistral tool parsing ( #10333 )
2024-11-15 00:42:49 +00:00
554af9228d
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 ( #9583 )
...
Signed-off-by: xiaozijin <xiaozijin@bytedance.com >
2024-11-14 16:38:53 -08:00
b2e0ad3b59
[Perf] Reduce peak memory usage of llama ( #10339 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2024-11-15 00:38:20 +00:00
4a18fd14ba
Support Roberta embedding models ( #9387 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-14 21:23:29 +00:00
1dbae0329c
[Docs] Publish meetup slides ( #10331 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-14 16:19:38 +00:00
675d603400
[CI/Build] Make shellcheck happy ( #10285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 09:47:53 +00:00
03025c023f
[CI/Build] Fix CPU CI online inference timeout ( #10314 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 16:45:32 +08:00
29f3ef26a3
[ci][distributed] disable hanging tests ( #10317 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-14 00:23:39 -08:00
294bf467ba
[Model] Add BNB quantization support for Idefics3 ( #10310 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-14 06:31:44 +00:00
52b48c1ead
[BugFix]: properly deserialize tool_calls iterator before processing by mistral-common when MistralTokenizer is used ( #9951 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-14 04:48:16 +00:00
f67ce05d0b
[Frontend] Pythonic tool parser ( #9859 )
...
Signed-off-by: Mike Depinet <mike@fixie.ai >
2024-11-14 04:14:34 +00:00
e0853b6508
[Misc] format.sh: Simplify tool_version_check ( #10305 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-14 11:12:35 +08:00
504ac53d18
[misc] error early for old-style class ( #10304 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-13 18:55:39 -08:00
15bb8330aa
[Bugfix] Fix tensor parallel for qwen2 classification model ( #10297 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 10:54:59 +08:00
ac49b59d8b
[Bugfix] bitsandbytes models fail to run pipeline parallel ( #10200 )
...
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com >
2024-11-13 09:56:39 -07:00
0b8bb86bf1
[1/N] Initial prototype for multi-modal processor ( #10044 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-13 12:39:03 +00:00
bb7991aa29
[V1] Add missing tokenizer options for Detokenizer ( #10288 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-13 11:02:56 +00:00
d909acf9fe
[Model][LoRA]LoRA support added for idefics3 ( #10281 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-13 17:25:59 +08:00
b6dde33019
[Core] Flashinfer - Remove advance step size restriction ( #10282 )
2024-11-13 16:29:32 +08:00
1b886aa104
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 ( #9944 )
...
Signed-off-by: FurtherAI <austin.veselka@lighton.ai >
Co-authored-by: FurtherAI <austin.veselka@lighton.ai >
2024-11-13 08:28:13 +00:00
3945c82346
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions ( #10221 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-13 07:07:22 +00:00
032fcf16ae
[Doc] Fix typo in arg_utils.py ( #10264 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-12 21:54:52 -08:00
56a955e774
Bump to compressed-tensors v0.8.0 ( #10279 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2024-11-12 21:54:10 -08:00
bbd3e86926
[V1] Support VLMs with fine-grained scheduling ( #9871 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-13 04:53:13 +00:00
0d4ea3fb5c
[core][distributed] use tcp store directly ( #10275 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 17:36:08 -08:00
112fa0bbe5
[V1] Fix CI tests on V1 engine ( #10272 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 16:17:20 -08:00
377b74fe87
Revert "[ci][build] limit cmake version" ( #10271 )
2024-11-12 15:06:48 -08:00
18081451f9
[doc] improve debugging doc ( #10270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:43:52 -08:00
96ae0eaeb2
[doc] fix location of runllm widget ( #10266 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:34:39 -08:00
1f55e05713
[V1] Enable Inductor when using piecewise CUDA graphs ( #10268 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 13:39:56 -08:00
8a06428c70
[LoRA] Adds support for bias in LoRA ( #5733 )
...
Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com >
Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com >
2024-11-12 11:08:40 -08:00
b41fb9d3b1
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers ( #9982 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-12 10:53:57 -08:00
7c65527918
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest ( #10245 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 08:57:14 -08:00
47db6ec831
[Frontend] Add per-request number of cached token stats ( #10174 )
2024-11-12 16:42:28 +00:00
176fcb1c71
[Bugfix] Fix QwenModel argument ( #10262 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2024-11-12 16:36:51 +00:00
a838ba7254
[Misc]Fix Idefics3Model argument ( #10255 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 13:07:11 +00:00
36c513a076
[BugFix] Do not raise a ValueError when tool_choice is set to the supported none option and tools are not defined. ( #10000 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-12 11:13:46 +00:00
d201d41973
[CI][CPU]refactor CPU tests to allow to bind with different cores ( #10222 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-12 10:07:32 +00:00
3a28f18b0b
[doc] explain the class hierarchy in vLLM ( #10240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 22:56:44 -08:00
812c981fa0
Splitting attention kernel file ( #10091 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2024-11-11 22:55:07 -08:00
7f5edb5900
[Misc][LoRA] Replace hardcoded cuda device with configurable argument ( #10223 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 11:10:15 +08:00
eea55cca5b
[1/N] torch.compile user interface design ( #10237 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 18:01:06 -08:00
9cdba9669c
[Doc] Update help text for --distributed-executor-backend ( #10231 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-12 09:55:09 +08:00
d1c6799b88
[doc] update debugging guide ( #10236 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 15:21:12 -08:00
6ace6fba2c
[V1] AsyncLLM Implementation ( #9826 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-11 23:05:38 +00:00
08f93e7439
Make shutil rename in python_only_dev ( #10233 )
...
Signed-off-by: shcheglovnd <shcheglovnd@avride.ai >
2024-11-11 14:29:19 -08:00
9d5b4e4dea
[V1] Enable custom ops with piecewise CUDA graphs ( #10228 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:58:07 -08:00
8a7fe47d32
[misc][distributed] auto port selection and disable tests ( #10226 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 11:54:59 -08:00
4800339c62
Add docs on serving with Llama Stack ( #10183 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-11 11:28:55 -08:00
fe15729a2b
[V1] Use custom ops for piecewise CUDA graphs ( #10227 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:26:48 -08:00
330e82d34a
[v1][torch.compile] support managing cudagraph buffer ( #10203 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:10:27 -08:00
d7a4f2207b
[V1] Do not use inductor for piecewise CUDA graphs ( #10225 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:05:57 -08:00
f9dadfbee3
[V1] Fix detokenizer ports ( #10224 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 10:42:07 -08:00
25144ceed0
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #10209 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 17:24:10 +00:00
e6de9784d2
[core][distributed] add stateless process group ( #10216 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 09:02:14 -08:00
36fc439de0
[Doc] fix doc string typo in block_manager swap_out function ( #10212 )
2024-11-11 08:53:07 -08:00
874f551b36
[Metrics] add more metrics ( #4464 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-12 00:17:38 +08:00
2cebda42bb
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner ( #10218 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 12:37:58 +00:00
5fb1f935b0
[V1] Allow tokenizer_mode and trust_remote_code for Detokenizer ( #10211 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-11 18:01:18 +08:00
36e4acd02a
[LoRA][Kernel] Remove the unused libentry module ( #10214 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-11 09:43:23 +00:00
58170d6503
[Hardware][CPU] Add embedding models support for CPU backend ( #10193 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 08:54:28 +00:00
9804ac7c7c
Bump the patch-update group with 5 updates ( #10210 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 07:22:40 +00:00
f89d18ff74
[6/N] pass whole config to inner model ( #10205 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 06:41:46 +00:00
f0f2e5638e
[doc] improve debugging code ( #10206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-10 17:49:40 -08:00
ad9a78bf64
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py ( #10196 )
2024-11-11 00:14:22 +00:00
73b9083e99
[misc] improve cloudpickle registration and tests ( #10202 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 00:10:53 +00:00
20cf2f553c
[Misc] small fixes to function tracing file path ( #9543 )
...
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 15:21:06 -08:00
bfb7d61a7c
[doc] Polish the integration with huggingface doc ( #10195 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 10:22:04 -08:00
19682023b6
[Doc] Fix typo error in CONTRIBUTING.md ( #10190 )
...
Signed-off-by: FuryMartin <furymartin9910@outlook.com >
2024-11-10 07:47:24 +00:00
9fa4bdde9d
[ci][build] limit cmake version ( #10188 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 16:27:26 -08:00
51c2e1fcef
[CI/Build] Split up models tests ( #10069 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:39:14 -08:00
b09895a618
[Frontend][Core] Override HF config.json via CLI ( #5836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 16:19:27 +00:00
d88bff1b96
[Frontend] add add_request_id middleware ( #9594 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-11-09 10:18:29 +00:00
9e37266420
bugfix: fix the bug that stream generate not work ( #2756 )
2024-11-09 10:09:48 +00:00
8a4358ecb5
[doc] explaining the integration with huggingface ( #10173 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 01:02:54 -08:00
bd46357ad9
[bugfix] fix broken tests of mlp speculator ( #10177 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 00:04:50 -08:00
f192aeba74
[Bugfix] Enable some fp8 and quantized fullgraph tests ( #10171 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-09 08:01:27 +00:00
8e1529dc57
[CI/Build] Add run-hpu-test.sh script ( #10167 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2024-11-09 06:26:52 +00:00
1a95f10ee7
[5/N] pass the whole config to model ( #9983 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 14:17:28 +08:00
49d2a41a86
[Doc] Adjust RunLLM location ( #10176 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 20:07:10 -08:00
47672f38b5
[CI/Build] Fix VLM broadcast tests tensor_parallel_size passing ( #10161 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-09 04:02:59 +00:00
f83feccd7f
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module ( #10169 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-09 03:36:46 +00:00
e0191a95d8
[0/N] Rename MultiModalInputs to MultiModalKwargs ( #10040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:31:02 +08:00
d7edca1dee
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking ( #6892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 03:27:11 +00:00
127c07480e
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case ( #9857 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-08 19:59:22 -05:00
10b67d865d
[Bugfix] SymIntArrayRef expected to contain concrete integers ( #10170 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-08 14:44:18 -08:00
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
e1b5a82179
Rename vllm.logging to vllm.logging_utils ( #10134 )
2024-11-08 20:53:24 +00:00
87713c6053
[CI/Build] Ignore .gitignored files for shellcheck ( #10162 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-11-08 19:53:36 +00:00
b5815c8413
[V1] Fix non-cudagraph op name ( #10166 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-08 10:23:04 -08:00
6b30471586
[Misc] Improve Web UI ( #10090 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-08 09:51:04 -08:00
f6778620a9
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 ( #10136 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-08 15:56:18 +00:00
0535e5fe6c
Fix edge case Mistral tokenizer ( #10152 )
2024-11-08 15:42:27 +00:00
b489fc3c91
[CI/Build] Update CPU tests to include all "standard" tests ( #5481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 23:30:04 +08:00
208ce622c7
[V1]Enable APC by default only for text models ( #10148 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-08 14:39:41 +00:00
1ff4aed5bd
[Model] Expose size to Idefics3 as mm_processor_kwargs ( #10146 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-08 09:56:58 +00:00
f10797c0ce
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator ( #10144 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-08 09:41:03 +00:00
f4c2187e29
[Misc] Fix typo in #5895 ( #10145 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 09:07:01 +00:00
aea6ad629f
Add hf_transfer to testing image ( #10096 )
2024-11-08 08:35:25 +00:00
da07a9ead7
Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. ( #9285 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2024-11-08 05:31:28 +00:00
3a7f15a398
[Doc] Move CONTRIBUTING to docs site ( #9924 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 05:15:12 +00:00
7371749d54
[Misc] Fix ImportError causing by triton ( #9493 )
2024-11-08 05:08:51 +00:00
ad39bd640c
[Bugfix] Add error handling when server cannot respond any valid tokens ( #5895 )
2024-11-08 04:58:37 +00:00
40d0e7411d
[Doc] Update FAQ links in spec_decode.rst ( #9662 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2024-11-08 04:44:58 +00:00
6bb52b0f97
[CI/Build] Give PR cleanup job PR write access ( #10139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 12:10:20 +08:00
201fc07730
[V1] Prefix caching (take 2) ( #9972 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-07 17:34:44 -08:00
42b4f46b71
[V1] Add all_token_ids attribute to Request ( #10135 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-07 17:08:24 -08:00
073a472728
[Misc] report relevant env vars in collect_env.py tool ( #9293 )
2024-11-07 16:14:01 -08:00
93bff421bc
Bump actions/checkout from 4.2.1 to 4.2.2 ( #9746 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 21:44:58 +00:00
28b2877d30
Online video support for VLMs ( #10020 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 20:25:59 +00:00
97b8475beb
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #9745 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 18:55:35 +00:00
a2f1f3b089
[CI/Build] Automate PR body text cleanup ( #10082 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:26:28 +00:00
3be5b26a76
[CI/Build] Add shell script linting using shellcheck ( #7925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:17:29 +00:00
de0e61a323
[CI/Build] Always run mypy ( #10122 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 16:43:16 +00:00
9d43afcc53
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding ( #9291 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2024-11-07 08:15:14 -08:00
ae62fd17c0
[Frontend] Tool calling parser for Granite 3.0 models ( #9027 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 07:09:02 -08:00
a62bc0109c
[Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. ( #10105 )
...
Signed-off-by: Mozhou <spli161006@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-11-07 11:20:30 +00:00
999df95b4e
[Bugfix] Make image processor respect mm_processor_kwargs for Qwen2-VL ( #10112 )
...
Signed-off-by: Jiahao Li <liplus17@163.com >
2024-11-07 10:50:44 +00:00
a6f332d0d9
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target ( #10108 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 18:42:50 +08:00
0dfba97b42
[Frontend] Fix multiple values for keyword argument error ( #10075 ) ( #10076 )
...
Signed-off-by: Lei <ylxx@live.com >
2024-11-07 09:07:19 +00:00
aa9078fa03
Adds method to read the pooling types from model's files ( #9506 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 08:42:40 +00:00
e036e527a0
[CI/Build] Improve mypy + python version matrix ( #10041 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 07:54:16 +00:00
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce ( #10030 )
...
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com >
2024-11-06 23:50:47 -08:00
d7263a1bb8
Doc: Improve benchmark documentation ( #9927 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-06 23:50:35 -08:00
104d729656
[CI/Build] re-add codespell to CI ( #10083 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:54:46 -08:00
db7db4aab9
[Misc] Consolidate ModelConfig code related to HF config ( #10104 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 06:00:21 +00:00
1fa020c539
[V1][BugFix] Fix Generator construction in greedy + seed case ( #10097 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-07 05:06:57 +00:00
e7b84c394d
[doc] add back Python 3.8 ABI ( #10100 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 21:06:41 -08:00
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 ( #9911 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 04:43:08 +00:00
29862b884b
[Frontend] Adjust try/except blocks in API impl ( #10056 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-06 20:07:51 -08:00
d3859f1891
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend ( #9823 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-06 17:29:03 -08:00
4ab3256644
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 ( #10095 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-07 00:54:13 +00:00
719c1ca468
[core][distributed] add stateless_init_process_group ( #10072 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 16:42:09 -08:00
74f2f8a0f1
[CI/Build] Always run the ruff workflow ( #10092 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:25:23 +00:00
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-06 11:57:35 -08:00
87bd7e0515
[CI/Build] change conflict PR comment from mergify ( #10080 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 10:15:42 -08:00
098f94de42
[CI/Build] Drop Python 3.8 support ( #10038 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 14:31:01 +00:00
399c798608
Remove ScaledActivation for AWQ ( #10057 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-06 14:27:06 +00:00
406d4cc480
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration ( #10022 )
...
Signed-off-by: ericperfect <ericperfectttt@gmail.com >
2024-11-06 14:13:15 +00:00
a5bba7d234
[Model] Add Idefics3 support ( #9767 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
2024-11-06 11:41:17 +00:00
2003cc3513
[Model][LoRA]LoRA support added for LlamaEmbeddingModel ( #10071 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-06 09:49:19 +00:00
6a585a23d2
[Hotfix] Fix ruff errors ( #10073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-06 01:24:28 -08:00
a02a50e6e5
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend ( #6143 )
...
Signed-off-by: yuwenzho <yuwen.zhou@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Signed-off-by: Bob Zhu <bob.zhu@intel.com >
Signed-off-by: zehao-intel <zehao.huang@intel.com >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai >
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com >
Co-authored-by: Vivek Goel <vgoel@habana.ai >
Co-authored-by: yuwenzho <yuwen.zhou@intel.com >
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai >
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com >
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com >
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai >
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com >
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai >
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com >
Co-authored-by: Ilia Taraban <tarabanil@gmail.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai >
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai >
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com >
Co-authored-by: Sun Choi <schoi@habana.ai >
Co-authored-by: Iryna Boiko <iboiko@habana.ai >
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com >
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com >
Co-authored-by: Zehao Huang <zehao.huang@intel.com >
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com >
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com >
Co-authored-by: Nir David <ndavid@habana.ai >
Co-authored-by: Yu-Zhou <yu.zhou@intel.com >
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai >
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com >
Co-authored-by: Jacek Czaja <jczaja@habana.ai >
Co-authored-by: Yuan <yuan.zhou@outlook.com >
2024-11-06 01:09:10 -08:00
a5fda50a10
[CI/Build] Fix large_gpu_mark reason ( #10070 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-06 08:50:37 +00:00
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
4be3a45158
[distributed] add function to create ipc buffers directly ( #10064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 22:35:03 -08:00
4089985552
[V1] Integrate Piecewise CUDA graphs ( #10058 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-05 22:16:04 -08:00
9d59b75593
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type ( #10054 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-06 05:13:09 +00:00
ea928f608c
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path ( #10063 )
...
Signed-off-by: Alex Rakowski <alex.rakowski@amd.com >
Signed-off-by: Alex Rakowski <182798202+arakowsk-amd@users.noreply.github.com >
2024-11-06 05:10:40 +00:00
2bcbae704c
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer ( #10051 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-06 04:28:29 +00:00
ffc0f2b47a
[Model][OpenVINO] Fix regressions from #8346 ( #10045 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-06 04:19:15 +00:00
82bfc38d07
[Misc] Sort the list of embedding models ( #10037 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 04:05:05 +00:00
c4cacbaa7f
[v1] reduce graph capture time for piecewise cudagraph ( #10059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 18:19:50 -08:00
0c63c34f72
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode ( #9730 )
...
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-11-06 01:45:45 +00:00
966e31697b
[Bugfix] Fix pickle of input when async output processing is on ( #9931 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-11-06 00:39:26 +00:00
43300bd98a
[Bugfix] Properly propagate trust_remote_code settings ( #10047 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-05 16:34:40 -08:00
ca9844b340
[bugfix] fix weak ref in piecewise cudagraph and tractable test ( #10048 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 14:49:20 -08:00
235366fe2e
[CI] Prune back the number of tests in tests/kernels/* ( #9932 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:32 -05:00
02462465ea
[CI] Prune tests/models/decoder_only/language/* tests ( #9940 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:23 -05:00
b9c64c0ca7
[Misc] Modify BNB parameter name ( #9997 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-05 14:40:08 -05:00
d2e80332a7
[Feature] Update benchmark_throughput.py to support image input ( #9851 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-05 19:30:02 +00:00
a53046b16f
[Model] Support quantization of PixtralHFTransformer for PixtralHF ( #9921 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 10:42:20 -08:00
731aec5be7
[CI/Build] Limit github CI jobs based on files changed ( #9928 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 10:30:42 -08:00
09d3550372
[Misc] Add logging for CUDA memory ( #10027 )
...
Signed-off-by: Chenghao Yang <yangalan1996@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Chenghao Yang <yangalan1996@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-05 09:50:50 -08:00
cd34029e91
Refactor TPU requirements file and pin build dependencies ( #10010 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-05 16:48:44 +00:00
5952d81139
[Frontend] Fix tcp port reservation for api server ( #10012 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 07:50:57 -08:00
93dee88f6b
[Misc] vllm CLI flags should be ordered for better user readability ( #10017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-05 18:59:56 +08:00
7a83b1aec0
[BugFix] Lazy import ray ( #10021 )
2024-11-05 10:04:10 +00:00
ad23318928
[Bugfix] Fixup Mamba ( #10004 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 03:46:38 +00:00
bbc3619dc8
[Core] Make encoder-decoder inputs a nested structure to be more composable ( #9604 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-05 10:07:31 +08:00
04bbf38e05
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep ( #9994 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 01:08:21 +00:00
8f0a9ca890
[Bugfix] Respect modules_to_not_convert within awq_marlin ( #9895 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-04 16:57:44 -07:00
2094062b4e
[4.5/N] bugfix for quant config in speculative decode ( #10007 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 15:11:59 -08:00
d93478b399
[Bugfix] Upgrade to pytorch 2.5.1 ( #10001 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-04 15:11:28 -08:00
ac04a97a9f
[Frontend] Add max_tokens prometheus metric ( #9881 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-04 22:53:24 +00:00
9a5664d4a4
[Misc] Refactor benchmark_throughput.py ( #9779 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <lkchen@github.com >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-04 14:32:16 -08:00
04cef2c6ab
[Bugfix] Fix MQLLMEngine hanging ( #9973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-11-04 16:01:43 -05:00
6e056bcf04
[Doc] Update VLM doc about loading from local files ( #9999 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-04 19:47:11 +00:00
5208dc7a20
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs ( #9279 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-11-04 11:37:46 -08:00
1c45f4c385
[CI] Basic Integration Test For TPU ( #9968 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
2024-11-04 11:34:26 -08:00
603a661ae8
[Model] factoring out MambaMixer out of Jamba ( #8993 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-04 18:00:00 +00:00
fb2716d641
[Misc]Reduce BNB static variable ( #9987 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 17:04:40 +00:00
8d72bb20fa
[4/N] make quant config first-class citizen ( #9978 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 08:51:31 -08:00
ac6b8f19b9
[Frontend] Multi-Modality Support for Loading Local Image Files ( #9915 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-04 15:34:57 +00:00
ccb5376a9a
[Bugfix][OpenVINO] Fix circular reference #9939 ( #9974 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-04 18:14:13 +08:00
ea4adeddc1
[Bugfix] Fix E2EL mean and median stats ( #9984 )
...
Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com >
2024-11-04 09:37:58 +00:00
4dbcbbeb09
[Misc] Compute query_start_loc/seq_start_loc on CPU ( #9447 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-11-04 08:54:37 +00:00
b67feb1274
[Bugfix]Using the correct type hints ( #9885 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-11-04 06:19:51 +00:00
c49f0407ba
[Bugfix] Fix MiniCPMV and Mllama BNB bug ( #9917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 03:36:41 +00:00
91c9ebbb1b
[V1] Fix Configs ( #9971 )
2024-11-04 00:24:40 +00:00
54597724f4
[Model] Add support for H2OVL-Mississippi models ( #9747 )
...
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-04 00:15:36 +00:00
1f1b6d6eda
[V1] Support per-request seed ( #9945 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
2024-11-03 09:14:17 -08:00
3bb4befea7
[bugfix] fix tsts ( #9959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 15:54:05 -07:00
ae5279a163
[torch.compile] Adding torch compile to vision-language models ( #9946 )
2024-11-02 12:56:05 -07:00
1b73ab2a1f
[CI/Build] Quoting around > ( #9956 )
2024-11-02 12:50:28 -07:00
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 12:08:49 -07:00
74b529ceee
[bugfix] fix chatglm dummy_data_for_glmv ( #9955 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 08:03:33 -07:00
d6459b4516
[V1] Fix EngineArgs refactor on V1 ( #9954 )
2024-11-02 07:44:38 -07:00
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-02 07:35:05 -07:00
1d4cfe2be1
[Doc] Updated tpu-installation.rst with more details ( #9926 )
...
Signed-off-by: Michael Green <mikegre@google.com >
2024-11-02 10:06:45 -04:00
eed92f12fc
[Docs] Update Granite 3.0 models in supported models table ( #9930 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-02 09:02:18 +00:00
af7380d83b
[torch.compile] fix cpu broken code ( #9947 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 23:35:47 -07:00
a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models ( #9559 )
2024-11-01 23:22:49 -07:00
d522034c85
[ci/build] Have dependabot ignore pinned dependencies ( #9935 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-11-01 23:56:13 +00:00
6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking ( #8346 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-01 16:21:10 -07:00
d151fde834
[ci/build] Bump the patch-update group with 10 updates ( #9897 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-11-01 23:04:42 +00:00
27cd36e6e2
[Bugfix] PicklingError on RayTaskError ( #9934 )
...
Signed-off-by: Gene Su <e870252314@gmail.com >
2024-11-01 22:08:23 +00:00
18bd7587b7
[1/N] pass the complete config from engine to executor ( #9933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 13:51:57 -07:00
598b6d7b07
[Bugfix/Core] Flashinfer k_scale and v_scale ( #9861 )
2024-11-01 12:15:05 -07:00
aff1fd8188
[torch.compile] use interpreter with stable api from pytorch ( #9889 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 11:50:37 -07:00
4581d2cc02
[Core] Refactor: Clean up unused argument in Scheduler._preempt ( #9696 )
...
Signed-off-by: André Jonasson <andre.jonasson@gmail.com >
2024-11-01 11:41:38 -07:00
1dd4cb2935
[Bugfix] Fix edge cases for MistralTokenizer ( #9625 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2024-11-01 10:33:15 -07:00
ba0d892074
[Frontend] Use a proper chat template for VLM2Vec ( #9912 )
2024-11-01 14:09:07 +00:00
30a2e80742
[CI/Build] Add Model Tests for PixtralHF ( #9813 )
2024-11-01 07:55:29 -06:00
06386a64dd
[Frontend] Chat-based Embeddings API ( #9759 )
2024-11-01 08:13:35 +00:00
d3aa2a8b2f
[Doc] Update multi-input support ( #9906 )
2024-11-01 07:34:49 +00:00
2b5bf20988
[torch.compile] Adding torch compile annotations to some models ( #9876 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-01 00:25:47 -07:00
93a76dd21d
[Model] Support bitsandbytes for MiniCPMV ( #9891 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:31:56 +08:00
566cd27797
[torch.compile] rework test plans ( #9866 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 22:20:17 -07:00
37a4947dcd
[Bugfix] Fix layer skip logic with bitsandbytes ( #9887 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:12:44 +08:00
96e0c9cbbd
[torch.compile] directly register custom op ( #9896 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 21:56:09 -07:00
031a7995f3
[Bugfix][Frontend] Reject guided decoding in multistep mode ( #9892 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-01 01:09:46 +00:00
b63c64d95b
[ci/build] Configure dependabot to update pip dependencies ( #9811 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-31 15:55:38 -07:00
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 ( #9838 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-10-31 20:06:25 +00:00
55650c83a0
[Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together ( #9532 )
...
Signed-off-by: sasha0552 <admin@sasha0552.org >
2024-10-31 11:46:36 -07:00
77f7ef2908
[CI/Build] Adding a forced docker system prune to clean up space ( #9849 )
2024-11-01 01:02:58 +08:00
16b8f7a86f
[CI/Build] Add Model Tests for Qwen2-VL ( #9846 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-31 09:10:52 -07:00
5608e611c2
[Doc] Update Qwen documentation ( #9869 )
2024-10-31 08:54:18 +00:00
3ea2dc2ec4
[Misc] Remove deprecated arg for cuda graph capture ( #9864 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-10-31 07:22:07 +00:00
d087bf863e
[Model] Support quantization of Qwen2VisionTransformer ( #9817 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-30 22:41:20 -07:00
890ca36072
Revert "[Bugfix] Use host argument to bind to interface ( #9798 )" ( #9852 )
2024-10-31 01:44:51 +00:00
abbfb6134d
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint ( #9837 )
2024-10-30 18:15:56 -07:00
64384bbcdf
[torch.compile] upgrade tests ( #9858 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 16:34:22 -07:00
00d91c8a2c
[CI/Build] Simplify exception trace in api server tests ( #9787 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-30 14:52:05 -07:00
c2cd1a2142
[doc] update pp support ( #9853 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 13:36:51 -07:00
c787f2d81d
[Neuron] Update Dockerfile.neuron to fix build failure ( #9822 )
2024-10-30 12:22:02 -07:00
33d257735f
[Doc] link bug for multistep guided decoding ( #9843 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 17:28:29 +00:00
3b3f1e7436
[Bugfix][core] replace heartbeat with pid check ( #9818 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 09:34:07 -07:00
9ff4511e43
[Misc] Add chunked-prefill support on FlashInfer. ( #9781 )
2024-10-30 09:33:53 -07:00
81f09cfd80
[Model] Support math-shepherd-mistral-7b-prm model ( #9697 )
...
Signed-off-by: Went-Liang <wenteng_liang@163.com >
2024-10-30 09:33:42 -07:00
cc98f1e079
[CI/Build] VLM Test Consolidation ( #9372 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-30 09:32:17 -07:00
211fe91aa8
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA ( #9438 )
2024-10-30 09:41:38 +00:00
6aa6020f9b
[Misc] Specify minimum pynvml version ( #9827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 23:05:43 -07:00
ff5ed6e1bc
[torch.compile] rework compile control with piecewise cudagraph ( #9715 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 23:03:49 -07:00
7b0365efef
[Doc] Add the DCO to CONTRIBUTING.md ( #9803 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-30 05:22:23 +00:00
04a3ae0aca
[Bugfix] Fix multi nodes TP+PP for XPU ( #8884 )
...
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn >
2024-10-29 21:34:45 -07:00
62fac4b9aa
[ci/build] Pin CI dependencies version with pip-compile ( #9810 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-30 03:34:55 +00:00
226688bd61
[Bugfix][VLM] Make apply_fp8_linear work with >2D input ( #9812 )
2024-10-29 19:49:44 -07:00
64cb1cdc3f
Update README.md ( #9819 )
2024-10-29 17:28:43 -07:00
1ab6f6b4ad
[core][distributed] fix custom allreduce in pytorch 2.5 ( #9815 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 17:06:24 -07:00
bc73e9821c
[Bugfix] Fix prefix strings for quantized VLMs ( #9772 )
2024-10-29 16:02:59 -07:00
8d7724104a
[Docs] Add notes about Snowflake Meetup ( #9814 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-10-29 15:19:02 -07:00
882a1ad0de
[Model] tool calling support for ibm-granite/granite-20b-functioncalling ( #8339 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
2024-10-29 15:07:37 -07:00
67bdf8e523
[Bugfix][Frontend] Guard against bad token ids ( #9634 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-29 14:13:20 -07:00
0ad216f575
[MISC] Set label value to timestamp over 0, to keep track of recent history ( #9777 )
...
Signed-off-by: Kunjan Patel <kunjanp@google.com >
2024-10-29 19:52:19 +00:00
7585ec996f
[CI/Build] mergify: fix rules for ci/build label ( #9804 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-29 19:24:42 +00:00
ab6f981671
[CI][Bugfix] Skip chameleon for transformers 4.46.1 ( #9808 )
2024-10-29 11:12:43 -07:00
ac3d748dba
[Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel ( #9806 )
2024-10-29 10:40:35 -07:00
0ce7798f44
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) ( #9801 )
...
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2024-10-29 10:39:20 -07:00
0f43387157
[Bugfix] Use host argument to bind to interface ( #9798 )
2024-10-29 10:37:59 -07:00
08600ddc68
Fix the log to correct guide user to install modelscope ( #9793 )
...
Signed-off-by: yuze.zyz <yuze.zyz@alibaba-inc.com >
2024-10-29 10:36:59 -07:00
74fc2d77ae
[Misc] Add metrics for request queue time, forward time, and execute time ( #9659 )
2024-10-29 10:32:56 -07:00
622b7ab955
[Hardware] using current_platform.seed_everything ( #9785 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-29 14:47:44 +00:00
09500f7dde
[Model] Add BNB quantization support for Mllama ( #9720 )
2024-10-29 08:20:02 -04:00
ef7865b4f9
[Frontend] re-enable multi-modality input in the new beam search implementation ( #9427 )
...
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
eae3d48181
[Bugfix] Use temporary directory in registry ( #9721 )
2024-10-28 22:08:20 -07:00
e74f2d448c
[Doc] Specify async engine args in docs ( #9726 )
2024-10-28 22:07:57 -07:00
7a4df5f200
[Model][LoRA]LoRA support added for Qwen ( #9622 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 04:14:07 +00:00
c5d7fb9ddc
[Doc] fix third-party model example ( #9771 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 19:39:21 -07:00
76ed5340f0
[torch.compile] add deepseek v2 compile ( #9775 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 14:35:17 -07:00
97b61bfae6
[misc] avoid circular import ( #9765 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 20:51:23 +00:00
aa0addb397
Adding "torch compile" annotations to moe models ( #9758 )
2024-10-28 13:49:56 -07:00
5f8d8075f9
[Model][VLM] Add multi-video support for LLaVA-Onevision ( #8905 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-28 18:04:10 +00:00
8b0e4f2ad7
[CI/Build] Adopt Mergify for auto-labeling PRs ( #9259 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 09:38:09 -07:00
2adb4409e0
[Bugfix] Fix ray instance detect issue ( #9439 )
2024-10-28 07:13:03 +00:00
feb92fbe4a
Fix beam search eos ( #9627 )
2024-10-28 06:59:37 +00:00
32176fee73
[torch.compile] support moe models ( #9632 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 21:58:04 -07:00
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-28 04:07:00 +00:00
34a9941620
[Bugfix] Fix load config when using bools ( #9533 )
2024-10-27 13:46:41 -04:00
e130c40e4e
Fix cache management in "Close inactive issues and PRs" actions workflow ( #9734 )
2024-10-27 10:30:03 -07:00
3cb07a36a2
[Misc] Upgrade to pytorch 2.5 ( #9588 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-27 09:44:24 +00:00
8549c82660
[core] cudagraph output with tensor weak reference ( #9724 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 00:19:28 -07:00
67a6882da4
[Misc] SpecDecodeWorker supports profiling ( #9719 )
...
Signed-off-by: Abatom <abatom@163.com >
2024-10-27 04:18:03 +00:00
6650e6a930
[Model] Add classification Task with Qwen2ForSequenceClassification ( #9704 )
...
Signed-off-by: Kevin-Yang <ykcha9@gmail.com >
Co-authored-by: Kevin-Yang <ykcha9@gmail.com >
2024-10-26 17:53:35 +00:00
07e981fdf4
[Frontend] Bad words sampling parameter ( #9717 )
...
Signed-off-by: Vasily Alexeev <alvasian@yandex.ru >
2024-10-26 16:29:38 +00:00
55137e8ee3
Fix: MI100 Support By Bypassing Custom Paged Attention ( #9560 )
2024-10-26 12:12:57 +00:00
5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino ( #9716 )
2024-10-26 10:59:06 +00:00
067e77f9a8
[Bugfix] Steaming continuous_usage_stats default to False ( #9709 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-10-26 05:05:47 +00:00
6567e13724
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding ( #9631 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: pavlo-ruban <pavlo.ruban@servicenow.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-25 15:42:56 -07:00
228cfbd03f
[Doc] Improve quickstart documentation ( #9256 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-25 14:32:10 -07:00
ca0d92227e
[Bugfix] Fix compressed_tensors_moe bad config.strategy ( #9677 )
2024-10-25 12:40:33 -07:00
9645b9f646
[V1] Support sliding window attention ( #9679 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-24 22:20:37 -07:00
a6f3721861
[Model] add a lora module for granite 3.0 MoE models ( #9673 )
2024-10-24 22:00:17 -07:00
9f7b4ba865
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 ( #9676 )
2024-10-24 20:59:00 -07:00
c91ed47c43
[Bugfix] Remove xformers requirement for Pixtral ( #9597 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 15:38:05 -07:00
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement ( #9384 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2024-10-24 15:37:52 -07:00
e26d37a185
[Log][Bugfix] Fix default value check for image_url.detail ( #9663 )
2024-10-24 10:44:38 -07:00
722d46edb9
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints ( #9650 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-24 10:42:24 -07:00
c866e0079d
[CI/Build] Fix VLM test failures when using transformers v4.46 ( #9666 )
2024-10-25 01:40:40 +08:00
d27cfbf791
[torch.compile] Adding torch compile annotations to some models ( #9641 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 09:31:42 -07:00
de662d32b5
Increase operation per run limit for "Close inactive issues and PRs" workflow ( #9661 )
...
Signed-off-by: Harry Mellor <hej.mellor@gmail.com >
2024-10-24 12:17:45 -04:00
f58454968f
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models ( #9653 )
2024-10-24 07:52:07 -07:00
b979143d5b
[Doc] Move additional tips/notes to the top ( #9647 )
2024-10-24 09:43:59 +00:00
ad6f78053e
[torch.compile] expanding support and fix allgather compilation ( #9637 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 01:32:15 -07:00
295a061fb3
[Kernel] add kernel for FATReLU ( #9610 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-24 16:18:27 +08:00
8a02cd045a
[torch.compile] Adding torch compile annotations to some models ( #9639 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 00:54:57 -07:00
4fdc581f9e
[core] simplify seq group code ( #9569 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-10-24 00:16:44 -07:00
3770071eb4
[V1][Bugfix] Clean up requests when aborted ( #9629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-23 23:33:22 -07:00
836e8ef6ee
[Bugfix] Fix PP for ChatGLM and Molmo ( #9422 )
2024-10-24 06:12:05 +00:00
056a68c7db
[XPU] avoid triton import for xpu ( #9440 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-24 05:14:00 +00:00
33bab41060
[Bugfix]: Make chat content text allow type content ( #9358 )
...
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2024-10-24 05:05:49 +00:00
b7df53cd42
[Bugfix] Use "vision_model" prefix for MllamaVisionModel ( #9628 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:07:44 +08:00
bb01f2915e
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image ( #9626 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:03:44 +08:00
b548d7a5f4
[CI/Build] Add bot to close stale issues and PRs ( #9436 )
2024-10-23 15:45:26 -07:00
fc6c274626
[Model] Add Qwen2-Audio model support ( #9248 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-23 17:54:22 +00:00
150b779081
[Frontend] Enable Online Multi-image Support for MLlama ( #9393 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-23 17:28:57 +00:00
9013e24f7b
[torch.compile] Adding torch compile annotations to some models ( #9614 )
2024-10-23 10:07:48 -07:00
fd0e2cfdb2
[Misc] Separate total and output tokens in benchmark_throughput.py ( #8914 )
2024-10-23 16:47:20 +00:00
e5ac6a4199
[Bugfix] Fix divide by zero when serving Mamba models ( #9617 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-10-23 16:40:43 +00:00
dbdd3b5e5a
[misc] comment to avoid future confusion about baichuan ( #9620 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-23 09:14:44 -07:00
e7116c017c
[Bugfix] Fix _init_vision_model in NVLM_D model ( #9611 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 14:09:04 +00:00
31a08f5bd2
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs ( #9612 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-23 14:05:18 +00:00
c18e1a3418
[VLM] Enable overriding whether post layernorm is used in vision encoder + fix quant args ( #9217 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 11:27:37 +00:00
3ff57ebfca
[Model] Initialize Florence-2 language backbone support ( #9555 )
2024-10-23 10:42:47 +00:00
2394962d70
[Hardware][XPU] using current_platform.is_xpu ( #9605 )
2024-10-23 08:28:21 +00:00
51c24c9736
[Build] Fix FetchContent multiple build issue ( #9596 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-10-23 12:43:07 +08:00
831540cf04
[Model] Support E5-V ( #9576 )
2024-10-23 11:35:29 +08:00
29061ed9df
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages ( #9590 )
2024-10-23 11:17:28 +08:00
65050a40e6
[Bugfix] Generate exactly input_len tokens in benchmark_throughput ( #9592 )
2024-10-22 17:45:35 -07:00
208cb34c81
[Doc]: Update tensorizer docs to include vllm[tensorizer] ( #7889 )
...
Co-authored-by: Kaunil Dhruv <dhruv.kaunil@gmail.com >
2024-10-22 15:43:25 -07:00
b17046e298
[BugFix] Fix metrics error for --num-scheduler-steps > 1 ( #8234 )
2024-10-22 15:43:03 -07:00
d1e8240875
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing ( #9487 )
2024-10-22 15:41:13 -07:00
cb6fdaa0a0
[Misc] Make benchmarks use EngineArgs ( #9529 )
2024-10-22 15:40:38 -07:00
23b899a8e6
[Bugfix] fix detokenizer shallow copy ( #5919 )
2024-10-22 15:38:12 -07:00
17c79f3c36
[torch.compile] auto infer dynamic_arg_dims from type annotation ( #9589 )
2024-10-22 13:43:37 -07:00
cd5601ac37
[BugFix] Prevent exporting duplicate OpenTelemetry spans ( #9017 )
2024-10-22 11:11:53 -07:00
434984e665
[Frontend] Support custom request_id from request ( #9550 )
...
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com >
2024-10-22 18:07:30 +00:00
32a1ee74a0
[Hardware][Intel CPU][DOC] Update docs for CPU backend ( #6212 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com >
Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com >
2024-10-22 10:38:04 -07:00
08075c3448
[Bugfix] Eagle: change config name for fc bias ( #9580 )
2024-10-22 16:14:22 +00:00
bb392ea2d2
[Model][VLM] Initialize support for Mono-InternVL model ( #9528 )
2024-10-22 16:01:46 +00:00
9dbcce84a7
[Neuron] [Bugfix] Fix neuron startup ( #9374 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-22 12:51:41 +00:00
a48e3ec052
[CI/Build][LoRA] Temporarily fix long context failure issue ( #9579 )
2024-10-22 11:32:51 +00:00
6c5af09b39
[V1] Implement vLLM V1 [1/N] ( #9289 )
2024-10-22 01:24:07 -07:00
3ddbe25502
[Hardware][CPU] using current_platform.is_cpu ( #9536 )
2024-10-22 00:50:43 -07:00
0d02747f2e
support TP in qwen2 bnb ( #9574 )
2024-10-22 07:13:23 +00:00
f7db5f0fa9
[Doc] Use shell code-blocks and fix section headers ( #9508 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-22 06:43:24 +00:00
ca30c3c84b
[Core] Remove evictor_v1 ( #9572 )
2024-10-22 04:55:49 +00:00
c0292211ce
[CI/Build] Replaced some models on tests for smaller ones ( #9570 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-22 04:52:14 +00:00
74692421f7
[Bugfix]: phi.py get rope_theta from config file ( #9503 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-22 02:53:36 +00:00
29acd2c34c
[Bugfix][OpenVINO] fix_dockerfile_openvino ( #9552 )
2024-10-21 19:47:52 -07:00
f085995a7b
[CI/Build] Remove unnecessary fork_new_process ( #9484 )
2024-10-21 19:47:29 -07:00
b729901139
[Bugfix]: serialize config by value for --trust-remote-code ( #6751 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-21 19:46:24 -07:00
76a5e13270
[core] move parallel sampling out from vllm core ( #9302 )
2024-10-22 00:31:44 +00:00
ef7faad1b8
🐛 Fixup more test failures from memory profiling ( #9563 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-21 17:10:56 -07:00
575dcebe9a
[CI] Make format checker error message more user-friendly by using emoji ( #9564 )
...
This PR makes format checker error message more user-friendly by adding emojis.
2024-10-21 23:45:15 +00:00
711f3a7806
[Frontend] Don't log duplicate error stacktrace for every request in the batch ( #9023 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-21 14:49:41 -07:00
15713e3b75
[BugFix] Update draft model TP size check to allow matching target TP size ( #9394 )
...
Co-authored-by: Baoyuan Qi <qibaoyuan@126.com >
2024-10-21 14:14:29 -07:00
d621c43df7
[doc] fix format ( #9562 )
2024-10-21 13:54:57 -07:00
9d9186be97
[Frontend] Reduce frequency of client cancellation checking ( #7959 )
2024-10-21 13:28:10 -07:00
5241aa1494
[Model][Bugfix] Fix batching with multi-image in PixtralHF ( #9518 )
2024-10-21 14:20:07 -04:00
ec6bd6c4c6
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint ( #9492 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-21 17:43:02 +00:00
8ca8954841
[Bugfix][Misc]: fix graph capture for decoder ( #9549 )
2024-10-21 17:33:30 +00:00
f6b97293aa
[Model] FalconMamba Support ( #9325 )
2024-10-21 12:50:16 -04:00
496e991da8
[Doc] Consistent naming of attention backends ( #9498 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-10-21 22:29:57 +08:00
696b01af8f
[CI/Build] Split up decoder-only LM tests ( #9488 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-20 21:27:50 -07:00
855e0e6f97
[Frontend][Misc] Goodput metric support ( #9338 )
2024-10-20 18:39:32 +00:00
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
962d2c6349
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision ( #9520 )
2024-10-20 05:29:14 +00:00
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger ( #9530 )
2024-10-20 00:05:02 +00:00
8e3e7f2713
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf ( #9514 )
2024-10-19 10:44:29 -04:00
263d8ee150
[Bugfix] Fix missing task for speculative decoding ( #9524 )
2024-10-19 06:49:40 +00:00
c5eea3c8ba
[Frontend] Support simpler image input format ( #9478 )
2024-10-18 23:17:07 -07:00
85dc92fc98
[CI/Build] Configure matcher for actionlint workflow ( #9511 )
...
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
2024-10-19 06:04:18 +00:00
dfd951ed9b
[CI/Build] Add error matching for ruff output ( #9513 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-19 05:42:20 +00:00
82c25151ec
[Doc] update gpu-memory-utilization flag docs ( #9507 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-19 11:26:36 +08:00
1325872ec8
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily ( #9521 )
2024-10-18 20:21:01 -07:00
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-18 21:25:19 -04:00
337ed76671
[Bugfix] Fix offline mode when using mistral_common ( #9457 )
2024-10-18 18:12:32 -07:00
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores ( #9497 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-18 17:55:48 -07:00
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py ( #9510 )
2024-10-18 14:30:55 -07:00
9bb10a7d27
[MISC] Add lora requests to metrics ( #9477 )
...
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a .c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
3921a2f29e
[Model] Support Pixtral models in the HF Transformers format ( #9036 )
2024-10-18 13:29:56 -06:00
67a7e5ef38
[CI/Build] Add error matching config for mypy ( #9512 )
2024-10-18 12:17:53 -07:00
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
7dbe738d65
[Misc] benchmark: Add option to set max concurrency ( #9390 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-18 11:15:28 -07:00
ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py ( #9505 )
2024-10-18 16:59:19 +00:00
1bbbcc0b1d
[CI/Build] Fix lint errors in mistral tokenizer ( #9504 )
2024-10-19 00:09:35 +08:00
25aeb7d4c9
[BugFix] Fix and simplify completion API usage streaming ( #9475 )
2024-10-18 14:10:26 +00:00
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser ( #9154 )
2024-10-18 10:27:48 +00:00
1ffc8a7362
[BugFix] Typing fixes to RequestOutput.prompt and beam search ( #9473 )
2024-10-18 07:19:53 +00:00
944dd8edaf
[CI/Build] Use commit hash references for github actions ( #9430 )
2024-10-17 21:54:58 -07:00
154a8ae880
[Qwen2.5] Support bnb quant for Qwen2.5 ( #9467 )
2024-10-18 04:40:14 +00:00
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-17 22:47:27 -04:00
48138a8415
[BugFix] Stop silent failures on compressed-tensors parsing ( #9381 )
2024-10-17 18:54:00 -07:00
343f8e0905
Support BERTModel (first encoder-only embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: laishzh <laishengzhang@gmail.com >
Co-authored-by: Max de Bayser <maxdebayser@gmail.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-10-17 23:21:01 +00:00
bb76538bbd
[Hardwware][Neuron] Simplify model load for transformers-neuronx library ( #9380 )
2024-10-17 15:39:39 -07:00
d615b5c9f8
[Bugfix] Print warnings related to mistral_common tokenizer only once ( #9468 )
2024-10-17 21:44:20 +00:00
d65049daab
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script ( #9013 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-17 21:11:11 +00:00
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
7871659abb
[Misc] Remove commit id file ( #9470 )
2024-10-17 10:34:37 -07:00
a2c71c5405
[CI/Build] remove .github from .dockerignore, add dirty repo check ( #9375 )
2024-10-17 10:25:06 -07:00
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
5eda21e773
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support ( #9344 )
2024-10-17 12:21:04 -04:00
8e1cddcd44
[TPU] Call torch._sync(param) during weight loading ( #9437 )
2024-10-17 09:00:11 -07:00
5e443b594f
[Bugfix] Allow prefill of assistant response when using mistral_common ( #9446 )
2024-10-17 15:06:37 +00:00
9d30a056e7
[misc] CUDA Time Layerwise Profiler ( #8337 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-10-17 10:36:09 -04:00
390be74649
[Misc] Print stack trace using logger.exception ( #9461 )
2024-10-17 13:55:48 +00:00
e312e52b44
[Kernel] Add Exllama as a backend for compressed-tensors ( #9395 )
2024-10-17 09:48:26 -04:00
dbfa8d31d5
Add notes on the use of Slack ( #9442 )
2024-10-17 04:46:46 +00:00
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels ( #9391 )
2024-10-17 01:34:06 +00:00
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel ( #9425 )
2024-10-16 23:46:06 +00:00
776dbd74f1
[CI/Build] mypy: Resolve some errors from checking vllm/engine ( #9267 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-16 22:55:59 +00:00
8345045833
[Performance][Spec Decode] Optimize ngram lookup performance ( #9333 )
2024-10-16 13:37:45 -06:00
5b8a1fde84
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft ( #9396 )
2024-10-16 16:40:24 +00:00
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba ( #9189 )
2024-10-16 12:12:43 -04:00
415f76a9cb
Support mistral interleaved attn ( #9414 )
2024-10-16 13:28:30 +00:00
cf1d62a644
[Model] Support SDPA attention for Molmo vision backbone ( #9410 )
2024-10-16 11:52:01 +00:00
59230ef32b
[Misc] Consolidate example usage of OpenAI client for multimodal models ( #9412 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-16 11:20:51 +00:00
cee711fdbb
[Core] Rename input data types ( #8688 )
2024-10-16 10:49:37 +00:00
1de76a0e55
[CI/Build] Test VLM embeddings ( #9406 )
2024-10-16 09:44:30 +00:00
7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM ( #9303 )
2024-10-16 14:31:00 +08:00
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
ed920135c8
[Bugfix] Molmo text-only input bug fix ( #9397 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-16 04:56:09 +00:00
717a5f82cd
[Bugfix][CI/Build] Fix CUDA 11.8 Build ( #9386 )
2024-10-16 00:15:21 +00:00
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids ( #9034 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-15 15:40:43 -07:00
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions ( #8909 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-15 15:40:25 -07:00
5d264f4ab8
pass ignore_eos parameter to all benchmark_serving calls ( #9349 )
2024-10-15 13:30:44 -07:00
e9d517f276
[BugFix] Fix chat API continuous usage stats ( #9357 )
2024-10-14 23:19:48 -07:00
55e081fbad
[Bugfix] Update InternVL input mapper to support image embeds ( #9351 )
2024-10-14 21:29:19 -07:00
8e836d982a
[Doc] Fix code formatting in spec_decode.rst ( #9348 )
2024-10-14 21:29:11 -07:00
44eaa5a5d9
[Frontend] Clarify model_type error messages ( #9345 )
2024-10-14 21:29:01 -07:00
169b530607
[Bugfix] Clean up some cruft in mamba.py ( #9343 )
2024-10-15 00:24:25 +00:00
f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images ( #9095 )
2024-10-14 15:24:26 -07:00
4d31cd424b
[Frontend] merge beam search implementations ( #9296 )
2024-10-14 15:05:52 -07:00
473e7b3606
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel ( #9350 )
2024-10-14 15:02:06 -07:00
fd47e57f4b
[Docs] Remove PDF build from Readtehdocs ( #9347 )
2024-10-14 11:57:47 -07:00
203ab8f80f
[CI/Build] setuptools-scm fixes ( #8900 )
2024-10-14 11:34:47 -07:00
4141608c6a
[Hardware][intel GPU] add async output process for xpu ( #8897 )
2024-10-14 12:23:33 -06:00
dfe43a2071
[Model] Molmo vLLM Integration ( #9016 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-14 07:56:24 -07:00
16b24e7dcd
[Bugfix] Bandaid fix for speculative decoding tests ( #9327 )
2024-10-13 23:02:11 +00:00
f519902c52
[CI] Fix merge conflict ( #9317 )
2024-10-13 06:41:23 +00:00
250e26a63e
[Bugfix]Fix MiniCPM's LoRA bug ( #9286 )
2024-10-12 09:36:47 -07:00
2b184ddd4f
[Misc][Installation] Improve source installation script and doc ( #9309 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-12 09:36:40 -07:00
00298e092c
[Bugfix] Fix bug of xformer prefill for encoder-decoder ( #9026 )
2024-10-12 15:00:43 +08:00
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
ec10cb8511
[BugFix] Fix tool call finish reason in streaming case ( #9209 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-10-11 18:24:26 -07:00
d11b46f3a5
[bugfix] fix f-string for error ( #9295 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-11 17:03:48 -07:00
c6cf9295e1
[Bugfix] Sets is_first_step_output for TPUModelRunner ( #9202 )
2024-10-11 13:28:10 -07:00
de9fb4bef8
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected ( #9254 )
2024-10-11 15:57:39 -04:00
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-11 11:18:50 -07:00
1a1823871d
[Doc] Remove outdated comment to avoid misunderstanding ( #9287 )
2024-10-11 18:02:03 +00:00
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
f710090d8e
[Kernel] adding fused moe kernel config for L40S TP4 ( #9245 )
2024-10-11 08:54:22 -07:00
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
df3dcdf49d
[Bugfix] Fix priority in multiprocessing engine ( #9277 )
2024-10-11 15:35:35 +00:00
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
e808156f30
[Misc] Collect model support info in a single process per model ( #9233 )
2024-10-11 11:08:11 +00:00
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-10-10 21:30:44 -07:00
94bf9ae4e9
[Misc] Fix sampling from sonnet for long context case ( #9235 )
2024-10-11 00:33:16 +00:00
f990bab2a4
[Doc][Neuron] add note to neuron documentation about resolving triton issue ( #9257 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-10-10 23:36:32 +00:00
e00c094f15
[torch.compile] generic decorators ( #9258 )
2024-10-10 15:54:23 -07:00
a78c6ba7c8
[ci/build] Add placeholder command for custom models test ( #9262 )
2024-10-10 15:45:09 -07:00
fb870fd491
Bump actions/setup-python from 3 to 5 ( #9195 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:46 -07:00
270953bafb
Bump actions/checkout from 3 to 4 ( #9196 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:35 -07:00
9cc811c4ff
Bump actions/github-script from 6 to 7 ( #9197 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:24 -07:00
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
78c0b4166c
Suggest codeowners for the core componenets ( #9210 )
2024-10-10 12:29:24 -07:00
21efb603f5
[CI/Build] Make the Dockerfile.cpu file's PIP_EXTRA_INDEX_URL Configurable as a Build Argument ( #9252 )
2024-10-10 18:18:18 +00:00
055f3270d4
[Doc] Improve debugging documentation ( #9204 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-10 10:48:51 -07:00
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError ( #9218 )
2024-10-10 17:39:56 +00:00
83ea5c72b9
[OpenVINO] Use torch 2.4.0 and newer optimim version ( #9121 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-10 11:18:58 -06:00
04de9057ab
[Model] support input image embedding for minicpmv ( #9237 )
2024-10-10 15:00:47 +00:00
07c11cf4d4
[Bugfix] Fix lm_head weights tying with lora for llama ( #9227 )
2024-10-10 21:11:56 +08:00
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
de895f1697
[misc] improve model support check in another process ( #9208 )
2024-10-09 21:58:27 -07:00
cf25b93bdd
[Core] Fix invalid args to _process_request ( #9201 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-10 12:10:09 +08:00
d5fbb8706d
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 ( #9130 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 12:51:47 -06:00
cdca8994bd
[CI/Build] mypy: check vllm/entrypoints ( #9194 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-09 17:15:28 +00:00
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
7dea289066
Add Dependabot configuration for GitHub Actions updates ( #1217 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 08:16:26 -07:00
cfaa6008e6
[Bugfix] Access get_vocab instead of vocab in tool parsers ( #9188 )
2024-10-09 08:59:57 -06:00
21906a6f50
[Bugfix] Fix lora loading for Compressed Tensors in #9120 ( #9179 )
2024-10-09 12:10:44 +00:00
dc4aea677a
[Doc] Fix VLM prompt placeholder sample bug ( #9170 )
2024-10-09 08:59:42 +00:00
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
8bfaa4e31e
[Bugfix] fix composite weight loading and EAGLE weight loading ( #9160 )
2024-10-09 00:36:55 -07:00
0b5b5d767e
[Frontend] Log the maximum supported concurrency ( #8831 )
2024-10-09 00:03:14 -07:00
cdc72e3c80
[Model] Remap FP8 kv_scale in CommandR and DBRX ( #9174 )
2024-10-09 06:43:06 +00:00
7627172bf4
[Bugfix][Doc] Report neuron error in output ( #9159 )
2024-10-08 22:43:34 -07:00
480b7f40cf
[Misc] Improve validation errors around best_of and n ( #9167 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-10-09 04:54:48 +00:00
acce7630c1
Update link to KServe deployment guide ( #9173 )
2024-10-09 03:58:49 +00:00
ffc4b27ea8
Add classifiers in setup.py ( #9171 )
2024-10-08 19:30:48 -07:00
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
9ba0bd6aa6
Add lm-eval directly to requirements-test.txt ( #9161 )
2024-10-08 18:22:31 -07:00
2a131965a8
mypy: check additional directories ( #9162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-08 22:08:22 +00:00
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
de24046fcd
[Doc] Improve contributing and installation documentation ( #9132 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-08 20:22:08 +00:00
1874c6a1b0
[Doc] Update vlm.rst to include an example on videos ( #9155 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-08 18:12:29 +00:00
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
cfba685bd4
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models ( #8758 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-10-08 09:37:34 -07:00
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:31:26 +00:00
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:12:56 +00:00
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
e1faa2a598
[misc] improve ux on readme ( #9147 )
2024-10-07 22:26:25 -07:00
80b57f00d5
[Intel GPU] Fix xpu decode input ( #9145 )
2024-10-08 03:51:14 +00:00
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
8eeb857084
Add Slack to README ( #9137 )
2024-10-07 17:06:21 -07:00
fa45513a51
[misc] fix comment and variable name ( #9139 )
2024-10-07 16:07:05 -07:00
c0d9a98d0c
[Doc] Include performance benchmark in README ( #9135 )
2024-10-07 15:04:06 -07:00
e0dbdb013d
[CI/Build] Add linting for github actions workflows ( #7876 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-07 21:18:10 +00:00
93cf74a8a7
[Doc]: Add deploying_with_k8s guide ( #8451 )
2024-10-07 13:31:45 -07:00
151ef4efd2
[Model] Support NVLM-D and fix QK Norm in InternViT ( #9045 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2024-10-07 11:55:12 +00:00
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
c8f26bb636
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None ( #9103 )
2024-10-07 03:52:42 +00:00
487678d046
[Bugfix][Hardware][CPU] Fix CPU model input for decode ( #9044 )
2024-10-06 19:14:27 -07:00
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-06 12:48:11 -07:00
fdf59d30ea
[Bugfix] fix tool_parser error handling when serve a model not support it ( #8709 )
2024-10-06 12:51:08 +00:00
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-06 16:35:27 +08:00
f22619fe96
[Misc] Remove user-facing error for removed VLM args ( #9104 )
2024-10-06 01:33:52 -07:00
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-05 23:39:03 -07:00
23fea8714a
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model ( #9101 )
2024-10-06 13:00:04 +08:00
f4dd830e09
[core] use forward context for flash infer ( #9097 )
2024-10-05 19:37:31 -07:00
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
53b3a33027
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs ( #8979 )
2024-10-04 22:05:37 -07:00
dac914b0d6
[Bugfix] use blockmanagerv1 for encoder-decoder ( #9084 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-05 04:45:38 +00:00
a95354a36e
[Doc] Update README.md with Ray summit slides ( #9088 )
2024-10-05 02:54:45 +00:00
663874e048
[torch.compile] improve allreduce registration ( #9061 )
2024-10-04 16:43:50 -07:00
cc90419e89
[Hardware][Neuron] Add on-device sampling support for Neuron ( #8746 )
...
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
2024-10-04 16:42:20 -07:00
27302dd584
[Misc] Fix CI lint ( #9085 )
2024-10-04 16:07:54 -07:00
0cc566ca8f
[Misc] Add random seed for prefix cache benchmark ( #9081 )
2024-10-04 21:58:57 +00:00
05c531be47
[Misc] Improved prefix cache example ( #9077 )
2024-10-04 21:38:42 +00:00
fbb74420e7
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang ( #7412 )
2024-10-04 14:01:44 -07:00
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika Sikka <ds3822@columbia.edu >
2024-10-04 12:34:44 -06:00
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-10-04 18:31:40 +00:00
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models ( #8717 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:38:25 -07:00
e5dc713c23
[Hardware][PowerPC] Make oneDNN dependency optional for Power ( #9039 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-04 17:24:42 +00:00
36eecfbddb
Remove AMD Ray Summit Banner ( #9075 )
2024-10-04 10:17:16 -07:00
9ade8bbc8d
[Model] add a bunch of supported lora modules for mixtral ( #9008 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-04 16:24:40 +00:00
22482e495e
[Bugfix] Flash attention arches not getting set properly ( #9062 )
2024-10-04 09:43:15 -06:00
3d826d2c52
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL ( #9071 )
2024-10-04 14:34:58 +00:00
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:56:58 +08:00
303d44790a
[Misc] Enable multi-step output streaming by default ( #9047 )
2024-10-03 22:55:42 -04:00
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model ( #8405 )
2024-10-04 10:36:39 +08:00
2838d6b38e
[Bugfix] Weight loading fix for OPT model ( #9042 )
...
Co-authored-by: dvres <dvres@fri.uni-lj.si >
2024-10-03 19:53:29 -04:00
91add85ec4
Fix failing spec decode test ( #9054 )
2024-10-03 23:07:29 +00:00
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
63e39937f9
[Frontend] [Neuron] Parse literals out of override-neuron-config ( #8959 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-03 18:02:07 +00:00
f5d72b2fc6
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. ( #8678 )
2024-10-03 09:44:21 -07:00
83caf35e08
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser ( #9020 )
2024-10-03 16:44:52 +08:00
01843c89b8
[Misc] log when using default MoE config ( #8971 )
2024-10-03 04:31:07 +00:00
19a4dd0990
[Bugfix] example template should not add parallel_tool_prompt if tools is none ( #9007 )
2024-10-03 03:04:17 +00:00
18c2e30c57
[Doc] Update Granite model docs ( #9025 )
2024-10-03 02:42:24 +00:00
19f0d25796
[Model] Adding Granite MoE. ( #8206 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-03 09:33:57 +08:00
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-02 19:44:39 +00:00
7f60520deb
[Misc] Update Default Image Mapper Error Log ( #8977 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-02 11:44:38 +00:00
563649aafe
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching ( #8804 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
2024-10-02 07:52:20 +00:00
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows ( #8997 )
2024-10-01 11:07:06 -07:00
4f341bd4bf
[Doc] Update list of supported models ( #8987 )
2024-10-02 00:35:39 +08:00
35bd215168
[Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API ( #8965 )
2024-10-01 09:58:06 +00:00
1fe0a4264a
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders ( #8991 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-01 09:52:44 +00:00
bc4eb65b54
[Bugfix] Fix Fuyu tensor parallel inference ( #8986 )
2024-10-01 17:51:41 +08:00
82f3937e59
[Misc] add process_weights_after_loading for DummyLoader ( #8969 )
2024-10-01 03:46:41 +00:00
7da2487591
[torch.compile] fix tensor alias ( #8982 )
2024-10-01 03:40:48 +00:00
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-01 03:33:12 +00:00
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-01 09:34:25 +08:00
bce324487a
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. ( #8975 )
2024-10-01 00:51:40 +00:00
1425a1bcf9
[ci] Add CODEOWNERS for test directories ( #8795 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-01 00:47:08 +00:00
1cabfcefb6
[Misc] Adjust max_position_embeddings for LoRA compatibility ( #8957 )
2024-09-30 12:57:39 +00:00
be76e5aabf
[Core] Make scheduling policy settable via EngineArgs ( #8956 )
2024-09-30 12:28:44 +00:00
2ae25f79cf
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg ( #8946 )
2024-09-30 13:01:20 +08:00
8e60afa15e
[Model][LoRA]LoRA support added for MiniCPMV2.6 ( #8943 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-30 04:31:55 +00:00
b6d7392579
[Misc][CI/Build] Include cv2 via mistral_common[opencv] ( #8951 )
2024-09-30 04:28:26 +00:00
e01ab595d8
[Model] support input embeddings for qwen2vl ( #8856 )
2024-09-30 03:16:10 +00:00
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
6c9ba48fde
[Frontend] Added support for HF's new continue_final_message parameter ( #8942 )
2024-09-29 17:59:47 +00:00
1fb9c1b0bf
[Misc] Fix typo in BlockSpaceManagerV1 ( #8944 )
2024-09-29 15:05:54 +00:00
31f46a0d35
[BugFix] Fix seeded random sampling with encoder-decoder models ( #8870 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-29 09:43:14 +00:00
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
bc2ef1f77c
[Model] Support Qwen2.5-Math-RM-72B ( #8896 )
2024-09-28 21:19:39 -07:00
2e7fe7e79f
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching ( #8930 )
2024-09-29 03:13:01 +00:00
26a68d5d7e
[CI/Build] Add test decorator for minimum GPU memory ( #8925 )
2024-09-29 02:50:51 +00:00
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-28 18:19:40 -07:00
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots ( #8824 )
2024-09-29 09:17:45 +08:00
d1537039ce
[Core] Improve choice of Python multiprocessing method ( #8823 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-29 09:17:07 +08:00
cc276443b5
[doc] organize installation doc and expose per-commit docker ( #8931 )
2024-09-28 17:48:41 -07:00
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 ( #8891 )
2024-09-28 18:51:22 +00:00
090e945e36
[Frontend] Make beam search emulator temperature modifiable ( #8928 )
...
Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr >
2024-09-28 11:30:21 -07:00
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-28 09:54:35 -07:00
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
39d3f8d94f
[Bugfix] Fix code for downloading models from modelscope ( #8443 )
2024-09-28 08:24:12 -07:00
b0298aa8cc
[Misc] Remove vLLM patch of BaichuanTokenizer ( #8921 )
2024-09-28 08:11:25 +00:00
260024a374
[Bugfix][Intel] Fix XPU Dockerfile Build ( #7824 )
...
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-27 23:45:50 -07:00
d86f6b2afb
[misc] fix wheel name ( #8919 )
2024-09-27 22:10:44 -07:00
bd429f2b75
[Core] Priority-based scheduling in async engine ( #8850 )
2024-09-27 15:07:10 -07:00
18e60d7d13
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag ( #8911 )
2024-09-27 14:27:56 -07:00
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-09-27 13:32:07 -07:00
c5d55356f9
[Bugfix] fix for deepseek w4a16 ( #8906 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-27 13:12:34 -06:00
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
8df2dc3c88
[TPU] Update pallas.py to support trillium ( #8871 )
2024-09-27 01:16:55 -07:00
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 ( #8892 )
2024-09-27 01:15:58 -07:00
0e088750af
[MISC] Fix invalid escape sequence '\' ( #8830 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-09-27 01:13:25 -07:00
dc4e3df5c2
[misc] fix collect env ( #8894 )
2024-09-27 00:26:38 -07:00
3b00b9c26c
[Core] renamePromptInputs and inputs ( #8876 )
2024-09-26 20:35:15 -07:00
344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use ( #8343 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-09-26 17:01:42 -07:00
1b49148e47
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility ( #8764 )
2024-09-26 16:54:09 -07:00
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade ( #8829 )
2024-09-26 16:46:43 -07:00
71d21c73ab
[Bugfix] Fixup advance_step.cu warning ( #8815 )
2024-09-26 16:23:45 -07:00
ee2da3e9ef
fix validation: Only set tool_choice auto if at least one tool is provided ( #8568 )
2024-09-26 16:23:17 -07:00
e2f6f26e86
[Bugfix] Fix print_warning_once's line info ( #8867 )
2024-09-26 16:18:26 -07:00
b28d2104de
[Misc] Change dummy profiling and BOS fallback warns to log once ( #8820 )
2024-09-26 16:18:14 -07:00
93d364da34
[Bugfix] Include encoder prompts len to non-stream api usage response ( #8861 )
2024-09-26 15:47:00 -07:00
d9cfbc891e
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM ( #8872 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-26 15:02:16 -07:00
70de39f6b4
[misc][installation] build from source without compilation ( #8818 )
2024-09-26 13:19:04 -07:00
68988d4e0d
[CI/Build] Fix missing ci dependencies ( #8834 )
2024-09-26 11:04:39 -07:00
520db4dbc1
[Docs] Add README to the build docker image ( #8825 )
2024-09-26 11:02:52 -07:00
f70bccac75
[Build/CI] Upgrade to gcc 10 in the base build Docker image ( #8814 )
2024-09-26 10:07:18 -07:00
4bb98f2190
[Misc] Update config loading for Qwen2-VL and remove Granite ( #8837 )
2024-09-26 07:45:30 -07:00