3679753af5
Reduce Scatter Plumbing
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-28 16:33:52 +00:00
9b61dd41e7
[Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series ( #14031 )
2025-02-28 07:36:08 -08:00
f7bee5c815
[VLM][Bugfix] Enable specifying prompt target via index ( #14038 )
2025-02-28 07:35:55 -08:00
e0734387fb
[Bugfix] Fix MoeWNA16Method activation ( #14024 )
2025-02-28 15:22:42 +00:00
f58f8b5c96
Update AutoAWQ docs ( #14042 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-28 15:20:29 +00:00
b3f7aaccd0
[V1][Minor] Restore V1 compatibility with LLMEngine class ( #13090 )
2025-02-28 00:52:25 -08:00
b91660ddb8
[Hardware][Intel-Gaudi] Regional compilation support ( #13213 )
2025-02-28 00:51:49 -08:00
76c89fcadd
Use smaller embedding model when not testing model specifically ( #13891 )
2025-02-28 00:50:43 -08:00
b9e41734c5
[Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) ( #13987 )
...
Signed-off-by: Mathis Felardos <mathis@mistral.ai >
2025-02-28 07:53:45 +00:00
1088f06242
[Doc] Move multimodal Embedding API example to Online Serving page ( #14017 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-28 07:12:04 +00:00
73e0225ee9
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama ( #13911 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-02-28 04:00:45 +00:00
6c85da3a18
[V1]SupportsV0Only protocol for model definitions ( #13959 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-27 20:02:15 -05:00
67fc426845
[Misc] Print FusedMoE detail info ( #13974 )
2025-02-27 18:53:13 -05:00
9804145cac
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict ( #13626 )
...
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai >
2025-02-27 15:28:08 -08:00
2e94b9cfbb
[Attention] Flash MLA for V1 ( #13867 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Yang Chen <yangche@fb.com >
2025-02-27 23:03:41 +00:00
8294773e48
[core] Perf improvement for DSv3 on AMD GPUs ( #13718 )
...
Signed-off-by: qli88 <qiang.li2@amd.com >
2025-02-27 22:14:30 +00:00
cd813c6d4d
[V1][Minor] Minor cleanup for GPU Model Runner ( #13983 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-27 13:11:40 -08:00
38acae6e97
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups ( #13970 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-02-27 20:31:47 +00:00
a2dd48c386
[VLM] Deprecate legacy input mapper for OOT multimodal models ( #13979 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-27 19:14:55 +00:00
126f6beeb4
Bump azure/setup-helm from 4.2.0 to 4.3.0 ( #13742 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 19:04:10 +00:00
58d1b2aa77
[Attention] MLA support for V1 ( #13789 )
...
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-27 13:14:17 -05:00
f1579b229d
[VLM] Generalized prompt updates for multi-modal processor ( #13964 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-27 17:44:25 +00:00
7864875879
[Bugfix] Fix qwen2.5-vl overflow issue ( #13968 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-27 17:30:39 +00:00
1dd422b64a
Update LMFE version to v0.10.11 to support new versions of transforme… ( #13930 )
2025-02-27 17:16:12 +00:00
06c8f8d885
[bugfix] Fix profiling for RayDistributedExecutor ( #13945 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-28 01:01:21 +08:00
5677c9bb3e
Deduplicate .pre-commit-config.yaml's exclude ( #13967 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-27 16:27:47 +00:00
512d77d582
Update quickstart.md ( #13958 )
2025-02-27 16:05:11 +00:00
7f0be2aa24
[Model] Deepseek GGUF support ( #13167 )
2025-02-27 02:08:35 -08:00
edf309ebbe
[VLM] Support multimodal inputs for Florence-2 models ( #13320 )
2025-02-27 02:06:41 -08:00
788f284b53
Fix test_block_fp8.py test for MoE ( #13915 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-27 18:00:00 +08:00
4b1d141f49
[PP] Correct cache size check ( #13873 )
...
Signed-off-by: Yang Zheng <zhengy.gator@gmail.com >
2025-02-27 17:47:29 +08:00
10c3b8c1cf
[Misc] fixed 'required' is an invalid argument for positionals ( #13948 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2025-02-27 09:06:49 +00:00
a7f37314b7
[CI/Build] Add examples/ directory to be labelled by mergify ( #13944 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-27 08:24:11 +00:00
cd711c48b2
[V1][Metrics] Handle preemptions ( #13169 )
2025-02-26 20:04:59 -08:00
378b3ef6f8
[ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding ( #13922 )
2025-02-26 20:04:12 -08:00
c9944acbf9
[misc] Rename Ray ADAG to Compiled Graph ( #13928 )
2025-02-26 20:03:28 -08:00
ca377cf1b9
Use CUDA 12.4 as default for release and nightly wheels ( #12098 )
2025-02-26 19:06:37 -08:00
a31614e386
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined ( #13851 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-27 10:39:10 +08:00
f95903909f
[Kernel] FlashMLA integration ( #13747 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
2025-02-27 10:35:08 +08:00
b382a7f28f
[BugFix] Make FP8 Linear compatible with torch.compile ( #13918 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-26 13:48:55 -08:00
4cb6fa0a9c
[Bugfix] Backend option to disable xgrammar any_whitespace ( #12744 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-26 10:52:34 -08:00
d08b285adf
[Misc] fixed qwen_vl_utils parameter error ( #13906 )
2025-02-26 08:31:53 -08:00
b27122acc2
[TPU] use torch2.6 with whl package ( #13860 )
...
Signed-off-by: Chenyaaang <llccyy1212@gmail.com >
2025-02-26 08:18:54 -05:00
934bb99c71
[Bugfix] Update expected token counts for Ultravox tests ( #13895 )
2025-02-26 04:56:50 -08:00
3f808cc044
[Bugfix] Do not crash V0 engine on input errors ( #13101 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-26 19:07:29 +08:00
ec8a5e5386
[Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor ( #13736 )
...
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca >
2025-02-26 19:06:47 +08:00
215bf150a6
[Bugfix] Handle None parameters in Mistral function calls. ( #13786 )
2025-02-26 03:06:21 -08:00
0ecdd98031
Add comments on accessing kv_cache and attn_metadata ( #13887 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-26 18:41:02 +08:00
7b700ec8c8
[Bugfix] Add test example for Ultravox v0.5 ( #13890 )
2025-02-26 02:31:43 -08:00
7ca1da020f
[Misc] Fix input processing for Ultravox ( #13871 )
2025-02-25 23:56:34 -08:00
5157338ed9
[Misc] Improve LoRA spelling ( #13831 )
2025-02-25 23:43:01 -08:00
e206b54331
[v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine ( #13837 )
...
Signed-off-by: Seth Kimmel <seth.kimmel3@gmail.com >
2025-02-26 14:58:24 +08:00
1d35662e6d
[ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms ( #13844 )
...
Signed-off-by: Sage Moore <sage@neuralmagic.com >
2025-02-26 14:56:58 +08:00
e656f638de
[Doc] fix the incorrect module path of tensorize_vllm_model ( #13863 )
2025-02-25 22:56:19 -08:00
145944cb94
Improve pipeline partitioning ( #13839 )
2025-02-25 18:53:56 -08:00
094b7d9496
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues ( #13797 )
2025-02-25 18:52:03 -08:00
e1fe7591f2
[Misc]Code Cleanup ( #13859 )
...
Signed-off-by: noemotiovon <noemotiovon@gmail.com >
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2025-02-26 10:44:30 +08:00
5629f26df7
[V1][Spec Decode] Change Spec Decode Rejection Sampling API ( #13729 )
2025-02-25 18:14:48 -08:00
9ba28043b5
[misc] Show driver IP info when Ray fails to allocate driver worker ( #13858 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-02-26 09:53:43 +08:00
24679788ed
DeepSeek V2/V3/R1 only place lm_head on last pp rank ( #13833 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-26 01:24:57 +00:00
07c4353057
[Model] Support Grok1 ( #13795 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-26 01:07:12 +00:00
34e3494e70
Fix failing MyGemma2Embedding test ( #13820 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-25 12:33:03 -08:00
f75aa72732
[Neuron] Add custom_ops for neuron backend ( #13246 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: George Novack <gnovack@amazon.com >
Co-authored-by: Aoyu Zhang <aoyuzhan@amazon.com >
2025-02-25 11:47:49 -08:00
340e39e387
Fix string parsing error ( #13825 )
2025-02-25 08:20:29 -08:00
f4133ce4e5
[Bugfix] Revert inspection code in #13743 ( #13832 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-26 00:18:50 +08:00
6522d55b6f
Fix /v1/audio/transcriptions Bad Request Error ( #13811 )
2025-02-25 06:03:33 -08:00
6ff518626c
[Bugfix] Fix deepseek-vl2 inference with more than 2 images ( #13818 )
2025-02-25 06:03:02 -08:00
fa82074167
[Bugfix] Flush TunableOp results before worker processes are destroyed. ( #13623 )
...
Signed-off-by: Nichols A. Romero <nick.romero@amd.com >
2025-02-25 11:08:20 +00:00
75e9d49796
[Bugfix] Initialize attention bias on the same device as Query/Key/Value ( #13468 )
2025-02-25 02:13:09 -08:00
32c3b6bfd1
[Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs ( #13724 )
...
Signed-off-by: Chen-0210 <chenjincong11@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
2025-02-25 10:12:19 +00:00
37b6cb4985
[CI/Build] Fix V1 LoRA failure ( #13767 )
2025-02-25 02:01:15 -08:00
aabeb2688f
[ROCm][Quantization][Kernel] Using HIP FP8 header ( #12593 )
2025-02-25 00:39:59 -08:00
2f42a4888c
[Feature] Support KV cache offloading and disagg prefill with LMCache connector. ( #12953 )
2025-02-25 00:38:42 -08:00
3173c3b34e
[misc] Clean up ray compiled graph type hints ( #13731 )
2025-02-25 00:37:08 -08:00
2d87d7d1ac
[Bugfix] Modify modelscope api usage in transformer_utils ( #13807 )
2025-02-25 00:36:07 -08:00
aab392774b
[Core] xgrammar: Expand list of unsupported jsonschema keywords ( #13783 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-25 08:21:25 +00:00
6724e79164
[Misc] Check that the model can be inspected upon registration ( #13743 )
2025-02-25 00:18:19 -08:00
03f48b3db6
[Core] LoRA V1 - Add add/pin/list/remove_lora functions ( #13705 )
2025-02-25 00:18:02 -08:00
4d251ad00e
Fix CompressedTensorsWNA16MoE with grouped scales ( #13769 )
2025-02-25 00:17:14 -08:00
18e505930d
[Bugfix] Support MLA for CompressedTensorsWNA16 ( #13725 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-25 06:10:31 +00:00
4a8cfc7551
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" ( #13802 )
2025-02-24 20:33:59 -08:00
bc32bc73aa
[V1][Metrics] Implement vllm:lora_requests_info metric ( #13504 )
2025-02-24 20:01:33 -08:00
ab1091d5f2
[Misc][Attention][Quantization] init property earlier ( #13733 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-02-25 03:19:30 +00:00
1e15aaef56
[Bugfix][Quantization] Fix FP8 + EP ( #13784 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-25 10:54:17 +08:00
51010a1807
[Misc] set single whitespace between log sentences ( #13771 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2025-02-25 10:26:12 +08:00
7196a3b1db
[Doc] arg_utils.py: fixed a typo ( #13785 )
2025-02-24 18:23:04 -08:00
cdc1fa12eb
Remove unused kwargs from model definitions ( #13555 )
2025-02-24 17:13:52 -08:00
f61528d46d
[Misc][Chore] Clean Up AsyncOutputProcessing Logs ( #13780 )
2025-02-24 16:39:07 -08:00
1f0ae3ed0a
[Misc] Clean Up EngineArgs.create_engine_config ( #13734 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-02-24 13:52:21 -05:00
db986c19ea
Fix precommit fail in fused_moe intermediate_cache2 chunking ( #13772 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-24 09:25:47 -08:00
227578480d
Revert "[V1][Core] Fix memory issue with logits & sampling" ( #13775 )
2025-02-24 09:16:05 -08:00
befc402d34
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) ( #10980 )
...
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-02-24 08:29:41 -08:00
444b0f0f62
[Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set ( #12513 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-02-24 10:43:21 -05:00
ccc00515fd
[BugFix] Illegal memory access for MoE On H20 ( #13693 )
2025-02-24 07:37:32 -08:00
781096e385
Expert Parallelism (EP) Support for DeepSeek V2 ( #12583 )
2025-02-24 07:33:20 -08:00
7940d8a6a7
[CI/Build] add python-json-logger to requirements-common ( #12842 )
2025-02-24 06:10:33 -08:00
c0e3ecd6d2
[Bugfix] fix(logging): add missing opening square bracket ( #13011 )
2025-02-24 06:10:25 -08:00
23eca9cf68
[model][refactor] remove cuda hard code in models and layers ( #13658 )
2025-02-24 06:10:14 -08:00
437b76ff59
[V1][Core] Fix memory issue with logits & sampling ( #13721 )
2025-02-24 06:10:06 -08:00
f90a375593
[ci] Add logic to change model to S3 path only when S3 CI env var is on ( #13727 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-63-253.us-west-2.compute.internal >
2025-02-24 06:32:11 +00:00
e7ef74e26e
Fix some issues with benchmark data output ( #13641 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-02-24 10:23:18 +08:00
cbae7af552
[V1][BugFix] Fix engine core client shutdown hangs ( #13298 )
...
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.
Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-23 13:07:43 -08:00
eb24dc4a45
[v1] torchrun compatibility ( #13642 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-23 22:47:24 +08:00
9bebc9512f
[Misc] Deprecate --dataset from benchmark_serving.py ( #13708 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-23 13:32:20 +00:00
5a2ba16f5c
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms ( #13688 )
2025-02-23 02:54:29 -08:00
ba5106e519
[LMM] Implement merged multimodal processor for whisper ( #13278 )
2025-02-23 01:46:03 -08:00
d5ca2110f1
[Quant] BaiChuan SupportsQuant ( #13710 )
2025-02-22 19:21:15 -08:00
2c5e637b57
[ci] Use env var to control whether to use S3 bucket in CI ( #13634 )
2025-02-22 19:19:45 -08:00
322d2a27d6
[BugFix] Minor: logger import in attention backend ( #13706 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-02-22 16:51:13 -08:00
82e0d601fc
[CI/Build] Fix pre-commit errors from #13571 ( #13709 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-22 16:50:38 -08:00
78ac0f591d
[CI/Build] fix uv caching in Dockerfile ( #13611 )
2025-02-22 08:25:20 -08:00
b56155e7f3
[XPU]fix setuptools version for xpu ( #13548 )
2025-02-22 08:05:35 -08:00
382f66fb08
[Bugfix] Fix boolean conversion for OpenVINO env variable ( #13615 )
2025-02-22 08:04:12 -08:00
8354f6640c
[Doc] Dockerfile instructions for optional dependencies and dev transformers ( #13699 )
2025-02-22 06:04:31 -08:00
c904fdddf6
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm ( #13231 )
2025-02-22 05:54:38 -08:00
558db8083c
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths ( #13095 )
2025-02-22 05:25:41 -08:00
e109e598c7
[NVIDIA] Support nvfp4 cutlass gemm ( #13571 )
2025-02-22 05:24:05 -08:00
8db1b9d0a1
Support SSL Key Rotation in HTTP Server ( #13495 )
2025-02-22 05:17:44 -08:00
2382ad29d1
[ci] fix linter ( #13701 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-22 20:28:59 +08:00
3e472d882a
[core] set up data parallel communication ( #13591 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-22 19:28:59 +08:00
7f6bae561c
[CI/Build] Fix pre-commit errors ( #13696 )
2025-02-22 00:31:26 -08:00
105b8ce4c0
[Misc] Reduce LoRA-related static variable ( #13166 )
2025-02-22 00:21:30 -08:00
2cb8c1540e
[Metrics] Add --show-hidden-metrics-for-version CLI arg ( #13295 )
2025-02-22 00:20:45 -08:00
1cd981da4f
[V1][Metrics] Support vllm:cache_config_info ( #13299 )
2025-02-22 00:20:00 -08:00
fca20841c2
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size ( #13660 )
2025-02-22 00:19:10 -08:00
da31b5333e
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler ( #13594 )
...
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-02-22 00:08:29 -08:00
bb78fb318e
[v1] Support allowed_token_ids in v1 Sampler ( #13210 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-22 14:13:05 +08:00
8aca27fa11
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len ( #13691 )
...
Signed-off-by: WangErXiao <863579016@qq.com >
2025-02-22 14:10:38 +08:00
95c617e04b
[Misc] Bump compressed-tensors ( #13619 )
2025-02-21 22:09:04 -08:00
9a1f1da5d1
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA ( #13687 )
2025-02-21 22:07:45 -08:00
68d630a0c7
[ROCM] fix native attention function call ( #13650 )
2025-02-21 22:07:04 -08:00
68d535ef44
[Misc] Capture and log the time of loading weights ( #13666 )
2025-02-21 22:06:34 -08:00
c6ed93860f
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… ( #13672 )
2025-02-21 22:05:28 -08:00
0ffdf8ce0c
[HTTP Server] Make model param optional in request ( #13568 )
2025-02-21 21:55:50 -08:00
8c0dd3d4df
docs: Add a note on full CI run in contributing guide ( #13646 )
2025-02-21 21:53:59 -08:00
ada7c780d5
[Misc] Fix yapf linting tools etc not running on pre-commit ( #13695 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-22 13:10:43 +08:00
288cc6c234
[Attention] MLA with chunked prefill ( #12639 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Co-authored-by: Patrick Horn <patrick.horn@gmail.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-21 15:30:12 -08:00
900edbfa48
fix typo of grafana dashboard, with correct datasource ( #13668 )
...
Signed-off-by: John Zheng <john.zheng@hp.com >
2025-02-21 18:21:05 +00:00
b2c3fc5d65
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation ( #13586 )
2025-02-20 22:24:17 -08:00
839b27c6cc
[Kernel]Add streamK for block-quantized CUTLASS kernels ( #12978 )
2025-02-20 22:14:24 -08:00
34ad27fe83
[ci] Fix metrics test model path ( #13635 )
2025-02-20 22:12:10 -08:00
1c3c975766
[FEATURE] Enables /score endpoint for embedding models ( #12846 )
2025-02-20 22:09:47 -08:00
1cdc88614a
Missing comment explaining VDR variable in GGUF kernels ( #13290 )
2025-02-20 22:06:54 -08:00
31aa045c11
[V1][Sampler] Avoid an operation during temperature application ( #13587 )
2025-02-20 22:05:56 -08:00
a30c093502
[Bugfix] Add mm_processor_kwargs to chat-related protocols ( #13644 )
2025-02-20 22:04:33 -08:00
c7b07a95a6
Use pre-commit to update requirements-test.txt ( #13617 )
2025-02-20 22:03:27 -08:00
27a09dc52c
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant ( #13632 )
2025-02-20 22:01:48 -08:00
981f3c831e
[Misc] Adding script to setup ray for multi-node vllm deployments ( #12913 )
2025-02-20 21:16:40 -08:00
44c33f01f3
Add llmaz as another integration ( #13643 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2025-02-21 03:52:40 +00:00
33170081f1
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth ( #13245 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com >
2025-02-20 17:45:45 -08:00
71face8540
[Bugfix] Fix max_num_batched_tokens for MLA ( #13620 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-20 17:45:20 -08:00
bfbc0b32c6
[Frontend] Add backend-specific options for guided decoding ( #13505 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-02-20 15:07:58 -05:00
6a417b8600
fix neuron performance issue ( #13589 )
2025-02-20 10:59:36 -08:00
d3ea50113c
[V1][Minor] Print KV cache size in token counts ( #13596 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-20 09:24:31 -08:00
34aad515c8
Update pre-commit's isort version to remove warnings ( #13614 )
2025-02-20 08:00:14 -08:00
ed6e9075d3
[Bugfix] Fix deepseekv3 grouped topk error ( #13474 )
...
Signed-off-by: Chen-XiaoBing <chenxb002@whu.edu.cn >
2025-02-20 06:47:01 -08:00
992e5c3d34
Merge similar examples in offline_inference into single basic example ( #12737 )
2025-02-20 04:53:51 -08:00
b69692a2d8
[Kernel] LoRA - Refactor sgmv kernels ( #13110 )
2025-02-20 07:28:06 -05:00
a64a84433d
[2/n][ci] S3: Use full model path ( #13564 )
...
Signed-off-by: <>
2025-02-20 01:20:15 -08:00
aa1e62d0db
[ci] Fix spec decode test ( #13600 )
2025-02-20 16:56:00 +08:00
497bc83124
[CI/Build] Use uv in the Dockerfile ( #13566 )
2025-02-19 23:05:44 -08:00
3738e6fa80
[API Server] Add port number range validation ( #13506 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-20 15:05:13 +08:00
0023cd2b9d
[ROCm] MI300A compile targets deprecation ( #13560 )
2025-02-19 23:05:00 -08:00
041e294716
[Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL ( #13533 )
2025-02-19 23:04:30 -08:00
9621667874
[Misc] Warn if the vLLM version can't be retrieved ( #13501 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-20 06:24:48 +00:00
8c755c3b6d
[bugfix] spec decode worker get tp group only when initialized ( #13578 )
2025-02-20 04:46:28 +00:00
ba81163997
[core] add sleep and wake up endpoint and v1 support ( #12987 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: cennn <2523403608@qq.com >
Co-authored-by: cennn <2523403608@qq.com >
2025-02-20 12:41:17 +08:00
0d243f2a54
[ROCm][MoE] mi300 mixtral8x7B perf for specific BS ( #13577 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-20 04:01:02 +00:00
88f6ba3281
[ci] Add AWS creds for AMD ( #13572 )
2025-02-20 03:56:06 +00:00
512368e34a
[Misc] Qwen2.5 VL support LoRA ( #13261 )
2025-02-19 18:37:55 -08:00
473f51cfd9
[3/n][CI] Load Quantization test models with S3 ( #13570 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-20 10:12:30 +08:00
a4c402a756
[BugFix] Avoid error traceback in logs when V1 LLM terminates ( #13565 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-20 00:49:01 +00:00
550d97eb58
[Misc] Avoid calling unnecessary hf_list_repo_files for local model path ( #13348 )
...
Signed-off-by: isotr0py <2037008807@qq.com >
2025-02-19 18:57:48 +00:00
fbbe1fbac6
[MISC] Logging the message about Ray teardown ( #13502 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
2025-02-19 09:40:50 -08:00
01c184b8f3
Fix copyright year to auto get current year ( #13561 )
2025-02-19 16:55:34 +00:00
ad5a35c21b
[doc] clarify multi-node serving doc ( #13558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 22:32:17 +08:00
5ae9f26a5a
[Bugfix] Fix device ordinal for multi-node spec decode ( #13269 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-19 22:13:15 +08:00
377d10bd14
[VLM][Bugfix] Pass processor kwargs properly on init ( #13516 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-19 13:13:50 +00:00
52ce14d31f
[doc] clarify profiling is only for developers ( #13554 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-19 20:55:58 +08:00
81dabf24a8
[CI/Build] force writing version file ( #13544 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2025-02-19 18:48:03 +08:00
423330263b
[Feature] Pluggable platform-specific scheduler ( #13161 )
...
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com >
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2025-02-19 17:16:38 +08:00
caf7ff4456
[V1][Core] Generic mechanism for handling engine utility ( #13060 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-19 17:09:22 +08:00
f525c0be8b
[Model][Speculative Decoding] DeepSeek MTP spec decode ( #12755 )
...
Signed-off-by: Lu Fang <fanglu@fb.com >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-19 17:06:23 +08:00
983a40a8bb
[Bugfix] Fix Positive Feature Layers in Llava Models ( #13514 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-02-19 08:50:07 +00:00
fdc5df6f54
use device param in load_model method ( #13037 )
2025-02-19 16:05:02 +08:00
3b05cd4555
[perf-benchmark] Fix ECR path for premerge benchmark ( #13512 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:56:11 +00:00
d5d214ac7f
[1/n][CI] Load models in CI from S3 instead of HF ( #13205 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 07:34:59 +00:00
fd84857f64
[Doc] Add clarification note regarding paligemma ( #13511 )
2025-02-18 22:24:03 -08:00
8aada19dfc
[ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe ( #13503 )
2025-02-18 22:23:24 -08:00
9aa95b0e6a
[perf-benchmark] Allow premerge ECR ( #13509 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-19 05:13:41 +00:00
d0a7a2769d
[Hardware][Gaudi][Feature] Support Contiguous Cache Fetch ( #12139 )
...
Signed-off-by: yuzhou <yuzhou@habana.ai >
Signed-off-by: zhouyu5 <yu.zhou@intel.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-18 19:40:19 -08:00
00b69c2d27
[Misc] Remove dangling references to --use-v2-block-manager ( #13492 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-19 03:37:26 +00:00
4c82229898
[V1][Spec Decode] Optimize N-gram matching with Numba ( #13365 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 13:19:58 -08:00
c8d70e2437
Pin Ray version to 2.40.0 ( #13490 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 12:50:31 -08:00
30172b4947
[V1] Optimize handling of sampling metadata and req_ids list ( #13244 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-18 12:15:33 -08:00
a4d577b379
[V1][Tests] Adding additional testing for multimodal models to V1 ( #13308 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-18 09:53:14 -08:00
7b203b7694
[misc] fix debugging code ( #13487 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 09:37:11 -08:00
4fb8142a0e
[V1][PP] Enable true PP with Ray executor ( #13472 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-18 09:15:32 -08:00
a02c86b4dd
[CI/Build] migrate static project metadata from setup.py to pyproject.toml ( #8772 )
2025-02-18 08:02:49 -08:00
3809458456
[Bugfix] Fix invalid rotary embedding unit test ( #13431 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-02-18 11:52:03 +00:00
d3231cb436
[Bugfix] Handle content type with optional parameters ( #13383 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2025-02-18 11:29:13 +00:00
435b502a6e
[ROCm] Make amdsmi import optional for other platforms ( #13460 )
2025-02-18 03:15:56 -08:00
29fc5772c4
[Bugfix] Remove noisy error logging during local model loading ( #13458 )
2025-02-18 03:15:48 -08:00
2358ca527b
[Doc]: Improve feature tables ( #13224 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-18 18:52:39 +08:00
8cf97f8661
[Bugfix] Fix failing transformers dynamic module resolving with spawn multiproc method ( #13403 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-18 10:25:53 +00:00
e2603fefb8
[Bugfix] Ensure LoRA path from the request can be included in err msg ( #13450 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-18 16:19:15 +08:00
b53d79983c
Add outlines fallback when JSON schema has enum ( #13449 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-18 06:49:41 +00:00
9915912f7f
[V1][PP] Fix & Pin Ray version in requirements-cuda.txt ( #13436 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 21:58:06 -08:00
d1b649f1ef
[Quant] Aria SupportsQuant ( #13416 )
2025-02-17 21:51:09 -08:00
ac19b519ed
[core] fix sleep mode in pytorch 2.6 ( #13456 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 13:48:10 +08:00
a1074b3efe
[Bugfix] Only print out chat template when supplied ( #13444 )
2025-02-17 21:43:31 -08:00
00294e1bc6
[Quant] Arctic SupportsQuant ( #13366 )
2025-02-17 21:35:09 -08:00
88787bce1d
[Quant] Molmo SupportsQuant ( #13336 )
2025-02-17 21:34:47 -08:00
932b51cedd
[v1] fix parallel config rank ( #13445 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-18 12:33:45 +08:00
7c7adf81fc
[ROCm] fix get_device_name for rocm ( #13438 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-02-18 04:07:12 +00:00
67ef8f666a
[Model] Enable quantization support for transformers backend ( #12960 )
2025-02-17 19:52:47 -08:00
efbe854448
[Misc] Remove dangling references to SamplingType.BEAM ( #13402 )
2025-02-17 19:52:35 -08:00
b3942e157e
[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallback issue ( #13425 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-18 00:32:48 +00:00
cd4a72a28d
[V1][Spec decode] Move drafter to model runner ( #13363 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 15:40:12 -08:00
6ac485a953
[V1][PP] Fix intermediate tensor values ( #13417 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-17 13:37:45 -08:00
4c21ce9eba
[V1] Get input tokens from scheduler ( #13339 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-17 11:01:07 -08:00
ce77eb9410
[Bugfix] Fix VLLM_USE_MODELSCOPE issue ( #13384 )
2025-02-17 14:22:01 +00:00
30513d1cb6
[Bugfix] fix xpu communicator ( #13368 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-02-17 20:59:18 +08:00
1f69c4a892
[Model] Support Mamba2 (Codestral Mamba) ( #9292 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
2025-02-17 20:17:50 +08:00
7b623fca0b
[VLM] Check required fields before initializing field config in DictEmbeddingItems ( #13380 )
2025-02-17 01:36:07 -08:00
238dfc8ac3
[MISC] tiny fixes ( #13378 )
2025-02-17 00:57:13 -08:00
45186834a0
Run v1 benchmark and integrate with PyTorch OSS benchmark database ( #13068 )
...
Signed-off-by: Huy Do <huydhn@gmail.com >
2025-02-17 08:16:32 +00:00
f857311d13
Fix spelling error in index.md ( #13369 )
2025-02-17 06:53:20 +00:00
46cdd59577
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode ( #12304 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-16 19:32:26 -08:00
2010f04c17
[V1][Misc] Avoid unnecessary log output ( #13289 )
2025-02-16 19:26:24 -08:00
69e1d23e1e
[V1][BugFix] Clean up rejection sampler & Fix warning msg ( #13362 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 12:25:29 -08:00
d67cc21b78
[Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case ( #13358 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-02-16 18:55:27 +00:00
e18227b04a
[V1][PP] Cache Intermediate Tensors ( #13353 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 10:02:27 -08:00
7b89386553
[V1][BugFix] Add __init__.py to v1/spec_decode/ ( #13359 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-16 09:39:08 -08:00
da833b0aee
[Docs] Change myenv to vllm. Update python_env_setup.inc.md ( #13325 )
2025-02-16 16:04:21 +00:00
5d2965b7d7
[Bugfix] Fix 2 Node and Spec Decode tests ( #13341 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-16 22:20:22 +08:00
a0231b7c25
[platform] add base class for communicators ( #13208 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:14:22 +08:00
124776ebd5
[ci] skip failed tests for flashinfer ( #13352 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-16 22:09:15 +08:00
b7d309860e
[V1] Update doc and examples for H2O-VL ( #13349 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-02-16 10:35:54 +00:00
dc0f7ccf8b
[BugFix] Enhance test_pos_encoding to support execution on multi-devices ( #13187 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-02-16 08:59:49 +00:00
d3d547e057
[Bugfix] Pin xgrammar to 0.1.11 ( #13338 )
2025-02-15 19:42:25 -08:00
12913d17ba
[Quant] Add SupportsQuant to phi3 and clip ( #13104 )
2025-02-15 19:28:33 -08:00
80f63a3966
[V1][Spec Decode] Ngram Spec Decode ( #12193 )
...
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2025-02-15 18:05:11 -08:00
367cb8ce8c
[Doc] [2/N] Add Fuyu E2E example for multimodal processor ( #13331 )
2025-02-15 07:06:23 -08:00
54ed913f34
[ci/build] update flashinfer ( #13323 )
2025-02-15 05:33:13 -08:00
9206b3d7ec
[V1][PP] Run engine busy loop with batch queue ( #13064 )
2025-02-15 03:59:01 -08:00
ed0de3e4b8
[AMD] [Model] DeepSeek tunings ( #13199 )
2025-02-15 03:58:09 -08:00
2ad1bc7afe
[V1][Metrics] Add iteration_tokens_total histogram from V0 ( #13288 )
2025-02-15 03:56:19 -08:00
7fdaaf48ef
[Bugfix] Fix qwen2.5-vl image processor ( #13286 )
2025-02-15 03:00:11 -08:00
067fa2255b
[Bugfix]Fix search start_index of stop_checker ( #13280 )
2025-02-14 21:39:42 -08:00
9076325677
[BugFix] Don't scan entire cache dir when loading model ( #13302 )
2025-02-14 21:33:31 -08:00
97a3d6d995
[Bugfix] Massage MLA's usage of flash attn for RoCM ( #13310 )
2025-02-14 21:33:25 -08:00
579d7a63b2
[Bugfix][Docs] Fix offline Whisper ( #13274 )
2025-02-14 21:32:37 -08:00
c9f9d5b397
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm ( #13235 )
2025-02-14 20:30:42 -08:00
0c73026844
[V1][PP] Fix memory profiling in PP ( #13315 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 20:17:25 -08:00
6a854c7a2b
[V1][Sampler] Don't apply temp for greedy-only ( #13311 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-14 18:10:53 -08:00
e7eea5a520
[V1][CI] Fix failed v1-test because of min_p ( #13316 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-14 17:29:51 -08:00
a12934d3ec
[V1][Core] min_p sampling support ( #13191 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
2025-02-14 15:50:05 -08:00
3bcb8c75da
[Core] Reduce TTFT with concurrent partial prefills ( #10235 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-14 15:36:07 -08:00
5e5c8e091e
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts ( #13236 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-14 12:53:42 -08:00
c9e2d644e7
[Hardware][Gaudi][Bugfix] Fix error for guided decoding ( #12317 )
2025-02-14 04:36:49 -08:00
7734e9a291
[Core] choice-based structured output with xgrammar ( #12632 )
2025-02-14 04:36:05 -08:00
6224a9f620
Support logit_bias in v1 Sampler ( #13079 )
2025-02-14 04:34:59 -08:00
085b7b2d6c
[V1] Simplify GPUModelRunner._update_states check ( #13265 )
2025-02-14 04:33:43 -08:00
4da1f667e9
[VLM] Keep track of whether prompt replacements have been applied ( #13215 )
2025-02-14 04:20:46 -08:00
556ef7f714
[Misc] Log time consumption of sleep and wake-up ( #13115 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-02-14 20:10:21 +08:00
83481ceb49
[Bugfix] Fix missing parentheses ( #13263 )
2025-02-14 01:07:10 -08:00
185cc19f92
[Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch ( #12927 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-02-14 08:22:42 +00:00
45f90bcbba
[WIP] TPU V1 Support Refactored ( #13049 )
2025-02-14 00:21:53 -08:00
b0ccfc565a
[Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch ( #13126 )
2025-02-13 22:39:20 -08:00
ba59b78a9c
[ROCm][V1] Add intial ROCm support to V1 ( #12790 )
2025-02-13 22:21:50 -08:00
cbc40128eb
[V1] LoRA - Enable Serving Usecase ( #12883 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-14 14:21:12 +08:00
f0b2da72a8
Expand MLA to support most types of quantization ( #13181 )
2025-02-13 22:19:22 -08:00
f2b20fe491
Consolidate Llama model usage in tests ( #13094 )
2025-02-13 22:18:03 -08:00
40932d7a05
[Misc] Remove redundant statements in scheduler.py ( #13229 )
2025-02-13 22:07:25 -08:00
84683fa271
[Bugfix] Offline example of disaggregated prefill ( #13214 )
2025-02-13 20:20:47 -08:00
067678262a
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config ( #13237 )
2025-02-13 20:19:43 -08:00
09545c0a94
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on ( #13250 )
2025-02-13 20:19:25 -08:00
dd5ede4440
[V1] Consolidate MM cache size to vllm.envs ( #13239 )
2025-02-13 20:19:03 -08:00
8c32b08a86
[Kernel] Fix awq error when n is not divisable by 128 ( #13227 )
2025-02-13 20:07:05 -08:00
410886950a
[ROCm] Avoid using the default stream on ROCm ( #13238 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-14 09:29:26 +08:00
e38be640e6
Revert "Add label if pre-commit passes" ( #13242 )
2025-02-13 16:12:32 -08:00
c1e37bf71b
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels ( #13198 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-14 00:01:14 +00:00
2344192a55
Optimize moe_align_block_size for deepseek_v3 ( #12850 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-13 18:43:37 -05:00
bffddd9a05
Add label if pre-commit passes ( #12527 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-13 20:51:30 +00:00
d84cef76eb
[Frontend] Add /v1/audio/transcriptions OpenAI API endpoint ( #12909 )
2025-02-13 07:23:45 -08:00
37dfa60037
[Bugfix] Missing Content Type returns 500 Internal Server Error ( #13193 )
2025-02-13 06:52:22 -08:00
1bc3b5e71b
[VLM] Separate text-only and vision variants of the same model architecture ( #13157 )
2025-02-13 06:19:15 -08:00
02ed8a1fbe
[Misc] Qwen2.5-VL Optimization ( #13155 )
2025-02-13 06:17:57 -08:00
2092a6fa7d
[V1][Core] Add worker_base for v1 worker ( #12816 )
...
Signed-off-by: Aoyu <aoyuzhan@amazon.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Aoyu <aoyuzhan@amazon.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-13 20:35:18 +08:00
c9d3ecf016
[VLM] Merged multi-modal processor for Molmo ( #12966 )
2025-02-13 04:34:00 -08:00
fdcf64d3c6
[V1] Clarify input processing and multimodal feature caching logic ( #13211 )
2025-02-13 03:43:24 -08:00
578087e56c
[Frontend] Pass pre-created socket to uvicorn ( #13113 )
2025-02-13 00:51:46 -08:00
fa253f1a70
[VLM] Remove input processor from clip and siglip ( #13165 )
2025-02-13 00:31:37 -08:00
9605c1256e
[V1][core] Implement pipeline parallel on Ray ( #12996 )
2025-02-13 08:02:46 +00:00
0ccd8769fb
[CI/Build] Allow ruff to auto-fix some issues ( #13180 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 07:45:38 +00:00
cb944d5818
Allow Unsloth Dynamic 4bit BnB quants to work ( #12974 )
2025-02-12 23:13:08 -08:00
d46d490c27
[Frontend] Move CLI code into vllm.cmd package ( #12971 )
2025-02-12 23:12:21 -08:00
04f50ad9d1
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case ( #13097 )
2025-02-12 23:11:26 -08:00
60c68df6d1
[Build] Automatically use the wheel of the base commit with Python-only build ( #13178 )
2025-02-12 23:10:28 -08:00
009439caeb
Simplify logic of locating CUDART so file path ( #13203 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-13 13:52:41 +08:00
bc55d13070
[VLM] Implement merged multimodal processor for Mllama ( #11427 )
2025-02-12 20:26:21 -08:00
d88c8666a1
[Bugfix][Example] Fix GCed profiling server for TPU ( #12792 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-13 11:52:11 +08:00
4fc5c23bb6
[NVIDIA] Support nvfp4 quantization ( #12784 )
2025-02-12 19:51:51 -08:00
9f9704dca6
[perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance ( #12706 )
2025-02-12 19:51:33 -08:00
8eafe5eaea
[CI/Build] Ignore ruff warning up007 ( #13182 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-13 11:48:31 +08:00
4c0d93f4b2
[V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort ( #13173 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2025-02-12 12:58:11 -08:00
14b7899d10
[CI] Fix failing FP8 cpu offload test ( #13170 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
2025-02-12 19:16:06 +00:00
09972e716c
[Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity ( #13119 )
2025-02-12 09:19:53 -08:00
36a08630e8
[CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control ( #7086 )
2025-02-12 09:19:43 -08:00
2c2b560f48
[CI/Build] Use mypy matcher for pre-commit CI job ( #13162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-12 17:12:22 +00:00
042c3419fa
Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path ( #12998 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-12 09:06:13 -08:00
82cabf53a3
[Misc] Delete unused LoRA modules ( #13151 )
2025-02-12 08:58:24 -08:00
314cfade02
[Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral ( #12332 )
2025-02-12 08:29:56 -08:00
985b4a2b19
[Bugfix] Fix num video tokens calculation for Qwen2-VL ( #13148 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-12 11:55:23 +00:00
f4d97e4fc2
[Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request ( #13108 )
2025-02-12 02:39:16 -08:00
f1042e86f0
[Misc] AMD Build Improvements ( #12923 )
2025-02-12 02:36:10 -08:00
7c4033acd4
Further reduce the HTTP calls to huggingface.co ( #13107 )
2025-02-12 02:34:09 -08:00
d59def4730
Bump actions/setup-python from 5.3.0 to 5.4.0 ( #12672 )
2025-02-12 16:41:22 +08:00
0c7d9effce
Bump helm/chart-testing-action from 2.6.1 to 2.7.0 ( #12463 )
2025-02-12 16:41:06 +08:00
dd3b4a01f8
Bump actions/stale from 9.0.0 to 9.1.0 ( #12462 )
2025-02-12 00:40:25 -08:00
a0597c6b75
Bump helm/kind-action from 1.10.0 to 1.12.0 ( #11612 )
2025-02-12 00:40:19 -08:00
e92694b6fe
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency ( #12921 )
...
Signed-off-by: Lingfan Yu <lingfany@amazon.com >
2025-02-11 21:12:37 -08:00
842b0fd402
[ci] Add more source file dependencies for some tests ( #13123 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-11 20:38:10 -08:00
974dfd4971
[Model] IBM/NASA Prithvi Geospatial model ( #12830 )
2025-02-11 20:34:30 -08:00
3ee696a63d
[RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM ( #12518 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-02-12 12:25:58 +08:00
72c2b68dc9
[Misc] Move pre-commit suggestion back to the end ( #13114 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 22:34:16 +00:00
14ecab5be2
[Bugfix] Guided decoding falls back to outlines when fails to import xgrammar ( #12976 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-11 18:17:44 +00:00
deb6c1c6b4
[Doc] Improve OpenVINO installation doc ( #13102 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 18:02:46 +00:00
565c1efa65
[CI/Build][Bugfix] Fix CPU backend default threads num ( #13077 )
2025-02-11 16:55:56 +00:00
2b25b7d2e1
Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 ( #13023 )
2025-02-11 08:38:48 -08:00
6c4dbe23eb
[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES ( #12962 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2025-02-12 00:21:50 +08:00
21f5d50fa5
[Bugfix] Do not use resource module on Windows ( #12858 ) ( #13029 )
2025-02-11 08:21:18 -08:00
bf3e05215c
[Misc] Fix typo at comments at metrics.py ( #13024 )
2025-02-11 08:20:37 -08:00
ad9776353e
Set torch_dtype in TransformersModel ( #13088 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-11 23:51:19 +08:00
75e6e14516
[V1][Metrics] Add several request timing histograms ( #12644 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-11 10:14:00 -05:00
110f59a33e
[Bugfix] fix flaky test ( #13089 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-11 14:41:20 +00:00
2e3b969ec0
[Platform] add pre_register_and_update function ( #12432 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-02-11 22:06:46 +08:00
da317197dd
[Build] Fix cuda link target of cumem_allocator in CPU env ( #12863 )
...
Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-11 21:55:57 +08:00
7539bbc6a6
[ROCm] Using a more precise memory profiling ( #12624 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-02-11 21:47:10 +08:00
9cf4759493
[executor] init local_rank as device index ( #13027 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-02-11 21:20:53 +08:00
41c5dd45b9
[V1][Metrics] Add GPU prefix cache hit rate % gauge ( #12592 )
2025-02-11 08:27:25 +00:00
fc6485d277
[Bugfix]: Reasoning output bug according to the chat template change ( #13025 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
2025-02-11 15:49:03 +08:00
78a141d768
[Misc] LoRA - Refactor Punica ops tests ( #12970 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-11 07:26:03 +00:00
c320ca8edd
[Core] Don't do platform detection at import time ( #12933 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-11 07:25:25 +00:00
58047c6f04
[Benchmark] Add BurstGPT to benchmark_serving ( #13063 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-02-10 21:25:30 -08:00
cb080f32e3
[Bugfix] Support missing tool parameters in mistral tokenizer ( #12884 )
...
Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com >
2025-02-11 03:33:33 +00:00
2c0f58203c
[Docs] Annouce Meta Meetup ( #13065 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-02-10 18:24:29 -08:00
2ff4857678
[V1][Minor] Move scheduler outputs to a separate file ( #13062 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-11 02:10:06 +00:00
91e876750e
[misc] Fix setup.py condition to avoid AMD from being mistaken with CPU ( #13022 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-02-10 18:06:16 -08:00
08b2d845d6
[Model] Ultravox Model: Support v0.5 Release ( #12912 )
...
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai >
2025-02-10 22:02:48 +00:00
2ae889052c
Fix seed parameter behavior in vLLM ( #13007 )
...
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com >
2025-02-10 23:26:50 +08:00
51f0b5f7f6
[Bugfix] Clean up and fix multi-modal processors ( #13012 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-10 10:45:21 +00:00
fde71262e0
[misc] Add retries with exponential backoff for HF file existence check ( #13008 )
2025-02-10 01:15:02 -08:00
243137143c
[Doc] Add link to tool_choice tracking issue in tool_calling.md ( #13003 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 06:09:33 +00:00
b2496bb07f
[core] fix sleep mode and pytorch checkpoint compatibility ( #13001 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 13:03:43 +08:00
44607e07d3
Check if selected backend is None in get_attn_backend_cls() ( #12975 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-02-10 11:45:07 +08:00
67c4637ccf
[V1] Use msgpack for core request serialization ( #12918 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-02-10 11:35:56 +08:00
aa0ca5ebb7
[core][rlhf] add colocate example for RLHF ( #12984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 10:28:59 +08:00
59fff4a01a
[core] improve error handling when wake up from sleep mode ( #12981 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-10 09:38:57 +08:00
29f1d47e73
[MISC] Always import version library first in the vllm package ( #12979 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-09 18:56:40 +08:00
cf797aa856
[core] port pynvml into vllm codebase ( #12963 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 15:00:00 +08:00
24700c346b
[V1] Cache uses_mrope in GPUModelRunner ( #12969 )
2025-02-08 15:32:32 -08:00
d366ccc4e3
[RFC] [Mistral] FP8 format ( #10130 )
...
Signed-off-by: mgoin <mgoin64@gmail.com >
Co-authored-by: mgoin <mgoin64@gmail.com >
2025-02-08 14:12:53 -07:00
870c37481e
[V1][Minor] Remove outdated comment ( #12968 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 12:48:30 -08:00
86222a3dab
[VLM] Merged multi-modal processor for GLM4V ( #12449 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-08 20:32:16 +00:00
fe743b798d
[bugfix] fix early import of flash attention ( #12959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-09 00:06:56 +08:00
913df14da3
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU ( #12935 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-02-08 14:46:19 +00:00
8a69e0e20e
[CI/Build] Auto-fix Markdown files ( #12941 )
2025-02-08 04:25:15 -08:00
4c8dd12ef3
[Misc] Add qwen2.5-vl BNB support ( #12944 )
2025-02-08 04:24:47 -08:00
256a2d29dc
[Doc] Correct HF repository for TeleChat2 models ( #12949 )
2025-02-08 01:42:15 -08:00
c45d398e6f
[CI] Resolve transformers-neuronx version conflict ( #12925 )
2025-02-08 01:41:35 -08:00
011e612d92
[Misc] Log time consumption on weight downloading ( #12926 )
2025-02-08 09:16:42 +00:00
7e1837676a
[misc] Add LoRA to benchmark_serving ( #12898 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-08 17:15:44 +08:00
2880e21e3d
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi ( #12812 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2025-02-08 17:15:30 +08:00
407b5537db
[Build] Make pypi install work on CPU platform ( #12874 )
2025-02-08 01:15:15 -08:00
4ea48fb35c
[V1][Minor] Move cascade attn logic outside _prepare_inputs ( #12943 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-08 00:39:09 -08:00
e31498bdcb
[Misc] Add offline test for disaggregated prefill ( #12418 )
2025-02-08 08:38:20 +00:00
91dd8f7aa6
[bugfix] respect distributed_executor_backend in world_size=1 ( #12934 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-08 16:17:08 +08:00
d01f66b039
[Bugfix] Fix multi-round chat error when mistral tokenizer is used ( #12859 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-08 07:04:34 +00:00
cc01223f3b
[Misc] Fix typo in the example file ( #12896 )
...
Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com >
2025-02-08 06:56:43 +00:00
306923da82
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping ( #12905 )
2025-02-07 21:02:53 -08:00
3243158336
[V1] Move KV block hashes from Request to KVCacheManager ( #12922 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:14:10 -08:00
b21f0f9d17
[V1][Minor] Remove outdated comment ( #12928 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-07 19:07:37 -08:00
45cbc4991d
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues ( #12723 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 16:39:50 -08:00
932c6b7461
[V1] LM Eval With Streaming Integration Tests ( #11590 )
2025-02-07 15:07:03 -08:00
eaa92d4437
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing ( #12501 )
2025-02-07 08:13:43 -08:00
0630d4537a
[V1] Logprobs and prompt logprobs support ( #9880 )
...
This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.
New behavior:
- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2025-02-07 07:26:20 -08:00
538fab93cd
PR #12718 ( #12718 )
2025-02-07 06:22:37 -08:00
ce26b16268
[Misc] Remove unnecessary detokenization in multimodal processing ( #12868 )
2025-02-07 06:21:17 -08:00
1918aa1b80
[MISC][EASY] Break check file names into entry and args in the pre-commit hooks ( #12880 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-07 13:04:39 +00:00
6e1fc61f0f
Prevent unecessary requests to huggingface hub ( #12837 )
2025-02-06 21:37:41 -08:00
aa375dca9f
[Bugfix] Missing quant_config in deepseek embedding layer ( #12836 )
2025-02-06 21:35:09 -08:00
433c4a4923
Make vllm compatible with verl ( #12824 )
...
Co-authored-by: zhangshulai <zhangshulai@bytedance.com >
2025-02-07 11:54:20 +08:00
ef533d25fb
[Bugfix] FA2 illegal memory access ( #12848 )
2025-02-06 19:54:07 -08:00
b260782357
[misc] Revert # 12833 ( #12857 )
...
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 16:29:12 -08:00
741429a4cd
[MISC] Check space in the file names in the pre commit checks ( #12804 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 15:36:21 -08:00
aff404571b
Add Bamba Model ( #10909 )
...
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-06 15:22:42 -08:00
467a96a541
[V1] LoRA Support ( #10957 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-02-06 09:32:51 -08:00
8108ac841d
[Bugfix] Fix unsupported FA version check for Turing GPU ( #12828 )
2025-02-06 09:18:22 -08:00
afe74f7a96
[Doc] double quote cmake package in build.inc.md ( #12840 )
2025-02-06 09:17:55 -08:00
09b95e36ab
[torch.compile] PyTorch 2.6 and nightly compatibility ( #12393 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-07 01:09:07 +08:00
85ac82d228
[Kernel] Make rotary_embedding ops more flexible with input shape ( #12777 )
2025-02-06 08:46:13 -08:00
1e57b1ee63
[Misc] Remove unnecessary decode call ( #12833 )
2025-02-06 08:45:44 -08:00
e152f29502
[misc] Reduce number of config file requests to HuggingFace ( #12797 )
...
Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal >
2025-02-06 14:59:18 +00:00
c786e757fa
[Attention] Use FA3 for MLA on Hopper ( #12807 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-02-06 11:43:12 +00:00
cefd56ee35
[Docs] Add Google Cloud Slides ( #12814 )
2025-02-06 01:02:38 -08:00
7ca9934fe7
[Misc] Update w2 scale loading for GPTQMarlinMoE ( #12757 )
2025-02-06 01:02:14 -08:00
0408efc6d0
[Misc] Improve error message for incorrect pynvml ( #12809 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 15:23:50 +08:00
449d1bce02
[Misc] Remove duplicated DeepSeek V2/V3 model definition ( #12793 )
2025-02-05 23:16:20 -08:00
1a6fcad4c9
Improve TransformersModel UX ( #12785 )
2025-02-05 22:24:57 -08:00
56534cd577
[Bugfix] Fix the test_ultravox.py's license ( #12806 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-02-06 13:25:54 +08:00
d88506dda4
[Model] LoRA Support for Ultravox model ( #11253 )
2025-02-05 19:54:13 -08:00
9cdea30b4f
[Misc][Easy] Remove the space from the file name
2025-02-05 19:23:35 -08:00
76abd0c881
[Bugfix] Better FP8 supported defaults
2025-02-05 19:22:19 -08:00
5b19b93082
[ROCm][Kernel] Using the correct warp_size value
2025-02-05 19:15:08 -08:00
75404d041b
[VLM] Update compatibility with transformers 4.49
2025-02-05 19:09:45 -08:00
bf3b79efb8
[VLM] Qwen2.5-VL
2025-02-05 13:31:38 -08:00
9a5b1554b4
[Docs] Drop duplicate [source] links
2025-02-05 13:30:50 -08:00
a4ce74c14a
[VLM] Use shared field to pass token ids to model
2025-02-05 13:30:46 -08:00
3b2005e1db
Add: Support for Sparse24Bitmask Compressed Models
2025-02-05 13:30:43 -08:00
af8486de49
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU)
2025-02-05 13:29:45 -08:00
4c3aac51e1
Merging PR #12536
...
Merged via CLI script
2025-02-05 13:24:26 -08:00
bc1bdecebf
[core][distributed] exact ray placement control ( #12732 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-06 02:03:19 +08:00
022bcc701a
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 ( #12546 )
2025-02-04 23:11:02 -08:00
c53dc466b1
[Doc] Remove performance warning for auto_awq.md ( #12743 )
2025-02-04 22:43:11 -08:00
3d09e592a8
[V1][Misc] Shorten FinishReason enum and use constant strings ( #12760 )
2025-02-04 22:43:02 -08:00
fcf2e3d7fc
[Bugfix] Fix OpenVINO model runner ( #12750 )
2025-02-04 22:42:46 -08:00
58b218d7ae
[Doc] Update PR Reminder with link to Developer Slack ( #12748 )
2025-02-04 22:42:09 -08:00
7ff7a638b6
[Model][Quant] Fix GLM, Fix fused module mappings for quantization ( #12634 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-02-05 05:32:06 +00:00
686006a220
[Misc] Bump the compressed-tensors version ( #12736 )
2025-02-04 20:44:48 -08:00
98fd089fc9
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models ( #12729 )
2025-02-04 20:44:26 -08:00
249824c3bf
Refactor Linear handling in TransformersModel ( #12727 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-02-05 04:31:12 +00:00
64862d106e
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling ( #12713 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-02-05 03:58:22 +00:00
b3a0d01e45
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS ( #12368 )
...
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com >
2025-02-04 18:46:26 -08:00
75e94309e8
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) ( #12676 )
...
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-02-04 18:22:24 -08:00
233df6f5c4
[V1][Metrics] Add request_success_total counter, labelled with finish reason ( #12579 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-02-04 19:46:54 -05:00
18016a5e62
[Bugfix] Fix CI failures for InternVL and Mantis models ( #12728 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-02-04 23:54:23 +08:00
649550f27e
[Build] update requirements of no-device for plugin usage ( #12630 )
...
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com >
2025-02-04 21:19:12 +08:00
62467a834a
Avoid unnecessary multi-modal input data copy when len(batch) == 1 ( #12722 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2025-02-04 21:03:19 +08:00
6469038b14
[Bugfix] Fix loading of fine-tuned models based on Phi-3-Small ( #12689 )
...
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com >
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com >
2025-02-04 20:58:48 +08:00
815079de8e
[VLM] merged multimodal processor and V1 support for idefics3 ( #12660 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-02-04 20:00:51 +08:00
18a88fcccc
[V1] Remove scheduling constraint on partial requests ( #12674 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-04 02:43:58 -08:00
d1ca7df84d
[VLM] Merged multi-modal processor for InternVL-based models ( #12553 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-02-04 16:44:52 +08:00
96b23621c1
[Misc] Add BNB quantization for Whisper ( #12381 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-04 16:27:36 +08:00
c36ac98d01
[AMD][ROCm] Enable DeepSeek model on ROCm ( #12662 )
...
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
2025-02-04 08:24:11 +00:00
4896d0c2dd
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs ( #12711 )
2025-02-03 23:27:11 -08:00
bb392af434
[Doc] Replace ibm-fms with ibm-ai-platform ( #12709 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-02-04 07:05:04 +00:00
5d98d56089
Support Pixtral-Large HF by using llava multimodal_projector_bias config ( #12710 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-02-04 11:55:46 +08:00
73b35cca7f
[Core] Improve hash collision avoidance in prefix caching ( #12621 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 16:28:20 -08:00
5095e96606
[V1] Revert uncache_blocks and support recaching full blocks ( #12415 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 15:04:53 -08:00
cf58b9c4ca
[MISC] Remove model input dumping when exception ( #12582 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-02-03 13:34:16 -08:00
4797dad3ec
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 ( #12707 )
2025-02-03 13:30:39 -08:00
6dd5e52823
Squelch MLA warning for Compressed-Tensors Models ( #12704 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-02-03 13:29:56 -08:00
c11de33dad
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm ( #12696 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-03 13:04:59 -08:00
33e0602e59
[Misc] Fix improper placement of SPDX header in scripts ( #12694 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-03 11:16:59 -08:00
a1a2aaadb9
[Model]: Add transformers backend support ( #11330 )
...
# Adds support for `transformers` as a backend
Following https://github.com/huggingface/transformers/pull/35235 , a
bunch of models should already be supported, we are ramping up support
for more models.
Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes:
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2025-02-03 21:30:38 +08:00
1298a400e8
[ci/build] fix gh200 test ( #12681 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:59:49 +08:00
ad4a9dc817
[cuda] manually import the correct pynvml module ( #12679 )
...
fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 15:58:21 +08:00
b9986454fe
Fix for attention layers to remain unquantized during moe_wn16 quant ( #12570 )
...
Fix to AWQ quant loading of the new R1 model
The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit
The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again
---------
Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Beim <beim2015@outlook.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Beim <805908499@qq.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com >
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com >
Co-authored-by: fade_away <1028552010@qq.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Shawn Du <shawnd200@outlook.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:46:19 +08:00
c5932e5dac
Properly check if all fused layers are in the list of targets ( #12666 )
...
Thanks @kylesayrs for catching this!
2025-02-03 13:42:18 +08:00
20579c0fae
make sure mistral_common not imported for non-mistral models ( #12669 )
...
When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691 .
I added the check, and make all imports of `cv2` lazy.
---------
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 13:40:25 +08:00
95460fc513
[Kernel] port sgl moe_align_block_size kernels ( #12574 )
...
sgl_moe_align_block_size is based on:
ded9fcd09a
moe_align_block_size is based on:
ba5112ff69
Signed-off-by: Yang Chen <yangche@fb.com >
2025-02-03 13:09:50 +08:00
326fcc8b9f
[Doc] Deprecate Discord ( #12668 )
2025-02-02 19:19:56 -08:00
e64330910b
[doc][misc] clarify VLLM_HOST_IP for multi-node inference ( #12667 )
...
As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815 becomes more
frequent. Let's give clear message to users.
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-02-03 09:32:18 +08:00
e489ad7a21
[Misc] Add SPDX-License-Identifier headers to python source files ( #12628 )
...
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**
commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:18:24 2025 -0500
Add SPDX license headers to python source files
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
also be easily used by tools to help manage license compliance.
The Linux Foundation runs license scans against the codebase to help
ensure
we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
More information can be found on the SPDX site:
- https://spdx.dev/learn/handling-license-info/
Signed-off-by: Russell Bryant <rbryant@redhat.com >
commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com >
Date: Fri Jan 31 14:36:32 2025 -0500
Check for SPDX headers using pre-commit
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 11:58:18 -08:00
f256ebe4df
[Hardware][Intel GPU] add XPU bf16 support ( #12392 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-02-02 10:17:26 +00:00
f8ece6e17f
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager ( #12608 )
...
As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254 ,
this PR achieves the task: combine allocate_slots and append_slots.
There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.
@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo
---------
Signed-off-by: Shawn Du <shawnd200@outlook.com >
2025-02-02 16:40:58 +08:00
abfcdcdf27
[V1][Minor] Avoid frequently creating ConstantList ( #12653 )
...
A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-02-01 23:43:20 -08:00
e497f33491
[Core] Silence unnecessary deprecation warnings ( #12620 )
...
I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:
```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
and will be removed in a future version.
Please use 'lora_path' instead.
```
The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-02-02 15:35:50 +08:00
baaa2b24da
[Bugfix] fix moe_wna16 get_quant_method ( #12648 )
...
Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.
baeded2569/vllm/attention/layer.py (L86-L92)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-02-02 15:29:56 +08:00
b4e5c03306
doc: fixing minor typo in readme.md ( #12643 )
...
Word "evolved" was mistyped
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
---------
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com >
2025-02-01 17:17:29 +00:00
3194039c0e
Apply torch.compile to fused_moe/grouped_topk ( #12637 )
2025-02-01 16:16:19 +00:00
4f4d427ac2
Disable chunked prefill and/or prefix caching when MLA is enabled ( #12642 )
...
From @mgoin in https://github.com/vllm-project/vllm/pull/12638
I cannot push to that branch, therefore a new PR to unblock release.
---------
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-31 23:46:57 -08:00
1e3698393f
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 ( #12280 )
...
We have `v1`, `structured-output`, and `speculative-decoding` labels on
github. This adds automation for applying these labels based on the
files touched by a PR.
Signed-off-by: Russell Bryant <rbryant@redhat.com >
---------
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-31 23:13:10 -08:00
baeded2569
[Attention] Deepseek v3 MLA support with FP8 compute ( #12601 )
...
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights
---------
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
2025-01-31 21:52:51 -08:00
3e1c76cf3a
Fix: Respect sparsity_config.ignore in Cutlass Integration ( #12517 )
...
This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.
---
### Changes
- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.
---
<details>
<summary>Testing Setup</summary>
The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097 ).
#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```
#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
QuantizationStrategy,
QuantizationType)
from pydantic import BaseModel
-
+from vllm.logger import init_logger
from vllm.model_executor.layers.fused_moe import FusedMoE
from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
should_ignore_layer)
from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
__all__ = ["CompressedTensorsLinearMethod"]
SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```
Apply using:
```bash
git apply logging-patch.patch
```
</details>
---
<details>
<summary>Models Tested</summary>
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24`
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`
</details>
---
<details>
<summary>Example Output</summary>
#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```
#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```
</details>
**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.
---
For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160 )](https://app.asana.com/0/0/1209227810815160 ).
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
2025-02-01 13:41:59 +08:00
cfa134d247
[Bugfix/CI] Fixup benchmark_moe.py ( #12562 )
...
Fixes `is_marlin` not being passed into `get_default_config`
Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-02-01 13:41:35 +08:00
35b7a05507
[ci] Upgrade transformers to 4.48.2 in CI dependencies ( #12599 )
2025-01-31 21:22:23 -08:00
1867c258bd
Fix target matching for fused layers with compressed-tensors ( #12617 )
...
Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).
Example recipe:
```
recipe = """
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "group"
group_size: 128
targets: [
"model.layers.0.mlp.down_proj",
"model.layers.2.mlp.down_proj",
"model.layers.3.mlp.down_proj",
"model.layers.4.mlp.down_proj",
"model.layers.5.mlp.down_proj",
"model.layers.6.mlp.down_proj",
"model.layers.7.mlp.down_proj",
"model.layers.8.mlp.down_proj",
"model.layers.9.mlp.down_proj",
"model.layers.10.mlp.down_proj",
"model.layers.11.mlp.down_proj",
"model.layers.12.mlp.down_proj",
"model.layers.13.mlp.down_proj",
"model.layers.14.mlp.down_proj",
"model.layers.15.mlp.down_proj",
"model.layers.16.mlp.down_proj",
"model.layers.17.mlp.down_proj",
"model.layers.19.mlp.down_proj",
"model.layers.21.mlp.down_proj",
"model.layers.22.mlp.down_proj",
.
.
.
]
"""
```
To reproduce the vLLM error:
```bash
vllm serve nm-testing/eldar-test
```
With this PR
------------
Models are loaded correctly without any errors.
2025-02-01 05:07:46 +00:00
cb3e73e4c8
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 ( #11161 )
...
FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086 #12487
---------
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: weilong.yu <weilong.yu@shopee.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-02-01 12:52:07 +08:00
b1340f9d55
[V1] Bugfix: Validate Model Input Length ( #12600 )
...
SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len
FIX #12567(*link existing issues this PR will resolve*)
2025-01-31 18:32:04 -08:00
44bbca78d7
[Doc] int4 w4a16 example ( #12585 )
...
Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py )
FIX #n/a (no issue created)
@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets
---------
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-31 15:38:48 -08:00
60808bd4c7
[Doc] Improve installation signposting ( #12575 )
...
- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html
- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images
---------
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 15:38:35 -08:00
fc542144c4
[Feature] Fix guided decoding blocking bitmask memcpy ( #12563 )
...
**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.
(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

With the optimization, this is no longer the case:

---------
Signed-off-by: Ryan N <ryan.nguyen@centml.ai >
2025-01-31 15:37:30 -08:00
eb5741ad42
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 ( #12587 )
...
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-31 15:29:11 -08:00
145c2ff648
[Bugfix] Revert MoE Triton Config Default ( #12629 )
...
SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files ) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-31 15:28:47 -08:00
415f19474d
[release] Add input step to ask for Release version ( #12631 )
...
Instead of having to create a new build with release version put in as
env var.
2025-01-31 13:39:36 -08:00
89003c4082
[v1][Bugfix] Add extra_keys to block_hash for prefix caching ( #12603 )
...
This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
request_id=0,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash1", "hash2"],
)
request2 = make_request(
request_id=1,
prompt_token_ids=[_ for _ in range(6)],
mm_positions=[{
"offset": 0,
"length": 3
}, {
"offset": 3,
"length": 3
}],
mm_hashes=["hash3", "hash2"],
)
```
---------
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-31 13:13:04 -08:00
60bcef000e
[Docs][V1] Prefix caching design ( #12598 )
...
- Create v1 design document section in docs.
- Add prefix caching design doc.
@WoosukKwon @ywang96
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:46 -08:00
847f883232
[Git] Automatically sign-off commits ( #12595 )
...
It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just
```
pre-commit install
```
Now we need to
```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```
Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.
cc @hmellor @youkaichao
---------
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-31 12:30:33 -08:00
325f679f32
[BugFix] Fix Torch.Compile For DeepSeek ( #12594 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-31 12:06:39 -08:00
e3f7ff65e7
Add favicon to docs ( #12611 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-31 09:20:34 -08:00
7a8987dac5
[Bugfix] Gracefully handle huggingface hub http error ( #12571 )
2025-01-31 08:19:35 +00:00
cabaf4eff3
[Attention] MLA decode optimizations ( #12528 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com >
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 23:49:37 -08:00
a1fc18c030
[ROCm][AMD][Model] llama 3.2 support upstreaming ( #12421 )
...
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2025-01-31 12:24:28 +08:00
9798b2fb00
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling ( #11868 )
2025-01-30 18:33:00 -08:00
4078052f09
[V1][Log] Add max request concurrency log to V1 ( #12569 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-30 23:07:19 +00:00
bd2107e30a
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies ( #12555 )
...
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com >
2025-01-30 16:29:39 -05:00
9b0c4bab36
[Kernel] Triton Configs for Fp8 Block Quantization ( #11589 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
2025-01-30 11:53:22 -08:00
41bf5612f5
[Misc] fix typo: add missing space in lora adapter error message ( #12564 )
...
Signed-off-by: Beim <beim2015@outlook.com >
2025-01-30 15:39:22 +00:00
a2769032ca
Set ?device={device} when changing tab in installation guides ( #12560 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-30 00:05:42 -08:00
f17f1d4608
[V1][Metrics] Add GPU cache usage % gauge ( #12561 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 18:31:01 -08:00
1c1bb0bbf2
[Misc][MoE] add Deepseek-V3 moe tuning support ( #12558 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-30 00:47:30 +00:00
e0cc5f259a
[V1][BugFix] Free encoder cache for aborted requests ( #12545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-29 13:47:33 -08:00
73aa6cfdf7
Revert "[Build/CI] Fix libcuda.so linkage" ( #12552 )
2025-01-29 21:12:24 +00:00
27b78c73ca
[Kernel] add triton fused moe kernel for gptq/awq ( #12185 )
2025-01-29 09:07:09 -05:00
b02fd288b2
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. ( #11787 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-29 01:46:12 -08:00
ff7424f491
[Frontend] Support override generation config in args ( #12409 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
2025-01-29 01:41:01 -08:00
d93bf4da85
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM ( #12069 )
...
Signed-off-by: hzh <hezhihui_thu@163.com >
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Oleg Mosalov <oleg@krai.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
Signed-off-by: Chenguang Li <757486878@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Shanshan Shen <467638484@qq.com >
Signed-off-by: elijah <f1renze.142857@gmail.com >
Signed-off-by: Yikun <yikunkero@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com >
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com >
Co-authored-by: sixgod <evethwillbeok@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com >
Co-authored-by: Oleg Mosalov <oleg@krai.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com >
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com >
Co-authored-by: Concurrensee <yida.wu@amd.com >
Co-authored-by: Chenguang Li <757486878@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Alex Brooks <alex.brooks@ibm.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Shanshan Shen <467638484@qq.com >
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com >
Co-authored-by: Yikun Jiang <yikunkero@gmail.com >
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: TJian <tunjian1996@gmail.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com >
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com >
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-01-29 09:24:59 +00:00
036ca94c25
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense ( #12347 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Wallas Santos <wallashss@ibm.com >
2025-01-29 08:54:35 +00:00
ef001d98ef
Fix the pydantic logging validator ( #12420 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-01-29 07:53:13 +00:00
5f671cb4c3
[V1] Improve Error Message for Unsupported Config ( #12535 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-29 04:56:56 +00:00
bd02164cf9
Bugfix for whisper quantization due to fake k_proj bias ( #12524 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-29 04:49:03 +00:00
46fb056749
[V1][Metrics] Add TTFT and TPOT histograms ( #12530 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-29 04:11:16 +00:00
dd6a3a02cb
[Doc] Convert docs to use colon fences ( #12471 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-29 11:38:29 +08:00
a7e3eba66f
[Frontend] Support reasoning content for deepseek r1 ( #12473 )
...
Signed-off-by: Ce Gao <cegao@tensorchord.ai >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
2025-01-29 11:38:08 +08:00
fbb5bd4cef
[TPU] Add example for profiling TPU inference ( #12531 )
...
Signed-off-by: mgoin <mgoin@redhat.com >
2025-01-29 03:16:47 +00:00
80fcc3ed1c
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels ( #12482 )
...
Signed-off-by: Fenghui Zhang <fhzhang@google.com >
2025-01-28 22:36:44 +00:00
c386c43ca3
[V1][Metrics] Add per-request prompt/generation_tokens histograms ( #12516 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 22:07:22 +00:00
f26d790718
Do not run suggestion pre-commit hook multiple times ( #12521 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-28 20:05:27 +00:00
0f657bdc52
Replace missed warning_once for rerank API ( #12472 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-28 19:06:32 +00:00
3fd1fb63ef
[V1][Metrics] Hook up IterationStats for Prometheus metrics ( #12478 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-28 16:38:38 +00:00
925d2f1908
[Doc] Fix typo for x86 CPU installation ( #12514 )
...
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com >
2025-01-28 16:37:10 +00:00
8f58a51358
[VLM] Merged multi-modal processor and V1 support for Qwen-VL ( #12504 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-28 16:25:05 +00:00
2079e43bee
[Core] Make raw_request optional in ServingCompletion ( #12503 )
...
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com >
2025-01-28 10:56:45 +00:00
e29d4358ef
[V1] Include Engine Version in Logs ( #12496 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-28 08:27:41 +00:00
8cbc424975
Update README.md with V1 alpha release ( #12495 )
2025-01-28 08:22:41 +00:00
dd66fd2b01
[CI] fix pre-commit error ( #12494 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-28 06:11:05 +00:00
0f465ab533
[FEATURE] Enables offline /score for embedding models ( #12021 )
...
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com >
2025-01-28 11:30:13 +08:00
23a7cbc88b
[CI/Build] Fixed the xla nightly issue report in #12451 ( #12453 )
2025-01-28 11:18:07 +08:00
426a5c3625
Fix bad path in prometheus example ( #12481 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-27 18:56:31 -07:00
ddee88d0ff
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache ( #11277 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
Co-authored-by: Jiangfei Duan <jfduan@outlook.com >
2025-01-27 17:31:16 -08:00
823ab79633
Update pre-commit hooks ( #12475 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-27 17:23:08 -07:00
6116ca8cd7
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill ( #10132 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: wallashss <wallashss@ibm.com >
Co-authored-by: wallashss <wallashss@ibm.com >
2025-01-27 13:38:35 -08:00
2bc3fbba0c
[FlashInfer] Upgrade to 0.2.0 ( #11194 )
...
Signed-off-by: Bowen Wang <abmfy@icloud.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-27 18:19:24 +00:00
3f1fc7425a
[V1][CI/Test] Do basic test for top-p & top-k sampling ( #12469 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 09:40:04 -08:00
01ba927040
[V1][Metrics] Add initial Prometheus logger ( #12416 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2025-01-27 12:26:28 -05:00
103bd17ac5
[Build] Only build 9.0a for scaled_mm and sparse kernels ( #12339 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-27 10:40:00 -05:00
ce69f7f754
[Bugfix] Fix gpt2 GGUF inference ( #12467 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 18:31:49 +08:00
624a1e4711
[V1][Minor] Minor optimizations for update_from_output ( #12454 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-27 01:09:27 -08:00
372bf0890b
[Bugfix] Fix missing seq_start_loc in xformers prefill metadata ( #12464 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-27 07:25:30 +00:00
5204ff5c3f
[Bugfix] Fix Granite 3.0 MoE model loading ( #12446 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-26 21:26:44 -08:00
0cc6b383d7
[Frontend] Support scores endpoint in run_batch ( #12430 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2025-01-27 04:30:17 +00:00
28e0750847
[V1] Avoid list creation in input preparation ( #12457 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-26 19:57:56 -08:00
582cf78798
[DOC] Add link to vLLM blog ( #12460 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-27 03:46:19 +00:00
0034b09ceb
[Frontend] Rerank API (Jina- and Cohere-compatible API) ( #12376 )
...
Signed-off-by: Kyle Mistele <kyle@mistele.com >
2025-01-26 19:58:45 -07:00
72bac73067
[Build/CI] Fix libcuda.so linkage ( #12424 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-26 21:18:19 +00:00
68f11149d8
[Bugfix][Kernel] Fix perf regression caused by PR #12405 ( #12434 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-26 11:09:34 -08:00
72f4880425
[Bugfix/CI] Fix broken kernels/test_mha.py ( #12450 )
2025-01-26 10:39:03 -08:00
aa2cd2c43d
[Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 ( #12417 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-26 19:59:58 +08:00
9ddc35220b
[Frontend] generation_config.json for maximum tokens( #12242 )
...
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com >
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Co-authored-by: shangmingc <caishangming@linux.alibaba.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-26 19:59:25 +08:00
a5255270c3
[Misc] Revert FA on ViT #12355 and #12435 ( #12445 )
2025-01-26 03:56:34 -08:00
0ee349b553
[V1][Bugfix] Fix assertion when mm hashing is turned off ( #12439 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-26 00:47:42 -08:00
fa63e710c7
[V1][Perf] Reduce scheduling overhead in model runner after cuda sync ( #12094 )
...
Signed-off-by: Keyun Tong <tongkeyun@gmail.com >
2025-01-26 00:42:37 -08:00
2a0309a646
[Misc][Bugfix] FA3 support to ViT MHA layer ( #12435 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-26 05:00:31 +00:00
324960a95c
[TPU][CI] Update torchxla version in requirement-tpu.txt ( #12422 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-01-25 07:23:03 +00:00
f1fc0510df
[Misc] Add FA2 support to ViT MHA layer ( #12355 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-25 15:07:35 +08:00
bf21481dde
[ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 ( #12408 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-25 12:17:19 +08:00
fb30ee92ee
[Bugfix] Fix BLIP-2 processing ( #12412 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-25 11:42:42 +08:00
221d388cc5
[Bugfix][Kernel] Fix moe align block issue for mixtral ( #12413 )
2025-01-25 01:49:28 +00:00
3132a933b6
[Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). ( #12405 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 20:20:59 +00:00
df5dafaa5b
[Misc] Remove deprecated code ( #12383 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-24 14:45:20 -05:00
ab5bbf5ae3
[Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build ( #12375 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-24 15:27:59 +00:00
3bb8e2c9a2
[Misc] Enable proxy support in benchmark script ( #12356 )
...
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp >
2025-01-24 14:58:26 +00:00
e784c6b998
[ci/build] sync default value for wheel size ( #12398 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:54:29 +08:00
9a0f3bdbe5
[Hardware][Gaudi][Doc] Add missing step in setup instructions ( #12382 )
2025-01-24 09:43:49 +00:00
c7c9851036
[ci/build] fix wheel size check ( #12396 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 17:31:25 +08:00
3c818bdb42
[Misc] Use VisionArena Dataset for VLM Benchmarking ( #12389 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-24 00:22:04 -08:00
6dd94dbe94
[perf] fix perf regression from #12253 ( #12380 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 11:34:27 +08:00
0e74d797ce
[V1] Increase default batch size for H100/H200 ( #12369 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-24 03:19:55 +00:00
55ef66edf4
Update compressed-tensors version ( #12367 )
2025-01-24 11:19:42 +08:00
5e5630a478
[Bugfix] Path join when building local path for S3 clone ( #12353 )
...
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai >
2025-01-24 11:06:07 +08:00
d3d6bb13fb
Set weights_only=True when using torch.load() ( #12366 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-24 02:17:30 +00:00
24b0205f58
[V1][Frontend] Coalesce bunched RequestOutputs ( #12298 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2025-01-23 17:17:41 -08:00
c5cffcd0cd
[Docs] Update spec decode + structured output in compat matrix ( #12373 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-24 01:15:52 +00:00
682b55bc07
[Docs] Add meetup slides ( #12345 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-23 14:10:03 -08:00
9726ad676d
[Misc] Fix OpenAI API Compatibility Issues in Benchmark Script ( #12357 )
...
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp >
2025-01-23 17:02:13 -05:00
eb5cb5e528
[BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order ( #11528 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-23 21:40:33 +00:00
2cbeedad09
[Docs] Document Phi-4 support ( #12362 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 19:18:51 +00:00
2c85529bfc
[TPU] Update TPU CI to use torchxla nightly on 20250122 ( #12334 )
...
Signed-off-by: Siyuan Liu <lsiyuan@google.com >
2025-01-23 18:50:16 +00:00
e97f802b2d
[FP8][Kernel] Dynamic kv cache scaling factors computation ( #11906 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Micah Williamson <micah.williamson@amd.com >
2025-01-23 18:04:03 +00:00
6e650f56a1
[torch.compile] decouple compile sizes and cudagraph sizes ( #12243 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 02:01:30 +08:00
3f50c148fd
[core] add wake_up doc and some sanity check ( #12361 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-24 02:00:50 +08:00
8c01b8022c
[Bugfix] Fix broken internvl2 inference with v1 ( #12360 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 17:20:33 +00:00
99d01a5e3d
[V1] Simplify M-RoPE ( #12352 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: imkero <kerorek@outlook.com >
2025-01-23 23:13:23 +08:00
d07efb31c5
[Doc] Troubleshooting errors during model inspection ( #12351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-23 22:46:58 +08:00
978b45f399
[Kernel] Flash Attention 3 Support ( #12093 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2025-01-23 06:45:48 -08:00
c5b4b11d7f
[Bugfix] Fix k_proj's bias for whisper self attention ( #12342 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-23 10:15:33 +00:00
8ae5ff2009
[Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update ( #12338 )
...
Signed-off-by: zhenwei <zhenweiliu@habana.ai >
2025-01-23 08:35:46 +00:00
511627445e
[doc] explain common errors around torch.compile ( #12340 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-23 14:56:02 +08:00
f0ef37233e
[V1] Add uncache_blocks ( #12333 )
2025-01-23 04:19:21 +00:00
7551a34032
[Docs] Document vulnerability disclosure process ( #12326 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-23 03:44:09 +00:00
01a55941f5
[Docs] Update FP8 KV Cache documentation ( #12238 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-23 11:18:09 +08:00
8d7aa9de71
[Bugfix] Fixing AMD LoRA CI test. ( #12329 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2025-01-23 10:53:02 +08:00
68c4421b6d
[AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD ( #12282 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-01-23 00:10:37 +00:00
aea94362c9
[Frontend][V1] Online serving performance improvements ( #12287 )
2025-01-22 22:22:12 +00:00
7206ce4ce1
[Core] Support reset_prefix_cache ( #12284 )
2025-01-22 18:52:27 +00:00
96f6a7596f
[Bugfix] Fix HPU multiprocessing executor ( #12167 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-23 02:07:07 +08:00
84bee4bd5c
[Misc] Improve the readability of BNB error messages ( #12320 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-22 16:56:54 +00:00
fc66dee76d
[Misc] Fix the error in the tip for the --lora-modules parameter ( #12319 )
...
Signed-off-by: wangerxiao <863579016@qq.com >
2025-01-22 16:48:41 +00:00
6609cdf019
[Doc] Add docs for prompt replacement ( #12318 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 14:56:29 +00:00
16366ee8bb
[Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 ( #12313 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-22 21:06:36 +08:00
528dbcac7d
[Model][Bugfix]: correct Aria model output ( #12309 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2025-01-22 11:39:19 +00:00
cd7b6f0857
[VLM] Avoid unnecessary tokenization ( #12310 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 11:08:31 +00:00
68ad4e3a8d
[Core] Support fully transparent sleep mode ( #11743 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:39:32 +08:00
4004f144f3
[Build] update requirements of no-device ( #12299 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-22 14:29:31 +08:00
66818e5b63
[core] separate builder init and builder prepare for each batch ( #12253 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-22 14:13:52 +08:00
222a9dc350
[Benchmark] More accurate TPOT calc in benchmark_serving.py ( #12288 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2025-01-22 13:46:14 +08:00
cbdc4ad5a5
[Ci/Build] Fix mypy errors on main ( #12296 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-22 12:06:54 +08:00
016e3676e7
[CI] add docker volume prune to neuron CI ( #12291 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-01-22 10:47:49 +08:00
64ea24d0b3
[ci/lint] Add back default arg for pre-commit ( #12279 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2025-01-22 01:15:27 +00:00
df76e5af26
[VLM] Simplify post-processing of replacement info ( #12269 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 16:48:13 -08:00
09ccc9c8f7
[Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose ( #12281 )
...
Signed-off-by: Hongxia Yang <hongxyan@amd.com >
2025-01-22 07:49:22 +08:00
69196a9bc7
[BUGFIX] When skip_tokenize_init and multistep are set, execution crashes ( #12277 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: maleksan85 <maleksan@amd.com >
2025-01-21 23:30:46 +00:00
2acba47d9b
[bugfix] moe tuning. rm is_navi() ( #12273 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-21 22:47:32 +00:00
9c485d9e25
[Core] Free CPU pinned memory on environment cleanup ( #10477 )
2025-01-21 11:56:41 -08:00
fa9ee08121
[Misc] Set default backend to SDPA for get_vit_attn_backend ( #12235 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-21 11:52:11 -08:00
347eeebe3b
[Misc] Remove experimental dep from tracing.py ( #12007 )
...
Signed-off-by: Adrian Cole <adrian.cole@elastic.co >
2025-01-21 11:51:55 -08:00
18fd4a8331
[Bugfix] Multi-sequence broken ( #11898 )
...
Signed-off-by: Andy Lo <andy@mistral.ai >
2025-01-21 11:51:35 -08:00
132a132100
[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types ( #10907 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2025-01-21 11:51:13 -08:00
1e60f87bb3
[Kernel] fix moe_align_block_size error condition ( #12239 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
2025-01-21 10:30:28 -08:00
9705b90bcf
[Bugfix] fix race condition that leads to wrong order of token returned ( #10802 )
...
Signed-off-by: Jannis Schönleber <joennlae@gmail.com >
2025-01-21 09:47:04 -08:00
3aec49e56f
[ci/build] update nightly torch for gh200 test ( #12270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 23:03:17 +08:00
c64612802b
[Platform] improve platforms getattr ( #12264 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2025-01-21 14:42:41 +00:00
9a7c3a0042
Remove pytorch comments for outlines + compressed-tensors ( #12260 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2025-01-21 21:49:08 +08:00
b197a5ccfd
[V1][Bugfix] Fix data item ordering in mixed-modality inference ( #12259 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-21 13:18:43 +00:00
c81081fece
[torch.compile] transparent compilation with more logging ( #12246 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 19:32:55 +08:00
a94eee4456
[Bugfix] Fix mm_limits access for merged multi-modal processor ( #12252 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 10:09:39 +00:00
f2e9f2a3be
[Misc] Remove redundant TypeVar from base model ( #12248 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 08:40:39 +00:00
1f1542afa9
[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration ( #12237 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-21 07:49:08 +00:00
96912550c8
[Misc] Rename MultiModalInputsV2 -> MultiModalInputs ( #12244 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-21 07:31:19 +00:00
2fc6944c5e
[ci/build] disable failed and flaky tests ( #12240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-21 13:25:03 +08:00
5fe6bf29d6
[BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 ( #12230 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-01-21 12:23:14 +08:00
d4b62d4641
[AMD][Build] Porting dockerfiles from the ROCm/vllm fork ( #11777 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-01-21 12:22:23 +08:00
ecf67814f1
Add quantization and guided decoding CODEOWNERS ( #12228 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-20 18:23:40 -07:00
750f4cabfa
[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) ( #12222 )
...
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-20 16:42:16 -08:00
06a760d6e8
[bugfix] catch xgrammar unsupported array constraints ( #12210 )
...
Signed-off-by: Jason Cheng <jasoncky96@gmail.com >
2025-01-20 16:42:02 -08:00
da7512215f
[misc] add cuda runtime version to usage data ( #12190 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2025-01-21 00:31:01 +00:00
af69a6aded
fix: update platform detection for M-series arm based MacBook processors ( #12227 )
...
Signed-off-by: isikhi <huseyin.isik000@gmail.com >
2025-01-20 22:23:28 +00:00
7bd3630067
[Misc] Update CODEOWNERS ( #12229 )
2025-01-20 22:19:09 +00:00
96663699b2
[CI] Pass local python version explicitly to pre-commit mypy.sh ( #12224 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-20 23:49:18 +08:00
18572e3384
[Bugfix] Fix HfExampleModels.find_hf_info ( #12223 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 15:35:36 +00:00
86bfb6dba7
[Misc] Pass attention to impl backend ( #12218 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-20 23:25:28 +08:00
5f0ec3935a
[V1] Remove _get_cache_block_size ( #12214 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-20 21:54:16 +08:00
c222f47992
[core][bugfix] configure env var during import vllm ( #12209 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 19:35:59 +08:00
170eb35079
[misc] print a message to suggest how to bypass commit hooks ( #12217 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 18:06:24 +08:00
b37d82791e
[Model] Upgrade Aria to transformers 4.48 ( #12203 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 17:58:48 +08:00
3127e975fb
[CI/Build] Make pre-commit faster ( #12212 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 17:36:24 +08:00
4001ea1266
[CI/Build] Remove dummy CI steps ( #12208 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 16:41:57 +08:00
5c89a29c22
[misc] add placeholder format.sh ( #12206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 16:04:49 +08:00
59a0192fb9
[Core] Interface for accessing model from VllmRunner ( #10353 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-20 15:00:59 +08:00
83609791d2
[Model] Add Qwen2 PRM model support ( #12202 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-20 14:59:46 +08:00
0974c9bc5c
[Bugfix] Fix incorrect types in LayerwiseProfileResults ( #12196 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:59:20 +08:00
d2643128f7
[DOC] Add missing docstring in LLMEngine.add_request() ( #12195 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:59:00 +08:00
c5c06209ec
[DOC] Fix typo in docstring and assert message ( #12194 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-20 14:58:29 +08:00
3ea7b94523
Move linting to pre-commit ( #11975 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-20 14:58:01 +08:00
51ef828f10
[torch.compile] fix sym_tensor_indices ( #12191 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-20 11:37:50 +08:00
df450aa567
[Bugfix] Fix num_heads value for simple connector when tp enabled ( #12074 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2025-01-20 02:56:43 +00:00
bbe5f9de7d
[Model] Support for fairseq2 Llama ( #11442 )
...
Signed-off-by: Martin Gleize <mgleize@meta.com >
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas >
2025-01-19 10:40:40 -08:00
81763c58a0
[V1] Add V1 support of Qwen2-VL ( #12128 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: imkero <kerorek@outlook.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-19 19:52:13 +08:00
edaae198e7
[Misc] Add BNB support to GLM4-V model ( #12184 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-19 19:49:22 +08:00
936db119ed
benchmark_serving support --served-model-name param ( #12109 )
...
Signed-off-by: zibai <zibai.gj@alibaba-inc.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2025-01-19 09:59:56 +00:00
e66faf4809
[torch.compile] store inductor compiled Python file ( #12182 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-19 16:27:26 +08:00
630eb5b5ce
[Bugfix] Fix multi-modal processors for transformers 4.48 ( #12187 )
2025-01-18 19:16:34 -08:00
4e94951bb1
[BUGFIX] Move scores to float32 in case of running xgrammar on cpu ( #12152 )
...
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai >
2025-01-19 11:12:05 +08:00
7a8a48d51e
[V1] Collect env var for usage stats ( #12115 )
2025-01-19 03:07:15 +00:00
32eb0da808
[Misc] Support register quantization method out-of-tree ( #11969 )
2025-01-18 16:13:16 -08:00
6d0e3d3724
[core] clean up executor class hierarchy between v1 and v0 ( #12171 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 14:35:15 +08:00
02798ecabe
[Model] Port deepseek-vl2 processor, remove dependency ( #12169 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-18 13:59:39 +08:00
813f249f02
[Docs] Fix broken link in SECURITY.md ( #12175 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-18 04:35:21 +00:00
da02cb4b27
[core] further polish memory profiling ( #12126 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 12:25:08 +08:00
c09503ddd6
[AMD][CI/Build][Bugfix] use pytorch stale wheel ( #12172 )
...
Signed-off-by: hongxyan <hongxyan@amd.com >
2025-01-18 11:15:53 +08:00
2b83503227
[misc] fix cross-node TP ( #12166 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-18 10:53:27 +08:00
7b98a65ae6
[torch.compile] disable logging when cache is disabled ( #12043 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-17 20:29:31 +00:00
b5b57e301e
[AMD][FP8] Using MI300 FP8 format on ROCm for block_quant ( #12134 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2025-01-17 17:12:26 +00:00
54cacf008f
[Bugfix] Mistral tokenizer encode accept list of str ( #12149 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-17 16:47:53 +00:00
58fd57ff1d
[Bugfix] Fix score api for missing max_model_len validation ( #12119 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2025-01-17 16:24:22 +00:00
87a0c076af
[core] allow callable in collective_rpc ( #12151 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-17 20:47:01 +08:00
d4e6194570
[CI/Build][CPU][Bugfix] Fix CPU CI ( #12150 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-17 19:39:52 +08:00
07934cc237
[Misc][LoRA] Improve the readability of LoRA error messages ( #12102 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-17 19:32:28 +08:00
69d765f5a5
[V1] Move more control of kv cache initialization from model_executor to EngineCore ( #11960 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2025-01-17 07:39:35 +00:00
8027a72461
[ROCm][MoE] moe tuning support for rocm ( #12049 )
...
Signed-off-by: Divakar Verma <divakar.verma@amd.com >
2025-01-17 14:49:16 +08:00
d75ab55f10
[Misc] Add deepseek_vl2 chat template ( #12143 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-17 06:34:48 +00:00
d1adb9b403
[BugFix] add more is not None check in VllmConfig.__post_init__ ( #12138 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-17 05:33:22 +00:00
b8bfa46a18
[Bugfix] Fix issues in CPU build Dockerfile ( #12135 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-17 12:54:01 +08:00
1475847a14
[Doc] Add instructions on using Podman when SELinux is active ( #12136 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2025-01-17 04:45:36 +00:00
fead53ba78
[CI]add genai-perf benchmark in nightly benchmark ( #10704 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-17 04:15:09 +00:00
ebc73f2828
[Bugfix] Fix a path bug in disaggregated prefill example script. ( #12121 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-17 11:12:41 +08:00
d06e824006
[Bugfix] Set enforce_eager automatically for mllama ( #12127 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-16 15:30:08 -05:00
62b06ba23d
[Model] Add support for deepseek-vl2-tiny model ( #12068 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-16 17:14:48 +00:00
5fd24ec02e
[misc] Add LoRA kernel micro benchmarks ( #11579 )
2025-01-16 15:51:40 +00:00
874f7c292a
[Bugfix] Fix max image feature size for Llava-one-vision ( #12104 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-16 14:54:06 +00:00
92e793d91a
[core] LLM.collective_rpc interface and RLHF example ( #12084 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-16 20:19:52 +08:00
bf53e0c70b
Support torchrun and SPMD-style offline inference ( #12071 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-16 19:58:53 +08:00
dd7c9ad870
[Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 ( #12067 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-16 10:11:54 +00:00
9aa1519f08
Various cosmetic/comment fixes ( #12089 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-16 09:59:06 +00:00
f8ef146f03
[Doc] Add documentation for specifying model architecture ( #12105 )
2025-01-16 15:53:43 +08:00
fa0050db08
[Core] Default to using per_token quantization for fp8 when cutlass is supported. ( #8651 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Michael Goin <mgoin@redhat.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-16 04:31:27 +00:00
cd9d06fb8d
Allow hip sources to be directly included when compiling for rocm. ( #12087 )
2025-01-15 16:46:03 -05:00
ebd8c669ef
[Bugfix] Fix _get_lora_device for HQQ marlin ( #12090 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2025-01-15 19:59:42 +00:00
70755e819e
[V1][Core] Autotune encoder cache budget ( #11895 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-15 11:29:00 -08:00
edce722eaa
[Bugfix] use right truncation for non-generative tasks ( #12050 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-01-16 00:31:01 +08:00
57e729e874
[Doc]: Update OpenAI-Compatible Server documents ( #12082 )
2025-01-15 16:07:45 +00:00
de0526f668
[Misc][Quark] Upstream Quark format to VLLM ( #10765 )
...
Signed-off-by: kewang-xlnx <kewang@xilinx.com >
Signed-off-by: kewang2 <kewang2@amd.com >
Co-authored-by: kewang2 <kewang2@amd.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-15 11:05:15 -05:00
5ecf3e0aaf
Misc: allow to use proxy in HTTPConnection ( #12042 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2025-01-15 13:16:40 +00:00
97eb97b5a4
[Model]: Support internlm3 ( #12037 )
2025-01-15 11:35:17 +00:00
3adf0ffda8
[Platform] Do not raise error if _Backend is not found ( #12023 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
2025-01-15 10:14:15 +00:00
ad388d25a8
Type-fix: make execute_model output type optional ( #12020 )
2025-01-15 09:44:56 +00:00
cbe94391eb
Fix: cases with empty sparsity config ( #12057 )
...
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com >
2025-01-15 17:41:24 +08:00
994fc655b7
[V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager ( #12003 )
2025-01-15 07:55:30 +00:00
3f9b7ab9f5
[Doc] Update examples to remove SparseAutoModelForCausalLM ( #12062 )
...
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com >
2025-01-15 06:36:01 +00:00
ad34c0df0f
[core] platform agnostic executor via collective_rpc ( #11256 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-15 13:45:21 +08:00
f218f9c24d
[core] Turn off GPU communication overlap for Ray executor ( #12051 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-01-15 05:19:55 +00:00
0794e7446e
[Misc] Add multipstep chunked-prefill support for FlashInfer ( #10467 )
2025-01-15 12:47:49 +08:00
b7ee940a82
[V1][BugFix] Fix edge case in VLM scheduling ( #12065 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-14 20:21:28 -08:00
9ddac56311
[Platform] move current_memory_usage() into platform ( #11369 )
...
Signed-off-by: Shanshan Shen <467638484@qq.com >
2025-01-15 03:38:25 +00:00
1a51b9f872
[HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py ( #12046 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-15 02:59:18 +00:00
42f5e7c52a
[Kernel] Support MulAndSilu ( #11624 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 02:29:53 +00:00
a3a3ee4e6f
[Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping ( #11924 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-15 07:49:49 +08:00
87054a57ab
[Doc]: Update the Json Example of the Engine Arguments document ( #12045 )
2025-01-14 17:03:04 +00:00
c9d6ff530b
Explain where the engine args go when using Docker ( #12041 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-14 16:05:50 +00:00
a2d2acb4c8
[Bugfix][Kernel] Give unique name to BlockSparseFlashAttention ( #12040 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-14 15:45:05 +00:00
2e0e017610
[Platform] Add output for Attention Backend ( #11981 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-14 13:27:04 +00:00
1f18adb245
[Kernel] Revert the API change of Attention.forward ( #12038 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-14 20:59:32 +08:00
bb354e6b2d
[Bugfix] Fix various bugs in multi-modal processor ( #12031 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-14 12:16:11 +00:00
ff39141a49
[HPU][misc] add comments for explanation ( #12034 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-14 19:24:06 +08:00
8a1f938e6f
[Doc] Update Quantization Hardware Support Documentation ( #12025 )
...
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com >
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com >
2025-01-14 04:37:52 +00:00
078da31903
[HPU][Bugfix] set_forward_context and CI test execution ( #12014 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2025-01-14 11:04:18 +08:00
1a401252b5
[Docs] Add Sky Computing Lab to project intro ( #12019 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-13 17:24:36 -08:00
f35ec461fc
[Bugfix] Fix deepseekv3 gate bias error ( #12002 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2025-01-13 13:43:51 -07:00
289b5191d5
[Doc] Fix build from source and installation link in README.md ( #12013 )
...
Signed-off-by: Yikun <yikunkero@gmail.com >
2025-01-13 17:23:59 +00:00
c6db21313c
bugfix: Fix signature mismatch in benchmark's get_tokenizer function ( #11982 )
...
Signed-off-by: elijah <f1renze.142857@gmail.com >
2025-01-13 15:22:07 +00:00
a7d59688fb
[Platform] Move get_punica_wrapper() function to Platform ( #11516 )
...
Signed-off-by: Shanshan Shen <467638484@qq.com >
2025-01-13 13:12:10 +00:00
458e63a2c6
[platform] add device_control env var ( #12009 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-13 20:59:09 +08:00
e8c23ff989
[Doc] Organise installation documentation into categories and tabs ( #11935 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-13 12:27:36 +00:00
cd8249903f
[Doc][V1] Update model implementation guide for V1 support ( #11998 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2025-01-13 11:58:54 +00:00
0f8cafe2d1
[Kernel] unified_attention for Attention.forward ( #11967 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-13 19:28:53 +08:00
5340a30d01
Fix Max Token ID for Qwen-VL-Chat ( #11980 )
...
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com >
2025-01-13 08:37:48 +00:00
89ce62a316
[platform] add ray_device_key ( #11948 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-13 16:20:52 +08:00
c3f05b09a0
[Misc]Minor Changes about Worker ( #11555 )
...
Signed-off-by: Chenguang Li <757486878@qq.com >
2025-01-13 15:47:05 +08:00
cf6bbcb493
[Misc] Fix Deepseek V2 fp8 kv-scale remapping ( #11947 )
...
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu >
2025-01-12 23:05:06 -08:00
80ea3af1a0
[CI][Spec Decode] fix: broken test for EAGLE model ( #11972 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
2025-01-13 06:50:35 +00:00
9dd02d85ca
[Bug] Fix usage of .transpose() and .view() consecutively. ( #11979 )
2025-01-13 06:24:10 +00:00
f7b3ba82c3
[MISC] fix typo in kv transfer send recv test ( #11983 )
2025-01-13 05:07:48 +00:00
619ae268c3
[V1] [2/n] Logging and Metrics - OutputProcessor Abstraction ( #11973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-13 04:54:10 +00:00
d14e98d924
[Model] Support GGUF models newly added in transformers 4.46.0 ( #9685 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-13 00:13:44 +00:00
9597a095f2
[V1][Core][1/n] Logging and Metrics ( #11962 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2025-01-12 21:02:02 +00:00
263a870ee1
[Hardware][TPU] workaround fix for MoE on TPU ( #11764 )
2025-01-12 10:53:51 -05:00
8bddb73512
[Hardware][CPU] Multi-LoRA implementation for the CPU backend ( #11100 )
...
Signed-off-by: Akshat Tripathi <akshat@krai.ai >
Signed-off-by: Oleg Mosalov <oleg@krai.ai >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Oleg Mosalov <oleg@krai.ai >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-12 13:01:52 +00:00
f967e51f38
[Model] Initialize support for Deepseek-VL2 models ( #11578 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-12 00:17:24 -08:00
43f3d9e699
[CI/Build] Add markdown linter ( #11857 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2025-01-12 00:17:13 -08:00
b25cfab9a0
[V1] Avoid sending text prompt to core engine ( #11963 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-12 06:36:38 +00:00
4b657d3292
[Model] Add cogagent model support vLLM ( #11742 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-11 19:05:56 +00:00
d697dc01b4
[Bugfix] Fix RobertaModel loading ( #11940 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2025-01-11 14:05:09 +00:00
a991f7d508
[Doc] Basic guide for writing unit tests for new models ( #11951 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-11 21:27:24 +08:00
7a3a83e3b8
[CI/Build] Move model-specific multi-modal processing tests ( #11934 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-11 13:50:05 +08:00
c32a7c7c0c
[Bugfix] fused_experts_impl wrong compute type for float32 ( #11921 )
...
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com >
2025-01-11 13:49:39 +08:00
2118d0565c
[Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design ( #11672 )
...
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com >
2025-01-10 20:49:38 -08:00
899136b857
[ci] fix broken distributed-tests-4-gpus ( #11937 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-11 09:07:24 +08:00
c9f09a4fe8
[mypy] Fix mypy warnings in api_server.py ( #11941 )
...
Signed-off-by: Fred Reiss <frreiss@us.ibm.com >
2025-01-11 01:04:58 +00:00
d45cbe70f5
[Bugfix] Check that number of images matches number of <|image|> tokens with mllama ( #11939 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2025-01-10 23:26:00 +00:00
8a579408f3
[Misc] Update benchmark_prefix_caching.py fixed example usage ( #11920 )
...
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn >
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn >
2025-01-10 20:39:22 +00:00
46fa98ccad
[Misc] Clean up debug code in Deepseek-V3 ( #11930 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2025-01-10 19:19:15 +00:00
aa1e77a19c
[Hardware][CPU] Support MOE models on x86 CPU ( #11831 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-10 11:07:58 -05:00
5959564f94
Doc fix in benchmark_long_document_qa_throughput.py ( #11933 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-10 23:51:43 +08:00
f33e033e27
[Docs] Fix docstring in get_ip function ( #11932 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2025-01-10 23:51:02 +08:00
482cdc494e
[Doc] Rename offline inference examples ( #11927 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-10 23:50:29 +08:00
20410b2fda
[platform] support custom torch.compile backend key ( #11318 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-10 23:46:51 +08:00
12664ddda5
[Doc] [1/N] Initial guide for merged multi-modal processor ( #11925 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 14:30:25 +00:00
241ad7b301
[ci] Fix sampler tests ( #11922 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-10 20:45:33 +08:00
d85c47d6ad
Replace "online inference" with "online serving" ( #11923 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-10 12:05:56 +00:00
ef725feafc
[platform] support pytorch custom op pluggable ( #11328 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2025-01-10 10:02:38 +00:00
d907be7dc7
[misc] remove python function call for custom activation op ( #11885 )
...
Co-authored-by: youkaichao <youkaichao@gmail.com >
2025-01-10 17:18:25 +08:00
d53575a5f0
[ci] fix gh200 tests ( #11919 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-10 16:25:17 +08:00
61af633256
[BUGFIX] Fix UnspecifiedPlatform package name ( #11916 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2025-01-10 16:20:46 +08:00
ac2f3f7fee
[Bugfix] Validate lora adapters to avoid crashing server ( #11727 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-10 15:56:36 +08:00
cf5f000d21
[torch.compile] Hide KV cache behind torch.compile boundary ( #11677 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-10 13:14:42 +08:00
3de2b1eafb
[Doc] Show default pooling method in a table ( #11904 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 11:25:20 +08:00
b844b99ad3
[VLM] Enable tokenized inputs for merged multi-modal processor ( #11900 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-10 03:24:00 +00:00
c3cf54dda4
[Doc][5/N] Move Community and API Reference to the bottom ( #11896 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2025-01-10 03:10:12 +00:00
36f5303578
[Docs] Add Modal to deployment frameworks ( #11907 )
2025-01-09 23:26:37 +00:00
9a228348d2
[Misc] Provide correct Pixtral-HF chat template ( #11891 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 10:19:37 -07:00
bd82872211
[ci]try to fix flaky multi-step tests ( #11894 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-09 14:47:29 +00:00
405eb8e396
[platform] Allow platform specify attention backend ( #11609 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
2025-01-09 21:46:50 +08:00
65097ca0af
[Doc] Add model development API Reference ( #11884 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 09:43:40 +00:00
1d967acb45
[Bugfix] fix beam search input errors and latency benchmark script ( #11875 )
...
Signed-off-by: Ye Qi <yeq@meta.com >
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com >
2025-01-09 17:36:39 +08:00
0bd1ff4346
[Bugfix] Override dunder methods of placeholder modules ( #11882 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-09 09:02:53 +00:00
310aca88c9
[perf]fix current stream ( #11870 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-09 07:18:21 +00:00
a732900efc
[Doc] Intended links Python multiprocessing library ( #11878 )
2025-01-09 05:39:39 +00:00
d848800e88
[Misc] Move print_*_once from utils to logger ( #11298 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
2025-01-09 12:48:12 +08:00
730e9592e9
[Doc] Recommend uv and python 3.12 for quickstart guide ( #11849 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-09 11:37:48 +08:00
1fe554bac3
treat do_lower_case in the same way as the sentence-transformers library ( #11815 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2025-01-09 11:05:43 +08:00
615e4a5401
[CI] Turn on basic correctness tests for V1 ( #10864 )
2025-01-08 21:20:44 -05:00
3db0cafdf1
[Docs] Add Google Cloud Meetup ( #11864 )
2025-01-08 12:38:28 -08:00
526de822d5
[Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models ( #11698 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2025-01-08 20:23:15 +00:00
56fe4c297c
[TPU][Quantization] TPU W8A8 ( #11785 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-08 19:33:29 +00:00
47de8821d3
[Misc]add some explanations for BlockHashType ( #11847 )
2025-01-08 18:21:30 +00:00
5984499e47
[Doc] Expand Multimodal API Reference ( #11852 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 17:14:14 +00:00
ca47e176af
[Misc] Move some model utils into vision file ( #11848 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 17:04:46 +00:00
78f4590b60
[Bugfix][XPU] fix silu_and_mul ( #11823 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2025-01-09 00:11:50 +08:00
2f7024987e
[CI/Build][Bugfix] Fix CPU CI image clean up ( #11836 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2025-01-08 15:18:28 +00:00
6cd40a5bfe
[Doc][4/N] Reorganize API Reference ( #11843 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 21:34:44 +08:00
aba8d6ee00
[Doc] Move examples into categories ( #11840 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-08 13:09:53 +00:00
2a0596bc48
[VLM] Reorganize profiling/processing-related code ( #11812 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 18:59:58 +08:00
f12141170a
[torch.compile] consider relevant code in compilation cache ( #11614 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-08 10:46:43 +00:00
cfd3219f58
[Hardware][Apple] Native support for macOS Apple Silicon ( #11696 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2025-01-08 16:35:49 +08:00
a1b2b8606e
[Docs] Update sponsor name: 'Novita' to 'Novita AI' ( #11833 )
2025-01-07 23:05:46 -08:00
ad9f1aa679
[doc] update wheels url ( #11830 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-08 14:36:49 +08:00
889e662eae
[misc] improve memory profiling ( #11809 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-08 06:36:03 +00:00
ef68eb28d8
[Bug] Fix pickling of ModelConfig when RunAI Model Streamer is used ( #11825 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 13:40:09 +08:00
259abd8953
[Docs] reorganize sponsorship page ( #11639 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2025-01-07 21:16:08 -08:00
f645eb6954
[Bugfix] Add checks for LoRA and CPU offload ( #11810 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-08 13:08:48 +08:00
f4923cb8bc
[OpenVINO] Fixed Docker.openvino build ( #11732 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2025-01-08 13:08:30 +08:00
b640b19cc0
Fixed docker build for ppc64le ( #11518 )
...
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com >
2025-01-08 13:05:37 +08:00
dc71af0a71
Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… ( #11824 )
2025-01-08 04:09:25 +00:00
4d29e91be8
[Misc] sort torch profiler table by kernel timing ( #11813 )
2025-01-08 10:57:04 +08:00
91445c7bc8
[Bugfix] Fix image input for Pixtral-HF ( #11741 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-08 10:17:16 +08:00
5950f555a1
[Doc] Group examples into categories ( #11782 )
...
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com >
2025-01-08 09:20:12 +08:00
a4e2b26856
[Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 ( #11794 )
2025-01-07 16:15:50 -08:00
973f5dc581
[Doc]Add documentation for using EAGLE in vLLM ( #11417 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2025-01-07 19:19:12 +00:00
c994223d56
[Bugfix] update the prefix for qwen2 ( #11795 )
...
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com >
2025-01-07 18:36:34 +00:00
869579a702
[optimization] remove python function call for custom op ( #11750 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-07 17:04:28 +00:00
c0efe92d8b
[Doc] Add note to gte-Qwen2 models ( #11808 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 21:50:58 +08:00
d9fa1c05ad
[doc] update how pip can install nightly wheels ( #11806 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-07 21:42:58 +08:00
2de197bdd4
[V1] Support audio language models on V1 ( #11733 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 19:47:36 +08:00
869e829b85
[doc] add doc to explain how to use uv ( #11773 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2025-01-07 18:41:17 +08:00
8f37be38eb
[Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation ( #11800 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 18:25:02 +08:00
8082ad7950
[V1][Doc] Update V1 support for LLaVa-NeXT-Video ( #11798 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 09:55:39 +00:00
1e4ce295ae
[CI][CPU] adding build number to docker image name ( #11788 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2025-01-07 07:28:01 +00:00
ce1917fcf2
[Doc] Create a vulnerability management team ( #9925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2025-01-06 22:57:32 -08:00
e512f76a89
fix init error for MessageQueue when n_local_reader is zero ( #11768 )
2025-01-07 06:12:48 +00:00
898cdf033e
[CI] Fix neuron CI and run offline tests ( #11779 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2025-01-06 21:36:10 -08:00
0f3f3c86ec
[Bugfix] Update attention interface in Whisper ( #11784 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-07 04:36:24 +00:00
b278557935
[Kernel][LoRA]Punica prefill kernels fusion ( #11234 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Abatom <abzhonghua@gmail.com >
Co-authored-by: Zhonghua Deng <abatom@163.com >
2025-01-07 04:01:39 +00:00
8ceffbf315
[Doc][3/N] Reorganize Serving section ( #11766 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 11:20:01 +08:00
d93d2d74fd
[XPU] Make pp group initilized for pipeline-parallelism ( #11648 )
...
Signed-off-by: yisheng <yi.sheng@intel.com >
2025-01-07 11:09:58 +08:00
d0169e1b0f
[Model] Future-proof Qwen2-Audio multi-modal processor ( #11776 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 11:05:17 +08:00
08fb75c72e
[Bugfix] Fix LLaVA-NeXT feature size precision error (for real) ( #11772 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-07 01:10:54 +00:00
91b361ae89
[V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision ( #11685 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 19:58:16 +00:00
e20c92bb61
[Kernel] Move attn_type to Attention.__init__() ( #11690 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2025-01-07 00:11:28 +08:00
32c9eff2ff
[Bugfix][V1] Fix molmo text-only inputs ( #11676 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-06 15:22:25 +00:00
4ca5d40adc
[doc] explain how to add interleaving sliding window support ( #11771 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2025-01-06 21:57:44 +08:00
9279b9f83d
[Bugfix] Fix max image size for LLaVA-Onevision ( #11769 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2025-01-06 13:48:53 +00:00
ee77fdb5de
[Doc][2/N] Reorganize Models and Usage sections ( #11755 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 21:40:31 +08:00
996357e480
[VLM] Separate out profiling-related logic ( #11746 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 16:02:21 +08:00
2a622d704a
k8s-config: Update the secret to use stringData ( #11679 )
...
Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com >
2025-01-06 08:01:22 +00:00
9c749713f6
[mypy] Forward pass function type hints in lora ( #11740 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2025-01-06 07:59:36 +00:00
022c5c6944
[V1] Refactor get_executor_cls ( #11754 )
2025-01-06 07:59:16 +00:00
f8fcca100b
[Misc] Fix typo for valid_tool_parses ( #11753 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2025-01-06 07:12:38 +00:00
06bfb51963
[V1] Add BlockTable class ( #11693 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-06 14:24:42 +09:00
408e560015
[Bugfix] Remove block size constraint ( #11723 )
2025-01-06 12:49:55 +08:00
402d378360
[Doc] [1/N] Reorganize Getting Started section ( #11645 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-06 02:18:33 +00:00
9e764e7b10
[distributed] remove pynccl's redundant change_state ( #11749 )
2025-01-06 09:05:48 +08:00
33fc1e2e86
[Frontend] Improve StreamingResponse Exception Handling ( #11752 )
2025-01-05 16:35:01 -05:00
eba17173d3
fix: [doc] fix typo ( #11751 )
...
Co-authored-by: Lancer <maruixiang6688@gmail.com >
2025-01-06 00:48:16 +08:00
635b897246
[distributed] remove pynccl's redundant stream ( #11744 )
2025-01-05 23:09:11 +08:00
4068f4b5b5
[MISC] Replace c10::optional with std::optional ( #11730 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-05 10:20:34 +09:00
47831430cc
[Bugfix][V1] Fix test_kv_cache_utils.py ( #11738 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-04 16:07:59 +00:00
65c08928c2
[Model] Remove unnecessary weight initialization logic ( #11736 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-04 23:46:21 +08:00
ba214dffbe
[Bugfix] Fix precision error in LLaVA-NeXT ( #11735 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-04 23:45:57 +08:00
eed11ebee9
[VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision ( #11717 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-04 11:40:53 +00:00
300acb8347
[Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture ( #11233 )
...
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com >
Signed-off-by: Ido Asraff <idoa@atero.ai >
2025-01-04 14:50:16 +08:00
d91457d529
[V1] Add kv cache utils tests. ( #11513 )
...
Signed-off-by: xcnick <xcnick0412@gmail.com >
2025-01-04 14:49:46 +08:00
fbf2564554
[V1] Add RayExecutor support for AsyncLLM (api server) ( #11712 )
2025-01-04 06:41:31 +00:00
d1d49397e7
Update bnb.md with example for OpenAI ( #11718 )
2025-01-04 06:29:02 +00:00
9c93636d84
Update tool_calling.md ( #11701 )
2025-01-04 06:16:30 +00:00
e5d7ed0c53
[V1] log GPU blocks num for MultiprocExecutor ( #11656 )
2025-01-04 00:13:12 +00:00
ad0d567e1c
[V1] Chore: cruft removal ( #11724 )
2025-01-03 23:25:02 +00:00
bf0d97d786
Update requirements-tpu.txt to support python 3.9 and 3.11 ( #11695 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2025-01-03 22:36:46 +00:00
a655eb3025
[Misc]Add BNB quantization for Qwen2VL ( #11719 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-03 15:19:02 -07:00
1543914c04
[V1] Improve TP>1 Error Handling + Stack Trace ( #11721 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2025-01-03 21:29:11 +00:00
61fed92c7e
[Bugfix] Fix ColumnParallelLinearWithLoRA slice ( #11708 )
...
Signed-off-by: ZincCat <zincchloride@outlook.com >
2025-01-03 21:02:34 +00:00
80c751e7f6
[V1] Simplify Shutdown ( #11659 )
2025-01-03 17:25:38 +00:00
e1a5c2f0a1
[Model] Whisper model implementation ( #11280 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
2025-01-03 16:39:19 +08:00
fd3a62a122
[perf-benchmark] Fix dependency for steps in benchmark pipeline ( #11710 )
2025-01-02 22:38:37 -08:00
07064cb1d4
[Bugfix] Check chain_speculative_sampling before calling it ( #11673 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-02 16:58:56 -08:00
2f1e8e8f54
Update default max_num_batch_tokens for chunked prefill ( #11694 )
2025-01-03 00:25:53 +00:00
68d37809b9
[Misc] Minimum requirements for SageMaker compatibility ( #11576 )
2025-01-02 15:59:25 -08:00
5dba257506
Resolve race conditions in Marlin kernel ( #11493 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2025-01-02 22:58:56 +00:00
187e32997c
[Bugfix] Change kv scaling factor by param json on nvidia gpu ( #11688 )
...
Signed-off-by: bjmsong <bjmsong@126.com >
Co-authored-by: bjmsong <bjmsong@126.com >
2025-01-02 21:11:39 +00:00
b55ed6ef8a
[V1][Minor] Optimize token_ids_cpu copy ( #11692 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-02 12:04:58 -07:00
2f385183f3
[Bugfix] Free cross attention block table for preempted-for-recompute sequence group. ( #10013 )
...
Signed-off-by: Kathy Yu <feiyangyu@google.com >
2025-01-02 10:28:09 -08:00
84c35c374a
According to vllm.EngineArgs, the name should be distributed_executor_backend ( #11689 )
2025-01-02 18:14:16 +00:00
8c38ee7007
[VLM] Merged multi-modal processor for LLaVA-NeXT ( #11682 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-02 16:39:27 +00:00
b6087a6bee
[mypy] Pass type checking in vllm/inputs ( #11680 )
...
Signed-off-by: Tobias Pitters <tobias.pitters@gmail.com >
2025-01-02 16:18:15 +00:00
23c1b10a4c
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input ( #11674 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2025-01-02 17:00:00 +08:00
a115ac46b5
[VLM] Move supported limits and max tokens to merged multi-modal processor ( #11669 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2025-01-01 15:44:42 +00:00
73001445fb
[V1] Implement Cascade Attention ( #11635 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-01 21:56:46 +09:00
6d70198b17
[Doc] Fix typo ( #11666 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
2025-01-01 08:10:10 +00:00
f962f426bc
[Misc] Replace space with - in the file names ( #11667 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-01 07:39:30 +00:00
11d8a091c6
[Misc] Optimize Qwen2-VL LoRA test ( #11663 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-01 14:42:23 +08:00
365801fedd
[VLM] Add max-count checking in data parser for single image models ( #11661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-31 22:15:21 -08:00
4db72e57f6
[Bugfix][Refactor] Unify model management in frontend ( #11660 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-01-01 02:21:51 +00:00
0c6f998554
[Benchmark] Add benchmark script for CPU offloading ( #11533 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: KuntaiDu <kuntai@uchicago.edu >
2025-01-01 00:10:55 +00:00
e7c7c5e822
[V1][VLM] V1 support for selected single-image models. ( #11632 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-31 21:17:22 +00:00
8c3230d8c1
[V1] Simpify vision block hash for prefix caching by removing offset from hash ( #11646 )
2024-12-31 08:56:01 +00:00
2c5718809b
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. ( #11565 )
2024-12-31 06:29:04 +00:00
82c49d3260
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) ( #6909 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 22:15:58 -08:00
74fa1d123c
[Bugfix] Fix OpenAI parallel sampling when using xgrammar ( #11637 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-31 03:43:54 +00:00
a2a40bcd0d
[Model][LoRA]LoRA support added for MolmoForCausalLM ( #11439 )
...
Signed-off-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 17:33:06 -08:00
ccb1aabcca
[benchmark] Remove dependency for H100 benchmark step ( #11572 )
2024-12-30 12:27:07 -08:00
36e7670045
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel ( #11631 )
2024-12-30 18:51:04 +00:00
5886aa496e
[V1] [6/N] API Server: Better Shutdown ( #11586 )
2024-12-30 15:51:02 +00:00
8d9b6721e7
[VLM] Abstract out multi-modal data parsing in merged processor ( #11620 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-30 15:01:35 +00:00
b12e87f942
[platforms] enable platform plugins ( #11602 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 20:24:45 +08:00
5dbf854553
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels ( #11618 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-30 10:17:04 +00:00
970d6d0776
[Build][Kernel] Update CUTLASS to v3.6.0 ( #11607 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-30 17:22:13 +08:00
628ec6c17b
[Docker] bump up neuron sdk v2.21 ( #11593 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2024-12-30 13:46:14 +08:00
3682e33f9f
[v1] fix compilation cache ( #11598 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 04:24:12 +00:00
0aa38d16f5
Remove print statement in DeepseekScalingRotaryEmbedding ( #11604 )
2024-12-29 20:16:46 +00:00
faef77c0d6
[Misc] KV cache transfer connector registry ( #11481 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2024-12-29 16:08:09 +00:00
dba4d9dec6
[v1][bugfix] fix cudagraph with inplace buffer assignment ( #11596 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-29 09:03:49 +00:00
32b4c63f02
[Doc] Convert list tables to MyST ( #11594 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-29 15:56:22 +08:00
4fb8e329fd
[V1] [5/N] API Server: unify Detokenizer and EngineCore input ( #11545 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-28 20:51:57 +00:00
328841d002
[bugfix] interleaving sliding window for cohere2 model ( #11583 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-28 16:55:42 +00:00
d427e5cfda
[Doc] Minor documentation fixes ( #11580 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-28 21:53:59 +08:00
42bb201fd6
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor ( #11581 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-28 13:33:12 +00:00
59d6bb4c86
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version ( #11515 )
...
Signed-off-by: hjwei <hjwei_xd@163.com >
2024-12-28 11:17:35 +00:00
b7dcc003dc
[Model] Remove hardcoded image tokens ids from Pixtral ( #11582 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-28 10:54:23 +00:00
d34be24bb1
[Model] Support InternLM2 Reward models ( #11571 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-28 06:14:10 +00:00
b5cbe8eeb3
[Bugfix] Last token measurement fix ( #11376 )
...
Signed-off-by: rajveerb <46040700+rajveerb@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-28 11:34:46 +08:00
df04dffade
[V1] [4/N] API Server: ZMQ/MP Utilities ( #11541 )
2024-12-28 01:45:08 +00:00
a60731247f
[Doc] Update mllama example based on official doc ( #11567 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2024-12-28 00:31:10 +00:00
ac79799403
[Bugfix] Fix for ROCM compressed tensor support ( #11561 )
2024-12-27 20:12:11 +00:00
dde1fa18c9
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix ( #11566 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-27 19:45:13 +00:00
0240402c46
[Misc]Add BNB quantization for MolmoForCausalLM ( #11551 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 18:48:24 +00:00
55509c2114
[MODEL] LoRA support for Jamba model ( #11209 )
...
Signed-off-by: Erez Schwartz <erezs@ai21.com >
2024-12-27 17:58:21 +00:00
101418096f
[VLM] Support caching in merged multi-modal processor ( #11396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 17:22:48 +00:00
5ce4627a7e
[Doc] Add xgrammar in doc ( #11549 )
...
Signed-off-by: ccjincong <chenjincong11@gmail.com >
2024-12-27 13:05:10 +00:00
7af553ea30
[Misc] Abstract the logic for reading and writing media content ( #11527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 19:21:23 +08:00
2c9b8ea2b0
[Bugfix] Fix TeleChat2ForCausalLM weights mapper ( #11546 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 10:39:15 +00:00
d003f3ea39
Update deploying_with_k8s.md with AMD ROCm GPU example ( #11465 )
...
Signed-off-by: Alex He <alehe@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-27 10:00:04 +00:00
6c6f7fe8a8
[Platform] Move model arch check to platform ( #11503 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-27 08:45:25 +00:00