Compare commits

...

2135 Commits

Author SHA1 Message Date
d3eddd6ef1 initial
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-04-01 16:06:59 -07:00
e75a6301bd [V1][Spec Decode] Implement Eagle Proposer [1/N] (#15729)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-04-01 12:33:16 -07:00
a79cc68b3a [V1][Metrics] Initial speculative decoding metrics (#15151)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-04-01 10:45:04 -07:00
7e3f7a4ee7 [CI] Disable flaky structure decoding test temporarily. (#15892)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-04-01 17:42:34 +00:00
9ec8257914 [Model] Add module name prefixes to gemma3 (#15889)
Signed-off-by: Bartholomew Sabat <bartek@recursal.ai>
Co-authored-by: Bartholomew Sabat <bartek@recursal.ai>
2025-04-01 10:13:40 -07:00
38327cf454 [Model] Aya Vision (#15441)
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-04-01 16:30:43 +00:00
dfa82e2a3d [CI/Build] Clean up LoRA tests (#15867)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-04-01 16:28:50 +00:00
e59ca942f5 Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations. (#13932)
Signed-off-by: Bill Nell <bnell@redhat.com>
2025-04-01 12:07:43 -04:00
a57a3044aa [ROCm][Build][Bugfix] Bring the base dockerfile in sync with the ROCm fork (#15820)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-04-01 08:56:39 -07:00
4e5a0f6ae2 [Misc] Allow using OpenCV as video IO fallback (#15055)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-04-01 15:55:13 +00:00
b63bd14999 Reinstate format.sh and make pre-commit installation simpler (#15890)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-04-01 15:41:30 +00:00
2041c0e360 [Doc] Quark quantization documentation (#15861)
Signed-off-by: chaow <chaow@amd.com>
2025-04-01 08:32:45 -07:00
085cbc4f9f [New Model]: jinaai/jina-reranker-v2-base-multilingual (#15876)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-04-01 08:32:26 -07:00
2b93162fb0 Remove format.sh as it's been unsupported >70 days (#15884)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-04-01 22:27:46 +08:00
2e45bd29fe [Misc] remove unused script (#15746)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-04-01 13:58:05 +00:00
51d7c6a2b2 [Model] Support Mistral3 in the HF Transformers format (#15505)
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-04-01 06:10:05 -07:00
f3aca1ee30 setup correct nvcc version with CUDA_HOME (#15725)
Signed-off-by: Yang Chen <yangche@fb.com>
2025-04-01 06:09:40 -07:00
8dd41d6bcc [Misc] Use envs.VLLM_USE_RAY_COMPILED_DAG_CHANNEL_TYPE (#15831)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-04-01 06:07:53 -07:00
0a298ea418 [Bugfix] Fix no video/image profiling edge case for MultiModalDataParser (#15828)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-04-01 18:17:11 +08:00
d330558bab [Docs] Fix small error in link text (#15868)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-04-01 10:05:14 +00:00
656fd72976 [Misc] Fix speculative config repr string (#15860)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-04-01 02:26:22 -07:00
79455cf421 [Misc] Enable V1 LoRA by default (#15320)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-04-01 16:53:56 +08:00
30d6a015e0 [Feature] specify model in config.yaml (#15798)
Signed-off-by: weizeng <weizeng@roblox.com>
2025-04-01 01:20:06 -07:00
8af5a5c4e5 fix: can not use uv run collect_env close #13888 (#15792)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-04-01 07:45:49 +00:00
3a5f0afcd2 [V1] Implement sliding window attention in kv_cache_manager (#14097)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-04-01 00:33:17 -07:00
c7e63aa4d8 [ROCm] Use device name in the warning (#15838)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-04-01 00:10:48 -07:00
4a9ce1784c [sleep mode] clear pytorch cache after sleep (#15248)
Signed-off-by: <villard@us.ibm.com>
2025-03-31 22:58:58 -07:00
7e4e709b43 [V1] TPU - Fix fused MOE (#15834)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-31 22:58:07 -07:00
63d8eabed0 [Bugfix]: Fix is_embedding_layer condition in VocabParallelEmbedding (#15824)
Signed-off-by: alexwl <alexey.a.kiryushin@gmail.com>
2025-03-31 22:57:59 -07:00
e830b01383 [Bugfix] Fix extra comma (#15851)
Signed-off-by: haochengxia <xhc_1007@163.com>
2025-03-31 22:57:28 -07:00
ff6473980d [Bugfix][Model] fix mllama multi-image (#14883)
Signed-off-by: yan ma <yan.ma@intel.com>
2025-03-31 22:53:37 -07:00
a164aea35d [Frontend] Add Phi-4-mini function calling support (#14886)
Signed-off-by: Kinfey <kinfeylo@microsoft.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-03-31 22:50:05 -07:00
a76f547e11 Rename fallback model and refactor supported models section (#15829)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-31 22:49:41 -07:00
b7b7676d67 [Distributed] Add custom allreduce support for ROCM (#14125)
Signed-off-by: ilmarkov <imarkov@redhat.com>
Co-authored-by: ilmarkov <imarkov@redhat.com>
2025-03-31 22:49:12 -07:00
e6e3c55ef2 Move dockerfiles into their own directory (#14549)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-31 13:47:32 -07:00
f98a4920f9 [V1][Core] Remove unused speculative config from scheduler (#15818)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-31 19:15:21 +00:00
d4bfc23ef0 Fix Transformers backend compatibility check (#15290)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-31 10:27:07 -07:00
9a2160fa55 [V1] TPU CI - Add basic perf regression test (#15414)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-31 13:25:20 -04:00
2de4118243 fix: change GB to GiB in logging close #14979 (#15807)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-31 10:00:50 -07:00
239b7befdd [V1][Spec Decode] Remove deprecated spec decode config params (#15466)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-03-31 09:19:35 -07:00
09e974d483 [Bugfix] Check dimensions of multimodal embeddings in V1 (#15816)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-31 09:01:35 -07:00
e5ef4fa99a Upgrade transformers to v4.50.3 (#13905)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-31 08:59:37 -07:00
Mrm
037bcd942c [Bugfix] Fix missing return value in load_weights method of adapters.py (#15542)
Signed-off-by: noc-turne <2270929247@qq.com>
2025-03-31 06:56:42 -07:00
c2e7507ad4 [Bugfix] Fix Crashing When Loading Modules With Batchnorm Stats (#15813)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2025-03-31 13:23:53 +00:00
3aa2b6a637 [Model] Update support for NemotronNAS models (#15008)
Signed-off-by: Nave Assaf <nassaf@nvidia.com>
2025-03-31 20:35:14 +08:00
555aa21905 [V1] Fully Transparent Implementation of CPU Offloading (#15354)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-31 20:22:34 +08:00
e7ae3bf3d6 fix: better install requirement for install in setup.py (#15796)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-31 05:13:32 -07:00
b932c048ac Recommend developing with Python 3.12 in developer guide (#15811)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-03-31 11:54:49 +00:00
e85829450d [Feature][ROCm]Enable fusion pass for torch.compile on ROCm (#15050)
Signed-off-by: charlifu <charlifu@amd.com>
2025-03-31 04:42:18 -07:00
effc5d24fa [Benchmark] Update Vision Arena Dataset and HuggingFaceDataset Setup (#15748)
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
2025-03-31 15:38:58 +08:00
18ed3132d2 [Misc] update the comments (#15780)
Signed-off-by: chengyang liu <lcy4869@gmail.com>
Co-authored-by: chengyang liu <lcy4869@gmail.com>
2025-03-30 19:39:56 -07:00
9b459eca88 [V1][Scheduler] Avoid calling _try_schedule_encoder_inputs for every request (#15778)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-30 14:10:42 -07:00
70fedd0f79 fix: Comments to English for better dev experience (#15768)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-30 10:47:57 -07:00
bb103b29bf [Bugfix] Added embed_is_patch mask for fuyu model (#15731)
Signed-off-by: Kyle Huang <kylhuang@nvidia.com>
2025-03-30 03:45:08 -07:00
248e76c4df fix: lint fix a ruff checkout syntax error (#15767)
Signed-off-by: yihong0618 <zouzou0208@gmail.com>
2025-03-30 03:36:02 -07:00
803d5c35f3 [V1] Override mm_counts for dummy data creation (#15703)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-30 03:20:42 -07:00
7fd8c0f85c fix test_phi3v (#15321)
Signed-off-by: pansicheng <sicheng.pan.chn@gmail.com>
2025-03-30 02:01:34 -07:00
44c3a5abc3 [doc] update conda to usage link in installation (#15761)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-03-30 08:12:13 +00:00
6909a76201 [Bugfix] Fix Mistral guided generation using xgrammar (#15704)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
2025-03-29 20:20:19 -07:00
045533716b [CI] xgrammar structured output supports Enum. (#15757)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-03-29 20:20:02 -07:00
3c0ff914ac [Bugfix] Fix Mllama interleaved images input support (#15564)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
2025-03-29 18:11:15 +00:00
2bc4be4e32 [V1][Minor] Simplify rejection sampler's parse_output (#15741)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-29 09:25:17 -07:00
c67abd614f [V1] Support interleaved modality items (#15605)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-29 06:30:09 -07:00
6fa7cd3dbc [Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore (#12957)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-03-29 04:01:46 -07:00
94744ba41a [V1] [Feature] Collective RPC (#15444)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
2025-03-29 03:39:14 -07:00
4965ec42d2 [FEAT] [ROCm] Add AITER int8 scaled gemm kernel (#15433)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-03-29 03:33:56 -07:00
73aa7041bf [doc] update doc (#15740)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-03-29 04:27:22 +00:00
7c1f760024 [Kernel][TPU][ragged-paged-attn] vLLM code change for PR#8896 (#15659)
Signed-off-by: Yarong Mu <ymu@google.com>
2025-03-28 21:13:15 -07:00
da461f3cbf [TPU][V1][Bugfix] Fix w8a8 recompiilation with GSM8K (#15714)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-28 21:13:06 -07:00
5b800f0932 [Bugfix] set VLLM_WORKER_MULTIPROC_METHOD=spawn for vllm.entrypoionts.openai.api_server (#15700)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-03-28 21:12:26 -07:00
8427f70493 Use numba 0.61 for python 3.10+ to support numpy>=2 (#15692)
Signed-off-by: cyy <cyyever@outlook.com>
2025-03-29 12:11:51 +08:00
7a7992085b [CI] Speed up V1 structured output tests (#15718)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-28 21:10:45 -07:00
1286211f57 [Bugfix] LoRA V1: add and fix entrypoints tests (#15715)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-28 21:10:41 -07:00
6d531ad7b8 [Misc][V1] Misc code streamlining (#15723)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-28 20:59:47 -07:00
762b424a52 [Docs] Document v0 engine support in reasoning outputs (#15739)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-03-29 03:46:57 +00:00
de1cb38769 [Model] Support Skywork-R1V (#15397)
Signed-off-by: jiacai.liu <932997367@qq.com>
Co-authored-by: jiacai.liu <932997367@qq.com>
2025-03-28 20:39:21 -07:00
c802f5430d [ROCm][AMD][Build] Update AMD supported arch list (#15632)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-03-28 20:39:18 -07:00
cff8991a50 [Docs][V1] Optimize diagrams in prefix caching design (#15716) 2025-03-29 03:33:58 +00:00
f3f8d8fff4 implement prometheus fast-api-instrumentor for http service metrics (#15657) 2025-03-29 00:12:02 +00:00
26df46ee59 [Misc] cli auto show default value (#15582)
Signed-off-by: reidliu41 <reid201711@gmail.com>
2025-03-28 22:23:00 +00:00
c3f687ac22 [V1] TPU - Fix the chunked prompt bug (#15713)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-28 20:19:04 +00:00
04437e313d [Bugfix] [torch.compile] Add Dynamo metrics context during compilation (#15639)
Signed-off-by: luka <luka@neuralmagic.com>
2025-03-28 14:01:09 -06:00
038bededba [TPU] [Perf] Improve Memory Usage Estimation (#15671)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-03-28 17:37:52 +00:00
d03308be0c [Misc] Remove stale func in KVTransferConfig (#14746)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-03-28 17:33:32 +00:00
c6bc0034d0 [Misc] Remove unused utils and clean up imports (#15708)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-28 09:41:16 -07:00
70e132244a [Minor] Remove TGI launching script (#15646)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-28 09:30:08 -07:00
47e9038d23 Fix cpu offload testing for gptq/awq/ct (#15648)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-29 00:29:32 +08:00
432cf22a6a [Bugfix] Fix regex compile display format (#15368)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-03-28 08:58:44 -07:00
2914006fe0 [doc] add missing imports (#15699)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-03-28 15:56:48 +00:00
7329ff5468 [V1] Support disable_any_whtespace for guidance backend (#15584)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-28 23:46:45 +08:00
541d1df486 [Bugfix] embed_is_patch for Idefics3 (#15696)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-28 08:27:52 -07:00
3b00ff9138 [Bugfix][v1] xgrammar structured output supports Enum. (#15594)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-03-28 06:14:53 -07:00
91276c5721 [Model] Adding torch compile annotations to chatglm (#15624)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-28 21:14:09 +08:00
0b4167526d [Docs] Add "Generation quality changed" section to troubleshooting (#15701)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-28 13:03:21 +00:00
fd5fd26902 [Frontend] update priority for --api-key and VLLM_API_KEY (#15588)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-03-28 19:40:12 +08:00
3bbaacbe15 [Bugfix][Frontend] Eliminate regex based check in reasoning full generator (#14821)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-03-28 11:20:35 +00:00
a10314c6b3 [Misc] Fix test_sleep to use query parameters (#14373)
Signed-off-by: Lize Cai <lize.cai@sap.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-03-28 18:00:14 +08:00
70f2c2a709 [Bugfix] Fix 'InductorAdaptor object has no attribute 'cache_dir' (#15674)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-28 17:10:40 +08:00
280d074103 [CPU][CI] Improve CPU Dockerfile (#15690)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-03-28 01:36:31 -07:00
32b14baf8a [Refactor][Frontend] Keep all logic about reasoning into one class (#14428)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-03-28 00:23:30 -07:00
2d9045fce8 [TPU][CI] Fix TPUModelRunner Test (#15667)
Signed-off-by: Robert Shaw <robshaw@redhat.com>
Co-authored-by: Robert Shaw <robshaw@redhat.com>
2025-03-28 00:01:26 -07:00
355f66348c [V1] Remove legacy input registry (#15673)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-27 23:34:34 -07:00
8693e47e6a [Bugfix] Fix mm_hashes forgetting to be passed (#15668)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-28 05:51:05 +00:00
cec8c7d7f8 Refactor error handling for multiple exceptions in preprocessing (#15650)
Signed-off-by: JasonZhu1313 <jasonchu13@outlook.com>
2025-03-28 03:27:20 +00:00
4d0ec37267 [Quantization][FP8] Adding support for fp8 gemm layer input in fp8 (#14578)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-03-28 02:58:16 +00:00
e7f720ea56 [Misc]add coding benchmark for speculative decoding (#15303)
Signed-off-by: CXIAAAAA <cxia0209@gmail.com>
2025-03-28 10:47:05 +08:00
Wes
4ae17bf1e2 Revert "Use Cache Hinting for fused_moe kernel (#15511)" (#15645)
Signed-off-by: Wes Medford <wryanmedford@gmail.com>
2025-03-27 19:45:55 -07:00
8a49eea74b [CI][TPU] Temporarily Disable Quant Test on TPU (#15649)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-03-27 19:45:05 -07:00
b4245a48df [Doc] Fix dead links in Job Board (#15637)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
2025-03-28 02:43:40 +00:00
4e0f6076be [Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS. (#14948)
Signed-off-by: Kebe <mail@kebe7jun.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-03-28 10:13:41 +08:00
726efc6a32 [Quantization][V1] BitsAndBytes support V1 (#15611)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-28 10:12:47 +08:00
bd45912b99 [TPU] Lazy Import (#15656)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-03-28 09:57:01 +08:00
15dac210f0 [V1] AsyncLLM data parallel (#13923)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-27 16:14:41 -07:00
112b3e5b3b [CI] Update rules for applying tpu label. (#15634)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-27 22:15:26 +00:00
32d669275b Correct PowerPC to modern IBM Power (#15635)
Signed-off-by: Christy Norman <christy@linux.vnet.ibm.com>
2025-03-27 15:04:32 -07:00
4098b72210 [Bugfix][TPU][V1] Fix recompilation (#15553)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-27 19:15:06 +00:00
46450b8d33 Use absolute placement for Ask AI button (#15628)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-27 18:52:18 +00:00
13ac9cab21 [Misc] Avoid direct access of global mm_registry in compute_encoder_budget (#15621)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-27 17:52:00 +00:00
66aa4c0bf4 [Feature] Add middleware to log API Server responses (#15593)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-27 17:49:38 +00:00
247181536f [Misc] Replace is_encoder_decoder_inputs with split_enc_dec_inputs (#15620)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-27 17:36:32 +00:00
07bf813fb5 [Doc] Link to onboarding tasks (#15629)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-27 16:30:53 +00:00
8958217ad5 [Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1 (#15211)
Signed-off-by: h-sugi <h.sugi@ieee.org>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-27 22:29:29 +08:00
ac5bc615b0 [Model] MiniCPM-V/O supports V1 (#15487)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-27 06:07:29 -07:00
8063dfc61a [Doc] update --system for transformers installation in docker doc (#15616)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-03-27 20:38:46 +08:00
6278bc829e Fix incorrect filenames in vllm_compile_cache.py (#15494)
Signed-off-by: <zou3519@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-03-27 18:33:41 +08:00
3f532cb6a6 [Misc] Use model_redirect to redirect the model name to a local folder. (#14116) 2025-03-27 02:21:23 -07:00
e6c9053f9e [Misc] Clean up scatter_patch_features (#15559)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-27 07:45:00 +00:00
43ed4143c4 [Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM (#15587)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>
Co-authored-by: ElizaWszola <ewszola@redhat.com>
2025-03-27 06:47:25 +00:00
f4c98b4d4c [Misc] Consolidate LRUCache implementations (#15481)
Signed-off-by: Bella kira <2374035698@qq.com>
2025-03-27 06:43:43 +00:00
e1e0fd7543 [TPU] Avoid Triton Import (#15589)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-03-27 06:43:02 +00:00
df8d3d1287 [Misc] Restrict ray version dependency and update PP feature warning in V1 (#15556) 2025-03-27 06:21:07 +00:00
619d3de8bd [TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS (#15583)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-26 22:46:26 -07:00
ecff8309a3 [ROCm] Env variable to trigger custom PA (#15557)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-03-26 22:46:12 -07:00
dcf2a590f5 Allow torchao quantization in SiglipMLP (#15575) 2025-03-26 22:45:51 -07:00
54aa619459 [V1] Refactor num_computed_tokens logic (#15307)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-27 04:54:36 +00:00
fb22be5817 [moe][quant] add weight name case for offset (#15515)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-03-27 04:50:29 +00:00
7f301dd8ef [Doc] Update V1 user guide for fp8 kv cache support (#15585)
Signed-off-by: weizeng <weizeng@roblox.com>
2025-03-26 19:39:03 -07:00
8095341a01 [misc] LoRA: Remove unused long context test data (#15558)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-27 10:04:51 +08:00
69db16a46a add platform check back (#15578)
Signed-off-by: Chenyaaang <llccyy1212@gmail.com>
2025-03-27 01:50:27 +00:00
ce78f9af4e Add automatic tpu label to mergify.yml (#15560) 2025-03-26 21:39:58 -04:00
9239bf718e [Kernel] CUTLASS grouped gemm fp8 MoE kernel (#13972)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Co-authored-by: Lucas Wilkinson <wilkinson.lucas@gmail.com>
2025-03-27 00:54:44 +00:00
7a6d45bc8a Support FIPS enabled machines with MD5 hashing (#15299)
Signed-off-by: Matthew Vine <32849887+MattTheCuber@users.noreply.github.com>
2025-03-26 20:19:46 -04:00
e74ff409e0 [TPU] support disabling xla compilation cache (#15567)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-27 00:09:28 +00:00
Wes
7a888271f5 Use Cache Hinting for fused_moe kernel (#15511) 2025-03-26 23:21:34 +00:00
9d119a86ae [V1] TPU CI - Fix test_compilation.py (#15570)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-26 21:51:54 +00:00
b2e85e26f4 [V1] TPU - Revert to exponential padding by default (#15565)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-26 21:35:05 +00:00
dd8a29da99 Applying some fixes for K8s agents in CI (#15493)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-03-26 20:35:11 +00:00
27df5199d9 Support SHA256 as hash function in prefix caching (#15297)
Signed-off-by: Marko Rosenmueller <5467316+dr75@users.noreply.github.com>
2025-03-26 11:11:28 -07:00
35fad35a48 [V1][Sampler] Faster top-k only implementation (#15478)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-26 10:56:47 -07:00
733e7c9e95 [Refactor] Remove unnecessary backend parameter in structured output interface (#15317)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-26 17:51:56 +00:00
0af4d764d6 Fix weight loading for some models in Transformers backend (#15544)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-26 10:17:53 -07:00
e64afa455c multi-node offline DP+EP example (#15484)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-26 23:54:24 +08:00
1711b929b6 [Model] Add Reasoning Parser for Granite Models (#14202)
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
2025-03-26 14:28:07 +00:00
c091c0a588 Improve validation of TP in Transformers backend (#15540)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-26 07:26:48 -07:00
1aa162e030 Apply torchfix (#15532)
Signed-off-by: cyy <cyyever@outlook.com>
2025-03-26 12:09:06 +00:00
cf5c8f1686 Separate base model from TransformersModel (#15467)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-03-26 18:13:38 +08:00
4ec2cee000 [Misc] improve example script output (#15528)
Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
2025-03-26 10:12:47 +00:00
99f536f830 [Misc] Enhance warning information to user-defined chat template (#15408)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
2025-03-26 02:21:15 -07:00
5ebf66748b [FEAT][ROCm] Integrate Fused MoE Kernels from AITER (#14967)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-03-26 16:30:30 +08:00
781d056280 [Feature] Enhance EAGLE Architecture with Proper RMS Norms (#14990)
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-03-26 08:24:07 +00:00
5aefd6ac31 Fix raw_request extraction in load_aware_call decorator (#15382)
Signed-off-by: Daniel Salib <danielsalib@meta.com>
2025-03-25 22:29:54 -07:00
6c663dfd5e [misc] LoRA - Skip LoRA kernels when not required (#15152)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-26 11:33:45 +08:00
33437bc6e7 [BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results) (#15492)
Signed-off-by: LucasWilkinson <lwilkinson@neuralmagic.com>
2025-03-25 20:33:22 -07:00
23114d3364 [Misc] Warn about v0 in benchmark_paged_attn.py (#15495)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-25 20:31:04 -07:00
997c8811d6 [Model] Support multi-image for Molmo (#15438)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-26 11:26:33 +08:00
e42389f9d7 Transformers backend already supports V1 (#15463)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-25 20:26:16 -07:00
ff38f0a32c [CI/Build] LoRA: Delete long context tests (#15503)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-25 17:18:34 -07:00
a5cfbab3c8 [Core] LoRA: V1 Scheduler optimization (#15422)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-25 22:50:09 +00:00
ac3cd6e83c [core] add bucket padding to tpu_model_runner (#14995)
Signed-off-by: Chenyaaang <llccyy1212@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-03-25 17:27:22 -04:00
082ab86f5f [V1] Support long_prefill_token_threshold in v1 scheduler (#15419)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-25 14:22:26 -07:00
6aa196c8dc [V1][Minor] Use SchedulerInterface type for engine scheduler field (#15499)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-25 14:21:36 -07:00
a0dd7dcd49 [TPU][V1] Fix Sampler recompilation (#15309)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-25 16:43:54 -04:00
e977c11111 Add workaround for shared field_names in pydantic model class (#13925)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-03-25 20:31:08 +00:00
5f063a80bd [bugfix] add supports_v1 platform interface (#15417)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-03-25 15:00:32 -04:00
5d8e1c9279 [Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility) (#15471)
Co-authored-by: ServerAI <ai@exc-mad-ai.com>
2025-03-25 17:59:25 +00:00
0a049c7d86 [CI/Build] Add tests for the V1 tpu_model_runner. (#14843)
Signed-off-by: Yarong Mu <ymu@google.com>
2025-03-25 12:27:16 -04:00
d0cfec7ab9 [bugfix] fix inductor cache on max_position_embeddings (#15436)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-25 07:05:39 -07:00
a608160027 [Kernel] Fix conflicting macro names for gguf kernels (#15456)
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
2025-03-25 13:50:49 +00:00
3f04a7fbf2 [Doc] Update V1 user guide for multi-modality (#15460)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-25 11:01:58 +00:00
5994430b84 [Misc] Remove redundant num_embeds (#15443)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-25 18:27:57 +08:00
a9e879b316 [Misc] Clean up MiniCPM-V/O code (#15337)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-25 10:22:52 +00:00
3e2f37a69a Dockerfile.ppc64le changes to move to UBI (#15402)
Signed-off-by: Md. Shafi Hussain <Md.Shafi.Hussain@ibm.com>
2025-03-25 10:15:14 +00:00
4f044b1d67 [Kernel][CPU] CPU MLA (#14744)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-03-25 09:34:59 +00:00
4157f563b4 [Hardware][TPU][Bugfix] Fix v1 mp profiler (#15409)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-03-25 01:43:00 -07:00
051da7efe3 Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10 (#15160)
Signed-off-by: Lu Fang <lufang@fb.com>
Co-authored-by: Richard Barnes <rbarnes@meta.com>
2025-03-25 15:36:45 +08:00
25f560a62c [V1][Spec Decode] Update target_logits in place for rejection sampling (#15427)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-24 21:04:41 -07:00
a09ad90a72 [V1] guidance backend for structured output + auto fallback mode (#14779)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Loc Huynh <jc1da.3011@gmail.com>
Co-authored-by: Michal Moskal <michal@moskal.me>
2025-03-24 21:02:33 -07:00
10b34e36b9 [Bugfix] Fixed the issue of not being able to input video and image simultaneously (#15387)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-03-25 03:48:08 +00:00
b5269db959 Revert "Fix non-contiguous input passed to Marlin kernel (#15319)" (#15398) 2025-03-24 20:43:51 -07:00
6db94571d7 [Misc] Remove LoRA log (#15388)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-24 20:43:48 -07:00
97cfa65df7 Add pipeline parallel support to TransformersModel (#12832)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-03-25 10:41:45 +08:00
911c8eb000 [Minor][Spec Decode] Remove compiled_softmax (#15416)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-24 19:09:04 -07:00
ebcebeeb6b [V1][Spec Decode] Enable spec decode for top-p & top-k sampling (#15063)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-24 17:16:46 -07:00
f533b5837f [ROCm][Kernel] MoE weights padding (#14454)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Co-authored-by: charlifu <charlifu@amd.com>
2025-03-24 23:45:30 +00:00
8279201ce6 [Build] Cython compilation support fix (#14296)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-03-24 23:37:54 +00:00
23fdab00a8 [Hardware][TPU] Skip failed compilation test (#15421)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-03-24 23:28:57 +00:00
623e2ed29f [BugFix][V1] Quick fix for min_tokens with multiple EOS (#15407)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-24 15:58:59 -07:00
9d72daf4ce [V1][Perf] Simpler request output queues (#15156)
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-03-24 22:44:08 +00:00
6dd55af6c9 [Doc] Update docs on handling OOM (#15357)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-24 14:29:34 -07:00
3eb08ed9b1 [DOC] Add Kubernetes deployment guide with CPUs (#14865) 2025-03-24 10:48:43 -07:00
5eeadc2642 [Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral (#12303)
Signed-off-by: zhenwei <zhenweiliu@habana.ai>
2025-03-24 09:48:40 -07:00
3aee6573dc [V1] Aggregate chunked prompt logprobs in model runner (#14875)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-24 12:27:57 -04:00
9cc645141d [MISC] Refine no available block debug msg (#15076)
Signed-off-by: Yi Liu <yiliu4@habana.ai>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Co-authored-by: Yi Liu <yiliu4@habana.ai>
2025-03-25 00:01:10 +08:00
0893567db9 [V1][Minor] fix comments (#15392)
Signed-off-by: chenjincong <chenjincong@baidu.com>
Signed-off-by: Chen-0210 <chenjincong11@gmail.com>
Co-authored-by: chenjincong <chenjincong@baidu.com>
2025-03-24 08:45:32 -07:00
8abe69b499 [Core] Don't force uppercase for VLLM_LOGGING_LEVEL (#15306)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-24 08:27:30 -07:00
761702fd19 [Core] Integrate fastsafetensors loader for loading model weights (#10647)
Signed-off-by: Manish Sethi <Manish.sethi1@ibm.com>
2025-03-24 08:08:02 -07:00
9606d572ed [distributed] fix dp group (#15355)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-24 14:54:27 +00:00
cbcdf2c609 [Bugfix] Fix chat template loading (#15143)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-24 13:50:09 +00:00
038de04d7b Fix zmq IPv6 URL format error (#15341)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-24 09:30:41 -04:00
6b3cc75be0 [Kernel] allow non-contiguous input for marlin kernel (#14658)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-03-24 09:21:33 -04:00
7ffcccfa5c Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-24 05:53:10 -07:00
cc8accfd53 [Misc] Update guided decoding logs to debug (#15310)
Signed-off-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
Co-authored-by: Benjamin Merkel <benjamin.merkel@tngtech.com>
2025-03-24 04:25:20 -07:00
948ab03e7e [Bugfix][V1] Avoid importing PreTrainedModel (#15366)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-03-24 10:33:12 +00:00
5797fb97e9 [Misc] Remove ignore_reinit_error for ray.init() (#15373) 2025-03-24 07:41:53 +00:00
3892e58ad7 [Misc] Upgrade BNB version (#15183) 2025-03-24 05:51:42 +00:00
d20e261199 Fix non-contiguous input passed to Marlin kernel (#15319) 2025-03-24 03:09:44 +00:00
f622dbcf39 [Fix] [torch.compile] Improve UUID system for custom passes (#15249)
Signed-off-by: luka <luka@neuralmagic.com>
2025-03-24 01:54:07 +00:00
dccf535f8e [V1] Enable V1 Fp8 cache for FA3 in the oracle (#15191)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-23 15:07:04 -07:00
9c5c81b0da [Misc][Doc] Add note regarding loading generation_config by default (#15281)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-23 14:00:55 -07:00
d6cd59f122 [Frontend] Support tool calling and reasoning parser (#14511)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-03-23 14:00:07 -07:00
bc8ed3c4ba [V1][Spec Decode] Use better defaults for N-gram (#15358)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-23 10:52:30 -07:00
b9bd76ca14 [V1][Spec Decode] Respect prompt_lookup_max (#15348)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-23 10:41:44 -07:00
6ebaf9ac71 [Bugfix] consider related env vars for torch.compiled cache hash (#14953)
Signed-off-by: DefTruth <31974251+DefTruth@users.noreply.github.com>
2025-03-23 15:53:09 +00:00
f90d34b498 [Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 (#15322)
Signed-off-by: DefTruth <qiustudent_r@163.com>
2025-03-23 01:10:10 -07:00
f68cce8e64 [ci/build] fix broken tests in LLM.collective_rpc (#15350)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-23 14:49:48 +08:00
09b6a95551 [ci/build] update torch nightly version for GH200 (#15135)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-23 14:04:13 +08:00
50c9636d87 [V1][Usage] Refactor speculative decoding configuration and tests (#14434)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-03-22 19:28:10 -10:00
0661cfef7a Fix v1 supported oracle for worker-cls and worker-extension-cls (#15324)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-03-23 10:23:35 +08:00
a827aa815d [doc] Add back previous news (#15331)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-03-22 17:38:33 -07:00
b877031d80 Remove openvino support in favor of external plugin (#15339)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-22 14:06:39 -07:00
dd861b992f [BugFix][Typing] Fix Imprecise Type Annotations (#15208)
Signed-off-by: Wang Ran (汪然) <wrran@outlook.com>
2025-03-22 09:05:03 -07:00
eb63ea1e18 [V1] Add disable-any-whitespace option support for xgrammar (#15316)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-22 15:56:17 +00:00
2f4bd358f1 [Model] Support Tele-FLM Model (#15023)
Signed-off-by: Naitong Yu <ntyu@baai.ac.cn>
Signed-off-by: jiangxin <horizon94@outlook.com>
Co-authored-by: Jason Fang <jasonfang3900@gmail.com>
Co-authored-by: jiangxin <horizon94@outlook.com>
2025-03-22 02:04:44 -07:00
8a8b30eac1 [Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes (#15308)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-22 02:03:32 -07:00
2fa0e1396b [Bugfix] Fix torch.compile raise FileNotFoundError (#15278)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-22 13:49:34 +08:00
1c2bec0f82 [Doc] add load_format items in docs (#14804)
Signed-off-by: wwl2755 <wangwenlong2755@gmail.com>
2025-03-21 22:36:43 -07:00
ec870fba9a [FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature (#14959)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-03-21 22:36:14 -07:00
df1430265c [Bugfix][V0] Multi-sequence logprobs streaming edge case (#15259)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-03-21 22:35:37 -07:00
4c69e228b3 [Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout (#15301)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-03-21 22:25:43 -07:00
790b79750b [Build/CI] Fix env var typo (#15305)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-21 22:28:46 +00:00
cfbb8c930f [TPU][V1] MHA Pallas backend (#15288)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-21 08:50:39 -07:00
baec0d4de9 Revert "[Feature] specify model in config.yaml (#14855)" (#15293)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-21 08:30:23 -07:00
c21b99b912 [Bugfix][VLM] fix llava processor (#15285)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-03-21 05:14:36 -07:00
93a00d7dde [v1] Refactor KVCacheConfig (#14079)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-03-21 04:56:27 -07:00
61e8c18350 [Misc] Add cProfile helpers (#15074)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-21 04:56:09 -07:00
8afcd0f633 [Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend (#15282)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-21 11:42:06 +00:00
91ca929dc7 [V1] Fix wrong import path of get_flash_attn_version (#15280)
Signed-off-by: Lehua Ding <lehuading@tencent.com>
2025-03-21 03:54:11 -07:00
84e00adc8a [Bugfix] Fix incorrect resolving order for transformers fallback (#15279)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-21 03:54:08 -07:00
47c7126213 [Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL (#15273)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-21 10:32:33 +00:00
a989ca2bf6 [Bugfix] Add int8 torch dtype for KVCache (#15260)
Signed-off-by: shen-shanshan <467638484@qq.com>
2025-03-21 08:58:28 +00:00
0fa3970deb [Feature] specify model in config.yaml (#14855)
Signed-off-by: weizeng <weizeng@roblox.com>
2025-03-21 00:26:03 -07:00
da6ea29f7a [V1] Avoid redundant input processing in n>1 case (#14985)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-20 22:24:10 -07:00
7297941b38 [Doc] Update LWS docs (#15163)
Signed-off-by: Edwinhr716 <Edandres249@gmail.com>
2025-03-20 21:18:47 -07:00
f8a08cb90d [V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-21 03:14:19 +00:00
b15fd2be2a [Hardware][TPU] Add check for no additional graph compilation during runtime (#14710)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-03-21 03:05:28 +00:00
e588ac237c Add an example for reproducibility (#15262)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-20 19:55:47 -07:00
5df2da5b97 [Misc] Better RayExecutor and multiprocessing compatibility (#14705)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-20 19:27:46 -07:00
11b986b3fb [Docs] Trim the latest news in README (#15261)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-20 19:24:21 -07:00
296f927f24 [Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies (#14857)
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
2025-03-20 19:21:08 -07:00
0032903a5b [Bugfix] detect alibi and revert to FA2 (#15231)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2025-03-20 19:20:16 -07:00
47195057e9 [V1][TPU] Speed up top-k on TPU by using torch.topk (#15242)
Signed-off-by: Hyesoo Yang <hyeygit@gmail.com>
2025-03-20 19:19:40 -07:00
6edbfa924d Mention extra_body as a way top pass vLLM only parameters using the OpenAI client (#15240)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-20 19:18:36 -07:00
1e508343e1 [Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation (#15200)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-20 19:18:04 -07:00
2e0b4cfde0 [ROCM] Upgrade torch to 2.6 (#15244)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-20 19:17:33 -07:00
10f55fe6c5 [Misc] Clean up the BitsAndBytes arguments (#15140)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-20 19:17:12 -07:00
d3ccbd6350 Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 (#15159)
Signed-off-by: Lu Fang <lufang@fb.com>
Co-authored-by: Richard Barnes <rbarnes@meta.com>
2025-03-21 10:01:11 +08:00
0cfe7d386d [CI/Build] LoRA : make add_lora_test safer (#15181)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-21 09:28:53 +08:00
0c6f5023c3 [V1] Scheduler Refactoring [1/N] - Add Scheduler Interface (#15250)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-20 17:50:43 -07:00
06dd08256f Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. (#14617)
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2025-03-21 00:44:37 +00:00
2b22290ce0 [V1] Add flag to disable cascade attention (#15243)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-20 15:24:16 -07:00
d8e82bc06d [Bugfix] fix V1 Engine crash while handling requests with duplicate request id (#15043)
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
2025-03-20 10:01:02 -07:00
086b56824c [ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 (#15172)
Signed-off-by: Chi Zhang <zhangchi.usc1992@bytedance.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-03-21 00:30:04 +08:00
5a0905ba2a Replace misc issues with link to forum (#15226)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-20 23:18:20 +08:00
a8f12a63fd Fix env vars for running Ray distributed backend on GKE (#15166)
Signed-off-by: Richard Liu <ricliu@google.com>
2025-03-20 14:59:33 +00:00
69ae2380c6 Add user forum to README (#15220)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-20 22:39:51 +08:00
27261e40a6 [Bugfix] Multi-video inference on LLaVA-Onevision (#15082)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-03-20 14:10:45 +00:00
e3f813c33b [macOS] Ugrade pytorch to 2.6.0 (#15129) 2025-03-20 01:22:40 -07:00
c607a2652b Fixing Imprecise Type Annotations (#15192) 2025-03-20 01:19:55 -07:00
3d45e3d749 [release] Tag vllm-cpu with latest upon new version released (#15193) 2025-03-20 01:19:10 -07:00
742369d35a [Frontend][Bugfix] support prefill decode disaggregation on deepseek (#14824)
Signed-off-by: billishyahao <bill.he@amd.com>
Co-authored-by: Zhai Feiyue <80079571+ZhaiFeiyue@users.noreply.github.com>
2025-03-20 00:00:33 -07:00
bfe2fe0af4 typo: Update config.py (#15189) 2025-03-19 23:31:21 -07:00
a8652f4f0f Enable CUDA graph support for llama 3.2 vision (#14917)
Signed-off-by: Matt Ritter <100659061+mritterfigma@users.noreply.github.com>
2025-03-19 23:29:16 -07:00
2f726b241e [Doc] Update README.md (#15187)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-20 13:25:58 +08:00
a597a57595 [Attention] Flash Attention 3 - fp8 (#14570)
Signed-off-by: Mickael Seznec <mickael@mistral.ai>
2025-03-20 01:14:20 -04:00
ae65f3e237 [Misc]fixed disable these http request logs (#14754)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-03-19 21:53:40 -07:00
34868b106a [Doc] Update Mistral Small 3.1/Pixtral example (#15184)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-20 04:46:06 +00:00
1f16b7fe74 [Core][V0] Add guidance backend for structured output (#14589)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Loc Huynh <lohuynh@microsoft.com>
Co-authored-by: Michal Moskal <michal@moskal.me>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-19 21:33:51 -07:00
b88be22165 [Benchmark] Allow oversample request in benchmark dataset (#15170)
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
2025-03-20 12:32:58 +08:00
d8c6d7d6b5 [V1][TPU] Support V1 Sampler for ragged attention (#14227)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-19 21:00:39 -07:00
40828ce5fe fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… (#14673)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2025-03-19 20:56:16 -07:00
ffa443afed [Bugfix] Fix embedding assignment for InternVL-based models (#15086)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-20 03:40:13 +00:00
70e500cad9 Fix broken tests (#14713)
Signed-off-by: JovanSardinha <jovan.sardinha@gmail.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
2025-03-20 02:06:49 +00:00
4cb1c05c9e [Doc] Clarify run vllm only on one node in distributed inference (#15148)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-03-20 09:55:59 +08:00
c47aafa37c [BugFix] Lazily import XgrammarBackend to avoid early cuda init (#15171)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-20 01:30:43 +00:00
cfbca8a2f2 [V1] TPU - Tensor parallel MP support (#15059) 2025-03-20 00:55:18 +00:00
0fe5609874 [Docs] Annouce Ollama and Singapore Meetups (#15161)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-19 16:18:04 -07:00
22d33baca2 [FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests (#15150)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-19 21:04:41 +00:00
b0e96aaebb [V1][TPU] Change kv cache shape. (#15145)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
2025-03-19 12:16:42 -07:00
8310e0b59b simple bugfix: Update stats.py (#15139) 2025-03-19 18:26:27 +00:00
26dd972adb [FEAT]Support reset prefix cache by specified device (#15003) 2025-03-19 10:54:41 -07:00
61c7a1b856 [V1] Minor V1 async engine test refactor (#15075)
Signed-off-by: andoorve <murali.andoorveedu@mail.utoronto.ca>
Co-authored-by: andoorve <murali.andoorveedu@mail.utoronto.ca>
2025-03-19 10:37:17 -07:00
374ee287d8 [Frontend] Remove custom_cache_manager (#13791)
Signed-off-by: fulvius31 <asangior@redhat.com>
2025-03-20 00:13:50 +08:00
a4d83661d7 [Misc] Update the "the first vLLM China Meetup" slides link to point to the first page (#15134)
Signed-off-by: imkero <kerorek@outlook.com>
2025-03-19 15:07:39 +00:00
8363cd093d [Bugfix] Adjust mllama to regional compilation (#15112)
Signed-off-by: Jan Kaniecki <jkaniecki@habana.ai>
2025-03-19 07:57:25 -07:00
6c5a3195db [Misc][Benchmark] Add support for different tokenizer_mode (#15040)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2025-03-19 14:56:50 +00:00
073d1ed354 [Doc] Update tip info on using latest transformers when creating a custom Dockerfile (#15070) 2025-03-19 13:33:40 +00:00
3d446433ec [Bugfix] Fix size calculation of processing cache (#15114)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-19 05:53:19 -07:00
1fe0fd12d3 [Misc] Avoid unnecessary HF do_rescale warning when passing dummy data (#15107)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-19 03:42:31 -07:00
dafb4e504a [V1][Bugfix] Fix oracle for device checking (#15104)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-19 18:35:32 +08:00
68cf1601d3 [CI][Intel GPU] update XPU dockerfile and CI script (#15109)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-03-19 01:29:25 -07:00
61f412187d [Bugfix] Re-enable Gemma3 for V1 (#14980)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-18 23:58:22 -07:00
05ccd0aa35 [V1] Ensure using int64 for sampled token ids (#15065)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-18 23:52:19 -07:00
f690372b68 [Core] Update dtype detection and defaults (#14858)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-19 13:49:33 +08:00
8b3e94a357 [Model] Remove duplicated message check in Mistral chat completion request (#15069)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-03-19 05:09:32 +00:00
437f9162d0 [Model] Pixtral: Remove layer instantiation duplication (#15053)
Signed-off-by: Julien Denize <julien.denize@mistral.ai>
2025-03-19 10:34:03 +08:00
4f065f12f5 [Misc][V1] Skip device checking if not available (#15061)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-18 19:33:43 -07:00
228b768db6 [Doc] Minor v1_user_guide update (#15064)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2025-03-18 16:10:45 -07:00
027827cc1d fix long dtype in topk sampling (#15049) 2025-03-18 15:57:31 -07:00
72a8639b68 [V1] TPU - CI/CD use smaller model (#15054)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-18 21:39:21 +00:00
99abb8b650 [V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels (#14930)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-18 14:31:54 -07:00
3a1e648158 [V1] Refactor Structured Output for multiple backends (#14694)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-18 19:49:15 +00:00
46c759c165 [Bugfix] Fix LoRA extra vocab size (#15047)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-18 09:40:29 -07:00
179a619c21 [Bugfix] Fix broken CPU quantization due to triton import (#15038)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-18 08:57:39 -07:00
452e8fd968 [MODEL] Add support for Zamba2 models (#13185)
Signed-off-by: Yury Tokpanov <yury@zyphra.com>
Signed-off-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-03-18 08:56:21 -07:00
8b793f7ec6 MI325 configs, fused_moe_kernel bugfix (#14987)
Signed-off-by: Eugene Kuznetsov <eugene.kuznetsov@amd.com>
2025-03-18 08:05:18 -07:00
af35d3a3cc [TPU][V1][Bugfix] Fix chunked prefill with padding (#15037)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-18 07:34:45 -07:00
3b457143d2 [Bugfix] Register serializers for V0 MQ Engine (#15009)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-18 09:14:47 -04:00
ab656f2c2f [Bugfix] Loosen type check to avoid errors in V1 (#15021)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-18 12:54:40 +00:00
64fc2193dc [Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros (#14347) 2025-03-18 05:50:19 -07:00
dd732028f5 [Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest (#14352)
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>
2025-03-18 05:50:05 -07:00
414919138b [Bugfix] torchrun compatibility (#14899)
Signed-off-by: hiyouga <hiyouga@buaa.edu.cn>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-03-18 05:49:27 -07:00
db7c8ca910 [Misc] Embedding model support LoRA (#14935)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-18 12:07:00 +00:00
f863ffc965 [Mistral-Small 3.1] Update docs and tests (#14977)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-18 03:29:42 -07:00
400d483e87 [Kernels] LoRA - Retire SGMV and BGMV Kernels (#14685)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-18 09:47:53 +00:00
d1695758b2 [Doc][V1] Fix V1 APC doc (#14920) 2025-03-18 08:15:46 +00:00
53a0cf8b95 [Neuron] trim attention kernel tests to fit trn1.2x instance (#14988)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-03-18 15:05:52 +08:00
5eeabc2a44 [Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights (#14950) 2025-03-17 23:27:26 +00:00
18551e820c [V1] TPU - Fix CI/CD runner (#14974) 2025-03-17 21:07:07 +00:00
e41e160263 [V1] Guard Against Main Thread Usage (#14972)
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
2025-03-17 13:23:02 -07:00
b89fb2a4a1 [CI/Build] Use AutoModelForImageTextToText to load VLMs in tests (#14945)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-17 18:35:17 +00:00
5340b0e221 [Bugfix] Fix interface for Olmo2 on V1 (#14976)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-17 11:26:38 -07:00
37e3806132 [Bugfix] Make Gemma3 MM V0 only for now (#14971)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-17 10:04:21 -07:00
c0efdd655b [Fix][Structured Output] using vocab_size to construct matcher (#14868)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
2025-03-17 11:42:45 -04:00
aaaec52ad9 [Bugfix][Model] Mixtral: use unused head_dim config argument (#14961)
Signed-off-by: Quentin Torroba <quentin.torroba@mistral.ai>
2025-03-17 07:44:18 -07:00
e1eb45d397 [Bugfix] Fix precommit - line too long in pixtral.py (#14960)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-17 07:18:50 -07:00
89fca671fb [V1] Default MLA to V1 (#14921)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-17 06:54:40 -07:00
d20b0c139c Add patch merger (#14957) 2025-03-17 06:47:50 -07:00
166a168b0f [Doc] Fix misleading log during multi-modal profiling (#14955)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-17 06:14:32 -07:00
2bb0e1a799 [Bugfix][ROCm] running new process using spawn method for rocm in tests. (#14810)
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-03-17 11:33:35 +00:00
6eaf1e5c52 [Misc] Add --seed option to offline multi-modal examples (#14934)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-17 03:00:17 -07:00
868a8c5b2c [Bugfix] Fix Ultravox on V1 (#14929)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-17 17:15:20 +08:00
b4ad56c1bd [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. (#14846)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
2025-03-17 01:48:28 -07:00
69698f257e fix minor miscalled method (#14327) 2025-03-17 01:47:58 -07:00
cd0cd85102 [MISC] More AMD unused var clean up (#14926)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-17 16:40:41 +08:00
0a74bfce9c setup.py: drop assumption about local main branch (#14692)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-17 01:37:42 -07:00
dd3b865854 [Doc] Add vLLM Beijing meetup slide (#14938)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-03-17 16:29:36 +08:00
9b87a579aa [Misc][XPU] Use None as device capacity for XPU (#14932)
Signed-off-by: yan ma <yan.ma@intel.com>
2025-03-17 01:22:14 -07:00
b539222d4e [V1] Remove input cache client (#14864)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-16 23:42:06 -07:00
8d6cf89526 [V1] [Spec Decode] Support random sampling for spec decode (#13933)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 22:00:20 -07:00
583a9778e0 [Benchmark] Do not save detailed info to json by default (#14879)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-16 21:48:11 -07:00
a73e183e36 [Misc] Replace os environ to monkeypatch in test suite (#14516)
Signed-off-by: sibi <85477603+t-sibiraj@users.noreply.github.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-16 20:35:57 -07:00
1e799b7ec1 [BugFix] Fix MLA + V1 + TP==1 causing reinitialization of cuda context (#14910) 2025-03-17 03:35:37 +00:00
7f6c5ee06c [V1][Minor] Add __repr__ to ConstantList (#14907)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 20:20:15 -07:00
faa0275730 [V1] Optimize the overhead of rewinding (#14905)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 20:19:30 -07:00
8a5a9b70d7 [CI/Build] Update defaults for test reproducibility (#14893)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-17 10:38:15 +08:00
bb3aeddfaf [CI] Nightly Tests (#14898)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-03-17 02:06:43 +00:00
aecc780dba [V1] Enable Entrypoints Tests (#14903) 2025-03-16 17:56:16 -07:00
90df7f23aa [Doc] Add guidance for using ccache with pip install -e . in doc (#14901) 2025-03-16 23:10:04 +00:00
b9b5bdfc7d [Misc] Catching Ray Compiled Graph PP test failures for V1 (#14847) 2025-03-16 15:46:42 -07:00
31060b2757 [V1][BugFix] Detect interleaved sliding window attention (#14896)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-16 14:53:53 -07:00
fc1f67715d [BugFix][V1] Fix overhead related to bad_words sampling when not in use (#14894)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-16 14:53:34 -07:00
f6137adbcb Revert "[Bugfix] Limit profiling run sequence length by max_model_len (#14785) (#14892)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-16 09:13:46 -07:00
e53b1350f2 [Bugfix] Explicitly disable Phi-4-multimodal in V1 (#14889)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-16 09:05:40 -07:00
d30aa7e9e6 [Bugfix] Limit profiling run sequence length by max_model_len (#14785)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-03-16 07:44:19 -07:00
d1ad2a57af [V1] [Spec Decode] Fix ngram tests (#14878) 2025-03-16 00:29:22 -07:00
b82662d952 [BugFix] Fix torch distributed stateless PG backend init (#14870)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-15 20:26:19 -07:00
71c1e07107 [Kernel] Add more tuned configs (#14877)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-15 20:25:03 -07:00
b30c75dda4 [V1] Remove V0 fallback for mistral-tokenizer (#14873)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-15 20:21:11 -07:00
def232e122 [VLM] Clean up Phi-4-MM ViT implementation (#14812)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-03-15 18:53:52 -07:00
3453b964a3 [Misc][Doc] Minor benchmark README update (#14874)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-16 09:46:17 +08:00
61c6a5a796 [VLM] Merged multi-modal processor for Pixtral (#12211)
Signed-off-by: remi <remi@mistral.ai>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-15 06:28:27 -07:00
74bc397b0a [Core] Expose API endpoint /is_sleeping (#14312)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-03-15 06:28:14 -07:00
f58aea002c [CI][Intel GPU] refine intel GPU ci docker build (#14860)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-03-15 11:58:53 +00:00
3556a41434 [VLM] Limit multimodal input cache by memory (#14805)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-15 02:52:05 -07:00
9ed6ee92d6 [Bugfix] EAGLE output norm bug (#14464)
Signed-off-by: Bryan Lu <yuzhelu@amazon.com>
2025-03-15 06:50:33 +00:00
ee3778d5fc [Build/CI] Upgrade jinja2 to get 3 moderate CVE fixes (#14839)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-15 05:38:19 +00:00
aaacf17324 [Doc] V1 user guide (#13991)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-03-14 22:17:59 -07:00
4c7629cae9 [V1][Structured Output] calculate vocab_size eagerly (#14851)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-14 22:09:51 -07:00
e0fdfa1608 [CI/Build] Delete LoRA bias test (#14849)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-14 22:09:25 -07:00
5952d8ab61 [Attention] Get rid of mla cache alignment (#14842)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-15 05:08:25 +00:00
a2ae496589 [CPU] Support FP8 KV cache (#14741)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-03-14 22:07:36 -07:00
877e352262 [Docs] Add new East Coast vLLM Meetup slides to README and meetups.md (#14852) 2025-03-14 22:06:38 -07:00
d4d93db2c5 [V1] V1 Enablement Oracle (#13726)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-03-14 22:02:20 -07:00
8c0d15d5c5 [Misc][Easy] Annotate unused vars in the csrc files (#14798)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-15 12:40:09 +08:00
97ac781c62 [Misc] Remove misleading message in gemma2 and gemma3 (#14850)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-14 21:35:12 -07:00
776dcec8fe Disable outlines cache by default (#14837) 2025-03-15 03:57:55 +00:00
ccf02fcbae Revert "[Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of U… (#14848) 2025-03-14 20:45:42 -07:00
acaea3bb07 [Bugfix][V1] Fix flashinfer sampling (#14815) 2025-03-14 20:42:38 -07:00
9f37422779 [Neuron][CI] update docker run command (#14829)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-03-14 18:51:35 -07:00
dd344e0342 [Bugfix] Fix torch_xla in V0 which can't handle None seed introduced … (#14844)
Signed-off-by: Yarong Mu <ymu@google.com>
2025-03-15 00:41:15 +00:00
54a8804455 [Doc] More neutral K8s deployment guide (#14084)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-14 16:12:36 -07:00
bbd94a19fc [Build/CI] Upgrade aiohttp to incldue CVE fix (#14840)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-14 23:11:28 +00:00
233ffce1eb [Build/CI] Move ninja to common deps (#14835)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-14 21:25:28 +00:00
40677783aa [CI] Add TPU v1 test (#14834)
Signed-off-by: Richard Liu <ricliu@google.com>
2025-03-14 17:13:30 -04:00
14f301b541 Update to torch==2.6.0 (#12721)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-14 16:58:30 -04:00
46f98893dd [V1] Fix model parameterization for structured output tests (#14833)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-14 20:55:18 +00:00
fe66b34728 [Model] Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies (#14778)
Signed-off-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com>
2025-03-14 16:36:18 -04:00
270a5da495 Re-enable the AMD Entrypoints Test (#14711)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-03-14 12:18:13 -07:00
7097b4cc1c [release] Remove log cleanup commands from TPU job (#14838) 2025-03-14 11:59:52 -07:00
977a16772c [Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0 (#14430)
Signed-off-by: wyj371990 <wyj371990@alibaba-inc.com>
2025-03-14 09:55:14 -07:00
73deea2fdb [Frontend] track server_load (#13950) 2025-03-14 09:53:17 -07:00
9d2b4a70f4 [V1][Metrics] Updated list of deprecated metrics in v0.8 (#14695)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-15 00:45:25 +08:00
0b0d6421b2 [Frontend] Fix log message to use http vs https (#14774)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-14 09:21:09 -07:00
1140991a7b [V1] Fix vocab size calculation for structured output (#14826)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-14 09:18:38 -07:00
613c5bb945 [Bugfix] Fix Aria test loading (#14823)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-14 09:11:23 -07:00
fd8e055ffb [BugFix]: properly catch templating error when preprocess input (#13976)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2025-03-14 05:58:34 -07:00
ab93f1360f [VLM] Various cleanup and fixes (#14806)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-14 05:58:19 -07:00
40253bab44 [Bugfix][W8A8] fixed cutlass block fp8 binding (#14796) 2025-03-14 03:32:42 -07:00
c77620d22d [V1][Minor] Minor code cleanup for scheduling metrics (#14800)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-14 08:21:28 +00:00
989ecd2007 [Misc] Gemma3ForConditionalGeneration supports LoRA (#14797)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-14 01:07:30 -07:00
54cc46f3eb [Bugfix] Fix small typo in the example of Streaming delimiter (#14793) 2025-03-14 08:05:17 +00:00
601bd3268e [Misc] Clean up type annotation for SupportsMultiModal (#14794)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-14 00:59:56 -07:00
09269b3127 [BugFix]Fix performance serving benchmark when enable profiling (#14737)
Signed-off-by: wangli <wangli858794774@gmail.com>
2025-03-14 07:02:05 +00:00
27b50f1fe6 [Bugfix][Kernel][CPU] Fix num_tokens in CPU rotary embedding kernel (#14667)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-03-13 23:47:49 -07:00
9532c49836 [Attention] MLA get rid of materialization (#14770)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-13 23:39:02 -07:00
0c2af17c76 [CI] Fix missing example model id in processor test (#14787)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-14 13:52:15 +08:00
a6e0d096dd [Feature] Add visionarena offline support for benchmark_throughput (#14654)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Signed-off-by: Jennifer Zhao <ai.jenniferzhao@gmail.com>
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Jennifer Zhao <JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2025-03-14 04:07:54 +00:00
d3d4956261 [Neuron] flatten test parameterization for neuron attention kernels (#14712) 2025-03-13 20:46:56 -07:00
4059adc31b [Misc][Minor] Simplify SamplingParams.__post_init__() (#14772)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-14 11:44:20 +08:00
f1f632d9ec [ci] Reduce number of tests in fastcheck (#14782) 2025-03-13 20:43:45 -07:00
95d680b862 [Bugfix][IPEX] Add VLLM_CPU_MOE_PREPACK to allow disabling MoE prepack when CPU does not support it (#14681)
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-03-13 20:43:18 -07:00
fb4c7f8ef0 [Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA. (#14431)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com>
2025-03-13 20:42:27 -07:00
0b1cfa6180 [Kernel] LoRA - Enable CUDAGraphs for V1 (#14626)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-13 20:42:04 -07:00
32ef4983cd [V1] Temporarily disable FlashInfer Rejection Sampler (#14788)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-13 20:40:35 -07:00
ad19c8a003 [V1] Move OOM check into sampler run (#14728)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-03-13 20:40:23 -07:00
2a602b055a forward fix PR 14245, restore build on ROCm 6.2 (#14709)
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
2025-03-13 20:40:15 -07:00
7888e1d0a3 [V1] TPU - Enable prefix caching by default (#14773) 2025-03-13 20:40:05 -07:00
60c872d4b6 [Doc] Fix small typo in Transformers fallback (#14791)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-03-13 20:33:12 -07:00
3fb17d26c8 [Doc] Fix typo in documentation (#14783)
Signed-off-by: yasu52 <tsuguro4649@gmail.com>
2025-03-13 20:33:09 -07:00
d47807ba08 [Attention] Remove slow setattr in MLA (#14769)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-13 21:31:14 +00:00
02fcaa3d0a [V1] Detokenizer: Respect Stop Tokens + not include_stop_str_in_output (#14624)
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
2025-03-13 19:07:34 +00:00
8a4a2efc6f [V1][Core] using cached vocab_size for Structured Outputs (#14630)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-13 11:39:28 -07:00
8e9ffd37d6 [Misc] Clean up processor tests (#14771)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-13 18:25:37 +00:00
01b3fd0af7 [V1][Minor] Minor enhancements on scheduler (#14732)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-13 08:53:22 -07:00
f53a0586b9 [Bugfix] Fix prompt format of GLM4V (#14539)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-13 11:37:17 +00:00
b1cc4dfef5 [VLM] Support loading InternVideo2.5 models as original InternVLChatModel (#14738)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-13 03:10:02 -07:00
382403921f [VLM] Support pan-and-scan for Gemma3 multi-modal processor (#14672)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-13 02:23:12 -07:00
a73122de96 [Bugfix] fix benchmark moe (#14653)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-13 16:12:42 +08:00
bd44b812cb [CI/Build] Delete ultravox LoRA test (#14730)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-13 07:57:39 +00:00
55211b01e8 [Bugfix] Fix chunked prefill for GGUF (#14666)
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
2025-03-13 07:19:03 +00:00
5d043c1685 [Quant] Bamba SupportsQuant (#14698)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-03-13 04:57:05 +00:00
36d1ccb286 [Quant] BartModel SupportsQuant (#14699)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-03-13 04:55:59 +00:00
1bc3b739c4 [V1][TPU] Add assertion on multi-step-scheduler (#14707)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-03-12 21:37:58 -07:00
1bd32bc8dd [Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config (#14367)
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
2025-03-12 20:15:20 -07:00
128bf75283 [BugFix][TritonMLA] Process weights after model loading for GGUF (#14555)
Signed-off-by: TianyuanWu <Tianyuan.Wu@amd.com>
2025-03-12 20:14:36 -07:00
a94a699c3f [ROCm][FP8] Fix for adjustments needed only for fnuz (#14689)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-03-12 20:14:04 -07:00
ab426ec9c0 Add ray[data] as tpu dependency (#14691)
Signed-off-by: <ricliu@google.com>
Signed-off-by: Richard Liu <ricliu@google.com>
2025-03-12 20:13:48 -07:00
165290d357 [bugfix] fixup warning message for plugged schedulers for v1 (#14700)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-03-12 20:12:13 -07:00
ce20124671 [release] Add force remove for TPU logs (#14697) 2025-03-12 22:35:18 +00:00
53be4a8634 [V1] Allow sliding window + prefix caching (#13069)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-12 11:21:19 -07:00
f5d3acd474 [BugFix][V1] Fix parallel sampling finishing/aborts (#14512)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-12 10:29:48 -07:00
916836bbfb [FEAT] [ROCm] [Embedding] Add encoder-only model support into ROCm Flash Attention to enable embedding models. (#14664)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-03-12 09:31:19 -07:00
d9f83d6206 [ROCm] Enable chunked prefill/paged attention in MLA on ROCm (#14316)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-12 15:51:20 +00:00
4a754fcf15 [Bugfix] Missing thumbnail from NVLM-D processor (#14633)
Signed-off-by: ameyanjarlekar <aanjarlekar@nvidia.com>
2025-03-12 08:50:49 -07:00
c0c25e25fa [Model] Add support for Gemma 3 (#14660)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-12 08:36:33 -07:00
45f3f3f59e [ROCm][Bugfix] Ensure that the moe_wna16_gemm kernel is not built on ROCm platforms. (#14629)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-12 08:00:28 -04:00
ff47aab056 [CPU] Upgrade CPU backend to torch-2.6 (#13381)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-03-12 10:41:13 +00:00
debd6bbf09 [Kernel] Add ModelOpt FP4 Checkpoint Support (#12520)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-03-12 05:13:11 +00:00
5c538c37b2 [V1][Bugfix][Spec Decode] Fix incorrect outputs in V1 speculative decoding due to batch indexing (#14645)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-03-11 22:12:41 -07:00
e22ee1e7a2 [Kernel] GGUF MoE kernel (#14613)
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
2025-03-12 03:33:27 +00:00
e392d85831 [Core] Refactor QKVCrossParallelLinear implementation to support BNB 4-bit quantization (#14545)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-11 20:12:52 -07:00
77a318bd01 [V1][Core] Support MistralTokenizer for Structured Output (#14625)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-12 10:40:09 +08:00
80e78d02ac [Model] Extend Ultravox to accept audio longer than 30s (#13631)
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai>
2025-03-12 10:27:10 +08:00
4a42b9f5d6 [Doc] Update benchmarks README (#14646)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2025-03-11 19:23:04 -07:00
47532cd9f4 [core][V1] pluggable scheduler (#14466)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-03-12 01:15:15 +00:00
36e0c8f7da [Feature] Add vllm bench CLI (#13993)
Signed-off-by: Randy Chen <acad.randyjhc@gmail.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-12 00:31:48 +00:00
9f583e360c [release] Add commands to clean up logs on TPU release node (#14642) 2025-03-12 00:14:50 +00:00
b706d898af [Bugfix][V1][PP] Only warmup sampler at last PP rank (#14643)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-11 23:40:07 +00:00
863d315c86 [V1][TPU] Pad the block_table.shape[1] so the ragged paged attention can handle correctly (#14597) 2025-03-11 19:12:26 -04:00
d374f04a33 Fix run_tpu_test (#14641)
Signed-off-by: <ricliu@google.com>
Signed-off-by: Richard Liu <ricliu@google.com>
2025-03-11 21:14:33 +00:00
61a01b27a7 [V1] Delay all xgrammar usage until needed (#14616)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-11 20:21:33 +00:00
53056731fd fix some typos : supported_head_sizes (#14627) 2025-03-11 10:38:24 -07:00
4cbf286794 [V1] Remove cache from StructuredOutputManager (#14622)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-11 10:36:07 -07:00
c6e14a61ab [Hardware][Intel GPU] upgrade IPEX dependency to 2.6.10. (#14564)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-03-11 17:11:47 +00:00
07b4b7a37f [BugFix/Build] Fix sparse kernels not getting built on hopper (#14572)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-11 17:09:03 +00:00
07964e2f30 docs: Add documentation for s390x cpu implementation (#14198)
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-11 17:02:17 +00:00
4bf82d4b90 [V1] Add regex structured output support with xgrammar (#14590)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-11 23:03:44 +08:00
9ab326713f Uninstall dependencies before installing requirements/tpu.txt (#14586)
Signed-off-by: <ricliu@google.com>
Signed-off-by: Richard Liu <ricliu@google.com>
2025-03-11 08:01:35 -07:00
af295e9b01 [Bugfix] Update --hf-overrides for Alibaba-NLP/gte-Qwen2 (#14609)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-11 07:59:43 -07:00
a1c8f3796c dynamic distpatch of fp8 kernels (#14245)
Signed-off-by: Jeff Daily <jeff.daily@amd.com>
2025-03-11 10:54:56 -04:00
08a1a1121d benchmarks: simplify test jsonschema (#14567)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-11 13:39:30 +00:00
1477ffc381 [VLM] Cleanup siglip legacy code and fix broken paligemma multimodal processor (#14602)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-11 11:27:36 +00:00
70b808fe1a [Perf]:Optimize qwen2-vl to reduce cudaMemcpyAsync (#14377)
Signed-off-by: cynthieye <987073381@qq.com>
2025-03-11 07:39:56 +00:00
63d635d179 [Misc] Correct deepseek-vl2 chat template (#14558)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-11 04:37:11 +00:00
1fc973c0b5 [V1][Core] Fix memory issue with logits & sampling (#14508)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Varun Sundar Rabindranath <3337719+varun-sundar-rabindranath@users.noreply.github.com>
2025-03-11 04:03:41 +00:00
c982ac5722 [Bugfix] Fix FP16 overflow for DeepSeek V2 (#13232)
Signed-off-by: Yida Wu <yida.wu@amd.com>
2025-03-10 20:46:59 -07:00
4290b704ff [V1][PP] Do not block engine core when no requests to schedule (#14585)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-10 19:48:24 -07:00
c91b64f749 [neuron] add reshape_and_cache (#14391) 2025-03-10 18:37:29 -07:00
d6123170d5 [Neuron] Add Neuron device communicator for vLLM v1 (#14085) 2025-03-10 18:37:04 -07:00
485afdd3cb [MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils (#14379)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-10 20:42:11 -04:00
90e88ab756 [Kernel] moe wna16 cuda kernel (#13321)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-10 20:12:40 -04:00
04421dff8a [V1] Prevent xgrammar from breaking TPU support (#14575)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-10 23:06:19 +00:00
432d6dad15 Fix typo in benchmark_serving_structured_output.py (#14566)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-10 14:58:58 -07:00
5ff0d32580 [V1] LoRA - Add triton kernels for V1 (#13096)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-10 17:27:53 -04:00
0967110e42 [Minor] Update the tqdm bar for parallel sampling (#14571)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-10 14:23:48 -07:00
fb0acb6c72 [Perf] Improve MLA on V1 (#14540)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-10 12:06:58 -07:00
92b0ce2ac7 [Bugfix][v1] fixed llava-hf/llava-1.5-7b-hf is broken on V1 (#14554)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-10 18:24:51 +00:00
bc2d4473bf [Docs] Make installation URLs nicer (#14556)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-10 10:43:08 -07:00
3b352a2f92 Correct capitalisation: VLLM -> vLLM (#14562)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-10 16:36:21 +00:00
dea985aef0 [V1][Bugfix] Fix handing of second_per_grid_ts for Qwen2-VL & Qwen2.5-VL (#14548)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-10 16:03:11 +00:00
39be30351f Correct capitalisation: Github -> GitHub (#14561)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-10 15:53:33 +00:00
001a9c7b0d [Doc] Update PaliGemma note to a warning (#14565)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-10 15:02:28 +00:00
89cdaa83e7 [Kernel] Add more dtype support for GGUF kernels (#14043)
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
Signed-off-by: SzymonOzog <szymon.ozog@gmail.com>
2025-03-10 07:30:04 -07:00
b0746fae3d [Frontend] support image embeds (#13955)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-03-10 12:36:03 +00:00
60a98b2de5 [Docs] Mention model_impl arg when explaining Transformers fallback (#14552)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-10 12:13:10 +00:00
460f553a6d [Misc] Add log information for handle_process_request. (#14130)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-03-10 08:40:50 +00:00
1253b15774 [Feature] Consolidate performance benchmark datasets (#14036)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-10 07:23:11 +00:00
dc74613fa2 [Bugfix] Wrong requirements path - rocm (#14527)
Signed-off-by: Martin Hoyer <mhoyer@redhat.com>
2025-03-10 02:49:46 +00:00
a21076ed3a [Misc] Ensure out-of-tree quantization method recognize by cli args (#14328)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
2025-03-09 12:13:31 +00:00
212007b168 [Hardware][TPU] Fix the recompiling issue in logits processor after warmup (#14510)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-09 05:44:39 -04:00
fb16eea48b [Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work (#14498)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-09 04:47:45 +00:00
73ae0b44e9 [Bugfix] Fix tqdm progress bar when SamplingParams.n > 1 (#12428)
Signed-off-by: Yuchen Yan <740987012@qq.com>
2025-03-08 20:14:53 -08:00
6d7f037748 [Feat] Support chunked prefill for LMCache connector (#14505)
Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
2025-03-08 19:30:06 -08:00
10f7552789 [V1][TPU] Remove unnecessary padding for running on TPU. (#14467) 2025-03-08 21:56:04 -05:00
b0d541947a [Attention] Default to FlashMLA backend for MLA (#14451)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-08 18:18:39 -08:00
5f0b53c6ea Revert "[V1][Core] Fix memory issue with logits & sampling" (#14504)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-03-08 17:43:37 -08:00
eb8b5eb183 [V1] Support bad_words in sampler (#13376)
Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-08 14:50:26 -08:00
9513290032 [Misc] Upgrade to Python 3.9 typing for additional directories (#14492)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-08 17:35:50 +00:00
0d5e73d30e Update CODEOWNERS for structured output (#14496)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-08 17:19:51 +00:00
609ef61fea [Bugfix] Fix profiling OOM and decouple encoder multimodal profiling (#14361)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-08 16:52:34 +00:00
db84f5eb3b [Bugfix] DeepSeek Accuracy (#14476)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-08 16:47:03 +00:00
206e2577fa Move requirements into their own directory (#12547)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-08 16:44:35 +00:00
e02883c400 [Misc] Don't run ruff at all on 3rd party libs (#14493)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-08 07:16:40 -08:00
9085aabd62 [benchmarks] Add option to use unique jsonschema for each request (#14457)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-08 06:36:39 -08:00
8d5aa466fb [V1][Core] Fix memory issue with logits & sampling (#13776)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-08 06:11:04 -08:00
0b7f06b447 [Misc] add use_tqdm_on_load to reduce logs (#14407)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
2025-03-08 05:57:46 -08:00
03fe18ae0f [VLM] Add TP support for Phi-4-MM (#14453)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-08 05:57:14 -08:00
cb8bdfade2 [V1] TPU - Add tensor parallel support via Ray (#13618)
Signed-off-by: Alexander Matveev <amatveev@redhat.com>
2025-03-08 08:19:38 -05:00
33f227e16b [CI/Build] Use a fixed seed to avoid flaky tests (#14480)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-08 11:30:09 +00:00
cfd0ae8234 Add RLHF document (#14482)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-08 09:51:39 +00:00
7caff01a7b [Build/BugFix] Fix hopper 12.8 build (#14354)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-08 08:11:56 +00:00
be0b399d74 Add training doc signposting to TRL (#14439)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-08 07:35:07 +00:00
b8b0ccbd2d [Bugfix] Make the deviceprofiler include LoRA memory. (#14469)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-08 07:12:22 +00:00
c908a07f57 [Doc] Added QwQ-32B to the supported models list in the reasoning out… (#14479)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-03-08 07:07:32 +00:00
7b6fd6e486 [Doc]add doc for Qwen models tool calling (#14478)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-03-08 06:58:46 +00:00
47512b3200 Default to generation_config from model (#12622)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-08 14:46:15 +08:00
3b9c6c6947 [CI/Build] refactor: set timezone of container to UTC (#12888)
Signed-off-by: Roger Meier <r.meier@siemens.com>
2025-03-07 22:42:01 -08:00
4aae667668 [core] add extra_args to SamplingParams (#13300)
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
2025-03-08 14:41:18 +08:00
9f3bc0f58c [MISC][V1] Register process killing handler only in the main thread (#14380)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-07 22:40:06 -08:00
980385f8c1 [Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache (#14369)
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
2025-03-07 22:39:31 -08:00
ca7a2d5f28 Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)" (#14471) 2025-03-07 22:18:53 -08:00
333681408f [Bugfix][V1] Handle MLA in kv_cache_interface (#14462)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-07 22:18:25 -08:00
ef64044079 [V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC (#13949) 2025-03-08 01:48:12 +00:00
66e16a038e [Bugfix] Fix torch_xla which can't handle None seed introduced in #14274 (#14459)
Signed-off-by: Yarong Mu <ymu@google.com>
2025-03-07 23:17:04 +00:00
e1f0835ae0 [V1][Metrics] Fix traceback with preemptions+LoRA (#14220)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-07 15:36:16 -05:00
8ed5421aaa [V1] Eagerly remove finished requests from the batch (#14388)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-07 10:56:00 -08:00
c6359e8ca6 [v1] torch.compile integration explanation (#14437)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-08 01:55:50 +08:00
952a074980 [Misc] Add Phi4-MM example (#14343)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-07 17:28:52 +00:00
d0feea31c7 [Kernel] optimize performance of gptq marlin kernel when n is small (#14138)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-03-07 11:53:38 -05:00
58abe35455 [Benchmarks] Make detokenization optional in benchmark scripts (#11697)
Signed-off-by: Jeremy Arnold <Jeremy.Arnold@amd.com>
2025-03-07 08:09:00 -08:00
f7ebad2307 [Doc] Update prefix_caching.md to match the example image (#14420) 2025-03-07 15:29:00 +00:00
80e9afb5bc [V1][Core] Support for Structured Outputs (#12388)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-07 07:19:11 -08:00
1e3598edeb Use the optimized block sizes after tuning the kernel. (#14329) 2025-03-07 13:25:13 +00:00
f7a6bd0fa1 Fix missing kv_caches and attn_metadata in OpenVINOCausalLM (#14271)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-07 12:30:42 +00:00
0ca3b8e01c [BUGFIX] Skip tokenization support for throughput benchmark (#12712)
Signed-off-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: root <root@banff-cyxtera-s73-5.ctr.dcgpu>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-03-07 02:51:47 -08:00
cc10281498 [Misc] Set default value of seed to None (#14274)
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
2025-03-07 10:40:01 +00:00
05fb6718f0 [Bugfix] Clean up multi-modal processors (#14417)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-07 10:33:38 +00:00
12c29a881f [Bugfix] Further clean up LoRA test (#14422)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-07 10:30:55 +00:00
70da0c0748 correct wrong markdown syntax (#14414)
Signed-off-by: vincent-pli <justdoit.pli@gmail.com>
2025-03-07 08:01:18 +00:00
c1588a2c94 [GH] Auto-apply multi-modality label to relevant PRs (#14402)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-07 15:26:32 +08:00
8ca7a71df7 OpenVINO: added CPU-like conditions (#14338)
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
2025-03-06 22:24:49 -08:00
63137cd922 [Build] Add nightly wheel fallback when latest commit wheel unavailable (#14358)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-06 22:10:57 -08:00
ddd1ef66ec [Bugfix] Fix JambaForCausalLM LoRA (#14370)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-06 22:05:47 -08:00
e5e03c2c1b [BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396) 2025-03-06 21:56:06 -08:00
e1744502c2 [FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object (#14390)
Signed-off-by: luka <luka@neuralmagic.com>
2025-03-07 05:20:16 +00:00
dae6896977 [Perf] Reduce MLA CPU overheads in V1 (#14384)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-06 19:59:14 -08:00
c34eeec58d [Bugfix] Correctly call cudaProfilerStop in benchmarks script (#14183)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-03-07 00:42:49 +00:00
ad60bbb2b2 [Doc] Fix a typo (#14385) 2025-03-06 16:31:52 -08:00
0578e5a462 [Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue (#14310)
Signed-off-by: Chengji Yao <chengjiyao@google.com>
2025-03-06 23:31:05 +00:00
04222984f8 [Docs] Add nsight guide to profiling docs (#14298)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 14:19:58 -08:00
6832707e90 [V1][Bugfix] Standardize quantized kv cache rejection for attention backends (#14221)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 14:18:29 -08:00
6b2ef5cd17 [Bug] Fix Attention when ignored in by quant_method (#14313)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 14:18:06 -08:00
958adce478 [Bugfix] Fix use_direct_call condition in FusedMoE layer for (#14382)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-06 14:17:21 -08:00
99b0915d3b [Kernel] Add needs_fixed_stride_order tag to most GEMMs (#14306)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-06 14:17:09 -08:00
8ca2b21c98 [CI] Disable spawn when running V1 Test (#14345)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-03-06 21:52:46 +00:00
d9292786e1 [CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa (#13569)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-06 16:08:36 -05:00
cc2f9b32c8 [Distributed] Add enable_expert_parallel arg (#14305)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-06 18:54:45 +00:00
cd579352bf [V1] Do not detokenize if sampling param detokenize is False (#14224)
Signed-off-by: Himanshu Jaju <hj@mistral.ai>
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-03-06 10:40:24 -08:00
9f1710f1ac Fix mla prefill context performance (#13897)
Signed-off-by: ZhongYingMatrix <zhongyingmatrix@gmail.com>
2025-03-06 09:35:49 -08:00
e642ec962c Add authors to license header. (#14371)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2025-03-06 08:43:09 -08:00
ada19210a3 Adding cpu inference with VXE ISA for s390x architecture (#12613)
Signed-off-by: Dilip Gowda Bhagavan <dilip.bhagavan@ibm.com>
Signed-off-by: Rishika Kedia <rishika.kedia@in.ibm.com>
Co-authored-by: Rishika Kedia <rishika.kedia@in.ibm.com>
2025-03-06 08:40:53 -08:00
bf0560bda9 Reinstate best_of for V0 (#14356)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-06 08:34:22 -08:00
151b08e0fe [RLHF] use worker_extension_cls for compatibility with V0 and V1 (#14185)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-07 00:32:46 +08:00
81b2f4a45f [Doc] Fix date typo in README.md (#14366)
Signed-off-by: Jitse Klomp <jitse.klomp@conclusionxforce.nl>
2025-03-06 08:29:57 -08:00
82551ad616 [Core] Don't use cache during multi-modal profiling (#14336) 2025-03-06 08:03:31 -08:00
caac5c2e59 [Bugfix][Core] fix abort_seq_group and memory leak when n>1 (#14326)
Signed-off-by: courage17340 <courage17340@163.com>
2025-03-06 23:59:32 +08:00
6bd1dd9d26 [Kernel] [V1] Improved performance for V1 Triton (ROCm) backend (#14152) 2025-03-06 07:39:16 -08:00
4f27044aab [Doc] Correct beam_search using in generative_models.md (#14363) 2025-03-06 15:37:10 +00:00
0ddc991f5c [Doc] Update reasoning with stream example to use OpenAI library (#14077)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
2025-03-06 13:20:37 +00:00
fa82b93853 [Frontend][Docs] Transcription API streaming (#13301)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-03-06 10:39:35 +00:00
69ff99fdcd [Core] Optimizing cross-attention QKVParallelLinear computation (#12325)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
Co-authored-by: NickLucche <nick@nlucches-4xa100.c.openshift-330514.internal>
2025-03-06 09:37:26 +00:00
5d802522a7 [V1][VLM][Pixtral-HF] Support Pixtral-HF on V1 (#14275)
Signed-off-by: Linkun Chen <github@lkchen.net>
2025-03-06 08:58:41 +00:00
1769928079 [Model] Update Paligemma multimodal processing with PromptUpdate (#14015)
Signed-off-by: Kyle Huang <kylhuang@nvidia.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-03-06 08:31:38 +00:00
ed6ea06577 [Hardware] Update the flash attn tag to support Blackwell (#14244) 2025-03-05 22:01:37 -08:00
5ee10e990d [Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention (#11301) 2025-03-05 20:00:53 -08:00
3dbd2d813a [V1] LoRA - Enable more V1 tests (#14315)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-03-06 11:55:42 +08:00
f5f7f00cd9 [Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1 (#14114) 2025-03-06 03:49:20 +00:00
abcc61e0af [misc] Mention ray list nodes command to troubleshoot ray issues (#14318)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-03-06 02:00:36 +00:00
f6bb18fd9a [BugFix] MLA + V1, illegal memory access and accuracy issues (#14253)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-03-05 17:10:13 -08:00
71eaf8969b [Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation (#13850)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-05 17:09:29 -08:00
ca100c90fe Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-05 17:08:51 -08:00
ffad94397d [CI/Build] Use spawn multiprocessing mode for V1 test pipeline (#14243)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-03-05 17:08:02 -08:00
4dacaa4a83 [BugFix] Fix prefix caching V0 MLA (#14255)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Ying Zhong <zhongyingmatrix@gmail.com>
2025-03-05 17:07:42 -08:00
a7ea35aa67 [Bugfix] Remove num_tokens_across_dp (#14302)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-05 23:55:55 +00:00
1e3e76b6cc [Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch (#14237)
Signed-off-by: pyc96 <pychen96@gmail.com>
2025-03-05 22:22:40 +00:00
53ea6ad830 [V1][Easy] Add empty allowed_token_ids in the v1 sampler test (#14308)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-05 21:41:18 +00:00
1b7624bf5c [misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env (#14267) 2025-03-05 21:28:50 +00:00
ac60dc7fe1 [V1][BugFix] Fix for mixed top_k batch (#14301)
Signed-off-by: Nick Hill <nhill@redhat.com>


Co-authored-by: Ye Cao <caoye.cao@alibaba-inc.com>
2025-03-05 20:43:04 +00:00
a4f1ee35d6 Deprecate best_of Sampling Parameter in anticipation for vLLM V1 (#13997)
Signed-off-by: vincent-4 <vincentzhongy+githubvincent4@gmail.com>
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-05 20:22:43 +00:00
a32c8669ca [V1][Minor] Remove obsolete FIXME comment (#14304)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-05 11:59:23 -08:00
ca2ca8de57 [Docs] Add Meta Slides (#14297)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-03-05 08:30:23 -08:00
f71b00a19e [Bugfix] Fix broken vision language example (#14292)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-05 15:57:10 +00:00
8f808cf86e prefix_caching.md: Fixed typo (#14293)
Signed-off-by: Daivid Savernin-Frenk <daivid.frank@TurboNext.ai>
2025-03-05 15:43:13 +00:00
7bab4bb048 [Misc] Add Qwen2MoeForCausalLM moe tuning support (#14276)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-05 23:11:29 +08:00
e17e4488bd [LoRA] Remove linear hack outside transformers backend (#14177)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-03-05 15:06:28 +00:00
257e200a25 [V1][Frontend] Add Testing For V1 Runtime Parameters (#14159)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-03-05 14:18:55 +00:00
47d4a7e004 Small update for external_launcher backend docs (#14288) 2025-03-05 21:30:00 +08:00
7f89a594dd [Doc] [3/N] Refer code examples for common cases in dev multimodal processor (#14278)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-03-05 12:29:50 +00:00
961644e6a8 [Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID (#14217)
Signed-off-by: Iacopo Poli <iacopo@lighton.ai>
2025-03-05 11:44:10 +00:00
8d6cd32b7b [Bugfix][V1] Fix allowed_token_ids for v1 Sampler (#14169)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-03-05 08:49:44 +00:00
ec79b67c77 [Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing (#14256)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-03-05 07:37:16 +00:00
32985bed7c [Frontend] Allow return_tokens_as_token_ids to be passed as a request param (#14066)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-03-05 06:30:40 +00:00
dae9ec464c Temporarily disable test_awq_gemm_opcheck (#14251)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-05 06:10:35 +00:00
6eaf93020d [platforms] improve rocm debugging info (#14257) 2025-03-04 21:32:18 -08:00
72c62eae5f [V1] EP/TP MoE + DP Attention (#13931) 2025-03-04 21:27:26 -08:00
0a995d5434 [Model] New model support for Phi-4-multimodal-instruct (#14119) 2025-03-04 20:57:01 -08:00
ade3f7d988 [V1][Bugfix] Do not reset prefix caching metrics (#14235) 2025-03-05 04:39:13 +00:00
0df25101d6 [Bugfix] Fix gptq_marlin for deepseek-v3 (#13750)
Signed-off-by: dangshunya <dangshunya@baichuan-inc.com>
Co-authored-by: dangshunya <dangshunya@baichuan-inc.com>
2025-03-05 12:25:53 +08:00
e123aafdf0 Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0 (#14157)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-05 12:25:24 +08:00
5b143d33be Moved numba from common requirements to cuda/rocm specific requirements (#14199)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
2025-03-05 12:25:00 +08:00
eb59b5a6cb [misc] announce china meetup (#14248)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-05 10:33:50 +08:00
fbfc3ee37e [V1][TPU] TPU multimodal model support for ragged attention (#14158)
Signed-off-by: Michael Goin <mgoin64@gmail.com>
2025-03-04 19:58:48 -05:00
3e1d223626 [ROCm] Disable a few more kernel tests that are broken on ROCm (#14145)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-04 23:37:55 +00:00
4f5b059f14 Clean up unused padding_idx variables across many model definitions (#13240)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-03-04 21:27:00 +00:00
288ca110f6 [Security] Serialize using safetensors instead of pickle in Mooncake Pipe (#14228)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2025-03-04 21:10:32 +00:00
c2bd2196fc [v1][Metrics] Add design doc (#12745)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-04 20:36:55 +00:00
550c7ba3dc [Docs] Update Dockerfile dependency image (#14215)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-04 20:22:11 +00:00
e5b2f1601a [Frontend] Do prompt_logprobs clamping for chat as well as completions (#14225)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-04 20:13:06 +00:00
9badee53de Fix performance when --generation-config is not None (#14223)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-04 20:59:22 +01:00
beebf4742a [TPU][Profiler] Support start_profile/stop_profile in TPU worker (#13988)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-04 14:40:06 -05:00
f89978ad7c add cutlass support for blackwell fp8 gemm (#13798) 2025-03-04 07:55:07 -08:00
b3cf368d79 [V1][Molmo] Fix get_multimodal_embeddings() in molmo.py (#14161) 2025-03-04 15:43:59 +00:00
c8525f06fc [V0][Metrics] Deprecate some questionable request time metrics (#14135)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-04 15:11:33 +00:00
5db6b2c961 [V1][BugFix] Fix remaining sync engine client shutdown errors/hangs (#13869)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-04 15:06:47 +00:00
6247bae6c6 [Bugfix] Restrict MacOS CPU detection (#14210)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-04 22:25:27 +08:00
3610fb4930 [doc] add "Failed to infer device type" to faq (#14200)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-04 20:47:06 +08:00
71c4b40562 [sleep mode] error out with expandable_segments (#14189)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-04 18:54:19 +08:00
ac65bc92df [platform] add debug logging during inferring the device type (#14195)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-03-04 18:39:16 +08:00
f78c0be80a Fix benchmark_moe.py tuning for CUDA devices (#14164) 2025-03-03 21:11:03 -08:00
66233af7b6 Use math.prod instead of np.prod for trivial ops (#14142) 2025-03-03 21:09:22 -08:00
bf13d40972 [core] Pass all driver env vars to ray workers unless excluded (#14099)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-03-04 11:44:17 +08:00
989f4f430c [Misc] Remove lru_cache in NvmlCudaPlatform (#14156)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-04 11:09:34 +08:00
bb5b640359 [core] moe fp8 block quant tuning support (#14068)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-03-04 01:30:23 +00:00
c060b71408 [Model] Add support for GraniteMoeShared models (#13313)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-03-04 08:04:52 +08:00
79e4937c65 [v1] Add comments to the new ragged paged attention Pallas kernel (#14155)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-03-03 23:00:55 +00:00
cd1d3c3df8 [Docs] Add GPTQModel (#14056)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-03-03 21:59:09 +00:00
19d98e0c7d [Kernel] Optimize moe intermediate_cache usage (#13625)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-03 16:29:53 -05:00
2b04c209ee [Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3 (#14100)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-03-03 14:20:24 -07:00
ae122b1cbd [WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics (#14055)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 19:04:45 +00:00
872db2be0e [V1] Simplify stats logging (#14082)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-03-03 10:34:14 -08:00
2dfdfed8a0 [V0][Metrics] Deprecate some KV/prefix cache metrics (#14136)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 18:25:46 +00:00
c41d27156b [V0][Metrics] Remove unimplemented vllm:tokens_total (#14134)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 17:50:22 +00:00
91373a0d15 Fix head_dim not existing in all model configs (Transformers backend) (#14141)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-03 17:48:11 +00:00
848a6438ae [ROCm] Faster Custom Paged Attention kernels (#12348) 2025-03-03 09:24:45 -08:00
98175b2816 Improve the docs for TransformersModel (#14147)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-03-03 17:03:05 +00:00
4167252eaf [V1] Refactor parallel sampling support (#13774)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-03-03 08:15:27 -08:00
f35f8e2242 [Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1 (#13921)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-03-03 16:43:14 +08:00
b87c21fc89 [Misc][Platform] Move use allgather to platform (#14010)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-03-03 15:40:04 +08:00
e584b85afd [Misc] duplicate code in deepseek_v2 (#14106) 2025-03-03 14:10:11 +08:00
09e56f9262 [Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure (#14051) 2025-03-02 17:35:01 -08:00
cf069aa8aa Update deprecated Python 3.8 typing (#13971) 2025-03-02 17:34:51 -08:00
bf33700ecd [v0][structured output] Support reasoning output (#12955)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-03-02 14:49:42 -05:00
bc6ccb9878 [Doc] Source building add clone step (#14086)
Signed-off-by: qux-bbb <1147635419@qq.com>
2025-03-02 10:59:50 +00:00
82fbeae92b [Misc] Accurately capture the time of loading weights (#14063)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-03-01 17:20:30 -08:00
cc5e8f6db8 [Model] Add LoRA support for TransformersModel (#13770)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-02 09:17:34 +08:00
d54990da47 [v1] Add __repr__ to KVCacheBlock to avoid recursive print (#14081) 2025-03-01 20:46:02 +00:00
b9f1d4294e [v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073) 2025-03-01 08:25:54 +00:00
b28246f6ff [ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBackend class (#14065)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-03-01 07:18:32 +00:00
3b5567a209 [V1][Minor] Do not print attn backend twice (#13985)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-03-01 07:09:14 +00:00
fdcc405346 [Doc] Consolidate whisper and florence2 examples (#14050) 2025-02-28 22:49:15 -08:00
8994dabc22 [Documentation] Add more deployment guide for Kubernetes deployment (#13841)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-03-01 06:44:24 +00:00
02296f420d [Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocExecutor (#14053) 2025-02-28 22:31:01 -08:00
6a92ff93e1 [Misc][Kernel]: Add GPTQAllSpark Quantization (#12931) 2025-02-28 22:30:59 -08:00
6a84164add [Bugfix] Add file lock for ModelScope download (#14060)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-03-01 06:10:28 +00:00
f64ffa8c25 [Docs] Add pipeline_parallel_size to optimization docs (#14059)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-03-01 05:43:54 +00:00
bd56c983d6 [torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 case, rename RedundantReshapesPass to NoopEliminationPass (#10902)
Signed-off-by: luka <luka@neuralmagic.com>
2025-02-28 16:20:11 -07:00
084bbac8cc [core] Bump ray to 2.43 (#13994)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-02-28 21:47:44 +00:00
28943d36ce [v1] Move block pool operations to a separate class (#13973)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-28 20:53:31 +00:00
b526ca6726 Add RELEASE.md (#13926)
Signed-off-by: atalman <atalman@fb.com>
2025-02-28 12:25:50 -08:00
e7bd944e08 [v1] Cleanup the BlockTable in InputBatch (#13977)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-02-28 19:03:16 +00:00
c3b6559a10 [V1][TPU] Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Signed-off-by: Xiongfei Wei <isaacwxf23@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-02-28 11:01:36 -07:00
4be4b26cb7 Fix entrypoint tests for embedding models (#14052) 2025-02-28 08:56:44 -08:00
2aed2c9fa7 [Doc] Fix ROCm documentation (#14041)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-02-28 16:42:07 +00:00
9b61dd41e7 [Bugfix] Initialize attention bias on the same device as Query/Key/Value for QwenVL Series (#14031) 2025-02-28 07:36:08 -08:00
f7bee5c815 [VLM][Bugfix] Enable specifying prompt target via index (#14038) 2025-02-28 07:35:55 -08:00
e0734387fb [Bugfix] Fix MoeWNA16Method activation (#14024) 2025-02-28 15:22:42 +00:00
f58f8b5c96 Update AutoAWQ docs (#14042)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-28 15:20:29 +00:00
b3f7aaccd0 [V1][Minor] Restore V1 compatibility with LLMEngine class (#13090) 2025-02-28 00:52:25 -08:00
b91660ddb8 [Hardware][Intel-Gaudi] Regional compilation support (#13213) 2025-02-28 00:51:49 -08:00
76c89fcadd Use smaller embedding model when not testing model specifically (#13891) 2025-02-28 00:50:43 -08:00
b9e41734c5 [Bugfix][Disaggregated] patch the inflight batching on the decode node in SimpleConnector to avoid hangs in SimpleBuffer (nccl based) (#13987)
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
2025-02-28 07:53:45 +00:00
1088f06242 [Doc] Move multimodal Embedding API example to Online Serving page (#14017)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-28 07:12:04 +00:00
73e0225ee9 [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (#13911)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2025-02-28 04:00:45 +00:00
6c85da3a18 [V1]SupportsV0Only protocol for model definitions (#13959)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-27 20:02:15 -05:00
67fc426845 [Misc] Print FusedMoE detail info (#13974) 2025-02-27 18:53:13 -05:00
9804145cac [Model][Speculative Decoding] Expand DeepSeek MTP code to support k > n_predict (#13626)
Signed-off-by: Benjamin Chislett <benjamin.chislett@centml.ai>
2025-02-27 15:28:08 -08:00
2e94b9cfbb [Attention] Flash MLA for V1 (#13867)
Signed-off-by: Yang Chen <yangche@fb.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Yang Chen <yangche@fb.com>
2025-02-27 23:03:41 +00:00
8294773e48 [core] Perf improvement for DSv3 on AMD GPUs (#13718)
Signed-off-by: qli88 <qiang.li2@amd.com>
2025-02-27 22:14:30 +00:00
cd813c6d4d [V1][Minor] Minor cleanup for GPU Model Runner (#13983)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-27 13:11:40 -08:00
38acae6e97 [ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-02-27 20:31:47 +00:00
a2dd48c386 [VLM] Deprecate legacy input mapper for OOT multimodal models (#13979)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-27 19:14:55 +00:00
126f6beeb4 Bump azure/setup-helm from 4.2.0 to 4.3.0 (#13742)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-02-27 19:04:10 +00:00
58d1b2aa77 [Attention] MLA support for V1 (#13789)
Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-27 13:14:17 -05:00
f1579b229d [VLM] Generalized prompt updates for multi-modal processor (#13964)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-27 17:44:25 +00:00
7864875879 [Bugfix] Fix qwen2.5-vl overflow issue (#13968)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-02-27 17:30:39 +00:00
1dd422b64a Update LMFE version to v0.10.11 to support new versions of transforme… (#13930) 2025-02-27 17:16:12 +00:00
06c8f8d885 [bugfix] Fix profiling for RayDistributedExecutor (#13945)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-02-28 01:01:21 +08:00
5677c9bb3e Deduplicate .pre-commit-config.yaml's exclude (#13967)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-27 16:27:47 +00:00
512d77d582 Update quickstart.md (#13958) 2025-02-27 16:05:11 +00:00
7f0be2aa24 [Model] Deepseek GGUF support (#13167) 2025-02-27 02:08:35 -08:00
edf309ebbe [VLM] Support multimodal inputs for Florence-2 models (#13320) 2025-02-27 02:06:41 -08:00
788f284b53 Fix test_block_fp8.py test for MoE (#13915)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-27 18:00:00 +08:00
4b1d141f49 [PP] Correct cache size check (#13873)
Signed-off-by: Yang Zheng <zhengy.gator@gmail.com>
2025-02-27 17:47:29 +08:00
10c3b8c1cf [Misc] fixed 'required' is an invalid argument for positionals (#13948)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2025-02-27 09:06:49 +00:00
a7f37314b7 [CI/Build] Add examples/ directory to be labelled by mergify (#13944)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-02-27 08:24:11 +00:00
cd711c48b2 [V1][Metrics] Handle preemptions (#13169) 2025-02-26 20:04:59 -08:00
378b3ef6f8 [ROCm][V1] Update reshape_and_cache to properly work with CUDA graph padding (#13922) 2025-02-26 20:04:12 -08:00
c9944acbf9 [misc] Rename Ray ADAG to Compiled Graph (#13928) 2025-02-26 20:03:28 -08:00
ca377cf1b9 Use CUDA 12.4 as default for release and nightly wheels (#12098) 2025-02-26 19:06:37 -08:00
a31614e386 [ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undefined (#13851)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-02-27 10:39:10 +08:00
f95903909f [Kernel] FlashMLA integration (#13747)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
2025-02-27 10:35:08 +08:00
b382a7f28f [BugFix] Make FP8 Linear compatible with torch.compile (#13918)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-26 13:48:55 -08:00
4cb6fa0a9c [Bugfix] Backend option to disable xgrammar any_whitespace (#12744)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
2025-02-26 10:52:34 -08:00
d08b285adf [Misc] fixed qwen_vl_utils parameter error (#13906) 2025-02-26 08:31:53 -08:00
b27122acc2 [TPU] use torch2.6 with whl package (#13860)
Signed-off-by: Chenyaaang <llccyy1212@gmail.com>
2025-02-26 08:18:54 -05:00
934bb99c71 [Bugfix] Update expected token counts for Ultravox tests (#13895) 2025-02-26 04:56:50 -08:00
3f808cc044 [Bugfix] Do not crash V0 engine on input errors (#13101)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-02-26 19:07:29 +08:00
ec8a5e5386 [Misc]: Add support for goodput on guided benchmarking + TPOT calculation refactor (#13736)
Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-02-26 19:06:47 +08:00
215bf150a6 [Bugfix] Handle None parameters in Mistral function calls. (#13786) 2025-02-26 03:06:21 -08:00
0ecdd98031 Add comments on accessing kv_cache and attn_metadata (#13887)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-26 18:41:02 +08:00
7b700ec8c8 [Bugfix] Add test example for Ultravox v0.5 (#13890) 2025-02-26 02:31:43 -08:00
7ca1da020f [Misc] Fix input processing for Ultravox (#13871) 2025-02-25 23:56:34 -08:00
5157338ed9 [Misc] Improve LoRA spelling (#13831) 2025-02-25 23:43:01 -08:00
e206b54331 [v0][Core] Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
Signed-off-by: Seth Kimmel <seth.kimmel3@gmail.com>
2025-02-26 14:58:24 +08:00
1d35662e6d [ROCm] Disable chunked prefill/prefix caching when running MLA on non-cuda platforms (#13844)
Signed-off-by: Sage Moore <sage@neuralmagic.com>
2025-02-26 14:56:58 +08:00
e656f638de [Doc] fix the incorrect module path of tensorize_vllm_model (#13863) 2025-02-25 22:56:19 -08:00
145944cb94 Improve pipeline partitioning (#13839) 2025-02-25 18:53:56 -08:00
094b7d9496 [Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutlass epilogues (#13797) 2025-02-25 18:52:03 -08:00
e1fe7591f2 [Misc]Code Cleanup (#13859)
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2025-02-26 10:44:30 +08:00
5629f26df7 [V1][Spec Decode] Change Spec Decode Rejection Sampling API (#13729) 2025-02-25 18:14:48 -08:00
9ba28043b5 [misc] Show driver IP info when Ray fails to allocate driver worker (#13858)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-02-26 09:53:43 +08:00
24679788ed DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-26 01:24:57 +00:00
07c4353057 [Model] Support Grok1 (#13795)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-26 01:07:12 +00:00
34e3494e70 Fix failing MyGemma2Embedding test (#13820)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-25 12:33:03 -08:00
f75aa72732 [Neuron] Add custom_ops for neuron backend (#13246)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: George Novack <gnovack@amazon.com>
Co-authored-by: Aoyu Zhang <aoyuzhan@amazon.com>
2025-02-25 11:47:49 -08:00
340e39e387 Fix string parsing error (#13825) 2025-02-25 08:20:29 -08:00
f4133ce4e5 [Bugfix] Revert inspection code in #13743 (#13832)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-26 00:18:50 +08:00
6522d55b6f Fix /v1/audio/transcriptions Bad Request Error (#13811) 2025-02-25 06:03:33 -08:00
6ff518626c [Bugfix] Fix deepseek-vl2 inference with more than 2 images (#13818) 2025-02-25 06:03:02 -08:00
fa82074167 [Bugfix] Flush TunableOp results before worker processes are destroyed. (#13623)
Signed-off-by: Nichols A. Romero <nick.romero@amd.com>
2025-02-25 11:08:20 +00:00
75e9d49796 [Bugfix] Initialize attention bias on the same device as Query/Key/Value (#13468) 2025-02-25 02:13:09 -08:00
32c3b6bfd1 [Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo IDs (#13724)
Signed-off-by: Chen-0210 <chenjincong11@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
2025-02-25 10:12:19 +00:00
37b6cb4985 [CI/Build] Fix V1 LoRA failure (#13767) 2025-02-25 02:01:15 -08:00
aabeb2688f [ROCm][Quantization][Kernel] Using HIP FP8 header (#12593) 2025-02-25 00:39:59 -08:00
2f42a4888c [Feature] Support KV cache offloading and disagg prefill with LMCache connector. (#12953) 2025-02-25 00:38:42 -08:00
3173c3b34e [misc] Clean up ray compiled graph type hints (#13731) 2025-02-25 00:37:08 -08:00
2d87d7d1ac [Bugfix] Modify modelscope api usage in transformer_utils (#13807) 2025-02-25 00:36:07 -08:00
aab392774b [Core] xgrammar: Expand list of unsupported jsonschema keywords (#13783)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-25 08:21:25 +00:00
6724e79164 [Misc] Check that the model can be inspected upon registration (#13743) 2025-02-25 00:18:19 -08:00
03f48b3db6 [Core] LoRA V1 - Add add/pin/list/remove_lora functions (#13705) 2025-02-25 00:18:02 -08:00
4d251ad00e Fix CompressedTensorsWNA16MoE with grouped scales (#13769) 2025-02-25 00:17:14 -08:00
18e505930d [Bugfix] Support MLA for CompressedTensorsWNA16 (#13725)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-25 06:10:31 +00:00
4a8cfc7551 [Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" (#13802) 2025-02-24 20:33:59 -08:00
bc32bc73aa [V1][Metrics] Implement vllm:lora_requests_info metric (#13504) 2025-02-24 20:01:33 -08:00
ab1091d5f2 [Misc][Attention][Quantization] init property earlier (#13733)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-25 03:19:30 +00:00
1e15aaef56 [Bugfix][Quantization] Fix FP8 + EP (#13784)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-25 10:54:17 +08:00
51010a1807 [Misc] set single whitespace between log sentences (#13771)
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com>
2025-02-25 10:26:12 +08:00
7196a3b1db [Doc] arg_utils.py: fixed a typo (#13785) 2025-02-24 18:23:04 -08:00
cdc1fa12eb Remove unused kwargs from model definitions (#13555) 2025-02-24 17:13:52 -08:00
f61528d46d [Misc][Chore] Clean Up AsyncOutputProcessing Logs (#13780) 2025-02-24 16:39:07 -08:00
1f0ae3ed0a [Misc] Clean Up EngineArgs.create_engine_config (#13734)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-02-24 13:52:21 -05:00
db986c19ea Fix precommit fail in fused_moe intermediate_cache2 chunking (#13772)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-24 09:25:47 -08:00
227578480d Revert "[V1][Core] Fix memory issue with logits & sampling" (#13775) 2025-02-24 09:16:05 -08:00
befc402d34 [V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) (#10980)
Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-02-24 08:29:41 -08:00
444b0f0f62 [Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set (#12513)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-02-24 10:43:21 -05:00
ccc00515fd [BugFix] Illegal memory access for MoE On H20 (#13693) 2025-02-24 07:37:32 -08:00
781096e385 Expert Parallelism (EP) Support for DeepSeek V2 (#12583) 2025-02-24 07:33:20 -08:00
7940d8a6a7 [CI/Build] add python-json-logger to requirements-common (#12842) 2025-02-24 06:10:33 -08:00
c0e3ecd6d2 [Bugfix] fix(logging): add missing opening square bracket (#13011) 2025-02-24 06:10:25 -08:00
23eca9cf68 [model][refactor] remove cuda hard code in models and layers (#13658) 2025-02-24 06:10:14 -08:00
437b76ff59 [V1][Core] Fix memory issue with logits & sampling (#13721) 2025-02-24 06:10:06 -08:00
f90a375593 [ci] Add logic to change model to S3 path only when S3 CI env var is on (#13727)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-63-253.us-west-2.compute.internal>
2025-02-24 06:32:11 +00:00
e7ef74e26e Fix some issues with benchmark data output (#13641)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-02-24 10:23:18 +08:00
cbae7af552 [V1][BugFix] Fix engine core client shutdown hangs (#13298)
Even though ZMQ context.destroy() is meant to close open sockets before terminating the context, it appears to be necessary to do this explicitly or else it can hang in the context.term() method.

Close zmq sockets explicitly before terminating context, make shutdown of client resource more robust, shut down engine core process prior to terminating zmq context.

Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-23 13:07:43 -08:00
eb24dc4a45 [v1] torchrun compatibility (#13642)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-23 22:47:24 +08:00
9bebc9512f [Misc] Deprecate --dataset from benchmark_serving.py (#13708)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-23 13:32:20 +00:00
5a2ba16f5c [Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms (#13688) 2025-02-23 02:54:29 -08:00
ba5106e519 [LMM] Implement merged multimodal processor for whisper (#13278) 2025-02-23 01:46:03 -08:00
d5ca2110f1 [Quant] BaiChuan SupportsQuant (#13710) 2025-02-22 19:21:15 -08:00
2c5e637b57 [ci] Use env var to control whether to use S3 bucket in CI (#13634) 2025-02-22 19:19:45 -08:00
322d2a27d6 [BugFix] Minor: logger import in attention backend (#13706)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-02-22 16:51:13 -08:00
82e0d601fc [CI/Build] Fix pre-commit errors from #13571 (#13709)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-22 16:50:38 -08:00
78ac0f591d [CI/Build] fix uv caching in Dockerfile (#13611) 2025-02-22 08:25:20 -08:00
b56155e7f3 [XPU]fix setuptools version for xpu (#13548) 2025-02-22 08:05:35 -08:00
382f66fb08 [Bugfix] Fix boolean conversion for OpenVINO env variable (#13615) 2025-02-22 08:04:12 -08:00
8354f6640c [Doc] Dockerfile instructions for optional dependencies and dev transformers (#13699) 2025-02-22 06:04:31 -08:00
c904fdddf6 [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm (#13231) 2025-02-22 05:54:38 -08:00
558db8083c [V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths (#13095) 2025-02-22 05:25:41 -08:00
e109e598c7 [NVIDIA] Support nvfp4 cutlass gemm (#13571) 2025-02-22 05:24:05 -08:00
8db1b9d0a1 Support SSL Key Rotation in HTTP Server (#13495) 2025-02-22 05:17:44 -08:00
2382ad29d1 [ci] fix linter (#13701)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 20:28:59 +08:00
3e472d882a [core] set up data parallel communication (#13591)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-22 19:28:59 +08:00
7f6bae561c [CI/Build] Fix pre-commit errors (#13696) 2025-02-22 00:31:26 -08:00
105b8ce4c0 [Misc] Reduce LoRA-related static variable (#13166) 2025-02-22 00:21:30 -08:00
2cb8c1540e [Metrics] Add --show-hidden-metrics-for-version CLI arg (#13295) 2025-02-22 00:20:45 -08:00
1cd981da4f [V1][Metrics] Support vllm:cache_config_info (#13299) 2025-02-22 00:20:00 -08:00
fca20841c2 Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size (#13660) 2025-02-22 00:19:10 -08:00
da31b5333e [Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler (#13594)
Signed-off-by: Jennifer Zhao <7443418+JenZhao@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-02-22 00:08:29 -08:00
bb78fb318e [v1] Support allowed_token_ids in v1 Sampler (#13210)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-22 14:13:05 +08:00
8aca27fa11 [Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len (#13691)
Signed-off-by: WangErXiao <863579016@qq.com>
2025-02-22 14:10:38 +08:00
95c617e04b [Misc] Bump compressed-tensors (#13619) 2025-02-21 22:09:04 -08:00
9a1f1da5d1 [Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA (#13687) 2025-02-21 22:07:45 -08:00
68d630a0c7 [ROCM] fix native attention function call (#13650) 2025-02-21 22:07:04 -08:00
68d535ef44 [Misc] Capture and log the time of loading weights (#13666) 2025-02-21 22:06:34 -08:00
c6ed93860f [Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… (#13672) 2025-02-21 22:05:28 -08:00
0ffdf8ce0c [HTTP Server] Make model param optional in request (#13568) 2025-02-21 21:55:50 -08:00
8c0dd3d4df docs: Add a note on full CI run in contributing guide (#13646) 2025-02-21 21:53:59 -08:00
ada7c780d5 [Misc] Fix yapf linting tools etc not running on pre-commit (#13695)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-02-22 13:10:43 +08:00
288cc6c234 [Attention] MLA with chunked prefill (#12639)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Patrick Horn <patrick.horn@gmail.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-21 15:30:12 -08:00
900edbfa48 fix typo of grafana dashboard, with correct datasource (#13668)
Signed-off-by: John Zheng <john.zheng@hp.com>
2025-02-21 18:21:05 +00:00
b2c3fc5d65 [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation (#13586) 2025-02-20 22:24:17 -08:00
839b27c6cc [Kernel]Add streamK for block-quantized CUTLASS kernels (#12978) 2025-02-20 22:14:24 -08:00
34ad27fe83 [ci] Fix metrics test model path (#13635) 2025-02-20 22:12:10 -08:00
1c3c975766 [FEATURE] Enables /score endpoint for embedding models (#12846) 2025-02-20 22:09:47 -08:00
1cdc88614a Missing comment explaining VDR variable in GGUF kernels (#13290) 2025-02-20 22:06:54 -08:00
31aa045c11 [V1][Sampler] Avoid an operation during temperature application (#13587) 2025-02-20 22:05:56 -08:00
a30c093502 [Bugfix] Add mm_processor_kwargs to chat-related protocols (#13644) 2025-02-20 22:04:33 -08:00
c7b07a95a6 Use pre-commit to update requirements-test.txt (#13617) 2025-02-20 22:03:27 -08:00
27a09dc52c [NVIDIA] Fix an issue to use current stream for the nvfp4 quant (#13632) 2025-02-20 22:01:48 -08:00
981f3c831e [Misc] Adding script to setup ray for multi-node vllm deployments (#12913) 2025-02-20 21:16:40 -08:00
44c33f01f3 Add llmaz as another integration (#13643)
Signed-off-by: kerthcet <kerthcet@gmail.com>
2025-02-21 03:52:40 +00:00
33170081f1 [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)
Signed-off-by: Lingfan Yu <lingfany@amazon.com>
2025-02-20 17:45:45 -08:00
71face8540 [Bugfix] Fix max_num_batched_tokens for MLA (#13620)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-20 17:45:20 -08:00
bfbc0b32c6 [Frontend] Add backend-specific options for guided decoding (#13505)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-02-20 15:07:58 -05:00
6a417b8600 fix neuron performance issue (#13589) 2025-02-20 10:59:36 -08:00
d3ea50113c [V1][Minor] Print KV cache size in token counts (#13596)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-20 09:24:31 -08:00
34aad515c8 Update pre-commit's isort version to remove warnings (#13614) 2025-02-20 08:00:14 -08:00
ed6e9075d3 [Bugfix] Fix deepseekv3 grouped topk error (#13474)
Signed-off-by: Chen-XiaoBing <chenxb002@whu.edu.cn>
2025-02-20 06:47:01 -08:00
992e5c3d34 Merge similar examples in offline_inference into single basic example (#12737) 2025-02-20 04:53:51 -08:00
b69692a2d8 [Kernel] LoRA - Refactor sgmv kernels (#13110) 2025-02-20 07:28:06 -05:00
a64a84433d [2/n][ci] S3: Use full model path (#13564)
Signed-off-by: <>
2025-02-20 01:20:15 -08:00
aa1e62d0db [ci] Fix spec decode test (#13600) 2025-02-20 16:56:00 +08:00
497bc83124 [CI/Build] Use uv in the Dockerfile (#13566) 2025-02-19 23:05:44 -08:00
3738e6fa80 [API Server] Add port number range validation (#13506)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-20 15:05:13 +08:00
0023cd2b9d [ROCm] MI300A compile targets deprecation (#13560) 2025-02-19 23:05:00 -08:00
041e294716 [Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL (#13533) 2025-02-19 23:04:30 -08:00
9621667874 [Misc] Warn if the vLLM version can't be retrieved (#13501)
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
2025-02-20 06:24:48 +00:00
8c755c3b6d [bugfix] spec decode worker get tp group only when initialized (#13578) 2025-02-20 04:46:28 +00:00
ba81163997 [core] add sleep and wake up endpoint and v1 support (#12987)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: cennn <2523403608@qq.com>
Co-authored-by: cennn <2523403608@qq.com>
2025-02-20 12:41:17 +08:00
0d243f2a54 [ROCm][MoE] mi300 mixtral8x7B perf for specific BS (#13577)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-02-20 04:01:02 +00:00
88f6ba3281 [ci] Add AWS creds for AMD (#13572) 2025-02-20 03:56:06 +00:00
512368e34a [Misc] Qwen2.5 VL support LoRA (#13261) 2025-02-19 18:37:55 -08:00
473f51cfd9 [3/n][CI] Load Quantization test models with S3 (#13570)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-20 10:12:30 +08:00
a4c402a756 [BugFix] Avoid error traceback in logs when V1 LLM terminates (#13565)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-20 00:49:01 +00:00
550d97eb58 [Misc] Avoid calling unnecessary hf_list_repo_files for local model path (#13348)
Signed-off-by: isotr0py <2037008807@qq.com>
2025-02-19 18:57:48 +00:00
fbbe1fbac6 [MISC] Logging the message about Ray teardown (#13502)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
2025-02-19 09:40:50 -08:00
01c184b8f3 Fix copyright year to auto get current year (#13561) 2025-02-19 16:55:34 +00:00
ad5a35c21b [doc] clarify multi-node serving doc (#13558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-19 22:32:17 +08:00
5ae9f26a5a [Bugfix] Fix device ordinal for multi-node spec decode (#13269)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-02-19 22:13:15 +08:00
377d10bd14 [VLM][Bugfix] Pass processor kwargs properly on init (#13516)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-19 13:13:50 +00:00
52ce14d31f [doc] clarify profiling is only for developers (#13554)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-19 20:55:58 +08:00
81dabf24a8 [CI/Build] force writing version file (#13544)
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com>
2025-02-19 18:48:03 +08:00
423330263b [Feature] Pluggable platform-specific scheduler (#13161)
Signed-off-by: Yannick Schnider <yannick.schnider1@ibm.com>
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com>
2025-02-19 17:16:38 +08:00
caf7ff4456 [V1][Core] Generic mechanism for handling engine utility (#13060)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-19 17:09:22 +08:00
f525c0be8b [Model][Speculative Decoding] DeepSeek MTP spec decode (#12755)
Signed-off-by: Lu Fang <fanglu@fb.com>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-19 17:06:23 +08:00
983a40a8bb [Bugfix] Fix Positive Feature Layers in Llava Models (#13514)
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
2025-02-19 08:50:07 +00:00
fdc5df6f54 use device param in load_model method (#13037) 2025-02-19 16:05:02 +08:00
3b05cd4555 [perf-benchmark] Fix ECR path for premerge benchmark (#13512)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-19 07:56:11 +00:00
d5d214ac7f [1/n][CI] Load models in CI from S3 instead of HF (#13205)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-19 07:34:59 +00:00
fd84857f64 [Doc] Add clarification note regarding paligemma (#13511) 2025-02-18 22:24:03 -08:00
8aada19dfc [ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe (#13503) 2025-02-18 22:23:24 -08:00
9aa95b0e6a [perf-benchmark] Allow premerge ECR (#13509)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-19 05:13:41 +00:00
d0a7a2769d [Hardware][Gaudi][Feature] Support Contiguous Cache Fetch (#12139)
Signed-off-by: yuzhou <yuzhou@habana.ai>
Signed-off-by: zhouyu5 <yu.zhou@intel.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-18 19:40:19 -08:00
00b69c2d27 [Misc] Remove dangling references to --use-v2-block-manager (#13492)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-19 03:37:26 +00:00
4c82229898 [V1][Spec Decode] Optimize N-gram matching with Numba (#13365)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-18 13:19:58 -08:00
c8d70e2437 Pin Ray version to 2.40.0 (#13490)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-18 12:50:31 -08:00
30172b4947 [V1] Optimize handling of sampling metadata and req_ids list (#13244)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-18 12:15:33 -08:00
a4d577b379 [V1][Tests] Adding additional testing for multimodal models to V1 (#13308)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
2025-02-18 09:53:14 -08:00
7b203b7694 [misc] fix debugging code (#13487)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-18 09:37:11 -08:00
4fb8142a0e [V1][PP] Enable true PP with Ray executor (#13472)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-18 09:15:32 -08:00
a02c86b4dd [CI/Build] migrate static project metadata from setup.py to pyproject.toml (#8772) 2025-02-18 08:02:49 -08:00
3809458456 [Bugfix] Fix invalid rotary embedding unit test (#13431)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-02-18 11:52:03 +00:00
d3231cb436 [Bugfix] Handle content type with optional parameters (#13383)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
2025-02-18 11:29:13 +00:00
435b502a6e [ROCm] Make amdsmi import optional for other platforms (#13460) 2025-02-18 03:15:56 -08:00
29fc5772c4 [Bugfix] Remove noisy error logging during local model loading (#13458) 2025-02-18 03:15:48 -08:00
2358ca527b [Doc]: Improve feature tables (#13224)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-18 18:52:39 +08:00
8cf97f8661 [Bugfix] Fix failing transformers dynamic module resolving with spawn multiproc method (#13403)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-02-18 10:25:53 +00:00
e2603fefb8 [Bugfix] Ensure LoRA path from the request can be included in err msg (#13450)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-18 16:19:15 +08:00
b53d79983c Add outlines fallback when JSON schema has enum (#13449)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-18 06:49:41 +00:00
9915912f7f [V1][PP] Fix & Pin Ray version in requirements-cuda.txt (#13436)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 21:58:06 -08:00
d1b649f1ef [Quant] Aria SupportsQuant (#13416) 2025-02-17 21:51:09 -08:00
ac19b519ed [core] fix sleep mode in pytorch 2.6 (#13456)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-18 13:48:10 +08:00
a1074b3efe [Bugfix] Only print out chat template when supplied (#13444) 2025-02-17 21:43:31 -08:00
00294e1bc6 [Quant] Arctic SupportsQuant (#13366) 2025-02-17 21:35:09 -08:00
88787bce1d [Quant] Molmo SupportsQuant (#13336) 2025-02-17 21:34:47 -08:00
932b51cedd [v1] fix parallel config rank (#13445)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-18 12:33:45 +08:00
7c7adf81fc [ROCm] fix get_device_name for rocm (#13438)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-02-18 04:07:12 +00:00
67ef8f666a [Model] Enable quantization support for transformers backend (#12960) 2025-02-17 19:52:47 -08:00
efbe854448 [Misc] Remove dangling references to SamplingType.BEAM (#13402) 2025-02-17 19:52:35 -08:00
b3942e157e [Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallback issue (#13425)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-18 00:32:48 +00:00
cd4a72a28d [V1][Spec decode] Move drafter to model runner (#13363)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 15:40:12 -08:00
6ac485a953 [V1][PP] Fix intermediate tensor values (#13417)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-17 13:37:45 -08:00
4c21ce9eba [V1] Get input tokens from scheduler (#13339)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-17 11:01:07 -08:00
ce77eb9410 [Bugfix] Fix VLLM_USE_MODELSCOPE issue (#13384) 2025-02-17 14:22:01 +00:00
30513d1cb6 [Bugfix] fix xpu communicator (#13368)
Signed-off-by: yan ma <yan.ma@intel.com>
2025-02-17 20:59:18 +08:00
1f69c4a892 [Model] Support Mamba2 (Codestral Mamba) (#9292)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
2025-02-17 20:17:50 +08:00
7b623fca0b [VLM] Check required fields before initializing field config in DictEmbeddingItems (#13380) 2025-02-17 01:36:07 -08:00
238dfc8ac3 [MISC] tiny fixes (#13378) 2025-02-17 00:57:13 -08:00
45186834a0 Run v1 benchmark and integrate with PyTorch OSS benchmark database (#13068)
Signed-off-by: Huy Do <huydhn@gmail.com>
2025-02-17 08:16:32 +00:00
f857311d13 Fix spelling error in index.md (#13369) 2025-02-17 06:53:20 +00:00
46cdd59577 [Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-02-16 19:32:26 -08:00
2010f04c17 [V1][Misc] Avoid unnecessary log output (#13289) 2025-02-16 19:26:24 -08:00
69e1d23e1e [V1][BugFix] Clean up rejection sampler & Fix warning msg (#13362)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-16 12:25:29 -08:00
d67cc21b78 [Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend edge case (#13358)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-02-16 18:55:27 +00:00
e18227b04a [V1][PP] Cache Intermediate Tensors (#13353)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-16 10:02:27 -08:00
7b89386553 [V1][BugFix] Add __init__.py to v1/spec_decode/ (#13359)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-16 09:39:08 -08:00
da833b0aee [Docs] Change myenv to vllm. Update python_env_setup.inc.md (#13325) 2025-02-16 16:04:21 +00:00
5d2965b7d7 [Bugfix] Fix 2 Node and Spec Decode tests (#13341)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-16 22:20:22 +08:00
a0231b7c25 [platform] add base class for communicators (#13208)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-16 22:14:22 +08:00
124776ebd5 [ci] skip failed tests for flashinfer (#13352)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-16 22:09:15 +08:00
b7d309860e [V1] Update doc and examples for H2O-VL (#13349)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-02-16 10:35:54 +00:00
dc0f7ccf8b [BugFix] Enhance test_pos_encoding to support execution on multi-devices (#13187)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2025-02-16 08:59:49 +00:00
d3d547e057 [Bugfix] Pin xgrammar to 0.1.11 (#13338) 2025-02-15 19:42:25 -08:00
12913d17ba [Quant] Add SupportsQuant to phi3 and clip (#13104) 2025-02-15 19:28:33 -08:00
80f63a3966 [V1][Spec Decode] Ngram Spec Decode (#12193)
Signed-off-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2025-02-15 18:05:11 -08:00
367cb8ce8c [Doc] [2/N] Add Fuyu E2E example for multimodal processor (#13331) 2025-02-15 07:06:23 -08:00
54ed913f34 [ci/build] update flashinfer (#13323) 2025-02-15 05:33:13 -08:00
9206b3d7ec [V1][PP] Run engine busy loop with batch queue (#13064) 2025-02-15 03:59:01 -08:00
ed0de3e4b8 [AMD] [Model] DeepSeek tunings (#13199) 2025-02-15 03:58:09 -08:00
2ad1bc7afe [V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288) 2025-02-15 03:56:19 -08:00
7fdaaf48ef [Bugfix] Fix qwen2.5-vl image processor (#13286) 2025-02-15 03:00:11 -08:00
067fa2255b [Bugfix]Fix search start_index of stop_checker (#13280) 2025-02-14 21:39:42 -08:00
9076325677 [BugFix] Don't scan entire cache dir when loading model (#13302) 2025-02-14 21:33:31 -08:00
97a3d6d995 [Bugfix] Massage MLA's usage of flash attn for RoCM (#13310) 2025-02-14 21:33:25 -08:00
579d7a63b2 [Bugfix][Docs] Fix offline Whisper (#13274) 2025-02-14 21:32:37 -08:00
c9f9d5b397 [Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't build on ROCm (#13235) 2025-02-14 20:30:42 -08:00
0c73026844 [V1][PP] Fix memory profiling in PP (#13315)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-14 20:17:25 -08:00
6a854c7a2b [V1][Sampler] Don't apply temp for greedy-only (#13311)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-14 18:10:53 -08:00
e7eea5a520 [V1][CI] Fix failed v1-test because of min_p (#13316)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-14 17:29:51 -08:00
a12934d3ec [V1][Core] min_p sampling support (#13191)
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
2025-02-14 15:50:05 -08:00
3bcb8c75da [Core] Reduce TTFT with concurrent partial prefills (#10235)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-14 15:36:07 -08:00
5e5c8e091e [Quant][Perf] Use moe_wna16 kernel by default for MoEs with many experts (#13236)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-14 12:53:42 -08:00
c9e2d644e7 [Hardware][Gaudi][Bugfix] Fix error for guided decoding (#12317) 2025-02-14 04:36:49 -08:00
7734e9a291 [Core] choice-based structured output with xgrammar (#12632) 2025-02-14 04:36:05 -08:00
6224a9f620 Support logit_bias in v1 Sampler (#13079) 2025-02-14 04:34:59 -08:00
085b7b2d6c [V1] Simplify GPUModelRunner._update_states check (#13265) 2025-02-14 04:33:43 -08:00
4da1f667e9 [VLM] Keep track of whether prompt replacements have been applied (#13215) 2025-02-14 04:20:46 -08:00
556ef7f714 [Misc] Log time consumption of sleep and wake-up (#13115)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-02-14 20:10:21 +08:00
83481ceb49 [Bugfix] Fix missing parentheses (#13263) 2025-02-14 01:07:10 -08:00
185cc19f92 [Frontend] Optionally remove memory buffer used for uploading to URLs in run_batch (#12927)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2025-02-14 08:22:42 +00:00
45f90bcbba [WIP] TPU V1 Support Refactored (#13049) 2025-02-14 00:21:53 -08:00
b0ccfc565a [Bugfix][V1] GPUModelRunner._update_states should return True when there is a finished request in batch (#13126) 2025-02-13 22:39:20 -08:00
ba59b78a9c [ROCm][V1] Add intial ROCm support to V1 (#12790) 2025-02-13 22:21:50 -08:00
cbc40128eb [V1] LoRA - Enable Serving Usecase (#12883)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-14 14:21:12 +08:00
f0b2da72a8 Expand MLA to support most types of quantization (#13181) 2025-02-13 22:19:22 -08:00
f2b20fe491 Consolidate Llama model usage in tests (#13094) 2025-02-13 22:18:03 -08:00
40932d7a05 [Misc] Remove redundant statements in scheduler.py (#13229) 2025-02-13 22:07:25 -08:00
84683fa271 [Bugfix] Offline example of disaggregated prefill (#13214) 2025-02-13 20:20:47 -08:00
067678262a [Bugfix][CI] Inherit codespell settings from pyproject.toml in the pre-commit-config (#13237) 2025-02-13 20:19:43 -08:00
09545c0a94 [Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250) 2025-02-13 20:19:25 -08:00
dd5ede4440 [V1] Consolidate MM cache size to vllm.envs (#13239) 2025-02-13 20:19:03 -08:00
8c32b08a86 [Kernel] Fix awq error when n is not divisable by 128 (#13227) 2025-02-13 20:07:05 -08:00
410886950a [ROCm] Avoid using the default stream on ROCm (#13238)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-02-14 09:29:26 +08:00
e38be640e6 Revert "Add label if pre-commit passes" (#13242) 2025-02-13 16:12:32 -08:00
c1e37bf71b [Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-14 00:01:14 +00:00
2344192a55 Optimize moe_align_block_size for deepseek_v3 (#12850)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-13 18:43:37 -05:00
bffddd9a05 Add label if pre-commit passes (#12527)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-13 20:51:30 +00:00
d84cef76eb [Frontend] Add /v1/audio/transcriptions OpenAI API endpoint (#12909) 2025-02-13 07:23:45 -08:00
37dfa60037 [Bugfix] Missing Content Type returns 500 Internal Server Error (#13193) 2025-02-13 06:52:22 -08:00
1bc3b5e71b [VLM] Separate text-only and vision variants of the same model architecture (#13157) 2025-02-13 06:19:15 -08:00
02ed8a1fbe [Misc] Qwen2.5-VL Optimization (#13155) 2025-02-13 06:17:57 -08:00
2092a6fa7d [V1][Core] Add worker_base for v1 worker (#12816)
Signed-off-by: Aoyu <aoyuzhan@amazon.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Aoyu <aoyuzhan@amazon.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-02-13 20:35:18 +08:00
c9d3ecf016 [VLM] Merged multi-modal processor for Molmo (#12966) 2025-02-13 04:34:00 -08:00
fdcf64d3c6 [V1] Clarify input processing and multimodal feature caching logic (#13211) 2025-02-13 03:43:24 -08:00
578087e56c [Frontend] Pass pre-created socket to uvicorn (#13113) 2025-02-13 00:51:46 -08:00
fa253f1a70 [VLM] Remove input processor from clip and siglip (#13165) 2025-02-13 00:31:37 -08:00
9605c1256e [V1][core] Implement pipeline parallel on Ray (#12996) 2025-02-13 08:02:46 +00:00
0ccd8769fb [CI/Build] Allow ruff to auto-fix some issues (#13180)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-13 07:45:38 +00:00
cb944d5818 Allow Unsloth Dynamic 4bit BnB quants to work (#12974) 2025-02-12 23:13:08 -08:00
d46d490c27 [Frontend] Move CLI code into vllm.cmd package (#12971) 2025-02-12 23:12:21 -08:00
04f50ad9d1 [Bugfix] deepseek_r1_reasoning_parser put reason content in wrong field in certain edge case (#13097) 2025-02-12 23:11:26 -08:00
60c68df6d1 [Build] Automatically use the wheel of the base commit with Python-only build (#13178) 2025-02-12 23:10:28 -08:00
009439caeb Simplify logic of locating CUDART so file path (#13203)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-13 13:52:41 +08:00
bc55d13070 [VLM] Implement merged multimodal processor for Mllama (#11427) 2025-02-12 20:26:21 -08:00
d88c8666a1 [Bugfix][Example] Fix GCed profiling server for TPU (#12792)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-02-13 11:52:11 +08:00
4fc5c23bb6 [NVIDIA] Support nvfp4 quantization (#12784) 2025-02-12 19:51:51 -08:00
9f9704dca6 [perf-benchmark] cleanup unused Docker images and volumes in H100 benchmark instance (#12706) 2025-02-12 19:51:33 -08:00
8eafe5eaea [CI/Build] Ignore ruff warning up007 (#13182)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-13 11:48:31 +08:00
4c0d93f4b2 [V1][Bugfix] Copy encoder input ids to fix set iteration issue during VLM abort (#13173)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
2025-02-12 12:58:11 -08:00
14b7899d10 [CI] Fix failing FP8 cpu offload test (#13170)
Signed-off-by: mgoin <mgoin64@gmail.com>
2025-02-12 19:16:06 +00:00
09972e716c [Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularity (#13119) 2025-02-12 09:19:53 -08:00
36a08630e8 [CORE] [QUANT] Support for GPTQModel's dynamic quantization per module override/control (#7086) 2025-02-12 09:19:43 -08:00
2c2b560f48 [CI/Build] Use mypy matcher for pre-commit CI job (#13162)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-12 17:12:22 +00:00
042c3419fa Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path (#12998)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-12 09:06:13 -08:00
82cabf53a3 [Misc] Delete unused LoRA modules (#13151) 2025-02-12 08:58:24 -08:00
314cfade02 [Frontend] Generate valid tool call IDs when using tokenizer-mode=mistral (#12332) 2025-02-12 08:29:56 -08:00
985b4a2b19 [Bugfix] Fix num video tokens calculation for Qwen2-VL (#13148)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-12 11:55:23 +00:00
f4d97e4fc2 [Bug] [V1] Try fetching stop_reason from EngineOutput before checking the request (#13108) 2025-02-12 02:39:16 -08:00
f1042e86f0 [Misc] AMD Build Improvements (#12923) 2025-02-12 02:36:10 -08:00
7c4033acd4 Further reduce the HTTP calls to huggingface.co (#13107) 2025-02-12 02:34:09 -08:00
d59def4730 Bump actions/setup-python from 5.3.0 to 5.4.0 (#12672) 2025-02-12 16:41:22 +08:00
0c7d9effce Bump helm/chart-testing-action from 2.6.1 to 2.7.0 (#12463) 2025-02-12 16:41:06 +08:00
dd3b4a01f8 Bump actions/stale from 9.0.0 to 9.1.0 (#12462) 2025-02-12 00:40:25 -08:00
a0597c6b75 Bump helm/kind-action from 1.10.0 to 1.12.0 (#11612) 2025-02-12 00:40:19 -08:00
e92694b6fe [Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
Signed-off-by: Lingfan Yu <lingfany@amazon.com>
2025-02-11 21:12:37 -08:00
842b0fd402 [ci] Add more source file dependencies for some tests (#13123)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-11 20:38:10 -08:00
974dfd4971 [Model] IBM/NASA Prithvi Geospatial model (#12830) 2025-02-11 20:34:30 -08:00
3ee696a63d [RFC][vllm-API] Support tokenizer registry for customized tokenizer in vLLM (#12518)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
2025-02-12 12:25:58 +08:00
72c2b68dc9 [Misc] Move pre-commit suggestion back to the end (#13114)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-11 22:34:16 +00:00
14ecab5be2 [Bugfix] Guided decoding falls back to outlines when fails to import xgrammar (#12976)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-11 18:17:44 +00:00
deb6c1c6b4 [Doc] Improve OpenVINO installation doc (#13102)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-11 18:02:46 +00:00
565c1efa65 [CI/Build][Bugfix] Fix CPU backend default threads num (#13077) 2025-02-11 16:55:56 +00:00
2b25b7d2e1 Fix initializing GGUF weights for ColumnParallelLinear when using tensor parallel > 1 (#13023) 2025-02-11 08:38:48 -08:00
6c4dbe23eb [BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (#12962)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2025-02-12 00:21:50 +08:00
21f5d50fa5 [Bugfix] Do not use resource module on Windows (#12858) (#13029) 2025-02-11 08:21:18 -08:00
bf3e05215c [Misc] Fix typo at comments at metrics.py (#13024) 2025-02-11 08:20:37 -08:00
ad9776353e Set torch_dtype in TransformersModel (#13088)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-11 23:51:19 +08:00
75e6e14516 [V1][Metrics] Add several request timing histograms (#12644)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-02-11 10:14:00 -05:00
110f59a33e [Bugfix] fix flaky test (#13089)
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
2025-02-11 14:41:20 +00:00
2e3b969ec0 [Platform] add pre_register_and_update function (#12432)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-02-11 22:06:46 +08:00
da317197dd [Build] Fix cuda link target of cumem_allocator in CPU env (#12863)
Signed-off-by: YuhongGuo <yuhong.gyh@antgroup.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-11 21:55:57 +08:00
7539bbc6a6 [ROCm] Using a more precise memory profiling (#12624)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-02-11 21:47:10 +08:00
9cf4759493 [executor] init local_rank as device index (#13027)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-02-11 21:20:53 +08:00
41c5dd45b9 [V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592) 2025-02-11 08:27:25 +00:00
fc6485d277 [Bugfix]: Reasoning output bug according to the chat template change (#13025)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
2025-02-11 15:49:03 +08:00
78a141d768 [Misc] LoRA - Refactor Punica ops tests (#12970)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-11 07:26:03 +00:00
c320ca8edd [Core] Don't do platform detection at import time (#12933)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-11 07:25:25 +00:00
58047c6f04 [Benchmark] Add BurstGPT to benchmark_serving (#13063)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2025-02-10 21:25:30 -08:00
cb080f32e3 [Bugfix] Support missing tool parameters in mistral tokenizer (#12884)
Signed-off-by: Florian Greinacher <florian.greinacher@siemens.com>
2025-02-11 03:33:33 +00:00
2c0f58203c [Docs] Annouce Meta Meetup (#13065)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-02-10 18:24:29 -08:00
2ff4857678 [V1][Minor] Move scheduler outputs to a separate file (#13062)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-11 02:10:06 +00:00
91e876750e [misc] Fix setup.py condition to avoid AMD from being mistaken with CPU (#13022)
Signed-off-by: kevin <kevin@anyscale.com>
2025-02-10 18:06:16 -08:00
08b2d845d6 [Model] Ultravox Model: Support v0.5 Release (#12912)
Signed-off-by: Farzad Abdolhosseini <farzad@fixie.ai>
2025-02-10 22:02:48 +00:00
2ae889052c Fix seed parameter behavior in vLLM (#13007)
Signed-off-by: மனோஜ்குமார் பழனிச்சாமி <smartmanoj42857@gmail.com>
2025-02-10 23:26:50 +08:00
51f0b5f7f6 [Bugfix] Clean up and fix multi-modal processors (#13012)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-10 10:45:21 +00:00
fde71262e0 [misc] Add retries with exponential backoff for HF file existence check (#13008) 2025-02-10 01:15:02 -08:00
243137143c [Doc] Add link to tool_choice tracking issue in tool_calling.md (#13003)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-10 06:09:33 +00:00
b2496bb07f [core] fix sleep mode and pytorch checkpoint compatibility (#13001)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-10 13:03:43 +08:00
44607e07d3 Check if selected backend is None in get_attn_backend_cls() (#12975)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-10 11:45:07 +08:00
67c4637ccf [V1] Use msgpack for core request serialization (#12918)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-02-10 11:35:56 +08:00
aa0ca5ebb7 [core][rlhf] add colocate example for RLHF (#12984)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-10 10:28:59 +08:00
59fff4a01a [core] improve error handling when wake up from sleep mode (#12981)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-10 09:38:57 +08:00
29f1d47e73 [MISC] Always import version library first in the vllm package (#12979)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-09 18:56:40 +08:00
cf797aa856 [core] port pynvml into vllm codebase (#12963)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-09 15:00:00 +08:00
24700c346b [V1] Cache uses_mrope in GPUModelRunner (#12969) 2025-02-08 15:32:32 -08:00
d366ccc4e3 [RFC] [Mistral] FP8 format (#10130)
Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: mgoin <mgoin64@gmail.com>
2025-02-08 14:12:53 -07:00
870c37481e [V1][Minor] Remove outdated comment (#12968)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-08 12:48:30 -08:00
86222a3dab [VLM] Merged multi-modal processor for GLM4V (#12449)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-02-08 20:32:16 +00:00
fe743b798d [bugfix] fix early import of flash attention (#12959)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-09 00:06:56 +08:00
913df14da3 [Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU (#12935)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-02-08 14:46:19 +00:00
8a69e0e20e [CI/Build] Auto-fix Markdown files (#12941) 2025-02-08 04:25:15 -08:00
4c8dd12ef3 [Misc] Add qwen2.5-vl BNB support (#12944) 2025-02-08 04:24:47 -08:00
256a2d29dc [Doc] Correct HF repository for TeleChat2 models (#12949) 2025-02-08 01:42:15 -08:00
c45d398e6f [CI] Resolve transformers-neuronx version conflict (#12925) 2025-02-08 01:41:35 -08:00
011e612d92 [Misc] Log time consumption on weight downloading (#12926) 2025-02-08 09:16:42 +00:00
7e1837676a [misc] Add LoRA to benchmark_serving (#12898)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-08 17:15:44 +08:00
2880e21e3d [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi (#12812)
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
2025-02-08 17:15:30 +08:00
407b5537db [Build] Make pypi install work on CPU platform (#12874) 2025-02-08 01:15:15 -08:00
4ea48fb35c [V1][Minor] Move cascade attn logic outside _prepare_inputs (#12943)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-08 00:39:09 -08:00
e31498bdcb [Misc] Add offline test for disaggregated prefill (#12418) 2025-02-08 08:38:20 +00:00
91dd8f7aa6 [bugfix] respect distributed_executor_backend in world_size=1 (#12934)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-08 16:17:08 +08:00
d01f66b039 [Bugfix] Fix multi-round chat error when mistral tokenizer is used (#12859)
Signed-off-by: Zifei Tong <zifeitong@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-02-08 07:04:34 +00:00
cc01223f3b [Misc] Fix typo in the example file (#12896)
Signed-off-by: Zhao Ke <yingxiongraomingzk@gmail.com>
2025-02-08 06:56:43 +00:00
306923da82 [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping (#12905) 2025-02-07 21:02:53 -08:00
3243158336 [V1] Move KV block hashes from Request to KVCacheManager (#12922)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-07 19:14:10 -08:00
b21f0f9d17 [V1][Minor] Remove outdated comment (#12928)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-07 19:07:37 -08:00
45cbc4991d [Bugfix] Fix disagg hang caused by the prefill and decode communication issues (#12723)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-07 16:39:50 -08:00
932c6b7461 [V1] LM Eval With Streaming Integration Tests (#11590) 2025-02-07 15:07:03 -08:00
eaa92d4437 [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing (#12501) 2025-02-07 08:13:43 -08:00
0630d4537a [V1] Logprobs and prompt logprobs support (#9880)
This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.

New behavior:

- During model execution, model runner computes sample logprobs (if user-provided logprobs setting is not None) and prompt logprobs (if user-provided prompt_logprobs setting is not None). For both sample and prompt logprobs, the engine core returns 3 vectors: token ids, token logprob values, token ranks. Ranks reflect tokens' 1-indexed positions in the vocabulary vector after sorting the vocabulary by log probability in descending order.
- In scheduler.update_from_output(), sample and prompt logprobs are incorporated into the EngineCoreOutput data structure which is transferred to the engine client. If multiprocessing is enabled, then sample and prompt logprobs will be (de)serialized when the EngineCoreOutput data structure is (de)serialized.
- During output processing, the LogprobsProcessor transforms the triplet of token ids, token logprobs values, and token ranks into the OpenAI-compatible List[Dict[token id,Logprob]] format (for sample and prompt logprobs respectively.)
- Each Logprob instance (whether sample- or prompt-) consists of a token's log-probability, rank, and detokenized string representation. Note that logprob detokenization is handled by the LogprobsProcessor not the detokenizer.

Signed-off-by: Andrew Feldman <afeldman@neuralmagic.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>


Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2025-02-07 07:26:20 -08:00
538fab93cd PR #12718 (#12718) 2025-02-07 06:22:37 -08:00
ce26b16268 [Misc] Remove unnecessary detokenization in multimodal processing (#12868) 2025-02-07 06:21:17 -08:00
1918aa1b80 [MISC][EASY] Break check file names into entry and args in the pre-commit hooks (#12880)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-07 13:04:39 +00:00
6e1fc61f0f Prevent unecessary requests to huggingface hub (#12837) 2025-02-06 21:37:41 -08:00
aa375dca9f [Bugfix] Missing quant_config in deepseek embedding layer (#12836) 2025-02-06 21:35:09 -08:00
433c4a4923 Make vllm compatible with verl (#12824)
Co-authored-by: zhangshulai <zhangshulai@bytedance.com>
2025-02-07 11:54:20 +08:00
ef533d25fb [Bugfix] FA2 illegal memory access (#12848) 2025-02-06 19:54:07 -08:00
b260782357 [misc] Revert # 12833 (#12857)
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-06 16:29:12 -08:00
741429a4cd [MISC] Check space in the file names in the pre commit checks (#12804)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-06 15:36:21 -08:00
aff404571b Add Bamba Model (#10909)
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-06 15:22:42 -08:00
467a96a541 [V1] LoRA Support (#10957)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-02-06 09:32:51 -08:00
8108ac841d [Bugfix] Fix unsupported FA version check for Turing GPU (#12828) 2025-02-06 09:18:22 -08:00
afe74f7a96 [Doc] double quote cmake package in build.inc.md (#12840) 2025-02-06 09:17:55 -08:00
09b95e36ab [torch.compile] PyTorch 2.6 and nightly compatibility (#12393)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-07 01:09:07 +08:00
85ac82d228 [Kernel] Make rotary_embedding ops more flexible with input shape (#12777) 2025-02-06 08:46:13 -08:00
1e57b1ee63 [Misc] Remove unnecessary decode call (#12833) 2025-02-06 08:45:44 -08:00
e152f29502 [misc] Reduce number of config file requests to HuggingFace (#12797)
Signed-off-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
Signed-off-by: <>
Co-authored-by: EC2 Default User <ec2-user@ip-172-31-20-117.us-west-2.compute.internal>
2025-02-06 14:59:18 +00:00
c786e757fa [Attention] Use FA3 for MLA on Hopper (#12807)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-02-06 11:43:12 +00:00
cefd56ee35 [Docs] Add Google Cloud Slides (#12814) 2025-02-06 01:02:38 -08:00
7ca9934fe7 [Misc] Update w2 scale loading for GPTQMarlinMoE (#12757) 2025-02-06 01:02:14 -08:00
0408efc6d0 [Misc] Improve error message for incorrect pynvml (#12809)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-06 15:23:50 +08:00
449d1bce02 [Misc] Remove duplicated DeepSeek V2/V3 model definition (#12793) 2025-02-05 23:16:20 -08:00
1a6fcad4c9 Improve TransformersModel UX (#12785) 2025-02-05 22:24:57 -08:00
56534cd577 [Bugfix] Fix the test_ultravox.py's license (#12806)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-02-06 13:25:54 +08:00
d88506dda4 [Model] LoRA Support for Ultravox model (#11253) 2025-02-05 19:54:13 -08:00
9cdea30b4f [Misc][Easy] Remove the space from the file name 2025-02-05 19:23:35 -08:00
76abd0c881 [Bugfix] Better FP8 supported defaults 2025-02-05 19:22:19 -08:00
5b19b93082 [ROCm][Kernel] Using the correct warp_size value 2025-02-05 19:15:08 -08:00
75404d041b [VLM] Update compatibility with transformers 4.49 2025-02-05 19:09:45 -08:00
bf3b79efb8 [VLM] Qwen2.5-VL 2025-02-05 13:31:38 -08:00
9a5b1554b4 [Docs] Drop duplicate [source] links 2025-02-05 13:30:50 -08:00
a4ce74c14a [VLM] Use shared field to pass token ids to model 2025-02-05 13:30:46 -08:00
3b2005e1db Add: Support for Sparse24Bitmask Compressed Models 2025-02-05 13:30:43 -08:00
af8486de49 [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) 2025-02-05 13:29:45 -08:00
4c3aac51e1 Merging PR #12536
Merged via CLI script
2025-02-05 13:24:26 -08:00
bc1bdecebf [core][distributed] exact ray placement control (#12732)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-06 02:03:19 +08:00
022bcc701a [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 (#12546) 2025-02-04 23:11:02 -08:00
c53dc466b1 [Doc] Remove performance warning for auto_awq.md (#12743) 2025-02-04 22:43:11 -08:00
3d09e592a8 [V1][Misc] Shorten FinishReason enum and use constant strings (#12760) 2025-02-04 22:43:02 -08:00
fcf2e3d7fc [Bugfix] Fix OpenVINO model runner (#12750) 2025-02-04 22:42:46 -08:00
58b218d7ae [Doc] Update PR Reminder with link to Developer Slack (#12748) 2025-02-04 22:42:09 -08:00
7ff7a638b6 [Model][Quant] Fix GLM, Fix fused module mappings for quantization (#12634)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-02-05 05:32:06 +00:00
686006a220 [Misc] Bump the compressed-tensors version (#12736) 2025-02-04 20:44:48 -08:00
98fd089fc9 [VLM] Add MLA with pure RoPE support for deepseek-vl2 models (#12729) 2025-02-04 20:44:26 -08:00
249824c3bf Refactor Linear handling in TransformersModel (#12727)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-02-05 04:31:12 +00:00
64862d106e [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling (#12713)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-02-05 03:58:22 +00:00
b3a0d01e45 [Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS (#12368)
Signed-off-by: Aviv Keshet <akeshet@scaledcognition.com>
2025-02-04 18:46:26 -08:00
75e94309e8 [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) (#12676)
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Lucas Wilkinson <lcwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-02-04 18:22:24 -08:00
233df6f5c4 [V1][Metrics] Add request_success_total counter, labelled with finish reason (#12579)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-02-04 19:46:54 -05:00
18016a5e62 [Bugfix] Fix CI failures for InternVL and Mantis models (#12728)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-02-04 23:54:23 +08:00
649550f27e [Build] update requirements of no-device for plugin usage (#12630)
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
2025-02-04 21:19:12 +08:00
62467a834a Avoid unnecessary multi-modal input data copy when len(batch) == 1 (#12722)
Signed-off-by: imkero <kerorek@outlook.com>
2025-02-04 21:03:19 +08:00
6469038b14 [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small (#12689)
Signed-off-by: Michael Greenbaum <mgreenbaum@microsoft.com>
Co-authored-by: Michael Greenbaum <mgreenbaum@microsoft.com>
2025-02-04 20:58:48 +08:00
815079de8e [VLM] merged multimodal processor and V1 support for idefics3 (#12660)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-02-04 20:00:51 +08:00
18a88fcccc [V1] Remove scheduling constraint on partial requests (#12674)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-04 02:43:58 -08:00
d1ca7df84d [VLM] Merged multi-modal processor for InternVL-based models (#12553)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-02-04 16:44:52 +08:00
96b23621c1 [Misc] Add BNB quantization for Whisper (#12381)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-02-04 16:27:36 +08:00
c36ac98d01 [AMD][ROCm] Enable DeepSeek model on ROCm (#12662)
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
2025-02-04 08:24:11 +00:00
4896d0c2dd [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs (#12711) 2025-02-03 23:27:11 -08:00
bb392af434 [Doc] Replace ibm-fms with ibm-ai-platform (#12709)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-02-04 07:05:04 +00:00
5d98d56089 Support Pixtral-Large HF by using llava multimodal_projector_bias config (#12710)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-02-04 11:55:46 +08:00
73b35cca7f [Core] Improve hash collision avoidance in prefix caching (#12621)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-03 16:28:20 -08:00
5095e96606 [V1] Revert uncache_blocks and support recaching full blocks (#12415)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-03 15:04:53 -08:00
cf58b9c4ca [MISC] Remove model input dumping when exception (#12582)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-02-03 13:34:16 -08:00
4797dad3ec [Model] Add Deepseek V3 fp8_w8a8 configs for B200 (#12707) 2025-02-03 13:30:39 -08:00
6dd5e52823 Squelch MLA warning for Compressed-Tensors Models (#12704)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-02-03 13:29:56 -08:00
c11de33dad [Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm (#12696)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-03 13:04:59 -08:00
33e0602e59 [Misc] Fix improper placement of SPDX header in scripts (#12694)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-03 11:16:59 -08:00
a1a2aaadb9 [Model]: Add transformers backend support (#11330)
# Adds support for `transformers` as a backend

Following https://github.com/huggingface/transformers/pull/35235, a
bunch of models should already be supported, we are ramping up support
for more models.

Thanks @Isotr0py for the TP support, and @hmellor for his help as well!
This includes: 
- `trust_remote_code=True` support: any model on the hub, if it
implements attention the correct way can be natively supported!!
- tensor parallel support

---------

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <41363108+Isotr0py@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
2025-02-03 21:30:38 +08:00
1298a400e8 [ci/build] fix gh200 test (#12681)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 15:59:49 +08:00
ad4a9dc817 [cuda] manually import the correct pynvml module (#12679)
fixes problems like https://github.com/vllm-project/vllm/pull/12635 and
https://github.com/vllm-project/vllm/pull/12636 and
https://github.com/vllm-project/vllm/pull/12565

---------

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 15:58:21 +08:00
b9986454fe Fix for attention layers to remain unquantized during moe_wn16 quant (#12570)
Fix to AWQ quant loading of the new R1 model

The new optimized MoE kernels for a large number of experts `moe_wn16`
uses AWQ quant which requires the attention layers to be in 16bit

The current merge has broken this, and the `get_quant_method` must
return None for it to work correctly again

---------

Signed-off-by: Srikanth Srinivas <srikanth@astrum.ai>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Beim <beim2015@outlook.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Shawn Du <shawnd200@outlook.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Beim <805908499@qq.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Nishidha <nishidha.panpaliya@partner.ibm.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Ryan Nguyen <96593302+xpbowler@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Co-authored-by: fade_away <1028552010@qq.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Eldar Kurtic <eldarkurtic314@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
Co-authored-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Shawn Du <shawnd200@outlook.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-02-03 13:46:19 +08:00
c5932e5dac Properly check if all fused layers are in the list of targets (#12666)
Thanks @kylesayrs for catching this!
2025-02-03 13:42:18 +08:00
20579c0fae make sure mistral_common not imported for non-mistral models (#12669)
When people use deepseek models, they find that they need to solve cv2
version conflict, see https://zhuanlan.zhihu.com/p/21064432691 .

I added the check, and make all imports of `cv2` lazy.

---------

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 13:40:25 +08:00
95460fc513 [Kernel] port sgl moe_align_block_size kernels (#12574)
sgl_moe_align_block_size is based on:


ded9fcd09a

moe_align_block_size is based on:


ba5112ff69

Signed-off-by: Yang Chen <yangche@fb.com>
2025-02-03 13:09:50 +08:00
326fcc8b9f [Doc] Deprecate Discord (#12668) 2025-02-02 19:19:56 -08:00
e64330910b [doc][misc] clarify VLLM_HOST_IP for multi-node inference (#12667)
As more and more people are trying deepseek models with multi-node
inference, https://github.com/vllm-project/vllm/issues/7815 becomes more
frequent. Let's give clear message to users.

Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-02-03 09:32:18 +08:00
e489ad7a21 [Misc] Add SPDX-License-Identifier headers to python source files (#12628)
- **Add SPDX license headers to python source files**
- **Check for SPDX headers using pre-commit**

commit 9d7ef44c3cfb72ca4c32e1c677d99259d10d4745
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:18:24 2025 -0500

    Add SPDX license headers to python source files
    
This commit adds SPDX license headers to python source files as
recommended to
the project by the Linux Foundation. These headers provide a concise way
that is
both human and machine readable for communicating license information
for each
source file. It helps avoid any ambiguity about the license of the code
and can
    also be easily used by tools to help manage license compliance.
    
The Linux Foundation runs license scans against the codebase to help
ensure
    we are in compliance with the licenses of the code we use, including
dependencies. Having these headers in place helps that tool do its job.
    
    More information can be found on the SPDX site:
    
    - https://spdx.dev/learn/handling-license-info/
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

commit 5a1cf1cb3b80759131c73f6a9dddebccac039dea
Author: Russell Bryant <rbryant@redhat.com>
Date:   Fri Jan 31 14:36:32 2025 -0500

    Check for SPDX headers using pre-commit
    
    Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 11:58:18 -08:00
f256ebe4df [Hardware][Intel GPU] add XPU bf16 support (#12392)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-02-02 10:17:26 +00:00
f8ece6e17f [Core][v1] Unify allocating slots in prefill and decode in KV cache manager (#12608)
As mentioned in RFC https://github.com/vllm-project/vllm/issues/12254,
this PR achieves the task: combine allocate_slots and append_slots.

There should be no functionality change, except that in decode, also
raise exception when num_tokens is zero (like prefill), and change the
unit test case accordingly.

@comaniac @rickyyx @WoosukKwon @youkaichao @heheda12345 @simon-mo

---------

Signed-off-by: Shawn Du <shawnd200@outlook.com>
2025-02-02 16:40:58 +08:00
abfcdcdf27 [V1][Minor] Avoid frequently creating ConstantList (#12653)
A small optimization to avoid creating a new `ConstantList` every time `request.kv_block_hashes` is used.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-02-01 23:43:20 -08:00
e497f33491 [Core] Silence unnecessary deprecation warnings (#12620)
I noticed during testing that I was getting a lot of these deprecation
warnings about `local_lora_path`:

```
DeprecationWarning: The 'lora_local_path' attribute is deprecated
     and will be removed in a future version.
     Please use 'lora_path' instead.
```

The check used for emitting this warning was always True, even when the
parameter was not actually specified. It will always be in
`__struct_fields__`. We should be checking for a non-None value,
instead.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-02-02 15:35:50 +08:00
baaa2b24da [Bugfix] fix moe_wna16 get_quant_method (#12648)
Fix https://github.com/vllm-project/vllm/issues/12647
The `get_quant_method` of `moe_wna16` always return moe method,
GPTQ-based linear method or AWQ-based linear method, even when the
target module is attention layer.


baeded2569/vllm/attention/layer.py (L86-L92)

Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-02-02 15:29:56 +08:00
b4e5c03306 doc: fixing minor typo in readme.md (#12643)
Word "evolved" was mistyped

Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>

---------

Signed-off-by: Vicente Herrera <vicenteherrera@vicenteherrera.com>
2025-02-01 17:17:29 +00:00
3194039c0e Apply torch.compile to fused_moe/grouped_topk (#12637) 2025-02-01 16:16:19 +00:00
4f4d427ac2 Disable chunked prefill and/or prefix caching when MLA is enabled (#12642)
From @mgoin in https://github.com/vllm-project/vllm/pull/12638

I cannot push to that branch, therefore a new PR to unblock release.

---------

Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-31 23:46:57 -08:00
1e3698393f [CI/Build] Add label automation for structured-output, speculative-decoding, v1 (#12280)
We have `v1`, `structured-output`, and `speculative-decoding` labels on
github. This adds automation for applying these labels based on the
files touched by a PR.

Signed-off-by: Russell Bryant <rbryant@redhat.com>

---------

Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-31 23:13:10 -08:00
baeded2569 [Attention] Deepseek v3 MLA support with FP8 compute (#12601)
This PR implements the Deepseek V3 support by performing matrix absorption the fp8 weights 

---------

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
2025-01-31 21:52:51 -08:00
3e1c76cf3a Fix: Respect sparsity_config.ignore in Cutlass Integration (#12517)
This PR addresses a bug in the Cutlass integration where the
`sparsity_config.ignore` list was not being respected. When only a
subset of modules were configured as Sparse24, the system incorrectly
selected Cutlass for non-sparse modules as well. This update ensures the
correct scheme is selected for non-sparse modules, fixing this behavior.

---

### Changes

- Updated logic to correctly respect `sparsity_config.ignore`.
- Ensured non-sparse modules use the appropriate scheme instead of
defaulting to Cutlass.

---

<details>
<summary>Testing Setup</summary>

The fix has been tested on top of [this
diff](https://github.com/vllm-project/vllm/pull/12097).

#### Steps to Test:
```bash
git checkout -b my-test-branch origin/rahul-bitmask-additions # compressed Cutlass support
git revert --no-edit aa2cd2c # revert Tyler's commit to turn off Cutlass for W16A16
git cherry-pick ca624cddb # this branch
```

#### Additional Patch Required:
```diff
diff --git a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
index a54177c1c..f916dd0c9 100644
--- a/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
+++ b/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
@@ -9,7 +9,7 @@ from compressed_tensors.quantization import (QuantizationArgs,
                                              QuantizationStrategy,
                                              QuantizationType)
 from pydantic import BaseModel
-
+from vllm.logger import init_logger
 from vllm.model_executor.layers.fused_moe import FusedMoE
 from vllm.model_executor.layers.linear import (LinearBase, LinearMethodBase,
                                                UnquantizedLinearMethod)
@@ -27,7 +27,7 @@ from vllm.model_executor.layers.quantization.compressed_tensors.utils import (
     should_ignore_layer)
 from vllm.model_executor.layers.quantization.kv_cache import BaseKVCacheMethod
 from vllm.platforms import current_platform
-
+logger = init_logger(__name__)
 __all__ = ["CompressedTensorsLinearMethod"]
 
 SPARSITY_CONFIG_NAME: Literal["sparsity_config"] = "sparsity_config"
```

Apply using:
```bash
git apply logging-patch.patch
```

</details>

---

<details>
<summary>Models Tested</summary>

- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24` 
- `nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-full-sparse24`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-entire-fp8-compressed`
-
`nm-testing/TinyLlama-1.1B-Chat-v1.0-gsm8k-partial-24-remaining-fp8-compressed`

</details>

---


<details>
<summary>Example Output</summary>

#### Layers 0-5 (Sparse24)
```
Using scheme: CompressedTensors24 for model.layers.0.self_attn.qkv_proj
Using scheme: CompressedTensors24 for model.layers.0.self_attn.o_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.gate_up_proj
Using scheme: CompressedTensors24 for model.layers.0.mlp.down_proj
...
```

#### Layers 6+ (Non-Sparse, FP8)
```
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.qkv_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.self_attn.o_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.gate_up_proj
Using scheme: CompressedTensorsW8A8Fp8 for model.layers.6.mlp.down_proj
...
```

</details>

**Note:** Assumed all modules in fused layers such as `QKV_proj` and
`Gate_up_proj` follow the same quantization/pruning scheme.

---

For related tasks using the Asana app for GitHub, refer to [[this
link](https://app.asana.com/0/0/1209227810815160)](https://app.asana.com/0/0/1209227810815160).

Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
2025-02-01 13:41:59 +08:00
cfa134d247 [Bugfix/CI] Fixup benchmark_moe.py (#12562)
Fixes `is_marlin` not being passed into `get_default_config`

Also allow `--tensor-parallel-size` in addition to `-tp` and `--tp-size`

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-02-01 13:41:35 +08:00
35b7a05507 [ci] Upgrade transformers to 4.48.2 in CI dependencies (#12599) 2025-01-31 21:22:23 -08:00
1867c258bd Fix target matching for fused layers with compressed-tensors (#12617)
Without this PR
---------------
Quantizing models with llm-compressor and a recipe that explicitly lists
names of layers produces a model that is not loadable by vLLM (i.e.
`vllm serve <model>` fails with `raise ValueError(f"Unable to find
matching target for {module} in the ...`).

Example recipe:
```
recipe = """
quantization_stage:
  run_type: oneshot
  quantization_modifiers:
    GPTQModifier:
      ignore: ["lm_head"]
      config_groups:
        group_0:
          weights:
            num_bits: 4
            type: "int"
            symmetric: true
            strategy: "group"
            group_size: 128
          targets: [
            "model.layers.0.mlp.down_proj",
            "model.layers.2.mlp.down_proj",
            "model.layers.3.mlp.down_proj",
            "model.layers.4.mlp.down_proj",
            "model.layers.5.mlp.down_proj",
            "model.layers.6.mlp.down_proj",
            "model.layers.7.mlp.down_proj",
            "model.layers.8.mlp.down_proj",
            "model.layers.9.mlp.down_proj",
            "model.layers.10.mlp.down_proj",
            "model.layers.11.mlp.down_proj",
            "model.layers.12.mlp.down_proj",
            "model.layers.13.mlp.down_proj",
            "model.layers.14.mlp.down_proj",
            "model.layers.15.mlp.down_proj",
            "model.layers.16.mlp.down_proj",
            "model.layers.17.mlp.down_proj",
            "model.layers.19.mlp.down_proj",
            "model.layers.21.mlp.down_proj",
            "model.layers.22.mlp.down_proj",
            .
            .
            .
          ]
"""
```

To reproduce the vLLM error: 
```bash
vllm serve nm-testing/eldar-test
```

With this PR
------------
Models are loaded correctly without any errors.
2025-02-01 05:07:46 +00:00
cb3e73e4c8 [BugFix] fix wrong output when using lora and num_scheduler_steps=8 (#11161)
FIX issue https://github.com/vllm-project/vllm/issues/9688
https://github.com/vllm-project/vllm/issues/11086 #12487

---------

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: weilong.yu <weilong.yu@shopee.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-02-01 12:52:07 +08:00
b1340f9d55 [V1] Bugfix: Validate Model Input Length (#12600)
SUMMARY:
* avoid crashing the engine when we get an input longer than
max_model_len

FIX #12567(*link existing issues this PR will resolve*)
2025-01-31 18:32:04 -08:00
44bbca78d7 [Doc] int4 w4a16 example (#12585)
Based on a request by @mgoin , with @kylesayrs we have added an example
doc for int4 w4a16 quantization, following the pre-existing int8 w8a8
quantization example and the example available in
[`llm-compressor`](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16/llama3_example.py)

FIX #n/a (no issue created)

@kylesayrs and I have discussed a couple additional improvements for the
quantization docs. We will revisit at a later date, possibly including:
- A section for "choosing the correct quantization scheme/ compression
technique"
- Additional vision or audio calibration datasets

---------

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-31 15:38:48 -08:00
60808bd4c7 [Doc] Improve installation signposting (#12575)
- Make device tab names more explicit
- Add comprehensive list of devices to
https://docs.vllm.ai/en/latest/getting_started/installation/index.html
- Add `attention` blocks to the intro of all devices that don't have
pre-built wheels/images

---------

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-31 15:38:35 -08:00
fc542144c4 [Feature] Fix guided decoding blocking bitmask memcpy (#12563)
**[Guided decoding performance optimization]** Sending the guided
decoding bitmask in xgrammar to the GPU
(`self.token_bitmask.to(scores.device)`) is a blocking operation that
prevents the CPU from pre-launching the sampler kernels. The CPU waits
until decode is complete, then copies the bitmask over. This PR changes
the operation to async via setting `non-blocking=True`.

(Current) The CPU is blocked on a `cudaStreamSynchronize` and only
pre-empts the sampling kernels after bitmask application. Below is the
Nsys profile for one decode phase from Llama 3.1 8B.

![image](https://github.com/user-attachments/assets/8997eae1-b822-4f52-beb8-ef19a7c6b824)

With the optimization, this is no longer the case:

![image](https://github.com/user-attachments/assets/6d5ea83f-f169-4f98-a8c1-41c719b3e1e7)

---------

Signed-off-by: Ryan N <ryan.nguyen@centml.ai>
2025-01-31 15:37:30 -08:00
eb5741ad42 [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 (#12587)
Integrates the block-quantized kernels introduced in
https://github.com/vllm-project/vllm/pull/11868 for use in linear
layers.

Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-31 15:29:11 -08:00
145c2ff648 [Bugfix] Revert MoE Triton Config Default (#12629)
SUMMARY:
* previous PR for pulling in block configs also changed defaults
(https://github.com/vllm-project/vllm/pull/11589/files) for FP8
* this broke L4 MoE since there was not enough SHM for the default
configuration
* this reverts the non-block example to the default

Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-31 15:28:47 -08:00
415f19474d [release] Add input step to ask for Release version (#12631)
Instead of having to create a new build with release version put in as
env var.
2025-01-31 13:39:36 -08:00
89003c4082 [v1][Bugfix] Add extra_keys to block_hash for prefix caching (#12603)
This pr adds extra key to block hash, to generate different hash value
for two blocks with the same token string but different extra_keys in
their parent blocks. For example, it can generate different hash value
for the second block of the following two requests:
```python
request1 = make_request(
        request_id=0,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash1", "hash2"],
    )
    request2 = make_request(
        request_id=1,
        prompt_token_ids=[_ for _ in range(6)],
        mm_positions=[{
            "offset": 0,
            "length": 3
        }, {
            "offset": 3,
            "length": 3
        }],
        mm_hashes=["hash3", "hash2"],
    )
```

---------

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-31 13:13:04 -08:00
60bcef000e [Docs][V1] Prefix caching design (#12598)
- Create v1 design document section in docs.
- Add prefix caching design doc.

@WoosukKwon @ywang96

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-31 12:30:46 -08:00
847f883232 [Git] Automatically sign-off commits (#12595)
It's very annoying when I forgot to add `-s` in `git commit` to
sign-off, because I then need to `git rebase HEAD~1 --signoff` and `git
push -f` to fix the DCO. This PR adds a hook to sign off commits
automatically when `-s` is missing to solve this problem. The only
change from the user side is now users have to install 2 hooks, so
instead of just

```
pre-commit install
```

Now we need to

```
pre-commit install --hook-type pre-commit --hook-type commit-msg
```

Note that even if users still only install the pre-commit hook, they
won't get any error in `git commit`. Just the sign-off hook won't run.

cc @hmellor @youkaichao

---------

Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-31 12:30:33 -08:00
325f679f32 [BugFix] Fix Torch.Compile For DeepSeek (#12594)
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-31 12:06:39 -08:00
e3f7ff65e7 Add favicon to docs (#12611)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-31 09:20:34 -08:00
7a8987dac5 [Bugfix] Gracefully handle huggingface hub http error (#12571) 2025-01-31 08:19:35 +00:00
cabaf4eff3 [Attention] MLA decode optimizations (#12528)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Alexander Matveev <59768536+alexm-neuralmagic@users.noreply.github.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-30 23:49:37 -08:00
a1fc18c030 [ROCm][AMD][Model] llama 3.2 support upstreaming (#12421)
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
2025-01-31 12:24:28 +08:00
9798b2fb00 [Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling (#11868) 2025-01-30 18:33:00 -08:00
4078052f09 [V1][Log] Add max request concurrency log to V1 (#12569)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-30 23:07:19 +00:00
bd2107e30a [CPU][PPC] Updated torch, torchvision, torchaudio dependencies (#12555)
Signed-off-by: npanpaliya <nishidha.panpaliya@partner.ibm.com>
2025-01-30 16:29:39 -05:00
9b0c4bab36 [Kernel] Triton Configs for Fp8 Block Quantization (#11589)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
2025-01-30 11:53:22 -08:00
41bf5612f5 [Misc] fix typo: add missing space in lora adapter error message (#12564)
Signed-off-by: Beim <beim2015@outlook.com>
2025-01-30 15:39:22 +00:00
a2769032ca Set ?device={device} when changing tab in installation guides (#12560)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-30 00:05:42 -08:00
f17f1d4608 [V1][Metrics] Add GPU cache usage % gauge (#12561)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-29 18:31:01 -08:00
1c1bb0bbf2 [Misc][MoE] add Deepseek-V3 moe tuning support (#12558)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-30 00:47:30 +00:00
e0cc5f259a [V1][BugFix] Free encoder cache for aborted requests (#12545)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-29 13:47:33 -08:00
73aa6cfdf7 Revert "[Build/CI] Fix libcuda.so linkage" (#12552) 2025-01-29 21:12:24 +00:00
27b78c73ca [Kernel] add triton fused moe kernel for gptq/awq (#12185) 2025-01-29 09:07:09 -05:00
b02fd288b2 [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. (#11787)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-29 01:46:12 -08:00
ff7424f491 [Frontend] Support override generation config in args (#12409)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
2025-01-29 01:41:01 -08:00
d93bf4da85 [Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM (#12069)
Signed-off-by: hzh <hezhihui_thu@163.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
Signed-off-by: Chenguang Li <757486878@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Shanshan Shen <467638484@qq.com>
Signed-off-by: elijah <f1renze.142857@gmail.com>
Signed-off-by: Yikun <yikunkero@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: shaochangxu <85155497+shaochangxu@users.noreply.github.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Co-authored-by: sixgod <evethwillbeok@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Akshat Tripathi <Akshat.tripathi6568@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Avshalom Manevich <12231371+avshalomman@users.noreply.github.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Yangcheng Li <liyangcheng.lyc@alibaba-inc.com>
Co-authored-by: Siyuan Li <94890248+liaoyanqing666@users.noreply.github.com>
Co-authored-by: Concurrensee <yida.wu@amd.com>
Co-authored-by: Chenguang Li <757486878@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: elijah <30852919+e1ijah1@users.noreply.github.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: Steve Luo <36296769+SunflowerAries@users.noreply.github.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Konrad Zawora <kzawora@habana.ai>
Co-authored-by: TJian <tunjian1996@gmail.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: maang-h <55082429+maang-h@users.noreply.github.com>
Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com>
Co-authored-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-01-29 09:24:59 +00:00
036ca94c25 [Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense (#12347)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Wallas Santos <wallashss@ibm.com>
2025-01-29 08:54:35 +00:00
ef001d98ef Fix the pydantic logging validator (#12420)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-01-29 07:53:13 +00:00
5f671cb4c3 [V1] Improve Error Message for Unsupported Config (#12535)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-29 04:56:56 +00:00
bd02164cf9 Bugfix for whisper quantization due to fake k_proj bias (#12524)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-29 04:49:03 +00:00
46fb056749 [V1][Metrics] Add TTFT and TPOT histograms (#12530)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-29 04:11:16 +00:00
dd6a3a02cb [Doc] Convert docs to use colon fences (#12471)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-29 11:38:29 +08:00
a7e3eba66f [Frontend] Support reasoning content for deepseek r1 (#12473)
Signed-off-by: Ce Gao <cegao@tensorchord.ai>
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
2025-01-29 11:38:08 +08:00
fbb5bd4cef [TPU] Add example for profiling TPU inference (#12531)
Signed-off-by: mgoin <mgoin@redhat.com>
2025-01-29 03:16:47 +00:00
80fcc3ed1c [Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels (#12482)
Signed-off-by: Fenghui Zhang <fhzhang@google.com>
2025-01-28 22:36:44 +00:00
c386c43ca3 [V1][Metrics] Add per-request prompt/generation_tokens histograms (#12516)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-28 22:07:22 +00:00
f26d790718 Do not run suggestion pre-commit hook multiple times (#12521)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-28 20:05:27 +00:00
0f657bdc52 Replace missed warning_once for rerank API (#12472)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-28 19:06:32 +00:00
3fd1fb63ef [V1][Metrics] Hook up IterationStats for Prometheus metrics (#12478)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-28 16:38:38 +00:00
925d2f1908 [Doc] Fix typo for x86 CPU installation (#12514)
Signed-off-by: Jun Duan <jun.duan.phd@outlook.com>
2025-01-28 16:37:10 +00:00
8f58a51358 [VLM] Merged multi-modal processor and V1 support for Qwen-VL (#12504)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-28 16:25:05 +00:00
2079e43bee [Core] Make raw_request optional in ServingCompletion (#12503)
Signed-off-by: Sebastian Schönnenbeck <sebastian.schoennenbeck@comma-soft.com>
2025-01-28 10:56:45 +00:00
e29d4358ef [V1] Include Engine Version in Logs (#12496)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-28 08:27:41 +00:00
8cbc424975 Update README.md with V1 alpha release (#12495) 2025-01-28 08:22:41 +00:00
dd66fd2b01 [CI] fix pre-commit error (#12494)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-01-28 06:11:05 +00:00
0f465ab533 [FEATURE] Enables offline /score for embedding models (#12021)
Signed-off-by: Gabriel Marinho <gmarinho@ibm.com>
2025-01-28 11:30:13 +08:00
23a7cbc88b [CI/Build] Fixed the xla nightly issue report in #12451 (#12453) 2025-01-28 11:18:07 +08:00
426a5c3625 Fix bad path in prometheus example (#12481)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-27 18:56:31 -07:00
ddee88d0ff [Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (#11277)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
Co-authored-by: Jiangfei Duan <jfduan@outlook.com>
2025-01-27 17:31:16 -08:00
823ab79633 Update pre-commit hooks (#12475)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-27 17:23:08 -07:00
6116ca8cd7 [Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill (#10132)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: wallashss <wallashss@ibm.com>
Co-authored-by: wallashss <wallashss@ibm.com>
2025-01-27 13:38:35 -08:00
2bc3fbba0c [FlashInfer] Upgrade to 0.2.0 (#11194)
Signed-off-by: Bowen Wang <abmfy@icloud.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-27 18:19:24 +00:00
3f1fc7425a [V1][CI/Test] Do basic test for top-p & top-k sampling (#12469)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-27 09:40:04 -08:00
01ba927040 [V1][Metrics] Add initial Prometheus logger (#12416)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2025-01-27 12:26:28 -05:00
103bd17ac5 [Build] Only build 9.0a for scaled_mm and sparse kernels (#12339)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-27 10:40:00 -05:00
ce69f7f754 [Bugfix] Fix gpt2 GGUF inference (#12467)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-27 18:31:49 +08:00
624a1e4711 [V1][Minor] Minor optimizations for update_from_output (#12454)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-27 01:09:27 -08:00
372bf0890b [Bugfix] Fix missing seq_start_loc in xformers prefill metadata (#12464)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-27 07:25:30 +00:00
5204ff5c3f [Bugfix] Fix Granite 3.0 MoE model loading (#12446)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-26 21:26:44 -08:00
0cc6b383d7 [Frontend] Support scores endpoint in run_batch (#12430)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2025-01-27 04:30:17 +00:00
28e0750847 [V1] Avoid list creation in input preparation (#12457)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-26 19:57:56 -08:00
582cf78798 [DOC] Add link to vLLM blog (#12460)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-27 03:46:19 +00:00
0034b09ceb [Frontend] Rerank API (Jina- and Cohere-compatible API) (#12376)
Signed-off-by: Kyle Mistele <kyle@mistele.com>
2025-01-26 19:58:45 -07:00
72bac73067 [Build/CI] Fix libcuda.so linkage (#12424)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-26 21:18:19 +00:00
68f11149d8 [Bugfix][Kernel] Fix perf regression caused by PR #12405 (#12434)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-26 11:09:34 -08:00
72f4880425 [Bugfix/CI] Fix broken kernels/test_mha.py (#12450) 2025-01-26 10:39:03 -08:00
aa2cd2c43d [Bugfix] Disable w16a16 2of4 sparse CompressedTensors24 (#12417)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-26 19:59:58 +08:00
9ddc35220b [Frontend] generation_config.json for maximum tokens(#12242)
Signed-off-by: Matthew Hendrey <matthew.hendrey@gmail.com>
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: shangmingc <caishangming@linux.alibaba.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-26 19:59:25 +08:00
a5255270c3 [Misc] Revert FA on ViT #12355 and #12435 (#12445) 2025-01-26 03:56:34 -08:00
0ee349b553 [V1][Bugfix] Fix assertion when mm hashing is turned off (#12439)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-26 00:47:42 -08:00
fa63e710c7 [V1][Perf] Reduce scheduling overhead in model runner after cuda sync (#12094)
Signed-off-by: Keyun Tong <tongkeyun@gmail.com>
2025-01-26 00:42:37 -08:00
2a0309a646 [Misc][Bugfix] FA3 support to ViT MHA layer (#12435)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-26 05:00:31 +00:00
324960a95c [TPU][CI] Update torchxla version in requirement-tpu.txt (#12422)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-01-25 07:23:03 +00:00
f1fc0510df [Misc] Add FA2 support to ViT MHA layer (#12355)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-25 15:07:35 +08:00
bf21481dde [ROCm][MoE] MI300 tuned configs Mixtral-8x(7B,22B) | fp16, fp8 (#12408)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-25 12:17:19 +08:00
fb30ee92ee [Bugfix] Fix BLIP-2 processing (#12412)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-25 11:42:42 +08:00
221d388cc5 [Bugfix][Kernel] Fix moe align block issue for mixtral (#12413) 2025-01-25 01:49:28 +00:00
3132a933b6 [Bugfix][Kernel] FA3 Fix - RuntimeError: This flash attention build only supports pack_gqa (for build size reasons). (#12405)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 20:20:59 +00:00
df5dafaa5b [Misc] Remove deprecated code (#12383)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-24 14:45:20 -05:00
ab5bbf5ae3 [Bugfix][Kernel] Fix CUDA 11.8 being broken by FA3 build (#12375)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-24 15:27:59 +00:00
3bb8e2c9a2 [Misc] Enable proxy support in benchmark script (#12356)
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
2025-01-24 14:58:26 +00:00
e784c6b998 [ci/build] sync default value for wheel size (#12398)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 17:54:29 +08:00
9a0f3bdbe5 [Hardware][Gaudi][Doc] Add missing step in setup instructions (#12382) 2025-01-24 09:43:49 +00:00
c7c9851036 [ci/build] fix wheel size check (#12396)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 17:31:25 +08:00
3c818bdb42 [Misc] Use VisionArena Dataset for VLM Benchmarking (#12389)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-24 00:22:04 -08:00
6dd94dbe94 [perf] fix perf regression from #12253 (#12380)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 11:34:27 +08:00
0e74d797ce [V1] Increase default batch size for H100/H200 (#12369)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-24 03:19:55 +00:00
55ef66edf4 Update compressed-tensors version (#12367) 2025-01-24 11:19:42 +08:00
5e5630a478 [Bugfix] Path join when building local path for S3 clone (#12353)
Signed-off-by: Omer Dayan (SW-GPU) <omer@run.ai>
2025-01-24 11:06:07 +08:00
d3d6bb13fb Set weights_only=True when using torch.load() (#12366)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-24 02:17:30 +00:00
24b0205f58 [V1][Frontend] Coalesce bunched RequestOutputs (#12298)
Signed-off-by: Nick Hill <nhill@redhat.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2025-01-23 17:17:41 -08:00
c5cffcd0cd [Docs] Update spec decode + structured output in compat matrix (#12373)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-24 01:15:52 +00:00
682b55bc07 [Docs] Add meetup slides (#12345)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-23 14:10:03 -08:00
9726ad676d [Misc] Fix OpenAI API Compatibility Issues in Benchmark Script (#12357)
Signed-off-by: Junichi Sato <junichi.sato@sbintuitions.co.jp>
2025-01-23 17:02:13 -05:00
eb5cb5e528 [BugFix] Fix parameter names and process_after_weight_loading for W4A16 MoE Group Act Order (#11528)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-23 21:40:33 +00:00
2cbeedad09 [Docs] Document Phi-4 support (#12362)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-23 19:18:51 +00:00
2c85529bfc [TPU] Update TPU CI to use torchxla nightly on 20250122 (#12334)
Signed-off-by: Siyuan Liu <lsiyuan@google.com>
2025-01-23 18:50:16 +00:00
e97f802b2d [FP8][Kernel] Dynamic kv cache scaling factors computation (#11906)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
2025-01-23 18:04:03 +00:00
6e650f56a1 [torch.compile] decouple compile sizes and cudagraph sizes (#12243)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 02:01:30 +08:00
3f50c148fd [core] add wake_up doc and some sanity check (#12361)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-24 02:00:50 +08:00
8c01b8022c [Bugfix] Fix broken internvl2 inference with v1 (#12360)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-23 17:20:33 +00:00
99d01a5e3d [V1] Simplify M-RoPE (#12352)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: imkero <kerorek@outlook.com>
2025-01-23 23:13:23 +08:00
d07efb31c5 [Doc] Troubleshooting errors during model inspection (#12351)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-23 22:46:58 +08:00
978b45f399 [Kernel] Flash Attention 3 Support (#12093)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2025-01-23 06:45:48 -08:00
c5b4b11d7f [Bugfix] Fix k_proj's bias for whisper self attention (#12342)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-23 10:15:33 +00:00
8ae5ff2009 [Hardware][Gaudi][BugFix] Fix dataclass error due to triton package update (#12338)
Signed-off-by: zhenwei <zhenweiliu@habana.ai>
2025-01-23 08:35:46 +00:00
511627445e [doc] explain common errors around torch.compile (#12340)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-23 14:56:02 +08:00
f0ef37233e [V1] Add uncache_blocks (#12333) 2025-01-23 04:19:21 +00:00
7551a34032 [Docs] Document vulnerability disclosure process (#12326)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-23 03:44:09 +00:00
01a55941f5 [Docs] Update FP8 KV Cache documentation (#12238)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-23 11:18:09 +08:00
8d7aa9de71 [Bugfix] Fixing AMD LoRA CI test. (#12329)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2025-01-23 10:53:02 +08:00
68c4421b6d [AMD][Quantization] Add TritonScaledMMLinearKernel since int8 is broken for AMD (#12282)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-01-23 00:10:37 +00:00
aea94362c9 [Frontend][V1] Online serving performance improvements (#12287) 2025-01-22 22:22:12 +00:00
7206ce4ce1 [Core] Support reset_prefix_cache (#12284) 2025-01-22 18:52:27 +00:00
96f6a7596f [Bugfix] Fix HPU multiprocessing executor (#12167)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2025-01-23 02:07:07 +08:00
84bee4bd5c [Misc] Improve the readability of BNB error messages (#12320)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-22 16:56:54 +00:00
fc66dee76d [Misc] Fix the error in the tip for the --lora-modules parameter (#12319)
Signed-off-by: wangerxiao <863579016@qq.com>
2025-01-22 16:48:41 +00:00
6609cdf019 [Doc] Add docs for prompt replacement (#12318)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-22 14:56:29 +00:00
16366ee8bb [Bugfix][VLM] Fix mixed-modality inference backward compatibility for V0 (#12313)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-22 21:06:36 +08:00
528dbcac7d [Model][Bugfix]: correct Aria model output (#12309)
Signed-off-by: xffxff <1247714429@qq.com>
2025-01-22 11:39:19 +00:00
cd7b6f0857 [VLM] Avoid unnecessary tokenization (#12310)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-22 11:08:31 +00:00
68ad4e3a8d [Core] Support fully transparent sleep mode (#11743)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-22 14:39:32 +08:00
4004f144f3 [Build] update requirements of no-device (#12299)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-01-22 14:29:31 +08:00
66818e5b63 [core] separate builder init and builder prepare for each batch (#12253)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-22 14:13:52 +08:00
222a9dc350 [Benchmark] More accurate TPOT calc in benchmark_serving.py (#12288)
Signed-off-by: Nick Hill <nhill@redhat.com>
2025-01-22 13:46:14 +08:00
cbdc4ad5a5 [Ci/Build] Fix mypy errors on main (#12296)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-22 12:06:54 +08:00
016e3676e7 [CI] add docker volume prune to neuron CI (#12291)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-01-22 10:47:49 +08:00
64ea24d0b3 [ci/lint] Add back default arg for pre-commit (#12279)
Signed-off-by: kevin <kevin@anyscale.com>
2025-01-22 01:15:27 +00:00
df76e5af26 [VLM] Simplify post-processing of replacement info (#12269)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 16:48:13 -08:00
09ccc9c8f7 [Documentation][AMD] Add information about prebuilt ROCm vLLM docker for perf validation purpose (#12281)
Signed-off-by: Hongxia Yang <hongxyan@amd.com>
2025-01-22 07:49:22 +08:00
69196a9bc7 [BUGFIX] When skip_tokenize_init and multistep are set, execution crashes (#12277)
Signed-off-by: maleksan85 <maleksan@amd.com>
Co-authored-by: maleksan85 <maleksan@amd.com>
2025-01-21 23:30:46 +00:00
2acba47d9b [bugfix] moe tuning. rm is_navi() (#12273)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-21 22:47:32 +00:00
9c485d9e25 [Core] Free CPU pinned memory on environment cleanup (#10477) 2025-01-21 11:56:41 -08:00
fa9ee08121 [Misc] Set default backend to SDPA for get_vit_attn_backend (#12235)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-21 11:52:11 -08:00
347eeebe3b [Misc] Remove experimental dep from tracing.py (#12007)
Signed-off-by: Adrian Cole <adrian.cole@elastic.co>
2025-01-21 11:51:55 -08:00
18fd4a8331 [Bugfix] Multi-sequence broken (#11898)
Signed-off-by: Andy Lo <andy@mistral.ai>
2025-01-21 11:51:35 -08:00
132a132100 [v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (#10907)
Signed-off-by: rickyx <rickyx@anyscale.com>
2025-01-21 11:51:13 -08:00
1e60f87bb3 [Kernel] fix moe_align_block_size error condition (#12239)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
2025-01-21 10:30:28 -08:00
9705b90bcf [Bugfix] fix race condition that leads to wrong order of token returned (#10802)
Signed-off-by: Jannis Schönleber <joennlae@gmail.com>
2025-01-21 09:47:04 -08:00
3aec49e56f [ci/build] update nightly torch for gh200 test (#12270)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-21 23:03:17 +08:00
c64612802b [Platform] improve platforms getattr (#12264)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2025-01-21 14:42:41 +00:00
9a7c3a0042 Remove pytorch comments for outlines + compressed-tensors (#12260)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2025-01-21 21:49:08 +08:00
b197a5ccfd [V1][Bugfix] Fix data item ordering in mixed-modality inference (#12259)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-21 13:18:43 +00:00
c81081fece [torch.compile] transparent compilation with more logging (#12246)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-21 19:32:55 +08:00
a94eee4456 [Bugfix] Fix mm_limits access for merged multi-modal processor (#12252)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 10:09:39 +00:00
f2e9f2a3be [Misc] Remove redundant TypeVar from base model (#12248)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 08:40:39 +00:00
1f1542afa9 [Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (#12237)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-21 07:49:08 +00:00
96912550c8 [Misc] Rename MultiModalInputsV2 -> MultiModalInputs (#12244)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-21 07:31:19 +00:00
2fc6944c5e [ci/build] disable failed and flaky tests (#12240)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-21 13:25:03 +08:00
5fe6bf29d6 [BugFix] Fix GGUF tp>1 when vocab_size is not divisible by 64 (#12230)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-01-21 12:23:14 +08:00
d4b62d4641 [AMD][Build] Porting dockerfiles from the ROCm/vllm fork (#11777)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-01-21 12:22:23 +08:00
ecf67814f1 Add quantization and guided decoding CODEOWNERS (#12228)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-20 18:23:40 -07:00
750f4cabfa [Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) (#12222)
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-20 16:42:16 -08:00
06a760d6e8 [bugfix] catch xgrammar unsupported array constraints (#12210)
Signed-off-by: Jason Cheng <jasoncky96@gmail.com>
2025-01-20 16:42:02 -08:00
da7512215f [misc] add cuda runtime version to usage data (#12190)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2025-01-21 00:31:01 +00:00
af69a6aded fix: update platform detection for M-series arm based MacBook processors (#12227)
Signed-off-by: isikhi <huseyin.isik000@gmail.com>
2025-01-20 22:23:28 +00:00
7bd3630067 [Misc] Update CODEOWNERS (#12229) 2025-01-20 22:19:09 +00:00
96663699b2 [CI] Pass local python version explicitly to pre-commit mypy.sh (#12224)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-20 23:49:18 +08:00
18572e3384 [Bugfix] Fix HfExampleModels.find_hf_info (#12223)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 15:35:36 +00:00
86bfb6dba7 [Misc] Pass attention to impl backend (#12218)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-20 23:25:28 +08:00
5f0ec3935a [V1] Remove _get_cache_block_size (#12214)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-20 21:54:16 +08:00
c222f47992 [core][bugfix] configure env var during import vllm (#12209)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 19:35:59 +08:00
170eb35079 [misc] print a message to suggest how to bypass commit hooks (#12217)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 18:06:24 +08:00
b37d82791e [Model] Upgrade Aria to transformers 4.48 (#12203)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 17:58:48 +08:00
3127e975fb [CI/Build] Make pre-commit faster (#12212)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 17:36:24 +08:00
4001ea1266 [CI/Build] Remove dummy CI steps (#12208)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 16:41:57 +08:00
5c89a29c22 [misc] add placeholder format.sh (#12206)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 16:04:49 +08:00
59a0192fb9 [Core] Interface for accessing model from VllmRunner (#10353)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-20 15:00:59 +08:00
83609791d2 [Model] Add Qwen2 PRM model support (#12202)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-20 14:59:46 +08:00
0974c9bc5c [Bugfix] Fix incorrect types in LayerwiseProfileResults (#12196)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-20 14:59:20 +08:00
d2643128f7 [DOC] Add missing docstring in LLMEngine.add_request() (#12195)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-20 14:59:00 +08:00
c5c06209ec [DOC] Fix typo in docstring and assert message (#12194)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-20 14:58:29 +08:00
3ea7b94523 Move linting to pre-commit (#11975)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-20 14:58:01 +08:00
51ef828f10 [torch.compile] fix sym_tensor_indices (#12191)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-20 11:37:50 +08:00
df450aa567 [Bugfix] Fix num_heads value for simple connector when tp enabled (#12074)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-01-20 02:56:43 +00:00
bbe5f9de7d [Model] Support for fairseq2 Llama (#11442)
Signed-off-by: Martin Gleize <mgleize@meta.com>
Co-authored-by: mgleize user <mgleize@a100-st-p4de24xlarge-4.fair-a100.hpcaas>
2025-01-19 10:40:40 -08:00
81763c58a0 [V1] Add V1 support of Qwen2-VL (#12128)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: imkero <kerorek@outlook.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-19 19:52:13 +08:00
edaae198e7 [Misc] Add BNB support to GLM4-V model (#12184)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-19 19:49:22 +08:00
936db119ed benchmark_serving support --served-model-name param (#12109)
Signed-off-by: zibai <zibai.gj@alibaba-inc.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2025-01-19 09:59:56 +00:00
e66faf4809 [torch.compile] store inductor compiled Python file (#12182)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-19 16:27:26 +08:00
630eb5b5ce [Bugfix] Fix multi-modal processors for transformers 4.48 (#12187) 2025-01-18 19:16:34 -08:00
4e94951bb1 [BUGFIX] Move scores to float32 in case of running xgrammar on cpu (#12152)
Signed-off-by: Michal Adamczyk <madamczyk@habana.ai>
2025-01-19 11:12:05 +08:00
7a8a48d51e [V1] Collect env var for usage stats (#12115) 2025-01-19 03:07:15 +00:00
32eb0da808 [Misc] Support register quantization method out-of-tree (#11969) 2025-01-18 16:13:16 -08:00
6d0e3d3724 [core] clean up executor class hierarchy between v1 and v0 (#12171)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 14:35:15 +08:00
02798ecabe [Model] Port deepseek-vl2 processor, remove dependency (#12169)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-18 13:59:39 +08:00
813f249f02 [Docs] Fix broken link in SECURITY.md (#12175)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-18 04:35:21 +00:00
da02cb4b27 [core] further polish memory profiling (#12126)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 12:25:08 +08:00
c09503ddd6 [AMD][CI/Build][Bugfix] use pytorch stale wheel (#12172)
Signed-off-by: hongxyan <hongxyan@amd.com>
2025-01-18 11:15:53 +08:00
2b83503227 [misc] fix cross-node TP (#12166)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-18 10:53:27 +08:00
7b98a65ae6 [torch.compile] disable logging when cache is disabled (#12043)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-17 20:29:31 +00:00
b5b57e301e [AMD][FP8] Using MI300 FP8 format on ROCm for block_quant (#12134)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2025-01-17 17:12:26 +00:00
54cacf008f [Bugfix] Mistral tokenizer encode accept list of str (#12149)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-01-17 16:47:53 +00:00
58fd57ff1d [Bugfix] Fix score api for missing max_model_len validation (#12119)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2025-01-17 16:24:22 +00:00
87a0c076af [core] allow callable in collective_rpc (#12151)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-17 20:47:01 +08:00
d4e6194570 [CI/Build][CPU][Bugfix] Fix CPU CI (#12150)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-01-17 19:39:52 +08:00
07934cc237 [Misc][LoRA] Improve the readability of LoRA error messages (#12102)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-17 19:32:28 +08:00
69d765f5a5 [V1] Move more control of kv cache initialization from model_executor to EngineCore (#11960)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2025-01-17 07:39:35 +00:00
8027a72461 [ROCm][MoE] moe tuning support for rocm (#12049)
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
2025-01-17 14:49:16 +08:00
d75ab55f10 [Misc] Add deepseek_vl2 chat template (#12143)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-17 06:34:48 +00:00
d1adb9b403 [BugFix] add more is not None check in VllmConfig.__post_init__ (#12138)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-17 05:33:22 +00:00
b8bfa46a18 [Bugfix] Fix issues in CPU build Dockerfile (#12135)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-17 12:54:01 +08:00
1475847a14 [Doc] Add instructions on using Podman when SELinux is active (#12136)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-17 04:45:36 +00:00
fead53ba78 [CI]add genai-perf benchmark in nightly benchmark (#10704)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-01-17 04:15:09 +00:00
ebc73f2828 [Bugfix] Fix a path bug in disaggregated prefill example script. (#12121)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-01-17 11:12:41 +08:00
d06e824006 [Bugfix] Set enforce_eager automatically for mllama (#12127)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-16 15:30:08 -05:00
62b06ba23d [Model] Add support for deepseek-vl2-tiny model (#12068)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-16 17:14:48 +00:00
5fd24ec02e [misc] Add LoRA kernel micro benchmarks (#11579) 2025-01-16 15:51:40 +00:00
874f7c292a [Bugfix] Fix max image feature size for Llava-one-vision (#12104)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-16 14:54:06 +00:00
92e793d91a [core] LLM.collective_rpc interface and RLHF example (#12084)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-16 20:19:52 +08:00
bf53e0c70b Support torchrun and SPMD-style offline inference (#12071)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-16 19:58:53 +08:00
dd7c9ad870 [Bugfix] Remove hardcoded head_size=256 for Deepseek v2 and v3 (#12067)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-16 10:11:54 +00:00
9aa1519f08 Various cosmetic/comment fixes (#12089)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-16 09:59:06 +00:00
f8ef146f03 [Doc] Add documentation for specifying model architecture (#12105) 2025-01-16 15:53:43 +08:00
fa0050db08 [Core] Default to using per_token quantization for fp8 when cutlass is supported. (#8651)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-16 04:31:27 +00:00
cd9d06fb8d Allow hip sources to be directly included when compiling for rocm. (#12087) 2025-01-15 16:46:03 -05:00
ebd8c669ef [Bugfix] Fix _get_lora_device for HQQ marlin (#12090)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2025-01-15 19:59:42 +00:00
70755e819e [V1][Core] Autotune encoder cache budget (#11895)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-15 11:29:00 -08:00
edce722eaa [Bugfix] use right truncation for non-generative tasks (#12050)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-01-16 00:31:01 +08:00
57e729e874 [Doc]: Update OpenAI-Compatible Server documents (#12082) 2025-01-15 16:07:45 +00:00
de0526f668 [Misc][Quark] Upstream Quark format to VLLM (#10765)
Signed-off-by: kewang-xlnx <kewang@xilinx.com>
Signed-off-by: kewang2 <kewang2@amd.com>
Co-authored-by: kewang2 <kewang2@amd.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-15 11:05:15 -05:00
5ecf3e0aaf Misc: allow to use proxy in HTTPConnection (#12042)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2025-01-15 13:16:40 +00:00
97eb97b5a4 [Model]: Support internlm3 (#12037) 2025-01-15 11:35:17 +00:00
3adf0ffda8 [Platform] Do not raise error if _Backend is not found (#12023)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-01-15 10:14:15 +00:00
ad388d25a8 Type-fix: make execute_model output type optional (#12020) 2025-01-15 09:44:56 +00:00
cbe94391eb Fix: cases with empty sparsity config (#12057)
Signed-off-by: Rahul Tuli <rahul@neuralmagic.com>
2025-01-15 17:41:24 +08:00
994fc655b7 [V1][Prefix Cache] Move the logic of num_computed_tokens into KVCacheManager (#12003) 2025-01-15 07:55:30 +00:00
3f9b7ab9f5 [Doc] Update examples to remove SparseAutoModelForCausalLM (#12062)
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2025-01-15 06:36:01 +00:00
ad34c0df0f [core] platform agnostic executor via collective_rpc (#11256)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-15 13:45:21 +08:00
f218f9c24d [core] Turn off GPU communication overlap for Ray executor (#12051)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-01-15 05:19:55 +00:00
0794e7446e [Misc] Add multipstep chunked-prefill support for FlashInfer (#10467) 2025-01-15 12:47:49 +08:00
b7ee940a82 [V1][BugFix] Fix edge case in VLM scheduling (#12065)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-14 20:21:28 -08:00
9ddac56311 [Platform] move current_memory_usage() into platform (#11369)
Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-01-15 03:38:25 +00:00
1a51b9f872 [HPU][Bugfix] Don't use /dev/accel/accel0 for HPU autodetection in setup.py (#12046)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2025-01-15 02:59:18 +00:00
42f5e7c52a [Kernel] Support MulAndSilu (#11624)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-15 02:29:53 +00:00
a3a3ee4e6f [Misc] Merge bitsandbytes_stacked_params_mapping and packed_modules_mapping (#11924)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-15 07:49:49 +08:00
87054a57ab [Doc]: Update the Json Example of the Engine Arguments document (#12045) 2025-01-14 17:03:04 +00:00
c9d6ff530b Explain where the engine args go when using Docker (#12041)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-14 16:05:50 +00:00
a2d2acb4c8 [Bugfix][Kernel] Give unique name to BlockSparseFlashAttention (#12040)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-14 15:45:05 +00:00
2e0e017610 [Platform] Add output for Attention Backend (#11981)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-14 13:27:04 +00:00
1f18adb245 [Kernel] Revert the API change of Attention.forward (#12038)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-14 20:59:32 +08:00
bb354e6b2d [Bugfix] Fix various bugs in multi-modal processor (#12031)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-14 12:16:11 +00:00
ff39141a49 [HPU][misc] add comments for explanation (#12034)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-14 19:24:06 +08:00
8a1f938e6f [Doc] Update Quantization Hardware Support Documentation (#12025)
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
2025-01-14 04:37:52 +00:00
078da31903 [HPU][Bugfix] set_forward_context and CI test execution (#12014)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2025-01-14 11:04:18 +08:00
1a401252b5 [Docs] Add Sky Computing Lab to project intro (#12019)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-13 17:24:36 -08:00
f35ec461fc [Bugfix] Fix deepseekv3 gate bias error (#12002)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2025-01-13 13:43:51 -07:00
289b5191d5 [Doc] Fix build from source and installation link in README.md (#12013)
Signed-off-by: Yikun <yikunkero@gmail.com>
2025-01-13 17:23:59 +00:00
c6db21313c bugfix: Fix signature mismatch in benchmark's get_tokenizer function (#11982)
Signed-off-by: elijah <f1renze.142857@gmail.com>
2025-01-13 15:22:07 +00:00
a7d59688fb [Platform] Move get_punica_wrapper() function to Platform (#11516)
Signed-off-by: Shanshan Shen <467638484@qq.com>
2025-01-13 13:12:10 +00:00
458e63a2c6 [platform] add device_control env var (#12009)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-13 20:59:09 +08:00
e8c23ff989 [Doc] Organise installation documentation into categories and tabs (#11935)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-13 12:27:36 +00:00
cd8249903f [Doc][V1] Update model implementation guide for V1 support (#11998)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2025-01-13 11:58:54 +00:00
0f8cafe2d1 [Kernel] unified_attention for Attention.forward (#11967)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-13 19:28:53 +08:00
5340a30d01 Fix Max Token ID for Qwen-VL-Chat (#11980)
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
2025-01-13 08:37:48 +00:00
89ce62a316 [platform] add ray_device_key (#11948)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-13 16:20:52 +08:00
c3f05b09a0 [Misc]Minor Changes about Worker (#11555)
Signed-off-by: Chenguang Li <757486878@qq.com>
2025-01-13 15:47:05 +08:00
cf6bbcb493 [Misc] Fix Deepseek V2 fp8 kv-scale remapping (#11947)
Signed-off-by: Yida Wu <yidawu@alumni.cmu.edu>
2025-01-12 23:05:06 -08:00
80ea3af1a0 [CI][Spec Decode] fix: broken test for EAGLE model (#11972)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
2025-01-13 06:50:35 +00:00
9dd02d85ca [Bug] Fix usage of .transpose() and .view() consecutively. (#11979) 2025-01-13 06:24:10 +00:00
f7b3ba82c3 [MISC] fix typo in kv transfer send recv test (#11983) 2025-01-13 05:07:48 +00:00
619ae268c3 [V1] [2/n] Logging and Metrics - OutputProcessor Abstraction (#11973)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-13 04:54:10 +00:00
d14e98d924 [Model] Support GGUF models newly added in transformers 4.46.0 (#9685)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-13 00:13:44 +00:00
9597a095f2 [V1][Core][1/n] Logging and Metrics (#11962)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2025-01-12 21:02:02 +00:00
263a870ee1 [Hardware][TPU] workaround fix for MoE on TPU (#11764) 2025-01-12 10:53:51 -05:00
8bddb73512 [Hardware][CPU] Multi-LoRA implementation for the CPU backend (#11100)
Signed-off-by: Akshat Tripathi <akshat@krai.ai>
Signed-off-by: Oleg Mosalov <oleg@krai.ai>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Oleg Mosalov <oleg@krai.ai>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-12 13:01:52 +00:00
f967e51f38 [Model] Initialize support for Deepseek-VL2 models (#11578)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-12 00:17:24 -08:00
43f3d9e699 [CI/Build] Add markdown linter (#11857)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2025-01-12 00:17:13 -08:00
b25cfab9a0 [V1] Avoid sending text prompt to core engine (#11963)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-12 06:36:38 +00:00
4b657d3292 [Model] Add cogagent model support vLLM (#11742)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-11 19:05:56 +00:00
d697dc01b4 [Bugfix] Fix RobertaModel loading (#11940)
Signed-off-by: NickLucche <nlucches@redhat.com>
2025-01-11 14:05:09 +00:00
a991f7d508 [Doc] Basic guide for writing unit tests for new models (#11951)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-11 21:27:24 +08:00
7a3a83e3b8 [CI/Build] Move model-specific multi-modal processing tests (#11934)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-11 13:50:05 +08:00
c32a7c7c0c [Bugfix] fused_experts_impl wrong compute type for float32 (#11921)
Signed-off-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
Co-authored-by: shaochangxu.scx <shaochangxu.scx@antgroup.com>
2025-01-11 13:49:39 +08:00
2118d0565c [Bugfix][SpecDecode] Adjust Eagle model architecture to align with intended design (#11672)
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
2025-01-10 20:49:38 -08:00
899136b857 [ci] fix broken distributed-tests-4-gpus (#11937)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-11 09:07:24 +08:00
c9f09a4fe8 [mypy] Fix mypy warnings in api_server.py (#11941)
Signed-off-by: Fred Reiss <frreiss@us.ibm.com>
2025-01-11 01:04:58 +00:00
d45cbe70f5 [Bugfix] Check that number of images matches number of <|image|> tokens with mllama (#11939)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2025-01-10 23:26:00 +00:00
8a579408f3 [Misc] Update benchmark_prefix_caching.py fixed example usage (#11920)
Signed-off-by: Ren MinMin <renmm6@chinaunicom.cn>
Co-authored-by: Ren MinMin <renmm6@chinaunicom.cn>
2025-01-10 20:39:22 +00:00
46fa98ccad [Misc] Clean up debug code in Deepseek-V3 (#11930)
Signed-off-by: Isotr0py <2037008807@qq.com>
2025-01-10 19:19:15 +00:00
aa1e77a19c [Hardware][CPU] Support MOE models on x86 CPU (#11831)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-01-10 11:07:58 -05:00
5959564f94 Doc fix in benchmark_long_document_qa_throughput.py (#11933)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-01-10 23:51:43 +08:00
f33e033e27 [Docs] Fix docstring in get_ip function (#11932)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2025-01-10 23:51:02 +08:00
482cdc494e [Doc] Rename offline inference examples (#11927)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-10 23:50:29 +08:00
20410b2fda [platform] support custom torch.compile backend key (#11318)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-10 23:46:51 +08:00
12664ddda5 [Doc] [1/N] Initial guide for merged multi-modal processor (#11925)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-10 14:30:25 +00:00
241ad7b301 [ci] Fix sampler tests (#11922)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-10 20:45:33 +08:00
d85c47d6ad Replace "online inference" with "online serving" (#11923)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-10 12:05:56 +00:00
ef725feafc [platform] support pytorch custom op pluggable (#11328)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-01-10 10:02:38 +00:00
d907be7dc7 [misc] remove python function call for custom activation op (#11885)
Co-authored-by: youkaichao <youkaichao@gmail.com>
2025-01-10 17:18:25 +08:00
d53575a5f0 [ci] fix gh200 tests (#11919)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-10 16:25:17 +08:00
61af633256 [BUGFIX] Fix UnspecifiedPlatform package name (#11916)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2025-01-10 16:20:46 +08:00
ac2f3f7fee [Bugfix] Validate lora adapters to avoid crashing server (#11727)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-10 15:56:36 +08:00
cf5f000d21 [torch.compile] Hide KV cache behind torch.compile boundary (#11677)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-10 13:14:42 +08:00
3de2b1eafb [Doc] Show default pooling method in a table (#11904)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-10 11:25:20 +08:00
b844b99ad3 [VLM] Enable tokenized inputs for merged multi-modal processor (#11900)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-10 03:24:00 +00:00
c3cf54dda4 [Doc][5/N] Move Community and API Reference to the bottom (#11896)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2025-01-10 03:10:12 +00:00
36f5303578 [Docs] Add Modal to deployment frameworks (#11907) 2025-01-09 23:26:37 +00:00
9a228348d2 [Misc] Provide correct Pixtral-HF chat template (#11891)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 10:19:37 -07:00
bd82872211 [ci]try to fix flaky multi-step tests (#11894)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-09 14:47:29 +00:00
405eb8e396 [platform] Allow platform specify attention backend (#11609)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-01-09 21:46:50 +08:00
65097ca0af [Doc] Add model development API Reference (#11884)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 09:43:40 +00:00
1d967acb45 [Bugfix] fix beam search input errors and latency benchmark script (#11875)
Signed-off-by: Ye Qi <yeq@meta.com>
Co-authored-by: yeq <yeq@devgpu004.lla3.facebook.com>
2025-01-09 17:36:39 +08:00
0bd1ff4346 [Bugfix] Override dunder methods of placeholder modules (#11882)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-09 09:02:53 +00:00
310aca88c9 [perf]fix current stream (#11870)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-09 07:18:21 +00:00
a732900efc [Doc] Intended links Python multiprocessing library (#11878) 2025-01-09 05:39:39 +00:00
d848800e88 [Misc] Move print_*_once from utils to logger (#11298)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
Co-authored-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
2025-01-09 12:48:12 +08:00
730e9592e9 [Doc] Recommend uv and python 3.12 for quickstart guide (#11849)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-09 11:37:48 +08:00
1fe554bac3 treat do_lower_case in the same way as the sentence-transformers library (#11815)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2025-01-09 11:05:43 +08:00
615e4a5401 [CI] Turn on basic correctness tests for V1 (#10864) 2025-01-08 21:20:44 -05:00
3db0cafdf1 [Docs] Add Google Cloud Meetup (#11864) 2025-01-08 12:38:28 -08:00
526de822d5 [Kernel][Triton][AMD] Use block size heuristic for avg 2.8x speedup for int8 models (#11698)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2025-01-08 20:23:15 +00:00
56fe4c297c [TPU][Quantization] TPU W8A8 (#11785)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-08 19:33:29 +00:00
47de8821d3 [Misc]add some explanations for BlockHashType (#11847) 2025-01-08 18:21:30 +00:00
5984499e47 [Doc] Expand Multimodal API Reference (#11852)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 17:14:14 +00:00
ca47e176af [Misc] Move some model utils into vision file (#11848)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 17:04:46 +00:00
78f4590b60 [Bugfix][XPU] fix silu_and_mul (#11823)
Signed-off-by: yan ma <yan.ma@intel.com>
2025-01-09 00:11:50 +08:00
2f7024987e [CI/Build][Bugfix] Fix CPU CI image clean up (#11836)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2025-01-08 15:18:28 +00:00
6cd40a5bfe [Doc][4/N] Reorganize API Reference (#11843)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 21:34:44 +08:00
aba8d6ee00 [Doc] Move examples into categories (#11840)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-08 13:09:53 +00:00
2a0596bc48 [VLM] Reorganize profiling/processing-related code (#11812)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 18:59:58 +08:00
f12141170a [torch.compile] consider relevant code in compilation cache (#11614)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-08 10:46:43 +00:00
cfd3219f58 [Hardware][Apple] Native support for macOS Apple Silicon (#11696)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2025-01-08 16:35:49 +08:00
a1b2b8606e [Docs] Update sponsor name: 'Novita' to 'Novita AI' (#11833) 2025-01-07 23:05:46 -08:00
ad9f1aa679 [doc] update wheels url (#11830)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-08 14:36:49 +08:00
889e662eae [misc] improve memory profiling (#11809)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-08 06:36:03 +00:00
ef68eb28d8 [Bug] Fix pickling of ModelConfig when RunAI Model Streamer is used (#11825)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 13:40:09 +08:00
259abd8953 [Docs] reorganize sponsorship page (#11639)
Signed-off-by: simon-mo <simon.mo@hey.com>
2025-01-07 21:16:08 -08:00
f645eb6954 [Bugfix] Add checks for LoRA and CPU offload (#11810)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-08 13:08:48 +08:00
f4923cb8bc [OpenVINO] Fixed Docker.openvino build (#11732)
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
2025-01-08 13:08:30 +08:00
b640b19cc0 Fixed docker build for ppc64le (#11518)
Signed-off-by: Nishidha Panpaliya <nishidha.panpaliya@partner.ibm.com>
2025-01-08 13:05:37 +08:00
dc71af0a71 Remove the duplicate imports of MultiModalKwargs and PlaceholderRange… (#11824) 2025-01-08 04:09:25 +00:00
4d29e91be8 [Misc] sort torch profiler table by kernel timing (#11813) 2025-01-08 10:57:04 +08:00
91445c7bc8 [Bugfix] Fix image input for Pixtral-HF (#11741)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-08 10:17:16 +08:00
5950f555a1 [Doc] Group examples into categories (#11782)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
2025-01-08 09:20:12 +08:00
a4e2b26856 [Bugfix] Significant performance drop on CPUs with --num-scheduler-steps > 1 (#11794) 2025-01-07 16:15:50 -08:00
973f5dc581 [Doc]Add documentation for using EAGLE in vLLM (#11417)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
2025-01-07 19:19:12 +00:00
c994223d56 [Bugfix] update the prefix for qwen2 (#11795)
Co-authored-by: jiadi.jjd <jiadi.jjd@antgroup.com>
2025-01-07 18:36:34 +00:00
869579a702 [optimization] remove python function call for custom op (#11750)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-07 17:04:28 +00:00
c0efe92d8b [Doc] Add note to gte-Qwen2 models (#11808)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 21:50:58 +08:00
d9fa1c05ad [doc] update how pip can install nightly wheels (#11806)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-07 21:42:58 +08:00
2de197bdd4 [V1] Support audio language models on V1 (#11733)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-07 19:47:36 +08:00
869e829b85 [doc] add doc to explain how to use uv (#11773)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2025-01-07 18:41:17 +08:00
8f37be38eb [Bugfix] Comprehensively test and fix LLaVA-NeXT feature size calculation (#11800)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 18:25:02 +08:00
8082ad7950 [V1][Doc] Update V1 support for LLaVa-NeXT-Video (#11798)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-07 09:55:39 +00:00
1e4ce295ae [CI][CPU] adding build number to docker image name (#11788)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2025-01-07 07:28:01 +00:00
ce1917fcf2 [Doc] Create a vulnerability management team (#9925)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2025-01-06 22:57:32 -08:00
e512f76a89 fix init error for MessageQueue when n_local_reader is zero (#11768) 2025-01-07 06:12:48 +00:00
898cdf033e [CI] Fix neuron CI and run offline tests (#11779)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2025-01-06 21:36:10 -08:00
0f3f3c86ec [Bugfix] Update attention interface in Whisper (#11784)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-07 04:36:24 +00:00
b278557935 [Kernel][LoRA]Punica prefill kernels fusion (#11234)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Abatom <abzhonghua@gmail.com>
Co-authored-by: Zhonghua Deng <abatom@163.com>
2025-01-07 04:01:39 +00:00
8ceffbf315 [Doc][3/N] Reorganize Serving section (#11766)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 11:20:01 +08:00
d93d2d74fd [XPU] Make pp group initilized for pipeline-parallelism (#11648)
Signed-off-by: yisheng <yi.sheng@intel.com>
2025-01-07 11:09:58 +08:00
d0169e1b0f [Model] Future-proof Qwen2-Audio multi-modal processor (#11776)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 11:05:17 +08:00
08fb75c72e [Bugfix] Fix LLaVA-NeXT feature size precision error (for real) (#11772)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-07 01:10:54 +00:00
91b361ae89 [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision (#11685)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 19:58:16 +00:00
e20c92bb61 [Kernel] Move attn_type to Attention.__init__() (#11690)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2025-01-07 00:11:28 +08:00
32c9eff2ff [Bugfix][V1] Fix molmo text-only inputs (#11676)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-06 15:22:25 +00:00
4ca5d40adc [doc] explain how to add interleaving sliding window support (#11771)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2025-01-06 21:57:44 +08:00
9279b9f83d [Bugfix] Fix max image size for LLaVA-Onevision (#11769)
Signed-off-by: Roger Wang <ywang@roblox.com>
2025-01-06 13:48:53 +00:00
ee77fdb5de [Doc][2/N] Reorganize Models and Usage sections (#11755)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 21:40:31 +08:00
996357e480 [VLM] Separate out profiling-related logic (#11746)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 16:02:21 +08:00
2a622d704a k8s-config: Update the secret to use stringData (#11679)
Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com>
2025-01-06 08:01:22 +00:00
9c749713f6 [mypy] Forward pass function type hints in lora (#11740)
Signed-off-by: lucast2021 <lucast2021@headroyce.org>
Co-authored-by: lucast2021 <lucast2021@headroyce.org>
2025-01-06 07:59:36 +00:00
022c5c6944 [V1] Refactor get_executor_cls (#11754) 2025-01-06 07:59:16 +00:00
f8fcca100b [Misc] Fix typo for valid_tool_parses (#11753)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2025-01-06 07:12:38 +00:00
06bfb51963 [V1] Add BlockTable class (#11693)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-06 14:24:42 +09:00
408e560015 [Bugfix] Remove block size constraint (#11723) 2025-01-06 12:49:55 +08:00
402d378360 [Doc] [1/N] Reorganize Getting Started section (#11645)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-06 02:18:33 +00:00
9e764e7b10 [distributed] remove pynccl's redundant change_state (#11749) 2025-01-06 09:05:48 +08:00
33fc1e2e86 [Frontend] Improve StreamingResponse Exception Handling (#11752) 2025-01-05 16:35:01 -05:00
eba17173d3 fix: [doc] fix typo (#11751)
Co-authored-by: Lancer <maruixiang6688@gmail.com>
2025-01-06 00:48:16 +08:00
635b897246 [distributed] remove pynccl's redundant stream (#11744) 2025-01-05 23:09:11 +08:00
4068f4b5b5 [MISC] Replace c10::optional with std::optional (#11730)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-01-05 10:20:34 +09:00
47831430cc [Bugfix][V1] Fix test_kv_cache_utils.py (#11738)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-04 16:07:59 +00:00
65c08928c2 [Model] Remove unnecessary weight initialization logic (#11736)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-04 23:46:21 +08:00
ba214dffbe [Bugfix] Fix precision error in LLaVA-NeXT (#11735)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-04 23:45:57 +08:00
eed11ebee9 [VLM] Merged multi-modal processors for LLaVA-NeXT-Video and LLaVA-OneVision (#11717)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-04 11:40:53 +00:00
300acb8347 [Core][Bugfix] Use correct device to initialize GPU data during CUDA-graph-capture (#11233)
Signed-off-by: Yan Burman <yanburman@users.noreply.github.com>
Signed-off-by: Ido Asraff <idoa@atero.ai>
2025-01-04 14:50:16 +08:00
d91457d529 [V1] Add kv cache utils tests. (#11513)
Signed-off-by: xcnick <xcnick0412@gmail.com>
2025-01-04 14:49:46 +08:00
fbf2564554 [V1] Add RayExecutor support for AsyncLLM (api server) (#11712) 2025-01-04 06:41:31 +00:00
d1d49397e7 Update bnb.md with example for OpenAI (#11718) 2025-01-04 06:29:02 +00:00
9c93636d84 Update tool_calling.md (#11701) 2025-01-04 06:16:30 +00:00
e5d7ed0c53 [V1] log GPU blocks num for MultiprocExecutor (#11656) 2025-01-04 00:13:12 +00:00
ad0d567e1c [V1] Chore: cruft removal (#11724) 2025-01-03 23:25:02 +00:00
bf0d97d786 Update requirements-tpu.txt to support python 3.9 and 3.11 (#11695)
Signed-off-by: mgoin <michael@neuralmagic.com>
2025-01-03 22:36:46 +00:00
a655eb3025 [Misc]Add BNB quantization for Qwen2VL (#11719)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-03 15:19:02 -07:00
1543914c04 [V1] Improve TP>1 Error Handling + Stack Trace (#11721)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2025-01-03 21:29:11 +00:00
61fed92c7e [Bugfix] Fix ColumnParallelLinearWithLoRA slice (#11708)
Signed-off-by: ZincCat <zincchloride@outlook.com>
2025-01-03 21:02:34 +00:00
80c751e7f6 [V1] Simplify Shutdown (#11659) 2025-01-03 17:25:38 +00:00
e1a5c2f0a1 [Model] Whisper model implementation (#11280)
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
2025-01-03 16:39:19 +08:00
fd3a62a122 [perf-benchmark] Fix dependency for steps in benchmark pipeline (#11710) 2025-01-02 22:38:37 -08:00
07064cb1d4 [Bugfix] Check chain_speculative_sampling before calling it (#11673)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-01-02 16:58:56 -08:00
2f1e8e8f54 Update default max_num_batch_tokens for chunked prefill (#11694) 2025-01-03 00:25:53 +00:00
68d37809b9 [Misc] Minimum requirements for SageMaker compatibility (#11576) 2025-01-02 15:59:25 -08:00
5dba257506 Resolve race conditions in Marlin kernel (#11493)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2025-01-02 22:58:56 +00:00
187e32997c [Bugfix] Change kv scaling factor by param json on nvidia gpu (#11688)
Signed-off-by: bjmsong <bjmsong@126.com>
Co-authored-by: bjmsong <bjmsong@126.com>
2025-01-02 21:11:39 +00:00
b55ed6ef8a [V1][Minor] Optimize token_ids_cpu copy (#11692)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-02 12:04:58 -07:00
2f385183f3 [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. (#10013)
Signed-off-by: Kathy Yu <feiyangyu@google.com>
2025-01-02 10:28:09 -08:00
84c35c374a According to vllm.EngineArgs, the name should be distributed_executor_backend (#11689) 2025-01-02 18:14:16 +00:00
8c38ee7007 [VLM] Merged multi-modal processor for LLaVA-NeXT (#11682)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-02 16:39:27 +00:00
b6087a6bee [mypy] Pass type checking in vllm/inputs (#11680)
Signed-off-by: Tobias Pitters <tobias.pitters@gmail.com>
2025-01-02 16:18:15 +00:00
23c1b10a4c [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input (#11674)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-01-02 17:00:00 +08:00
a115ac46b5 [VLM] Move supported limits and max tokens to merged multi-modal processor (#11669)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2025-01-01 15:44:42 +00:00
73001445fb [V1] Implement Cascade Attention (#11635)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2025-01-01 21:56:46 +09:00
6d70198b17 [Doc] Fix typo (#11666)
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com>
2025-01-01 08:10:10 +00:00
f962f426bc [Misc] Replace space with - in the file names (#11667)
Signed-off-by: Lu Fang <lufang@fb.com>
2025-01-01 07:39:30 +00:00
11d8a091c6 [Misc] Optimize Qwen2-VL LoRA test (#11663)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2025-01-01 14:42:23 +08:00
365801fedd [VLM] Add max-count checking in data parser for single image models (#11661)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-31 22:15:21 -08:00
4db72e57f6 [Bugfix][Refactor] Unify model management in frontend (#11660)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2025-01-01 02:21:51 +00:00
0c6f998554 [Benchmark] Add benchmark script for CPU offloading (#11533)
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: KuntaiDu <kuntai@uchicago.edu>
2025-01-01 00:10:55 +00:00
e7c7c5e822 [V1][VLM] V1 support for selected single-image models. (#11632)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-31 21:17:22 +00:00
8c3230d8c1 [V1] Simpify vision block hash for prefix caching by removing offset from hash (#11646) 2024-12-31 08:56:01 +00:00
2c5718809b [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. (#11565) 2024-12-31 06:29:04 +00:00
82c49d3260 [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) (#6909)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-30 22:15:58 -08:00
74fa1d123c [Bugfix] Fix OpenAI parallel sampling when using xgrammar (#11637)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-31 03:43:54 +00:00
a2a40bcd0d [Model][LoRA]LoRA support added for MolmoForCausalLM (#11439)
Signed-off-by: Matthias Vogler <matthias.vogler@joesecurity.org>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Matthias Vogler <matthias.vogler@joesecurity.org>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-30 17:33:06 -08:00
ccb1aabcca [benchmark] Remove dependency for H100 benchmark step (#11572) 2024-12-30 12:27:07 -08:00
36e7670045 [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel (#11631) 2024-12-30 18:51:04 +00:00
5886aa496e [V1] [6/N] API Server: Better Shutdown (#11586) 2024-12-30 15:51:02 +00:00
8d9b6721e7 [VLM] Abstract out multi-modal data parsing in merged processor (#11620)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-30 15:01:35 +00:00
b12e87f942 [platforms] enable platform plugins (#11602)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-30 20:24:45 +08:00
5dbf854553 [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels (#11618)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-12-30 10:17:04 +00:00
970d6d0776 [Build][Kernel] Update CUTLASS to v3.6.0 (#11607)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-30 17:22:13 +08:00
628ec6c17b [Docker] bump up neuron sdk v2.21 (#11593)
Signed-off-by: Liangfu Chen <liangfc@amazon.com>
2024-12-30 13:46:14 +08:00
3682e33f9f [v1] fix compilation cache (#11598)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-30 04:24:12 +00:00
0aa38d16f5 Remove print statement in DeepseekScalingRotaryEmbedding (#11604) 2024-12-29 20:16:46 +00:00
faef77c0d6 [Misc] KV cache transfer connector registry (#11481)
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
2024-12-29 16:08:09 +00:00
dba4d9dec6 [v1][bugfix] fix cudagraph with inplace buffer assignment (#11596)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-29 09:03:49 +00:00
32b4c63f02 [Doc] Convert list tables to MyST (#11594)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-29 15:56:22 +08:00
4fb8e329fd [V1] [5/N] API Server: unify Detokenizer and EngineCore input (#11545)
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-28 20:51:57 +00:00
328841d002 [bugfix] interleaving sliding window for cohere2 model (#11583)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-28 16:55:42 +00:00
d427e5cfda [Doc] Minor documentation fixes (#11580)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-28 21:53:59 +08:00
42bb201fd6 [V1][Minor] Set pin_memory=False for token_ids_cpu tensor (#11581)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-28 13:33:12 +00:00
59d6bb4c86 [Hardware][AMD]: Replace HIPCC version with more precise ROCm version (#11515)
Signed-off-by: hjwei <hjwei_xd@163.com>
2024-12-28 11:17:35 +00:00
b7dcc003dc [Model] Remove hardcoded image tokens ids from Pixtral (#11582)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-28 10:54:23 +00:00
d34be24bb1 [Model] Support InternLM2 Reward models (#11571)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-28 06:14:10 +00:00
b5cbe8eeb3 [Bugfix] Last token measurement fix (#11376)
Signed-off-by: rajveerb <46040700+rajveerb@users.noreply.github.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-12-28 11:34:46 +08:00
df04dffade [V1] [4/N] API Server: ZMQ/MP Utilities (#11541) 2024-12-28 01:45:08 +00:00
a60731247f [Doc] Update mllama example based on official doc (#11567)
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
2024-12-28 00:31:10 +00:00
ac79799403 [Bugfix] Fix for ROCM compressed tensor support (#11561) 2024-12-27 20:12:11 +00:00
dde1fa18c9 [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix (#11566)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-27 19:45:13 +00:00
0240402c46 [Misc]Add BNB quantization for MolmoForCausalLM (#11551)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-27 18:48:24 +00:00
55509c2114 [MODEL] LoRA support for Jamba model (#11209)
Signed-off-by: Erez Schwartz <erezs@ai21.com>
2024-12-27 17:58:21 +00:00
101418096f [VLM] Support caching in merged multi-modal processor (#11396)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-27 17:22:48 +00:00
5ce4627a7e [Doc] Add xgrammar in doc (#11549)
Signed-off-by: ccjincong <chenjincong11@gmail.com>
2024-12-27 13:05:10 +00:00
7af553ea30 [Misc] Abstract the logic for reading and writing media content (#11527)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-27 19:21:23 +08:00
2c9b8ea2b0 [Bugfix] Fix TeleChat2ForCausalLM weights mapper (#11546)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-27 10:39:15 +00:00
d003f3ea39 Update deploying_with_k8s.md with AMD ROCm GPU example (#11465)
Signed-off-by: Alex He <alehe@amd.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-27 10:00:04 +00:00
6c6f7fe8a8 [Platform] Move model arch check to platform (#11503)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-12-27 08:45:25 +00:00
2339d59f92 [BugFix] Fix quantization for all other methods (#11547) 2024-12-26 22:23:29 -08:00
1b875a0ef3 [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly (#11534) 2024-12-26 21:19:21 -08:00
eb881ed006 [misc] fix typing (#11540)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-27 11:05:08 +08:00
46d4359450 [CI] Fix broken CI (#11543) 2024-12-26 18:49:16 -08:00
81b979f2a8 [V1] Fix yapf (#11538)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-27 09:47:10 +09:00
371d04d39b [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling (#11394)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-27 09:32:38 +09:00
0c0c2015c5 Update openai_compatible_server.md (#11536)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-12-26 16:26:18 -08:00
82d24f7aac [Docs] Document Deepseek V3 support (#11535)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-12-26 16:21:56 -08:00
f49777ba62 Deepseek v3 (#11502)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com>
2024-12-26 16:09:44 -08:00
55fb97f7bd [2/N] API Server: Avoid ulimit footgun (#11530) 2024-12-26 23:43:05 +00:00
2072924d14 [Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization (#11523)
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: simon-mo <simon.mo@hey.com>
Signed-off-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: HandH1998 <1335248067@qq.com>
2024-12-26 15:33:30 -08:00
720b10fdc6 [1/N] API Server (Remove Proxy) (#11529) 2024-12-26 23:03:43 +00:00
b85a977822 [Doc] Add video example to openai client for multimodal (#11521)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-26 17:31:29 +00:00
eec906d811 [Misc] Add placeholder module (#11501)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-26 13:12:51 +00:00
f57ee5650d [Model] Modify MolmoForCausalLM MLP (#11510)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-26 13:12:05 +00:00
dcb1a944d4 [V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler (#10681)
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-26 19:02:58 +09:00
7492a36207 [Doc] Add QVQ and QwQ to the list of supported models (#11509)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-12-26 09:44:32 +00:00
aa25985bd1 [Misc][LoRA] Fix LoRA weight mapper (#11495)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-26 15:52:48 +08:00
dbeac95dbb Mypy checking for vllm/compilation (#11496)
Signed-off-by: lucast2021 <lucast2021@headroyce.org>
Co-authored-by: lucast2021 <lucast2021@headroyce.org>
2024-12-26 05:04:07 +00:00
51a624bf02 [Misc] Move some multimodal utils to modality-specific modules (#11494)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-26 04:23:20 +00:00
6ad909fdda [Doc] Improve GitHub links (#11491)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-25 14:49:26 -08:00
b689ada91e [Frontend] Enable decord to load video from base64 (#11492)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-25 16:33:55 +00:00
fc601665eb [Misc] Update disaggregation benchmark scripts and test logs (#11456)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
2024-12-25 06:58:48 +00:00
9832e5572a [V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor (#11472) 2024-12-24 19:49:46 -08:00
3f3e92e1f2 [Model] Automatic conversion of classification and reward models (#11469)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-24 18:22:22 +00:00
409475a827 [Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 (#11435)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-12-24 16:53:28 +00:00
196c34b0ac [Misc] Move weights mapper (#11443)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-24 13:05:25 +00:00
5c7963249d [attn][tiny fix] fix attn backend in MultiHeadAttention (#11463)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-12-24 12:39:36 +00:00
461cde2080 [OpenVINO] Fixed installation conflicts (#11458)
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com>
2024-12-24 11:38:21 +00:00
7a5286cc04 [Bugfix][Hardware][CPU] Fix CPU input_positions creation for text-only inputs with mrope (#11434)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-24 17:59:51 +08:00
b1b1038fbd [Bugfix] Fix Qwen2-VL LoRA weight loading (#11430)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-24 09:56:10 +00:00
9edca6bf8f [Frontend] Online Pooling API (#11457)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-24 17:54:30 +08:00
4f074fbf53 [Misc]Suppress irrelevant exception stack trace information when CUDA… (#11438)
Co-authored-by: shiquan <shiquan>
2024-12-24 08:43:39 +00:00
a491d6f535 [V1] TP Ray executor (#11107)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-12-23 23:00:12 +00:00
32aa2059ad [Docs] Convert rST to MyST (Markdown) (#11145)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-12-23 22:35:38 +00:00
94d545a1a1 [Doc] Fix typo in the help message of '--guided-decoding-backend' (#11440) 2024-12-23 20:20:44 +00:00
60fb4f3bcf [Bugfix] Add kv cache scales to gemma2.py (#11269) 2024-12-23 19:30:45 +00:00
63afbe9215 [CI] Expand OpenAI test_chat.py guided decoding tests (#11048)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-23 18:35:38 +00:00
8cef6e02dc [Misc] add w8a8 asym models (#11075) 2024-12-23 13:33:20 -05:00
b866cdbd05 [Misc] Add assertion and helpful message for marlin24 compressed models (#11388) 2024-12-24 02:23:38 +08:00
2e726680b3 [Bugfix] torch nightly version in ROCm installation guide (#11423)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-12-23 17:20:22 +00:00
5bfb30a529 [Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF (#11389)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-23 23:06:20 +08:00
e51719ae72 mypy type checking for vllm/worker (#11418)
Signed-off-by: lucast2021 <lucast2021@headroyce.org>
Co-authored-by: lucast2021 <lucast2021@headroyce.org>
2024-12-23 13:55:49 +00:00
f30581c518 [misc][perf] remove old code (#11425)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-23 08:01:08 +00:00
048fc57a0f [CI] Unboock H100 Benchmark (#11419)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-12-22 14:17:43 -08:00
f1d1bf6288 [Bugfix] Fix fully sharded LoRAs with Mixtral (#11390)
Signed-off-by: Jason Greene <jason.greene@redhat.com>
2024-12-22 23:25:10 +08:00
72d9c316d3 [cd][release] fix race conditions (#11407)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-22 00:39:11 -08:00
4a9139780a [cd][release] add pypi index for every commit and nightly build (#11404)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-12-21 23:53:44 -08:00
29c748930e [CI] Fix flaky entrypoint tests (#11403)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-21 21:08:44 -08:00
c2d1b075ba [Bugfix] Fix issues for Pixtral-Large-Instruct-2411 (#11393)
Signed-off-by: ywang96 <ywang@example.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-21 10:15:03 +00:00
584f0ae40d [V1] Make AsyncLLMEngine v1-v0 opaque (#11383)
Signed-off-by: Ricky Xu <xuchen727@hotmail.com>
2024-12-21 15:14:08 +08:00
51ff216d85 [Bugfix] update should_ignore_layer (#11354)
Signed-off-by: George Ohashi <george@neuralmagic.com>
2024-12-21 06:36:23 +00:00
dd2b5633dd [V1][Bugfix] Skip hashing empty or None mm_data (#11386)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-21 14:22:21 +09:00
47a0b615b4 Add ray[default] to wget to run distributed inference out of box (#11265)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
2024-12-20 13:54:55 -08:00
5d2248d81a [doc] explain nccl requirements for rlhf (#11381)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-20 13:00:56 -08:00
d573aeadcc [Bugfix] Don't log OpenAI field aliases as ignored (#11378)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-20 19:03:50 +00:00
995f56236b [Core] Loading model from S3 using RunAI Model Streamer as optional loader (#10192)
Signed-off-by: OmerD <omer@run.ai>
2024-12-20 16:46:24 +00:00
7c7aa37c69 [CI/Build] fix pre-compiled wheel install for exact tag (#11373)
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com>
2024-12-21 00:14:40 +08:00
04139ade59 [V1] Fix profiling for models with merged input processor (#11370)
Signed-off-by: ywang96 <ywang@roblox.com>
2024-12-20 12:04:21 +00:00
1ecc645b8f [doc] backward compatibility for 0.6.4 (#11359)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-19 21:33:53 -08:00
c954f21ac0 [misc] add early error message for custom ops (#11355)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-19 21:18:25 -08:00
86c2d8fd1c [Bugfix] Fix spec decoding when seed is none in a batch (#10863)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
2024-12-20 05:15:31 +00:00
b880ffb87e [Misc] Add tqdm progress bar during graph capture (#11349)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-20 04:35:18 +00:00
7801f56ed7 [ci][gh200] dockerfile clean up (#11351)
Signed-off-by: drikster80 <ed.sealing@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: drikster80 <ed.sealing@gmail.com>
Co-authored-by: cenzhiyao <2523403608@qq.com>
2024-12-19 18:13:06 -08:00
48edab8041 [Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 (#11331)
Signed-off-by: Akash Kaothalkar <0052v2@linux.vnet.ibm.com>
2024-12-20 01:32:07 +00:00
a985f7af9f [CI] Adding CPU docker pipeline (#11261)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
2024-12-19 11:46:55 -08:00
e461c262f0 [Misc] Remove unused vllm/block.py (#11336) 2024-12-19 17:54:24 +00:00
276738ce0f [Bugfix] Fix broken CPU compressed-tensors test (#11338)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-19 17:37:31 +00:00
cdf22afdda [Misc] Clean up and consolidate LRUCache (#11339)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-20 00:59:32 +08:00
e24113a8fe [Model] Refactor Qwen2-VL to use merged multimodal processor (#11258)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 16:28:00 +00:00
7379b3d4b2 [V1] Fix multimodal profiling for Molmo (#11325)
Signed-off-by: ywang96 <ywang@example.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-19 16:27:22 +00:00
6c7f881541 [Model] Add JambaForSequenceClassification model (#10860)
Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 22:48:06 +08:00
a0f7d53beb [Bugfix] Cleanup Pixtral HF code (#11333)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 13:22:00 +00:00
5aef49806d [Feature] Add load generation config from model (#11164)
Signed-off-by: liuyanyi <wolfsonliu@163.com>
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-12-19 10:50:38 +00:00
98356735ac [misc] benchmark_throughput : Add LoRA (#11267)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-19 15:43:16 +08:00
f26c4aeecb [Misc] Optimize ray worker initialization time (#11275)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-18 23:38:02 -08:00
8936316d58 [Kernel] Refactor Cutlass c3x (#10049)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-19 07:00:18 +00:00
6142ef0ada [VLM] Merged multimodal processor for Qwen2-Audio (#11303)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-19 06:14:17 +00:00
c6b0a7d3ba [V1] Simplify prefix caching logic by removing num_evictable_computed_blocks (#11310) 2024-12-19 04:17:12 +00:00
a30482f054 [CI] Expand test_guided_generate to test all backends (#11313)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-19 04:00:38 +00:00
17ca964273 [Model] IBM Granite 3.1 (#11307)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-12-19 11:27:24 +08:00
5a9da2e6e9 [Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) (#11311)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-19 02:43:30 +00:00
fdea8ec167 [V1] VLM - enable processor cache by default (#11305)
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
2024-12-18 18:54:46 -05:00
ca5f54a9b9 [Bugfix] fix minicpmv test (#11304)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-12-18 10:34:26 -08:00
f954fe0e65 [FIX] update openai version (#11287)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2024-12-18 10:17:05 -08:00
362cff1eb3 [CI][Misc] Remove Github Action Release Workflow (#11274) 2024-12-18 10:16:53 -08:00
996aa70f00 [Bugfix] Fix broken phi3-v mm_processor_kwargs tests (#11263)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-18 10:16:40 -08:00
60508ffda9 [Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support (#10995)
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com>
Co-authored-by: ilmarkov <markovilya197@gmail.com>
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
2024-12-18 09:57:16 -05:00
f04e407e6b [MISC][XPU]update ipex link for CI fix (#11278) 2024-12-17 22:34:23 -08:00
8b79f9e107 [Bugfix] Fix guided decoding with tokenizer mode mistral (#11046) 2024-12-17 22:34:08 -08:00
866fa4550d [Bugfix] Restore support for larger block sizes (#11259)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2024-12-17 16:39:07 -08:00
bf8717ebae [V1] Prefix caching for vision language models (#11187)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-17 16:37:59 -08:00
c77eb8a33c [Bugfix] Set temperature=0.7 in test_guided_choice_chat (#11264) 2024-12-17 16:34:06 -08:00
2d1b9baa8f [Bugfix] Fix request cancellation without polling (#11190) 2024-12-17 12:26:32 -08:00
f9ecbb18bf [Misc] Allow passing logits_soft_cap for xformers backend (#11252)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-17 00:37:04 -08:00
02222a0256 [Misc] Kernel Benchmark for RMSNorm (#11241)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com>
2024-12-17 06:57:02 +00:00
2bfdbf2a36 [V1][Core] Use weakref.finalize instead of atexit (#11242)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-16 22:11:33 -08:00
e88db68cf5 [Platform] platform agnostic for EngineArgs initialization (#11225)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-16 22:11:06 -08:00
59c9b6ebeb [V1][VLM] Proper memory profiling for image language models (#11210)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: ywang96 <ywang@example.com>
2024-12-16 22:10:57 -08:00
66d4b16724 [Frontend] Add OpenAI API support for input_audio (#11027)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-16 22:09:58 -08:00
0064f697d3 [CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse (#10935)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-17 11:39:58 +08:00
35bae114a8 fix gh200 tests on main (#11246)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 17:22:38 -08:00
88a412ed3d [torch.compile] fast inductor (#11108)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-16 16:15:22 -08:00
c301616ed2 [ci][tests] add gh200 tests (#11244)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 15:53:18 -08:00
35ffa682b1 [Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving (#11235)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-12-16 22:20:39 +00:00
551603feff [core] overhaul memory profiling and fix backward compatibility (#10511)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-16 13:32:25 -08:00
efbce85f4d [misc] Layerwise profile updates (#10242)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-16 18:14:57 +00:00
2ca830dbaa [Doc] Reorder vision language examples in alphabet order (#11228)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-16 11:23:33 +00:00
d927dbcd88 [Model] Refactor Ultravox to use merged input processor (#11198)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-16 10:09:53 +00:00
bddbbcb132 [Model] Support Cohere2ForCausalLM (Cohere R7B) (#11203) 2024-12-16 09:56:19 +00:00
b3b1526f03 WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 (#11212)
Signed-off-by: drikster80 <ed.sealing@gmail.com>
Co-authored-by: drikster80 <ed.sealing@gmail.com>
2024-12-16 09:20:49 +00:00
17138af7c4 [Bugfix] Fix the default value for temperature in ChatCompletionRequest (#11219) 2024-12-16 00:15:40 -08:00
69ba344de8 [Bugfix] Fix block size validation (#10938) 2024-12-15 16:38:40 -08:00
da6f409246 Update deploying_with_k8s.rst (#10922) 2024-12-15 16:33:58 -08:00
25ebed2f8c [V1][Minor] Cache np arange to reduce input preparation overhead (#11214)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-15 13:33:00 -08:00
d263bd9df7 [Core] Support disaggregated prefill with Mooncake Transfer Engine (#10884)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2024-12-15 21:28:18 +00:00
38e599d6a8 [Doc] add documentation for disaggregated prefilling (#11197)
Signed-off-by: Kuntai Du <kuntai@uchicago.edu>
2024-12-15 13:31:16 -06:00
96d673e0f8 [Bugfix] Fix error handling of unsupported sliding window (#11213)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-15 10:59:42 -07:00
b10609e6a1 [Misc] Clean up multi-modal processor (#11207)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-15 06:30:28 +00:00
a1c02058ba [torch.compile] allow tracking forward time (#11081)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-14 19:45:00 -08:00
15859f2357 [[Misc]Upgrade bitsandbytes to the latest version 0.45.0 (#11201) 2024-12-15 03:03:06 +00:00
886936837c [Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion (#7209) 2024-12-14 11:38:10 -08:00
6d917d0eeb Enable mypy checking on V1 code (#11105)
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
2024-12-14 09:54:04 -08:00
93abf23a64 [VLM] Fully dynamic prompt replacement in merged input processor (#11199)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-14 17:52:18 +00:00
9c3dadd1c9 [Frontend] Add logits_processors as an extra completion argument (#11150)
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com>
2024-12-14 16:46:42 +00:00
3cb5769883 [Misc] Minor improvements to the readability of PunicaWrapperBase (#11200)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-14 16:38:27 +00:00
ea7bd68d10 [V1][Bugfix] Fix V1 TP trust-remote-code (#11182)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-14 08:21:23 +00:00
48259264a4 [Core] Update outlines and increase its threadpool size (#11140)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-14 07:46:18 +00:00
24a3d12b82 update compressed-tensors to latest version (#11183)
Co-authored-by: dhuangnm <dhuang@MacBook-Pro-2.local>
2024-12-14 03:22:44 +00:00
9855aea21b [Bugfix][V1] Re-compute an entire block when fully cache hit (#11186)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-13 17:08:23 -08:00
4b5b8a6a3b [V1][Bugfix] Fix EngineCoreProc profile (#11185)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-13 17:02:35 -08:00
4863e5fba5 [Core] V1: Use multiprocessing by default (#11074)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-13 16:27:32 -08:00
0d8451c3a4 [Distributed] Allow the placement group more time to wait for resources to be ready (#11138)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
2024-12-13 20:17:37 +00:00
0a56bcc03d [Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend (#11169) 2024-12-13 18:00:40 +00:00
0920ab9131 [Doc] Reorganize online pooling APIs (#11172)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-14 00:22:22 +08:00
238c0d93b4 [Misc] Add tokenizer_mode param to benchmark_serving.py (#11174)
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
2024-12-13 16:19:10 +00:00
5b0ed8391d [Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor (#11156) 2024-12-13 15:56:19 +00:00
c31d4a57a6 [Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching (#8240) 2024-12-13 07:51:25 -08:00
d1fa714cb1 [Refactor]A simple device-related refactor (#11163)
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-12-13 13:39:00 +00:00
969da7d70b [V1][VLM] Fix edge case bug for InternVL2 (#11165)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-13 11:09:30 +00:00
eeec9e3390 [Frontend] Separate pooling APIs in offline inference (#11129)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-13 10:40:07 +00:00
f93bf2b189 [Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt (#11159)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-12-13 08:50:35 +00:00
7cd7409142 PaliGemma 2 support (#11142) 2024-12-13 07:40:07 +00:00
be39e3cd18 [core] clean up cudagraph batchsize padding logic (#10996)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-13 06:57:50 +00:00
34f1a806d5 [Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' (#11157)
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
2024-12-13 06:30:06 +00:00
00c1bde5d8 [ROCm][AMD] Disable auto enabling chunked prefill on ROCm (#11146)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-13 05:31:26 +00:00
3989a79824 [Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization (#11148) 2024-12-13 05:07:20 +00:00
1efce68605 [Bugfix] Use runner_type instead of task in GritLM (#11144)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2024-12-13 04:09:53 +00:00
30870b4f66 [torch.compile] Dynamic fp8 + rms_norm fusion (#10906)
Signed-off-by: luka <luka@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-13 03:19:23 +00:00
78ed8f57d8 [Misc][V1] Fix type in v1 prefix caching (#11151) 2024-12-13 00:57:40 +00:00
db6c264a1e [Bugfix] Fix value unpack error of simple connector for KVCache transfer. (#11058)
Signed-off-by: ShangmingCai <csmthu@gmail.com>
2024-12-12 21:19:17 +00:00
9f3974a319 Fix logging of the vLLM Config (#11143) 2024-12-12 12:05:57 -08:00
2c97eca1ff [Misc] Validate grammar and fail early (#11119) 2024-12-12 18:34:26 +00:00
5d712571af [Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c7ae. (#11024) 2024-12-12 18:09:20 +00:00
d4d5291cc2 fix(docs): typo in helm install instructions (#11141)
Signed-off-by: Ramon Ziai <ramon.ziai@bettermarks.com>
2024-12-12 17:36:32 +00:00
4816d20aa4 [V1] Fix torch profiling for offline inference (#11125)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-12 15:51:53 +00:00
85362f028c [Misc][LoRA] Ensure Lora Adapter requests return adapter name (#11094)
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-12 09:25:16 +00:00
62de37a38e [core][distributed] initialization from StatelessProcessGroup (#10986)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-12 09:04:19 +00:00
8195824206 [Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) (#10565)
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai>
2024-12-12 08:09:28 +00:00
f092153fbe [V1] Use more persistent buffers to optimize input preparation overheads (#11111)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-11 23:14:20 -08:00
1da8f0e1dd [Model] Add support for embedding model GritLM (#10816)
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io>
2024-12-12 06:39:16 +00:00
ccede2b264 [Core] cleanup zmq ipc sockets on exit (#11115)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-11 19:12:24 -08:00
24a36d6d5f Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst (#11112)
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2024-12-12 02:39:21 +00:00
8fb26dac61 [Docs] Add media kit (#11121) 2024-12-11 17:33:11 -08:00
7439a8b5fc [Bugfix] Multiple fixes to tool streaming with hermes and mistral (#10979)
Signed-off-by: cedonley <clayton@donley.io>
2024-12-12 01:10:12 +00:00
4e11683368 [V1] VLM preprocessor hashing (#11020)
Signed-off-by: Roger Wang <ywang@roblox.com>
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-12 00:55:30 +00:00
452a723bf2 [V1][Core] Remove should_shutdown to simplify core process termination (#11113)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-11 23:34:54 +00:00
d1e21a979b [CI/Build] Split up VLM tests (#11083)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-12 06:18:16 +08:00
72ff3a9686 [core] Bump ray to use _overlap_gpu_communication in compiled graph tests (#10410)
Signed-off-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Co-authored-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal>
2024-12-11 11:36:35 -08:00
66aaa7722d [torch.compile] remove graph logging in ci (#11110)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-11 10:59:50 -08:00
d643c2aba1 [V1] Use input_ids as input for text-only models (#11032)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-11 10:49:23 -08:00
91642db952 [torch.compile] use depyf to dump torch.compile internals (#10972)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-11 10:43:05 -08:00
fd22220687 [Doc] Installed version of llmcompressor for int8/fp8 quantization (#11103)
Signed-off-by: Guangda Liu <bingps@users.noreply.github.com>
Co-authored-by: Guangda Liu <bingps@users.noreply.github.com>
2024-12-11 15:43:24 +00:00
b2f775456e [CI/Build] Enable prefix caching test for AMD (#11098)
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com>
2024-12-11 15:23:37 +00:00
cad5c0a6ed [Doc] Update docs to refer to pooling models (#11093)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 13:36:27 +00:00
8f10d5e393 [Misc] Split up pooling tasks (#10820)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 01:28:00 -08:00
40766ca1b8 [Bugfix]: Clamp -inf logprob values in prompt_logprobs (#11073)
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com>
2024-12-11 01:27:39 -08:00
2e32f5d28d [Bugfix] Fix Idefics3 fails during multi-image inference (#11080)
Signed-off-by: B-201 <Joy25810@foxmail.com>
2024-12-11 01:27:07 -08:00
61b1d2f6ae [Core] v1: Use atexit to handle engine core client shutdown (#11076)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-11 01:26:36 -08:00
9974fca047 [ci/build] Fix entrypoints test and pin outlines version (#11088) 2024-12-11 01:01:53 -08:00
3fb4b4f163 [ci/build] Fix AMD CI dependencies (#11087) 2024-12-11 00:39:53 -08:00
2e33fe4191 [CI/Build] Check transformers v4.47 (#10991)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-11 05:02:02 +00:00
e39400a4b6 Fix streaming for granite tool call when <|tool_call|> is present (#11069)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-12-11 04:51:40 +00:00
ffa48c9146 [Model] PP support for Mamba-like models (#10992)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-12-10 21:53:37 -05:00
d5c5154fcf [Misc] LoRA + Chunked Prefill (#9057) 2024-12-11 10:09:20 +08:00
9a93973708 [Bugfix] Fix Mamba multistep (#11071)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-11 00:16:22 +00:00
134810b3d9 [V1][Bugfix] Always set enable_chunked_prefill = True for V1 (#11061)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-10 14:41:23 -08:00
75f89dc44c [torch.compile] add a flag to track batchsize statistics (#11059)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-10 12:40:52 -08:00
e739194926 [Core] Update to outlines >= 0.1.8 (#10576)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-10 12:08:16 -08:00
250ee65d72 [BUG] Remove token param #10921 (#11022)
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
2024-12-10 17:38:15 +00:00
9b9cef3145 [Bugfix] Backport request id validation to v0 (#11036)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-12-10 16:38:23 +00:00
d05f88679b [Misc][LoRA] Add PEFTHelper for LoRA (#11003)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-10 11:12:01 +00:00
beb16b2c81 [Bugfix] Handle <|tool_call|> token in granite tool parser (#11039)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-12-10 10:27:11 +00:00
fe2e10c71b Add example of helm chart for vllm deployment on k8s (#9199)
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com>
2024-12-10 09:19:27 +00:00
82c73fd510 [Bugfix] cuda error running llama 3.2 (#11047) 2024-12-10 07:41:11 +00:00
bfd610430c Update README.md (#11034) 2024-12-09 23:08:10 -08:00
e35879c276 [Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. (#11043) 2024-12-10 14:54:22 +08:00
ebf778061d monitor metrics of tokens per step using cudagraph batchsizes (#11031)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-09 22:35:36 -08:00
28b3a1c7e5 [V1] Multiprocessing Tensor Parallel Support for v1 (#9856)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-10 06:28:14 +00:00
bc192a2b09 [Pixtral] Improve loading (#11040) 2024-12-10 06:09:32 +00:00
980ad394a8 [Frontend] Use request id from header (#10968)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-12-10 13:46:29 +08:00
391d7b2763 [Bugfix] Fix usage of deprecated decorator (#11025)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-10 13:45:47 +08:00
d1f6d1c8af [Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba (#10739)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-10 10:23:07 +08:00
6d525288c1 [Docs] Add dedicated tool calling page to docs (#10554)
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-09 20:15:34 -05:00
6faec54505 [V1] Do not store None in self.generators (#11038)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-09 15:08:19 -08:00
5ed5d5f128 Build tpu image in release pipeline (#10936)
Signed-off-by: Richard Liu <ricliu@google.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
2024-12-09 23:07:48 +00:00
b63ba84832 [ROCm][bugfix] scpecilative decoding worker class (#11035)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-09 14:00:29 -08:00
9c6459e4cb [Neuron] Upgrade neuron to 2.20.2 (#11016)
Signed-off-by: Jerzy Zagorski <jzagorsk@amazon.com>
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com>
2024-12-09 13:53:24 -08:00
1a2f8fb828 [v1] fix use compile sizes (#11000)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-09 13:47:24 -08:00
cbcbdb1ceb [Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2024-12-09 13:21:06 -08:00
a811dd6608 [Model] merged input processor for Phi-3-Vision models (#10977)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-09 12:55:10 -08:00
ca871491ed [Misc][LoRA] Abstract PunicaWrapper (#10955)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-09 12:54:44 -08:00
3b61cb450d [V1] Further reduce CPU overheads in flash-attn (#10989)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-09 12:38:46 -08:00
edc4fa3188 [ci/build] Recompile CI dependencies list with Python 3.12 (#11013)
Signed-off-by: kevin <kevin@anyscale.com>
2024-12-09 11:46:58 -08:00
25b79d9fd3 [V1] Input Batch Relocation (#10962)
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-12-09 09:33:41 -08:00
aea2fc38c3 [Platform] Move async output check to platform (#10768)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-09 17:24:46 +00:00
e691b26f6f [Core] Require xgrammar >= 0.1.6 (#11021)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-09 16:44:27 +00:00
c690357928 [V1] Fix Detokenizer loading in AsyncLLM (#10997)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-09 16:27:10 +00:00
d1c2e15eb3 [torch.compile] add dynamo time tracking (#11005)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 23:09:04 -08:00
af7c4a92e6 [Doc][V1] Add V1 support column for multimodal models (#10998)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-08 22:29:16 -08:00
46004e83a2 [misc] clean up and unify logging (#10999)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 17:28:27 -08:00
43b05fa314 [torch.compile][misc] fix comments (#10993)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 11:18:18 -08:00
a11f326528 [V1] Initial support of multimodal models for V1 re-arch (#10699)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-12-08 12:50:51 +00:00
fd57d2b534 [torch.compile] allow candidate compile sizes (#10984)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-08 11:05:21 +00:00
7be15d9356 [core][misc] remove use_dummy driver for _run_workers (#10920)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-07 12:06:08 -08:00
1b62745b1d [core][executor] simplify instance id (#10976)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-07 09:33:45 -08:00
78029b34ed [BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None (#10928)
Signed-off-by: xffxff <1247714429@qq.com>
2024-12-08 01:21:18 +08:00
c889d5888b [Doc] Explicitly state that PP isn't compatible with speculative decoding yet (#10975)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 17:20:49 +00:00
39e227c7ae [Model] Update multi-modal processor to support Mantis(LLaVA) model (#10711)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 17:10:05 +00:00
1c768fe537 [Doc] Explicitly state that InternVL 2.5 is supported (#10978)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 16:58:02 +00:00
bf0e382e16 [Model] Composite weight loading for multimodal Qwen2 (#10944)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-07 07:22:52 -07:00
b26b4cd03c [Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation (#10958)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-07 18:33:49 +08:00
f13cf9ad50 [Build] Fix for the Wswitch-bool clang warning (#10060)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-07 09:03:44 +00:00
955fa9533a [3/N] Support and implement merged input processor for LLaVA model (#10676)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-07 00:50:58 -08:00
acf092d348 [Bugfix] Fix test-pipeline.yaml (#10973)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-07 12:08:54 +08:00
69d357ba12 [Core] Cleanup startup logging a bit (#10961)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-12-07 02:30:23 +00:00
dcdc3fafe5 [ci] fix broken tests (#10956)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-06 11:25:47 -08:00
c05cfb67da [misc] fix typo (#10960)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-06 11:25:20 -08:00
7406274041 [Doc] add KubeAI to serving integrations (#10837)
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com>
2024-12-06 17:03:56 +00:00
8b59631855 [Core] Support Lark grammars for XGrammar (#10870)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-06 08:34:29 -07:00
a1887f2c96 [torch.compile] fix deprecated code (#10948)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-06 11:01:23 +00:00
222f5b082a [CI/Build] Fix broken multimodal test (#10950) 2024-12-06 10:41:23 +00:00
b031a455a9 [torch.compile] add logging for compilation time (#10941)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-06 10:07:15 +00:00
db87eb6c67 [torch.compile] use size tuning for specific sizes (#10933)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-05 20:30:41 -08:00
9743d64e4e [ci][build] add tests for python only compilation (#10915)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-05 08:54:47 -08:00
a43065272f [Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
Signed-off-by: Konrad Zawora <kzawora@habana.ai>
2024-12-05 08:47:46 -08:00
998eeafe58 [CI/Build] Bump test transformers version (#10106)
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-05 16:05:52 +00:00
571da8fc43 [Misc][LoRA] Clean up the function interface of Punica (#10917)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-05 13:22:28 +00:00
39c89e71a8 [Misc] Update llama 3.2 template to support system prompt with images (#10901)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-12-05 05:54:06 +00:00
1f958a7d52 [Bugfix] Fix BNB loader target_modules (#10720)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-05 13:20:26 +08:00
aa39a8e175 [Doc] Create a new "Usage" section (#10827)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-05 11:19:35 +08:00
8d370e91cb [Bugfix] Fallback to outlines for complex json schemas (#10899)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-05 11:14:06 +08:00
7883c2bbe7 [benchmark] Make H100 benchmark optional (#10908) 2024-12-04 17:02:17 -08:00
2a56e1264f [V1] Fix when max_model_len is not divisible by block_size (#10903)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-04 16:54:05 -08:00
e4c34c23de [CI/Build] improve python-only dev setup (#9621)
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-12-04 21:48:13 +00:00
82eb5ea8f3 Benchmark serving structured output (#10880)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-12-04 16:28:21 -05:00
10398b4706 [Model] Consolidate ViTs attention implementation without mask (#10893)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-04 18:11:08 +00:00
01d079fd8e [LoRA] Change lora_tokenizers capacity (#10796)
Signed-off-by: Xin Yang <xyang19@gmail.com>
2024-12-04 17:40:16 +00:00
c92acb9693 [ci/build] Update vLLM postmerge ECR repo (#10887) 2024-12-04 09:01:20 +00:00
8db957ee3a [bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
Signed-off-by: jianzheng <57654625+o2363286@users.noreply.github.com>
2024-12-04 08:48:22 +00:00
c9ca4fce3f [ci/build] Job to build and push release image (#10877) 2024-12-04 15:02:40 +08:00
fa2dea61df [ci/build] Change queue name for Release jobs (#10875) 2024-12-04 15:02:16 +08:00
b5b647b084 Drop ROCm load format check (#10767)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-04 04:32:21 +00:00
d2bd88b122 [CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-12-04 03:23:21 +00:00
381ac93bb5 [Benchmark] Benchmark structured output with datasets (#10557)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
2024-12-03 17:21:06 -07:00
a061fe601e [Build][Bugfix] Using the correct type hint (#10866)
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
2024-12-03 15:47:55 -05:00
7c32b6861e [Frontend] correctly record prefill and decode time metrics (#10853)
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-12-03 19:13:31 +00:00
7090c27bb2 [Bugfix] Only require XGrammar on x86 (#10865)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-12-03 10:32:21 -08:00
2f2cdc745a [MISC][XPU] quick fix for XPU CI (#10859)
Signed-off-by: yan ma <yan.ma@intel.com>
2024-12-03 17:16:31 +00:00
3bc94cab69 [V1] VLM - Run the mm_mapper preprocessor in the frontend process (#10640)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-12-03 10:33:10 +00:00
f6084f6324 [Speculative Decoding] Move indices to device before filtering output (#10850)
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com>
2024-12-03 17:01:39 +08:00
9323a3153b [Core][Performance] Add XGrammar support for guided decoding and set it as default (#10785)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-12-03 15:17:00 +08:00
3257d449fa [Misc] Remove deprecated names (#10817)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-03 06:52:57 +00:00
ef51831ee8 [Doc] Add github links for source code references (#10672)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-03 06:46:07 +00:00
dc5ce861bf [torch.compile] remove compilation_context and simplify code (#10838)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-03 06:19:02 +00:00
21fe7b481a [core][distributed] add pynccl broadcast (#10843)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-03 04:53:23 +00:00
a4cf256159 [Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-03 12:10:29 +08:00
d746268e92 [Model] support bitsandbytes quantization with minicpm model (#10842)
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com>
2024-12-03 03:06:41 +00:00
4433195ab7 [Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts (#10753) 2024-12-03 02:26:15 +00:00
4c05edb33a [Model] Add TP and BNB quantization support to LlavaMultiModalProjector (#10834)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-12-02 23:06:09 +00:00
9b14d978aa Fix openvino on GPU (#10793) 2024-12-02 18:52:19 +00:00
519cc6ca12 [Misc][XPU] Avoid torch compile for XPU platform (#10747)
Signed-off-by: yan ma <yan.ma@intel.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-12-02 17:53:55 +00:00
b45f0d7946 [Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-12-02 17:53:36 +00:00
a4c4daf364 [misc] use out argument for flash attention (#10822)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-02 10:50:10 +00:00
e95f275f57 [CI/Build] Update mistral_common version for tests and docs (#10825)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-02 10:26:10 +00:00
ef31eabc68 [Model]: add some tests for aria model (#10770)
Signed-off-by: xffxff <1247714429@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-02 05:36:36 +00:00
995a148575 [doc]Update config docstring (#10732)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-12-02 04:14:45 +00:00
63a164172d [misc] remove xverse modeling file (#10814)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-02 03:27:13 +00:00
e25810ae29 Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-12-02 10:05:32 +08:00
073a4bd1c0 [Kernel] Use out arg in flash_attn_varlen_func (#10811)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-12-01 17:55:39 -08:00
b7954776fd [core] Avoid metrics log noise when idle - include speculative decodi… (#10809) 2024-12-02 01:49:48 +00:00
b18c9bbaba [Model] Add BNB support to Llava and Pixtral-HF (#10795)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-12-02 01:31:09 +00:00
0590ec3fd9 [Core] Implement disagg prefill by StatelessProcessGroup (#10502)
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Co-authored-by: ApostaC <yihua98@uchicago.edu>
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn>
2024-12-01 19:01:00 -06:00
c11f172187 [Misc] Adding MMMU-Pro vision dataset to serving benchmark (#10804)
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-12-01 08:47:05 +00:00
169a0ff911 [doc] add warning about comparing hf and vllm outputs (#10805)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-12-01 00:41:38 -08:00
d2f058e76c [Misc] Rename embedding classes to pooling (#10801)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-01 14:36:51 +08:00
f877a7d12a [Misc] Improve type annotations for support_torch_compile (#10763)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-30 17:48:35 -08:00
133707123e [Model] Replace embedding models with pooling adapter (#10769)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-12-01 08:02:54 +08:00
7e4bbda573 [doc] format fix (#10789)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-11-30 11:38:40 +00:00
e7cfc4ef4c [Interleaved ATTN] Support for Mistral-8B (#10591)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-30 07:45:50 +00:00
16ee07f22a [Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-30 04:19:14 +00:00
40bc242579 [Bugfix] Fix OpenVino/Neuron driver_worker init (#10779)
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-11-30 12:07:13 +08:00
661175bc82 [platform] Add verify_quantization in platform. (#10757)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2024-11-29 15:22:21 +00:00
3132aac043 [Bugfix] Fix Idefics3 bug (#10778)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-29 13:56:46 +00:00
c82b432d4a [Misc] typo find in sampling_metadata.py (#10740) 2024-11-29 05:17:57 +00:00
fa6ecb9aa7 [Model] Clean up MiniCPMV (#10751)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-29 04:47:06 +00:00
c83919c7a6 [Model] Add Internlm2 LoRA support (#5064)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-28 17:29:04 +00:00
98f47f2a40 [V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 09:01:02 -08:00
8c1e77fb58 [Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 08:31:28 -08:00
5fc5ce0fe4 [Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-11-28 14:53:31 +00:00
3ed5e73146 [TPU] Update requirements-tpu (#10726)
Signed-off-by: Richard Liu <ricliu@google.com>
2024-11-28 02:30:48 -08:00
9a8bff0285 [Kernel] Update vllm-flash-attn version (#10736)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 02:25:59 -08:00
a79b122400 [V1] Do not allocate beyond the max_model_len (#10730)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-28 00:13:15 -08:00
d9b4b3f069 [Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-27 23:59:28 -08:00
278be671a3 [Doc] Update model in arch_overview.rst to match comment (#10701)
Signed-off-by: spacewander <spacewanderlzx@gmail.com>
2024-11-27 23:58:39 -08:00
70dc14fbd0 [Model] support bitsandbytes quantization with minicpm3 model (#10682)
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com>
2024-11-27 23:58:02 -08:00
cb4e1c3f3a [misc] upgrade filelock version (#10731)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 19:54:58 -08:00
395b1c7454 [Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server (#10635)
Signed-off-by: Tomer Asida <tomera@ai21.com>
2024-11-27 13:21:10 -08:00
9b4b150395 [Bugfix] Ignore lm_head when loading embedding models (#10719)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-27 19:05:29 +00:00
197b4484a3 [Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
Signed-off-by: mzusman <mor.zusmann@gmail.com>
2024-11-27 19:02:27 +00:00
b98c62ba49 [Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-27 10:43:17 -08:00
c411def234 [torch.compile] fix shape specialization (#10722)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 10:16:10 -08:00
308cc5e21e [ci] fix slow tests (#10698)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-27 09:26:14 -08:00
9e0a147d50 [V1] Update interface for mistral-format Pixtral (#10703)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-27 12:26:27 +00:00
418cb3b93f [Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-27 11:55:38 +00:00
1209261e93 [Model] Support telechat2 (#10311)
Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-11-27 11:32:35 +00:00
e2251109c7 [Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-11-26 22:55:32 -08:00
15cc2a9f1a [Misc]Further reduce BNB static variable (#10597)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-26 22:54:12 -08:00
e85250b1d1 [Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2024-11-26 22:49:40 -08:00
cfb3bf25fb [bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig (#10657) 2024-11-27 13:55:23 +08:00
1bf905ddaa [Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. (#10198)
Signed-off-by: jeongin601 <0200angela@gmail.com>
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com>
2024-11-27 05:07:30 +00:00
0a4d968500 [V1] Update interface for idefics3 (#10680)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-27 10:04:01 +08:00
0a71900bc9 Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2024-11-26 17:57:11 -08:00
2f0a0a17a4 [V1] Refactor model executable interface for multimodal models (#10570)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-26 20:46:11 +00:00
7576cd38df [Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642) 2024-11-26 12:29:00 -08:00
9a99273b48 [Bugfix] Fix using -O[0,3] with LLM entrypoint (#10677)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-26 10:44:01 -08:00
f5792c7c4a [Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
Signed-off-by: Conroy Cheers <conroy@corncheese.org>
2024-11-26 10:26:28 -08:00
db66e018ea [Bugfix] Fix for Spec model TP + Chunked Prefill (#10232)
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com>
Signed-off-by: Sourashis Roy <sroy@roblox.com>
Co-authored-by: Sourashis Roy <sroy@roblox.com>
2024-11-26 09:11:16 -08:00
1f6584ee85 [V1] Enable profile for LLMEngine (#10665) 2024-11-26 10:36:45 +00:00
334d64d1e8 [ci] add vllm_test_utils (#10659)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-26 00:20:04 -08:00
940635343a [Misc] Remove outdated init protocols (#10655)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-26 14:55:00 +08:00
9a88f89799 custom allreduce + torch.compile (#10121)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
2024-11-25 22:00:16 -08:00
519e8e4182 [v1] EngineArgs for better config handling for v1 (#10382)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-25 21:09:43 -08:00
a6760f6456 [Feature] vLLM ARM Enablement for AARCH64 CPUs (#9228)
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-11-25 18:32:39 -08:00
45ac4ff270 [bugfix] fix aria model and add torch.compile (#10645)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 18:32:09 -08:00
6e9ff050c8 [misc] do not read HOST_IP (#10644)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 17:04:50 -08:00
9db713a1dc [Model] Add OLMo November 2024 model (#10503) 2024-11-25 17:26:40 -05:00
1b583cfefa [Doc] Fix typos in docs (#10636)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 10:15:45 -08:00
cf73f0c95e [Model] Enable optional prefix when loading embedding models (#10639)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 18:14:33 +00:00
b1d920531f [Model]: Add support for Aria model (#10514)
Signed-off-by: xffxff <1247714429@qq.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
2024-11-25 18:10:55 +00:00
452a4e80c3 [Docs] Add Snowflake Slides (#10641)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-25 09:34:46 -08:00
c27df94e1f [Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices (#9850)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-11-25 12:23:32 -05:00
d04b13a380 [Bug]: Authorization ignored when root_path is set (#10606)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-25 16:21:41 +00:00
2b0879bfc2 Super tiny little typo fix (#10633) 2024-11-25 13:08:30 +00:00
ed46f14321 [Model] Support is_causal HF config field for Qwen2 model (#10621)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 09:51:20 +00:00
05d1f8c9c6 [misc] move functions to config.py (#10624)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 09:27:30 +00:00
25d806e953 [misc] add torch.compile compatibility check (#10618)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-24 23:40:08 -08:00
65813781a2 [torch.compile] add warning for unsupported models (#10622)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-24 23:27:51 -08:00
7c2134beda [torch.compile] force inductor threads (#10620)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-24 23:04:21 -08:00
a30a605d21 [Doc] Add encoder-based models to Supported Models page (#10616)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-25 06:34:07 +00:00
571841b7fc [torch.compile] support encoder based models (#10613)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-25 05:24:33 +00:00
7ea3cd7c3e [Refactor][MISC] del redundant code in ParallelConfig.postinit (#10614)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-25 05:14:56 +00:00
214efc2c3c Support Cross encoder models (#10400)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Flavia Beo <flavia.beo@ibm.com>
Co-authored-by: Flavia Beo <flavia.beo@ibm.com>
2024-11-24 18:56:20 -08:00
49628fe13e [Doc] Update README.md with Ray Summit talk links (#10610) 2024-11-24 16:45:09 -08:00
e4fbb14414 [doc] update the code to add models (#10603)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-24 11:21:40 -08:00
c055747867 [model][utils] add extract_layer_index utility function (#10599)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-23 22:22:54 -08:00
eda2b3589c Revert "Print running script to enhance CI log readability" (#10601) 2024-11-23 21:31:47 -08:00
1c445dca51 [CI/Build] Print running script to enhance CI log readability (#10594)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-24 03:57:13 +00:00
1700c543a5 [Bugfix] Fix LoRA weight sharding (#10450)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-23 17:23:17 -08:00
17d8fc1806 [bugfix] Fix example/tensorize_vllm_model tests (#10595)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-23 17:22:33 -08:00
04668ebe7a [Bugfix] Avoid import AttentionMetadata explicitly in Mllama (#10593)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-23 18:12:20 +00:00
651f6c31ac For ppc64le, disabled tests for now and addressed space issues (#10538) 2024-11-23 09:33:53 +00:00
86a44fb896 [Platforms] Refactor openvino code (#10573)
Signed-off-by: statelesshz <hzji210@gmail.com>
2024-11-22 22:23:12 -08:00
4cfe5d2bca [Bugfix] multi_modal_kwargs broadcast for CPU tensor parallel (#10541)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-22 21:25:46 -08:00
c8acd80548 [2/N] handling placeholders in merged multi-modal processor (#10485)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-22 21:25:09 -08:00
4634a89d18 Prefix Cache Aware Scheduling [1/n] (#10128)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-22 21:15:55 -08:00
7c25fe45a6 [AMD] Add support for GGUF quantization on ROCm (#10254) 2024-11-22 21:14:49 -08:00
02a43f82a9 Update default max_num_batch_tokens for chunked prefill to 2048 (#10544) 2024-11-22 21:14:19 -08:00
cfea9c04ef [Model] Fix Baichuan BNB online quantization (#10572)
Signed-off-by: Chen Wu <cntryroa@gmail.com>
2024-11-22 21:13:59 -08:00
7d8ffb344f [Bugfix] Internal Server Error when tool_choice is incorrect. (#10567)
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com>
2024-11-22 21:13:29 -08:00
4aba6e3d1a [core] gemma2 full context length support (#10584)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 20:13:54 -08:00
978b39744b [Misc] Add pynccl wrappers for all_gather and reduce_scatter (#9432) 2024-11-22 22:14:03 -05:00
ebda51968b [Core] Fix broken log configuration (#10458)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-23 10:23:51 +08:00
9195dbdbca [Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use (#10164)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-23 10:17:38 +08:00
d559979c54 [bugfix] fix cpu tests (#10585)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 17:34:03 -08:00
d345f409b7 [V1] EngineCore supports profiling (#10564)
Signed-off-by: Abatom <abzhonghua@gmail.com>
2024-11-22 17:16:15 -08:00
28598f3939 [Core] remove temporary local variables in LLMEngine.__init__ (#10577)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-22 16:22:53 -08:00
948c859571 support bitsandbytes quantization with qwen model (#10549)
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com>
2024-11-22 16:16:14 -08:00
97814fbf0f [v1] Refactor KVCacheManager for more hash input than token ids (#10507)
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-11-22 23:27:25 +00:00
eebad39f26 [torch.compile] support all attention backends (#10558)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 14:04:42 -08:00
db100c5cde [bugfix] fix full graph tests (#10581)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-22 10:02:14 -08:00
11fcf0e066 Remove token-adding chat embedding params (#10551)
Signed-off-by: Noam Gat <noamgat@gmail.com>
2024-11-21 23:59:47 -08:00
b6374e09b0 [Bugfix] Fix Phi-3 BNB quantization with tensor parallel (#9948)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-22 15:01:56 +08:00
a111d0151f [platforms] absorb worker cls difference into platforms folder (#10555)
Signed-off-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
2024-11-21 21:00:32 -08:00
446c7806b2 [Minor] Fix line-too-long (#10563)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 19:40:40 -08:00
33e0a2540a [9/N] torch.compile LLM usage (#10552)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 19:13:31 -08:00
aed074860a [Benchmark] Add new H100 machine (#10547) 2024-11-21 18:27:20 -08:00
9afa014552 Add small example to metrics.rst (#10550) 2024-11-21 23:43:43 +00:00
46fe9b46d8 [Minor] Revert change in offline inference example (#10545)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 21:28:16 +00:00
cf656f5a02 [misc] improve error message (#10553)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 13:13:17 -08:00
edec3385b6 [CI][Installation] Avoid uploading CUDA 11.8 wheel (#10535)
Signed-off-by: simon-mo <simon.mo@hey.com>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-11-21 13:03:58 -08:00
f9310cbd0c [V1] Fix Compilation config & Enable CUDA graph by default (#10528)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-21 12:53:39 -08:00
7560ae5caf [8/N] enable cli flag without a space (#10529)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-21 12:30:42 -08:00
e7a8341c7c [Bugfix] Allow token ID-only inputs in Qwen2-Audio (#10536)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-21 18:09:43 +00:00
c51e397fe8 [Misc] Suppress duplicated logging regarding multimodal input pipeline (#10530)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-21 09:21:31 -08:00
2385b60d83 [Kernel] Register punica ops directly (#10522)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-21 09:18:11 -08:00
da7e702c6f [Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored (#10180)
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
2024-11-21 16:24:32 +00:00
4d676f0852 [Bugfix] Embedding model pooling_type equals ALL and multi input's bug (#10494) 2024-11-21 14:40:02 +00:00
d5ec121f95 [Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models (#10518)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-21 14:20:08 +00:00
8a93a598d9 fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len (#10524)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2024-11-21 11:15:36 +00:00
1cfde82ffd [Model] Add Support for Multimodal Granite Models (#10291)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-11-21 10:46:20 +00:00
f0e0238016 [Doc] fix a small typo in docstring of llama_tool_parser (#10513) 2024-11-21 09:05:23 +00:00
aaddce5d26 [platforms] improve error message for unspecified platforms (#10520)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 23:07:56 -08:00
3430857b64 [Misc] Increase default video fetch timeout (#10495)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 23:06:42 -08:00
8b0fe06c89 [torch.compile] Inductor code caching fix (#10273)
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: Luka Govedic <luka.govedic@gmail.com>
2024-11-20 21:44:57 -08:00
9d827170a3 [Platforms] Add device_type in Platform (#10508)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-21 04:44:20 +00:00
6c1208d083 [Core] Add Sliding Window Support with Flashinfer (#10462)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2024-11-20 19:56:47 -08:00
388ee3de66 [torch.compile] limit inductor threads and lazy import quant (#10482)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 18:36:33 -08:00
2f77b6cfec [TPU] Implement prefix caching for TPUs (#10307)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-20 13:54:15 -08:00
c68f7ede6a [Bugfix]: allow extra fields in requests to openai compatible server (#10463)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-20 16:42:21 -05:00
0cd3d9717e [7/N] torch.compile, reduce compilation time (#10460)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 11:20:38 -08:00
5f1d6af2b6 [perf bench] H200 development (#9768)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-20 11:06:56 -08:00
772a66732d [platforms] restore xpu check for parallel config (#10479)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-20 17:13:28 +00:00
63f1fde277 [Hardware][CPU] Support chunked-prefill and prefix-caching on CPU (#10355)
Signed-off-by: jiang1.li <jiang1.li@intel.com>
2024-11-20 10:57:39 +00:00
d5b28447e0 [Platforms] Refactor xpu code (#10468)
Signed-off-by: MengqingCao <cmq0113@163.com>
2024-11-19 22:52:13 -08:00
09dbf9ff16 [Bugfix] Handle conflicts between modern and legacy fields (#10471)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 14:45:08 +08:00
343041c4c4 [model] Reduce medusa weight (#10454)
Signed-off-by: skylee-01 <497627264@qq.com>
2024-11-20 06:05:55 +00:00
ed701ca963 [ci/build] Combine nightly and optional (#10465) 2024-11-19 21:36:03 -08:00
7629a9c6e5 [CI/Build] Support compilation with local cutlass path (#10423) (#10424) 2024-11-19 21:35:50 -08:00
709c9f1f25 [CI/Build] Add sphinx/rst linter for docs (#10366) 2024-11-19 21:35:31 -08:00
b4be5a8adb [Bugfix] Enforce no chunked prefill for embedding models (#10470)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-20 05:12:51 +00:00
ad44437ba3 [Bugfix] Fix Mamba model initialization and MLP Speculator weights loading (#10456)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-20 05:04:05 +00:00
9e05252b46 [Misc] Add __setitem__ for LazyDict (#10469)
Signed-off-by: Yanyi Liu <wolfsonliu@163.com>
2024-11-20 04:44:57 +00:00
d200972e7f [Bugfix] Marlin 2:4 temp fix for large M dim (>256) (#10464)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-19 19:40:33 -08:00
d5b68aba2f [CI/Build] Update Dockerfile.rocm (#10434)
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
2024-11-19 17:19:59 -08:00
a324d3a1a7 Change granite chat template to keep json list formatting for tool calls (#10452)
Signed-off-by: Max de Bayser <maxdebayser@gmail.com>
2024-11-19 18:16:54 -07:00
b00b33d77e [Model][Quantization] HQQ support through Marlin kernel expansion (#9766)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-19 13:31:12 -08:00
efa9084628 [Core] Avoid metrics log noise when idle (#8868)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-19 21:05:25 +00:00
803f37eaaa [6/N] torch.compile rollout to users (#10437)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-19 10:09:03 -08:00
fd9f124971 [Doc] fix link for page that was renamed (#10455)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-19 09:48:30 -08:00
1ea291a417 Fix: Build error seen on Power Architecture (#10421)
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com>
Signed-off-by: B-201 <Joy25810@foxmail.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: ismael-dm <ismaeldm99@gmail.com>
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com>
Signed-off-by: mgoin <michael@neuralmagic.com>
Signed-off-by: yan ma <yan.ma@intel.com>
Signed-off-by: Angus Wang <wangjadehao@gmail.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com>
Co-authored-by: B-201 <Joy25810@foxmail.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: ismael-dm <ismaeldm99@gmail.com>
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Yan Ma <yan.ma@intel.com>
Co-authored-by: Angus Wang <wangjadehao@gmail.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Ricky Xu <rickyx@anyscale.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
2024-11-19 09:34:57 -08:00
11fd7ea639 [Pixtral-Large] Pixtral actually has no bias in vision-lang adapter (#10449) 2024-11-19 17:33:06 +00:00
f028dff33d [BugFix] Fix hermes tool parser output error stream arguments in some cases (#10395) (#10398)
Signed-off-by: xiyuan lee <lixiyuan@haier.com>
2024-11-19 13:42:50 +00:00
b4614656b8 [CI][CPU] adding numa node number as container name suffix (#10441)
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>
2024-11-19 13:16:43 +00:00
25f9c78961 [misc][plugin] improve plugin loading (#10443)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-19 10:43:21 +00:00
5390d6664f [Doc] Add the start of an arch overview page (#10368) 2024-11-19 09:52:11 +00:00
382b6a4852 [Misc] Avoid misleading warning messages (#10438)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-19 08:54:58 +00:00
272e31c0bd [Bugfix] Guard for negative counter metrics to prevent crash (#10430)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-11-19 04:57:10 +00:00
74f8c2cf5f Add openai.beta.chat.completions.parse example to structured_outputs.rst (#10433) 2024-11-19 04:37:46 +00:00
8c1fb50705 [Platform][Refactor] Extract func get_default_attn_backend to Platform (#10358)
Signed-off-by: Mengqing Cao <cmq0113@163.com>
2024-11-19 11:22:26 +08:00
7eb719df13 [Bugfix]Fix Phi-3 BNB online quantization (#10417)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-19 03:21:42 +00:00
284203f171 [ci/build] Have dependabot ignore all patch update (#10436)
We have too many dependencies and all patch updates can be a little noisy. This is to have dependabot ignore all patch version updates.
2024-11-19 01:04:25 +00:00
90a6c759ca [misc] partial prefix & random input generation benchmark (#9929)
Signed-off-by: rickyx <rickyx@anyscale.com>
2024-11-18 15:39:14 -08:00
2298e69b5f [ci][bugfix] fix kernel tests (#10431)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-18 15:29:37 -08:00
a03ea40792 [3/N][torch.compile] consolidate custom op logging (#10399)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-18 15:14:59 -08:00
96d999fbe8 [Kernel] Initial Machete W4A8 support + Refactors (#9855)
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
2024-11-18 12:59:29 -07:00
c2170a5b39 [Kernel] Explicitly specify other value in tl.load calls (#9014)
Signed-off-by: Angus Wang <wangjadehao@gmail.com>
2024-11-18 11:39:40 -08:00
6b2d25efc7 [Hardware][XPU] AWQ/GPTQ support for xpu backend (#10107)
Signed-off-by: yan ma <yan.ma@intel.com>
2024-11-18 11:18:05 -07:00
281cc4b3cd [Model][Bugfix] Support TP for PixtralHF ViT (#10405)
Signed-off-by: mgoin <michael@neuralmagic.com>
2024-11-18 10:04:14 -08:00
4f686d139f Fix open_collective value in FUNDING.yml (#10426)
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com>
2024-11-18 09:52:42 -08:00
31894a2155 [Doc] Add documentation for Structured Outputs (#9943)
Signed-off-by: ismael-dm <ismaeldm99@gmail.com>
2024-11-18 09:52:12 -08:00
7851b45196 [5/N][torch.compile] torch.jit.script --> torch.compile (#10406)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-18 23:20:06 +08:00
4186be8111 [Doc] Update doc for LoRA support in GLM-4V (#10425)
Signed-off-by: B-201 <Joy25810@foxmail.com>
2024-11-18 15:08:30 +00:00
e7ebb662d7 [Model] Remove transformers attention porting in VITs (#10414)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-18 21:45:21 +08:00
5be4e52b65 [Model][LoRA]LoRA support added for glm-4v (#10418)
Signed-off-by: B-201 <Joy25810@foxmail.com>
2024-11-18 12:57:10 +00:00
01aae1cc68 [Model] Remove redundant softmax when using PoolingType.STEP (#10415) 2024-11-18 10:05:36 +00:00
c7dec926f6 [VLM] Report multi_modal_placeholders in output (#10407)
Signed-off-by: Linkun Chen <lkchen+anyscale@github.com>
2024-11-18 16:06:16 +08:00
51bb12d17b [4/N][torch.compile] clean up set_torch_compile_backend (#10401)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-17 23:57:20 -08:00
47826cacf0 [Bugfix] Ignore ray reinit error when current platform is ROCm or XPU (#10375)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-11-18 11:29:26 +08:00
c4e464333e [Misc] Add uninitialized params tracking for AutoWeightsLoader (#10327)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-18 09:07:46 +08:00
d1557e66d3 [Misc] Enhance offline_inference to support user-configurable paramet… (#10392)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2024-11-17 11:32:40 +00:00
80d85c5d7b [Bugfix] Fix mrope_position_delta in non-last prefill chunk (#10403)
Signed-off-by: imkero <kerorek@outlook.com>
2024-11-17 08:50:24 +00:00
76aab90ab6 [Hardware] [HPU]add mark_step for hpu (#10239)
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
2024-11-17 00:44:44 -08:00
8d74b5aee9 [platforms] refactor cpu code (#10402)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 23:14:23 -08:00
cf349c4a97 [Bugfix][CPU] Fix CPU embedding runner with tensor parallel (#10394)
Signed-off-by: Isotr0py <2037008807@qq.com>
2024-11-16 23:12:04 -08:00
905d0f0af4 [CI/Build] Fix IDC hpu [Device not found] issue (#10384)
Signed-off-by: Chendi Xue <chendi.xue@intel.com>
2024-11-17 14:58:22 +08:00
643ecf7b11 [V1] Refactor model executable interface for all text-only language models (#10374)
Signed-off-by: Roger Wang <ywang@roblox.com>
2024-11-17 05:18:46 +00:00
4fd9375028 [2/N][torch.compile] make compilation cfg part of vllm cfg (#10383)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 18:02:14 -08:00
661a34fd4f [V1] Add code owners for V1 (#10397)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-11-16 10:45:26 -08:00
361c29e174 [Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled (#10388)
Signed-off-by: imkero <kerorek@outlook.com>
2024-11-17 02:10:00 +08:00
b98d89efd4 [Misc] Medusa supports custom bias (#10361) 2024-11-16 16:33:01 +00:00
8b6725b0cf [Misc] Update benchmark to support image_url file or http (#10287)
Signed-off-by: rbbang <anjaehyun87@gmail.com>
2024-11-16 18:15:40 +08:00
1d75472626 [BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel (#10385)
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
2024-11-16 09:55:05 +00:00
2f427c2d16 [misc][plugin] improve log messages (#10386)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-16 01:23:20 -08:00
755b85359b [doc] add doc for the plugin system (#10372)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-15 21:46:27 -08:00
32e46e000f [Frontend] Automatic detection of chat content format from AST (#9919)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-16 13:35:40 +08:00
4f168f69a3 [Docs] Misc updates to TPU installation instructions (#10165) 2024-11-15 13:26:17 -08:00
3e8d14d8a1 [Doc] Move PR template content to docs (#10159)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-15 13:20:20 -08:00
a067f85e08 [Frontend] Add --version flag to CLI (#10369)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
2024-11-15 13:13:53 -08:00
c76ac49d26 [Docs] Add Nebius as sponsors (#10371)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-15 12:47:40 -08:00
a6221a144a [Misc] bump mistral common version (#10367)
Signed-off-by: simon-mo <simon.mo@hey.com>
2024-11-15 09:48:07 -08:00
79ee45b428 [Misc] Bump up test_fused_moe tolerance (#10364)
Signed-off-by: ElizaWszola <eliza@neuralmagic.com>
2024-11-15 16:31:18 +00:00
691a3ec047 [Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer (#10363)
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com>
2024-11-15 14:50:40 +00:00
3a763ba0c3 [core][misc] keep compatibility for old-style classes (#10356)
Signed-off-by: youkaichao <youkaichao@gmail.com>
2024-11-15 13:55:51 +00:00
f2056f726d [Misc] Fix some help info of arg_utils to improve readability (#10362) 2024-11-15 12:40:30 +00:00
1d65ec7eeb [Bugfix] Fix fully sharded LoRA bug (#10352)
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
2024-11-15 10:34:58 +00:00
26908554b2 [Doc] Remove float32 choice from --lora-dtype (#10348)
Signed-off-by: Xin Yang <xyang19@gmail.com>
2024-11-15 10:22:57 +00:00
b311efd0bd [Misc] Fix import error in tensorizer tests and cleanup some code (#10349)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-11-15 09:34:17 +00:00
3d158cdc8d Add default value to avoid Falcon crash (#5363) (#10347)
Signed-off-by: wchen61 <wchen61@foxmail.com>
2024-11-15 08:52:20 +00:00
2104 changed files with 247463 additions and 61066 deletions

View File

@ -1,9 +1,14 @@
# SPDX-License-Identifier: Apache-2.0
import os
import sys
import zipfile
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))
def print_top_10_largest_files(zip_file):

View File

@ -0,0 +1,26 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import os
template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""
parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()
filename = os.path.basename(args.wheel)
with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))

View File

@ -4,8 +4,8 @@ tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.233
value: 0.231
- name: "exact_match,flexible-extract"
value: 0.236
value: 0.22
limit: 1000
num_fewshot: 5

View File

@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.6353
- name: "exact_match,flexible-extract"
value: 0.637
limit: null
num_fewshot: null

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml
@ -12,6 +13,7 @@ from pathlib import Path
import lm_eval
import numpy
import pytest
import yaml
RTOL = 0.05
@ -45,6 +47,10 @@ def test_lm_eval_correctness():
eval_config = yaml.safe_load(
Path(TEST_DATA_FILE).read_text(encoding="utf-8"))
if eval_config[
"model_name"] == "nm-testing/Meta-Llama-3-70B-Instruct-FBGEMM-nonuniform": #noqa: E501
pytest.skip("FBGEMM is currently failing on main.")
# Launch eval requests.
results = launch_lm_eval(eval_config)

View File

@ -1,15 +1,13 @@
# vLLM benchmark suite
## Introduction
This directory contains two sets of benchmark for vllm.
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
## Performance benchmark quick overview
@ -19,17 +17,14 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
## Nightly benchmark quick overview
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
**Benchmarking Duration**: about 3.5hrs.
## Trigger the benchmark
Performance benchmark will be triggered when:
@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
Nightly benchmark will be triggered when:
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
## Performance benchmark details
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
#### Latency test
### Latency test
Here is an example of one test inside `latency-tests.json`:
@ -68,23 +58,25 @@ Here is an example of one test inside `latency-tests.json`:
```
In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
### Throughput test
#### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
#### Serving test
### Serving test
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
```
```json
[
{
"test_name": "serving_llama8B_tp1_sharegpt",
@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
```
Inside this example:
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
#### Visualizing the results
### Visualizing the results
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
## Nightly test details
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
### Workflow
#### Workflow
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
#### Nightly tests
### Nightly tests
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
#### Docker containers
### Docker containers
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).

View File

@ -1,5 +1,6 @@
steps:
- label: "Wait for container to be ready"
key: wait-for-container-image
agents:
queue: A100
plugins:
@ -9,10 +10,102 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
- wait
- label: "Cleanup H100"
agents:
queue: H100
depends_on: ~
command: docker system prune -a --volumes --force
- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~
- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
# Premerge benchmark
- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- kubernetes:
podSpec:
@ -41,20 +134,51 @@ steps:
- name: devshm
emptyDir:
medium: Memory
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN
- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~
- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

View File

@ -9,20 +9,19 @@ This file contains the downloading link for benchmarking results.
Please download the visualization scripts in the post
## Results reproduction
- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code
```
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:
```console
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```
And the results will be inside `./benchmarks/results`.

View File

@ -2,6 +2,7 @@
# Nightly benchmark
This benchmark aims to:
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
@ -9,7 +10,6 @@ Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html)
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
## Setup
- Docker images:
@ -33,7 +33,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
# Known issues
## Known issues
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
- TGI does not support `ignore-eos` flag.
- TGI does not support `ignore-eos` flag.

View File

@ -7,10 +7,8 @@
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: end-to-end latency (mean, median, p99).
{latency_tests_markdown_table}
## Throughput tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@ -19,10 +17,8 @@
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
- Evaluation metrics: throughput.
{throughput_tests_markdown_table}
## Serving tests
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@ -33,13 +29,11 @@
- We also added a speculative decoding test for llama-3 70B, under QPS 2
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
{serving_tests_markdown_table}
## json version of the benchmarking tables
This section contains the data of the markdown tables above in JSON format.
This section contains the data of the markdown tables above in JSON format.
You can load the benchmarking tables into pandas dataframes as follows:
```python
@ -54,9 +48,9 @@ serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
```
The json string for all benchmarking tables:
```json
{benchmarking_results_in_json_string}
```
You can also check the raw experiment data in the Artifact tab of the Buildkite page.

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import json
import os
from pathlib import Path
@ -82,8 +84,13 @@ if __name__ == "__main__":
# this result is generated via `benchmark_serving.py`
# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
try:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
except OSError as e:
print(e)
continue
raw_result.update(command)
# update the test name of this result
@ -97,8 +104,13 @@ if __name__ == "__main__":
# this result is generated via `benchmark_latency.py`
# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
try:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
except OSError as e:
print(e)
continue
raw_result.update(command)
# update the test name of this result
@ -119,8 +131,13 @@ if __name__ == "__main__":
# this result is generated via `benchmark_throughput.py`
# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
try:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
except OSError as e:
print(e)
continue
raw_result.update(command)
# update the test name of this result
@ -157,6 +174,18 @@ if __name__ == "__main__":
throughput_results,
serving_results)
for df in [latency_results, serving_results, throughput_results]:
if df.empty:
continue
# Sort all dataframes by their respective "Test name" columns
df.sort_values(by="Test name", inplace=True)
# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}")
# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
from transformers import AutoTokenizer

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
from pathlib import Path

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
from lmdeploy.serve.openai.api_client import APIClient
api_client = APIClient("http://localhost:8000")

View File

@ -43,7 +43,7 @@ main() {
# The figures should be genereated by a separate process outside the CI/CD pipeline
# The figures should be generated by a separate process outside the CI/CD pipeline
# # generate figures
# python3 -m pip install tabulate pandas matplotlib

View File

@ -301,6 +301,104 @@ run_serving_tests() {
kill_gpu_processes
}
run_genai_perf_tests() {
# run genai-perf tests
# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
genai_perf_test_file=$1
# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi
# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}
# get common parameters
common_params=$(echo "$params" | jq -r '.common_parameters')
model=$(echo "$common_params" | jq -r '.model')
tp=$(echo "$common_params" | jq -r '.tp')
dataset_name=$(echo "$common_params" | jq -r '.dataset_name')
dataset_path=$(echo "$common_params" | jq -r '.dataset_path')
port=$(echo "$common_params" | jq -r '.port')
num_prompts=$(echo "$common_params" | jq -r '.num_prompts')
reuse_server=$(echo "$common_params" | jq -r '.reuse_server')
# get client and server arguments
server_params=$(echo "$params" | jq -r ".${CURRENT_LLM_SERVING_ENGINE}_server_parameters")
qps_list=$(echo "$params" | jq -r '.qps_list')
qps_list=$(echo "$qps_list" | jq -r '.[] | @sh')
echo "Running over qps list $qps_list"
# check if there is enough GPU to run the test
if [[ $gpu_count -lt $tp ]]; then
echo "Required num-shard $tp but only $gpu_count GPU found. Skip testcase $test_name."
continue
fi
if [[ $reuse_server == "true" ]]; then
echo "Reuse previous server for test case $test_name"
else
kill_gpu_processes
bash "$VLLM_SOURCE_CODE_LOC/.buildkite/nightly-benchmarks/scripts/launch-server.sh" \
"$server_params" "$common_params"
fi
if wait_for_server; then
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE server is up and running."
else
echo ""
echo "$CURRENT_LLM_SERVING_ENGINE failed to start within the timeout period."
break
fi
# iterate over different QPS
for qps in $qps_list; do
# remove the surrounding single quote from qps
if [[ "$qps" == *"inf"* ]]; then
echo "qps was $qps"
qps=$num_prompts
echo "now qps is $qps"
fi
new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE
if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
#TODO: add output dir.
client_command="genai-perf profile \
-m $model \
--service-kind openai \
--backend vllm \
--endpoint-type chat \
--streaming \
--url localhost:$port \
--request-rate $qps \
--num-prompts $num_prompts \
"
echo "Client command: $client_command"
eval "$client_command"
#TODO: process/record outputs
done
done
kill_gpu_processes
}
prepare_dataset() {
@ -328,12 +426,17 @@ main() {
pip install -U transformers
pip install -r requirements/dev.txt
which genai-perf
# check storage
df -h
ensure_installed wget
ensure_installed curl
ensure_installed jq
# genai-perf dependency
ensure_installed libb64-0d
prepare_dataset
@ -345,6 +448,10 @@ main() {
# run the test
run_serving_tests "$BENCHMARK_ROOT/tests/nightly-tests.json"
# run genai-perf tests
run_genai_perf_tests "$BENCHMARK_ROOT/tests/genai-perf-tests.json"
mv artifacts/ $RESULTS_FOLDER/
# upload benchmark results to buildkite
python3 -m pip install tabulate pandas
python3 "$BENCHMARK_ROOT/scripts/summary-nightly-results.py"

View File

@ -6,6 +6,7 @@
# Do not set -e, as the mixtral 8x22B model tends to crash occasionally
# and we still want to see other benchmarking results even when mixtral crashes.
set -x
set -o pipefail
check_gpus() {
@ -85,11 +86,7 @@ kill_gpu_processes() {
ps -aux
lsof -t -i:8000 | xargs -r kill -9
pkill -f pt_main_thread
# this line doesn't work now
# ps aux | grep python | grep openai | awk '{print $2}' | xargs -r kill -9
pkill -f python3
pkill -f /usr/bin/python3
pgrep python3 | xargs -r kill -9
# wait until GPU memory usage smaller than 1GB
@ -289,7 +286,7 @@ run_serving_tests() {
# run the server
echo "Running test case $test_name"
echo "Server command: $server_command"
eval "$server_command" &
bash -c "$server_command" &
server_pid=$!
# wait until the server is alive
@ -312,17 +309,20 @@ run_serving_tests() {
new_test_name=$test_name"_qps_"$qps
# pass the tensor parallel size to the client so that it can be displayed
# on the benchmark dashboard
client_command="python3 benchmark_serving.py \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
--request-rate $qps \
--metadata "tensor_parallel_size=$tp" \
$client_args"
echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"
eval "$client_command"
bash -c "$client_command"
# record the benchmarking commands
jq_output=$(jq -n \
@ -348,6 +348,11 @@ main() {
check_gpus
check_hf_token
# Set to v1 to run v1 benchmark
if [[ "${ENGINE_VERSION:-v0}" == "v1" ]]; then
export VLLM_USE_V1=1
fi
# dependencies
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get update && apt-get -y install jq)
@ -356,7 +361,7 @@ main() {
# get the current IP address, required by benchmark_serving.py
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
# turn of the reporting of the status of each request, to clean up the terminal output
export VLLM_LOG_LEVEL="WARNING"
export VLLM_LOGGING_LEVEL="WARNING"
# prepare for benchmarking
cd benchmarks || exit 1

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import datetime
import json
import os

View File

@ -1,6 +1,10 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
if [[ "$BUILDKITE_BRANCH" == "main" ]]; then
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"
else
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
fi
TIMEOUT_SECONDS=10

View File

@ -0,0 +1,23 @@
[
{
"test_name": "llama8B_tp1_genai_perf",
"qps_list": [4,8,16,32],
"common_parameters": {
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"tp": 1,
"port": 8000,
"num_prompts": 500,
"reuse_server": false
},
"vllm_server_parameters": {
"disable_log_stats": "",
"disable_log_requests": "",
"gpu_memory_utilization": 0.9,
"num_scheduler_steps": 10,
"max_num_seqs": 512,
"dtype": "bfloat16"
},
"genai_perf_input_parameters": {
}
}
]

View File

@ -29,4 +29,4 @@
"num-iters": 15
}
}
]
]

View File

@ -63,11 +63,12 @@
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"disable_log_requests": "",
"tensor_parallel_size": 4,
"swap_space": 16,
"speculative_model": "turboderp/Qwama-0.5B-Instruct",
"num_speculative_tokens": 4,
"speculative_draft_tensor_parallel_size": 1,
"use_v2_block_manager": ""
"swap_space": 16,
"speculative_config": {
"model": "turboderp/Qwama-0.5B-Instruct",
"num_speculative_tokens": 4,
"draft_tensor_parallel_size": 1
}
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",

View File

@ -32,4 +32,4 @@
"backend": "vllm"
}
}
]
]

View File

@ -1,9 +1,20 @@
steps:
- label: "Build wheel - CUDA 12.4"
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"
- label: "Build wheel - CUDA 12.1"
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
@ -18,11 +29,60 @@ steps:
- label: "Build wheel - CUDA 11.8"
# depends_on: block-build-cu118-wheel
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"
- block: "Build release image"
depends_on: ~
key: block-release-image-build
- label: "Build release image"
depends_on: block-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain -f docker/Dockerfile ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
- label: "Build and publish TPU release image"
depends_on: ~
if: build.env("NIGHTLY") == "1"
agents:
queue: tpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --tag vllm/vllm-tpu:nightly --tag vllm/vllm-tpu:$BUILDKITE_COMMIT --progress plain -f docker/Dockerfile.tpu ."
- "docker push vllm/vllm-tpu:nightly"
- "docker push vllm/vllm-tpu:$BUILDKITE_COMMIT"
plugins:
- docker-login#v3.0.0:
username: vllm
password-env: DOCKERHUB_TOKEN
env:
DOCKER_BUILDKIT: "1"
- input: "Provide Release version here"
fields:
- text: "What is the release version?"
key: "release-version"
- block: "Build CPU release image"
key: block-cpu-release-image-build
depends_on: ~
- label: "Build and publish CPU release image"
depends_on: block-cpu-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg GIT_REPO_CHECK=1 --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version) --tag public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:latest --progress plain --target vllm-openai -f docker/Dockerfile.cpu ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-cpu-release-repo:$(buildkite-agent meta-data get release-version)"
env:
DOCKER_BUILDKIT: "1"

View File

@ -77,7 +77,6 @@ echo "Commands:$commands"
#ignore certain kernels tests
if [[ $commands == *" kernels "* ]]; then
commands="${commands} \
--ignore=kernels/test_attention.py \
--ignore=kernels/test_attention_selector.py \
--ignore=kernels/test_blocksparse_attention.py \
--ignore=kernels/test_causal_conv1d.py \
@ -85,7 +84,6 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_encoder_decoder_attn.py \
--ignore=kernels/test_flash_attn.py \
--ignore=kernels/test_flashinfer.py \
--ignore=kernels/test_gguf.py \
--ignore=kernels/test_int8_quant.py \
--ignore=kernels/test_machete_gemm.py \
--ignore=kernels/test_mamba_ssm.py \
@ -93,19 +91,40 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_moe.py \
--ignore=kernels/test_prefix_prefill.py \
--ignore=kernels/test_rand.py \
--ignore=kernels/test_sampler.py"
--ignore=kernels/test_sampler.py \
--ignore=kernels/test_cascade_flash_attn.py \
--ignore=kernels/test_mamba_mixer2.py \
--ignore=kernels/test_aqlm.py \
--ignore=kernels/test_machete_mm.py \
--ignore=kernels/test_mha_attn.py \
--ignore=kernels/test_block_fp8.py \
--ignore=kernels/test_permute_cols.py"
fi
#ignore certain Entrypoints tests
#ignore certain Entrypoints/openai tests
if [[ $commands == *" entrypoints/openai "* ]]; then
commands=${commands//" entrypoints/openai "/" entrypoints/openai \
--ignore=entrypoints/openai/test_accuracy.py \
--ignore=entrypoints/openai/test_audio.py \
--ignore=entrypoints/openai/test_encoder_decoder.py \
--ignore=entrypoints/openai/test_embedding.py \
--ignore=entrypoints/openai/test_oot_registration.py "}
--ignore=entrypoints/openai/test_chat.py \
--ignore=entrypoints/openai/test_shutdown.py \
--ignore=entrypoints/openai/test_completion.py \
--ignore=entrypoints/openai/test_sleep.py \
--ignore=entrypoints/openai/test_models.py \
--ignore=entrypoints/openai/test_prompt_validation.py "}
fi
#ignore certain Entrypoints/llm tests
if [[ $commands == *" && pytest -v -s entrypoints/llm/test_guided_generate.py"* ]]; then
commands=${commands//" && pytest -v -s entrypoints/llm/test_guided_generate.py"/" "}
fi
# --ignore=entrypoints/openai/test_encoder_decoder.py \
# --ignore=entrypoints/openai/test_embedding.py \
# --ignore=entrypoints/openai/test_oot_registration.py
# --ignore=entrypoints/openai/test_accuracy.py \
# --ignore=entrypoints/openai/test_models.py <= Fails on MI250 but passes on MI300 as of 2025-03-13
PARALLEL_JOB_COUNT=8
# check if the command contains shard flag, we will run all shards in parallel because the host have 8 GPUs.
if [[ $commands == *"--shard-id="* ]]; then
@ -115,13 +134,16 @@ if [[ $commands == *"--shard-id="* ]]; then
# assign shard-id for each shard
commands_gpu=${commands//"--shard-id= "/"--shard-id=${GPU} "}
echo "Shard ${GPU} commands:$commands_gpu"
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES="${GPU}" \
-e HF_TOKEN \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
--name "${container_name}_${GPU}" \
@ -142,13 +164,16 @@ if [[ $commands == *"--shard-id="* ]]; then
fi
done
else
echo "Render devices: $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES"
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--device /dev/kfd $BUILDKITE_AGENT_META_DATA_RENDER_DEVICES \
--network=host \
--shm-size=16gb \
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HF_TOKEN \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
--name "${container_name}" \

View File

@ -4,49 +4,11 @@
# It serves a sanity check for compilation and basic model usage.
set -ex
# Try building the docker image
docker build -t cpu-test -f Dockerfile.ppc64le .
# Setup cleanup
remove_docker_container() { docker rm -f cpu-test || true; }
remove_docker_container() { docker rm -f cpu-test || true; docker system prune -f; }
trap remove_docker_container EXIT
remove_docker_container
# Run the image, setting --shm-size=4g for tensor parallel.
source /etc/environment
#docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true --network host -e HF_TOKEN="$HF_TOKEN" --name cpu-test cpu-test
# Try building the docker image
docker build -t cpu-test -f docker/Dockerfile.ppc64le .
function cpu_tests() {
set -e
# Run basic model test
docker exec cpu-test bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
pytest -v -s tests/models/decoder_only/audio_language -m cpu_model
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"
# online inference
docker exec cpu-test bash -c "
set -e
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
}
# All of CPU tests are expected to be finished less than 25 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests"

View File

@ -8,36 +8,40 @@ set -ex
CORE_RANGE=${CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}
# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build -t cpu-test -f Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
# Setup cleanup
remove_docker_container() { docker rm -f cpu-test cpu-test-avx2 || true; }
remove_docker_container() {
set -e;
docker rm -f cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" || true;
docker image rm cpu-test-"$BUILDKITE_BUILD_NUMBER" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 || true;
}
trap remove_docker_container EXIT
remove_docker_container
# Try building the docker image
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$BUILDKITE_BUILD_NUMBER" --target vllm-test -f docker/Dockerfile.cpu .
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2 --target vllm-test -f docker/Dockerfile.cpu .
# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
--cpuset-mems="$NUMA_NODE" --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus="$CORE_RANGE" \
--cpuset-mems="$NUMA_NODE" --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2 cpu-test-avx2
--cpuset-mems="$NUMA_NODE" --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2
function cpu_tests() {
set -e
export NUMA_NODE=$2
export BUILDKITE_BUILD_NUMBER=$3
# offline inference
docker exec cpu-test-avx2 bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference.py"
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"
# Run basic model test
docker exec cpu-test bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pip install pytest pytest-asyncio \
decord einops librosa peft Pillow sentence-transformers soundfile \
transformers_stream_generator matplotlib datamodel_code_generator
pip install torchvision --index-url https://download.pytorch.org/whl/cpu
pytest -v -s tests/kernels/test_cache.py -m cpu_model
pytest -v -s tests/kernels/test_mla_decode_cpu.py -m cpu_model
pytest -v -s tests/models/decoder_only/language -m cpu_model
pytest -v -s tests/models/embedding/language -m cpu_model
pytest -v -s tests/models/encoder_decoder/language -m cpu_model
@ -45,20 +49,26 @@ function cpu_tests() {
pytest -v -s tests/models/decoder_only/vision_language -m cpu_model"
# Run compressed-tensor test
docker exec cpu-test bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_static_setup \
tests/quantization/test_compressed_tensors.py::test_compressed_tensors_w8a8_dynamic_per_token"
# Run AWQ test
docker exec cpu-test bash -c "
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/quantization/test_ipex_quant.py"
# online inference
docker exec cpu-test bash -c "
# Run chunked-prefill and prefix-cache test
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v -k cpu_model \
tests/basic_correctness/test_chunked_prefill.py"
# online serving
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=$1
@ -71,8 +81,14 @@ function cpu_tests() {
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
# Run multi-lora tests
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
set -e
pytest -s -v \
tests/lora/test_qwen2vl.py"
}
# All of CPU tests are expected to be finished less than 25 mins.
# All of CPU tests are expected to be finished less than 40 mins.
export -f cpu_tests
timeout 25m bash -c "cpu_tests $CORE_RANGE"
timeout 40m bash -c "cpu_tests $CORE_RANGE $NUMA_NODE $BUILDKITE_BUILD_NUMBER"

View File

@ -0,0 +1,30 @@
#!/bin/bash
# This script build the GH200 docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex
# Skip the new torch installation during build since we are using the specified version for arm64 in the Dockerfile
python3 use_existing_torch.py
# Try building the docker image
DOCKER_BUILDKIT=1 docker build . \
--file docker/Dockerfile \
--target vllm-openai \
--platform "linux/arm64" \
-t gh200-test \
--build-arg max_jobs=66 \
--build-arg nvcc_threads=2 \
--build-arg RUN_WHEEL_CHECK=false \
--build-arg torch_cuda_arch_list="9.0+PTX" \
--build-arg vllm_fa_cmake_gpu_arches="90-real"
# Setup cleanup
remove_docker_container() { docker rm -f gh200-test || true; }
trap remove_docker_container EXIT
remove_docker_container
# Run the image and test offline inference
docker run -e HF_TOKEN -e VLLM_WORKER_MULTIPROC_METHOD=spawn -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference/basic/generate.py --model meta-llama/Llama-3.2-1B
'

View File

@ -5,12 +5,20 @@
set -ex
# Try building the docker image
docker build -t hpu-test-env -f Dockerfile.hpu .
docker build -t hpu-test-env -f docker/Dockerfile.hpu .
# Setup cleanup
# certain versions of HPU software stack have a bug that can
# override the exit code of the script, so we need to use
# separate remove_docker_container and remove_docker_container_and_exit
# functions, while other platforms only need one remove_docker_container
# function.
EXITCODE=1
remove_docker_container() { docker rm -f hpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container_and_exit() { remove_docker_container; exit $EXITCODE; }
trap remove_docker_container_and_exit EXIT
remove_docker_container
# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m
EXITCODE=$?

View File

@ -3,6 +3,18 @@
# This script build the Neuron docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
set -e
set -v
image_name="neuron/vllm-ci"
container_name="neuron_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
HF_CACHE="$(realpath ~)/huggingface"
mkdir -p "${HF_CACHE}"
HF_MOUNT="/root/.cache/huggingface"
NEURON_COMPILE_CACHE_URL="$(realpath ~)/neuron_compile_cache"
mkdir -p "${NEURON_COMPILE_CACHE_URL}"
NEURON_COMPILE_CACHE_MOUNT="/root/.cache/neuron_compile_cache"
# Try building the docker image
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
@ -13,41 +25,30 @@ if [ -f /tmp/neuron-docker-build-timestamp ]; then
last_build=$(cat /tmp/neuron-docker-build-timestamp)
current_time=$(date +%s)
if [ $((current_time - last_build)) -gt 86400 ]; then
docker system prune -f
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
# Remove unused volumes / force the system prune for old images as well.
docker volume prune -f && docker system prune -f
echo "$current_time" > /tmp/neuron-docker-build-timestamp
fi
else
date "+%s" > /tmp/neuron-docker-build-timestamp
fi
docker build -t neuron -f Dockerfile.neuron .
docker build -t "${image_name}" -f docker/Dockerfile.neuron .
# Setup cleanup
remove_docker_container() { docker rm -f neuron || true; }
remove_docker_container() {
docker image rm -f "${image_name}" || true;
}
trap remove_docker_container EXIT
remove_docker_container
# Run the image
docker run --device=/dev/neuron0 --device=/dev/neuron1 --network host --name neuron neuron python3 -m vllm.entrypoints.api_server \
--model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --max-num-seqs 8 --max-model-len 128 --block-size 128 --device neuron --tensor-parallel-size 2 &
# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0
while [ "$(curl -s -o /dev/null -w '%{http_code}' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start
# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
docker run --rm -it --device=/dev/neuron0 --network bridge \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
-v "${NEURON_COMPILE_CACHE_URL}:${NEURON_COMPILE_CACHE_MOUNT}" \
-e "NEURON_COMPILE_CACHE_URL=${NEURON_COMPILE_CACHE_MOUNT}" \
--name "${container_name}" \
${image_name} \
/bin/bash -c "python3 /workspace/vllm/examples/offline_inference/neuron.py && python3 -m pytest /workspace/vllm/tests/neuron/1_core/ -v --capture=tee-sys && python3 -m pytest /workspace/vllm/tests/neuron/2_core/ -v --capture=tee-sys"

View File

@ -1,16 +0,0 @@
#!/bin/bash
# This script build the OpenVINO docker image and run the offline inference inside the container.
# It serves a sanity check for compilation and basic model usage.
set -ex
# Try building the docker image
docker build -t openvino-test -f Dockerfile.openvino .
# Setup cleanup
remove_docker_container() { docker rm -f openvino-test || true; }
trap remove_docker_container EXIT
remove_docker_container
# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference.py

View File

@ -1,17 +0,0 @@
#!/bin/bash
set -e
# Build the docker image.
docker build -f Dockerfile.tpu -t vllm-tpu .
# Set up cleanup.
remove_docker_container() { docker rm -f tpu-test || true; }
trap remove_docker_container EXIT
# Remove the container that might not be cleaned up in the previous run.
remove_docker_container
# For HF_TOKEN.
source /etc/environment
# Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it -e "HF_TOKEN=$HF_TOKEN" --name tpu-test vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git && python3 -m pip install pytest && python3 -m pip install lm_eval[api]==0.4.4 && pytest -v -s /workspace/vllm/tests/entrypoints/openai/test_accuracy.py && pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py && python3 /workspace/vllm/tests/tpu/test_compilation.py && python3 /workspace/vllm/examples/offline_inference_tpu.py"

43
.buildkite/run-tpu-v1-test.sh Executable file
View File

@ -0,0 +1,43 @@
#!/bin/bash
set -e
# Build the docker image.
docker build -f docker/Dockerfile.tpu -t vllm-tpu .
# Set up cleanup.
remove_docker_container() { docker rm -f tpu-test || true; }
trap remove_docker_container EXIT
# Remove the container that might not be cleaned up in the previous run.
remove_docker_container
# For HF_TOKEN.
source /etc/environment
# Run a simple end-to-end example.
docker run --privileged --net host --shm-size=16G -it \
-e "HF_TOKEN=$HF_TOKEN" --name tpu-test \
vllm-tpu /bin/bash -c "python3 -m pip install git+https://github.com/thuml/depyf.git \
&& python3 -m pip install pytest \
&& python3 -m pip install lm_eval[api]==0.4.4 \
&& export VLLM_USE_V1=1 \
&& export VLLM_XLA_CHECK_RECOMPILATION=1 \
&& echo TEST_0 \
&& pytest -v -s /workspace/vllm/tests/v1/tpu/test_perf.py \
&& echo TEST_1 \
&& pytest -v -s /workspace/vllm/tests/tpu/test_compilation.py \
&& echo TEST_2 \
&& pytest -v -s /workspace/vllm/tests/v1/tpu/test_basic.py \
&& echo TEST_3 \
&& pytest -v -s /workspace/vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine \
&& echo TEST_4 \
&& pytest -s -v /workspace/vllm/tests/tpu/test_quantization_accuracy.py \
&& echo TEST_5 \
&& python3 /workspace/vllm/examples/offline_inference/tpu.py \
&& echo TEST_6 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/worker/test_tpu_model_runner.py \
&& echo TEST_7 \
&& pytest -s -v /workspace/vllm/tests/v1/tpu/test_sampler.py" \
# TODO: This test fails because it uses RANDOM_SEED sampling
# && VLLM_USE_V1=1 pytest -v -s /workspace/vllm/tests/tpu/test_custom_dispatcher.py \

View File

@ -4,13 +4,28 @@
# It serves a sanity check for compilation and basic model usage.
set -ex
image_name="xpu/vllm-ci:${BUILDKITE_COMMIT}"
container_name="xpu_${BUILDKITE_COMMIT}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
# Try building the docker image
docker build -t xpu-test -f Dockerfile.xpu .
docker build -t ${image_name} -f docker/Dockerfile.xpu .
# Setup cleanup
remove_docker_container() { docker rm -f xpu-test || true; }
remove_docker_container() {
docker rm -f "${container_name}" || true;
docker image rm -f "${image_name}" || true;
docker system prune -f || true;
}
trap remove_docker_container EXIT
remove_docker_container
# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py
# Run the image and test offline inference/tensor parallel
docker run \
--device /dev/dri \
-v /dev/dri/by-path:/dev/dri/by-path \
--entrypoint="" \
--name "${container_name}" \
"${image_name}" \
sh -c '
VLLM_USE_V1=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m
VLLM_USE_V1=0 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m -tp 2
'

View File

@ -2,21 +2,20 @@
# adding a new command to an existing step. See different options here for examples.
# This script will be feed into Jinja template in `test-template-aws.j2` at
# https://github.com/vllm-project/buildkite-ci/blob/main/scripts/test-template-aws.j2
# https://github.com/vllm-project/buildkite-ci/blob/main/scripts/test-template-aws.j2
# to generate the final pipeline yaml file.
# Documentation
# label(str): the name of the test. emoji allowed.
# fast_check(bool): whether to run this on each commit on fastcheck pipeline.
# fast_check_only(bool): run this test on fastcheck pipeline only
# nightly(bool): run this test in nightly pipeline only
# optional(bool): never run this test by default (i.e. need to unblock manually)
# optional(bool): never run this test by default (i.e. need to unblock manually) unless it's scheduled nightly run.
# command(str): the single command to run for tests. incompatible with commands.
# commands(list): the list of commands to run for test. incompatbile with command.
# mirror_hardwares(list): the list of hardwares to run the test on as well. currently only supports [amd]
# gpu(str): override the GPU selection for the test. default is on L4 GPUs. currently only supports a100
# num_gpus(int): override the number of GPUs for the test. default to 1 GPU. currently support 2,4.
# num_nodes(int): whether to simulate multi-node setup by launch multiple containers on one host,
# num_nodes(int): whether to simulate multi-node setup by launch multiple containers on one host,
# in this case, commands must be specified. the first command runs on first host, the second
# command runs on the second host.
# working_dir(str): specify the place where command should execute, default to /vllm-workspace/tests
@ -25,8 +24,8 @@
# When adding a test
# - If the test belong to an existing group, add it there
# - If the test is short, add to any existing step
# - If the test takes more than 10min, then it is okay to create a new step.
# Note that all steps execute in parallel.
# - If the test takes more than 10min, then it is okay to create a new step.
# Note that all steps execute in parallel.
steps:
##### fast check tests #####
@ -36,13 +35,12 @@ steps:
fast_check: true
no_gpu: True
commands:
- pip install -r requirements-docs.txt
- pip install -r ../../requirements/docs.txt
- SPHINXOPTS=\"-W\" make html
# Check API reference (if it fails, you may have missing mock imports)
- grep \"sig sig-object py\" build/html/dev/sampling_params.html
- grep \"sig sig-object py\" build/html/api/inference_params.html
- label: Async Engine, Inputs, Utils, Worker Test # 24min
fast_check: true
source_file_dependencies:
- vllm/
- tests/mq_llm_engine
@ -51,7 +49,9 @@ steps:
- tests/multimodal
- tests/test_utils
- tests/worker
- tests/standalone_tests/lazy_imports.py
commands:
- python3 standalone_tests/lazy_imports.py
- pytest -v -s mq_llm_engine # MQLLMEngine
- pytest -v -s async_engine # AsyncLLMEngine
- NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
@ -60,6 +60,13 @@ steps:
- pytest -v -s test_utils.py # Utils
- pytest -v -s worker # Worker
- label: Python-only Installation Test
source_file_dependencies:
- tests/standalone_tests/python_only_compile.sh
- setup.py
commands:
- bash standalone_tests/python_only_compile.sh
- label: Basic Correctness Test # 30min
#mirror_hardwares: [amd]
fast_check: true
@ -68,7 +75,10 @@ steps:
- tests/basic_correctness/test_basic_correctness
- tests/basic_correctness/test_cpu_offload
- tests/basic_correctness/test_preemption
- tests/basic_correctness/test_cumem.py
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s basic_correctness/test_cumem.py
- pytest -v -s basic_correctness/test_basic_correctness.py
- pytest -v -s basic_correctness/test_cpu_offload.py
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py
@ -97,43 +107,61 @@ steps:
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/entrypoints/llm
- tests/entrypoints/openai
- tests/entrypoints/test_chat_utils
- tests/entrypoints/offline_mode
commands:
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s entrypoints/llm --ignore=entrypoints/llm/test_lazy_outlines.py --ignore=entrypoints/llm/test_generate.py --ignore=entrypoints/llm/test_generate_multiple_loras.py --ignore=entrypoints/llm/test_guided_generate.py --ignore=entrypoints/llm/test_collective_rpc.py
- pytest -v -s entrypoints/llm/test_lazy_outlines.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_generate_multiple_loras.py # it needs a clean process
- pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py
- pytest -v -s entrypoints/openai/test_oot_registration.py # it needs a clean process
- VLLM_USE_V1=0 pytest -v -s entrypoints/llm/test_guided_generate.py # it needs a clean process
- pytest -v -s entrypoints/openai --ignore=entrypoints/openai/test_oot_registration.py --ignore=entrypoints/openai/test_chat_with_tool_reasoning.py --ignore=entrypoints/openai/correctness/
- pytest -v -s entrypoints/test_chat_utils.py
- pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
- VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode # Needs to avoid interference with other tests
- label: Distributed Tests (4 GPUs) # 10min
working_dir: "/vllm-workspace/tests"
num_gpus: 4
fast_check: true
source_file_dependencies:
- vllm/distributed/
- vllm/core/
- tests/distributed
- tests/distributed/test_utils
- tests/distributed/test_pynccl
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile
- tests/compile/test_basic_correctness
- examples/offline_inference/rlhf.py
- examples/offline_inference/rlhf_colocate.py
- tests/examples/offline_inference/data_parallel.py
- tests/v1/test_async_llm_dp.py
commands:
# test with tp=2 and external_dp=2
- VLLM_USE_V1=0 torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
- torchrun --nproc-per-node=4 distributed/test_torchrun_example.py
# test with internal dp
- python3 ../examples/offline_inference/data_parallel.py
- TP_SIZE=2 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py
- pytest -v -s distributed/test_utils.py
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests
- pushd ../examples/offline_inference
- python3 rlhf.py
- RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py
- popd
- label: Metrics, Tracing Test # 10min
num_gpus: 2
fast_check: true
num_gpus: 2
source_file_dependencies:
- vllm/
- tests/metrics
- tests/tracing
commands:
- pytest -v -s metrics
- pytest -v -s metrics
- "pip install \
'opentelemetry-sdk>=1.26.0,<1.27.0' \
'opentelemetry-api>=1.26.0,<1.27.0' \
@ -160,6 +188,9 @@ steps:
- vllm/
- tests/engine
- tests/tokenization
- tests/test_sequence
- tests/test_config
- tests/test_logger
commands:
- pytest -v -s engine test_sequence.py test_config.py test_logger.py
# OOM in the CI unless we run this separately
@ -171,29 +202,51 @@ steps:
- vllm/
- tests/v1
commands:
- pytest -v -s v1
# split the test to avoid interference
- pytest -v -s v1/core
- pytest -v -s v1/entrypoints
- pytest -v -s v1/engine
- pytest -v -s v1/entrypoints
- pytest -v -s v1/sample
- pytest -v -s v1/worker
- pytest -v -s v1/structured_output
- pytest -v -s v1/test_stats.py
- pytest -v -s v1/test_utils.py
- pytest -v -s v1/test_oracle.py
# TODO: accuracy does not match, whether setting
# VLLM_USE_FLASHINFER_SAMPLER or not on H100.
- pytest -v -s v1/e2e
# Integration test for streaming correctness (requires special branch).
- pip install -U git+https://github.com/robertgshaw2-neuralmagic/lm-evaluation-harness.git@streaming-api
- pytest -v -s entrypoints/openai/correctness/test_lmeval.py::test_lm_eval_accuracy_v1_engine
- label: Examples Test # 15min
- label: Examples Test # 25min
working_dir: "/vllm-workspace/examples"
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/entrypoints
- examples/
commands:
- pip install awscli tensorizer # for llava example and tensorizer test
- python3 offline_inference.py
- python3 cpu_offload.py
- python3 offline_inference_chat.py
- python3 offline_inference_with_prefix.py
- python3 llm_engine_example.py
- python3 offline_inference_vision_language.py
- python3 offline_inference_vision_language_multi_image.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference_encoder_decoder.py
- python3 offline_profile.py --model facebook/opt-125m
- pip install tensorizer # for tensorizer test
- python3 offline_inference/basic/generate.py --model facebook/opt-125m
- python3 offline_inference/basic/generate.py --model meta-llama/Llama-2-13b-chat-hf --cpu-offload-gb 10
- python3 offline_inference/basic/chat.py
- python3 offline_inference/prefix_caching.py
- python3 offline_inference/llm_engine_example.py
- python3 offline_inference/audio_language.py --seed 0
- python3 offline_inference/vision_language.py --seed 0
- python3 offline_inference/vision_language_embedding.py --seed 0
- python3 offline_inference/vision_language_multi_image.py --seed 0
- VLLM_USE_V1=0 python3 other/tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 other/tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors
- python3 offline_inference/encoder_decoder.py
- python3 offline_inference/encoder_decoder_multimodal.py --model-type whisper --seed 0
- python3 offline_inference/basic/classify.py
- python3 offline_inference/basic/embed.py
- python3 offline_inference/basic/score.py
- VLLM_USE_V1=0 python3 offline_inference/profiling.py --model facebook/opt-125m run_num_steps --num-steps 2
- label: Prefix Caching Test # 9min
#mirror_hardwares: [amd]
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/prefix_caching
@ -205,6 +258,7 @@ steps:
- vllm/model_executor/layers
- vllm/sampling_metadata.py
- tests/samplers
- tests/conftest.py
commands:
- pytest -v -s samplers
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers
@ -213,27 +267,32 @@ steps:
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/model_executor/layers
- vllm/model_executor/guided_decoding
- tests/test_logits_processor
command: pytest -v -s test_logits_processor.py
- tests/model_executor/test_guided_processors
commands:
- pytest -v -s test_logits_processor.py
- pytest -v -s model_executor/test_guided_processors.py
- label: Speculative decoding tests # 30min
- label: Speculative decoding tests # 40min
source_file_dependencies:
- vllm/spec_decode
- tests/spec_decode
- vllm/model_executor/models/eagle.py
commands:
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py --ignore=spec_decode/e2e/test_mtp_correctness.py
- pytest -v -s spec_decode/e2e/test_eagle_correctness.py
- label: LoRA Test %N # 15min each
mirror_hardwares: [amd]
source_file_dependencies:
- vllm/lora
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py --ignore=lora/test_minicpmv_tp.py --ignore=lora/test_transfomers_model.py
parallelism: 4
- label: "PyTorch Fullgraph Smoke Test" # 9min
fast_check: true
- label: PyTorch Fullgraph Smoke Test # 9min
source_file_dependencies:
- vllm/
- tests/compile
@ -242,8 +301,9 @@ steps:
# these tests need to be separated, cannot combine
- pytest -v -s compile/piecewise/test_simple.py
- pytest -v -s compile/piecewise/test_toy_llama.py
- pytest -v -s compile/test_pass_manager.py
- label: "PyTorch Fullgraph Test" # 18min
- label: PyTorch Fullgraph Test # 18min
source_file_dependencies:
- vllm/
- tests/compile
@ -295,6 +355,14 @@ steps:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- bash ./run-tests.sh -c configs/models-small.txt -t 1
- label: OpenAI API correctness
source_file_dependencies:
- csrc/
- vllm/entrypoints/openai/
- vllm/model_executor/models/whisper.py
commands: # LMEval+Transcription WER check
- pytest -s entrypoints/openai/correctness/
- label: Encoder Decoder tests # 5min
source_file_dependencies:
- vllm/
@ -313,17 +381,17 @@ steps:
##### models test #####
- label: Basic Models Test # 30min
- label: Basic Models Test # 24min
source_file_dependencies:
- vllm/
- tests/models
commands:
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- pytest -v -s models/test_transformers.py
- pytest -v -s models/test_registry.py
- pytest -v -s models/test_initialization.py
# V1 Test: https://github.com/vllm-project/vllm/issues/14531
- VLLM_USE_V1=0 pytest -v -s models/test_initialization.py
- label: Language Models Test (Standard) # 42min
- label: Language Models Test (Standard) # 32min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
@ -333,10 +401,9 @@ steps:
commands:
- pytest -v -s models/decoder_only/language -m 'core_model or quant_model'
- pytest -v -s models/embedding/language -m core_model
- pytest -v -s models/embedding/vision_language -m core_model
- label: Language Models Test (Extended) # 50min
nightly: true
- label: Language Models Test (Extended) # 1h10min
optional: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/language
@ -345,24 +412,29 @@ steps:
commands:
- pytest -v -s models/decoder_only/language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/language -m 'not core_model'
- pytest -v -s models/embedding/vision_language -m 'not core_model'
- label: Multi-Modal Models Test (Standard) # 26min
- label: Multi-Modal Models Test (Standard) # 40min
#mirror_hardwares: [amd]
source_file_dependencies:
- vllm/
- tests/models/decoder_only/audio_language
- tests/models/decoder_only/vision_language
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/audio_language
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/multimodal
- pytest -v -s models/decoder_only/audio_language -m 'core_model or quant_model'
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'core_model or quant_model'
- pytest -v -s models/embedding/vision_language -m core_model
- pytest -v -s models/encoder_decoder/audio_language -m core_model
- pytest -v -s models/encoder_decoder/language -m core_model
- pytest -v -s models/encoder_decoder/vision_language -m core_model
- pytest -v -s models/decoder_only/vision_language/test_interleaved.py
- label: Multi-Modal Models Test (Extended) # 1h15m
nightly: true
- label: Multi-Modal Models Test (Extended) 1 # 48m
optional: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/audio_language
@ -370,14 +442,26 @@ steps:
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/audio_language -m 'not core_model and not quant_model'
- pytest -v -s models/decoder_only/vision_language/test_models.py -m 'split(group=0) and not core_model and not quant_model'
# HACK - run phi3v tests separately to sidestep this transformers bug
# https://github.com/huggingface/transformers/issues/34307
- pytest -v -s models/decoder_only/vision_language/test_phi3v.py
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'not core_model and not quant_model'
- pytest -v -s --ignore models/decoder_only/vision_language/test_models.py --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'not core_model and not quant_model'
- pytest -v -s models/embedding/vision_language -m 'not core_model'
- pytest -v -s models/encoder_decoder/language -m 'not core_model'
- pytest -v -s models/encoder_decoder/vision_language -m 'not core_model'
- label: Multi-Modal Models Test (Extended) 2 # 38m
optional: true
source_file_dependencies:
- vllm/
- tests/models/decoder_only/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/vision_language/test_models.py -m 'split(group=1) and not core_model and not quant_model'
# This test is used only in PR development phase to test individual models and should never run on main
- label: Custom Models Test
optional: true
@ -412,11 +496,11 @@ steps:
- tests/distributed/
commands:
- # the following commands are for the first node, with ip 192.168.10.10 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_multi_node_assignment.py
- VLLM_MULTI_NODE=1 pytest -v -s distributed/test_pipeline_parallel.py
- # the following commands are for the second node, with ip 192.168.10.11 (ray environment already set up)
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep -q 'Same node test passed'
- VLLM_TEST_SAME_HOST=0 torchrun --nnodes 2 --nproc-per-node=2 --rdzv_backend=c10d --rdzv_endpoint=192.168.10.10 distributed/test_same_node.py | grep 'Same node test passed'
- label: Distributed Tests (2 GPUs) # 40min
#mirror_hardwares: [amd]
@ -429,19 +513,48 @@ steps:
- vllm/model_executor/models/
- tests/distributed/
- vllm/compilation
- vllm/worker/worker_base.py
- vllm/worker/worker.py
- vllm/worker/model_runner.py
- entrypoints/llm/test_collective_rpc.py
- tests/v1/test_async_llm_dp.py
- vllm/v1/engine/
commands:
- TP_SIZE=1 DP_SIZE=2 pytest -v -s v1/test_async_llm_dp.py
- pytest -v -s entrypoints/llm/test_collective_rpc.py
- pytest -v -s ./compile/test_basic_correctness.py
- pytest -v -s ./compile/test_wrapper.py
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep -q 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m distributed_2_gpus
- VLLM_TEST_SAME_HOST=1 torchrun --nproc-per-node=4 distributed/test_same_node.py | grep 'Same node test passed'
- TARGET_TEST_SUITE=L4 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
# Avoid importing model tests that cause CUDA reinitialization error
- pytest models/encoder_decoder/language/test_bart.py -v -s -m distributed_2_gpus
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m distributed_2_gpus
- pytest models/decoder_only/vision_language/test_models.py -v -s -m distributed_2_gpus
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- pytest models/test_transformers.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/language/test_bart.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/encoder_decoder/vision_language/test_broadcast.py -v -s -m 'distributed(num_gpus=2)'
- pytest models/decoder_only/vision_language/test_models.py -v -s -m 'distributed(num_gpus=2)'
# this test fails consistently.
# TODO: investigate and fix
# - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py
- label: Plugin Tests (2 GPUs) # 40min
working_dir: "/vllm-workspace/tests"
num_gpus: 2
source_file_dependencies:
- vllm/plugins/
- tests/plugins/
commands:
# begin platform plugin tests, all the code in-between runs on dummy platform
- pip install -e ./plugins/vllm_add_dummy_platform
- pytest -v -s plugins_tests/test_platform_plugins.py
- pip uninstall vllm_add_dummy_platform -y
# end platform plugin tests
# other tests continue here:
- pytest -v -s plugins_tests/test_scheduler_plugins.py
- pip install -e ./plugins/vllm_add_dummy_model
- pytest -v -s distributed/test_distributed_oot.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- pytest -v -s entrypoints/openai/test_oot_registration.py # it needs a clean process
- pytest -v -s models/test_oot_registration.py # it needs a clean process
- label: Multi-step Tests (4 GPUs) # 36min
working_dir: "/vllm-workspace/tests"
@ -458,7 +571,9 @@ steps:
- vllm/engine
- tests/multi_step
commands:
- pytest -v -s multi_step/test_correctness_async_llm.py
# this test is quite flaky
# TODO: investigate and fix.
# - pytest -v -s multi_step/test_correctness_async_llm.py
- pytest -v -s multi_step/test_correctness_llm.py
- label: Pipeline Parallelism Test # 45min
@ -474,18 +589,22 @@ steps:
- pytest -v -s distributed/test_pp_cudagraph.py
- pytest -v -s distributed/test_pipeline_parallel.py
- label: LoRA Long Context (Distributed) # 11min
# This test runs llama 13B, so it is required to run on 4 GPUs.
- label: LoRA TP Test (Distributed)
num_gpus: 4
soft_fail: true
source_file_dependencies:
- vllm/lora
- tests/lora/test_long_context
- tests/lora
commands:
# FIXIT: find out which code initialize cuda before running the test
# before the fix, we need to use spawn to test it
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -v -s -x lora/test_long_context.py
# There is some Tensor Parallelism related processing logic in LoRA that
# requires multi-GPU testing for validation.
- pytest -v -s -x lora/test_chatglm3_tp.py
- pytest -v -s -x lora/test_llama_tp.py
- pytest -v -s -x lora/test_minicpmv_tp.py
- pytest -v -s -x lora/test_transfomers_model.py
- label: Weight Loading Multiple GPU Test # 33min
working_dir: "/vllm-workspace/tests"
@ -505,7 +624,7 @@ steps:
- vllm/
- tests/weight_loading
commands:
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
- bash weight_loading/run_model_weight_loading_test.sh -c weight_loading/models-large.txt
##### multi gpus test #####
@ -513,19 +632,21 @@ steps:
- label: Distributed Tests (A100) # optional
gpu: a100
optional: true
num_gpus: 4
source_file_dependencies:
- vllm/
commands:
commands:
# NOTE: don't test llama model here, it seems hf implementation is buggy
# see https://github.com/vllm-project/vllm/pull/5689 for details
- pytest -v -s distributed/test_custom_all_reduce.py
- torchrun --nproc_per_node=2 distributed/test_ca_buffer_sharing.py
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m distributed_2_gpus
- TARGET_TEST_SUITE=A100 pytest basic_correctness/ -v -s -m 'distributed(num_gpus=2)'
- pytest -v -s -x lora/test_mixtral.py
- label: LM Eval Large Models # optional
gpu: a100
optional: true
num_gpus: 4
working_dir: "/vllm-workspace/.buildkite/lm-eval-harness"
source_file_dependencies:

View File

@ -23,16 +23,55 @@ wheel="$new_wheel"
version=$(unzip -p "$wheel" '**/METADATA' | grep '^Version: ' | cut -d' ' -f2)
echo "Version: $version"
normal_wheel="$wheel" # Save the original wheel filename
# If the version contains "dev", rename it to v1.0.0.dev for consistency
if [[ $version == *dev* ]]; then
new_version="1.0.0.dev"
suffix="${version##*.}"
if [[ $suffix == cu* ]]; then
new_version="1.0.0.dev+${suffix}"
else
new_version="1.0.0.dev"
fi
new_wheel="${wheel/$version/$new_version}"
mv -- "$wheel" "$new_wheel"
# use cp to keep both files in the artifacts directory
cp -- "$wheel" "$new_wheel"
wheel="$new_wheel"
version="$new_version"
fi
# Upload the wheel to S3
python3 .buildkite/generate_index.py --wheel "$normal_wheel"
# generate index for this commit
aws s3 cp "$wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/$BUILDKITE_COMMIT/"
if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
elif [[ $normal_wheel == *"cu121"* ]]; then
# if $normal_wheel matches cu121, do not upload the index.html
echo "Skipping index files for cu121 wheels"
else
# only upload index.html for cu124 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/$BUILDKITE_COMMIT/vllm/index.html"
aws s3 cp "s3://vllm-wheels/nightly/index.html" "s3://vllm-wheels/$BUILDKITE_COMMIT/index.html"
fi
# generate index for nightly
aws s3 cp "$wheel" "s3://vllm-wheels/nightly/"
aws s3 cp "$normal_wheel" "s3://vllm-wheels/nightly/"
if [[ $normal_wheel == *"cu118"* ]]; then
# if $normal_wheel matches cu118, do not upload the index.html
echo "Skipping index files for cu118 wheels"
elif [[ $normal_wheel == *"cu121"* ]]; then
# if $normal_wheel matches cu121, do not upload the index.html
echo "Skipping index files for cu121 wheels"
else
# only upload index.html for cu124 wheels (default wheels)
aws s3 cp index.html "s3://vllm-wheels/nightly/vllm/index.html"
fi
aws s3 cp "$wheel" "s3://vllm-wheels/$version/"

49
.github/CODEOWNERS vendored
View File

@ -2,29 +2,40 @@
# for more info about CODEOWNERS file
# This lists cover the "core" components of vLLM that require careful review
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/core @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/engine/llm_engine.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/executor/executor_base.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker_base.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/worker/worker.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
/vllm/model_executor/layers/sampler.py @WoosukKwon @zhuohan123 @youkaichao @alexm-neuralmagic @comaniac @njhill
CMakeLists.txt @tlrmchlsmth @WoosukKwon
/vllm/attention/backends/abstract.py @WoosukKwon @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/core @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/engine/llm_engine.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/executor/executor_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker_base.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/worker/worker.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/sampler.py @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill
/vllm/model_executor/layers/quantization @mgoin @robertgshaw2-redhat @tlrmchlsmth
/vllm/model_executor/guided_decoding @mgoin @russellb
/vllm/multimodal @DarkLight1337 @ywang96
CMakeLists.txt @tlrmchlsmth
# vLLM V1
/vllm/v1 @WoosukKwon @robertgshaw2-redhat @njhill @ywang96 @comaniac @alexm-redhat
/vllm/v1/structured_output @mgoin @russellb
# Test ownership
/tests/async_engine @njhill @robertgshaw2-neuralmagic @simon-mo
/tests/test_inputs.py @DarkLight1337 @ywang96
/tests/entrypoints @DarkLight1337 @robertgshaw2-neuralmagic @simon-mo
/tests/models @DarkLight1337 @ywang96
/tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu
/tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/kernels @tlrmchlsmth @WoosukKwon
/tests/quantization @mgoin @robertgshaw2-neuralmagic
/.buildkite/lm-eval-harness @mgoin @simon-mo
/tests/async_engine @njhill @robertgshaw2-redhat @simon-mo
/tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac
/tests/distributed/test_multi_node_assignment.py @youkaichao
/tests/distributed/test_pipeline_parallel.py @youkaichao
/tests/distributed/test_same_node.py @youkaichao
/tests/multi_step @alexm-neuralmagic @comaniac
/tests/entrypoints @DarkLight1337 @robertgshaw2-redhat @simon-mo
/tests/entrypoints/llm/test_guided_generate.py @mgoin @russellb
/tests/kernels @tlrmchlsmth @WoosukKwon
/tests/model_executor/test_guided_processors.py @mgoin @russellb
/tests/models @DarkLight1337 @ywang96
/tests/multi_step @alexm-redhat @comaniac
/tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu
/tests/quantization @mgoin @robertgshaw2-redhat
/tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/test_inputs.py @DarkLight1337 @ywang96
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb
/tests/v1/structured_output @mgoin @russellb
/tests/weight_loading @mgoin @youkaichao
/tests/basic_correctness/test_chunked_prefill @rkooo567 @comaniac

2
.github/FUNDING.yml vendored
View File

@ -1,2 +1,2 @@
github: [vllm-project]
open_collective: [vllm]
open_collective: vllm

View File

@ -30,15 +30,6 @@ body:
</details>
validations:
required: true
- type: textarea
attributes:
label: Model Input Dumps
description: |
If you are facing crashing due to illegal memory access or other issues with model execution, vLLM may dump the problematic input of the model. In this case, you will see the message `Error in model execution (input dumped to /tmp/err_xxx.pkl)`. If you see this message, please zip the file (because GitHub doesn't support .pkl file format) and upload it here. This will help us to reproduce the issue and facilitate the debugging process.
placeholder: |
Upload the dumped input file.
validations:
required: false
- type: textarea
attributes:
label: 🐛 Describe the bug

View File

@ -9,7 +9,7 @@ body:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
#### We also highly recommend you read https://docs.vllm.ai/en/latest/models/adding_model.html first to understand how to add a new model.
#### We also highly recommend you read https://docs.vllm.ai/en/latest/contributing/model/adding_model.html first to understand how to add a new model.
- type: textarea
attributes:
label: The model to consider.

View File

@ -1,28 +0,0 @@
name: 🎲 Misc/random discussions that do not fit into the above categories.
description: Submit a discussion as you like. Note that developers are heavily overloaded and we mainly rely on community users to answer these issues.
title: "[Misc]: "
labels: ["misc"]
body:
- type: markdown
attributes:
value: >
#### Before submitting an issue, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: Anything you want to discuss about vllm.
description: >
Anything you want to discuss about vllm.
validations:
required: true
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
- type: checkboxes
id: askllm
attributes:
label: Before submitting a new issue...
options:
- label: Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
required: true

View File

@ -1 +1,5 @@
blank_issues_enabled: false
contact_links:
- name: Questions
url: https://discuss.vllm.ai
about: Ask questions and discuss with other vLLM community members

View File

@ -2,73 +2,5 @@ FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (*link existing issues this PR will resolve*)
**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE**
---
<details>
<!-- inside this <details> section, markdown rendering does not work, so we use raw html here. -->
<summary><b> PR Checklist (Click to Expand) </b></summary>
<p>Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.</p>
<h3>PR Title and Classification</h3>
<p>Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:</p>
<ul>
<li><code>[Bugfix]</code> for bug fixes.</li>
<li><code>[CI/Build]</code> for build or continuous integration improvements.</li>
<li><code>[Doc]</code> for documentation fixes and improvements.</li>
<li><code>[Model]</code> for adding a new model or improving an existing model. Model name should appear in the title.</li>
<li><code>[Frontend]</code> For changes on the vLLM frontend (e.g., OpenAI API server, <code>LLM</code> class, etc.) </li>
<li><code>[Kernel]</code> for changes affecting CUDA kernels or other compute kernels.</li>
<li><code>[Core]</code> for changes in the core vLLM logic (e.g., <code>LLMEngine</code>, <code>AsyncLLMEngine</code>, <code>Scheduler</code>, etc.)</li>
<li><code>[Hardware][Vendor]</code> for hardware-specific changes. Vendor name should appear in the prefix (e.g., <code>[Hardware][AMD]</code>).</li>
<li><code>[Misc]</code> for PRs that do not fit the above categories. Please use this sparingly.</li>
</ul>
<p><strong>Note:</strong> If the PR spans more than one category, please include all relevant prefixes.</p>
<h3>Code Quality</h3>
<p>The PR need to meet the following code quality standards:</p>
<ul>
<li>We adhere to <a href="https://google.github.io/styleguide/pyguide.html">Google Python style guide</a> and <a href="https://google.github.io/styleguide/cppguide.html">Google C++ style guide</a>.</li>
<li>Pass all linter checks. Please use <a href="https://github.com/vllm-project/vllm/blob/main/format.sh"><code>format.sh</code></a> to format your code.</li>
<li>The code need to be well-documented to ensure future contributors can easily understand the code.</li>
<li>Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.</li>
<li>Please add documentation to <code>docs/source/</code> if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.</li>
</ul>
<h3>Adding or changing kernels</h3>
<p>Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.</p>
<ul>
<li>Make sure custom ops are registered following PyTorch guidelines: <a href="https://pytorch.org/tutorials/advanced/cpp_custom_ops.html#cpp-custom-ops-tutorial">Custom C++ and CUDA Operators</a> and <a href="https://docs.google.com/document/d/1_W62p8WJOQQUzPsJYa7s701JXt0qf2OfLub2sbkHOaU">The Custom Operators Manual</a></li>
<li>Custom operations that return <code>Tensors</code> require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.</li>
<li>Use <a href="https://pytorch.org/docs/stable/library.html#torch.library.opcheck"><code>torch.libary.opcheck()</code></a> to test the function registration and meta-function for any registered ops. See <code>tests/kernels</code> for examples.</li>
<li>When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.</li>
<li>If a new custom type is needed, see the following document: <a href="https://docs.google.com/document/d/18fBMPuOJ0fY5ZQ6YyrHUppw9FA332CpNtgB6SOIgyuA">Custom Class Support in PT2</a>.
</ul>
<h3>Notes for Large Changes</h3>
<p>Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with <code>rfc-required</code> and might not go through the PR.</p>
<h3>What to Expect for the Reviews</h3>
<p>The goal of the vLLM team is to be a <i>transparent reviewing machine</i>. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process: </p>
<ul>
<li> After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.</li>
<li> After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.</li>
<li> After the review, the reviewer will put an <code> action-required</code> label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.</li>
<li> Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.
</li>
</ul>
<h3>Thank You</h3>
<p> Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone! </p>
</details>
<!--- pyml disable-next-line no-emphasis-as-heading -->
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>**

View File

@ -15,18 +15,17 @@ updates:
allow:
- dependency-type: "all"
ignore:
- dependency-name: "*"
update-types: ["version-update:semver-patch"]
- dependency-name: "torch"
- dependency-name: "torchvision"
- dependency-name: "xformers"
- dependency-name: "lm-format-enforcer"
- dependency-name: "gguf"
- dependency-name: "compressed-tensors"
- dependency-name: "ray[adag]"
- dependency-name: "ray[cgraph]" # Ray Compiled Graph
- dependency-name: "lm-eval"
groups:
patch-update:
applies-to: version-updates
update-types: ["patch"]
minor-update:
applies-to: version-updates
update-types: ["minor"]

85
.github/mergify.yml vendored
View File

@ -5,6 +5,7 @@ pull_request_rules:
- or:
- files~=^[^/]+\.md$
- files~=^docs/
- files~=^examples/
actions:
label:
add:
@ -18,7 +19,7 @@ pull_request_rules:
- files~=\.buildkite/
- files~=^cmake/
- files=CMakeLists.txt
- files~=^Dockerfile
- files~=^docker/Dockerfile
- files~=^requirements.*\.txt
- files=setup.py
actions:
@ -35,6 +36,88 @@ pull_request_rules:
add:
- frontend
- name: label-multi-modality
description: Automatically apply multi-modality label
conditions:
- or:
- files~=^vllm/multimodal/
- files~=^tests/multimodal/
- files~=^tests/models/multimodal/
- files~=^tests/models/*/audio_language/
- files~=^tests/models/*/vision_language/
- files=tests/models/test_vision.py
actions:
label:
add:
- multi-modality
- name: label-structured-output
description: Automatically apply structured-output label
conditions:
- or:
- files~=^vllm/model_executor/guided_decoding/
- files=tests/model_executor/test_guided_processors.py
- files=tests/entrypoints/llm/test_guided_generate.py
- files=benchmarks/benchmark_serving_guided.py
- files=benchmarks/benchmark_guided.py
actions:
label:
add:
- structured-output
- name: label-speculative-decoding
description: Automatically apply speculative-decoding label
conditions:
- or:
- files~=^vllm/spec_decode/
- files=vllm/model_executor/layers/spec_decode_base_sampler.py
- files~=^tests/spec_decode/
actions:
label:
add:
- speculative-decoding
- name: label-v1
description: Automatically apply v1 label
conditions:
- or:
- files~=^vllm/v1/
- files~=^tests/v1/
actions:
label:
add:
- v1
- name: label-tpu
description: Automatically apply tpu label
# Keep this list in sync with `label-tpu-remove` conditions
conditions:
- or:
- files~=tpu.py
- files~=_tpu
- files~=tpu_
- files~=/tpu/
- files~=pallas
actions:
label:
add:
- tpu
- name: label-tpu-remove
description: Automatically remove tpu label
# Keep this list in sync with `label-tpu` conditions
conditions:
- and:
- -files~=tpu.py
- -files~=_tpu
- -files~=tpu_
- -files~=/tpu/
- -files~=pallas
actions:
label:
remove:
- tpu
- name: ping author on conflicts and add 'needs-rebase' label
conditions:
- conflict

View File

@ -15,19 +15,36 @@ NEW=/tmp/new_pr_body.txt
gh pr view --json body --template "{{.body}}" "${PR_NUMBER}" > "${OLD}"
cp "${OLD}" "${NEW}"
# Remove all lines after and including "**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE**"
sed -i '/\*\*BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE\*\*/,$d' "${NEW}"
# Remove "FIX #xxxx (*link existing issues this PR will resolve*)"
sed -i '/FIX #xxxx.*$/d' "${NEW}"
# Remove "FILL IN THE PR DESCRIPTION HERE"
sed -i '/FILL IN THE PR DESCRIPTION HERE/d' "${NEW}"
# Remove all lines after and including "**BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE**"
sed -i '/\*\*BEFORE SUBMITTING, PLEASE READ.*\*\*/,$d' "${NEW}"
# Remove HTML <details> section that includes <summary> text of "PR Checklist (Click to Expand)"
python3 - <<EOF
import re
with open("${NEW}", "r") as file:
content = file.read()
pattern = re.compile(r'(---\n\n)?<details>.*?<summary>.*?PR Checklist \(Click to Expand\).*?</summary>.*?</details>', re.DOTALL)
content = re.sub(pattern, '', content)
with open("${NEW}", "w") as file:
file.write(content)
EOF
# Run this only if ${NEW} is different than ${OLD}
if ! cmp -s "${OLD}" "${NEW}"; then
echo "Updating PR body"
gh pr edit --body-file "${NEW}" "${PR_NUMBER}"
echo
echo "Updated PR body:"
echo
cat "${NEW}"
else
echo "No changes needed"
fi

View File

@ -1,40 +0,0 @@
name: Lint GitHub Actions workflows
on:
push:
branches:
- "main"
paths:
- '.github/workflows/*.ya?ml'
- '.github/workflows/actionlint.*'
- '.github/workflows/matchers/actionlint.json'
pull_request:
branches:
- "main"
paths:
- '.github/workflows/*.ya?ml'
- '.github/workflows/actionlint.*'
- '.github/workflows/matchers/actionlint.json'
env:
LC_ALL: en_US.UTF-8
defaults:
run:
shell: bash
permissions:
contents: read
jobs:
actionlint:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: "Run actionlint"
run: |
echo "::add-matcher::.github/workflows/matchers/actionlint.json"
tools/actionlint.sh -color

View File

@ -1,53 +0,0 @@
name: clang-format
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- '**/*.h'
- '**/*.cpp'
- '**/*.cu'
- '**/*.cuh'
- '.github/workflows/clang-format.yml'
pull_request:
branches:
- main
paths:
- '**/*.h'
- '**/*.cpp'
- '**/*.cu'
- '**/*.cuh'
- '.github/workflows/clang-format.yml'
jobs:
clang-format:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install clang-format==18.1.5
- name: Running clang-format
run: |
EXCLUDES=(
'csrc/moe/topk_softmax_kernels.cu'
'csrc/quantization/gguf/ggml-common.h'
'csrc/quantization/gguf/dequantize.cuh'
'csrc/quantization/gguf/vecdotq.cuh'
'csrc/quantization/gguf/mmq.cuh'
'csrc/quantization/gguf/mmvq.cuh'
)
find csrc/ \( -name '*.h' -o -name '*.cpp' -o -name '*.cu' -o -name '*.cuh' \) -print \
| grep -vFf <(printf "%s\n" "${EXCLUDES[@]}") \
| xargs clang-format --dry-run --Werror

View File

@ -16,7 +16,7 @@ jobs:
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
with:
python-version: '3.12'

View File

@ -1,45 +0,0 @@
name: codespell
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- "**/*.py"
- "**/*.md"
- "**/*.rst"
- pyproject.toml
- requirements-lint.txt
- .github/workflows/codespell.yml
pull_request:
branches:
- main
paths:
- "**/*.py"
- "**/*.md"
- "**/*.rst"
- pyproject.toml
- requirements-lint.txt
- .github/workflows/codespell.yml
jobs:
codespell:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.12"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
- name: Spelling check with codespell
run: |
codespell --toml pyproject.toml

82
.github/workflows/lint-and-deploy.yaml vendored Normal file
View File

@ -0,0 +1,82 @@
name: Lint and Deploy Charts
on: pull_request
jobs:
lint-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Set up Helm
uses: azure/setup-helm@b9e51907a09c216f16ebe8536097933489208112 # v4.3.0
with:
version: v3.14.4
#Python is required because ct lint runs Yamale and yamllint which require Python.
- uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
with:
python-version: '3.13'
- name: Set up chart-testing
uses: helm/chart-testing-action@0d28d3144d3a25ea2cc349d6e59901c4ff469b3b # v2.7.0
with:
version: v3.10.1
- name: Run chart-testing (lint)
run: ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/online_serving/chart-helm --charts examples/online_serving/chart-helm
- name: Setup minio
run: |
docker network create vllm-net
docker run -d -p 9000:9000 --name minio --net vllm-net \
-e "MINIO_ACCESS_KEY=minioadmin" \
-e "MINIO_SECRET_KEY=minioadmin" \
-v /tmp/data:/data \
-v /tmp/config:/root/.minio \
minio/minio server /data
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_EC2_METADATA_DISABLED=true
mkdir opt-125m
cd opt-125m && curl -O -Ls "https://huggingface.co/facebook/opt-125m/resolve/main/{pytorch_model.bin,config.json,generation_config.json,merges.txt,special_tokens_map.json,tokenizer_config.json,vocab.json}" && cd ..
aws --endpoint-url http://127.0.0.1:9000/ s3 mb s3://testbucket
aws --endpoint-url http://127.0.0.1:9000/ s3 cp opt-125m/ s3://testbucket/opt-125m --recursive
- name: Create kind cluster
uses: helm/kind-action@a1b0e391336a6ee6713a0583f8c6240d70863de3 # v1.12.0
- name: Build the Docker image vllm cpu
run: docker buildx build -f docker/Dockerfile.cpu -t vllm-cpu-env .
- name: Configuration of docker images, network and namespace for the kind cluster
run: |
docker pull amazon/aws-cli:2.6.4
kind load docker-image amazon/aws-cli:2.6.4 --name chart-testing
kind load docker-image vllm-cpu-env:latest --name chart-testing
docker network connect vllm-net "$(docker ps -aqf "name=chart-testing-control-plane")"
kubectl create ns ns-vllm
- name: Run chart-testing (install)
run: |
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
sleep 30 && kubectl -n ns-vllm logs -f "$(kubectl -n ns-vllm get pods | awk '/deployment/ {print $1;exit}')" &
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/online_serving/chart-helm -f examples/online_serving/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env"
- name: curl test
run: |
kubectl -n ns-vllm port-forward service/test-vllm-service 8001:80 &
sleep 10
CODE="$(curl -v -f --location http://localhost:8001/v1/completions \
--header "Content-Type: application/json" \
--data '{
"model": "opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'):$CODE"
echo "$CODE"

View File

@ -1,17 +0,0 @@
{
"problemMatcher": [
{
"owner": "ruff",
"pattern": [
{
"regexp": "^(.+?):(\\d+):(\\d+): (\\w+): (.+)$",
"file": 1,
"line": 2,
"column": 3,
"code": 4,
"message": 5
}
]
}
]
}

View File

@ -1,51 +0,0 @@
name: mypy
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- '**/*.py'
- '.github/workflows/mypy.yaml'
- 'tools/mypy.sh'
- 'pyproject.toml'
pull_request:
branches:
- main
# This workflow is only relevant when one of the following files changes.
# However, we have github configured to expect and require this workflow
# to run and pass before github with auto-merge a pull request. Until github
# allows more flexible auto-merge policy, we can just run this on every PR.
# It doesn't take that long to run, anyway.
#paths:
# - '**/*.py'
# - '.github/workflows/mypy.yaml'
# - 'tools/mypy.sh'
# - 'pyproject.toml'
jobs:
mypy:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install mypy==1.11.1
pip install types-setuptools
pip install types-PyYAML
pip install types-requests
pip install types-setuptools
- name: Mypy
run: |
echo "::add-matcher::.github/workflows/matchers/mypy.json"
tools/mypy.sh 1 ${{ matrix.python-version }}

20
.github/workflows/pre-commit.yml vendored Normal file
View File

@ -0,0 +1,20 @@
name: pre-commit
on:
pull_request:
push:
branches: [main]
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- uses: actions/setup-python@42375524e23c412d93fb67b49958b491fce71c38 # v5.4.0
with:
python-version: "3.12"
- run: echo "::add-matcher::.github/workflows/matchers/actionlint.json"
- run: echo "::add-matcher::.github/workflows/matchers/mypy.json"
- uses: pre-commit/action@2c7b3805fd2a0fd8c1884dcaebf91fc102a13ecd # v3.0.1
with:
extra_args: --all-files --hook-stage manual

View File

@ -39,67 +39,68 @@ jobs:
const script = require('.github/workflows/scripts/create_release.js')
await script(github, context, core)
wheel:
name: Build Wheel
runs-on: ${{ matrix.os }}
needs: release
# NOTE(simon): No longer build wheel using GitHub Actions. See buildkite's release workflow.
# wheel:
# name: Build Wheel
# runs-on: ${{ matrix.os }}
# needs: release
strategy:
fail-fast: false
matrix:
os: ['ubuntu-20.04']
python-version: ['3.9', '3.10', '3.11', '3.12']
pytorch-version: ['2.4.0'] # Must be the most recent version that meets requirements-cuda.txt.
cuda-version: ['11.8', '12.1']
# strategy:
# fail-fast: false
# matrix:
# os: ['ubuntu-20.04']
# python-version: ['3.9', '3.10', '3.11', '3.12']
# pytorch-version: ['2.4.0'] # Must be the most recent version that meets requirements/cuda.txt.
# cuda-version: ['11.8', '12.1']
steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
# steps:
# - name: Checkout
# uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Setup ccache
uses: hendrikmuhs/ccache-action@ed74d11c0b343532753ecead8a951bb09bb34bc9 # v1.2.14
with:
create-symlink: true
key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}
# - name: Setup ccache
# uses: hendrikmuhs/ccache-action@ed74d11c0b343532753ecead8a951bb09bb34bc9 # v1.2.14
# with:
# create-symlink: true
# key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}
- name: Set up Linux Env
if: ${{ runner.os == 'Linux' }}
run: |
bash -x .github/workflows/scripts/env.sh
# - name: Set up Linux Env
# if: ${{ runner.os == 'Linux' }}
# run: |
# bash -x .github/workflows/scripts/env.sh
- name: Set up Python
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
# - name: Set up Python
# uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
# with:
# python-version: ${{ matrix.python-version }}
- name: Install CUDA ${{ matrix.cuda-version }}
run: |
bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }}
# - name: Install CUDA ${{ matrix.cuda-version }}
# run: |
# bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }}
- name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }}
run: |
bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }}
# - name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }}
# run: |
# bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }}
- name: Build wheel
shell: bash
env:
CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size
run: |
bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
wheel_name=$(find dist -name "*whl" -print0 | xargs -0 -n 1 basename)
asset_name=${wheel_name//"linux"/"manylinux1"}
echo "wheel_name=${wheel_name}" >> "$GITHUB_ENV"
echo "asset_name=${asset_name}" >> "$GITHUB_ENV"
# - name: Build wheel
# shell: bash
# env:
# CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size
# run: |
# bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
# wheel_name=$(find dist -name "*whl" -print0 | xargs -0 -n 1 basename)
# asset_name=${wheel_name//"linux"/"manylinux1"}
# echo "wheel_name=${wheel_name}" >> "$GITHUB_ENV"
# echo "asset_name=${asset_name}" >> "$GITHUB_ENV"
- name: Upload Release Asset
uses: actions/upload-release-asset@e8f9f06c4b078e705bd2ea027f0926603fc9b4d5 # v1.0.2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
upload_url: ${{ needs.release.outputs.upload_url }}
asset_path: ./dist/${{ env.wheel_name }}
asset_name: ${{ env.asset_name }}
asset_content_type: application/*
# - name: Upload Release Asset
# uses: actions/upload-release-asset@e8f9f06c4b078e705bd2ea027f0926603fc9b4d5 # v1.0.2
# env:
# GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# with:
# upload_url: ${{ needs.release.outputs.upload_url }}
# asset_path: ./dist/${{ env.wheel_name }}
# asset_name: ${{ env.asset_name }}
# asset_content_type: application/*
# (Danielkinz): This last step will publish the .whl to pypi. Warning: untested
# - name: Publish package

View File

@ -2,7 +2,6 @@ name: PR Reminder Comment Bot
on:
pull_request_target:
types: [opened]
jobs:
pr_reminder:
runs-on: ubuntu-latest
@ -15,7 +14,12 @@ jobs:
owner: context.repo.owner,
repo: context.repo.repo,
issue_number: context.issue.number,
body: '👋 Hi! Thank you for contributing to the vLLM project.\n Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run `fastcheck` CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your `fastcheck` build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping `simon-mo` or `khluu` to add you in our Buildkite org. \n\nOnce the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.\n\n To run CI, PR reviewers can do one of these:\n- Add `ready` label to the PR\n- Enable auto-merge.\n\n🚀'
body: '👋 Hi! Thank you for contributing to the vLLM project.\n\n' +
'💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.\n\n' +
'Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run `fastcheck` CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your `fastcheck` build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping `simon-mo` or `khluu` to add you in our Buildkite org.\n\n' +
'Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.\n\n' +
'To run CI, PR reviewers can either: Add `ready` label to the PR or enable auto-merge.\n\n' +
'🚀'
})
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

View File

@ -1,52 +0,0 @@
name: ruff
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- "**/*.py"
- pyproject.toml
- requirements-lint.txt
- .github/workflows/matchers/ruff.json
- .github/workflows/ruff.yml
pull_request:
branches:
- main
# This workflow is only relevant when one of the following files changes.
# However, we have github configured to expect and require this workflow
# to run and pass before github with auto-merge a pull request. Until github
# allows more flexible auto-merge policy, we can just run this on every PR.
# It doesn't take that long to run, anyway.
#paths:
# - "**/*.py"
# - pyproject.toml
# - requirements-lint.txt
# - .github/workflows/matchers/ruff.json
# - .github/workflows/ruff.yml
jobs:
ruff:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.12"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements-lint.txt
- name: Analysing the code with ruff
run: |
echo "::add-matcher::.github/workflows/matchers/ruff.json"
ruff check --output-format github .
- name: Run isort
run: |
isort . --check-only

View File

@ -9,7 +9,7 @@ PATH=${cuda_home}/bin:$PATH
LD_LIBRARY_PATH=${cuda_home}/lib64:$LD_LIBRARY_PATH
# Install requirements
$python_executable -m pip install -r requirements-build.txt -r requirements-cuda.txt
$python_executable -m pip install -r requirements/build.txt -r requirements/cuda.txt
# Limit the number of parallel jobs to avoid OOM
export MAX_JOBS=1

View File

@ -1,4 +1,4 @@
// Uses Github's API to create the release and wait for result.
// Uses GitHub's API to create the release and wait for result.
// We use a JS script since github CLI doesn't provide a way to wait for the release's creation and returns immediately.
module.exports = async (github, context, core) => {

View File

@ -1,37 +0,0 @@
name: Lint shell scripts
on:
push:
branches:
- "main"
paths:
- '**/*.sh'
- '.github/workflows/shellcheck.yml'
pull_request:
branches:
- "main"
paths:
- '**/*.sh'
- '.github/workflows/shellcheck.yml'
env:
LC_ALL: en_US.UTF-8
defaults:
run:
shell: bash
permissions:
contents: read
jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- name: "Checkout"
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: "Check shell scripts"
run: |
tools/shellcheck.sh

View File

@ -13,7 +13,7 @@ jobs:
actions: write
runs-on: ubuntu-latest
steps:
- uses: actions/stale@28ca1036281a5e5922ead5184a1bbf96e5fc984e # v9.0.0
- uses: actions/stale@5bef64f19d7facfb25b37b414482c7164d639639 # v9.1.0
with:
# Increasing this value ensures that changes to this workflow
# propagate to all issues and PRs in days rather than months

View File

@ -1,38 +0,0 @@
name: yapf
on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/yapf.yml
pull_request:
branches:
- main
paths:
- "**/*.py"
- .github/workflows/yapf.yml
jobs:
yapf:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.12"]
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install yapf==0.32.0
pip install toml==0.10.2
- name: Running yapf
run: |
yapf --diff --recursive .

8
.gitignore vendored
View File

@ -2,7 +2,8 @@
/vllm/_version.py
# vllm-flash-attn built from source
vllm/vllm_flash_attn/
vllm/vllm_flash_attn/*
!vllm/vllm_flash_attn/fa_utils.py
# Byte-compiled / optimized / DLL files
__pycache__/
@ -79,8 +80,7 @@ instance/
# Sphinx documentation
docs/_build/
docs/source/getting_started/examples/*.rst
!**/*.template.rst
docs/source/getting_started/examples/
# PyBuilder
.pybuilder/
@ -198,7 +198,7 @@ _build/
hip_compat.h
# Benchmark dataset
benchmarks/*.json
benchmarks/**/*.json
# Linting
actionlint

132
.pre-commit-config.yaml Normal file
View File

@ -0,0 +1,132 @@
default_install_hook_types:
- pre-commit
- commit-msg
default_stages:
- pre-commit # Run locally
- manual # Run in CI
exclude: 'vllm/third_party/.*'
repos:
- repo: https://github.com/google/yapf
rev: v0.43.0
hooks:
- id: yapf
args: [--in-place, --verbose]
additional_dependencies: [toml] # TODO: Remove when yapf is upgraded
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.9.3
hooks:
- id: ruff
args: [--output-format, github, --fix]
- repo: https://github.com/codespell-project/codespell
rev: v2.4.0
hooks:
- id: codespell
additional_dependencies: ['tomli']
args: ['--toml', 'pyproject.toml']
- repo: https://github.com/PyCQA/isort
rev: 0a0b7a830386ba6a31c2ec8316849ae4d1b8240d # 6.0.0
hooks:
- id: isort
- repo: https://github.com/pre-commit/mirrors-clang-format
rev: v19.1.7
hooks:
- id: clang-format
exclude: 'csrc/(moe/topk_softmax_kernels.cu|quantization/gguf/(ggml-common.h|dequantize.cuh|vecdotq.cuh|mmq.cuh|mmvq.cuh))|vllm/third_party/.*'
types_or: [c++, cuda]
args: [--style=file, --verbose]
- repo: https://github.com/jackdewinter/pymarkdown
rev: v0.9.27
hooks:
- id: pymarkdown
args: [fix]
- repo: https://github.com/rhysd/actionlint
rev: v1.7.7
hooks:
- id: actionlint
- repo: https://github.com/astral-sh/uv-pre-commit
rev: 0.6.2
hooks:
- id: pip-compile
args: [requirements/test.in, -o, requirements/test.txt]
files: ^requirements/test\.(in|txt)$
- repo: local
hooks:
- id: mypy-local
name: Run mypy for local Python installation
entry: tools/mypy.sh 0 "local"
language: python
types: [python]
additional_dependencies: &mypy_deps [mypy==1.11.1, types-cachetools, types-setuptools, types-PyYAML, types-requests]
stages: [pre-commit] # Don't run in CI
- id: mypy-3.9 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.9
entry: tools/mypy.sh 1 "3.9"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: mypy-3.10 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.10
entry: tools/mypy.sh 1 "3.10"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: mypy-3.11 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.11
entry: tools/mypy.sh 1 "3.11"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: mypy-3.12 # TODO: Use https://github.com/pre-commit/mirrors-mypy when mypy setup is less awkward
name: Run mypy for Python 3.12
entry: tools/mypy.sh 1 "3.12"
language: python
types: [python]
additional_dependencies: *mypy_deps
stages: [manual] # Only run in CI
- id: shellcheck
name: Lint shell scripts
entry: tools/shellcheck.sh
language: script
types: [shell]
- id: png-lint
name: Lint PNG exports from excalidraw
entry: tools/png-lint.sh
language: script
types: [png]
- id: signoff-commit
name: Sign-off Commit
entry: bash
args:
- -c
- |
if ! grep -q "^Signed-off-by: $(git config user.name) <$(git config user.email)>" .git/COMMIT_EDITMSG; then
printf "\nSigned-off-by: $(git config user.name) <$(git config user.email)>\n" >> .git/COMMIT_EDITMSG
fi
language: system
verbose: true
stages: [commit-msg]
- id: check-spdx-header
name: Check SPDX headers
entry: python tools/check_spdx_header.py
language: python
types: [python]
- id: check-filenames
name: Check for spaces in all filenames
entry: bash
args:
- -c
- 'git ls-files | grep " " && echo "Filenames should not contain spaces!" && exit 1 || exit 0'
language: system
always_run: true
pass_filenames: false
# Keep `suggestion` last
- id: suggestion
name: Suggestion
entry: bash -c 'echo "To bypass pre-commit hooks, add --no-verify to git commit."'
language: system
verbose: true
pass_filenames: false
# Insert new entries above the `suggestion` entry

View File

@ -18,4 +18,4 @@ formats: []
# Optionally declare the Python requirements required to build your docs
python:
install:
- requirements: docs/requirements-docs.txt
- requirements: requirements/docs.txt

View File

@ -24,9 +24,6 @@ include(${CMAKE_CURRENT_LIST_DIR}/cmake/utils.cmake)
# Suppress potential warnings about unused manually-specified variables
set(ignoreMe "${VLLM_PYTHON_PATH}")
# Prevent installation of dependencies (cutlass) by default.
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" ALL_COMPONENTS)
#
# Supported python versions. These versions will be searched in order, the
# first match will be selected. These should be kept in sync with setup.py.
@ -34,10 +31,10 @@ install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" ALL_COMPONENTS)
set(PYTHON_SUPPORTED_VERSIONS "3.9" "3.10" "3.11" "3.12")
# Supported NVIDIA architectures.
set(CUDA_SUPPORTED_ARCHS "7.0;7.5;8.0;8.6;8.9;9.0")
set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0")
# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx1100;gfx1101")
set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1200;gfx1201")
#
# Supported/expected torch versions for CUDA/ROCm.
@ -47,10 +44,10 @@ set(HIP_SUPPORTED_ARCHS "gfx906;gfx908;gfx90a;gfx940;gfx941;gfx942;gfx1030;gfx11
#
# Note: the CUDA torch version is derived from pyproject.toml and various
# requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from Dockerfile.rocm
# versions are derived from docker/Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.5.1")
set(TORCH_SUPPORTED_VERSION_ROCM "2.5.1")
set(TORCH_SUPPORTED_VERSION_CUDA "2.6.0")
set(TORCH_SUPPORTED_VERSION_ROCM "2.6.0")
#
# Try to find python package with an executable that exactly matches
@ -177,10 +174,54 @@ include(FetchContent)
file(MAKE_DIRECTORY ${FETCHCONTENT_BASE_DIR}) # Ensure the directory exists
message(STATUS "FetchContent base directory: ${FETCHCONTENT_BASE_DIR}")
#
# Set rocm version dev int.
#
if(VLLM_GPU_LANG STREQUAL "HIP")
#
# Overriding the default -O set up by cmake, adding ggdb3 for the most verbose devug info
#
set(CMAKE_${VLLM_GPU_LANG}_FLAGS_DEBUG "${CMAKE_${VLLM_GPU_LANG}_FLAGS_DEBUG} -O0 -ggdb3")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -O0 -ggdb3")
#
# Certain HIP functions are marked as [[nodiscard]], yet vllm ignores the result which generates
# a lot of warnings that always mask real issues. Suppressing until this is properly addressed.
#
set(CMAKE_${VLLM_GPU_LANG}_FLAGS "${CMAKE_${VLLM_GPU_LANG}_FLAGS} -Wno-unused-result")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-result")
endif()
#
# Define other extension targets
#
#
# cumem_allocator extension
#
set(VLLM_CUMEM_EXT_SRC
"csrc/cumem_allocator.cpp")
set_gencode_flags_for_srcs(
SRCS "${VLLM_CUMEM_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")
if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Enabling cumem allocator extension.")
# link against cuda driver library
list(APPEND CUMEM_LIBS CUDA::cuda_driver)
define_gpu_extension_target(
cumem_allocator
DESTINATION vllm
LANGUAGE CXX
SOURCES ${VLLM_CUMEM_EXT_SRC}
LIBRARIES ${CUMEM_LIBS}
USE_SABI 3.8
WITH_SOABI)
endif()
#
# _C extension
#
@ -193,30 +234,49 @@ set(VLLM_EXT_SRC
"csrc/activation_kernels.cu"
"csrc/layernorm_kernels.cu"
"csrc/layernorm_quant_kernels.cu"
"csrc/cuda_view.cu"
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu"
"csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu"
"csrc/quantization/gguf/gguf_kernel.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/prepare_inputs/advance_step.cu"
"csrc/custom_all_reduce.cu"
"csrc/torch_bindings.cpp")
if(VLLM_GPU_LANG STREQUAL "CUDA")
SET(CUTLASS_ENABLE_HEADERS_ONLY ON CACHE BOOL "Enable only the header library")
# Set CUTLASS_REVISION manually -- its revision detection doesn't work in this case.
set(CUTLASS_REVISION "v3.5.1" CACHE STRING "CUTLASS revision to use")
# Please keep this in sync with FetchContent_Declare line below.
set(CUTLASS_REVISION "v3.8.0" CACHE STRING "CUTLASS revision to use")
FetchContent_Declare(
# Use the specified CUTLASS source directory for compilation if VLLM_CUTLASS_SRC_DIR is provided
if (DEFINED ENV{VLLM_CUTLASS_SRC_DIR})
set(VLLM_CUTLASS_SRC_DIR $ENV{VLLM_CUTLASS_SRC_DIR})
endif()
if(VLLM_CUTLASS_SRC_DIR)
if(NOT IS_ABSOLUTE VLLM_CUTLASS_SRC_DIR)
get_filename_component(VLLM_CUTLASS_SRC_DIR "${VLLM_CUTLASS_SRC_DIR}" ABSOLUTE)
endif()
message(STATUS "The VLLM_CUTLASS_SRC_DIR is set, using ${VLLM_CUTLASS_SRC_DIR} for compilation")
FetchContent_Declare(cutlass SOURCE_DIR ${VLLM_CUTLASS_SRC_DIR})
else()
FetchContent_Declare(
cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
GIT_TAG v3.5.1
# Please keep this in sync with CUTLASS_REVISION line above.
GIT_TAG v3.8.0
GIT_PROGRESS TRUE
# Speed up CUTLASS download by retrieving only the specified GIT_TAG instead of the history.
# Important: If GIT_SHALLOW is enabled then GIT_TAG works only with branch names and tags.
# So if the GIT_TAG above is updated to a commit hash, GIT_SHALLOW must be set to FALSE
GIT_SHALLOW TRUE
)
)
endif()
FetchContent_MakeAvailable(cutlass)
list(APPEND VLLM_EXT_SRC
@ -224,10 +284,12 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"csrc/mamba/causal_conv1d/causal_conv1d.cu"
"csrc/quantization/aqlm/gemm_kernels.cu"
"csrc/quantization/awq/gemm_kernels.cu"
"csrc/quantization/gguf/gguf_kernel.cu"
"csrc/custom_all_reduce.cu"
"csrc/permute_cols.cu"
"csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu")
"csrc/quantization/cutlass_w8a8/scaled_mm_entry.cu"
"csrc/quantization/fp4/nvfp4_quant_entry.cu"
"csrc/quantization/fp4/nvfp4_scaled_mm_entry.cu"
"csrc/sparse/cutlass/sparse_scaled_mm_entry.cu"
"csrc/cutlass_extensions/common.cpp")
set_gencode_flags_for_srcs(
SRCS "${VLLM_EXT_SRC}"
@ -236,7 +298,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# Only build Marlin kernels if we are building for at least some compatible archs.
# Keep building Marlin for 9.0 as there are some group sizes and shapes that
# are not supported by Machete yet.
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.9;9.0" ${CUDA_ARCHS})
cuda_archs_loose_intersection(MARLIN_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
if (MARLIN_ARCHS)
set(MARLIN_SRCS
"csrc/quantization/fp8/fp8_marlin.cu"
@ -256,39 +318,87 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
" in CUDA target architectures")
endif()
#
# Only build AllSpark kernels if we are building for at least some compatible archs.
cuda_archs_loose_intersection(ALLSPARK_ARCHS "8.0;8.6;8.7;8.9" "${CUDA_ARCHS}")
if (ALLSPARK_ARCHS)
set(ALLSPARK_SRCS
"csrc/quantization/gptq_allspark/allspark_repack.cu"
"csrc/quantization/gptq_allspark/allspark_qgemm_w8a16.cu")
set_gencode_flags_for_srcs(
SRCS "${ALLSPARK_SRCS}"
CUDA_ARCHS "${ALLSPARK_ARCHS}")
list(APPEND VLLM_EXT_SRC "${ALLSPARK_SRCS}")
message(STATUS "Building AllSpark kernels for archs: ${ALLSPARK_ARCHS}")
else()
message(STATUS "Not building AllSpark kernels as no compatible archs found"
" in CUDA target architectures")
endif()
set(SCALED_MM_3X_ARCHS)
# The cutlass_scaled_mm kernels for Hopper (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.0 or later (and only work on Hopper, 9.0/9.0a for now).
cuda_archs_loose_intersection(SCALED_MM_3X_ARCHS "9.0;9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/scaled_mm_c3x.cu")
# CUDA 12.0 or later
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a;" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_ARCHS)
set(SRCS
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x_sm90.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_fp8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm90_int8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_azp_sm90_int8.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_blockwise_sm90_fp8.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_3X_ARCHS}")
CUDA_ARCHS "${SCALED_MM_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_C3X=1")
message(STATUS "Building scaled_mm_c3x for archs: ${SCALED_MM_3X_ARCHS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM90=1")
# Let scaled_mm_c2x know it doesn't need to build these arches
list(APPEND SCALED_MM_3X_ARCHS "${SCALED_MM_ARCHS}")
message(STATUS "Building scaled_mm_c3x_sm90 for archs: ${SCALED_MM_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_3X_ARCHS)
message(STATUS "Not building scaled_mm_c3x as CUDA Compiler version is "
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0 AND SCALED_MM_ARCHS)
message(STATUS "Not building scaled_mm_c3x_sm90 as CUDA Compiler version is "
"not >= 12.0, we recommend upgrading to CUDA 12.0 or "
"later if you intend on running FP8 quantized models on "
"Hopper.")
else()
message(STATUS "Not building scaled_mm_c3x as no compatible archs found "
message(STATUS "Not building scaled_mm_c3x_sm90 as no compatible archs found "
"in CUDA target architectures")
endif()
endif()
# clear SCALED_MM_3X_ARCHS so the scaled_mm_c2x kernels know we didn't
# build any 3x kernels
set(SCALED_MM_3X_ARCHS)
# The cutlass_scaled_mm kernels for Blackwell (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.8 or later
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;12.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND SCALED_MM_ARCHS)
set(SRCS
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x_sm100.cu"
"csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm100_fp8.cu"
)
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SCALED_MM_SM100=1")
# Let scaled_mm_c2x know it doesn't need to build these arches
list(APPEND SCALED_MM_3X_ARCHS "${SCALED_MM_ARCHS}")
message(STATUS "Building scaled_mm_c3x_sm100 for archs: ${SCALED_MM_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND SCALED_MM_ARCHS)
message(STATUS "Not building scaled_mm_c3x_sm100 as CUDA Compiler version is "
"not >= 12.8, we recommend upgrading to CUDA 12.8 or "
"later if you intend on running FP8 quantized models on "
"Blackwell.")
else()
message(STATUS "Not building scaled_mm_c3x_100 as no compatible archs found "
"in CUDA target architectures")
endif()
endif()
#
# For the cutlass_scaled_mm kernels we want to build the c2x (CUTLASS 2.x)
# kernels for the remaining archs that are not already built for 3x.
cuda_archs_loose_intersection(SCALED_MM_2X_ARCHS
"7.5;8.0;8.6;8.9;9.0" "${CUDA_ARCHS}")
"7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
# subtract out the archs that are already built for 3x
list(REMOVE_ITEM SCALED_MM_2X_ARCHS ${SCALED_MM_3X_ARCHS})
if (SCALED_MM_2X_ARCHS)
@ -309,6 +419,75 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
endif()
#
# 2:4 Sparse Kernels
# The 2:4 sparse kernels cutlass_scaled_sparse_mm and cutlass_compressor
# require CUDA 12.2 or later (and only work on Hopper).
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a;" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_ARCHS)
set(SRCS "csrc/sparse/cutlass/sparse_scaled_mm_c3x.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_SPARSE_SCALED_MM_C3X=1")
message(STATUS "Building sparse_scaled_mm_c3x for archs: ${SCALED_MM_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.2 AND SCALED_MM_ARCHS)
message(STATUS "Not building sparse_scaled_mm_c3x kernels as CUDA Compiler version is "
"not >= 12.2, we recommend upgrading to CUDA 12.2 or later "
"if you intend on running FP8 sparse quantized models on Hopper.")
else()
message(STATUS "Not building sparse_scaled_mm_c3x as no compatible archs found "
"in CUDA target architectures")
endif()
endif()
# FP4 Archs and flags
cuda_archs_loose_intersection(FP4_ARCHS "10.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND FP4_ARCHS)
set(SRCS
"csrc/quantization/fp4/nvfp4_quant_kernels.cu"
"csrc/quantization/fp4/nvfp4_scaled_mm_kernels.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${FP4_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_NVFP4=1")
message(STATUS "Building NVFP4 for archs: ${FP4_ARCHS}")
else()
message(STATUS "Not building NVFP4 as no compatible archs were found.")
# clear FP4_ARCHS
set(FP4_ARCHS)
endif()
#
# CUTLASS MoE kernels
# The MoE kernel cutlass_moe_mm requires CUDA 12.3 or later (and only works
# on Hopper). get_cutlass_moe_mm_data should only be compiled if it's possible
# to compile MoE kernels that use its output.
cuda_archs_loose_intersection(SCALED_MM_ARCHS "9.0a;" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS)
set(SRCS "csrc/quantization/cutlass_w8a8/moe/grouped_mm_c3x.cu"
"csrc/quantization/cutlass_w8a8/moe/moe_data.cu")
set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${SCALED_MM_ARCHS}")
list(APPEND VLLM_EXT_SRC "${SRCS}")
list(APPEND VLLM_GPU_FLAGS "-DENABLE_CUTLASS_MOE_SM90=1")
message(STATUS "Building grouped_mm_c3x for archs: ${SCALED_MM_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.3 AND SCALED_MM_ARCHS)
message(STATUS "Not building grouped_mm_c3x kernels as CUDA Compiler version is "
"not >= 12.3, we recommend upgrading to CUDA 12.3 or later "
"if you intend on running FP8 quantized MoE models on Hopper.")
else()
message(STATUS "Not building grouped_mm_c3x as no compatible archs found "
"in CUDA target architectures")
endif()
endif()
#
# Machete kernels
@ -391,6 +570,7 @@ define_gpu_extension_target(
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR}
INCLUDE_DIRECTORIES ${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
USE_SABI 3
WITH_SOABI)
@ -409,12 +589,24 @@ set(VLLM_MOE_EXT_SRC
"csrc/moe/moe_align_sum_kernels.cu"
"csrc/moe/topk_softmax_kernels.cu")
if(VLLM_GPU_LANG STREQUAL "CUDA")
list(APPEND VLLM_MOE_EXT_SRC "csrc/moe/moe_wna16.cu")
endif()
set_gencode_flags_for_srcs(
SRCS "${VLLM_MOE_EXT_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")
if(VLLM_GPU_LANG STREQUAL "CUDA")
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.9;9.0" "${CUDA_ARCHS}")
set(VLLM_MOE_WNA16_SRC
"csrc/moe/moe_wna16.cu")
set_gencode_flags_for_srcs(
SRCS "${VLLM_MOE_WNA16_SRC}"
CUDA_ARCHS "${CUDA_ARCHS}")
list(APPEND VLLM_MOE_EXT_SRC "${VLLM_MOE_WNA16_SRC}")
cuda_archs_loose_intersection(MARLIN_MOE_ARCHS "8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0" "${CUDA_ARCHS}")
if (MARLIN_MOE_ARCHS)
set(MARLIN_MOE_SRC
"csrc/moe/marlin_kernels/marlin_moe_kernel.h"
@ -468,79 +660,8 @@ if(VLLM_GPU_LANG STREQUAL "HIP")
WITH_SOABI)
endif()
# vllm-flash-attn currently only supported on CUDA
if (NOT VLLM_TARGET_DEVICE STREQUAL "cuda")
return()
# For CUDA we also build and ship some external projects.
if (VLLM_GPU_LANG STREQUAL "CUDA")
include(cmake/external_projects/flashmla.cmake)
include(cmake/external_projects/vllm_flash_attn.cmake)
endif ()
# vLLM flash attention requires VLLM_GPU_ARCHES to contain the set of target
# arches in the CMake syntax (75-real, 89-virtual, etc), since we clear the
# arches in the CUDA case (and instead set the gencodes on a per file basis)
# we need to manually set VLLM_GPU_ARCHES here.
if(VLLM_GPU_LANG STREQUAL "CUDA")
foreach(_ARCH ${CUDA_ARCHS})
string(REPLACE "." "" _ARCH "${_ARCH}")
list(APPEND VLLM_GPU_ARCHES "${_ARCH}-real")
endforeach()
endif()
#
# Build vLLM flash attention from source
#
# IMPORTANT: This has to be the last thing we do, because vllm-flash-attn uses the same macros/functions as vLLM.
# Because functions all belong to the global scope, vllm-flash-attn's functions overwrite vLLMs.
# They should be identical but if they aren't, this is a massive footgun.
#
# The vllm-flash-attn install rules are nested under vllm to make sure the library gets installed in the correct place.
# To only install vllm-flash-attn, use --component vllm_flash_attn_c.
# If no component is specified, vllm-flash-attn is still installed.
# If VLLM_FLASH_ATTN_SRC_DIR is set, vllm-flash-attn is installed from that directory instead of downloading.
# This is to enable local development of vllm-flash-attn within vLLM.
# It can be set as an environment variable or passed as a cmake argument.
# The environment variable takes precedence.
if (DEFINED ENV{VLLM_FLASH_ATTN_SRC_DIR})
set(VLLM_FLASH_ATTN_SRC_DIR $ENV{VLLM_FLASH_ATTN_SRC_DIR})
endif()
if(VLLM_FLASH_ATTN_SRC_DIR)
FetchContent_Declare(vllm-flash-attn SOURCE_DIR ${VLLM_FLASH_ATTN_SRC_DIR})
else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 5259c586c403a4e4d8bf69973c159b40cc346fb9
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
)
endif()
# Set the parent build flag so that the vllm-flash-attn library does not redo compile flag and arch initialization.
set(VLLM_PARENT_BUILD ON)
# Ensure the vllm/vllm_flash_attn directory exists before installation
install(CODE "file(MAKE_DIRECTORY \"\${CMAKE_INSTALL_PREFIX}/vllm/vllm_flash_attn\")" COMPONENT vllm_flash_attn_c)
# Make sure vllm-flash-attn install rules are nested under vllm/
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY FALSE)" COMPONENT vllm_flash_attn_c)
install(CODE "set(OLD_CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${CMAKE_INSTALL_PREFIX}/vllm/\")" COMPONENT vllm_flash_attn_c)
# Fetch the vllm-flash-attn library
FetchContent_MakeAvailable(vllm-flash-attn)
message(STATUS "vllm-flash-attn is available at ${vllm-flash-attn_SOURCE_DIR}")
# Restore the install prefix
install(CODE "set(CMAKE_INSTALL_PREFIX \"\${OLD_CMAKE_INSTALL_PREFIX}\")" COMPONENT vllm_flash_attn_c)
install(CODE "set(CMAKE_INSTALL_LOCAL_ONLY TRUE)" COMPONENT vllm_flash_attn_c)
# Copy over the vllm-flash-attn python files
install(
DIRECTORY ${vllm-flash-attn_SOURCE_DIR}/vllm_flash_attn/
DESTINATION vllm/vllm_flash_attn
COMPONENT vllm_flash_attn_c
FILES_MATCHING PATTERN "*.py"
)
# Nothing after vllm-flash-attn, see comment about macros above

View File

@ -61,7 +61,7 @@ representative at an online or offline/IRL event.
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement in the #code-of-conduct
channel in the [vLLM Discord](https://discord.com/invite/jz7wjKhh6g).
channel in the [vLLM Slack](https://slack.vllm.ai).
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
For answers to common questions about this code of conduct, see the
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).

View File

@ -1,41 +0,0 @@
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
FROM $BASE_IMAGE
RUN echo "Base image is $BASE_IMAGE"
# Install some basic utilities
RUN apt-get update && \
apt-get install -y \
git \
python3 \
python3-pip \
ffmpeg libsm6 libxext6 libgl1
### Mount Point ###
# When launching the container, mount the code directory to /app
ARG APP_MOUNT=/app
VOLUME [ ${APP_MOUNT} ]
WORKDIR ${APP_MOUNT}/vllm
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers pandas
RUN python3 -m pip install sentencepiece transformers==4.36.2 -U
RUN python3 -m pip install transformers-neuronx --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
RUN python3 -m pip install --pre neuronx-cc==2.15.* --extra-index-url=https://pip.repos.neuron.amazonaws.com -U
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
RUN python3 -m pip install -U \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
-r requirements-neuron.txt
ENV VLLM_TARGET_DEVICE neuron
RUN --mount=type=bind,source=.git,target=.git \
pip install --no-build-isolation -v -e .
CMD ["/bin/bash"]

View File

@ -1,25 +0,0 @@
# The vLLM Dockerfile is used to construct vLLM image that can be directly used
# to run the OpenAI compatible server.
FROM ubuntu:22.04 AS dev
RUN apt-get update -y && \
apt-get install -y \
git python3-pip \
ffmpeg libsm6 libxext6 libgl1
WORKDIR /workspace
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
# install build requirements
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/requirements-build.txt
# build vLLM with OpenVINO backend
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" VLLM_TARGET_DEVICE="openvino" python3 -m pip install /workspace
COPY examples/ /workspace/examples
COPY benchmarks/ /workspace/benchmarks
CMD ["/bin/bash"]

View File

@ -1,36 +0,0 @@
FROM mambaorg/micromamba
ARG MAMBA_DOCKERFILE_ACTIVATE=1
USER root
ENV PATH="/usr/local/cargo/bin:$PATH:/opt/conda/bin/"
RUN apt-get update -y && apt-get install -y git wget curl vim libnuma-dev libsndfile-dev libprotobuf-dev build-essential ffmpeg libsm6 libxext6 libgl1
# Some packages in requirements-cpu are installed here
# IBM provides optimized packages for ppc64le processors in the open-ce project for mamba
# Currently these may not be available for venv or pip directly
RUN micromamba install -y -n base -c https://ftp.osuosl.org/pub/open-ce/1.11.0-p10/ -c defaults python=3.10 torchvision-cpu=0.16.2 rust && micromamba clean --all --yes
COPY ./ /workspace/vllm
WORKDIR /workspace/vllm
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi
# These packages will be in rocketce eventually
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
torch==2.3.1 \
-r requirements-cpu.txt \
xformers uvloop==0.20.0
RUN --mount=type=bind,source=.git,target=.git \
VLLM_TARGET_DEVICE=cpu python3 setup.py install
WORKDIR /workspace/
RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks
ENTRYPOINT ["/opt/conda/bin/python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@ -1,171 +0,0 @@
# Default ROCm 6.2 base image
ARG BASE_IMAGE="rocm/pytorch:rocm6.2_ubuntu20.04_py3.9_pytorch_release_2.3.0"
# Default ROCm ARCHes to build vLLM for.
ARG PYTORCH_ROCM_ARCH="gfx908;gfx90a;gfx942;gfx1100"
# Whether to install CK-based flash-attention
# If 0, will not install flash-attention
ARG BUILD_FA="1"
ARG FA_GFX_ARCHS="gfx90a;gfx942"
ARG FA_BRANCH="3cea2fb"
# Whether to build triton on rocm
ARG BUILD_TRITON="1"
ARG TRITON_BRANCH="e192dba"
### Base image build stage
FROM $BASE_IMAGE AS base
# Import arg(s) defined before this build stage
ARG PYTORCH_ROCM_ARCH
# Install some basic utilities
RUN apt-get update && apt-get install python3 python3-pip -y
RUN apt-get update && apt-get install -y \
curl \
ca-certificates \
sudo \
git \
bzip2 \
libx11-6 \
build-essential \
wget \
unzip \
tmux \
ccache \
&& rm -rf /var/lib/apt/lists/*
# When launching the container, mount the code directory to /vllm-workspace
ARG APP_MOUNT=/vllm-workspace
WORKDIR ${APP_MOUNT}
RUN python3 -m pip install --upgrade pip
# Remove sccache so it doesn't interfere with ccache
# TODO: implement sccache support across components
RUN apt-get purge -y sccache; python3 -m pip uninstall -y sccache; rm -f "$(which sccache)"
# Install torch == 2.6.0 on ROCm
RUN --mount=type=cache,target=/root/.cache/pip \
case "$(ls /opt | grep -Po 'rocm-[0-9]\.[0-9]')" in \
*"rocm-6.2"*) \
python3 -m pip uninstall -y torch torchvision \
&& python3 -m pip install --pre \
torch==2.6.0.dev20240918 \
'setuptools-scm>=8' \
torchvision==0.20.0.dev20240918 \
--extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2;; \
*) ;; esac
ENV LLVM_SYMBOLIZER_PATH=/opt/rocm/llvm/bin/llvm-symbolizer
ENV PATH=$PATH:/opt/rocm/bin:/libtorch/bin:
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib/:/libtorch/lib:
ENV CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/libtorch/include:/libtorch/include/torch/csrc/api/include/:/opt/rocm/include/:
ENV PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}
ENV CCACHE_DIR=/root/.cache/ccache
### AMD-SMI build stage
FROM base AS build_amdsmi
# Build amdsmi wheel always
RUN cd /opt/rocm/share/amd_smi \
&& python3 -m pip wheel . --wheel-dir=/install
### Flash-Attention wheel build stage
FROM base AS build_fa
ARG BUILD_FA
ARG FA_GFX_ARCHS
ARG FA_BRANCH
# Build ROCm flash-attention wheel if `BUILD_FA = 1`
RUN --mount=type=cache,target=${CCACHE_DIR} \
if [ "$BUILD_FA" = "1" ]; then \
mkdir -p libs \
&& cd libs \
&& git clone https://github.com/ROCm/flash-attention.git \
&& cd flash-attention \
&& git checkout "${FA_BRANCH}" \
&& git submodule update --init \
&& GPU_ARCHS="${FA_GFX_ARCHS}" python3 setup.py bdist_wheel --dist-dir=/install; \
# Create an empty directory otherwise as later build stages expect one
else mkdir -p /install; \
fi
### Triton wheel build stage
FROM base AS build_triton
ARG BUILD_TRITON
ARG TRITON_BRANCH
# Build triton wheel if `BUILD_TRITON = 1`
RUN --mount=type=cache,target=${CCACHE_DIR} \
if [ "$BUILD_TRITON" = "1" ]; then \
mkdir -p libs \
&& cd libs \
&& python3 -m pip install ninja cmake wheel pybind11 \
&& git clone https://github.com/OpenAI/triton.git \
&& cd triton \
&& git checkout "${TRITON_BRANCH}" \
&& cd python \
&& python3 setup.py bdist_wheel --dist-dir=/install; \
# Create an empty directory otherwise as later build stages expect one
else mkdir -p /install; \
fi
### Final vLLM build stage
FROM base AS final
# Import the vLLM development directory from the build context
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
RUN python3 -m pip install --upgrade pip
# Package upgrades for useful functionality or to avoid dependency issues
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install --upgrade numba scipy huggingface-hub[cli] pytest-shard
# Workaround for ray >= 2.10.0
ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1
# Silences the HF Tokenizers warning
ENV TOKENIZERS_PARALLELISM=false
RUN --mount=type=cache,target=${CCACHE_DIR} \
--mount=type=bind,source=.git,target=.git \
--mount=type=cache,target=/root/.cache/pip \
python3 -m pip install -Ur requirements-rocm.txt \
&& python3 setup.py clean --all \
&& python3 setup.py develop
# Copy amdsmi wheel into final image
RUN --mount=type=bind,from=build_amdsmi,src=/install,target=/install \
mkdir -p libs \
&& cp /install/*.whl libs \
# Preemptively uninstall to avoid same-version no-installs
&& python3 -m pip uninstall -y amdsmi;
# Copy triton wheel(s) into final image if they were built
RUN --mount=type=bind,from=build_triton,src=/install,target=/install \
mkdir -p libs \
&& if ls /install/*.whl; then \
cp /install/*.whl libs \
# Preemptively uninstall to avoid same-version no-installs
&& python3 -m pip uninstall -y triton; fi
# Copy flash-attn wheel(s) into final image if they were built
RUN --mount=type=bind,from=build_fa,src=/install,target=/install \
mkdir -p libs \
&& if ls /install/*.whl; then \
cp /install/*.whl libs \
# Preemptively uninstall to avoid same-version no-installs
&& python3 -m pip uninstall -y flash-attn; fi
# Install wheels that were built to the final image
RUN --mount=type=cache,target=/root/.cache/pip \
if ls libs/*.whl; then \
python3 -m pip install libs/*.whl; fi
CMD ["/bin/bash"]

View File

@ -1,68 +0,0 @@
FROM intel/oneapi-basekit:2024.2.1-0-devel-ubuntu22.04 AS vllm-base
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/intel-oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main " | tee /etc/apt/sources.list.d/oneAPI.list && \
chmod 644 /usr/share/keyrings/intel-oneapi-archive-keyring.gpg && \
wget -O- https://repositories.intel.com/graphics/intel-graphics.key | gpg --dearmor | tee /usr/share/keyrings/intel-graphics.gpg > /dev/null && \
echo "deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/graphics/ubuntu jammy arc" | tee /etc/apt/sources.list.d/intel.gpu.jammy.list && \
chmod 644 /usr/share/keyrings/intel-graphics.gpg
RUN apt-get update -y && \
apt-get install -y --no-install-recommends --fix-missing \
curl \
ffmpeg \
git \
libsndfile1 \
libsm6 \
libxext6 \
libgl1 \
lsb-release \
numactl \
python3 \
python3-dev \
python3-pip \
# vim \
wget
WORKDIR /workspace/vllm
COPY requirements-xpu.txt /workspace/vllm/requirements-xpu.txt
COPY requirements-common.txt /workspace/vllm/requirements-common.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install --no-cache-dir \
-r requirements-xpu.txt
RUN git clone https://github.com/intel/pti-gpu && \
cd pti-gpu/sdk && \
git checkout 6c491f07a777ed872c2654ca9942f1d0dde0a082 && \
mkdir build && \
cd build && \
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../cmake/toolchains/icpx_toolchain.cmake -DBUILD_TESTING=OFF .. && \
make -j && \
cmake --install . --config Release --prefix "/usr/local"
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/"
COPY . .
ARG GIT_REPO_CHECK
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi
ENV VLLM_TARGET_DEVICE=xpu
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \
python3 setup.py install
CMD ["/bin/bash"]
FROM vllm-base AS vllm-openai
# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0'
ENV VLLM_USAGE_SOURCE production-docker-image \
TRITON_XPU_PROFILE 1
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

View File

@ -1,9 +1,9 @@
include LICENSE
include requirements-common.txt
include requirements-cuda.txt
include requirements-rocm.txt
include requirements-neuron.txt
include requirements-cpu.txt
include requirements/common.txt
include requirements/cuda.txt
include requirements/rocm.txt
include requirements/neuron.txt
include requirements/cpu.txt
include CMakeLists.txt
recursive-include cmake *

View File

@ -10,15 +10,32 @@ Easy, fast, and cheap LLM serving for everyone
</h3>
<p align="center">
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
| <a href="https://docs.vllm.ai"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://x.com/vllm_project"><b>Twitter/X</b></a> | <a href="https://discuss.vllm.ai"><b>User Forum</b></a> | <a href="https://slack.vllm.ai"><b>Developer Slack</b></a> |
</p>
---
[2025/03] We are collaborating with Ollama to host an [Inference Night](https://lu.ma/vllm-ollama) at Y Combinator in San Francisco on Thursday, March 27, at 6 PM. Discuss all things inference local or data center!
[2025/04] We're hosting our first-ever *vLLM Asia Developer Day* in Singapore on *April 3rd*! This is a full-day event (9 AM - 9 PM SGT) in partnership with SGInnovate, AMD, and Embedded LLM. Meet the vLLM team and learn about LLM inference for RL, MI300X, and more! [Register Now](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)
---
*Latest News* 🔥
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing).
- [2025/03] We hosted [the first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg)! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing).
- [2025/03] We hosted [the East Coast vLLM Meetup](https://lu.ma/7mu4k4xx)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1NHiv8EUFF1NLd3fEYODm56nDmL26lEeXCaDgyDlTsRs/edit#slide=id.g31441846c39_0_0).
- [2025/02] We hosted [the ninth vLLM meetup](https://lu.ma/h7g3kuj9) with Meta! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1jzC_PZVXrVNSFVCW-V4cFXb6pn7zZ2CyP_Flwo05aqg/edit?usp=sharing) and AMD [here](https://drive.google.com/file/d/1Zk5qEJIkTmlQ2eQcXQZlljAx3m9s7nwn/view?usp=sharing). The slides from Meta will not be posted.
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
<details>
<summary>Previous News</summary>
- [2024/11] We hosted [the seventh vLLM meetup](https://lu.ma/h0qvrajz) with Snowflake! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1e3CxQBV3JsfGp30SwyvS3eM_tW-ghOhJ9PAJGK6KR54/edit?usp=sharing), and Snowflake team [here](https://docs.google.com/presentation/d/1qF3RkDAbOULwz9WK5TOltt2fE9t6uIc_hVNLFAaQX6A/edit?usp=sharing).
- [2024/10] We have just created a developer slack ([slack.vllm.ai](https://slack.vllm.ai)) focusing on coordinating contributions and discussing features. Please feel free to join us there!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://raysummit.anyscale.com/flow/anyscale/raysummit2024/landing/page/sessioncatalog?tab.day=20241001&search.sessiontracks=1719251906298001uzJ2) from other vLLM contributors and users!
- [2024/10] Ray Summit 2024 held a special track for vLLM! Please find the opening talk slides from the vLLM team [here](https://docs.google.com/presentation/d/1B_KQxpHBTRa_mDF-tR6i8rWdOU5QoTZNcEg2MKZxEHM/edit?usp=sharing). Learn more from the [talks](https://www.youtube.com/playlist?list=PLzTswPQNepXl6AQwifuwUImLPFRVpksjR) from other vLLM contributors and users!
- [2024/09] We hosted [the sixth vLLM meetup](https://lu.ma/87q3nvnh) with NVIDIA! Please find the meetup slides [here](https://docs.google.com/presentation/d/1wrLGwytQfaOTd5wCGSPNhoaW3nq0E-9wqyP7ny93xRs/edit?usp=sharing).
- [2024/07] We hosted [the fifth vLLM meetup](https://lu.ma/lp0gyjqr) with AWS! Please find the meetup slides [here](https://docs.google.com/presentation/d/1RgUD8aCfcHocghoP3zmXzck9vX3RCI9yfUAB2Bbcl4Y/edit?usp=sharing).
- [2024/07] In partnership with Meta, vLLM officially supports Llama 3.1 with FP8 quantization and pipeline parallelism! Please check out our blog post [here](https://blog.vllm.ai/2024/07/23/llama31.html).
@ -29,14 +46,19 @@ Easy, fast, and cheap LLM serving for everyone
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
</details>
---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with **PagedAttention**
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.
@ -59,7 +81,7 @@ vLLM is flexible and easy to use with:
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g. E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
@ -67,16 +89,16 @@ Find the full list of supported models [here](https://docs.vllm.ai/en/latest/mod
## Getting Started
Install vLLM with `pip` or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
```bash
pip install vllm
```
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
## Contributing
@ -89,33 +111,40 @@ vLLM is a community project. Our compute resources for development and testing a
<!-- Note: Please sort them in alphabetical order. -->
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md -->
Cash Donations:
- a16z
- Dropbox
- Sequoia Capital
- Skywork AI
- ZhenFund
Compute Resources:
- AMD
- Anyscale
- AWS
- Crusoe Cloud
- Databricks
- DeepInfra
- Dropbox
- Google Cloud
- Lambda Lab
- Nebius
- Novita AI
- NVIDIA
- Replicate
- Roblox
- RunPod
- Sequoia Capital
- Skywork AI
- Trainy
- UC Berkeley
- UC San Diego
- ZhenFund
Slack Sponsor: Anyscale
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
## Citation
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
@ -127,8 +156,12 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
## Contact Us
* For technical questions and feature requests, please use Github issues or discussions.
* For discussing with fellow users, please use Discord.
* For coordinating contributions and development, please use Slack.
* For security disclosures, please use Github's security advisory feature.
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues) or [Discussions](https://github.com/vllm-project/vllm/discussions)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)
## Media Kit
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).

54
RELEASE.md Normal file
View File

@ -0,0 +1,54 @@
# Releasing vLLM
vLLM releases offer a reliable version of the code base, packaged into a binary format that can be conveniently accessed via PyPI. These releases also serve as key milestones for the development team to communicate with the community about newly available features, improvements, and upcoming changes that could affect users, including potential breaking changes.
## Release Versioning
vLLM uses a “right-shifted” versioning scheme where a new patch release is out every 2 weeks. And patch releases contain features and bug fixes (as opposed to semver where patch release contains only backwards-compatible bug fixes). When critical fixes need to be made, special release post1 is released.
* _major_ major architectural milestone and when incompatible API changes are made, similar to PyTorch 2.0.
* _minor_ major features
* _patch_ features and backwards-compatible bug fixes
* _post1_ or _patch-1_ backwards-compatible bug fixes, either explicit or implicit post release
## Release Cadence
Patch release is released on bi-weekly basis. Post release 1-3 days after patch release and uses same branch as patch release.
Following is the release cadence for year 2025. All future release dates below are tentative. Please note: Post releases are optional.
| Release Date | Patch release versions | Post Release versions |
| --- | --- | --- |
| Jan 2025 | 0.7.0 | --- |
| Feb 2025 | 0.7.1, 0.7.2, 0.7.3 | --- |
| Mar 2025 | 0.7.4, 0.7.5 | --- |
| Apr 2025 | 0.7.6, 0.7.7 | --- |
| May 2025 | 0.7.8, 0.7.9 | --- |
| Jun 2025 | 0.7.10, 0.7.11 | --- |
| Jul 2025 | 0.7.12, 0.7.13 | --- |
| Aug 2025 | 0.7.14, 0.7.15 | --- |
| Sep 2025 | 0.7.16, 0.7.17 | --- |
| Oct 2025 | 0.7.18, 0.7.19 | --- |
| Nov 2025 | 0.7.20, 0.7.21 | --- |
| Dec 2025 | 0.7.22, 0.7.23 | --- |
## Release branch
Each release is built from a dedicated release branch.
* For _major_, _minor_, _patch_ releases, the release branch cut is performed 1-2 days before release is live.
* For post releases, previously cut release branch is reused
* Release builds are triggered via push to RC tag like vX.Y.Z-rc1 . This enables us to build and test multiple RCs for each release.
* Final tag : vX.Y.Z does not trigger the build but used for Release notes and assets.
* After branch cut is created we monitor the main branch for any reverts and apply these reverts to a release branch.
## Release Cherry-Pick Criteria
After branch cut, we approach finalizing the release branch with clear criteria on what cherry picks are allowed in. Note: a cherry pick is a process to land a PR in the release branch after branch cut. These are typically limited to ensure that the team has sufficient time to complete a thorough round of testing on a stable code base.
* Regression fixes - that address functional/performance regression against the most recent release (e.g. 0.7.0 for 0.7.1 release)
* Critical fixes - critical fixes for severe issue such as silent incorrectness, backwards compatibility, crashes, deadlocks, (large) memory leaks
* Fixes to new features introduced in the most recent release (e.g. 0.7.0 for 0.7.1 release)
* Documentation improvements
* Release branch specific changes (e.g. change version identifiers or CI fixes)
Please note: **No feature work allowed for cherry picks**. All PRs that are considered for cherry-picks need to be merged on trunk, the only exception are Release branch specific changes.

View File

@ -4,7 +4,7 @@
If you believe you have found a security vulnerability in vLLM, we encourage you to let us know right away. We will investigate all legitimate reports and do our best to quickly fix the problem.
Please report security issues privately using [the vulnerability submission form](https://github.com/vllm-project/vllm/security/advisories/new).
Please report security issues privately using [the vulnerability submission form](https://github.com/vllm-project/vllm/security/advisories/new). Reports will then be triaged by the [vulnerability management team](https://docs.vllm.ai/en/latest/contributing/vulnerability_management.html).
---

View File

@ -1,19 +1,296 @@
# Benchmarking vLLM
## Downloading the ShareGPT dataset
This README guides you through running benchmark tests with the extensive
datasets supported on vLLM. Its a living document, updated as new features and datasets
become available.
## Dataset Overview
<table style="width:100%; border-collapse: collapse;">
<thead>
<tr>
<th style="width:15%; text-align: left;">Dataset</th>
<th style="width:10%; text-align: center;">Online</th>
<th style="width:10%; text-align: center;">Offline</th>
<th style="width:65%; text-align: left;">Data Path</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ShareGPT</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td><code>wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json</code></td>
</tr>
<tr>
<td><strong>BurstGPT</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td><code>wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv</code></td>
</tr>
<tr>
<td><strong>Sonnet</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td>Local file: <code>benchmarks/sonnet.txt</code></td>
</tr>
<tr>
<td><strong>Random</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td><code>synthetic</code></td>
</tr>
<tr>
<td><strong>HuggingFace-VisionArena</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td><code>lmarena-ai/VisionArena-Chat</code></td>
</tr>
<tr>
<td><strong>HuggingFace-InstructCoder</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td><code>likaixin/InstructCoder</code></td>
</tr>
<tr>
<td><strong>HuggingFace-Other</strong></td>
<td style="text-align: center;"></td>
<td style="text-align: center;"></td>
<td><code>lmms-lab/LLaVA-OneVision-Data</code>, <code>Aeala/ShareGPT_Vicuna_unfiltered</code></td>
</tr>
</tbody>
</table>
✅: supported
🟡: Partial support
🚧: to be supported
**Note**: HuggingFace dataset's `dataset-name` should be set to `hf`
---
## Example - Online Benchmark
First start serving your model
You can download the dataset by running:
```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
vllm serve NousResearch/Hermes-3-Llama-3.1-8B --disable-log-requests
```
## Downloading the ShareGPT4V dataset
Then run the benchmarking script
The json file refers to several image datasets (coco, llava, etc.). The benchmark scripts
will ignore a datapoint if the referred image is missing.
```bash
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
mkdir coco -p
wget http://images.cocodataset.org/zips/train2017.zip -O coco/train2017.zip
unzip coco/train2017.zip -d coco/
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--endpoint /v1/completions \
--dataset-name sharegpt \
--dataset-path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 10
```
If successful, you will see the following output
```
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 5.78
Total input tokens: 1369
Total generated tokens: 2212
Request throughput (req/s): 1.73
Output token throughput (tok/s): 382.89
Total Token throughput (tok/s): 619.85
---------------Time to First Token----------------
Mean TTFT (ms): 71.54
Median TTFT (ms): 73.88
P99 TTFT (ms): 79.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.91
Median TPOT (ms): 7.96
P99 TPOT (ms): 8.03
---------------Inter-token Latency----------------
Mean ITL (ms): 7.74
Median ITL (ms): 7.70
P99 ITL (ms): 8.39
==================================================
```
### VisionArena Benchmark for Vision Language Models
```bash
# need a model with vision capability here
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
```
```bash
python3 vllm/benchmarks/benchmark_serving.py \
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--hf-split train \
--num-prompts 1000
```
### InstructCoder Benchmark with Speculative Decoding
``` bash
VLLM_USE_V1=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-model "[ngram]" \
--ngram_prompt_lookup_min 2 \
--ngram-prompt-lookup-max 5 \
--num_speculative_tokens 5
```
``` bash
python3 benchmarks/benchmark_serving.py \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset-name hf \
--dataset-path likaixin/InstructCoder \
--num-prompts 2048
```
### Other HuggingFaceDataset Examples
```bash
vllm serve Qwen/Qwen2-VL-7B-Instruct --disable-log-requests
```
**`lmms-lab/LLaVA-OneVision-Data`**
```bash
python3 vllm/benchmarks/benchmark_serving.py \
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path lmms-lab/LLaVA-OneVision-Data \
--hf-split train \
--hf-subset "chart2text(cauldron)" \
--num-prompts 10
```
**`Aeala/ShareGPT_Vicuna_unfiltered`**
```bash
python3 vllm/benchmarks/benchmark_serving.py \
--backend openai-chat \
--model Qwen/Qwen2-VL-7B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name hf \
--dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
--hf-split train \
--num-prompts 10
```
---
## Example - Offline Throughput Benchmark
```bash
python3 vllm/benchmarks/benchmark_throughput.py \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--dataset-name sonnet \
--dataset-path vllm/benchmarks/sonnet.txt \
--num-prompts 10
```
If successful, you will see the following output
```
Throughput: 7.15 requests/s, 4656.00 total tokens/s, 1072.15 output tokens/s
Total num prompt tokens: 5014
Total num output tokens: 1500
```
### VisionArena Benchmark for Vision Language Models
``` bash
python3 vllm/benchmarks/benchmark_throughput.py \
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
--dataset-path lmarena-ai/VisionArena-Chat \
--num-prompts 1000 \
--hf-split train
```
The `num prompt tokens` now includes image token counts
```
Throughput: 2.55 requests/s, 4036.92 total tokens/s, 326.90 output tokens/s
Total num prompt tokens: 14527
Total num output tokens: 1280
```
### InstructCoder Benchmark with Speculative Decoding
``` bash
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_USE_V1=1 \
python3 vllm/benchmarks/benchmark_throughput.py \
--dataset-name=hf \
--dataset-path=likaixin/InstructCoder \
--model=meta-llama/Meta-Llama-3-8B-Instruct \
--input-len=1000 \
--output-len=100 \
--num-prompts=2048 \
--async-engine \
--speculative-model="[ngram]" \
--ngram_prompt_lookup_min=2 \
--ngram-prompt-lookup-max=5 \
--num_speculative_tokens=5
```
```
Throughput: 104.77 requests/s, 23836.22 total tokens/s, 10477.10 output tokens/s
Total num prompt tokens: 261136
Total num output tokens: 204800
```
### Other HuggingFaceDataset Examples
**`lmms-lab/LLaVA-OneVision-Data`**
```bash
python3 vllm/benchmarks/benchmark_throughput.py \
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
--dataset-path lmms-lab/LLaVA-OneVision-Data \
--hf-split train \
--hf-subset "chart2text(cauldron)" \
--num-prompts 10
```
**`Aeala/ShareGPT_Vicuna_unfiltered`**
```bash
python3 vllm/benchmarks/benchmark_throughput.py \
--model Qwen/Qwen2-VL-7B-Instruct \
--backend vllm-chat \
--dataset-name hf \
--dataset-path Aeala/ShareGPT_Vicuna_unfiltered \
--hf-split train \
--num-prompts 10
```
### Benchmark with LoRA Adapters
``` bash
# download dataset
# wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 vllm/benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-2-7b-hf \
--backend vllm \
--dataset_path <your data path>/ShareGPT_V3_unfiltered_cleaned_split.json \
--dataset_name sharegpt \
--num-prompts 10 \
--max-loras 2 \
--max-lora-rank 8 \
--enable-lora \
--lora-path yard1/llama-2-7b-sql-lora-test
```

View File

@ -1,10 +1,12 @@
# SPDX-License-Identifier: Apache-2.0
import json
import os
import sys
import time
import traceback
from dataclasses import dataclass, field
from typing import List, Optional, Union
from typing import Optional, Union
import aiohttp
import huggingface_hub.constants
@ -12,6 +14,9 @@ from tqdm.asyncio import tqdm
from transformers import (AutoTokenizer, PreTrainedTokenizer,
PreTrainedTokenizerFast)
# NOTE(simon): do not import vLLM here so the benchmark script
# can run without vLLM installed.
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
@ -22,8 +27,9 @@ class RequestFuncInput:
prompt_len: int
output_len: int
model: str
best_of: int = 1
model_name: Optional[str] = None
logprobs: Optional[int] = None
extra_body: Optional[dict] = None
multi_modal_content: Optional[dict] = None
ignore_eos: bool = False
@ -33,9 +39,11 @@ class RequestFuncOutput:
generated_text: str = ""
success: bool = False
latency: float = 0.0
output_tokens: int = 0
ttft: float = 0.0 # Time to first token
itl: List[float] = field(
default_factory=list) # List of inter-token latencies
itl: list[float] = field(
default_factory=list) # list of inter-token latencies
tpot: float = 0.0 # avg next-token latencies
prompt_len: int = 0
error: str = ""
@ -47,14 +55,15 @@ async def async_request_tgi(
api_url = request_func_input.api_url
assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
params = {
"best_of": request_func_input.best_of,
"max_new_tokens": request_func_input.output_len,
"do_sample": True,
"temperature": 0.01, # TGI does not accept 0.0 temperature.
"top_p": 0.99, # TGI does not accept 1.0 top_p.
# TGI does not accept ignore_eos flag.
"truncate": request_func_input.prompt_len,
"ignore_eos_token": request_func_input.ignore_eos,
}
payload = {
"inputs": request_func_input.prompt,
@ -62,6 +71,10 @@ async def async_request_tgi(
}
output = RequestFuncOutput()
output.prompt_len = request_func_input.prompt_len
if request_func_input.ignore_eos:
output.output_tokens = request_func_input.output_len
else:
output.output_tokens = None
ttft = 0.0
st = time.perf_counter()
@ -75,7 +88,7 @@ async def async_request_tgi(
continue
chunk_bytes = chunk_bytes.decode("utf-8")
#NOTE: Sometimes TGI returns a ping response without
# NOTE: Sometimes TGI returns a ping response without
# any data, we should skip it.
if chunk_bytes.startswith(":"):
continue
@ -118,8 +131,8 @@ async def async_request_trt_llm(
api_url = request_func_input.api_url
assert api_url.endswith("generate_stream")
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
payload = {
"accumulate_tokens": True,
"text_input": request_func_input.prompt,
@ -152,7 +165,7 @@ async def async_request_trt_llm(
timestamp = time.perf_counter()
# First token
if ttft == 0.0:
ttft = time.perf_counter() - st
ttft = timestamp - st
output.ttft = ttft
# Decoding phase
@ -182,8 +195,8 @@ async def async_request_deepspeed_mii(
request_func_input: RequestFuncInput,
pbar: Optional[tqdm] = None,
) -> RequestFuncOutput:
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
assert request_func_input.best_of == 1
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
payload = {
"prompt": request_func_input.prompt,
@ -230,17 +243,24 @@ async def async_request_openai_completions(
("completions", "profile")
), "OpenAI Completions API URL must end with 'completions' or 'profile'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
payload = {
"model": request_func_input.model,
"model": request_func_input.model_name \
if request_func_input.model_name else request_func_input.model,
"prompt": request_func_input.prompt,
"temperature": 0.0,
"best_of": request_func_input.best_of,
"max_tokens": request_func_input.output_len,
"logprobs": request_func_input.logprobs,
"stream": True,
"ignore_eos": request_func_input.ignore_eos,
"stream_options": {
"include_usage": True,
},
}
if request_func_input.ignore_eos:
payload["ignore_eos"] = request_func_input.ignore_eos
if request_func_input.extra_body:
payload.update(request_func_input.extra_body)
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
@ -249,7 +269,6 @@ async def async_request_openai_completions(
output.prompt_len = request_func_input.prompt_len
generated_text = ""
ttft = 0.0
st = time.perf_counter()
most_recent_timestamp = st
try:
@ -264,15 +283,16 @@ async def async_request_openai_completions(
chunk = chunk_bytes.decode("utf-8").removeprefix(
"data: ")
if chunk == "[DONE]":
latency = time.perf_counter() - st
else:
if chunk != "[DONE]":
data = json.loads(chunk)
# NOTE: Some completion API might have a last
# usage summary response without a token so we
# want to check a token was generated
if data["choices"][0]["text"]:
if choices := data.get("choices"):
# Note that text could be empty here
# e.g. for special tokens
text = choices[0].get("text")
timestamp = time.perf_counter()
# First token
if not first_chunk_received:
@ -286,7 +306,10 @@ async def async_request_openai_completions(
most_recent_timestamp)
most_recent_timestamp = timestamp
generated_text += data["choices"][0]["text"]
generated_text += text or ""
elif usage := data.get("usage"):
output.output_tokens = usage.get(
"completion_tokens")
if first_chunk_received:
output.success = True
else:
@ -295,7 +318,7 @@ async def async_request_openai_completions(
"Never received a valid chunk to calculate TTFT."
"This response will be marked as failed!")
output.generated_text = generated_text
output.latency = latency
output.latency = most_recent_timestamp - st
else:
output.error = response.reason or ""
output.success = False
@ -315,15 +338,17 @@ async def async_request_openai_chat_completions(
) -> RequestFuncOutput:
api_url = request_func_input.api_url
assert api_url.endswith(
"chat/completions"
("chat/completions", "profile")
), "OpenAI Chat Completions API URL must end with 'chat/completions'."
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
async with aiohttp.ClientSession(trust_env=True,
timeout=AIOHTTP_TIMEOUT) as session:
content = [{"type": "text", "text": request_func_input.prompt}]
if request_func_input.multi_modal_content:
content.append(request_func_input.multi_modal_content)
payload = {
"model": request_func_input.model,
"model": request_func_input.model_name \
if request_func_input.model_name else request_func_input.model,
"messages": [
{
"role": "user",
@ -333,8 +358,14 @@ async def async_request_openai_chat_completions(
"temperature": 0.0,
"max_completion_tokens": request_func_input.output_len,
"stream": True,
"ignore_eos": request_func_input.ignore_eos,
"stream_options": {
"include_usage": True,
},
}
if request_func_input.ignore_eos:
payload["ignore_eos"] = request_func_input.ignore_eos
if request_func_input.extra_body:
payload.update(request_func_input.extra_body)
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
@ -358,17 +389,15 @@ async def async_request_openai_chat_completions(
chunk = chunk_bytes.decode("utf-8").removeprefix(
"data: ")
if chunk == "[DONE]":
latency = time.perf_counter() - st
else:
if chunk != "[DONE]":
timestamp = time.perf_counter()
data = json.loads(chunk)
delta = data["choices"][0]["delta"]
if delta.get("content", None):
if choices := data.get("choices"):
content = choices[0]["delta"].get("content")
# First token
if ttft == 0.0:
ttft = time.perf_counter() - st
ttft = timestamp - st
output.ttft = ttft
# Decoding phase
@ -376,13 +405,16 @@ async def async_request_openai_chat_completions(
output.itl.append(timestamp -
most_recent_timestamp)
generated_text += delta["content"]
generated_text += content or ""
elif usage := data.get("usage"):
output.output_tokens = usage.get(
"completion_tokens")
most_recent_timestamp = timestamp
output.generated_text = generated_text
output.success = True
output.latency = latency
output.latency = most_recent_timestamp - st
else:
output.error = response.reason or ""
output.success = False
@ -400,24 +432,50 @@ def get_model(pretrained_model_name_or_path: str) -> str:
if os.getenv('VLLM_USE_MODELSCOPE', 'False').lower() == 'true':
from modelscope import snapshot_download
model_path = snapshot_download(
model_id=pretrained_model_name_or_path,
local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
ignore_file_pattern=[".*.pt", ".*.safetensors", ".*.bin"])
from vllm.model_executor.model_loader.weight_utils import get_lock
return model_path
# Use file lock to prevent multiple processes from
# downloading the same model weights at the same time.
with get_lock(pretrained_model_name_or_path):
model_path = snapshot_download(
model_id=pretrained_model_name_or_path,
local_files_only=huggingface_hub.constants.HF_HUB_OFFLINE,
ignore_file_pattern=[".*.pt", ".*.safetensors", ".*.bin"])
return model_path
return pretrained_model_name_or_path
def get_tokenizer(
pretrained_model_name_or_path: str, trust_remote_code: bool
pretrained_model_name_or_path: str,
tokenizer_mode: str = "auto",
trust_remote_code: bool = False,
**kwargs,
) -> Union[PreTrainedTokenizer, PreTrainedTokenizerFast]:
if pretrained_model_name_or_path is not None and not os.path.exists(
pretrained_model_name_or_path):
pretrained_model_name_or_path = get_model(
pretrained_model_name_or_path)
return AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
trust_remote_code=trust_remote_code)
if tokenizer_mode == "slow":
if kwargs.get("use_fast", False):
raise ValueError(
"Cannot use the fast tokenizer in slow tokenizer mode.")
kwargs["use_fast"] = False
if tokenizer_mode == "mistral":
try:
from vllm.transformers_utils.tokenizer import MistralTokenizer
except ImportError as e:
raise ImportError("MistralTokenizer requires vllm package.\n"
"Please install it with `pip install vllm` "
"to use mistral tokenizer mode.") from e
return MistralTokenizer.from_pretrained(
str(pretrained_model_name_or_path))
else:
return AutoTokenizer.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=trust_remote_code,
**kwargs,
)
ASYNC_REQUEST_FUNCS = {

View File

@ -0,0 +1,763 @@
# SPDX-License-Identifier: Apache-2.0
"""
This module defines a framework for sampling benchmark requests from various
datasets. Each dataset subclass of BenchmarkDataset must implement sample
generation. Supported dataset types include:
- ShareGPT
- Random (synthetic)
- Sonnet
- BurstGPT
- HuggingFace
- VisionArena
TODO: Implement CustomDataset to parse a JSON file and convert its contents into
SampleRequest instances, similar to the approach used in ShareGPT.
"""
import base64
import io
import json
import logging
import random
from abc import ABC, abstractmethod
from collections.abc import Mapping
from dataclasses import dataclass
from functools import cache
from io import BytesIO
from typing import Any, Callable, Optional, Union
import numpy as np
import pandas as pd
from datasets import load_dataset
from PIL import Image
from transformers import PreTrainedTokenizerBase
from vllm.lora.request import LoRARequest
from vllm.lora.utils import get_adapter_absolute_path
from vllm.multimodal import MultiModalDataDict
from vllm.transformers_utils.tokenizer import AnyTokenizer, get_lora_tokenizer
logger = logging.getLogger(__name__)
# -----------------------------------------------------------------------------
# Data Classes
# -----------------------------------------------------------------------------
@dataclass
class SampleRequest:
"""
Represents a single inference request for benchmarking.
"""
prompt: Union[str, Any]
prompt_len: int
expected_output_len: int
multi_modal_data: Optional[Union[MultiModalDataDict, dict]] = None
lora_request: Optional[LoRARequest] = None
# -----------------------------------------------------------------------------
# Benchmark Dataset Base Class
# -----------------------------------------------------------------------------
class BenchmarkDataset(ABC):
DEFAULT_SEED = 0
def __init__(
self,
dataset_path: Optional[str] = None,
random_seed: int = DEFAULT_SEED,
) -> None:
"""
Initialize the BenchmarkDataset with an optional dataset path and random
seed. Args:
dataset_path (Optional[str]): Path to the dataset. If None, it
indicates that a default or random dataset might be used.
random_seed (int): Seed value for reproducible shuffling or
sampling. Defaults to DEFAULT_SEED.
"""
self.dataset_path = dataset_path
# Set the random seed, ensuring that a None value is replaced with the
# default seed.
self.random_seed = (random_seed
if random_seed is not None else self.DEFAULT_SEED)
self.data = None
def apply_multimodal_chat_transformation(
self,
prompt: str,
mm_content: Optional[MultiModalDataDict] = None) -> list[dict]:
"""
Transform a prompt and optional multimodal content into a chat format.
This method is used for chat models that expect a specific conversation
format.
"""
content = [{"text": prompt, "type": "text"}]
if mm_content is not None:
content.append(mm_content)
return [{"role": "user", "content": content}]
def load_data(self) -> None:
"""
Load data from the dataset path into self.data.
This method must be overridden by subclasses since the method to load
data will vary depending on the dataset format and source.
Raises:
NotImplementedError: If a subclass does not implement this method.
"""
# TODO (jenniferzhao): add support for downloading data
raise NotImplementedError(
"load_data must be implemented in subclasses.")
def get_random_lora_request(
self,
tokenizer: PreTrainedTokenizerBase,
max_loras: Optional[int] = None,
lora_path: Optional[str] = None,
) -> tuple[Optional[LoRARequest], AnyTokenizer]:
"""
Optionally select a random LoRA request and return its associated
tokenizer.
This method is used when LoRA parameters are provided. It randomly
selects a LoRA based on max_loras and retrieves a cached tokenizer for
that LoRA if available. Otherwise, it returns the base tokenizer.
Args:
tokenizer (PreTrainedTokenizerBase): The base tokenizer to use if no
LoRA is selected. max_loras (Optional[int]): The maximum number of
LoRAs available. If None, LoRA is not used. lora_path
(Optional[str]): Path to the LoRA parameters on disk. If None, LoRA
is not used.
Returns:
tuple[Optional[LoRARequest], AnyTokenizer]: A tuple where the first
element is a LoRARequest (or None if not applicable) and the second
element is the tokenizer associated with the LoRA request (or the
base tokenizer).
"""
if max_loras is None or lora_path is None:
return None, tokenizer
# Generate a random LoRA ID in the range [1, max_loras].
lora_id = random.randint(1, max_loras)
lora_request = LoRARequest(
lora_name=str(lora_id),
lora_int_id=lora_id,
lora_path=lora_path_on_disk(lora_path),
)
if lora_id not in lora_tokenizer_cache:
lora_tokenizer_cache[lora_id] = get_lora_tokenizer(lora_request)
# Return lora_request and the cached tokenizer if available; otherwise,
# return the base tokenizer
return lora_request, lora_tokenizer_cache[lora_id] or tokenizer
@abstractmethod
def sample(self, tokenizer: PreTrainedTokenizerBase,
num_requests: int) -> list[SampleRequest]:
"""
Abstract method to generate sample requests from the dataset.
Subclasses must override this method to implement dataset-specific logic
for generating a list of SampleRequest objects.
Args:
tokenizer (PreTrainedTokenizerBase): The tokenizer to be used
for processing the dataset's text.
num_requests (int): The number of sample requests to generate.
Returns:
list[SampleRequest]: A list of sample requests generated from the
dataset.
"""
raise NotImplementedError("sample must be implemented in subclasses.")
def maybe_oversample_requests(self, requests: list[SampleRequest],
num_requests: int) -> None:
"""
Oversamples the list of requests if its size is less than the desired
number.
Args:
requests (List[SampleRequest]): The current list of sampled
requests. num_requests (int): The target number of requests.
"""
if len(requests) < num_requests:
random.seed(self.random_seed)
additional = random.choices(requests,
k=num_requests - len(requests))
requests.extend(additional)
logger.info("Oversampled requests to reach %d total samples.",
num_requests)
# -----------------------------------------------------------------------------
# Utility Functions and Global Caches
# -----------------------------------------------------------------------------
def is_valid_sequence(
prompt_len: int,
output_len: int,
min_len: int = 4,
max_prompt_len: int = 1024,
max_total_len: int = 2048,
skip_min_output_len_check: bool = False,
) -> bool:
"""
Validate a sequence based on prompt and output lengths.
Default pruning criteria are copied from the original `sample_hf_requests`
and `sample_sharegpt_requests` functions in benchmark_serving.py, as well as
from `sample_requests` in benchmark_throughput.py.
"""
# Check for invalid conditions
prompt_too_short = prompt_len < min_len
output_too_short = (not skip_min_output_len_check) and (output_len
< min_len)
prompt_too_long = prompt_len > max_prompt_len
combined_too_long = (prompt_len + output_len) > max_total_len
# Return True if none of the invalid conditions are met
return not (prompt_too_short or output_too_short or prompt_too_long
or combined_too_long)
@cache
def lora_path_on_disk(lora_path: str) -> str:
return get_adapter_absolute_path(lora_path)
# Global cache for LoRA tokenizers.
lora_tokenizer_cache: dict[int, AnyTokenizer] = {}
def process_image(image: Any) -> Mapping[str, Any]:
"""
Process a single image input and return a multimedia content dictionary.
Supports three input types:
1. Dictionary with raw image bytes: - Expects a dict with a 'bytes' key
containing raw image data. - Loads the bytes as a PIL.Image.Image.
2. PIL.Image.Image input: - Converts the image to RGB. - Saves the image as
a JPEG in memory. - Encodes the JPEG data as a base64 string. - Returns
a dictionary with the image as a base64 data URL.
3. String input: - Treats the string as a URL or local file path. -
Prepends "file://" if the string doesn't start with "http://" or
"file://". - Returns a dictionary with the image URL.
Raises:
ValueError: If the input is not a supported type.
"""
if isinstance(image, dict) and 'bytes' in image:
image = Image.open(BytesIO(image['bytes']))
if isinstance(image, Image.Image):
image = image.convert("RGB")
with io.BytesIO() as image_data:
image.save(image_data, format="JPEG")
image_base64 = base64.b64encode(
image_data.getvalue()).decode("utf-8")
return {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}
if isinstance(image, str):
image_url = (image if image.startswith(
("http://", "file://")) else f"file://{image}")
return {"type": "image_url", "image_url": {"url": image_url}}
raise ValueError(f"Invalid image input {image}. Must be a PIL.Image.Image"
" or str or dictionary with raw image bytes.")
# -----------------------------------------------------------------------------
# Random Dataset Implementation (Synthetic Data)
# -----------------------------------------------------------------------------
class RandomDataset(BenchmarkDataset):
# Default values copied from benchmark_serving.py for the random dataset.
DEFAULT_PREFIX_LEN = 0
DEFAULT_RANGE_RATIO = 1.0
DEFAULT_INPUT_LEN = 1024
DEFAULT_OUTPUT_LEN = 128
def __init__(
self,
**kwargs,
) -> None:
super().__init__(**kwargs)
def sample(
self,
tokenizer: PreTrainedTokenizerBase,
num_requests: int,
prefix_len: int = DEFAULT_PREFIX_LEN,
range_ratio: float = DEFAULT_RANGE_RATIO,
input_len: int = DEFAULT_INPUT_LEN,
output_len: int = DEFAULT_OUTPUT_LEN,
**kwargs,
) -> list[SampleRequest]:
vocab_size = tokenizer.vocab_size
prefix_token_ids = (np.random.randint(
0, vocab_size, size=prefix_len).tolist() if prefix_len > 0 else [])
input_low = int(input_len * range_ratio)
output_low = int(output_len * range_ratio)
input_lens = np.random.randint(input_low,
input_len + 1,
size=num_requests)
output_lens = np.random.randint(output_low,
output_len + 1,
size=num_requests)
offsets = np.random.randint(0, vocab_size, size=num_requests)
requests = []
for i in range(num_requests):
inner_seq = ((offsets[i] + i + np.arange(input_lens[i])) %
vocab_size).tolist()
token_sequence = prefix_token_ids + inner_seq
prompt = tokenizer.decode(token_sequence)
total_input_len = prefix_len + int(input_lens[i])
requests.append(
SampleRequest(
prompt=prompt,
prompt_len=total_input_len,
expected_output_len=int(output_lens[i]),
))
return requests
# -----------------------------------------------------------------------------
# ShareGPT Dataset Implementation
# -----------------------------------------------------------------------------
class ShareGPTDataset(BenchmarkDataset):
"""
Implements the ShareGPT dataset. Loads data from a JSON file and generates
sample requests based on conversation turns.
"""
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
self.load_data()
def load_data(self) -> None:
if self.dataset_path is None:
raise ValueError("dataset_path must be provided for loading data.")
with open(self.dataset_path, encoding="utf-8") as f:
self.data = json.load(f)
# Filter entries with at least two conversation turns.
self.data = [
entry for entry in self.data
if "conversations" in entry and len(entry["conversations"]) >= 2
]
random.seed(self.random_seed)
random.shuffle(self.data)
def sample(
self,
tokenizer: PreTrainedTokenizerBase,
num_requests: int,
lora_path: Optional[str] = None,
max_loras: Optional[int] = None,
output_len: Optional[int] = None,
enable_multimodal_chat: bool = False,
**kwargs,
) -> list:
samples: list = []
for entry in self.data:
if len(samples) >= num_requests:
break
prompt, completion = (
entry["conversations"][0]["value"],
entry["conversations"][1]["value"],
)
lora_request, tokenizer = self.get_random_lora_request(
tokenizer=tokenizer, max_loras=max_loras, lora_path=lora_path)
prompt_ids = tokenizer(prompt).input_ids
completion_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_ids)
new_output_len = (len(completion_ids)
if output_len is None else output_len)
if not is_valid_sequence(prompt_len,
new_output_len,
skip_min_output_len_check=output_len
is not None):
continue
if enable_multimodal_chat:
prompt = self.apply_multimodal_chat_transformation(
prompt, None)
samples.append(
SampleRequest(
prompt=prompt,
prompt_len=prompt_len,
expected_output_len=new_output_len,
lora_request=lora_request,
))
self.maybe_oversample_requests(samples, num_requests)
return samples
# -----------------------------------------------------------------------------
# Sonnet Dataset Implementation
# -----------------------------------------------------------------------------
class SonnetDataset(BenchmarkDataset):
"""
Simplified implementation of the Sonnet dataset. Loads poem lines from a
text file and generates sample requests. Default values here copied from
`benchmark_serving.py` for the sonnet dataset.
"""
DEFAULT_PREFIX_LEN = 200
DEFAULT_INPUT_LEN = 550
DEFAULT_OUTPUT_LEN = 150
def __init__(
self,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.load_data()
def load_data(self) -> None:
if not self.dataset_path:
raise ValueError("dataset_path must be provided.")
with open(self.dataset_path, encoding="utf-8") as f:
self.data = f.readlines()
def sample(
self,
tokenizer,
num_requests: int,
prefix_len: int = DEFAULT_PREFIX_LEN,
input_len: int = DEFAULT_INPUT_LEN,
output_len: int = DEFAULT_OUTPUT_LEN,
return_prompt_formatted: bool = False,
**kwargs,
) -> list:
# Calculate average token length for a poem line.
tokenized_lines = [tokenizer(line).input_ids for line in self.data]
avg_len = sum(len(tokens)
for tokens in tokenized_lines) / len(tokenized_lines)
# Build the base prompt.
base_prompt = "Pick as many lines as you can from these poem lines:\n"
base_msg = [{"role": "user", "content": base_prompt}]
base_fmt = tokenizer.apply_chat_template(base_msg,
add_generation_prompt=True,
tokenize=False)
base_offset = len(tokenizer(base_fmt).input_ids)
if input_len <= base_offset:
raise ValueError(
f"'input_len' must be higher than the base prompt length "
f"({base_offset}).")
# Determine how many poem lines to use.
num_input_lines = round((input_len - base_offset) / avg_len)
num_prefix_lines = round((prefix_len - base_offset) / avg_len)
prefix_lines = self.data[:num_prefix_lines]
samples = []
for _ in range(num_requests):
extra_lines = random.choices(self.data,
k=num_input_lines - num_prefix_lines)
prompt = f"{base_prompt}{''.join(prefix_lines + extra_lines)}"
msg = [{"role": "user", "content": prompt}]
prompt_formatted = tokenizer.apply_chat_template(
msg, add_generation_prompt=True, tokenize=False)
prompt_len = len(tokenizer(prompt_formatted).input_ids)
samples.append(
SampleRequest(
prompt=prompt_formatted
if return_prompt_formatted else prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
))
return samples
# -----------------------------------------------------------------------------
# BurstGPT Dataset Implementation
# -----------------------------------------------------------------------------
class BurstGPTDataset(BenchmarkDataset):
"""
Implements the BurstGPT dataset. Loads data from a CSV file and generates
sample requests based on synthetic prompt generation. Only rows with Model
"GPT-4" and positive response tokens are used.
"""
def __init__(self, **kwargs) -> None:
super().__init__(**kwargs)
self.load_data()
def load_data(self, ):
if self.dataset_path is None:
raise ValueError("dataset_path must be provided for loading data.")
df = pd.read_csv(self.dataset_path)
# Filter to keep only GPT-4 rows.
gpt4_df = df[df["Model"] == "GPT-4"]
# Remove failed requests (where Response tokens is 0 or less).
gpt4_df = gpt4_df[gpt4_df["Response tokens"] > 0]
# Sample the desired number of rows.
self.data = gpt4_df
def _sample_loaded_data(self, num_requests: int) -> list:
if num_requests <= len(self.data):
data = self.data.sample(n=num_requests,
random_state=self.random_seed)
else:
data = self.data.sample(
n=num_requests,
random_state=self.random_seed,
replace=True,
)
# Convert the dataframe to a list of lists.
return data.values.tolist()
def sample(
self,
tokenizer: PreTrainedTokenizerBase,
num_requests: int,
max_loras: Optional[int] = None,
lora_path: Optional[str] = None,
**kwargs,
) -> list[SampleRequest]:
samples = []
data = self._sample_loaded_data(num_requests=num_requests)
for i in range(num_requests):
input_len = int(data[i][2])
output_len = int(data[i][3])
lora_req, tokenizer = self.get_random_lora_request(
tokenizer=tokenizer, max_loras=max_loras, lora_path=lora_path)
vocab_size = tokenizer.vocab_size
# Generate a synthetic prompt: a list of token IDs computed as (i +
# j) modulo vocab_size.
token_ids = [(i + j) % vocab_size for j in range(input_len)]
prompt = tokenizer.decode(token_ids)
samples.append(
SampleRequest(
prompt=prompt,
prompt_len=input_len,
expected_output_len=output_len,
lora_request=lora_req,
))
return samples
# -----------------------------------------------------------------------------
# HuggingFace Dataset Base Implementation
# -----------------------------------------------------------------------------
class HuggingFaceDataset(BenchmarkDataset):
"""Base class for datasets hosted on HuggingFace."""
SUPPORTED_DATASET_PATHS: Union[set[str], dict[str, Callable]] = set()
def __init__(
self,
dataset_path: str,
dataset_split: str,
dataset_subset: Optional[str] = None,
**kwargs,
) -> None:
super().__init__(dataset_path=dataset_path, **kwargs)
# Validate dataset path
if self.SUPPORTED_DATASET_PATHS and \
self.dataset_path not in self.SUPPORTED_DATASET_PATHS:
raise ValueError(
f"{self.__class__.__name__} "
f"only supports: {', '.join(self.SUPPORTED_DATASET_PATHS)}. "
"Please consider contributing if you would "
"like to add support for additional dataset formats.")
self.dataset_split = dataset_split
self.dataset_subset = dataset_subset
self.load_data()
def load_data(self) -> None:
"""Load data from HuggingFace datasets."""
self.data = load_dataset(
self.dataset_path,
name=self.dataset_subset,
split=self.dataset_split,
streaming=True,
)
self.data = self.data.shuffle(seed=self.random_seed)
# -----------------------------------------------------------------------------
# Conversation Dataset Implementation
# -----------------------------------------------------------------------------
class ConversationDataset(HuggingFaceDataset):
"""Dataset for conversation data with multimodal support."""
SUPPORTED_DATASET_PATHS = {
'lmms-lab/LLaVA-OneVision-Data', 'Aeala/ShareGPT_Vicuna_unfiltered'
}
def sample(self,
tokenizer: PreTrainedTokenizerBase,
num_requests: int,
output_len: Optional[int] = None,
enable_multimodal_chat: bool = False,
**kwargs) -> list:
# Filter examples with at least 2 conversations
filtered_data = self.data.filter(
lambda x: len(x["conversations"]) >= 2)
sampled_requests = []
dynamic_output = output_len is None
for item in filtered_data:
if len(sampled_requests) >= num_requests:
break
conv = item["conversations"]
prompt, completion = conv[0]["value"], conv[1]["value"]
prompt_ids = tokenizer(prompt).input_ids
completion_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_ids)
completion_len = len(completion_ids)
output_len = completion_len if dynamic_output else output_len
assert isinstance(output_len, int) and output_len > 0
if dynamic_output and not is_valid_sequence(
prompt_len, completion_len):
continue
mm_content = process_image(
item["image"]) if "image" in item else None
if enable_multimodal_chat:
# Note: when chat is enabled the request prompt_len is no longer
# accurate and we will be using request output to count the
# actual prompt len and output len
prompt = self.apply_multimodal_chat_transformation(
prompt, mm_content)
sampled_requests.append(
SampleRequest(
prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
multi_modal_data=mm_content,
))
self.maybe_oversample_requests(sampled_requests, num_requests)
return sampled_requests
# -----------------------------------------------------------------------------
# Vision Arena Dataset Implementation
# -----------------------------------------------------------------------------
class VisionArenaDataset(HuggingFaceDataset):
"""
Vision Arena Dataset.
"""
DEFAULT_OUTPUT_LEN = 128
SUPPORTED_DATASET_PATHS = {
"lmarena-ai/VisionArena-Chat":
lambda x: x["conversation"][0][0]["content"],
"lmarena-ai/vision-arena-bench-v0.1":
lambda x: x["turns"][0][0]["content"]
}
def sample(
self,
tokenizer: PreTrainedTokenizerBase,
num_requests: int,
output_len: Optional[int] = None,
enable_multimodal_chat: bool = False,
**kwargs,
) -> list:
output_len = (output_len
if output_len is not None else self.DEFAULT_OUTPUT_LEN)
sampled_requests = []
for item in self.data:
if len(sampled_requests) >= num_requests:
break
parser_fn = self.SUPPORTED_DATASET_PATHS.get(self.dataset_path)
if parser_fn is None:
raise ValueError(
f"Unsupported dataset path: {self.dataset_path}")
prompt = parser_fn(item)
mm_content = process_image(item["images"][0])
prompt_len = len(tokenizer(prompt).input_ids)
if enable_multimodal_chat:
# Note: when chat is enabled the request prompt_len is no longer
# accurate and we will be using request output to count the
# actual prompt len
prompt = self.apply_multimodal_chat_transformation(
prompt, mm_content)
sampled_requests.append(
SampleRequest(
prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
multi_modal_data=mm_content,
))
self.maybe_oversample_requests(sampled_requests, num_requests)
return sampled_requests
# -----------------------------------------------------------------------------
# Instruct Coder Dataset Implementation
# -----------------------------------------------------------------------------
class InstructCoderDataset(HuggingFaceDataset):
"""
InstructCoder Dataset.
https://huggingface.co/datasets/likaixin/InstructCoder
InstructCoder is the dataset designed for general code editing. It consists
of 114,239 instruction-input-output triplets, and covers multiple distinct
code editing scenario.
"""
DEFAULT_OUTPUT_LEN = 200 # this is the average default output length
SUPPORTED_DATASET_PATHS = {
"likaixin/InstructCoder",
}
def sample(self,
tokenizer: PreTrainedTokenizerBase,
num_requests: int,
output_len: Optional[int] = None,
enable_multimodal_chat: bool = False,
**kwargs) -> list:
output_len = (output_len
if output_len is not None else self.DEFAULT_OUTPUT_LEN)
sampled_requests = []
for item in self.data:
if len(sampled_requests) >= num_requests:
break
prompt = f"{item['instruction']}:\n{item['input']}"
prompt_len = len(tokenizer(prompt).input_ids)
sampled_requests.append(
SampleRequest(
prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
))
self.maybe_oversample_requests(sampled_requests, num_requests)
return sampled_requests

View File

@ -1,21 +1,38 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark the latency of processing a single batch of requests."""
import argparse
import dataclasses
import json
import os
import time
from pathlib import Path
from typing import List, Optional
from typing import Any, Optional
import numpy as np
import torch
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
from tqdm import tqdm
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.inputs import PromptType
from vllm.sampling_params import BeamSearchParams
from vllm.utils import FlexibleArgumentParser
def save_to_pytorch_benchmark_format(args: argparse.Namespace,
results: dict[str, Any]) -> None:
pt_records = convert_to_pytorch_benchmark_format(
args=args,
metrics={"latency": results["latencies"]},
extra_info={k: results[k]
for k in ["avg_latency", "percentiles"]})
if pt_records:
pt_file = f"{os.path.splitext(args.output_json)[0]}.pytorch.json"
write_to_json(pt_file, pt_records)
def main(args: argparse.Namespace):
print(args)
@ -24,6 +41,10 @@ def main(args: argparse.Namespace):
# NOTE(woosuk): If the request cannot be processed in a single batch,
# the engine will automatically process the request in multiple batches.
llm = LLM(**dataclasses.asdict(engine_args))
assert llm.llm_engine.model_config.max_model_len >= (
args.input_len +
args.output_len), ("Please ensure that max_model_len is greater than"
" the sum of input_len and output_len.")
sampling_params = SamplingParams(
n=args.n,
@ -31,15 +52,31 @@ def main(args: argparse.Namespace):
top_p=1.0,
ignore_eos=True,
max_tokens=args.output_len,
detokenize=not args.disable_detokenize,
)
print(sampling_params)
dummy_prompt_token_ids = np.random.randint(10000,
size=(args.batch_size,
args.input_len))
dummy_prompts: List[PromptType] = [{
dummy_prompts: list[PromptType] = [{
"prompt_token_ids": batch
} for batch in dummy_prompt_token_ids.tolist()]
def llm_generate():
if not args.use_beam_search:
llm.generate(dummy_prompts,
sampling_params=sampling_params,
use_tqdm=False)
else:
llm.beam_search(
dummy_prompts,
BeamSearchParams(
beam_width=args.n,
max_tokens=args.output_len,
ignore_eos=True,
),
)
def run_to_completion(profile_dir: Optional[str] = None):
if profile_dir:
with torch.profiler.profile(
@ -48,16 +85,13 @@ def main(args: argparse.Namespace):
torch.profiler.ProfilerActivity.CUDA,
],
on_trace_ready=torch.profiler.tensorboard_trace_handler(
str(profile_dir))) as p:
llm.generate(dummy_prompts,
sampling_params=sampling_params,
use_tqdm=False)
print(p.key_averages())
str(profile_dir)),
) as p:
llm_generate()
print(p.key_averages().table(sort_by="self_cuda_time_total"))
else:
start_time = time.perf_counter()
llm.generate(dummy_prompts,
sampling_params=sampling_params,
use_tqdm=False)
llm_generate()
end_time = time.perf_counter()
latency = end_time - start_time
return latency
@ -69,9 +103,8 @@ def main(args: argparse.Namespace):
if args.profile:
profile_dir = args.profile_result_dir
if not profile_dir:
profile_dir = Path(
"."
) / "vllm_benchmark_result" / f"latency_result_{time.time()}"
profile_dir = (Path(".") / "vllm_benchmark_result" /
f"latency_result_{time.time()}")
print(f"Profiling (results will be saved to '{profile_dir}')...")
run_to_completion(profile_dir=profile_dir)
return
@ -83,9 +116,9 @@ def main(args: argparse.Namespace):
latencies = np.array(latencies)
percentages = [10, 25, 50, 75, 90, 99]
percentiles = np.percentile(latencies, percentages)
print(f'Avg latency: {np.mean(latencies)} seconds')
print(f"Avg latency: {np.mean(latencies)} seconds")
for percentage, percentile in zip(percentages, percentiles):
print(f'{percentage}% percentile latency: {percentile} seconds')
print(f"{percentage}% percentile latency: {percentile} seconds")
# Output JSON results if specified
if args.output_json:
@ -96,43 +129,57 @@ def main(args: argparse.Namespace):
}
with open(args.output_json, "w") as f:
json.dump(results, f, indent=4)
save_to_pytorch_benchmark_format(args, results)
if __name__ == '__main__':
if __name__ == "__main__":
parser = FlexibleArgumentParser(
description='Benchmark the latency of processing a single batch of '
'requests till completion.')
parser.add_argument('--input-len', type=int, default=32)
parser.add_argument('--output-len', type=int, default=128)
parser.add_argument('--batch-size', type=int, default=8)
parser.add_argument('--n',
type=int,
default=1,
help='Number of generated sequences per prompt.')
parser.add_argument('--use-beam-search', action='store_true')
parser.add_argument('--num-iters-warmup',
type=int,
default=10,
help='Number of iterations to run for warmup.')
parser.add_argument('--num-iters',
description="Benchmark the latency of processing a single batch of "
"requests till completion.")
parser.add_argument("--input-len", type=int, default=32)
parser.add_argument("--output-len", type=int, default=128)
parser.add_argument("--batch-size", type=int, default=8)
parser.add_argument(
"--n",
type=int,
default=1,
help="Number of generated sequences per prompt.",
)
parser.add_argument("--use-beam-search", action="store_true")
parser.add_argument(
"--num-iters-warmup",
type=int,
default=10,
help="Number of iterations to run for warmup.",
)
parser.add_argument("--num-iters",
type=int,
default=30,
help='Number of iterations to run.')
help="Number of iterations to run.")
parser.add_argument(
'--profile',
action='store_true',
help='profile the generation process of a single batch')
"--profile",
action="store_true",
help="profile the generation process of a single batch",
)
parser.add_argument(
'--profile-result-dir',
"--profile-result-dir",
type=str,
default=None,
help=('path to save the pytorch profiler output. Can be visualized '
'with ui.perfetto.dev or Tensorboard.'))
help=("path to save the pytorch profiler output. Can be visualized "
"with ui.perfetto.dev or Tensorboard."),
)
parser.add_argument(
'--output-json',
"--output-json",
type=str,
default=None,
help='Path to save the latency results in JSON format.')
help="Path to save the latency results in JSON format.",
)
parser.add_argument(
"--disable-detokenize",
action="store_true",
help=("Do not detokenize responses (i.e. do not include "
"detokenization time in the latency measurement)"),
)
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()

View File

@ -0,0 +1,184 @@
# SPDX-License-Identifier: Apache-2.0
"""
Offline benchmark to test the long document QA throughput.
Example usage:
# This workload samples 8 different prompts with a default input
# length of 20000 tokens, then replicates each prompt 2 times
# in random order.
python benchmark_long_document_qa_throughput.py \
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-documents 8 \
--repeat-count 2
Commandline arguments:
--num-documents: The number of documents to sample prompts from.
--document-length: The length of each document in tokens.
(Optional, default: 20000)
--output-len: The number of tokens to generate for each prompt.
(Optional, default: 10)
--repeat-count: The number of times to repeat each prompt.
(Optional, default: 2)
--repeat-mode: The mode to repeat prompts. The supported modes are:
- 'random': shuffle the prompts randomly. (Default)
- 'tile': the entire prompt list is repeated in sequence. (Potentially
lowest cache hit)
- 'interleave': each prompt is repeated consecutively before
moving to the next element. (Highest cache hit)
--shuffle-seed: Random seed when the repeat mode is "random".
(Optional, default: 0)
In the meantime, it also supports all the vLLM engine args to initialize the
LLM engine. You can refer to the `vllm.engine.arg_utils.EngineArgs` for more
details.
"""
import dataclasses
import random
import time
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser
def test_long_document_qa(llm=None, sampling_params=None, prompts=None):
"""
Test long document QA with the given prompts and sampling parameters.
Print the time spent in processing all the prompts.
Args:
llm: The language model used for generating responses.
sampling_params: Sampling parameter used to generate the response.
prompts: A list of prompt strings to be processed by the LLM.
"""
start_time = time.time()
llm.generate(prompts, sampling_params=sampling_params)
end_time = time.time()
print(f"Time to execute all requests: {end_time - start_time:.4f} secs")
def repeat_prompts(prompts, repeat_count, mode: str):
"""
Repeat each prompt in the list for a specified number of times.
The order of prompts in the output list depends on the mode.
Args:
prompts: A list of prompts to be repeated.
repeat_count: The number of times each prompt is repeated.
mode: The mode of repetition. Supported modes are:
- 'random': Shuffle the prompts randomly after repetition.
- 'tile': Repeat the entire prompt list in sequence.
Example: [1, 2, 3] -> [1, 2, 3, 1, 2, 3].
- 'interleave': Repeat each prompt consecutively before moving to
the next. Example: [1, 2, 3] -> [1, 1, 2, 2, 3, 3].
Returns:
A list of repeated prompts in the specified order.
Raises:
ValueError: If an invalid mode is provided.
"""
print("Repeat mode: ", mode)
if mode == 'random':
repeated_prompts = prompts * repeat_count
random.shuffle(repeated_prompts)
return repeated_prompts
elif mode == 'tile':
return prompts * repeat_count
elif mode == 'interleave':
repeated_prompts = []
for prompt in prompts:
repeated_prompts.extend([prompt] * repeat_count)
return repeated_prompts
else:
raise ValueError(f"Invalid mode: {mode}, only support "
"'random', 'tile', 'interleave'")
def main(args):
random.seed(args.shuffle_seed)
# Prepare the prompts:
# we append the document id at the beginning to avoid any of the document
# being the prefix of other documents
prompts = [
str(i) + ' '.join(['hi'] * args.document_length)
for i in range(args.num_documents)
]
prompts = repeat_prompts(prompts, args.repeat_count, mode=args.repeat_mode)
warmup_prompts = [
"This is warm up request " + str(i) + \
' '.join(['hi'] * args.document_length)
for i in range(args.num_documents)]
# Create the LLM engine
engine_args = EngineArgs.from_cli_args(args)
llm = LLM(**dataclasses.asdict(engine_args))
sampling_params = SamplingParams(temperature=0, max_tokens=args.output_len)
print("------warm up------")
test_long_document_qa(
llm=llm,
prompts=warmup_prompts,
sampling_params=sampling_params,
)
print("------start generating------")
test_long_document_qa(
llm=llm,
prompts=prompts,
sampling_params=sampling_params,
)
if __name__ == "__main__":
parser = FlexibleArgumentParser(
description=
'Benchmark the performance with or without automatic prefix caching.')
parser.add_argument(
'--document-length',
type=int,
# Roughly the number of tokens for a system paper,
# excluding images
default=20000,
help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").')
parser.add_argument('--num-documents',
type=int,
default=8,
help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").')
parser.add_argument('--output-len', type=int, default=10)
parser.add_argument('--repeat-count',
type=int,
default=2,
help='Number of times to repeat each prompt')
parser.add_argument("--repeat-mode",
type=str,
default='random',
help='The mode to repeat prompts. The supported '
'modes are "random", "tile", and "interleave". '
'See repeat_prompts() in the source code for details.')
parser.add_argument("--shuffle-seed",
type=int,
default=0,
help='Random seed when the repeat mode is "random"')
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()
main(args)

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
Benchmark the efficiency of prefix caching.
@ -10,7 +11,8 @@ Fixed example usage:
--model meta-llama/Llama-2-7b-chat-hf \
--enable-prefix-caching \
--num-prompts 1 \
--repeat-count 100
--repeat-count 100 \
--input-length-range 128:256
ShareGPT example usage:
# This command samples 20 prompts with input lengths
@ -29,7 +31,7 @@ import dataclasses
import json
import random
import time
from typing import List, Optional, Tuple
from typing import Optional
from transformers import PreTrainedTokenizerBase
@ -54,13 +56,30 @@ def test_prefix(llm=None, sampling_params=None, prompts=None):
print(f"cost time {end_time - start_time}")
def sample_requests(
@dataclasses.dataclass
class Request:
prompt: str
prompt_len: int
output_len: int
def sample_tokens(tokenizer: PreTrainedTokenizerBase, length: int) -> str:
vocab = tokenizer.get_vocab()
# Remove the special tokens.
vocab = {
k: v
for k, v in vocab.items() if k not in tokenizer.all_special_ids
}
return random.choices(list(vocab.values()), k=length)
def sample_requests_from_dataset(
dataset_path: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
input_length_range: Tuple[int, int],
input_length_range: tuple[int, int],
fixed_output_len: Optional[int],
) -> List[Tuple[str, int, int]]:
) -> list[Request]:
if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small")
@ -77,39 +96,63 @@ def sample_requests(
random.shuffle(dataset)
min_len, max_len = input_length_range
assert min_len >= 0 and max_len >= min_len, "input_length_range too small"
# Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = []
filtered_requests: list[Request] = []
for i in range(len(dataset)):
if len(filtered_dataset) == num_requests:
if len(filtered_requests) == num_requests:
break
# Tokenize the prompts and completions.
prompt = dataset[i][0]
prompt_token_ids = tokenizer(prompt).input_ids
prompt_token_ids = tokenizer(dataset[i][0]).input_ids
prompt = tokenizer.decode(prompt_token_ids)
completion = dataset[i][1]
completion_token_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids
) if fixed_output_len is None else fixed_output_len
if prompt_len < 4 or output_len < 4:
# Prune too short sequences.
continue
output_len = (len(completion_token_ids)
if fixed_output_len is None else fixed_output_len)
if min_len <= prompt_len <= max_len:
filtered_dataset.append((prompt, prompt_len, output_len))
filtered_requests.append(Request(prompt, prompt_len, output_len))
return filtered_dataset
return filtered_requests
def repeat_and_sort_requests(requests: List[Tuple[str, int, int]],
def sample_requests_from_random(
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
input_length_range: tuple[int, int],
fixed_output_len: Optional[int],
prefix_len: int,
) -> list[Request]:
requests = []
prefix_token_ids = sample_tokens(tokenizer, prefix_len)
min_len, max_len = input_length_range
for i in range(num_requests):
unique_part_token_ids = sample_tokens(
tokenizer,
random.randint(min_len - prefix_len, max_len - prefix_len))
prompt_token_ids = prefix_token_ids + unique_part_token_ids
prompt = tokenizer.decode(prompt_token_ids)
prompt_len = len(prompt_token_ids)
assert (min_len <= prompt_len <= max_len
), f"prompt_len {prompt_len} out of range {min_len}:{max_len}"
requests.append(Request(prompt, prompt_len, fixed_output_len))
return requests
def repeat_and_sort_requests(requests: list[Request],
repeat_count: int,
sort: bool = False) -> List[str]:
sort: bool = False) -> list[str]:
repeated_requests = requests * repeat_count
if sort:
repeated_requests.sort(key=lambda x: x[1])
else:
random.shuffle(repeated_requests)
return [req[0] for req in repeated_requests]
return [req.prompt for req in repeated_requests]
def main(args):
@ -117,9 +160,12 @@ def main(args):
input_length_range = tuple(map(int, args.input_length_range.split(':')))
random.seed(args.seed)
if args.dataset_path is not None:
print(f"Start to sample {args.num_prompts} prompts"
if args.prefix_len > 0:
raise ValueError("prefix-len is not supported when "
"dataset-path is provided.")
print(f"Start to sample {args.num_prompts} prompts "
f"from {args.dataset_path}")
filtered_datasets = sample_requests(
filtered_requests = sample_requests_from_dataset(
dataset_path=args.dataset_path,
num_requests=args.num_prompts,
tokenizer=tokenizer,
@ -127,18 +173,33 @@ def main(args):
fixed_output_len=args.output_len,
)
else:
prompt_len = len(tokenizer(PROMPT).input_ids)
filtered_datasets = [(PROMPT, prompt_len, args.output_len)
] * args.num_prompts
print(f"Start to sample {args.num_prompts} prompts from random")
filtered_requests = sample_requests_from_random(
num_requests=args.num_prompts,
tokenizer=tokenizer,
input_length_range=input_length_range,
fixed_output_len=args.output_len,
prefix_len=args.prefix_len,
)
# Print some helpful stats of the requests.
print(f"Sampled {len(filtered_requests)} requests.")
prompt_lens = [req.prompt_len for req in filtered_requests]
print(f"Average input length: {sum(prompt_lens) / len(prompt_lens)}")
print(f"P50 input length: {sorted(prompt_lens)[len(prompt_lens) // 2]}")
print(f"Min Prompt Length: {min(prompt_lens)}")
print(f"Max Prompt Length: {max(prompt_lens)}")
engine_args = EngineArgs.from_cli_args(args)
llm = LLM(**dataclasses.asdict(engine_args))
sampling_params = SamplingParams(temperature=0, max_tokens=args.output_len)
sampling_params = SamplingParams(temperature=0,
max_tokens=args.output_len,
detokenize=not args.disable_detokenize)
print("Testing filtered datasets")
prompts = repeat_and_sort_requests(filtered_datasets,
print("Testing filtered requests")
prompts = repeat_and_sort_requests(filtered_requests,
repeat_count=args.repeat_count,
sort=args.sort)
@ -161,20 +222,35 @@ if __name__ == "__main__":
parser.add_argument('--output-len', type=int, default=10)
parser.add_argument('--num-prompts',
type=int,
default=1,
required=True,
help="Number of the prompts sampled from dataset")
parser.add_argument('--repeat-count',
type=int,
default=100,
default=1,
help='Number of times to repeat each prompt')
parser.add_argument('--sort',
action='store_true',
help='Sort prompts by input length')
parser.add_argument('--input-length-range',
type=str,
default='128:256',
required=True,
help='Range of input lengths for sampling prompts,'
'specified as "min:max" (e.g., "128:256").')
parser.add_argument(
"--prefix-len",
type=int,
default=0,
help="Specifies the length of a common prefix to be "
"added to the input prompt. The input-length-range will "
"subtract this length when filtering prompts. Only used "
"when dataset-path is not provided.",
)
parser.add_argument(
'--disable-detokenize',
action='store_true',
help=("Do not detokenize responses (i.e. do not include "
"detokenization time in the latency measurement)"),
)
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()

View File

@ -1,10 +1,11 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark offline prioritization."""
import argparse
import dataclasses
import json
import random
import time
from typing import List, Optional, Tuple
from typing import Optional
from transformers import AutoTokenizer, PreTrainedTokenizerBase
@ -12,12 +13,17 @@ from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser
#Select a equi-probable random priority
def get_random_flag():
return 0 if random.random() < 0.5 else 1
def sample_requests(
dataset_path: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int],
) -> List[Tuple[str, int, int]]:
) -> list[tuple[str, int, int, int]]:
if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small")
@ -34,7 +40,7 @@ def sample_requests(
random.shuffle(dataset)
# Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = []
filtered_dataset: list[tuple[str, int, int]] = []
for i in range(len(dataset)):
if len(filtered_dataset) == num_requests:
break
@ -54,8 +60,7 @@ def sample_requests(
# Prune too long sequences.
continue
#Select a equi-probable random priority
priority = 0 if random.random() < 0.5 else 1
priority = get_random_flag()
filtered_dataset.append((prompt, prompt_len, output_len, priority))
@ -63,13 +68,20 @@ def sample_requests(
def run_vllm(
requests: List[Tuple[str, int, int]],
requests: list[tuple[str, int, int]],
n: int,
engine_args: EngineArgs,
disable_detokenize: bool = False,
) -> float:
from vllm import LLM, SamplingParams
llm = LLM(**dataclasses.asdict(engine_args))
assert all(
llm.llm_engine.model_config.max_model_len >= (request[1] + request[2])
for request in requests), (
"Please ensure that max_model_len is greater than the sum of"
" input_len and output_len for all requests.")
# Add the requests to the engine.
prompts = []
sampling_params = []
@ -84,6 +96,7 @@ def run_vllm(
top_p=1.0,
ignore_eos=True,
max_tokens=output_len,
detokenize=not disable_detokenize,
))
start = time.perf_counter()
@ -102,15 +115,16 @@ def main(args: argparse.Namespace):
if args.dataset is None:
# Synthesize a prompt with the given input length.
prompt = "hi" * (args.input_len - 1)
requests = [(prompt, args.input_len, args.output_len)
for _ in range(args.num_prompts)]
requests = [(prompt, args.input_len, args.output_len,
get_random_flag()) for _ in range(args.num_prompts)]
else:
requests = sample_requests(args.dataset, args.num_prompts, tokenizer,
args.output_len)
if args.backend == "vllm":
elapsed_time = run_vllm(requests, args.n,
EngineArgs.from_cli_args(args))
EngineArgs.from_cli_args(args),
args.disable_detokenize)
else:
raise ValueError(f"Unknown backend: {args.backend}")
total_num_tokens = sum(prompt_len + output_len
@ -163,6 +177,12 @@ if __name__ == "__main__":
type=str,
default=None,
help='Path to save the throughput results in JSON format.')
parser.add_argument(
'--disable-detokenize',
action='store_true',
help=("Do not detokenize responses (i.e. do not include "
"detokenization time in the latency measurement)"),
)
parser = EngineArgs.add_cli_args(parser)
args = parser.parse_args()

View File

@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
r"""Benchmark online serving throughput.
On the server side, run one of the following commands:
@ -6,9 +7,6 @@ On the server side, run one of the following commands:
--swap-space 16 \
--disable-log-requests
(TGI backend)
./launch_tgi_server.sh <your_model> <max_batch_total_tokens>
On the client side, run:
python benchmarks/benchmark_serving.py \
--backend <backend> \
@ -24,22 +22,20 @@ On the client side, run:
"""
import argparse
import asyncio
import base64
import io
import gc
import json
import os
import random
import time
import warnings
from collections.abc import AsyncGenerator, Iterable
from dataclasses import dataclass
from datetime import datetime
from typing import Any, AsyncGenerator, Collection, Dict, List, Optional, Tuple
from typing import Any, Optional
import numpy as np
from backend_request_func import (ASYNC_REQUEST_FUNCS, RequestFuncInput,
RequestFuncOutput)
from datasets import load_dataset
from PIL.Image import Image
from tqdm.asyncio import tqdm
from transformers import PreTrainedTokenizerBase
@ -53,6 +49,12 @@ try:
except ImportError:
from argparse import ArgumentParser as FlexibleArgumentParser
from benchmark_dataset import (BurstGPTDataset, ConversationDataset,
InstructCoderDataset, RandomDataset,
SampleRequest, ShareGPTDataset, SonnetDataset,
VisionArenaDataset)
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
MILLISECONDS_TO_SECONDS_CONVERSION = 1000
@ -68,256 +70,48 @@ class BenchmarkMetrics:
mean_ttft_ms: float
median_ttft_ms: float
std_ttft_ms: float
percentiles_ttft_ms: List[Tuple[float, float]]
percentiles_ttft_ms: list[tuple[float, float]]
mean_tpot_ms: float
median_tpot_ms: float
std_tpot_ms: float
percentiles_tpot_ms: List[Tuple[float, float]]
percentiles_tpot_ms: list[tuple[float, float]]
mean_itl_ms: float
median_itl_ms: float
std_itl_ms: float
percentiles_itl_ms: List[Tuple[float, float]]
percentiles_itl_ms: list[tuple[float, float]]
# E2EL stands for end-to-end latency per request.
# It is the time taken on the client side from sending
# a request to receiving a complete response.
mean_e2el_ms: float
median_e2el_ms: float
std_e2el_ms: float
percentiles_e2el_ms: List[Tuple[float, float]]
def sample_sharegpt_requests(
dataset_path: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, int, int, None]]:
# Load the dataset.
with open(dataset_path, encoding='utf-8') as f:
dataset = json.load(f)
# Filter out the conversations with less than 2 turns.
dataset = [data for data in dataset if len(data["conversations"]) >= 2]
# Only keep the first two turns of each conversation.
dataset = [(data["conversations"][0]["value"],
data["conversations"][1]["value"]) for data in dataset]
# Shuffle the dataset.
random.shuffle(dataset)
# Filter out sequences that are too long or too short
filtered_dataset: List[Tuple[str, int, int]] = []
for i in range(len(dataset)):
if len(filtered_dataset) == num_requests:
break
# Tokenize the prompts and completions.
prompt = dataset[i][0]
prompt_token_ids = tokenizer(prompt).input_ids
completion = dataset[i][1]
completion_token_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids
) if fixed_output_len is None else fixed_output_len
if prompt_len < 4 or (fixed_output_len is None and output_len < 4):
# Prune too short sequences.
continue
if prompt_len > 1024 or prompt_len + output_len > 2048:
# Prune too long sequences.
continue
filtered_dataset.append((prompt, prompt_len, output_len, None))
return filtered_dataset
def sample_sonnet_requests(
dataset_path: str,
num_requests: int,
input_len: int,
output_len: int,
prefix_len: int,
tokenizer: PreTrainedTokenizerBase,
) -> List[Tuple[str, str, int, int, None]]:
assert (
input_len > prefix_len
), "'args.sonnet-input-len' must be greater than 'args.prefix-input-len'."
# Load the dataset.
with open(dataset_path, encoding='utf-8') as f:
poem_lines = f.readlines()
# Tokenize the poem lines.
poem_token_ids = tokenizer(poem_lines).input_ids
average_poem_len = sum(
len(token_ids) for token_ids in poem_token_ids) / len(poem_token_ids)
# Base prefix for all requests.
base_prompt = "Pick as many lines as you can from these poem lines:\n"
base_message = [{
"role": "user",
"content": base_prompt,
}]
base_prompt_formatted = tokenizer.apply_chat_template(
base_message, add_generation_prompt=True, tokenize=False)
base_prompt_offset = len(tokenizer(base_prompt_formatted).input_ids)
assert (
input_len > base_prompt_offset
), f"Please set 'args.sonnet-input-len' higher than {base_prompt_offset}."
num_input_lines = round(
(input_len - base_prompt_offset) / average_poem_len)
# First approximately `prefix_len` number of tokens in the
# prompt are fixed poem lines.
assert (
prefix_len > base_prompt_offset
), f"Please set 'args.sonnet-prefix-len' higher than {base_prompt_offset}."
num_prefix_lines = round(
(prefix_len - base_prompt_offset) / average_poem_len)
prefix_lines = poem_lines[:num_prefix_lines]
# Sample the rest of lines per request.
sampled_requests: List[Tuple[str, int, int]] = []
for _ in range(num_requests):
num_lines_needed = num_input_lines - num_prefix_lines
sampled_lines = "".join(prefix_lines +
random.choices(poem_lines, k=num_lines_needed))
prompt = f"{base_prompt}{sampled_lines}"
message = [
{
"role": "user",
"content": prompt,
},
]
prompt_formatted = tokenizer.apply_chat_template(
message, add_generation_prompt=True, tokenize=False)
prompt_len = len(tokenizer(prompt_formatted).input_ids)
sampled_requests.append(
(prompt, prompt_formatted, prompt_len, output_len, None))
return sampled_requests
def sample_hf_requests(
dataset_path: str,
dataset_subset: str,
dataset_split: str,
num_requests: int,
tokenizer: PreTrainedTokenizerBase,
random_seed: int,
fixed_output_len: Optional[int] = None,
) -> List[Tuple[str, str, int, Optional[Dict[str, Collection[str]]]]]:
dataset = load_dataset(dataset_path,
name=dataset_subset,
split=dataset_split,
streaming=True)
assert "conversations" in dataset.features, (
"HF Dataset must have 'conversations' column.")
filter_func = lambda x: len(x["conversations"]) >= 2
filtered_dataset = dataset.shuffle(seed=random_seed).filter(filter_func)
sampled_requests: List[Tuple[str, int, int, Dict[str,
Collection[str]]]] = []
for data in filtered_dataset:
if len(sampled_requests) == num_requests:
break
# Tokenize the prompts and completions.
prompt = data["conversations"][0]["value"]
prompt_token_ids = tokenizer(prompt).input_ids
completion = data["conversations"][1]["value"]
completion_token_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids
) if fixed_output_len is None else fixed_output_len
if fixed_output_len is None and (prompt_len < 4 or output_len < 4):
# Prune too short sequences.
continue
if fixed_output_len is None and \
(prompt_len > 1024 or prompt_len + output_len > 2048):
# Prune too long sequences.
continue
if "image" in data and isinstance(data["image"], Image):
image: Image = data["image"]
image = image.convert("RGB")
image_data = io.BytesIO()
image.save(image_data, format='JPEG')
image_base64 = base64.b64encode(
image_data.getvalue()).decode("utf-8")
mm_content = {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
},
}
else:
mm_content = None
sampled_requests.append((prompt, prompt_len, output_len, mm_content))
return sampled_requests
def sample_random_requests(
prefix_len: int,
input_len: int,
output_len: int,
num_prompts: int,
range_ratio: float,
tokenizer: PreTrainedTokenizerBase,
) -> List[Tuple[str, int, int]]:
prefix_token_ids = np.random.randint(0,
tokenizer.vocab_size,
size=prefix_len).tolist()
input_lens = np.random.randint(
int(input_len * range_ratio),
input_len + 1,
size=num_prompts,
)
output_lens = np.random.randint(
int(output_len * range_ratio),
output_len + 1,
size=num_prompts,
)
offsets = np.random.randint(0, tokenizer.vocab_size, size=num_prompts)
input_requests = []
for i in range(num_prompts):
prompt = tokenizer.decode(prefix_token_ids +
[(offsets[i] + i + j) % tokenizer.vocab_size
for j in range(input_lens[i])])
input_requests.append((prompt, int(prefix_len + input_lens[i]),
int(output_lens[i]), None))
return input_requests
percentiles_e2el_ms: list[tuple[float, float]]
async def get_request(
input_requests: List[Tuple[str, int, int]],
input_requests: list[SampleRequest],
request_rate: float,
burstiness: float = 1.0,
) -> AsyncGenerator[Tuple[str, int, int], None]:
) -> AsyncGenerator[SampleRequest, None]:
"""
Asynchronously generates requests at a specified rate
Asynchronously generates requests at a specified rate
with OPTIONAL burstiness.
Args:
input_requests:
A list of input requests, each represented as a tuple.
request_rate:
input_requests:
A list of input requests, each represented as a SampleRequest.
request_rate:
The rate at which requests are generated (requests/s).
burstiness (optional):
The burstiness factor of the request generation.
burstiness (optional):
The burstiness factor of the request generation.
Only takes effect when request_rate is not inf.
Default value is 1, which follows a Poisson process.
Otherwise, the request intervals follow a gamma distribution.
A lower burstiness value (0 < burstiness < 1) results
in more bursty requests, while a higher burstiness value
A lower burstiness value (0 < burstiness < 1) results
in more bursty requests, while a higher burstiness value
(burstiness > 1) results in a more uniform arrival of requests.
"""
input_requests = iter(input_requests)
input_requests: Iterable[SampleRequest] = iter(input_requests)
# Calculate scale parameter theta to maintain the desired request_rate.
assert burstiness > 0, (
@ -339,38 +133,42 @@ async def get_request(
def calculate_metrics(
input_requests: List[Tuple[str, int, int]],
outputs: List[RequestFuncOutput],
input_requests: list[SampleRequest],
outputs: list[RequestFuncOutput],
dur_s: float,
tokenizer: PreTrainedTokenizerBase,
selected_percentile_metrics: List[str],
selected_percentiles: List[float],
gootput_config_dict: Dict[str, float],
) -> Tuple[BenchmarkMetrics, List[int]]:
actual_output_lens: List[int] = []
selected_percentile_metrics: list[str],
selected_percentiles: list[float],
goodput_config_dict: dict[str, float],
) -> tuple[BenchmarkMetrics, list[int]]:
actual_output_lens: list[int] = []
total_input = 0
completed = 0
good_completed = 0
itls: List[float] = []
tpots: List[float] = []
all_tpots: List[float] = []
ttfts: List[float] = []
e2els: List[float] = []
itls: list[float] = []
tpots: list[float] = []
all_tpots: list[float] = []
ttfts: list[float] = []
e2els: list[float] = []
for i in range(len(outputs)):
if outputs[i].success:
# We use the tokenizer to count the number of output tokens for all
# serving backends instead of looking at len(outputs[i].itl) since
# multiple output tokens may be bundled together
# Note : this may inflate the output token count slightly
output_len = len(
tokenizer(outputs[i].generated_text,
add_special_tokens=False).input_ids)
output_len = outputs[i].output_tokens
if output_len is None:
# We use the tokenizer to count the number of output tokens
# for some serving backends instead of looking at
# len(outputs[i].itl) since multiple output tokens may be
# bundled together
# Note : this may inflate the output token count slightly
output_len = len(
tokenizer(outputs[i].generated_text,
add_special_tokens=False).input_ids)
actual_output_lens.append(output_len)
total_input += input_requests[i][1]
total_input += input_requests[i].prompt_len
tpot = 0
if output_len > 1:
tpot = (outputs[i].latency - outputs[i].ttft) / (output_len -
1)
latency_minus_ttft = outputs[i].latency - outputs[i].ttft
tpot = latency_minus_ttft / (output_len - 1)
tpots.append(tpot)
# Note: if output_len <= 1, we regard tpot as 0 for goodput
all_tpots.append(tpot)
@ -381,21 +179,21 @@ def calculate_metrics(
else:
actual_output_lens.append(0)
if gootput_config_dict:
if goodput_config_dict:
valid_metrics = []
slo_values = []
if "ttft" in gootput_config_dict:
if "ttft" in goodput_config_dict:
valid_metrics.append(ttfts)
slo_values.append(gootput_config_dict["ttft"] /
slo_values.append(goodput_config_dict["ttft"] /
MILLISECONDS_TO_SECONDS_CONVERSION)
if "tpot" in gootput_config_dict:
if "tpot" in goodput_config_dict:
valid_metrics.append(all_tpots)
slo_values.append(gootput_config_dict["tpot"] /
slo_values.append(goodput_config_dict["tpot"] /
MILLISECONDS_TO_SECONDS_CONVERSION)
if "e2el" in gootput_config_dict:
if "e2el" in goodput_config_dict:
valid_metrics.append(e2els)
slo_values.append(gootput_config_dict["e2el"] /
slo_values.append(goodput_config_dict["e2el"] /
MILLISECONDS_TO_SECONDS_CONVERSION)
for req_metric in zip(*valid_metrics):
@ -447,19 +245,20 @@ async def benchmark(
api_url: str,
base_url: str,
model_id: str,
model_name: str,
tokenizer: PreTrainedTokenizerBase,
input_requests: List[Tuple[str, int, int]],
input_requests: list[SampleRequest],
logprobs: Optional[int],
best_of: int,
request_rate: float,
burstiness: float,
disable_tqdm: bool,
profile: bool,
selected_percentile_metrics: List[str],
selected_percentiles: List[str],
selected_percentile_metrics: list[str],
selected_percentiles: list[float],
ignore_eos: bool,
gootput_config_dict: Dict[str, float],
goodput_config_dict: dict[str, float],
max_concurrency: Optional[int],
lora_modules: Optional[Iterable[str]],
):
if backend in ASYNC_REQUEST_FUNCS:
request_func = ASYNC_REQUEST_FUNCS[backend]
@ -467,23 +266,28 @@ async def benchmark(
raise ValueError(f"Unknown backend: {backend}")
print("Starting initial single prompt test run...")
test_prompt, test_prompt_len, test_output_len, test_mm_content = (
input_requests[0])
test_prompt, test_prompt_len, test_output_len, test_mm_content = \
input_requests[0].prompt, input_requests[0].prompt_len, \
input_requests[0].expected_output_len, \
input_requests[0].multi_modal_data
if backend != "openai-chat" and test_mm_content is not None:
# multi-modal benchmark is only available on OpenAI Chat backend.
raise ValueError(
"Multi-modal content is only supported on 'openai-chat' backend.")
assert test_mm_content is None or isinstance(test_mm_content, dict)
test_input = RequestFuncInput(
model=model_id,
model_name=model_name,
prompt=test_prompt,
api_url=api_url,
prompt_len=test_prompt_len,
output_len=test_output_len,
logprobs=logprobs,
best_of=best_of,
multi_modal_content=test_mm_content,
ignore_eos=ignore_eos,
)
test_output = await request_func(request_func_input=test_input)
if not test_output.success:
raise ValueError(
@ -492,15 +296,21 @@ async def benchmark(
else:
print("Initial test run completed. Starting main benchmark run...")
if lora_modules:
# For each input request, choose a LoRA module at random.
lora_modules = iter(
[random.choice(lora_modules) \
for _ in range(len(input_requests))])
if profile:
print("Starting profiler...")
profile_input = RequestFuncInput(model=model_id,
model_name=model_name,
prompt=test_prompt,
api_url=base_url + "/start_profile",
prompt_len=test_prompt_len,
output_len=test_output_len,
logprobs=logprobs,
best_of=best_of,
multi_modal_content=test_mm_content,
ignore_eos=ignore_eos)
profile_output = await request_func(request_func_input=profile_input)
@ -534,23 +344,30 @@ async def benchmark(
pbar=pbar)
benchmark_start_time = time.perf_counter()
tasks: List[asyncio.Task] = []
tasks: list[asyncio.Task] = []
async for request in get_request(input_requests, request_rate, burstiness):
prompt, prompt_len, output_len, mm_content = request
request_func_input = RequestFuncInput(model=model_id,
prompt, prompt_len, output_len, mm_content = request.prompt, \
request.prompt_len, request.expected_output_len, \
request.multi_modal_data
req_model_id, req_model_name = model_id, model_name
if lora_modules:
req_lora_module = next(lora_modules)
req_model_id, req_model_name = req_lora_module, req_lora_module
request_func_input = RequestFuncInput(model=req_model_id,
model_name=req_model_name,
prompt=prompt,
api_url=api_url,
prompt_len=prompt_len,
output_len=output_len,
logprobs=logprobs,
best_of=best_of,
multi_modal_content=mm_content,
ignore_eos=ignore_eos)
tasks.append(
asyncio.create_task(
limited_request_func(request_func_input=request_func_input,
pbar=pbar)))
outputs: List[RequestFuncOutput] = await asyncio.gather(*tasks)
outputs: list[RequestFuncOutput] = await asyncio.gather(*tasks)
if profile:
print("Stopping profiler...")
@ -561,7 +378,6 @@ async def benchmark(
prompt_len=test_prompt_len,
output_len=test_output_len,
logprobs=logprobs,
best_of=best_of,
)
profile_output = await request_func(request_func_input=profile_input)
if profile_output.success:
@ -579,7 +395,7 @@ async def benchmark(
tokenizer=tokenizer,
selected_percentile_metrics=selected_percentile_metrics,
selected_percentiles=selected_percentiles,
gootput_config_dict=gootput_config_dict,
goodput_config_dict=goodput_config_dict,
)
print("{s:{c}^{n}}".format(s=' Serving Benchmark Result ', n=50, c='='))
@ -591,7 +407,7 @@ async def benchmark(
metrics.total_output))
print("{:<40} {:<10.2f}".format("Request throughput (req/s):",
metrics.request_throughput))
if gootput_config_dict:
if goodput_config_dict:
print("{:<40} {:<10.2f}".format("Request goodput (req/s):",
metrics.request_goodput))
print("{:<40} {:<10.2f}".format("Output token throughput (tok/s):",
@ -606,7 +422,7 @@ async def benchmark(
"total_output_tokens": metrics.total_output,
"request_throughput": metrics.request_throughput,
"request_goodput:":
metrics.request_goodput if gootput_config_dict else None,
metrics.request_goodput if goodput_config_dict else None,
"output_throughput": metrics.output_throughput,
"total_token_throughput": metrics.total_token_throughput,
"input_lens": [output.prompt_len for output in outputs],
@ -662,11 +478,11 @@ async def benchmark(
def check_goodput_args(args):
# Check and parse goodput arguments
gootput_config_dict = {}
goodput_config_dict = {}
VALID_NAMES = ["ttft", "tpot", "e2el"]
if args.goodput:
gootput_config_dict = parse_goodput(args.goodput)
for slo_name, slo_val in gootput_config_dict.items():
goodput_config_dict = parse_goodput(args.goodput)
for slo_name, slo_val in goodput_config_dict.items():
if slo_name not in VALID_NAMES:
raise ValueError(
f"Invalid metric name found, {slo_name}: {slo_val}. "
@ -677,22 +493,47 @@ def check_goodput_args(args):
f"Invalid value found, {slo_name}: {slo_val}. "
"The service level objective value should be "
"non-negative.")
return gootput_config_dict
return goodput_config_dict
def parse_goodput(slo_pairs):
gootput_config_dict = {}
goodput_config_dict = {}
try:
for slo_pair in slo_pairs:
slo_name, slo_val = slo_pair.split(":")
gootput_config_dict[slo_name] = float(slo_val)
goodput_config_dict[slo_name] = float(slo_val)
except ValueError as err:
raise argparse.ArgumentTypeError(
"Invalid format found for service level objectives. "
"Specify service level objectives for goodput as \"KEY:VALUE\" "
"pairs, where the key is a metric name, and the value is a "
"number in milliseconds.") from err
return gootput_config_dict
return goodput_config_dict
def save_to_pytorch_benchmark_format(args: argparse.Namespace,
results: dict[str, Any],
file_name: str) -> None:
metrics = [
"median_ttft_ms", "mean_ttft_ms", "std_ttft_ms", "p99_ttft_ms",
"mean_tpot_ms", "median_tpot_ms", "std_tpot_ms", "p99_tpot_ms",
"median_itl_ms", "mean_itl_ms", "std_itl_ms", "p99_itl_ms"
]
# These raw data might be useful, but they are rather big. They can be added
# later if needed
ignored_metrics = ["ttfts", "itls", "generated_texts", "errors"]
pt_records = convert_to_pytorch_benchmark_format(
args=args,
metrics={k: [results[k]]
for k in metrics},
extra_info={
k: results[k]
for k in results if k not in metrics and k not in ignored_metrics
})
if pt_records:
# Don't use json suffix here as we don't want CI to pick it up
pt_file = f"{os.path.splitext(file_name)[0]}.pytorch.json"
write_to_json(pt_file, pt_records)
def main(args: argparse.Namespace):
@ -702,7 +543,9 @@ def main(args: argparse.Namespace):
backend = args.backend
model_id = args.model
model_name = args.served_model_name
tokenizer_id = args.tokenizer if args.tokenizer is not None else args.model
tokenizer_mode = args.tokenizer_mode
if args.base_url is not None:
api_url = f"{args.base_url}{args.endpoint}"
@ -712,84 +555,91 @@ def main(args: argparse.Namespace):
base_url = f"http://{args.host}:{args.port}"
tokenizer = get_tokenizer(tokenizer_id,
tokenizer_mode=tokenizer_mode,
trust_remote_code=args.trust_remote_code)
if args.dataset is not None:
warnings.warn(
"The '--dataset' argument will be deprecated in the next "
"release. Please use '--dataset-name' and "
"'--dataset-path' in the future runs.",
stacklevel=2)
input_requests = sample_sharegpt_requests(
dataset_path=args.dataset,
num_requests=args.num_prompts,
tokenizer=tokenizer,
fixed_output_len=args.sharegpt_output_len,
)
if args.dataset_name is None:
raise ValueError(
"Please specify '--dataset-name' and the corresponding "
"'--dataset-path' if required.")
elif args.dataset_name == "sharegpt":
input_requests = sample_sharegpt_requests(
dataset_path=args.dataset_path,
num_requests=args.num_prompts,
tokenizer=tokenizer,
fixed_output_len=args.sharegpt_output_len,
)
elif args.dataset_name == "sonnet":
# Do not format the prompt, pass to message directly
if args.dataset_name == "sonnet":
dataset = SonnetDataset(dataset_path=args.dataset_path)
# For the "sonnet" dataset, formatting depends on the backend.
if args.backend == "openai-chat":
input_requests = sample_sonnet_requests(
dataset_path=args.dataset_path,
num_requests=args.num_prompts,
input_len=args.sonnet_input_len,
output_len=args.sonnet_output_len,
prefix_len=args.sonnet_prefix_len,
tokenizer=tokenizer,
)
input_requests = [(prompt, prompt_len, output_len, None)
for prompt, prompt_formatted, prompt_len,
output_len, _ in input_requests]
input_requests = dataset.sample(num_requests=args.num_prompts,
input_len=args.sonnet_input_len,
output_len=args.sonnet_output_len,
prefix_len=args.sonnet_prefix_len,
tokenizer=tokenizer,
return_prompt_formatted=False)
else:
assert (
tokenizer.chat_template or tokenizer.default_chat_template
), "Tokenizer/model must have chat template for sonnet dataset."
input_requests = sample_sonnet_requests(
dataset_path=args.dataset_path,
num_requests=args.num_prompts,
input_len=args.sonnet_input_len,
output_len=args.sonnet_output_len,
prefix_len=args.sonnet_prefix_len,
tokenizer=tokenizer,
)
input_requests = [(prompt_formatted, prompt_len, output_len, None)
for prompt, prompt_formatted, prompt_len,
output_len, _ in input_requests]
assert tokenizer.chat_template or tokenizer.default_chat_template, (
"Tokenizer/model must have chat template for sonnet dataset.")
input_requests = dataset.sample(num_requests=args.num_prompts,
input_len=args.sonnet_input_len,
output_len=args.sonnet_output_len,
prefix_len=args.sonnet_prefix_len,
tokenizer=tokenizer,
return_prompt_formatted=True)
elif args.dataset_name == "hf":
input_requests = sample_hf_requests(
# all following datasets are implemented from the
# HuggingFaceDataset base class
if args.dataset_path in VisionArenaDataset.SUPPORTED_DATASET_PATHS:
dataset_class = VisionArenaDataset
args.hf_split = "train"
args.hf_subset = None
elif args.dataset_path in InstructCoderDataset.SUPPORTED_DATASET_PATHS:
dataset_class = InstructCoderDataset
args.hf_split = "train"
elif args.dataset_path in ConversationDataset.SUPPORTED_DATASET_PATHS:
dataset_class = ConversationDataset
input_requests = dataset_class(
dataset_path=args.dataset_path,
dataset_subset=args.hf_subset,
dataset_split=args.hf_split,
).sample(
num_requests=args.num_prompts,
tokenizer=tokenizer,
random_seed=args.seed,
fixed_output_len=args.hf_output_len,
)
elif args.dataset_name == "random":
input_requests = sample_random_requests(
prefix_len=args.random_prefix_len,
input_len=args.random_input_len,
output_len=args.random_output_len,
num_prompts=args.num_prompts,
range_ratio=args.random_range_ratio,
tokenizer=tokenizer,
output_len=args.hf_output_len,
)
else:
raise ValueError(f"Unknown dataset: {args.dataset_name}")
# For datasets that follow a similar structure, use a mapping.
dataset_mapping = {
"sharegpt":
lambda: ShareGPTDataset(random_seed=args.seed,
dataset_path=args.dataset_path).sample(
tokenizer=tokenizer,
num_requests=args.num_prompts,
output_len=args.sharegpt_output_len,
),
"burstgpt":
lambda: BurstGPTDataset(random_seed=args.seed,
dataset_path=args.dataset_path).
sample(tokenizer=tokenizer, num_requests=args.num_prompts),
"random":
lambda: RandomDataset(dataset_path=args.dataset_path).sample(
tokenizer=tokenizer,
num_requests=args.num_prompts,
prefix_len=args.random_prefix_len,
input_len=args.random_input_len,
output_len=args.random_output_len,
range_ratio=args.random_range_ratio,
)
}
gootput_config_dict = check_goodput_args(args)
try:
input_requests = dataset_mapping[args.dataset_name]()
except KeyError as err:
raise ValueError(f"Unknown dataset: {args.dataset_name}") from err
goodput_config_dict = check_goodput_args(args)
# Avoid GC processing "static" data - reduce pause times.
gc.collect()
gc.freeze()
benchmark_result = asyncio.run(
benchmark(
@ -797,10 +647,10 @@ def main(args: argparse.Namespace):
api_url=api_url,
base_url=base_url,
model_id=model_id,
model_name=model_name,
tokenizer=tokenizer,
input_requests=input_requests,
logprobs=args.logprobs,
best_of=args.best_of,
request_rate=args.request_rate,
burstiness=args.burstiness,
disable_tqdm=args.disable_tqdm,
@ -810,13 +660,14 @@ def main(args: argparse.Namespace):
float(p) for p in args.metric_percentiles.split(",")
],
ignore_eos=args.ignore_eos,
gootput_config_dict=gootput_config_dict,
goodput_config_dict=goodput_config_dict,
max_concurrency=args.max_concurrency,
lora_modules=args.lora_modules,
))
# Save config and results to json
if args.save_result:
result_json: Dict[str, Any] = {}
result_json: dict[str, Any] = {}
# Setup
current_dt = datetime.now().strftime("%Y%m%d-%H%M%S")
@ -824,7 +675,6 @@ def main(args: argparse.Namespace):
result_json["backend"] = backend
result_json["model_id"] = model_id
result_json["tokenizer_id"] = tokenizer_id
result_json["best_of"] = args.best_of
result_json["num_prompts"] = args.num_prompts
# Metadata
@ -838,9 +688,18 @@ def main(args: argparse.Namespace):
"Invalid metadata format. Please use KEY=VALUE format."
)
if not args.save_detailed:
# Remove fields with too many data points
for field in [
"input_lens", "output_lens", "ttfts", "itls",
"generated_texts", "errors"
]:
if field in result_json:
del result_json[field]
# Traffic
result_json["request_rate"] = (
args.request_rate if args.request_rate < float("inf") else "inf")
result_json["request_rate"] = (args.request_rate if args.request_rate
< float("inf") else "inf")
result_json["burstiness"] = args.burstiness
result_json["max_concurrency"] = args.max_concurrency
@ -858,6 +717,7 @@ def main(args: argparse.Namespace):
file_name = os.path.join(args.result_dir, file_name)
with open(file_name, "w", encoding='utf-8') as outfile:
json.dump(result_json, outfile)
save_to_pytorch_benchmark_format(args, result_json, file_name)
if __name__ == "__main__":
@ -875,7 +735,8 @@ if __name__ == "__main__":
default=None,
help="Server or API base url if not using http host and port.",
)
parser.add_argument("--host", type=str, default="localhost")
# Use 127.0.0.1 here instead of localhost to force the use of ipv4
parser.add_argument("--host", type=str, default="127.0.0.1")
parser.add_argument("--port", type=int, default=8000)
parser.add_argument(
"--endpoint",
@ -883,18 +744,11 @@ if __name__ == "__main__":
default="/v1/completions",
help="API endpoint.",
)
parser.add_argument(
"--dataset",
type=str,
default=None,
help="Path to the ShareGPT dataset, will be deprecated in the "
"next release.",
)
parser.add_argument(
"--dataset-name",
type=str,
default="sharegpt",
choices=["sharegpt", "sonnet", "random", "hf"],
choices=["sharegpt", "burstgpt", "sonnet", "random", "hf"],
help="Name of the dataset to benchmark on.",
)
parser.add_argument("--dataset-path",
@ -927,13 +781,6 @@ if __name__ == "__main__":
help=
"Name or path of the tokenizer, if not using the default tokenizer.", # noqa: E501
)
parser.add_argument(
"--best-of",
type=int,
default=1,
help="Generates `best_of` sequences per prompt and "
"returns the best one.",
)
parser.add_argument("--use-beam-search", action="store_true")
parser.add_argument(
"--num-prompts",
@ -994,6 +841,12 @@ if __name__ == "__main__":
action="store_true",
help="Specify to save benchmark results to a json file",
)
parser.add_argument(
"--save-detailed",
action="store_true",
help="When saving the results, whether to include per request "
"information such as response, error, ttfs, tpots, etc.",
)
parser.add_argument(
"--metadata",
metavar="KEY=VALUE",
@ -1132,5 +985,31 @@ if __name__ == "__main__":
"from the sampled HF dataset.",
)
parser.add_argument(
'--tokenizer-mode',
type=str,
default="auto",
choices=['auto', 'slow', 'mistral', 'custom'],
help='The tokenizer mode.\n\n* "auto" will use the '
'fast tokenizer if available.\n* "slow" will '
'always use the slow tokenizer. \n* '
'"mistral" will always use the `mistral_common` tokenizer. \n*'
'"custom" will use --tokenizer to select the preregistered tokenizer.')
parser.add_argument("--served-model-name",
type=str,
default=None,
help="The model name used in the API. "
"If not specified, the model name will be the "
"same as the ``--model`` argument. ")
parser.add_argument("--lora-modules",
nargs='+',
default=None,
help="A subset of LoRA module names passed in when "
"launching the server. For each request, the "
"script chooses a LoRA module at random.")
args = parser.parse_args()
main(args)

File diff suppressed because it is too large Load Diff

View File

@ -1,14 +1,21 @@
# SPDX-License-Identifier: Apache-2.0
"""Benchmark offline inference throughput."""
import argparse
import dataclasses
import json
import os
import random
import time
from typing import List, Optional
import warnings
from typing import Any, Optional, Union
import torch
import uvloop
from PIL import Image
from benchmark_dataset import (BurstGPTDataset, ConversationDataset,
InstructCoderDataset, RandomDataset,
SampleRequest, ShareGPTDataset, SonnetDataset,
VisionArenaDataset)
from benchmark_utils import convert_to_pytorch_benchmark_format, write_to_json
from tqdm import tqdm
from transformers import (AutoModelForCausalLM, AutoTokenizer,
PreTrainedTokenizerBase)
@ -16,126 +23,35 @@ from transformers import (AutoModelForCausalLM, AutoTokenizer,
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
from vllm.entrypoints.openai.api_server import (
build_async_engine_client_from_engine_args)
from vllm.inputs import TextPrompt
from vllm.multimodal import MultiModalDataDict
from vllm.inputs import TextPrompt, TokensPrompt
from vllm.lora.request import LoRARequest
from vllm.outputs import RequestOutput
from vllm.sampling_params import BeamSearchParams
from vllm.utils import FlexibleArgumentParser, merge_async_iterators
@dataclasses.dataclass
class SampleRequest:
"""A class representing a single inference request for benchmarking.
Attributes:
prompt: The input text prompt for the model.
multi_modal_data: Optional dictionary containing multi-modal data (e.g.
images).
prompt_len: The length of the prompt in tokens.
expected_output_len: The expected length of the output in tokens.
"""
prompt: str
prompt_len: int
expected_output_len: int
multi_modal_data: Optional[MultiModalDataDict] = None
def _get_prompt_for_image_model(question: str, *, model: str) -> str:
"""Prepend and append special tokens around the question to form a prompt.
Args:
question: The input question text to wrap with special tokens
model: The name of the model being used, to determine which special
tokens to add
Returns:
The formatted prompt string with appropriate special tokens for the
model
Raises:
ValueError: If an unsupported model name is provided
"""
model = model.lower()
if "pixtral" in model:
return f"<s>[INST]{question}\n[IMG][/INST]"
raise ValueError(f"Unsupported model {model}")
def sample_requests(tokenizer: PreTrainedTokenizerBase,
args: argparse.Namespace) -> List[SampleRequest]:
dataset_path: str = args.dataset
num_requests: int = args.num_prompts
fixed_output_len: Optional[int] = args.output_len
model: str = args.model
if fixed_output_len is not None and fixed_output_len < 4:
raise ValueError("output_len too small")
# Load the dataset.
with open(dataset_path) as f:
dataset = json.load(f)
# Filter out the conversations with less than 2 turns.
dataset = [data for data in dataset if len(data["conversations"]) >= 2]
# Shuffle the dataset.
random.shuffle(dataset)
# Filter out sequences that are too long or too short
filtered_dataset: List[SampleRequest] = []
for data in dataset:
if len(filtered_dataset) == num_requests:
break
# Only keep the first two turns of each conversation.
prompt = data["conversations"][0]["value"]
completion = data["conversations"][1]["value"]
multi_modal_data: Optional[MultiModalDataDict] = None
if "image" in data:
multi_modal_data = multi_modal_data or {}
image_path = data["image"]
# TODO(vllm-project/vllm/issues/9778): Support multiple images.
assert isinstance(image_path,
str), "Only support single image input"
try:
multi_modal_data["image"] = Image.open(image_path).convert(
"RGB")
except FileNotFoundError:
# Ignore datapoint where asset is missing
continue
prompt = _get_prompt_for_image_model(question=prompt, model=model)
# Tokenize the prompts and completions.
prompt_token_ids = tokenizer(prompt).input_ids
completion_token_ids = tokenizer(completion).input_ids
prompt_len = len(prompt_token_ids)
output_len = len(completion_token_ids
) if fixed_output_len is None else fixed_output_len
if prompt_len < 4 or output_len < 4:
# Prune too short sequences.
continue
if prompt_len > 1024 or prompt_len + output_len > 2048:
# Prune too long sequences.
continue
filtered_dataset.append(
SampleRequest(prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
multi_modal_data=multi_modal_data))
return filtered_dataset
def run_vllm(
requests: List[SampleRequest],
requests: list[SampleRequest],
n: int,
engine_args: EngineArgs,
) -> float:
disable_detokenize: bool = False,
) -> tuple[float, Optional[list[RequestOutput]]]:
from vllm import LLM, SamplingParams
llm = LLM(**dataclasses.asdict(engine_args))
assert all(
llm.llm_engine.model_config.max_model_len >= (
request.prompt_len + request.expected_output_len)
for request in requests), (
"Please ensure that max_model_len is greater than the sum of"
" prompt_len and expected_output_len for all requests.")
# Add the requests to the engine.
prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = []
prompts: list[Union[TextPrompt, TokensPrompt]] = []
sampling_params: list[SamplingParams] = []
for request in requests:
prompts.append(
TokensPrompt(prompt_token_ids=request.prompt["prompt_token_ids"],
multi_modal_data=request.multi_modal_data)
if "prompt_token_ids" in request.prompt else \
TextPrompt(prompt=request.prompt,
multi_modal_data=request.multi_modal_data))
sampling_params.append(
@ -145,15 +61,24 @@ def run_vllm(
top_p=1.0,
ignore_eos=True,
max_tokens=request.expected_output_len,
detokenize=not disable_detokenize,
))
lora_requests: Optional[list[LoRARequest]] = None
if engine_args.enable_lora:
lora_requests = [request.lora_request for request in requests]
use_beam_search = False
outputs = None
if not use_beam_search:
start = time.perf_counter()
llm.generate(prompts, sampling_params, use_tqdm=True)
outputs = llm.generate(prompts,
sampling_params,
lora_request=lora_requests,
use_tqdm=True)
end = time.perf_counter()
else:
assert lora_requests is None, "BeamSearch API does not support LoRA"
prompts = [request.prompt for request in requests]
# output_len should be the same for all requests.
output_len = requests[0][2]
@ -168,25 +93,75 @@ def run_vllm(
ignore_eos=True,
))
end = time.perf_counter()
return end - start
return end - start, outputs
def run_vllm_chat(
requests: list[SampleRequest],
n: int,
engine_args: EngineArgs,
disable_detokenize: bool = False) -> tuple[float, list[RequestOutput]]:
"""
Run vLLM chat benchmark. This function is recommended ONLY for benchmarking
multimodal models as it properly handles multimodal inputs and chat
formatting. For non-multimodal models, use run_vllm() instead.
"""
from vllm import LLM, SamplingParams
llm = LLM(**dataclasses.asdict(engine_args))
assert all(
llm.llm_engine.model_config.max_model_len >= (
request.prompt_len + request.expected_output_len)
for request in requests), (
"Please ensure that max_model_len is greater than the sum of "
"prompt_len and expected_output_len for all requests.")
prompts = []
sampling_params: list[SamplingParams] = []
for request in requests:
prompts.append(request.prompt)
sampling_params.append(
SamplingParams(
n=n,
temperature=1.0,
top_p=1.0,
ignore_eos=True,
max_tokens=request.expected_output_len,
detokenize=not disable_detokenize,
))
start = time.perf_counter()
outputs = llm.chat(prompts, sampling_params, use_tqdm=True)
end = time.perf_counter()
return end - start, outputs
async def run_vllm_async(
requests: List[SampleRequest],
requests: list[SampleRequest],
n: int,
engine_args: AsyncEngineArgs,
disable_frontend_multiprocessing: bool = False,
disable_detokenize: bool = False,
) -> float:
from vllm import SamplingParams
async with build_async_engine_client_from_engine_args(
engine_args, disable_frontend_multiprocessing) as llm:
assert all(
llm.model_config.max_model_len >= (request.prompt_len +
request.expected_output_len)
for request in requests), (
"Please ensure that max_model_len is greater than the sum of"
" prompt_len and expected_output_len for all requests.")
# Add the requests to the engine.
prompts: List[TextPrompt] = []
sampling_params: List[SamplingParams] = []
prompts: list[Union[TextPrompt, TokensPrompt]] = []
sampling_params: list[SamplingParams] = []
lora_requests: list[Optional[LoRARequest]] = []
for request in requests:
prompts.append(
TokensPrompt(prompt_token_ids=request.prompt["prompt_token_ids"],
multi_modal_data=request.multi_modal_data)
if "prompt_token_ids" in request.prompt else \
TextPrompt(prompt=request.prompt,
multi_modal_data=request.multi_modal_data))
sampling_params.append(
@ -196,12 +171,18 @@ async def run_vllm_async(
top_p=1.0,
ignore_eos=True,
max_tokens=request.expected_output_len,
detokenize=not disable_detokenize,
))
lora_requests.append(request.lora_request)
generators = []
start = time.perf_counter()
for i, (prompt, sp) in enumerate(zip(prompts, sampling_params)):
generator = llm.generate(prompt, sp, request_id=f"test{i}")
for i, (prompt, sp,
lr) in enumerate(zip(prompts, sampling_params, lora_requests)):
generator = llm.generate(prompt,
sp,
lora_request=lr,
request_id=f"test{i}")
generators.append(generator)
all_gens = merge_async_iterators(*generators)
async for i, res in all_gens:
@ -211,12 +192,13 @@ async def run_vllm_async(
def run_hf(
requests: List[SampleRequest],
requests: list[SampleRequest],
model: str,
tokenizer: PreTrainedTokenizerBase,
n: int,
max_batch_size: int,
trust_remote_code: bool,
disable_detokenize: bool = False,
) -> float:
llm = AutoModelForCausalLM.from_pretrained(
model, torch_dtype=torch.float16, trust_remote_code=trust_remote_code)
@ -227,7 +209,7 @@ def run_hf(
pbar = tqdm(total=len(requests))
start = time.perf_counter()
batch: List[str] = []
batch: list[str] = []
max_prompt_len = 0
max_output_len = 0
for i in range(len(requests)):
@ -256,8 +238,9 @@ def run_hf(
use_cache=True,
max_new_tokens=max_output_len,
)
# Include the decoding time.
tokenizer.batch_decode(llm_outputs, skip_special_tokens=True)
if not disable_detokenize:
# Include the decoding time.
tokenizer.batch_decode(llm_outputs, skip_special_tokens=True)
pbar.update(len(batch))
# Clear the batch.
@ -269,7 +252,7 @@ def run_hf(
def run_mii(
requests: List[SampleRequest],
requests: list[SampleRequest],
model: str,
tensor_parallel_size: int,
output_len: int,
@ -286,36 +269,89 @@ def run_mii(
return end - start
def save_to_pytorch_benchmark_format(args: argparse.Namespace,
results: dict[str, Any]) -> None:
pt_records = convert_to_pytorch_benchmark_format(
args=args,
metrics={
"requests_per_second": [results["requests_per_second"]],
"tokens_per_second": [results["tokens_per_second"]],
},
extra_info={
k: results[k]
for k in ["elapsed_time", "num_requests", "total_num_tokens"]
})
if pt_records:
# Don't use json suffix here as we don't want CI to pick it up
pt_file = f"{os.path.splitext(args.output_json)[0]}.pytorch.json"
write_to_json(pt_file, pt_records)
def get_requests(args, tokenizer):
# Common parameters for all dataset types.
common_kwargs = {
"dataset_path": args.dataset_path,
"random_seed": args.seed,
}
sample_kwargs = {
"tokenizer": tokenizer,
"lora_path": args.lora_path,
"max_loras": args.max_loras,
"num_requests": args.num_prompts,
"input_len": args.input_len,
"output_len": args.output_len,
}
if args.dataset_path is None or args.dataset_name == "random":
sample_kwargs["range_ratio"] = args.random_range_ratio
sample_kwargs["prefix_len"] = args.prefix_len
dataset_cls = RandomDataset
elif args.dataset_name == "sharegpt":
dataset_cls = ShareGPTDataset
if args.backend == "vllm-chat":
sample_kwargs["enable_multimodal_chat"] = True
elif args.dataset_name == "sonnet":
assert tokenizer.chat_template or tokenizer.default_chat_template, (
"Tokenizer/model must have chat template for sonnet dataset.")
dataset_cls = SonnetDataset
sample_kwargs["prefix_len"] = args.prefix_len
sample_kwargs["return_prompt_formatted"] = True
elif args.dataset_name == "burstgpt":
dataset_cls = BurstGPTDataset
elif args.dataset_name == "hf":
if args.dataset_path in VisionArenaDataset.SUPPORTED_DATASET_PATHS:
dataset_cls = VisionArenaDataset
common_kwargs['dataset_subset'] = None
common_kwargs['dataset_split'] = "train"
sample_kwargs["enable_multimodal_chat"] = True
elif args.dataset_path in InstructCoderDataset.SUPPORTED_DATASET_PATHS:
dataset_cls = InstructCoderDataset
common_kwargs['dataset_split'] = "train"
elif args.dataset_path in ConversationDataset.SUPPORTED_DATASET_PATHS:
dataset_cls = ConversationDataset
common_kwargs['dataset_subset'] = args.hf_subset
common_kwargs['dataset_split'] = args.hf_split
sample_kwargs["enable_multimodal_chat"] = True
else:
raise ValueError(f"Unknown dataset name: {args.dataset_name}")
# Remove None values
sample_kwargs = {k: v for k, v in sample_kwargs.items() if v is not None}
return dataset_cls(**common_kwargs).sample(**sample_kwargs)
def main(args: argparse.Namespace):
if args.seed is None:
args.seed = 0
print(args)
random.seed(args.seed)
# Sample the requests.
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer, trust_remote_code=args.trust_remote_code)
if args.dataset is None:
# Synthesize a prompt with the given input length.
# As tokenizer may add additional tokens like BOS, we need to try
# different lengths to get the desired input length.
for i in range(-10, 10):
prompt = "hi " * (args.input_len + i)
tokenized_prompt = tokenizer(prompt).input_ids
if len(tokenized_prompt) == args.input_len:
break
else:
raise ValueError(
f"Failed to synthesize a prompt with {args.input_len} tokens.")
requests = [
SampleRequest(prompt=prompt,
prompt_len=args.input_len,
expected_output_len=args.output_len)
for _ in range(args.num_prompts)
]
else:
requests = sample_requests(tokenizer, args)
requests = get_requests(args, tokenizer)
is_multi_modal = any(request.multi_modal_data is not None
for request in requests)
request_outputs: Optional[list[RequestOutput]] = None
if args.backend == "vllm":
if args.async_engine:
elapsed_time = uvloop.run(
@ -324,31 +360,59 @@ def main(args: argparse.Namespace):
args.n,
AsyncEngineArgs.from_cli_args(args),
args.disable_frontend_multiprocessing,
args.disable_detokenize,
))
else:
elapsed_time = run_vllm(requests, args.n,
EngineArgs.from_cli_args(args))
elapsed_time, request_outputs = run_vllm(
requests, args.n, EngineArgs.from_cli_args(args),
args.disable_detokenize)
elif args.backend == "hf":
assert args.tensor_parallel_size == 1
elapsed_time = run_hf(requests, args.model, tokenizer, args.n,
args.hf_max_batch_size, args.trust_remote_code)
args.hf_max_batch_size, args.trust_remote_code,
args.disable_detokenize)
elif args.backend == "mii":
elapsed_time = run_mii(requests, args.model, args.tensor_parallel_size,
args.output_len)
elif args.backend == "vllm-chat":
elapsed_time, request_outputs = run_vllm_chat(
requests, args.n, EngineArgs.from_cli_args(args),
args.disable_detokenize)
else:
raise ValueError(f"Unknown backend: {args.backend}")
total_num_tokens = sum(request.prompt_len + request.expected_output_len
for request in requests)
total_output_tokens = sum(request.expected_output_len
for request in requests)
if is_multi_modal:
print("\033[91mWARNING\033[0m: Multi-modal request detected. The "
if request_outputs:
# Note: with the vllm and vllm-chat backends,
# we have request_outputs, which we use to count tokens.
total_prompt_tokens = 0
total_output_tokens = 0
for ro in request_outputs:
if not isinstance(ro, RequestOutput):
continue
total_prompt_tokens += len(
ro.prompt_token_ids) if ro.prompt_token_ids else 0
total_output_tokens += sum(
len(o.token_ids) for o in ro.outputs if o)
total_num_tokens = total_prompt_tokens + total_output_tokens
else:
total_num_tokens = sum(r.prompt_len + r.expected_output_len
for r in requests)
total_output_tokens = sum(r.expected_output_len for r in requests)
total_prompt_tokens = total_num_tokens - total_output_tokens
if is_multi_modal and args.backend != "vllm-chat":
print("\033[91mWARNING\033[0m: Multi-modal request with "
f"{args.backend} backend detected. The "
"following metrics are not accurate because image tokens are not"
" counted. See vllm-project/vllm/issues/9778 for details.")
# TODO(vllm-project/vllm/issues/9778): Count molti-modal token length.
# TODO(vllm-project/vllm/issues/9778): Count multi-modal token length.
# vllm-chat backend counts the image tokens now
print(f"Throughput: {len(requests) / elapsed_time:.2f} requests/s, "
f"{total_num_tokens / elapsed_time:.2f} total tokens/s, "
f"{total_output_tokens / elapsed_time:.2f} output tokens/s")
print(f"Total num prompt tokens: {total_prompt_tokens}")
print(f"Total num output tokens: {total_output_tokens}")
# Output JSON results if specified
if args.output_json:
@ -361,20 +425,122 @@ def main(args: argparse.Namespace):
}
with open(args.output_json, "w") as f:
json.dump(results, f, indent=4)
save_to_pytorch_benchmark_format(args, results)
def validate_args(args):
"""
Validate command-line arguments.
"""
# === Deprecation and Defaulting ===
if args.dataset is not None:
warnings.warn(
"The '--dataset' argument will be deprecated in the next release. "
"Please use '--dataset-name' and '--dataset-path' instead.",
stacklevel=2)
args.dataset_path = args.dataset
if not getattr(args, "tokenizer", None):
args.tokenizer = args.model
# === Backend Validation ===
valid_backends = {"vllm", "hf", "mii", "vllm-chat"}
if args.backend not in valid_backends:
raise ValueError(f"Unsupported backend: {args.backend}")
# === Dataset Configuration ===
if not args.dataset and not args.dataset_path:
print(
"When dataset path is not set, it will default to random dataset")
args.dataset_name = 'random'
if args.input_len is None:
raise ValueError("input_len must be provided for a random dataset")
# === Dataset Name Specific Checks ===
# --hf-subset and --hf-split: only used
# when dataset_name is 'hf'
if args.dataset_name != "hf" and (
getattr(args, "hf_subset", None) is not None
or getattr(args, "hf_split", None) is not None):
warnings.warn("--hf-subset and --hf-split will be ignored \
since --dataset-name is not 'hf'.",
stacklevel=2)
elif args.dataset_name == "hf":
if args.dataset_path in VisionArenaDataset.SUPPORTED_DATASET_PATHS:
assert args.backend == "vllm-chat", "VisionArenaDataset needs to use vllm-chat as the backend." #noqa: E501
elif args.dataset_path in InstructCoderDataset.SUPPORTED_DATASET_PATHS:
assert args.backend == "vllm", "InstructCoder dataset needs to use vllm as the backend." #noqa: E501
elif args.dataset_path in ConversationDataset.SUPPORTED_DATASET_PATHS:
assert args.backend == "vllm-chat", "ConversationDataset needs to use vllm-chat as the backend." #noqa: E501
else:
raise ValueError(
f"{args.dataset_path} is not supported by hf dataset.")
# --random-range-ratio: only used when dataset_name is 'random'
if args.dataset_name != 'random' and args.random_range_ratio is not None:
warnings.warn("--random-range-ratio will be ignored since \
--dataset-name is not 'random'.",
stacklevel=2)
# --prefix-len: only used when dataset_name is 'random', 'sonnet', or not
# set.
if args.dataset_name not in {"random", "sonnet", None
} and args.prefix_len is not None:
warnings.warn("--prefix-len will be ignored since --dataset-name\
is not 'random', 'sonnet', or not set.",
stacklevel=2)
# === LoRA Settings ===
if getattr(args, "enable_lora", False) and args.backend != "vllm":
raise ValueError(
"LoRA benchmarking is only supported for vLLM backend")
if getattr(args, "enable_lora", False) and args.lora_path is None:
raise ValueError("LoRA path must be provided when enable_lora is True")
# === Backend-specific Validations ===
if args.backend == "hf" and args.hf_max_batch_size is None:
raise ValueError("HF max batch size is required for HF backend")
if args.backend != "hf" and args.hf_max_batch_size is not None:
raise ValueError("HF max batch size is only for HF backend.")
if args.backend in {"hf", "mii"} and getattr(args, "quantization",
None) is not None:
raise ValueError("Quantization is only for vLLM backend.")
if args.backend == "mii" and args.dtype != "auto":
raise ValueError("dtype must be auto for MII backend.")
if args.backend == "mii" and args.n != 1:
raise ValueError("n must be 1 for MII backend.")
if args.backend == "mii" and args.tokenizer != args.model:
raise ValueError(
"Tokenizer must be the same as the model for MII backend.")
if __name__ == "__main__":
parser = FlexibleArgumentParser(description="Benchmark the throughput.")
parser.add_argument("--backend",
type=str,
choices=["vllm", "hf", "mii"],
choices=["vllm", "hf", "mii", "vllm-chat"],
default="vllm")
parser.add_argument("--dataset",
parser.add_argument(
"--dataset-name",
type=str,
choices=["sharegpt", "random", "sonnet", "burstgpt", "hf"],
help="Name of the dataset to benchmark on.",
default="sharegpt")
parser.add_argument(
"--dataset",
type=str,
default=None,
help="Path to the ShareGPT dataset, will be deprecated in\
the next release. The dataset is expected to "
"be a json in form of list[dict[..., conversations: "
"list[dict[..., value: <prompt_or_response>]]]]")
parser.add_argument("--dataset-path",
type=str,
default=None,
help="Path to the dataset. The dataset is expected to "
"be a json in form of List[Dict[..., conversations: "
"List[Dict[..., value: <prompt_or_response>]]]]")
help="Path to the dataset")
parser.add_argument("--input-len",
type=int,
default=None,
@ -409,34 +575,45 @@ if __name__ == "__main__":
action='store_true',
default=False,
help="Disable decoupled async engine frontend.")
parser.add_argument(
"--disable-detokenize",
action="store_true",
help=("Do not detokenize the response (i.e. do not include "
"detokenization time in the measurement)"))
# LoRA
parser.add_argument(
"--lora-path",
type=str,
default=None,
help="Path to the lora adapters to use. This can be an absolute path, "
"a relative path, or a Hugging Face model identifier.")
parser.add_argument("--prefix-len",
type=int,
default=None,
help="Number of prefix tokens per request."
"This is for the RandomDataset and SonnetDataset")
# random dataset
parser.add_argument(
"--random-range-ratio",
type=float,
default=None,
help="Range of sampled ratio of input/output length, "
"used only for RandomDataSet.",
)
# hf dtaset
parser.add_argument("--hf-subset",
type=str,
default=None,
help="Subset of the HF dataset.")
parser.add_argument("--hf-split",
type=str,
default=None,
help="Split of the HF dataset.")
parser = AsyncEngineArgs.add_cli_args(parser)
args = parser.parse_args()
if args.tokenizer is None:
args.tokenizer = args.model
if args.dataset is None:
assert args.input_len is not None
assert args.output_len is not None
else:
assert args.input_len is None
if args.backend == "vllm":
if args.hf_max_batch_size is not None:
raise ValueError("HF max batch size is only for HF backend.")
elif args.backend == "hf":
if args.hf_max_batch_size is None:
raise ValueError("HF max batch size is required for HF backend.")
if args.quantization is not None:
raise ValueError("Quantization is only for vLLM backend.")
elif args.backend == "mii":
if args.dtype != "auto":
raise ValueError("dtype must be auto for MII backend.")
if args.n != 1:
raise ValueError("n must be 1 for MII backend.")
if args.quantization is not None:
raise ValueError("Quantization is only for vLLM backend.")
if args.hf_max_batch_size is not None:
raise ValueError("HF max batch size is only for HF backend.")
if args.tokenizer != args.model:
raise ValueError("Tokenizer must be the same as the model for MII "
"backend.")
validate_args(args)
main(args)

View File

@ -0,0 +1,69 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import json
import math
import os
from typing import Any
def convert_to_pytorch_benchmark_format(args: argparse.Namespace,
metrics: dict[str, list],
extra_info: dict[str, Any]) -> list:
"""
Save the benchmark results in the format used by PyTorch OSS benchmark with
on metric per record
https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database
"""
records = []
if not os.environ.get("SAVE_TO_PYTORCH_BENCHMARK_FORMAT", False):
return records
for name, benchmark_values in metrics.items():
record = {
"benchmark": {
"name": "vLLM benchmark",
"extra_info": {
"args": vars(args),
},
},
"model": {
"name": args.model,
},
"metric": {
"name": name,
"benchmark_values": benchmark_values,
"extra_info": extra_info,
},
}
tp = record["benchmark"]["extra_info"]["args"].get(
"tensor_parallel_size")
# Save tensor_parallel_size parameter if it's part of the metadata
if not tp and "tensor_parallel_size" in extra_info:
record["benchmark"]["extra_info"]["args"][
"tensor_parallel_size"] = extra_info["tensor_parallel_size"]
records.append(record)
return records
class InfEncoder(json.JSONEncoder):
def clear_inf(self, o: Any):
if isinstance(o, dict):
return {k: self.clear_inf(v) for k, v in o.items()}
elif isinstance(o, list):
return [self.clear_inf(v) for v in o]
elif isinstance(o, float) and math.isinf(o):
return "inf"
return o
def iterencode(self, o: Any, *args, **kwargs) -> Any:
return super().iterencode(self.clear_inf(o), *args, **kwargs)
def write_to_json(filename: str, records: list) -> None:
with open(filename, "w") as f:
json.dump(records, f, cls=InfEncoder)

View File

@ -0,0 +1,387 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import copy
import itertools
import pickle as pkl
import time
from collections.abc import Iterable
from typing import Callable
import torch
import torch.utils.benchmark as TBenchmark
from torch.utils.benchmark import Measurement as TMeasurement
from utils import make_rand_sparse_tensors
from weight_shapes import WEIGHT_SHAPES
from vllm import _custom_ops as ops
from vllm.utils import FlexibleArgumentParser
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
DEFAULT_TP_SIZES = [1]
# bench
def bench_fn(label: str, sub_label: str, description: str, fn: Callable, *args,
**kwargs) -> TMeasurement:
min_run_time = 1
globals = {
"args": args,
"kwargs": kwargs,
"fn": fn,
}
return TBenchmark.Timer(
stmt="fn(*args, **kwargs)",
globals=globals,
label=label,
sub_label=sub_label,
description=description,
).blocked_autorange(min_run_time=min_run_time)
def bench_int8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
assert dtype == torch.int8
b_compressed, e, a, b = make_rand_sparse_tensors(torch.int8, m, n, k)
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)
bias = torch.zeros((n, ), device="cuda", dtype=torch.bfloat16)
out = ops.cutlass_scaled_sparse_mm(a, b_compressed, e, scale_a, scale_b,
torch.bfloat16)
out_ref = ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16)
if not torch.allclose(out, out_ref):
print("Incorrect results")
print(out)
print(out_ref)
else:
print("Correct results")
timers = []
# pytorch impl - bfloat16
timers.append(
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales",
torch.mm, a.to(dtype=torch.bfloat16),
b.to(dtype=torch.bfloat16)))
# pytorch impl - float16
timers.append(
bench_fn(label, sub_label,
"pytorch_fp16_fp16_fp16_matmul-no-scales", torch.mm,
a.to(dtype=torch.float16), b.to(dtype=torch.float16)))
# cutlass impl
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass with bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.bfloat16,
bias))
# cutlass sparse impl
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_sparse_mm",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16))
# cutlass sparse with bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_sparse_mm_bias",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16, bias))
return timers
def bench_fp8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
assert dtype == torch.float8_e4m3fn
b_compressed, e, a, b = make_rand_sparse_tensors(torch.float8_e4m3fn, m, n,
k)
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)
bias = torch.zeros((n, ), device="cuda", dtype=torch.bfloat16)
out = ops.cutlass_scaled_sparse_mm(a, b_compressed, e, scale_a, scale_b,
torch.bfloat16)
out_ref = ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16)
if not torch.allclose(out, out_ref):
print("Incorrect results")
print(out)
print(out_ref)
else:
print("Correct results")
timers = []
# pytorch impl w. bf16
timers.append(
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales",
torch.mm, a.to(dtype=torch.bfloat16, device="cuda"),
b.to(dtype=torch.bfloat16, device="cuda")))
# pytorch impl: bf16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16))
# pytorch impl: bf16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16,
use_fast_accum=True))
# pytorch impl: fp16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16))
# pytorch impl: fp16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16,
use_fast_accum=True))
# cutlass impl: bf16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass impl: bf16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_sparse_mm",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16))
# cutlass impl: fp16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_fp16_scaled_sparse_mm",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.float16))
# cutlass impl: bf16 output, with bias
timers.append(
bench_fn(label, sub_label,
"cutlass_fp8_fp8_bf16_scaled_sparse_mm_bias",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.bfloat16, bias))
# cutlass impl: fp16 output, with bias
timers.append(
bench_fn(label, sub_label,
"cutlass_fp8_fp8_fp16_scaled_sparse_mm_bias",
ops.cutlass_scaled_sparse_mm, a, b_compressed, e, scale_a,
scale_b, torch.float16, bias.to(dtype=torch.float16)))
return timers
def bench(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
if dtype == torch.int8:
return bench_int8(dtype, m, k, n, label, sub_label)
if dtype == torch.float8_e4m3fn:
return bench_fp8(dtype, m, k, n, label, sub_label)
raise ValueError("unsupported type")
# runner
def print_timers(timers: Iterable[TMeasurement]):
compare = TBenchmark.Compare(timers)
compare.print()
def run(dtype: torch.dtype,
MKNs: Iterable[tuple[int, int, int]]) -> Iterable[TMeasurement]:
results = []
for m, k, n in MKNs:
timers = bench(dtype, m, k, n, f"scaled-{dtype}-gemm",
f"MKN=({m}x{k}x{n})")
print_timers(timers)
results.extend(timers)
return results
# output makers
def make_output(data: Iterable[TMeasurement],
MKNs: Iterable[tuple[int, int, int]],
base_description: str,
timestamp=None):
print(f"== All Results {base_description} ====")
print_timers(data)
# pickle all the results
timestamp = int(time.time()) if timestamp is None else timestamp
with open(f"{base_description}-{timestamp}.pkl", "wb") as f:
pkl.dump(data, f)
# argparse runners
def run_square_bench(args):
dim_sizes = list(
range(args.dim_start, args.dim_end + 1, args.dim_increment))
MKNs = list(zip(dim_sizes, dim_sizes, dim_sizes))
data = run(args.dtype, MKNs)
make_output(data, MKNs, f"square_bench-{args.dtype}")
def run_range_bench(args):
dim_sizes = list(range(args.dim_start, args.dim_end, args.dim_increment))
n = len(dim_sizes)
Ms = [args.m_constant] * n if args.m_constant is not None else dim_sizes
Ks = [args.k_constant] * n if args.k_constant is not None else dim_sizes
Ns = [args.n_constant] * n if args.n_constant is not None else dim_sizes
MKNs = list(zip(Ms, Ks, Ns))
data = run(args.dtype, MKNs)
make_output(data, MKNs, f"range_bench-{args.dtype}")
def run_model_bench(args):
print("Benchmarking models:")
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
def model_shapes(model_name: str, tp_size: int) -> list[tuple[int, int]]:
KNs = []
for KN, tp_split_dim in copy.deepcopy(WEIGHT_SHAPES[model_name]):
KN[tp_split_dim] = KN[tp_split_dim] // tp_size
KNs.append(KN)
return KNs
model_bench_data = []
models_tps = list(itertools.product(args.models, args.tp_sizes))
for model, tp_size in models_tps:
Ms = args.batch_sizes
KNs = model_shapes(model, tp_size)
MKNs = []
for m in Ms:
for k, n in KNs:
MKNs.append((m, k, n))
data = run(args.dtype, MKNs)
model_bench_data.append(data)
# Print all results
for data, model_tp in zip(model_bench_data, models_tps):
model, tp_size = model_tp
print(f"== Results {args.dtype} {model}-TP{tp_size} ====")
print_timers(data)
timestamp = int(time.time())
all_data = []
for d in model_bench_data:
all_data.extend(d)
# pickle all data
with open(f"model_bench-{args.dtype}-{timestamp}.pkl", "wb") as f:
pkl.dump(all_data, f)
if __name__ == '__main__':
def to_torch_dtype(dt):
if dt == "int8":
return torch.int8
if dt == "fp8":
return torch.float8_e4m3fn
raise ValueError("unsupported dtype")
parser = FlexibleArgumentParser(
description="""
Benchmark Cutlass GEMM.
To run square GEMMs:
python3 ./benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 square_bench --dim-start 128 --dim-end 512 --dim-increment 64
To run constant N and K and sweep M:
python3 ./benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 range_bench --dim-start 128 --dim-end 512 --dim-increment 64 --n-constant 16384 --k-constant 16384
To run dimensions from a model:
python3 ./benchmarks/cutlass_benchmarks/sparse_benchmarks.py --dtype fp8 model_bench --models meta-llama/Llama-2-7b-hf --batch-sizes 16 --tp-sizes 1
Output:
- a .pkl file, that is a list of raw torch.benchmark.utils.Measurements for the pytorch and cutlass implementations for the various GEMMs.
""", # noqa: E501
formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--dtype",
type=to_torch_dtype,
required=True,
help="Available options are ['int8', 'fp8']")
subparsers = parser.add_subparsers(dest="cmd")
square_parser = subparsers.add_parser("square_bench")
square_parser.add_argument("--dim-start", type=int, required=True)
square_parser.add_argument("--dim-end", type=int, required=True)
square_parser.add_argument("--dim-increment", type=int, required=True)
square_parser.set_defaults(func=run_square_bench)
range_parser = subparsers.add_parser("range_bench")
range_parser.add_argument("--dim-start", type=int, required=True)
range_parser.add_argument("--dim-end", type=int, required=True)
range_parser.add_argument("--dim-increment", type=int, required=True)
range_parser.add_argument("--m-constant", type=int, default=None)
range_parser.add_argument("--n-constant", type=int, default=None)
range_parser.add_argument("--k-constant", type=int, default=None)
range_parser.set_defaults(func=run_range_bench)
model_parser = subparsers.add_parser("model_bench")
model_parser.add_argument("--models",
nargs="+",
type=str,
default=DEFAULT_MODELS,
choices=WEIGHT_SHAPES.keys())
model_parser.add_argument("--tp-sizes",
nargs="+",
type=int,
default=DEFAULT_TP_SIZES)
model_parser.add_argument("--batch-sizes",
nargs="+",
type=int,
default=DEFAULT_BATCH_SIZES)
model_parser.set_defaults(func=run_model_bench)
args = parser.parse_args()
args.func(args)

View File

@ -0,0 +1,98 @@
# SPDX-License-Identifier: Apache-2.0
# Cutlass bench utils
from collections.abc import Iterable
import torch
import vllm._custom_ops as ops
def to_fp8(tensor: torch.Tensor) -> torch.Tensor:
finfo = torch.finfo(torch.float8_e4m3fn)
return torch.round(tensor.clamp(
min=finfo.min, max=finfo.max)).to(dtype=torch.float8_e4m3fn)
def to_int8(tensor: torch.Tensor) -> torch.Tensor:
return torch.round(tensor.clamp(min=-128, max=127)).to(dtype=torch.int8)
def to_bf16(tensor: torch.Tensor) -> torch.Tensor:
return tensor.to(dtype=torch.bfloat16)
def to_fp16(tensor: torch.Tensor) -> torch.Tensor:
return tensor.to(dtype=torch.float16)
def make_rand_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
if dtype == torch.int8:
return to_int8(a), to_int8(b)
if dtype == torch.float8_e4m3fn:
return to_fp8(a), to_fp8(b)
raise ValueError("unsupported dtype")
def prune_to_2_4(tensor):
# Reshape tensor to [N, 4] where N is number of groups of 4
original_shape = tensor.shape
reshaped = tensor.reshape(-1, 4)
# Get indices of top 2 absolute values in each group of 4
_, indices = torch.topk(torch.abs(reshaped), k=2, dim=1)
# Create binary mask
mask = torch.zeros_like(reshaped)
mask.scatter_(dim=1,
index=indices,
src=torch.ones_like(indices, dtype=mask.dtype))
# Apply mask and reshape back
pruned = reshaped * mask
# Turn all -0.0 to 0.0
pruned[pruned == -0.0] = 0.0
return pruned.reshape(original_shape)
def make_rand_sparse_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
b = prune_to_2_4(b.t()).t()
if dtype == torch.int8:
a, b = to_int8(a), to_int8(b)
elif dtype == torch.float8_e4m3fn:
a, b = to_fp8(a), to_fp8(b)
elif dtype == torch.float16:
a, b = to_fp16(a), to_fp16(b)
elif dtype == torch.bfloat16:
a, b = to_bf16(a), to_bf16(b)
else:
raise ValueError("unsupported dtype")
b_compressed, e = ops.cutlass_sparse_compress(b.t())
# Compressed B, Metadata, Original A, B
return b_compressed, e, a, b
def make_n_rand_sparse_tensors(num_tensors: int, dtype: torch.dtype,
m: int, n: int, k: int) -> \
tuple[Iterable[torch.Tensor], Iterable[torch.Tensor]]:
ABs = []
for _ in range(num_tensors):
b_comp, e, a, b = make_rand_sparse_tensors(dtype, m, n, k)
if b_comp is not None:
ABs.append(make_rand_sparse_tensors(dtype, m, n, k))
BComps, Es, As, Bs = zip(*ABs)
return list(BComps), list(Es), list(As), list(Bs)

View File

@ -1,47 +1,28 @@
# SPDX-License-Identifier: Apache-2.0
import argparse
import copy
import itertools
import pickle as pkl
import time
from typing import Callable, Iterable, List, Tuple
from collections.abc import Iterable
from typing import Callable, Optional
import torch
import torch.utils.benchmark as TBenchmark
from torch.utils.benchmark import Measurement as TMeasurement
from utils import make_rand_tensors
from weight_shapes import WEIGHT_SHAPES
from vllm import _custom_ops as ops
from vllm.model_executor.layers.quantization.utils.fp8_utils import (
w8a8_block_fp8_matmul)
from vllm.utils import FlexibleArgumentParser
DEFAULT_MODELS = list(WEIGHT_SHAPES.keys())
DEFAULT_BATCH_SIZES = [1, 16, 32, 64, 128, 256, 512]
DEFAULT_TP_SIZES = [1]
# helpers
def to_fp8(tensor: torch.Tensor) -> torch.Tensor:
finfo = torch.finfo(torch.float8_e4m3fn)
return torch.round(tensor.clamp(
min=finfo.min, max=finfo.max)).to(dtype=torch.float8_e4m3fn)
def to_int8(tensor: torch.Tensor) -> torch.Tensor:
return torch.round(tensor.clamp(min=-128, max=127)).to(dtype=torch.int8)
def make_rand_tensors(dtype: torch.dtype, m: int, n: int,
k: int) -> Tuple[torch.Tensor, torch.Tensor]:
a = torch.randn((m, k), device='cuda') * 5
b = torch.randn((n, k), device='cuda').t() * 5
if dtype == torch.int8:
return to_int8(a), to_int8(b)
if dtype == torch.float8_e4m3fn:
return to_fp8(a), to_fp8(b)
raise ValueError("unsupported dtype")
# bench
def bench_fn(label: str, sub_label: str, description: str, fn: Callable, *args,
@ -62,8 +43,15 @@ def bench_fn(label: str, sub_label: str, description: str, fn: Callable, *args,
).blocked_autorange(min_run_time=min_run_time)
def bench_int8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
def bench_int8(
dtype: torch.dtype,
m: int,
k: int,
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
"""Benchmark INT8-based kernels."""
assert dtype == torch.int8
a, b = make_rand_tensors(torch.int8, m, n, k)
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
@ -72,155 +60,132 @@ def bench_int8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
azp = torch.zeros((m, ), device="cuda", dtype=torch.int32)
azp_adj = torch.zeros((n, ), device="cuda", dtype=torch.int32)
bench_fns = {
"pytorch_bf16_bf16_bf16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.bfloat16), b.to(dtype=torch.bfloat16)
),
"pytorch_fp16_fp16_fp16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.float16), b.to(dtype=torch.float16)),
"cutlass_i8_i8_bf16_scaled_mm":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16),
"cutlass_i8_i8_bf16_scaled_mm_bias":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16,
bias),
"cutlass_i8_i8_bf16_scaled_mm_azp":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj),
"cutlass_i8_i8_bf16_scaled_mm_azp_bias":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj, None, bias),
"cutlass_i8_i8_bf16_scaled_mm_azp_pt":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj, azp),
"cutlass_i8_i8_bf16_scaled_mm_azp_pt_bias":
lambda: ops.cutlass_scaled_mm_azp(a, b, scale_a, scale_b, torch.
bfloat16, azp_adj, azp, bias),
}
timers = []
# pytorch impl - bfloat16
timers.append(
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales",
torch.mm, a.to(dtype=torch.bfloat16),
b.to(dtype=torch.bfloat16)))
# pytorch impl - float16
timers.append(
bench_fn(label, sub_label,
"pytorch_fp16_fp16_fp16_matmul-no-scales", torch.mm,
a.to(dtype=torch.float16), b.to(dtype=torch.float16)))
# cutlass impl
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass with bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.bfloat16,
bias))
# cutlass with azp per-tensor
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj))
# cutlass with azp per-tensor + bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp_bias",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj, None, bias))
# cutlass with azp per-token
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp_pt",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj, azp))
# cutlass with azp per-token + bias
timers.append(
bench_fn(label, sub_label, "cutlass_i8_i8_bf16_scaled_mm_azp_pt_bias",
ops.cutlass_scaled_mm_azp, a, b, scale_a, scale_b,
torch.bfloat16, azp_adj, azp, bias))
for name, fn in bench_fns.items():
# If bench_kernels is None, run all. Otherwise, run only exact matches.
if bench_kernels is None or name in bench_kernels:
print(f"Running {name}")
timers.append(bench_fn(label, sub_label, name, fn))
return timers
def bench_fp8(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
def bench_fp8(
dtype: torch.dtype,
m: int,
k: int,
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
"""Benchmark FP8-based kernels."""
assert dtype == torch.float8_e4m3fn
a, b = make_rand_tensors(torch.float8_e4m3fn, m, n, k)
a_cont = a.contiguous()
scale_a = torch.tensor(1.0, device="cuda", dtype=torch.float32)
scale_b = torch.tensor(1.0, device="cuda", dtype=torch.float32)
block_scale_a = torch.rand((m, k // 128),
device="cuda",
dtype=torch.float32)
block_scale_b = torch.rand((k // 128, n // 128),
device="cuda",
dtype=torch.float32)
block_scale_a_M_major = block_scale_a.t().contiguous().t()
block_scale_b_K_major = block_scale_b.t().contiguous().t()
bias = torch.zeros((n, ), device="cuda", dtype=torch.bfloat16)
print(m, k, n)
bench_fns = {
"pytorch_bf16_bf16_bf16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.bfloat16), b.to(dtype=torch.bfloat16)
),
"pytorch_fp16_fp16_fp16_matmul-no-scales":
lambda: torch.mm(a.to(dtype=torch.float16), b.to(dtype=torch.float16)),
"pytorch_fp8_fp8_fp16_scaled_mm":
lambda: torch._scaled_mm(
a, b, scale_a, scale_b, out_dtype=torch.float16),
"pytorch_fp8_fp8_fp16_scaled_mm_fast_accum":
lambda: torch._scaled_mm(a,
b,
scale_a,
scale_b,
out_dtype=torch.float16,
use_fast_accum=True),
"pytorch_fp8_fp8_bf16_scaled_mm":
lambda: torch._scaled_mm(
a, b, scale_a, scale_b, out_dtype=torch.bfloat16),
"pytorch_fp8_fp8_bf16_scaled_mm_fast_accum":
lambda: torch._scaled_mm(a,
b,
scale_a,
scale_b,
out_dtype=torch.bfloat16,
use_fast_accum=True),
"cutlass_fp8_fp8_bf16_scaled_mm":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16),
"cutlass_fp8_fp8_fp16_scaled_mm":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.float16),
"cutlass_fp8_fp8_bf16_scaled_mm_bias":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.bfloat16,
bias),
"cutlass_fp8_fp8_fp16_scaled_mm_bias":
lambda: ops.cutlass_scaled_mm(a, b, scale_a, scale_b, torch.float16,
bias.to(dtype=torch.float16)),
"triton_fp8_fp8_fp16_scaled_mm_blockwise":
lambda: w8a8_block_fp8_matmul(a_cont, b.t(), block_scale_a,
block_scale_b.t(), (128, 128)),
"cutlass_fp8_fp8_fp16_scaled_mm_blockwise":
lambda: ops.cutlass_scaled_mm(a, b, block_scale_a_M_major,
block_scale_b_K_major, torch.float16),
}
timers = []
# pytorch impl w. bf16
timers.append(
bench_fn(label, sub_label, "pytorch_bf16_bf16_bf16_matmul-no-scales",
torch.mm, a.to(dtype=torch.bfloat16, device="cuda"),
b.to(dtype=torch.bfloat16, device="cuda")))
# pytorch impl: bf16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16))
# pytorch impl: bf16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_bf16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.bfloat16,
use_fast_accum=True))
# pytorch impl: fp16 output, without fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16))
# pytorch impl: fp16 output, with fp8 fast accum
timers.append(
bench_fn(label,
sub_label,
"pytorch_fp8_fp8_fp16_scaled_mm_fast_accum",
torch._scaled_mm,
a,
b,
scale_a=scale_a,
scale_b=scale_b,
out_dtype=torch.float16,
use_fast_accum=True))
# cutlass impl: bf16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b,
torch.bfloat16))
# cutlass impl: fp16 output
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_fp16_scaled_mm",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.float16))
# cutlass impl: bf16 output, with bias
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_bf16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.bfloat16,
bias))
# cutlass impl: fp16 output, with bias
timers.append(
bench_fn(label, sub_label, "cutlass_fp8_fp8_fp16_scaled_mm_bias",
ops.cutlass_scaled_mm, a, b, scale_a, scale_b, torch.float16,
bias.to(dtype=torch.float16)))
for name, fn in bench_fns.items():
# If bench_kernels is None, run all. Otherwise, run only exact matches.
if bench_kernels is None or name in bench_kernels:
print(f"Running {name}")
timers.append(bench_fn(label, sub_label, name, fn))
return timers
def bench(dtype: torch.dtype, m: int, k: int, n: int, label: str,
sub_label: str) -> Iterable[TMeasurement]:
def bench(dtype: torch.dtype,
m: int,
k: int,
n: int,
label: str,
sub_label: str,
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
if dtype == torch.int8:
return bench_int8(dtype, m, k, n, label, sub_label)
return bench_int8(dtype, m, k, n, label, sub_label, bench_kernels)
if dtype == torch.float8_e4m3fn:
return bench_fp8(dtype, m, k, n, label, sub_label)
return bench_fp8(dtype, m, k, n, label, sub_label, bench_kernels)
raise ValueError("unsupported type")
@ -231,20 +196,24 @@ def print_timers(timers: Iterable[TMeasurement]):
def run(dtype: torch.dtype,
MKNs: Iterable[Tuple[int, int, int]]) -> Iterable[TMeasurement]:
MKNs: Iterable[tuple[int, int, int]],
bench_kernels: Optional[list[str]] = None) -> Iterable[TMeasurement]:
results = []
for m, k, n in MKNs:
timers = bench(dtype, m, k, n, f"scaled-{dtype}-gemm",
f"MKN=({m}x{k}x{n})")
timers = bench(dtype,
m,
k,
n,
f"scaled-{dtype}-gemm",
f"MKN=({m}x{k}x{n})",
bench_kernels=bench_kernels)
print_timers(timers)
results.extend(timers)
return results
# output makers
def make_output(data: Iterable[TMeasurement],
MKNs: Iterable[Tuple[int, int, int]],
MKNs: Iterable[tuple[int, int, int]],
base_description: str,
timestamp=None):
print(f"== All Results {base_description} ====")
@ -256,15 +225,11 @@ def make_output(data: Iterable[TMeasurement],
pkl.dump(data, f)
# argparse runners
def run_square_bench(args):
dim_sizes = list(
range(args.dim_start, args.dim_end + 1, args.dim_increment))
MKNs = list(zip(dim_sizes, dim_sizes, dim_sizes))
data = run(args.dtype, MKNs)
data = run(args.dtype, MKNs, bench_kernels=args.kernels)
make_output(data, MKNs, f"square_bench-{args.dtype}")
@ -275,8 +240,7 @@ def run_range_bench(args):
Ks = [args.k_constant] * n if args.k_constant is not None else dim_sizes
Ns = [args.n_constant] * n if args.n_constant is not None else dim_sizes
MKNs = list(zip(Ms, Ks, Ns))
data = run(args.dtype, MKNs)
data = run(args.dtype, MKNs, bench_kernels=args.kernels)
make_output(data, MKNs, f"range_bench-{args.dtype}")
@ -285,7 +249,7 @@ def run_model_bench(args):
for i, model in enumerate(args.models):
print(f"[{i}] {model}")
def model_shapes(model_name: str, tp_size: int) -> List[Tuple[int, int]]:
def model_shapes(model_name: str, tp_size: int) -> list[tuple[int, int]]:
KNs = []
for KN, tp_split_dim in copy.deepcopy(WEIGHT_SHAPES[model_name]):
KN[tp_split_dim] = KN[tp_split_dim] // tp_size
@ -302,7 +266,7 @@ def run_model_bench(args):
for k, n in KNs:
MKNs.append((m, k, n))
data = run(args.dtype, MKNs)
data = run(args.dtype, MKNs, bench_kernels=args.kernels)
model_bench_data.append(data)
# Print all results
@ -352,6 +316,15 @@ Benchmark Cutlass GEMM.
type=to_torch_dtype,
required=True,
help="Available options are ['int8', 'fp8']")
parser.add_argument(
"--kernels",
nargs="+",
type=str,
default=None,
help=
"Exact names of the kernels to benchmark. If not set, runs all kernels."
)
subparsers = parser.add_subparsers(dest="cmd")
square_parser = subparsers.add_parser("square_bench")

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
# Weight Shapes are in the format
# ([K, N], TP_SPLIT_DIM)
# Example:
@ -40,4 +42,4 @@ WEIGHT_SHAPES = {
([8192, 57344], 1),
([28672, 8192], 0),
],
}
}

View File

@ -0,0 +1,145 @@
#!/bin/bash
# benchmark the overhead of disaggregated prefill.
# methodology:
# - send all request to prefill vLLM instance. It will buffer KV cache.
# - then send all request to decode instance.
# - The TTFT of decode instance is the overhead.
set -ex
kill_gpu_processes() {
# kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9
sleep 10
# remove vllm config file
rm -rf ~/.config/vllm
# Print the GPU memory usage
# so that we know if all GPU processes are killed.
gpu_memory_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0)
# The memory usage should be 0 MB.
echo "GPU 0 Memory Usage: $gpu_memory_usage MB"
}
wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}
benchmark() {
export VLLM_LOGGING_LEVEL=DEBUG
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
# compare chunked prefill with disaggregated prefill
results_folder="./results"
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=10
qps=$1
prefix_len=50
input_len=2048
output_len=$2
CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
wait_for_server 8100
wait_for_server 8200
# let the prefill instance finish prefill
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8100 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1.json \
--request-rate "inf"
# send the request to decode.
# The TTFT of this command will be the overhead of disagg prefill impl.
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8200 \
--save-result \
--result-dir $results_folder \
--result-filename disagg_prefill_tp1_overhead.json \
--request-rate "$qps"
kill_gpu_processes
}
main() {
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)
pip install quart httpx datasets
cd "$(dirname "$0")"
cd ..
# create sonnet-4x.txt
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks
rm -rf results
mkdir results
default_qps=1
default_output_len=1
benchmark $default_qps $default_output_len
}
main "$@"

View File

@ -0,0 +1,163 @@
#!/bin/bash
# Requirement: 2x GPUs.
# Model: meta-llama/Meta-Llama-3.1-8B-Instruct
# Query: 1024 input tokens, 6 output tokens, QPS 2/4/6/8, 100 requests
# Resource: 2x GPU
# Approaches:
# 2. Chunked prefill: 2 vllm instance with tp=4, equivalent to 1 tp=4 instance with QPS 4
# 3. Disaggregated prefill: 1 prefilling instance and 1 decoding instance
# Prefilling instance: max_output_token=1
# Decoding instance: force the input tokens be the same across requests to bypass prefilling
set -ex
kill_gpu_processes() {
# kill all processes on GPU.
pgrep pt_main_thread | xargs -r kill -9
pgrep python3 | xargs -r kill -9
for port in 8000 8100 8200; do lsof -t -i:$port | xargs -r kill -9; done
sleep 1
}
wait_for_server() {
# wait for vllm server to start
# return 1 if vllm server crashes
local port=$1
timeout 1200 bash -c "
until curl -s localhost:${port}/v1/completions > /dev/null; do
sleep 1
done" && return 0 || return 1
}
launch_chunked_prefill() {
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
# disagg prefill
CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
--max-model-len 10000 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.6 &
CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
--max-model-len 10000 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.6 &
wait_for_server 8100
wait_for_server 8200
python3 round_robin_proxy.py &
sleep 1
}
launch_disagg_prefill() {
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
# disagg prefill
CUDA_VISIBLE_DEVICES=0 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8100 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
CUDA_VISIBLE_DEVICES=1 python3 \
-m vllm.entrypoints.openai.api_server \
--model $model \
--port 8200 \
--max-model-len 10000 \
--gpu-memory-utilization 0.6 \
--kv-transfer-config \
'{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":5e9}' &
wait_for_server 8100
wait_for_server 8200
python3 disagg_prefill_proxy_server.py &
sleep 1
}
benchmark() {
results_folder="./results"
model="meta-llama/Meta-Llama-3.1-8B-Instruct"
dataset_name="sonnet"
dataset_path="../sonnet_4x.txt"
num_prompts=100
qps=$1
prefix_len=50
input_len=1024
output_len=$2
tag=$3
python3 ../benchmark_serving.py \
--backend vllm \
--model $model \
--dataset-name $dataset_name \
--dataset-path $dataset_path \
--sonnet-input-len $input_len \
--sonnet-output-len "$output_len" \
--sonnet-prefix-len $prefix_len \
--num-prompts $num_prompts \
--port 8000 \
--save-result \
--result-dir $results_folder \
--result-filename "$tag"-qps-"$qps".json \
--request-rate "$qps"
sleep 2
}
main() {
(which wget && which curl) || (apt-get update && apt-get install -y wget curl)
(which jq) || (apt-get -y install jq)
(which socat) || (apt-get -y install socat)
(which lsof) || (apt-get -y install lsof)
pip install quart httpx matplotlib aiohttp datasets
cd "$(dirname "$0")"
cd ..
# create sonnet-4x.txt so that we can sample 2048 tokens for input
echo "" > sonnet_4x.txt
for _ in {1..4}
do
cat sonnet.txt >> sonnet_4x.txt
done
cd disagg_benchmarks
rm -rf results
mkdir results
default_output_len=6
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
launch_chunked_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len chunked_prefill
done
kill_gpu_processes
launch_disagg_prefill
for qps in 2 4 6 8; do
benchmark $qps $default_output_len disagg_prefill
done
kill_gpu_processes
python3 visualize_benchmark_results.py
}
main "$@"

View File

@ -0,0 +1,63 @@
# SPDX-License-Identifier: Apache-2.0
import os
import aiohttp
from quart import Quart, make_response, request
AIOHTTP_TIMEOUT = aiohttp.ClientTimeout(total=6 * 60 * 60)
app = Quart(__name__)
async def forward_request(url, data):
async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
async with session.post(url=url, json=data,
headers=headers) as response:
if response.status == 200:
# if response.headers.get('Transfer-Encoding') == 'chunked':
if True:
async for chunk_bytes in response.content.iter_chunked(
1024):
yield chunk_bytes
else:
content = await response.read()
yield content
@app.route('/v1/completions', methods=['POST'])
async def handle_request():
try:
original_request_data = await request.get_json()
prefill_request = original_request_data.copy()
# change max_tokens = 1 to let it only do prefill
prefill_request['max_tokens'] = 1
# finish prefill
async for _ in forward_request('http://localhost:8100/v1/completions',
prefill_request):
continue
# return decode
generator = forward_request('http://localhost:8200/v1/completions',
original_request_data)
response = await make_response(generator)
response.timeout = None
return response
except Exception as e:
import sys
import traceback
exc_info = sys.exc_info()
print("Error occurred in disagg prefill proxy server")
print(e)
print("".join(traceback.format_exception(*exc_info)))
if __name__ == '__main__':
app.run(port=8000)

View File

@ -0,0 +1,62 @@
# SPDX-License-Identifier: Apache-2.0
import asyncio
import itertools
import aiohttp
from aiohttp import web
class RoundRobinProxy:
def __init__(self, target_ports):
self.target_ports = target_ports
self.port_cycle = itertools.cycle(self.target_ports)
async def handle_request(self, request):
target_port = next(self.port_cycle)
target_url = f"http://localhost:{target_port}{request.path_qs}"
async with aiohttp.ClientSession() as session:
try:
# Forward the request
async with session.request(
method=request.method,
url=target_url,
headers=request.headers,
data=request.content,
) as response:
# Start sending the response
resp = web.StreamResponse(status=response.status,
headers=response.headers)
await resp.prepare(request)
# Stream the response content
async for chunk in response.content.iter_any():
await resp.write(chunk)
await resp.write_eof()
return resp
except Exception as e:
return web.Response(text=f"Error: {str(e)}", status=500)
async def main():
proxy = RoundRobinProxy([8100, 8200])
app = web.Application()
app.router.add_route('*', '/{path:.*}', proxy.handle_request)
runner = web.AppRunner(app)
await runner.setup()
site = web.TCPSite(runner, 'localhost', 8000)
await site.start()
print("Proxy server started on http://localhost:8000")
# Keep the server running
await asyncio.Event().wait()
if __name__ == '__main__':
asyncio.run(main())

View File

@ -0,0 +1,48 @@
# SPDX-License-Identifier: Apache-2.0
import json
import matplotlib.pyplot as plt
import pandas as pd
if __name__ == "__main__":
data = []
for name in ['disagg_prefill', 'chunked_prefill']:
for qps in [2, 4, 6, 8]:
with open(f"results/{name}-qps-{qps}.json") as f:
x = json.load(f)
x['name'] = name
x['qps'] = qps
data.append(x)
df = pd.DataFrame.from_dict(data)
dis_df = df[df['name'] == 'disagg_prefill']
chu_df = df[df['name'] == 'chunked_prefill']
plt.style.use('bmh')
plt.rcParams['font.size'] = 20
for key in [
'mean_ttft_ms', 'median_ttft_ms', 'p99_ttft_ms', 'mean_itl_ms',
'median_itl_ms', 'p99_itl_ms'
]:
fig, ax = plt.subplots(figsize=(11, 7))
plt.plot(dis_df['qps'],
dis_df[key],
label='disagg_prefill',
marker='o',
linewidth=4)
plt.plot(chu_df['qps'],
chu_df[key],
label='chunked_prefill',
marker='o',
linewidth=4)
ax.legend()
ax.set_xlabel('QPS')
ax.set_ylabel(key)
ax.set_ylim(bottom=0)
fig.savefig(f'results/{key}.png')
plt.close(fig)

View File

@ -0,0 +1,176 @@
# SPDX-License-Identifier: Apache-2.0
import pickle as pkl
import time
from collections.abc import Iterable
from dataclasses import dataclass
from itertools import product
from typing import Callable, Optional
import torch
import torch.utils.benchmark as TBenchmark
from torch.utils.benchmark import Measurement as TMeasurement
from tqdm import tqdm
import vllm._custom_ops as ops
from vllm.model_executor.layers.layernorm import RMSNorm
@dataclass
class bench_params_t:
num_tokens: int
hidden_size: int
add_residual: bool
dtype: torch.dtype
def description(self):
return (f'N {self.num_tokens} '
f'x D {self.hidden_size} '
f'x R {self.add_residual} '
f'x DT {self.dtype}')
def get_bench_params() -> list[bench_params_t]:
## Test Fixtures
NUM_TOKENS = [2**x for x in range(11)]
HIDDEN_SIZES = list(range(1024, 8129, 1024))
ADD_RESIDUAL = [True, False]
DTYPES = [torch.bfloat16, torch.float]
combinations = product(NUM_TOKENS, HIDDEN_SIZES, ADD_RESIDUAL, DTYPES)
bench_params = list(map(lambda x: \
bench_params_t(x[0], x[1], x[2], x[3]), combinations))
return bench_params
# Reference impls
def unfused_int8_impl(rms_norm_layer: RMSNorm, x: torch.Tensor,
residual: Optional[torch.Tensor],
quant_dtype: torch.dtype):
# Norm
torch_out = None
if residual is None:
torch_out = rms_norm_layer.forward_cuda(x, residual)
else:
torch_out, _ = rms_norm_layer.forward_cuda(x, residual)
# Quant
torch_out, _, _ = ops.scaled_int8_quant(torch_out)
def unfused_fp8_impl(rms_norm_layer: RMSNorm, x: torch.Tensor,
residual: Optional[torch.Tensor],
quant_dtype: torch.dtype):
# Norm
torch_out = None
if residual is None:
torch_out = rms_norm_layer.forward_cuda(x, residual)
else:
torch_out, _ = rms_norm_layer.forward_cuda(x, residual)
# Quant
torch_out, _ = ops.scaled_fp8_quant(torch_out)
def fused_impl(
rms_norm_layer: RMSNorm, # this stores the weights
x: torch.Tensor,
residual: Optional[torch.Tensor],
quant_dtype: torch.dtype):
out, _ = ops.rms_norm_dynamic_per_token_quant(x,
rms_norm_layer.weight,
1e-6,
quant_dtype,
residual=residual)
# Bench functions
def bench_fn(rms_norm_layer: RMSNorm, x: torch.Tensor, residual: torch.Tensor,
quant_dtype: torch.dtype, label: str, sub_label: str,
fn: Callable, description: str) -> TMeasurement:
min_run_time = 1
globals = {
"rms_norm_layer": rms_norm_layer,
"x": x,
"residual": residual,
"quant_dtype": quant_dtype,
"fn": fn,
}
return TBenchmark.Timer(
stmt="fn(rms_norm_layer, x, residual, quant_dtype)",
globals=globals,
label=label,
sub_label=sub_label,
description=description,
).blocked_autorange(min_run_time=min_run_time)
def bench(params: bench_params_t, label: str, sub_label: str) \
-> Iterable[TMeasurement]:
# Make inputs
layer = RMSNorm(params.hidden_size, 1e-6).to(dtype=params.dtype)
# Make weights
layer.weight.data.normal_(mean=1.0, std=0.1)
# Make inputs
scale = 1 / params.hidden_size
x = torch.randn(params.num_tokens,
params.hidden_size,
dtype=params.dtype,
device='cuda') * scale
residual = (torch.randn_like(x) * scale).to(device='cuda') \
if params.add_residual else None
timers = []
# unfused int8 impl.
timers.append(
bench_fn(layer, x, residual, torch.int8, label, sub_label,
unfused_int8_impl, "unfused_int8_impl"))
# unfused fp8 impl.
timers.append(
bench_fn(layer, x, residual, torch.float8_e4m3fn, label, sub_label,
unfused_fp8_impl, "unfused_fp8_impl"))
# fused int8 impl.
timers.append(
bench_fn(layer, x, residual, torch.int8, label, sub_label, fused_impl,
"fused_int8_impl"))
# fused fp8 impl.
timers.append(
bench_fn(layer, x, residual, torch.float8_e4m3fn, label, sub_label,
fused_impl, "fused_fp8_impl"))
print_timers(timers)
return timers
# launch bench
# runner
def print_timers(timers: Iterable[TMeasurement]):
compare = TBenchmark.Compare(timers)
compare.print()
def main():
torch.set_default_device('cuda')
bench_params = get_bench_params()
timers = []
for bp in tqdm(bench_params):
timers.extend(
bench(bp, "rms-norm-dynamic-per-token-quant", bp.description()))
print_timers(timers)
# pickle all the results
timestamp = int(time.time())
with open(f"rms_norm_dpt_quant-{timestamp}.pkl", "wb") as f:
pkl.dump(timers, f)
if __name__ == '__main__':
main()

View File

@ -1,3 +1,5 @@
# SPDX-License-Identifier: Apache-2.0
import os
import sys
from typing import Optional

Some files were not shown because too many files have changed in this diff Show More