f49777ba62
Deepseek v3 ( #11502 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com >
2024-12-26 16:09:44 -08:00
55fb97f7bd
[2/N] API Server: Avoid ulimit footgun ( #11530 )
2024-12-26 23:43:05 +00:00
2072924d14
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization ( #11523 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: HandH1998 <1335248067@qq.com >
2024-12-26 15:33:30 -08:00
720b10fdc6
[1/N] API Server (Remove Proxy) ( #11529 )
2024-12-26 23:03:43 +00:00
b85a977822
[Doc] Add video example to openai client for multimodal ( #11521 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-26 17:31:29 +00:00
eec906d811
[Misc] Add placeholder module ( #11501 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 13:12:51 +00:00
f57ee5650d
[Model] Modify MolmoForCausalLM MLP ( #11510 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 13:12:05 +00:00
dcb1a944d4
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler ( #10681 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 19:02:58 +09:00
7492a36207
[Doc] Add QVQ and QwQ to the list of supported models ( #11509 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-26 09:44:32 +00:00
aa25985bd1
[Misc][LoRA] Fix LoRA weight mapper ( #11495 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 15:52:48 +08:00
dbeac95dbb
Mypy checking for vllm/compilation ( #11496 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-26 05:04:07 +00:00
51a624bf02
[Misc] Move some multimodal utils to modality-specific modules ( #11494 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 04:23:20 +00:00
6ad909fdda
[Doc] Improve GitHub links ( #11491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 14:49:26 -08:00
b689ada91e
[Frontend] Enable decord to load video from base64 ( #11492 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 16:33:55 +00:00
fc601665eb
[Misc] Update disaggregation benchmark scripts and test logs ( #11456 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-25 06:58:48 +00:00
9832e5572a
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor ( #11472 )
2024-12-24 19:49:46 -08:00
3f3e92e1f2
[Model] Automatic conversion of classification and reward models ( #11469 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 18:22:22 +00:00
409475a827
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 ( #11435 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-24 16:53:28 +00:00
196c34b0ac
[Misc] Move weights mapper ( #11443 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 13:05:25 +00:00
5c7963249d
[attn][tiny fix] fix attn backend in MultiHeadAttention ( #11463 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-24 12:39:36 +00:00
461cde2080
[OpenVINO] Fixed installation conflicts ( #11458 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2024-12-24 11:38:21 +00:00
7a5286cc04
[Bugfix][Hardware][CPU] Fix CPU input_positions creation for text-only inputs with mrope ( #11434 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-24 17:59:51 +08:00
b1b1038fbd
[Bugfix] Fix Qwen2-VL LoRA weight loading ( #11430 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 09:56:10 +00:00
9edca6bf8f
[Frontend] Online Pooling API ( #11457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 17:54:30 +08:00
4f074fbf53
[Misc]Suppress irrelevant exception stack trace information when CUDA… ( #11438 )
...
Co-authored-by: shiquan <shiquan>
2024-12-24 08:43:39 +00:00
a491d6f535
[V1] TP Ray executor ( #11107 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-12-23 23:00:12 +00:00
32aa2059ad
[Docs] Convert rST to MyST (Markdown) ( #11145 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-23 22:35:38 +00:00
94d545a1a1
[Doc] Fix typo in the help message of '--guided-decoding-backend' ( #11440 )
2024-12-23 20:20:44 +00:00
60fb4f3bcf
[Bugfix] Add kv cache scales to gemma2.py ( #11269 )
2024-12-23 19:30:45 +00:00
63afbe9215
[CI] Expand OpenAI test_chat.py guided decoding tests ( #11048 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 18:35:38 +00:00
8cef6e02dc
[Misc] add w8a8 asym models ( #11075 )
2024-12-23 13:33:20 -05:00
b866cdbd05
[Misc] Add assertion and helpful message for marlin24 compressed models ( #11388 )
2024-12-24 02:23:38 +08:00
2e726680b3
[Bugfix] torch nightly version in ROCm installation guide ( #11423 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-23 17:20:22 +00:00
5bfb30a529
[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF ( #11389 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 23:06:20 +08:00
e51719ae72
mypy type checking for vllm/worker ( #11418 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-23 13:55:49 +00:00
f30581c518
[misc][perf] remove old code ( #11425 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-23 08:01:08 +00:00
048fc57a0f
[CI] Unboock H100 Benchmark ( #11419 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-12-22 14:17:43 -08:00
f1d1bf6288
[Bugfix] Fix fully sharded LoRAs with Mixtral ( #11390 )
...
Signed-off-by: Jason Greene <jason.greene@redhat.com >
2024-12-22 23:25:10 +08:00
72d9c316d3
[cd][release] fix race conditions ( #11407 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-22 00:39:11 -08:00
4a9139780a
[cd][release] add pypi index for every commit and nightly build ( #11404 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-21 23:53:44 -08:00
29c748930e
[CI] Fix flaky entrypoint tests ( #11403 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-21 21:08:44 -08:00
c2d1b075ba
[Bugfix] Fix issues for Pixtral-Large-Instruct-2411 ( #11393 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-21 10:15:03 +00:00
584f0ae40d
[V1] Make AsyncLLMEngine v1-v0 opaque ( #11383 )
...
Signed-off-by: Ricky Xu <xuchen727@hotmail.com >
2024-12-21 15:14:08 +08:00
51ff216d85
[Bugfix] update should_ignore_layer ( #11354 )
...
Signed-off-by: George Ohashi <george@neuralmagic.com >
2024-12-21 06:36:23 +00:00
dd2b5633dd
[V1][Bugfix] Skip hashing empty or None mm_data ( #11386 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 14:22:21 +09:00
47a0b615b4
Add ray[default] to wget to run distributed inference out of box ( #11265 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-20 13:54:55 -08:00
5d2248d81a
[doc] explain nccl requirements for rlhf ( #11381 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-20 13:00:56 -08:00
d573aeadcc
[Bugfix] Don't log OpenAI field aliases as ignored ( #11378 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 19:03:50 +00:00
995f56236b
[Core] Loading model from S3 using RunAI Model Streamer as optional loader ( #10192 )
...
Signed-off-by: OmerD <omer@run.ai >
2024-12-20 16:46:24 +00:00
7c7aa37c69
[CI/Build] fix pre-compiled wheel install for exact tag ( #11373 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2024-12-21 00:14:40 +08:00
04139ade59
[V1] Fix profiling for models with merged input processor ( #11370 )
...
Signed-off-by: ywang96 <ywang@roblox.com >
2024-12-20 12:04:21 +00:00
1ecc645b8f
[doc] backward compatibility for 0.6.4 ( #11359 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:33:53 -08:00
c954f21ac0
[misc] add early error message for custom ops ( #11355 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:18:25 -08:00
86c2d8fd1c
[Bugfix] Fix spec decoding when seed is none in a batch ( #10863 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-12-20 05:15:31 +00:00
b880ffb87e
[Misc] Add tqdm progress bar during graph capture ( #11349 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 04:35:18 +00:00
7801f56ed7
[ci][gh200] dockerfile clean up ( #11351 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: cenzhiyao <2523403608@qq.com >
2024-12-19 18:13:06 -08:00
48edab8041
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 ( #11331 )
...
Signed-off-by: Akash Kaothalkar <0052v2@linux.vnet.ibm.com >
2024-12-20 01:32:07 +00:00
a985f7af9f
[CI] Adding CPU docker pipeline ( #11261 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-19 11:46:55 -08:00
e461c262f0
[Misc] Remove unused vllm/block.py ( #11336 )
2024-12-19 17:54:24 +00:00
276738ce0f
[Bugfix] Fix broken CPU compressed-tensors test ( #11338 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-19 17:37:31 +00:00
cdf22afdda
[Misc] Clean up and consolidate LRUCache ( #11339 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-20 00:59:32 +08:00
e24113a8fe
[Model] Refactor Qwen2-VL to use merged multimodal processor ( #11258 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 16:28:00 +00:00
7379b3d4b2
[V1] Fix multimodal profiling for Molmo ( #11325 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-19 16:27:22 +00:00
6c7f881541
[Model] Add JambaForSequenceClassification model ( #10860 )
...
Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 22:48:06 +08:00
a0f7d53beb
[Bugfix] Cleanup Pixtral HF code ( #11333 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 13:22:00 +00:00
5aef49806d
[Feature] Add load generation config from model ( #11164 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-19 10:50:38 +00:00
98356735ac
[misc] benchmark_throughput : Add LoRA ( #11267 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 15:43:16 +08:00
f26c4aeecb
[Misc] Optimize ray worker initialization time ( #11275 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-18 23:38:02 -08:00
8936316d58
[Kernel] Refactor Cutlass c3x ( #10049 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 07:00:18 +00:00
6142ef0ada
[VLM] Merged multimodal processor for Qwen2-Audio ( #11303 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 06:14:17 +00:00
c6b0a7d3ba
[V1] Simplify prefix caching logic by removing num_evictable_computed_blocks ( #11310 )
2024-12-19 04:17:12 +00:00
a30482f054
[CI] Expand test_guided_generate to test all backends ( #11313 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-19 04:00:38 +00:00
17ca964273
[Model] IBM Granite 3.1 ( #11307 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-19 11:27:24 +08:00
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) ( #11311 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-19 02:43:30 +00:00
fdea8ec167
[V1] VLM - enable processor cache by default ( #11305 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-18 18:54:46 -05:00
ca5f54a9b9
[Bugfix] fix minicpmv test ( #11304 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-18 10:34:26 -08:00
f954fe0e65
[FIX] update openai version ( #11287 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-12-18 10:17:05 -08:00
362cff1eb3
[CI][Misc] Remove Github Action Release Workflow ( #11274 )
2024-12-18 10:16:53 -08:00
996aa70f00
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests ( #11263 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-18 10:16:40 -08:00
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support ( #10995 )
...
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com >
Co-authored-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-18 09:57:16 -05:00
f04e407e6b
[MISC][XPU]update ipex link for CI fix ( #11278 )
2024-12-17 22:34:23 -08:00
8b79f9e107
[Bugfix] Fix guided decoding with tokenizer mode mistral ( #11046 )
2024-12-17 22:34:08 -08:00
866fa4550d
[Bugfix] Restore support for larger block sizes ( #11259 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-17 16:39:07 -08:00
bf8717ebae
[V1] Prefix caching for vision language models ( #11187 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-17 16:37:59 -08:00
c77eb8a33c
[Bugfix] Set temperature=0.7 in test_guided_choice_chat ( #11264 )
2024-12-17 16:34:06 -08:00
2d1b9baa8f
[Bugfix] Fix request cancellation without polling ( #11190 )
2024-12-17 12:26:32 -08:00
f9ecbb18bf
[Misc] Allow passing logits_soft_cap for xformers backend ( #11252 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-17 00:37:04 -08:00
02222a0256
[Misc] Kernel Benchmark for RMSNorm ( #11241 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com >
2024-12-17 06:57:02 +00:00
2bfdbf2a36
[V1][Core] Use weakref.finalize instead of atexit ( #11242 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 22:11:33 -08:00
e88db68cf5
[Platform] platform agnostic for EngineArgs initialization ( #11225 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-16 22:11:06 -08:00
59c9b6ebeb
[V1][VLM] Proper memory profiling for image language models ( #11210 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-16 22:10:57 -08:00
66d4b16724
[Frontend] Add OpenAI API support for input_audio ( #11027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-16 22:09:58 -08:00
0064f697d3
[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse ( #10935 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-17 11:39:58 +08:00
35bae114a8
fix gh200 tests on main ( #11246 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 17:22:38 -08:00
88a412ed3d
[torch.compile] fast inductor ( #11108 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 16:15:22 -08:00
c301616ed2
[ci][tests] add gh200 tests ( #11244 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 15:53:18 -08:00
35ffa682b1
[Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving ( #11235 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-16 22:20:39 +00:00
551603feff
[core] overhaul memory profiling and fix backward compatibility ( #10511 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 13:32:25 -08:00
efbce85f4d
[misc] Layerwise profile updates ( #10242 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-16 18:14:57 +00:00
2ca830dbaa
[Doc] Reorder vision language examples in alphabet order ( #11228 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-16 11:23:33 +00:00
d927dbcd88
[Model] Refactor Ultravox to use merged input processor ( #11198 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-16 10:09:53 +00:00
bddbbcb132
[Model] Support Cohere2ForCausalLM (Cohere R7B) ( #11203 )
2024-12-16 09:56:19 +00:00
b3b1526f03
WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 ( #11212 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
2024-12-16 09:20:49 +00:00
17138af7c4
[Bugfix] Fix the default value for temperature in ChatCompletionRequest ( #11219 )
2024-12-16 00:15:40 -08:00
69ba344de8
[Bugfix] Fix block size validation ( #10938 )
2024-12-15 16:38:40 -08:00
da6f409246
Update deploying_with_k8s.rst ( #10922 )
2024-12-15 16:33:58 -08:00
25ebed2f8c
[V1][Minor] Cache np arange to reduce input preparation overhead ( #11214 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-15 13:33:00 -08:00
d263bd9df7
[Core] Support disaggregated prefill with Mooncake Transfer Engine ( #10884 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2024-12-15 21:28:18 +00:00
38e599d6a8
[Doc] add documentation for disaggregated prefilling ( #11197 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2024-12-15 13:31:16 -06:00
96d673e0f8
[Bugfix] Fix error handling of unsupported sliding window ( #11213 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 10:59:42 -07:00
b10609e6a1
[Misc] Clean up multi-modal processor ( #11207 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 06:30:28 +00:00
a1c02058ba
[torch.compile] allow tracking forward time ( #11081 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-14 19:45:00 -08:00
15859f2357
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 ( #11201 )
2024-12-15 03:03:06 +00:00
886936837c
[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion ( #7209 )
2024-12-14 11:38:10 -08:00
6d917d0eeb
Enable mypy checking on V1 code ( #11105 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2024-12-14 09:54:04 -08:00
93abf23a64
[VLM] Fully dynamic prompt replacement in merged input processor ( #11199 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 17:52:18 +00:00
9c3dadd1c9
[Frontend] Add logits_processors as an extra completion argument ( #11150 )
...
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com >
2024-12-14 16:46:42 +00:00
3cb5769883
[Misc] Minor improvements to the readability of PunicaWrapperBase ( #11200 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-14 16:38:27 +00:00
ea7bd68d10
[V1][Bugfix] Fix V1 TP trust-remote-code ( #11182 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-14 08:21:23 +00:00
48259264a4
[Core] Update outlines and increase its threadpool size ( #11140 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-14 07:46:18 +00:00
24a3d12b82
update compressed-tensors to latest version ( #11183 )
...
Co-authored-by: dhuangnm <dhuang@MacBook-Pro-2.local >
2024-12-14 03:22:44 +00:00
9855aea21b
[Bugfix][V1] Re-compute an entire block when fully cache hit ( #11186 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 17:08:23 -08:00
4b5b8a6a3b
[V1][Bugfix] Fix EngineCoreProc profile ( #11185 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-13 17:02:35 -08:00
4863e5fba5
[Core] V1: Use multiprocessing by default ( #11074 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-13 16:27:32 -08:00
0d8451c3a4
[Distributed] Allow the placement group more time to wait for resources to be ready ( #11138 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-13 20:17:37 +00:00
0a56bcc03d
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend ( #11169 )
2024-12-13 18:00:40 +00:00
0920ab9131
[Doc] Reorganize online pooling APIs ( #11172 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 00:22:22 +08:00
238c0d93b4
[Misc] Add tokenizer_mode param to benchmark_serving.py ( #11174 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-13 16:19:10 +00:00
5b0ed8391d
[Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor ( #11156 )
2024-12-13 15:56:19 +00:00
c31d4a57a6
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching ( #8240 )
2024-12-13 07:51:25 -08:00
d1fa714cb1
[Refactor]A simple device-related refactor ( #11163 )
...
Signed-off-by: noemotiovon <noemotiovon@gmail.com >
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2024-12-13 13:39:00 +00:00
969da7d70b
[V1][VLM] Fix edge case bug for InternVL2 ( #11165 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-13 11:09:30 +00:00
eeec9e3390
[Frontend] Separate pooling APIs in offline inference ( #11129 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-13 10:40:07 +00:00
f93bf2b189
[Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt ( #11159 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-13 08:50:35 +00:00
7cd7409142
PaliGemma 2 support ( #11142 )
2024-12-13 07:40:07 +00:00
be39e3cd18
[core] clean up cudagraph batchsize padding logic ( #10996 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-13 06:57:50 +00:00
34f1a806d5
[Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' ( #11157 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 06:30:06 +00:00
00c1bde5d8
[ROCm][AMD] Disable auto enabling chunked prefill on ROCm ( #11146 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-13 05:31:26 +00:00
3989a79824
[Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization ( #11148 )
2024-12-13 05:07:20 +00:00
1efce68605
[Bugfix] Use runner_type instead of task in GritLM ( #11144 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-13 04:09:53 +00:00
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion ( #10906 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-13 03:19:23 +00:00
78ed8f57d8
[Misc][V1] Fix type in v1 prefix caching ( #11151 )
2024-12-13 00:57:40 +00:00
db6c264a1e
[Bugfix] Fix value unpack error of simple connector for KVCache transfer. ( #11058 )
...
Signed-off-by: ShangmingCai <csmthu@gmail.com >
2024-12-12 21:19:17 +00:00
9f3974a319
Fix logging of the vLLM Config ( #11143 )
2024-12-12 12:05:57 -08:00
2c97eca1ff
[Misc] Validate grammar and fail early ( #11119 )
2024-12-12 18:34:26 +00:00
5d712571af
[Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c7ae. ( #11024 )
2024-12-12 18:09:20 +00:00
d4d5291cc2
fix(docs): typo in helm install instructions ( #11141 )
...
Signed-off-by: Ramon Ziai <ramon.ziai@bettermarks.com >
2024-12-12 17:36:32 +00:00
4816d20aa4
[V1] Fix torch profiling for offline inference ( #11125 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-12 15:51:53 +00:00
85362f028c
[Misc][LoRA] Ensure Lora Adapter requests return adapter name ( #11094 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-12 09:25:16 +00:00
62de37a38e
[core][distributed] initialization from StatelessProcessGroup ( #10986 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-12 09:04:19 +00:00
8195824206
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) ( #10565 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2024-12-12 08:09:28 +00:00
f092153fbe
[V1] Use more persistent buffers to optimize input preparation overheads ( #11111 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 23:14:20 -08:00
1da8f0e1dd
[Model] Add support for embedding model GritLM ( #10816 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-12 06:39:16 +00:00
ccede2b264
[Core] cleanup zmq ipc sockets on exit ( #11115 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 19:12:24 -08:00
24a36d6d5f
Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst ( #11112 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-12 02:39:21 +00:00
8fb26dac61
[Docs] Add media kit ( #11121 )
2024-12-11 17:33:11 -08:00
7439a8b5fc
[Bugfix] Multiple fixes to tool streaming with hermes and mistral ( #10979 )
...
Signed-off-by: cedonley <clayton@donley.io >
2024-12-12 01:10:12 +00:00
4e11683368
[V1] VLM preprocessor hashing ( #11020 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-12 00:55:30 +00:00
452a723bf2
[V1][Core] Remove should_shutdown to simplify core process termination ( #11113 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 23:34:54 +00:00
d1e21a979b
[CI/Build] Split up VLM tests ( #11083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-12 06:18:16 +08:00
72ff3a9686
[core] Bump ray to use _overlap_gpu_communication in compiled graph tests ( #10410 )
...
Signed-off-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
2024-12-11 11:36:35 -08:00
66aaa7722d
[torch.compile] remove graph logging in ci ( #11110 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:59:50 -08:00
d643c2aba1
[V1] Use input_ids as input for text-only models ( #11032 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 10:49:23 -08:00
91642db952
[torch.compile] use depyf to dump torch.compile internals ( #10972 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:43:05 -08:00
fd22220687
[Doc] Installed version of llmcompressor for int8/fp8 quantization ( #11103 )
...
Signed-off-by: Guangda Liu <bingps@users.noreply.github.com >
Co-authored-by: Guangda Liu <bingps@users.noreply.github.com >
2024-12-11 15:43:24 +00:00
b2f775456e
[CI/Build] Enable prefix caching test for AMD ( #11098 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-12-11 15:23:37 +00:00
cad5c0a6ed
[Doc] Update docs to refer to pooling models ( #11093 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 13:36:27 +00:00
8f10d5e393
[Misc] Split up pooling tasks ( #10820 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 01:28:00 -08:00
40766ca1b8
[Bugfix]: Clamp -inf logprob values in prompt_logprobs ( #11073 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-11 01:27:39 -08:00
2e32f5d28d
[Bugfix] Fix Idefics3 fails during multi-image inference ( #11080 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-12-11 01:27:07 -08:00
61b1d2f6ae
[Core] v1: Use atexit to handle engine core client shutdown ( #11076 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 01:26:36 -08:00
9974fca047
[ci/build] Fix entrypoints test and pin outlines version ( #11088 )
2024-12-11 01:01:53 -08:00
3fb4b4f163
[ci/build] Fix AMD CI dependencies ( #11087 )
2024-12-11 00:39:53 -08:00
2e33fe4191
[CI/Build] Check transformers v4.47 ( #10991 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 05:02:02 +00:00
e39400a4b6
Fix streaming for granite tool call when <|tool_call|> is present ( #11069 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-11 04:51:40 +00:00
ffa48c9146
[Model] PP support for Mamba-like models ( #10992 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-12-10 21:53:37 -05:00
d5c5154fcf
[Misc] LoRA + Chunked Prefill ( #9057 )
2024-12-11 10:09:20 +08:00
9a93973708
[Bugfix] Fix Mamba multistep ( #11071 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 00:16:22 +00:00
134810b3d9
[V1][Bugfix] Always set enable_chunked_prefill = True for V1 ( #11061 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-10 14:41:23 -08:00
75f89dc44c
[torch.compile] add a flag to track batchsize statistics ( #11059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-10 12:40:52 -08:00
e739194926
[Core] Update to outlines >= 0.1.8 ( #10576 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-10 12:08:16 -08:00
250ee65d72
[BUG] Remove token param #10921 ( #11022 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-12-10 17:38:15 +00:00
9b9cef3145
[Bugfix] Backport request id validation to v0 ( #11036 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 16:38:23 +00:00
d05f88679b
[Misc][LoRA] Add PEFTHelper for LoRA ( #11003 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-10 11:12:01 +00:00
beb16b2c81
[Bugfix] Handle <|tool_call|> token in granite tool parser ( #11039 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-10 10:27:11 +00:00
fe2e10c71b
Add example of helm chart for vllm deployment on k8s ( #9199 )
...
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
2024-12-10 09:19:27 +00:00
82c73fd510
[Bugfix] cuda error running llama 3.2 ( #11047 )
2024-12-10 07:41:11 +00:00
bfd610430c
Update README.md ( #11034 )
2024-12-09 23:08:10 -08:00
e35879c276
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. ( #11043 )
2024-12-10 14:54:22 +08:00
ebf778061d
monitor metrics of tokens per step using cudagraph batchsizes ( #11031 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 22:35:36 -08:00
28b3a1c7e5
[V1] Multiprocessing Tensor Parallel Support for v1 ( #9856 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-10 06:28:14 +00:00
bc192a2b09
[Pixtral] Improve loading ( #11040 )
2024-12-10 06:09:32 +00:00
980ad394a8
[Frontend] Use request id from header ( #10968 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 13:46:29 +08:00
391d7b2763
[Bugfix] Fix usage of deprecated decorator ( #11025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-10 13:45:47 +08:00
d1f6d1c8af
[Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba ( #10739 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-10 10:23:07 +08:00
6d525288c1
[Docs] Add dedicated tool calling page to docs ( #10554 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-09 20:15:34 -05:00
6faec54505
[V1] Do not store None in self.generators ( #11038 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 15:08:19 -08:00
5ed5d5f128
Build tpu image in release pipeline ( #10936 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-09 23:07:48 +00:00
b63ba84832
[ROCm][bugfix] scpecilative decoding worker class ( #11035 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-09 14:00:29 -08:00
9c6459e4cb
[Neuron] Upgrade neuron to 2.20.2 ( #11016 )
...
Signed-off-by: Jerzy Zagorski <jzagorsk@amazon.com >
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-12-09 13:53:24 -08:00
1a2f8fb828
[v1] fix use compile sizes ( #11000 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 13:47:24 -08:00
cbcbdb1ceb
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version ( #11028 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-09 13:21:06 -08:00
a811dd6608
[Model] merged input processor for Phi-3-Vision models ( #10977 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-09 12:55:10 -08:00
ca871491ed
[Misc][LoRA] Abstract PunicaWrapper ( #10955 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-09 12:54:44 -08:00
3b61cb450d
[V1] Further reduce CPU overheads in flash-attn ( #10989 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 12:38:46 -08:00
edc4fa3188
[ci/build] Recompile CI dependencies list with Python 3.12 ( #11013 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-12-09 11:46:58 -08:00
25b79d9fd3
[V1] Input Batch Relocation ( #10962 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-09 09:33:41 -08:00
aea2fc38c3
[Platform] Move async output check to platform ( #10768 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-09 17:24:46 +00:00
e691b26f6f
[Core] Require xgrammar >= 0.1.6 ( #11021 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-09 16:44:27 +00:00
c690357928
[V1] Fix Detokenizer loading in AsyncLLM ( #10997 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-09 16:27:10 +00:00
d1c2e15eb3
[torch.compile] add dynamo time tracking ( #11005 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 23:09:04 -08:00
af7c4a92e6
[Doc][V1] Add V1 support column for multimodal models ( #10998 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 22:29:16 -08:00
46004e83a2
[misc] clean up and unify logging ( #10999 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 17:28:27 -08:00
43b05fa314
[torch.compile][misc] fix comments ( #10993 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:18:18 -08:00
a11f326528
[V1] Initial support of multimodal models for V1 re-arch ( #10699 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 12:50:51 +00:00
fd57d2b534
[torch.compile] allow candidate compile sizes ( #10984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:05:21 +00:00
7be15d9356
[core][misc] remove use_dummy driver for _run_workers ( #10920 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 12:06:08 -08:00
1b62745b1d
[core][executor] simplify instance id ( #10976 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 09:33:45 -08:00
78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None ( #10928 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2024-12-08 01:21:18 +08:00
c889d5888b
[Doc] Explicitly state that PP isn't compatible with speculative decoding yet ( #10975 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:20:49 +00:00
39e227c7ae
[Model] Update multi-modal processor to support Mantis(LLaVA) model ( #10711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:10:05 +00:00
1c768fe537
[Doc] Explicitly state that InternVL 2.5 is supported ( #10978 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 16:58:02 +00:00
bf0e382e16
[Model] Composite weight loading for multimodal Qwen2 ( #10944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 07:22:52 -07:00
b26b4cd03c
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation ( #10958 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-07 18:33:49 +08:00
f13cf9ad50
[Build] Fix for the Wswitch-bool clang warning ( #10060 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-07 09:03:44 +00:00
955fa9533a
[3/N] Support and implement merged input processor for LLaVA model ( #10676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-07 00:50:58 -08:00
acf092d348
[Bugfix] Fix test-pipeline.yaml ( #10973 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-07 12:08:54 +08:00
69d357ba12
[Core] Cleanup startup logging a bit ( #10961 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-07 02:30:23 +00:00
dcdc3fafe5
[ci] fix broken tests ( #10956 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:47 -08:00
c05cfb67da
[misc] fix typo ( #10960 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:20 -08:00
7406274041
[Doc] add KubeAI to serving integrations ( #10837 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-12-06 17:03:56 +00:00
8b59631855
[Core] Support Lark grammars for XGrammar ( #10870 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-06 08:34:29 -07:00
a1887f2c96
[torch.compile] fix deprecated code ( #10948 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:01:23 +00:00
222f5b082a
[CI/Build] Fix broken multimodal test ( #10950 )
2024-12-06 10:41:23 +00:00
b031a455a9
[torch.compile] add logging for compilation time ( #10941 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-06 10:07:15 +00:00
db87eb6c67
[torch.compile] use size tuning for specific sizes ( #10933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 20:30:41 -08:00
9743d64e4e
[ci][build] add tests for python only compilation ( #10915 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 08:54:47 -08:00
a43065272f
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives ( #10897 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-05 08:47:46 -08:00
998eeafe58
[CI/Build] Bump test transformers version ( #10106 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 16:05:52 +00:00
571da8fc43
[Misc][LoRA] Clean up the function interface of Punica ( #10917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:22:28 +00:00
39c89e71a8
[Misc] Update llama 3.2 template to support system prompt with images ( #10901 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-05 05:54:06 +00:00
1f958a7d52
[Bugfix] Fix BNB loader target_modules ( #10720 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:20:26 +08:00
aa39a8e175
[Doc] Create a new "Usage" section ( #10827 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 11:19:35 +08:00
8d370e91cb
[Bugfix] Fallback to outlines for complex json schemas ( #10899 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-05 11:14:06 +08:00
7883c2bbe7
[benchmark] Make H100 benchmark optional ( #10908 )
2024-12-04 17:02:17 -08:00
2a56e1264f
[V1] Fix when max_model_len is not divisible by block_size ( #10903 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-04 16:54:05 -08:00
e4c34c23de
[CI/Build] improve python-only dev setup ( #9621 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-04 21:48:13 +00:00
82eb5ea8f3
Benchmark serving structured output ( #10880 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-04 16:28:21 -05:00
10398b4706
[Model] Consolidate ViTs attention implementation without mask ( #10893 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-04 18:11:08 +00:00
01d079fd8e
[LoRA] Change lora_tokenizers capacity ( #10796 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-12-04 17:40:16 +00:00
c92acb9693
[ci/build] Update vLLM postmerge ECR repo ( #10887 )
2024-12-04 09:01:20 +00:00
8db957ee3a
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 ( #10854 )
...
Signed-off-by: jianzheng <57654625+o2363286@users.noreply.github.com >
2024-12-04 08:48:22 +00:00
c9ca4fce3f
[ci/build] Job to build and push release image ( #10877 )
2024-12-04 15:02:40 +08:00
fa2dea61df
[ci/build] Change queue name for Release jobs ( #10875 )
2024-12-04 15:02:16 +08:00
b5b647b084
Drop ROCm load format check ( #10767 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-04 04:32:21 +00:00
d2bd88b122
[CI/Build] Replace mean with torch.all in test_pynccl.py ( #10876 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-04 03:23:21 +00:00
381ac93bb5
[Benchmark] Benchmark structured output with datasets ( #10557 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2024-12-03 17:21:06 -07:00
a061fe601e
[Build][Bugfix] Using the correct type hint ( #10866 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-03 15:47:55 -05:00
7c32b6861e
[Frontend] correctly record prefill and decode time metrics ( #10853 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-12-03 19:13:31 +00:00
7090c27bb2
[Bugfix] Only require XGrammar on x86 ( #10865 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-03 10:32:21 -08:00
2f2cdc745a
[MISC][XPU] quick fix for XPU CI ( #10859 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-12-03 17:16:31 +00:00
3bc94cab69
[V1] VLM - Run the mm_mapper preprocessor in the frontend process ( #10640 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-03 10:33:10 +00:00
f6084f6324
[Speculative Decoding] Move indices to device before filtering output ( #10850 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-12-03 17:01:39 +08:00
9323a3153b
[Core][Performance] Add XGrammar support for guided decoding and set it as default ( #10785 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-12-03 15:17:00 +08:00
3257d449fa
[Misc] Remove deprecated names ( #10817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:52:57 +00:00
ef51831ee8
[Doc] Add github links for source code references ( #10672 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:46:07 +00:00
dc5ce861bf
[torch.compile] remove compilation_context and simplify code ( #10838 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 06:19:02 +00:00
21fe7b481a
[core][distributed] add pynccl broadcast ( #10843 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 04:53:23 +00:00
a4cf256159
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug ( #10844 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-03 12:10:29 +08:00
d746268e92
[Model] support bitsandbytes quantization with minicpm model ( #10842 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-12-03 03:06:41 +00:00
4433195ab7
[Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts ( #10753 )
2024-12-03 02:26:15 +00:00
4c05edb33a
[Model] Add TP and BNB quantization support to LlavaMultiModalProjector ( #10834 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-02 23:06:09 +00:00
9b14d978aa
Fix openvino on GPU ( #10793 )
2024-12-02 18:52:19 +00:00
519cc6ca12
[Misc][XPU] Avoid torch compile for XPU platform ( #10747 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-02 17:53:55 +00:00
b45f0d7946
[Misc][LoRA] Move the implementation of lora bias to punica.py ( #10829 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-02 17:53:36 +00:00
a4c4daf364
[misc] use out argument for flash attention ( #10822 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 10:50:10 +00:00
e95f275f57
[CI/Build] Update mistral_common version for tests and docs ( #10825 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-02 10:26:10 +00:00
ef31eabc68
[Model]: add some tests for aria model ( #10770 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-02 05:36:36 +00:00
995a148575
[doc]Update config docstring ( #10732 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-02 04:14:45 +00:00
63a164172d
[misc] remove xverse modeling file ( #10814 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 03:27:13 +00:00
e25810ae29
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill ( #10799 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-02 10:05:32 +08:00
073a4bd1c0
[Kernel] Use out arg in flash_attn_varlen_func ( #10811 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-01 17:55:39 -08:00
b7954776fd
[core] Avoid metrics log noise when idle - include speculative decodi… ( #10809 )
2024-12-02 01:49:48 +00:00
b18c9bbaba
[Model] Add BNB support to Llava and Pixtral-HF ( #10795 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-02 01:31:09 +00:00
0590ec3fd9
[Core] Implement disagg prefill by StatelessProcessGroup ( #10502 )
...
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2024-12-01 19:01:00 -06:00
c11f172187
[Misc] Adding MMMU-Pro vision dataset to serving benchmark ( #10804 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-01 08:47:05 +00:00
169a0ff911
[doc] add warning about comparing hf and vllm outputs ( #10805 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-01 00:41:38 -08:00
d2f058e76c
[Misc] Rename embedding classes to pooling ( #10801 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 14:36:51 +08:00
f877a7d12a
[Misc] Improve type annotations for support_torch_compile ( #10763 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-30 17:48:35 -08:00
133707123e
[Model] Replace embedding models with pooling adapter ( #10769 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 08:02:54 +08:00
7e4bbda573
[doc] format fix ( #10789 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-30 11:38:40 +00:00
e7cfc4ef4c
[Interleaved ATTN] Support for Mistral-8B ( #10591 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-30 07:45:50 +00:00
16ee07f22a
[Model] Refactor Molmo weights loading to use AutoWeightsLoader ( #10771 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-30 04:19:14 +00:00
40bc242579
[Bugfix] Fix OpenVino/Neuron driver_worker init ( #10779 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-30 12:07:13 +08:00
661175bc82
[platform] Add verify_quantization in platform. ( #10757 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-29 15:22:21 +00:00
3132aac043
[Bugfix] Fix Idefics3 bug ( #10778 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-29 13:56:46 +00:00
c82b432d4a
[Misc] typo find in sampling_metadata.py ( #10740 )
2024-11-29 05:17:57 +00:00
fa6ecb9aa7
[Model] Clean up MiniCPMV ( #10751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-29 04:47:06 +00:00
c83919c7a6
[Model] Add Internlm2 LoRA support ( #5064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-28 17:29:04 +00:00
98f47f2a40
[V1] Optimize the CPU overheads in FlashAttention custom op ( #10733 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 09:01:02 -08:00
8c1e77fb58
[Kernel] Update vllm-flash-attn version to reduce CPU overheads ( #10742 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 08:31:28 -08:00
5fc5ce0fe4
[Model] Added GLM-4 series hf format model support vllm==0.6.4 ( #10561 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-28 14:53:31 +00:00
3ed5e73146
[TPU] Update requirements-tpu ( #10726 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-28 02:30:48 -08:00
9a8bff0285
[Kernel] Update vllm-flash-attn version ( #10736 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 02:25:59 -08:00
a79b122400
[V1] Do not allocate beyond the max_model_len ( #10730 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 00:13:15 -08:00
d9b4b3f069
[Bug][CLI] Allow users to disable prefix caching explicitly ( #10724 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-27 23:59:28 -08:00
278be671a3
[Doc] Update model in arch_overview.rst to match comment ( #10701 )
...
Signed-off-by: spacewander <spacewanderlzx@gmail.com >
2024-11-27 23:58:39 -08:00
70dc14fbd0
[Model] support bitsandbytes quantization with minicpm3 model ( #10682 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-27 23:58:02 -08:00
cb4e1c3f3a
[misc] upgrade filelock version ( #10731 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 19:54:58 -08:00
395b1c7454
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server ( #10635 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-27 13:21:10 -08:00
9b4b150395
[Bugfix] Ignore lm_head when loading embedding models ( #10719 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-27 19:05:29 +00:00
197b4484a3
[Bugfix][Mamba] Fix Multistep on Mamba-like models ( #10705 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-27 19:02:27 +00:00
b98c62ba49
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint ( #10675 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-27 10:43:17 -08:00
c411def234
[torch.compile] fix shape specialization ( #10722 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 10:16:10 -08:00
308cc5e21e
[ci] fix slow tests ( #10698 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 09:26:14 -08:00
9e0a147d50
[V1] Update interface for mistral-format Pixtral ( #10703 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 12:26:27 +00:00
418cb3b93f
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault ( #10700 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-27 11:55:38 +00:00
1209261e93
[Model] Support telechat2 ( #10311 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-27 11:32:35 +00:00
e2251109c7
[Kernel] Remove if-else with identical branches in marlin 2:4 ( #10687 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-26 22:55:32 -08:00
15cc2a9f1a
[Misc]Further reduce BNB static variable ( #10597 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-26 22:54:12 -08:00
e85250b1d1
[Hardware][Gaudi]add get_name method for HPUAttentionBackend ( #10667 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-26 22:49:40 -08:00
cfb3bf25fb
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig ( #10657 )
2024-11-27 13:55:23 +08:00
1bf905ddaa
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. ( #10198 )
...
Signed-off-by: jeongin601 <0200angela@gmail.com >
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com >
2024-11-27 05:07:30 +00:00
0a4d968500
[V1] Update interface for idefics3 ( #10680 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 10:04:01 +08:00
0a71900bc9
Remove hard-dependencies of Speculative decode to CUDA workers ( #10587 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-26 17:57:11 -08:00
2f0a0a17a4
[V1] Refactor model executable interface for multimodal models ( #10570 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-26 20:46:11 +00:00
7576cd38df
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes ( #10642 )
2024-11-26 12:29:00 -08:00
9a99273b48
[Bugfix] Fix using -O[0,3] with LLM entrypoint ( #10677 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-26 10:44:01 -08:00
f5792c7c4a
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson ( #9735 )
...
Signed-off-by: Conroy Cheers <conroy@corncheese.org >
2024-11-26 10:26:28 -08:00
db66e018ea
[Bugfix] Fix for Spec model TP + Chunked Prefill ( #10232 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Co-authored-by: Sourashis Roy <sroy@roblox.com >
2024-11-26 09:11:16 -08:00
1f6584ee85
[V1] Enable profile for LLMEngine ( #10665 )
2024-11-26 10:36:45 +00:00
334d64d1e8
[ci] add vllm_test_utils ( #10659 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-26 00:20:04 -08:00
940635343a
[Misc] Remove outdated init protocols ( #10655 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-26 14:55:00 +08:00
9a88f89799
custom allreduce + torch.compile ( #10121 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-25 22:00:16 -08:00
519e8e4182
[v1] EngineArgs for better config handling for v1 ( #10382 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-25 21:09:43 -08:00
a6760f6456
[Feature] vLLM ARM Enablement for AARCH64 CPUs ( #9228 )
...
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-11-25 18:32:39 -08:00
45ac4ff270
[bugfix] fix aria model and add torch.compile ( #10645 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 18:32:09 -08:00
6e9ff050c8
[misc] do not read HOST_IP ( #10644 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 17:04:50 -08:00
9db713a1dc
[Model] Add OLMo November 2024 model ( #10503 )
2024-11-25 17:26:40 -05:00
1b583cfefa
[Doc] Fix typos in docs ( #10636 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 10:15:45 -08:00
cf73f0c95e
[Model] Enable optional prefix when loading embedding models ( #10639 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 18:14:33 +00:00
b1d920531f
[Model]: Add support for Aria model ( #10514 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-25 18:10:55 +00:00
452a4e80c3
[Docs] Add Snowflake Slides ( #10641 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-25 09:34:46 -08:00
c27df94e1f
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices ( #9850 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-11-25 12:23:32 -05:00
d04b13a380
[Bug]: Authorization ignored when root_path is set ( #10606 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-25 16:21:41 +00:00
2b0879bfc2
Super tiny little typo fix ( #10633 )
2024-11-25 13:08:30 +00:00
ed46f14321
[Model] Support is_causal HF config field for Qwen2 model ( #10621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 09:51:20 +00:00
05d1f8c9c6
[misc] move functions to config.py ( #10624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 09:27:30 +00:00
25d806e953
[misc] add torch.compile compatibility check ( #10618 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:40:08 -08:00
65813781a2
[torch.compile] add warning for unsupported models ( #10622 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:27:51 -08:00
7c2134beda
[torch.compile] force inductor threads ( #10620 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 23:04:21 -08:00
a30a605d21
[Doc] Add encoder-based models to Supported Models page ( #10616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 06:34:07 +00:00
571841b7fc
[torch.compile] support encoder based models ( #10613 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 05:24:33 +00:00
7ea3cd7c3e
[Refactor][MISC] del redundant code in ParallelConfig.postinit ( #10614 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-25 05:14:56 +00:00
214efc2c3c
Support Cross encoder models ( #10400 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-24 18:56:20 -08:00
49628fe13e
[Doc] Update README.md with Ray Summit talk links ( #10610 )
2024-11-24 16:45:09 -08:00
e4fbb14414
[doc] update the code to add models ( #10603 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-24 11:21:40 -08:00
c055747867
[model][utils] add extract_layer_index utility function ( #10599 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-23 22:22:54 -08:00
eda2b3589c
Revert "Print running script to enhance CI log readability" ( #10601 )
2024-11-23 21:31:47 -08:00
1c445dca51
[CI/Build] Print running script to enhance CI log readability ( #10594 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 03:57:13 +00:00
1700c543a5
[Bugfix] Fix LoRA weight sharding ( #10450 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-23 17:23:17 -08:00
17d8fc1806
[bugfix] Fix example/tensorize_vllm_model tests ( #10595 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-23 17:22:33 -08:00
04668ebe7a
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama ( #10593 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-23 18:12:20 +00:00
651f6c31ac
For ppc64le, disabled tests for now and addressed space issues ( #10538 )
2024-11-23 09:33:53 +00:00
86a44fb896
[Platforms] Refactor openvino code ( #10573 )
...
Signed-off-by: statelesshz <hzji210@gmail.com >
2024-11-22 22:23:12 -08:00
4cfe5d2bca
[Bugfix] multi_modal_kwargs broadcast for CPU tensor parallel ( #10541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 21:25:46 -08:00
c8acd80548
[2/N] handling placeholders in merged multi-modal processor ( #10485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-22 21:25:09 -08:00
4634a89d18
Prefix Cache Aware Scheduling [1/n] ( #10128 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-22 21:15:55 -08:00
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm ( #10254 )
2024-11-22 21:14:49 -08:00
02a43f82a9
Update default max_num_batch_tokens for chunked prefill to 2048 ( #10544 )
2024-11-22 21:14:19 -08:00
cfea9c04ef
[Model] Fix Baichuan BNB online quantization ( #10572 )
...
Signed-off-by: Chen Wu <cntryroa@gmail.com >
2024-11-22 21:13:59 -08:00
7d8ffb344f
[Bugfix] Internal Server Error when tool_choice is incorrect. ( #10567 )
...
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com >
2024-11-22 21:13:29 -08:00
4aba6e3d1a
[core] gemma2 full context length support ( #10584 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 20:13:54 -08:00
978b39744b
[Misc] Add pynccl wrappers for all_gather and reduce_scatter ( #9432 )
2024-11-22 22:14:03 -05:00
ebda51968b
[Core] Fix broken log configuration ( #10458 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-23 10:23:51 +08:00
9195dbdbca
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use ( #10164 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-23 10:17:38 +08:00
d559979c54
[bugfix] fix cpu tests ( #10585 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 17:34:03 -08:00
d345f409b7
[V1] EngineCore supports profiling ( #10564 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2024-11-22 17:16:15 -08:00
28598f3939
[Core] remove temporary local variables in LLMEngine.__init__ ( #10577 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-22 16:22:53 -08:00
948c859571
support bitsandbytes quantization with qwen model ( #10549 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-22 16:16:14 -08:00
97814fbf0f
[v1] Refactor KVCacheManager for more hash input than token ids ( #10507 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-22 23:27:25 +00:00
eebad39f26
[torch.compile] support all attention backends ( #10558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 14:04:42 -08:00
db100c5cde
[bugfix] fix full graph tests ( #10581 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 10:02:14 -08:00
11fcf0e066
Remove token-adding chat embedding params ( #10551 )
...
Signed-off-by: Noam Gat <noamgat@gmail.com >
2024-11-21 23:59:47 -08:00
b6374e09b0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel ( #9948 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 15:01:56 +08:00
a111d0151f
[platforms] absorb worker cls difference into platforms folder ( #10555 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-21 21:00:32 -08:00
446c7806b2
[Minor] Fix line-too-long ( #10563 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 19:40:40 -08:00
33e0a2540a
[9/N] torch.compile LLM usage ( #10552 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 19:13:31 -08:00
aed074860a
[Benchmark] Add new H100 machine ( #10547 )
2024-11-21 18:27:20 -08:00
9afa014552
Add small example to metrics.rst ( #10550 )
2024-11-21 23:43:43 +00:00
46fe9b46d8
[Minor] Revert change in offline inference example ( #10545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 21:28:16 +00:00
cf656f5a02
[misc] improve error message ( #10553 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 13:13:17 -08:00
edec3385b6
[CI][Installation] Avoid uploading CUDA 11.8 wheel ( #10535 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-11-21 13:03:58 -08:00
f9310cbd0c
[V1] Fix Compilation config & Enable CUDA graph by default ( #10528 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 12:53:39 -08:00
7560ae5caf
[8/N] enable cli flag without a space ( #10529 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 12:30:42 -08:00
e7a8341c7c
[Bugfix] Allow token ID-only inputs in Qwen2-Audio ( #10536 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-21 18:09:43 +00:00
c51e397fe8
[Misc] Suppress duplicated logging regarding multimodal input pipeline ( #10530 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-21 09:21:31 -08:00
2385b60d83
[Kernel] Register punica ops directly ( #10522 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-21 09:18:11 -08:00
da7e702c6f
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored ( #10180 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-21 16:24:32 +00:00
4d676f0852
[Bugfix] Embedding model pooling_type equals ALL and multi input's bug ( #10494 )
2024-11-21 14:40:02 +00:00
d5ec121f95
[Model] Expose dynamic_image_size as mm_processor_kwargs for InternVL2 models ( #10518 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-21 14:20:08 +00:00
8a93a598d9
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len ( #10524 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
2024-11-21 11:15:36 +00:00
1cfde82ffd
[Model] Add Support for Multimodal Granite Models ( #10291 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-21 10:46:20 +00:00
f0e0238016
[Doc] fix a small typo in docstring of llama_tool_parser ( #10513 )
2024-11-21 09:05:23 +00:00
aaddce5d26
[platforms] improve error message for unspecified platforms ( #10520 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 23:07:56 -08:00
3430857b64
[Misc] Increase default video fetch timeout ( #10495 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 23:06:42 -08:00
8b0fe06c89
[torch.compile] Inductor code caching fix ( #10273 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
2024-11-20 21:44:57 -08:00
9d827170a3
[Platforms] Add device_type in Platform ( #10508 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-21 04:44:20 +00:00
6c1208d083
[Core] Add Sliding Window Support with Flashinfer ( #10462 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2024-11-20 19:56:47 -08:00
388ee3de66
[torch.compile] limit inductor threads and lazy import quant ( #10482 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 18:36:33 -08:00
2f77b6cfec
[TPU] Implement prefix caching for TPUs ( #10307 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-20 13:54:15 -08:00
c68f7ede6a
[Bugfix]: allow extra fields in requests to openai compatible server ( #10463 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-20 16:42:21 -05:00
0cd3d9717e
[7/N] torch.compile, reduce compilation time ( #10460 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 11:20:38 -08:00
5f1d6af2b6
[perf bench] H200 development ( #9768 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-20 11:06:56 -08:00
772a66732d
[platforms] restore xpu check for parallel config ( #10479 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 17:13:28 +00:00
63f1fde277
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU ( #10355 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-20 10:57:39 +00:00
d5b28447e0
[Platforms] Refactor xpu code ( #10468 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-19 22:52:13 -08:00
09dbf9ff16
[Bugfix] Handle conflicts between modern and legacy fields ( #10471 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 14:45:08 +08:00
343041c4c4
[model] Reduce medusa weight ( #10454 )
...
Signed-off-by: skylee-01 <497627264@qq.com >
2024-11-20 06:05:55 +00:00
ed701ca963
[ci/build] Combine nightly and optional ( #10465 )
2024-11-19 21:36:03 -08:00
7629a9c6e5
[CI/Build] Support compilation with local cutlass path ( #10423 ) ( #10424 )
2024-11-19 21:35:50 -08:00
709c9f1f25
[CI/Build] Add sphinx/rst linter for docs ( #10366 )
2024-11-19 21:35:31 -08:00
b4be5a8adb
[Bugfix] Enforce no chunked prefill for embedding models ( #10470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 05:12:51 +00:00
ad44437ba3
[Bugfix] Fix Mamba model initialization and MLP Speculator weights loading ( #10456 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-20 05:04:05 +00:00
9e05252b46
[Misc] Add __setitem__ for LazyDict ( #10469 )
...
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
2024-11-20 04:44:57 +00:00
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) ( #10464 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-19 19:40:33 -08:00
d5b68aba2f
[CI/Build] Update Dockerfile.rocm ( #10434 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2024-11-19 17:19:59 -08:00
a324d3a1a7
Change granite chat template to keep json list formatting for tool calls ( #10452 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2024-11-19 18:16:54 -07:00
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-19 13:31:12 -08:00
efa9084628
[Core] Avoid metrics log noise when idle ( #8868 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 21:05:25 +00:00
803f37eaaa
[6/N] torch.compile rollout to users ( #10437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:09:03 -08:00
fd9f124971
[Doc] fix link for page that was renamed ( #10455 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:48:30 -08:00
1ea291a417
Fix: Build error seen on Power Architecture ( #10421 )
...
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: ismael-dm <ismaeldm99@gmail.com >
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Angus Wang <wangjadehao@gmail.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Ricky Xu <rickyx@anyscale.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:34:57 -08:00
11fd7ea639
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter ( #10449 )
2024-11-19 17:33:06 +00:00
f028dff33d
[BugFix] Fix hermes tool parser output error stream arguments in some cases ( #10395 ) ( #10398 )
...
Signed-off-by: xiyuan lee <lixiyuan@haier.com >
2024-11-19 13:42:50 +00:00
b4614656b8
[CI][CPU] adding numa node number as container name suffix ( #10441 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-19 13:16:43 +00:00
25f9c78961
[misc][plugin] improve plugin loading ( #10443 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:43:21 +00:00
5390d6664f
[Doc] Add the start of an arch overview page ( #10368 )
2024-11-19 09:52:11 +00:00
382b6a4852
[Misc] Avoid misleading warning messages ( #10438 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 08:54:58 +00:00
272e31c0bd
[Bugfix] Guard for negative counter metrics to prevent crash ( #10430 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-19 04:57:10 +00:00
74f8c2cf5f
Add openai.beta.chat.completions.parse example to structured_outputs.rst ( #10433 )
2024-11-19 04:37:46 +00:00
8c1fb50705
[Platform][Refactor] Extract func get_default_attn_backend to Platform ( #10358 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-11-19 11:22:26 +08:00
7eb719df13
[Bugfix]Fix Phi-3 BNB online quantization ( #10417 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 03:21:42 +00:00
284203f171
[ci/build] Have dependabot ignore all patch update ( #10436 )
...
We have too many dependencies and all patch updates can be a little noisy. This is to have dependabot ignore all patch version updates.
2024-11-19 01:04:25 +00:00
90a6c759ca
[misc] partial prefix & random input generation benchmark ( #9929 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-18 15:39:14 -08:00
2298e69b5f
[ci][bugfix] fix kernel tests ( #10431 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:29:37 -08:00
a03ea40792
[3/N][torch.compile] consolidate custom op logging ( #10399 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:14:59 -08:00
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-18 12:59:29 -07:00
c2170a5b39
[Kernel] Explicitly specify other value in tl.load calls ( #9014 )
...
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
2024-11-18 11:39:40 -08:00
6b2d25efc7
[Hardware][XPU] AWQ/GPTQ support for xpu backend ( #10107 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-18 11:18:05 -07:00
281cc4b3cd
[Model][Bugfix] Support TP for PixtralHF ViT ( #10405 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-18 10:04:14 -08:00
4f686d139f
Fix open_collective value in FUNDING.yml ( #10426 )
...
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
2024-11-18 09:52:42 -08:00
31894a2155
[Doc] Add documentation for Structured Outputs ( #9943 )
...
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
2024-11-18 09:52:12 -08:00
7851b45196
[5/N][torch.compile] torch.jit.script --> torch.compile ( #10406 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 23:20:06 +08:00
4186be8111
[Doc] Update doc for LoRA support in GLM-4V ( #10425 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 15:08:30 +00:00
e7ebb662d7
[Model] Remove transformers attention porting in VITs ( #10414 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 21:45:21 +08:00
5be4e52b65
[Model][LoRA]LoRA support added for glm-4v ( #10418 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 12:57:10 +00:00
01aae1cc68
[Model] Remove redundant softmax when using PoolingType.STEP ( #10415 )
2024-11-18 10:05:36 +00:00
c7dec926f6
[VLM] Report multi_modal_placeholders in output ( #10407 )
...
Signed-off-by: Linkun Chen <lkchen+anyscale@github.com >
2024-11-18 16:06:16 +08:00
51bb12d17b
[4/N][torch.compile] clean up set_torch_compile_backend ( #10401 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-17 23:57:20 -08:00
47826cacf0
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU ( #10375 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-11-18 11:29:26 +08:00
c4e464333e
[Misc] Add uninitialized params tracking for AutoWeightsLoader ( #10327 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 09:07:46 +08:00
d1557e66d3
[Misc] Enhance offline_inference to support user-configurable paramet… ( #10392 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-17 11:32:40 +00:00
80d85c5d7b
[Bugfix] Fix mrope_position_delta in non-last prefill chunk ( #10403 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 08:50:24 +00:00
76aab90ab6
[Hardware] [HPU]add mark_step for hpu ( #10239 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-17 00:44:44 -08:00
8d74b5aee9
[platforms] refactor cpu code ( #10402 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 23:14:23 -08:00
cf349c4a97
[Bugfix][CPU] Fix CPU embedding runner with tensor parallel ( #10394 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-16 23:12:04 -08:00
905d0f0af4
[CI/Build] Fix IDC hpu [Device not found] issue ( #10384 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-17 14:58:22 +08:00
643ecf7b11
[V1] Refactor model executable interface for all text-only language models ( #10374 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-17 05:18:46 +00:00
4fd9375028
[2/N][torch.compile] make compilation cfg part of vllm cfg ( #10383 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 18:02:14 -08:00
661a34fd4f
[V1] Add code owners for V1 ( #10397 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-16 10:45:26 -08:00
361c29e174
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled ( #10388 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 02:10:00 +08:00
b98d89efd4
[Misc] Medusa supports custom bias ( #10361 )
2024-11-16 16:33:01 +00:00
8b6725b0cf
[Misc] Update benchmark to support image_url file or http ( #10287 )
...
Signed-off-by: rbbang <anjaehyun87@gmail.com >
2024-11-16 18:15:40 +08:00
1d75472626
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel ( #10385 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-16 09:55:05 +00:00
2f427c2d16
[misc][plugin] improve log messages ( #10386 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 01:23:20 -08:00
755b85359b
[doc] add doc for the plugin system ( #10372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 21:46:27 -08:00
32e46e000f
[Frontend] Automatic detection of chat content format from AST ( #9919 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-16 13:35:40 +08:00
4f168f69a3
[Docs] Misc updates to TPU installation instructions ( #10165 )
2024-11-15 13:26:17 -08:00
3e8d14d8a1
[Doc] Move PR template content to docs ( #10159 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:20:20 -08:00
a067f85e08
[Frontend] Add --version flag to CLI ( #10369 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:13:53 -08:00
c76ac49d26
[Docs] Add Nebius as sponsors ( #10371 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 12:47:40 -08:00
a6221a144a
[Misc] bump mistral common version ( #10367 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 09:48:07 -08:00
79ee45b428
[Misc] Bump up test_fused_moe tolerance ( #10364 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-15 16:31:18 +00:00
691a3ec047
[Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer ( #10363 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-15 14:50:40 +00:00
3a763ba0c3
[core][misc] keep compatibility for old-style classes ( #10356 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 13:55:51 +00:00
f2056f726d
[Misc] Fix some help info of arg_utils to improve readability ( #10362 )
2024-11-15 12:40:30 +00:00
1d65ec7eeb
[Bugfix] Fix fully sharded LoRA bug ( #10352 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-15 10:34:58 +00:00
26908554b2
[Doc] Remove float32 choice from --lora-dtype ( #10348 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-15 10:22:57 +00:00
b311efd0bd
[Misc] Fix import error in tensorizer tests and cleanup some code ( #10349 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 09:34:17 +00:00
3d158cdc8d
Add default value to avoid Falcon crash ( #5363 ) ( #10347 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-15 08:52:20 +00:00
02dbf30e9a
[Build] skip renaming files for release wheels pipeline ( #9671 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-14 23:31:52 -08:00
2ac6d0e75b
[Misc] Consolidate pooler config overrides ( #10351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 06:59:00 +00:00
2ec8827288
[Bugfix] Qwen-vl output is inconsistent in speculative decoding ( #10350 )
2024-11-15 05:40:10 +00:00
b40cf6402e
[Model] Support Qwen2 embeddings and use tags to select model tests ( #10184 )
2024-11-14 20:23:09 -08:00
2885ba0e24
[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug ( #10308 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-15 02:44:26 +00:00
bf2ddc6610
[bugfix] Fix static asymmetric quantization case ( #10334 )
...
Signed-off-by: Daniël de Kok <me@danieldk.eu >
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Daniël de Kok <me@danieldk.eu >
2024-11-15 09:35:11 +08:00
972112d82f
[Bugfix] Fix unable to load some models ( #10312 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 16:55:54 -08:00
11cd1ae6ad
[Tool parsing] Improve / correct mistral tool parsing ( #10333 )
2024-11-15 00:42:49 +00:00
554af9228d
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 ( #9583 )
...
Signed-off-by: xiaozijin <xiaozijin@bytedance.com >
2024-11-14 16:38:53 -08:00
b2e0ad3b59
[Perf] Reduce peak memory usage of llama ( #10339 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2024-11-15 00:38:20 +00:00
4a18fd14ba
Support Roberta embedding models ( #9387 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-14 21:23:29 +00:00
1dbae0329c
[Docs] Publish meetup slides ( #10331 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-14 16:19:38 +00:00
675d603400
[CI/Build] Make shellcheck happy ( #10285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 09:47:53 +00:00
03025c023f
[CI/Build] Fix CPU CI online inference timeout ( #10314 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 16:45:32 +08:00
29f3ef26a3
[ci][distributed] disable hanging tests ( #10317 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-14 00:23:39 -08:00
294bf467ba
[Model] Add BNB quantization support for Idefics3 ( #10310 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-14 06:31:44 +00:00
52b48c1ead
[BugFix]: properly deserialize tool_calls iterator before processing by mistral-common when MistralTokenizer is used ( #9951 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-14 04:48:16 +00:00
f67ce05d0b
[Frontend] Pythonic tool parser ( #9859 )
...
Signed-off-by: Mike Depinet <mike@fixie.ai >
2024-11-14 04:14:34 +00:00
e0853b6508
[Misc] format.sh: Simplify tool_version_check ( #10305 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-14 11:12:35 +08:00
504ac53d18
[misc] error early for old-style class ( #10304 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-13 18:55:39 -08:00
15bb8330aa
[Bugfix] Fix tensor parallel for qwen2 classification model ( #10297 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 10:54:59 +08:00
ac49b59d8b
[Bugfix] bitsandbytes models fail to run pipeline parallel ( #10200 )
...
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com >
2024-11-13 09:56:39 -07:00
0b8bb86bf1
[1/N] Initial prototype for multi-modal processor ( #10044 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-13 12:39:03 +00:00
bb7991aa29
[V1] Add missing tokenizer options for Detokenizer ( #10288 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-13 11:02:56 +00:00
d909acf9fe
[Model][LoRA]LoRA support added for idefics3 ( #10281 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-13 17:25:59 +08:00
b6dde33019
[Core] Flashinfer - Remove advance step size restriction ( #10282 )
2024-11-13 16:29:32 +08:00
1b886aa104
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 ( #9944 )
...
Signed-off-by: FurtherAI <austin.veselka@lighton.ai >
Co-authored-by: FurtherAI <austin.veselka@lighton.ai >
2024-11-13 08:28:13 +00:00
3945c82346
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions ( #10221 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-13 07:07:22 +00:00
032fcf16ae
[Doc] Fix typo in arg_utils.py ( #10264 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-12 21:54:52 -08:00
56a955e774
Bump to compressed-tensors v0.8.0 ( #10279 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2024-11-12 21:54:10 -08:00
bbd3e86926
[V1] Support VLMs with fine-grained scheduling ( #9871 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-13 04:53:13 +00:00
0d4ea3fb5c
[core][distributed] use tcp store directly ( #10275 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 17:36:08 -08:00
112fa0bbe5
[V1] Fix CI tests on V1 engine ( #10272 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 16:17:20 -08:00
377b74fe87
Revert "[ci][build] limit cmake version" ( #10271 )
2024-11-12 15:06:48 -08:00
18081451f9
[doc] improve debugging doc ( #10270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:43:52 -08:00
96ae0eaeb2
[doc] fix location of runllm widget ( #10266 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:34:39 -08:00
1f55e05713
[V1] Enable Inductor when using piecewise CUDA graphs ( #10268 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 13:39:56 -08:00
8a06428c70
[LoRA] Adds support for bias in LoRA ( #5733 )
...
Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com >
Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com >
2024-11-12 11:08:40 -08:00
b41fb9d3b1
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers ( #9982 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-12 10:53:57 -08:00
7c65527918
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest ( #10245 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 08:57:14 -08:00
47db6ec831
[Frontend] Add per-request number of cached token stats ( #10174 )
2024-11-12 16:42:28 +00:00
176fcb1c71
[Bugfix] Fix QwenModel argument ( #10262 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2024-11-12 16:36:51 +00:00
a838ba7254
[Misc]Fix Idefics3Model argument ( #10255 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 13:07:11 +00:00
36c513a076
[BugFix] Do not raise a ValueError when tool_choice is set to the supported none option and tools are not defined. ( #10000 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-12 11:13:46 +00:00
d201d41973
[CI][CPU]refactor CPU tests to allow to bind with different cores ( #10222 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-12 10:07:32 +00:00
3a28f18b0b
[doc] explain the class hierarchy in vLLM ( #10240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 22:56:44 -08:00
812c981fa0
Splitting attention kernel file ( #10091 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2024-11-11 22:55:07 -08:00
7f5edb5900
[Misc][LoRA] Replace hardcoded cuda device with configurable argument ( #10223 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 11:10:15 +08:00
eea55cca5b
[1/N] torch.compile user interface design ( #10237 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 18:01:06 -08:00
9cdba9669c
[Doc] Update help text for --distributed-executor-backend ( #10231 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-12 09:55:09 +08:00
d1c6799b88
[doc] update debugging guide ( #10236 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 15:21:12 -08:00
6ace6fba2c
[V1] AsyncLLM Implementation ( #9826 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-11 23:05:38 +00:00
08f93e7439
Make shutil rename in python_only_dev ( #10233 )
...
Signed-off-by: shcheglovnd <shcheglovnd@avride.ai >
2024-11-11 14:29:19 -08:00
9d5b4e4dea
[V1] Enable custom ops with piecewise CUDA graphs ( #10228 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:58:07 -08:00
8a7fe47d32
[misc][distributed] auto port selection and disable tests ( #10226 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 11:54:59 -08:00
4800339c62
Add docs on serving with Llama Stack ( #10183 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-11 11:28:55 -08:00
fe15729a2b
[V1] Use custom ops for piecewise CUDA graphs ( #10227 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:26:48 -08:00
330e82d34a
[v1][torch.compile] support managing cudagraph buffer ( #10203 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:10:27 -08:00
d7a4f2207b
[V1] Do not use inductor for piecewise CUDA graphs ( #10225 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:05:57 -08:00
f9dadfbee3
[V1] Fix detokenizer ports ( #10224 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 10:42:07 -08:00
25144ceed0
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #10209 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 17:24:10 +00:00
e6de9784d2
[core][distributed] add stateless process group ( #10216 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 09:02:14 -08:00
36fc439de0
[Doc] fix doc string typo in block_manager swap_out function ( #10212 )
2024-11-11 08:53:07 -08:00
874f551b36
[Metrics] add more metrics ( #4464 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-12 00:17:38 +08:00
2cebda42bb
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner ( #10218 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 12:37:58 +00:00
5fb1f935b0
[V1] Allow tokenizer_mode and trust_remote_code for Detokenizer ( #10211 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-11 18:01:18 +08:00
36e4acd02a
[LoRA][Kernel] Remove the unused libentry module ( #10214 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-11 09:43:23 +00:00
58170d6503
[Hardware][CPU] Add embedding models support for CPU backend ( #10193 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 08:54:28 +00:00
9804ac7c7c
Bump the patch-update group with 5 updates ( #10210 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 07:22:40 +00:00
f89d18ff74
[6/N] pass whole config to inner model ( #10205 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 06:41:46 +00:00
f0f2e5638e
[doc] improve debugging code ( #10206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-10 17:49:40 -08:00
ad9a78bf64
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py ( #10196 )
2024-11-11 00:14:22 +00:00
73b9083e99
[misc] improve cloudpickle registration and tests ( #10202 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 00:10:53 +00:00
20cf2f553c
[Misc] small fixes to function tracing file path ( #9543 )
...
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 15:21:06 -08:00
bfb7d61a7c
[doc] Polish the integration with huggingface doc ( #10195 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 10:22:04 -08:00
19682023b6
[Doc] Fix typo error in CONTRIBUTING.md ( #10190 )
...
Signed-off-by: FuryMartin <furymartin9910@outlook.com >
2024-11-10 07:47:24 +00:00
9fa4bdde9d
[ci][build] limit cmake version ( #10188 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 16:27:26 -08:00
51c2e1fcef
[CI/Build] Split up models tests ( #10069 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:39:14 -08:00
b09895a618
[Frontend][Core] Override HF config.json via CLI ( #5836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 16:19:27 +00:00
d88bff1b96
[Frontend] add add_request_id middleware ( #9594 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-11-09 10:18:29 +00:00
9e37266420
bugfix: fix the bug that stream generate not work ( #2756 )
2024-11-09 10:09:48 +00:00
8a4358ecb5
[doc] explaining the integration with huggingface ( #10173 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 01:02:54 -08:00
bd46357ad9
[bugfix] fix broken tests of mlp speculator ( #10177 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 00:04:50 -08:00
f192aeba74
[Bugfix] Enable some fp8 and quantized fullgraph tests ( #10171 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-09 08:01:27 +00:00
8e1529dc57
[CI/Build] Add run-hpu-test.sh script ( #10167 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2024-11-09 06:26:52 +00:00
1a95f10ee7
[5/N] pass the whole config to model ( #9983 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 14:17:28 +08:00
49d2a41a86
[Doc] Adjust RunLLM location ( #10176 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 20:07:10 -08:00
47672f38b5
[CI/Build] Fix VLM broadcast tests tensor_parallel_size passing ( #10161 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-09 04:02:59 +00:00
f83feccd7f
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module ( #10169 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-09 03:36:46 +00:00
e0191a95d8
[0/N] Rename MultiModalInputs to MultiModalKwargs ( #10040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:31:02 +08:00
d7edca1dee
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking ( #6892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 03:27:11 +00:00
127c07480e
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case ( #9857 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-08 19:59:22 -05:00
10b67d865d
[Bugfix] SymIntArrayRef expected to contain concrete integers ( #10170 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-08 14:44:18 -08:00
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
e1b5a82179
Rename vllm.logging to vllm.logging_utils ( #10134 )
2024-11-08 20:53:24 +00:00
87713c6053
[CI/Build] Ignore .gitignored files for shellcheck ( #10162 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-11-08 19:53:36 +00:00
b5815c8413
[V1] Fix non-cudagraph op name ( #10166 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-08 10:23:04 -08:00
6b30471586
[Misc] Improve Web UI ( #10090 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-08 09:51:04 -08:00
f6778620a9
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 ( #10136 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-08 15:56:18 +00:00
0535e5fe6c
Fix edge case Mistral tokenizer ( #10152 )
2024-11-08 15:42:27 +00:00
b489fc3c91
[CI/Build] Update CPU tests to include all "standard" tests ( #5481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 23:30:04 +08:00
208ce622c7
[V1]Enable APC by default only for text models ( #10148 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-08 14:39:41 +00:00
1ff4aed5bd
[Model] Expose size to Idefics3 as mm_processor_kwargs ( #10146 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-08 09:56:58 +00:00
f10797c0ce
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator ( #10144 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-08 09:41:03 +00:00
f4c2187e29
[Misc] Fix typo in #5895 ( #10145 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 09:07:01 +00:00
aea6ad629f
Add hf_transfer to testing image ( #10096 )
2024-11-08 08:35:25 +00:00
da07a9ead7
Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. ( #9285 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2024-11-08 05:31:28 +00:00
3a7f15a398
[Doc] Move CONTRIBUTING to docs site ( #9924 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 05:15:12 +00:00
7371749d54
[Misc] Fix ImportError causing by triton ( #9493 )
2024-11-08 05:08:51 +00:00
ad39bd640c
[Bugfix] Add error handling when server cannot respond any valid tokens ( #5895 )
2024-11-08 04:58:37 +00:00
40d0e7411d
[Doc] Update FAQ links in spec_decode.rst ( #9662 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2024-11-08 04:44:58 +00:00
6bb52b0f97
[CI/Build] Give PR cleanup job PR write access ( #10139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 12:10:20 +08:00
201fc07730
[V1] Prefix caching (take 2) ( #9972 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-07 17:34:44 -08:00
42b4f46b71
[V1] Add all_token_ids attribute to Request ( #10135 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-07 17:08:24 -08:00
073a472728
[Misc] report relevant env vars in collect_env.py tool ( #9293 )
2024-11-07 16:14:01 -08:00
93bff421bc
Bump actions/checkout from 4.2.1 to 4.2.2 ( #9746 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 21:44:58 +00:00
28b2877d30
Online video support for VLMs ( #10020 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 20:25:59 +00:00
97b8475beb
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #9745 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 18:55:35 +00:00
a2f1f3b089
[CI/Build] Automate PR body text cleanup ( #10082 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:26:28 +00:00
3be5b26a76
[CI/Build] Add shell script linting using shellcheck ( #7925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:17:29 +00:00
de0e61a323
[CI/Build] Always run mypy ( #10122 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 16:43:16 +00:00
9d43afcc53
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding ( #9291 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2024-11-07 08:15:14 -08:00
ae62fd17c0
[Frontend] Tool calling parser for Granite 3.0 models ( #9027 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 07:09:02 -08:00
a62bc0109c
[Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. ( #10105 )
...
Signed-off-by: Mozhou <spli161006@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-11-07 11:20:30 +00:00
999df95b4e
[Bugfix] Make image processor respect mm_processor_kwargs for Qwen2-VL ( #10112 )
...
Signed-off-by: Jiahao Li <liplus17@163.com >
2024-11-07 10:50:44 +00:00
a6f332d0d9
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target ( #10108 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 18:42:50 +08:00
0dfba97b42
[Frontend] Fix multiple values for keyword argument error ( #10075 ) ( #10076 )
...
Signed-off-by: Lei <ylxx@live.com >
2024-11-07 09:07:19 +00:00
aa9078fa03
Adds method to read the pooling types from model's files ( #9506 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 08:42:40 +00:00
e036e527a0
[CI/Build] Improve mypy + python version matrix ( #10041 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 07:54:16 +00:00
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce ( #10030 )
...
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com >
2024-11-06 23:50:47 -08:00
d7263a1bb8
Doc: Improve benchmark documentation ( #9927 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-06 23:50:35 -08:00
104d729656
[CI/Build] re-add codespell to CI ( #10083 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:54:46 -08:00
db7db4aab9
[Misc] Consolidate ModelConfig code related to HF config ( #10104 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 06:00:21 +00:00
1fa020c539
[V1][BugFix] Fix Generator construction in greedy + seed case ( #10097 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-07 05:06:57 +00:00
e7b84c394d
[doc] add back Python 3.8 ABI ( #10100 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 21:06:41 -08:00
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 ( #9911 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 04:43:08 +00:00
29862b884b
[Frontend] Adjust try/except blocks in API impl ( #10056 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-06 20:07:51 -08:00
d3859f1891
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend ( #9823 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-06 17:29:03 -08:00
4ab3256644
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 ( #10095 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-07 00:54:13 +00:00
719c1ca468
[core][distributed] add stateless_init_process_group ( #10072 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 16:42:09 -08:00
74f2f8a0f1
[CI/Build] Always run the ruff workflow ( #10092 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:25:23 +00:00
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-06 11:57:35 -08:00
87bd7e0515
[CI/Build] change conflict PR comment from mergify ( #10080 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 10:15:42 -08:00
098f94de42
[CI/Build] Drop Python 3.8 support ( #10038 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 14:31:01 +00:00
399c798608
Remove ScaledActivation for AWQ ( #10057 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-06 14:27:06 +00:00
406d4cc480
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration ( #10022 )
...
Signed-off-by: ericperfect <ericperfectttt@gmail.com >
2024-11-06 14:13:15 +00:00
a5bba7d234
[Model] Add Idefics3 support ( #9767 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
2024-11-06 11:41:17 +00:00
2003cc3513
[Model][LoRA]LoRA support added for LlamaEmbeddingModel ( #10071 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-06 09:49:19 +00:00
6a585a23d2
[Hotfix] Fix ruff errors ( #10073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-06 01:24:28 -08:00
a02a50e6e5
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend ( #6143 )
...
Signed-off-by: yuwenzho <yuwen.zhou@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Signed-off-by: Bob Zhu <bob.zhu@intel.com >
Signed-off-by: zehao-intel <zehao.huang@intel.com >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai >
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com >
Co-authored-by: Vivek Goel <vgoel@habana.ai >
Co-authored-by: yuwenzho <yuwen.zhou@intel.com >
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai >
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com >
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com >
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai >
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com >
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai >
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com >
Co-authored-by: Ilia Taraban <tarabanil@gmail.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai >
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai >
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com >
Co-authored-by: Sun Choi <schoi@habana.ai >
Co-authored-by: Iryna Boiko <iboiko@habana.ai >
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com >
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com >
Co-authored-by: Zehao Huang <zehao.huang@intel.com >
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com >
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com >
Co-authored-by: Nir David <ndavid@habana.ai >
Co-authored-by: Yu-Zhou <yu.zhou@intel.com >
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai >
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com >
Co-authored-by: Jacek Czaja <jczaja@habana.ai >
Co-authored-by: Yuan <yuan.zhou@outlook.com >
2024-11-06 01:09:10 -08:00
a5fda50a10
[CI/Build] Fix large_gpu_mark reason ( #10070 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-06 08:50:37 +00:00
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
4be3a45158
[distributed] add function to create ipc buffers directly ( #10064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 22:35:03 -08:00
4089985552
[V1] Integrate Piecewise CUDA graphs ( #10058 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-05 22:16:04 -08:00
9d59b75593
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type ( #10054 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-06 05:13:09 +00:00
ea928f608c
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path ( #10063 )
...
Signed-off-by: Alex Rakowski <alex.rakowski@amd.com >
Signed-off-by: Alex Rakowski <182798202+arakowsk-amd@users.noreply.github.com >
2024-11-06 05:10:40 +00:00
2bcbae704c
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer ( #10051 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-06 04:28:29 +00:00
ffc0f2b47a
[Model][OpenVINO] Fix regressions from #8346 ( #10045 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-06 04:19:15 +00:00
82bfc38d07
[Misc] Sort the list of embedding models ( #10037 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 04:05:05 +00:00
c4cacbaa7f
[v1] reduce graph capture time for piecewise cudagraph ( #10059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 18:19:50 -08:00
0c63c34f72
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode ( #9730 )
...
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-11-06 01:45:45 +00:00
966e31697b
[Bugfix] Fix pickle of input when async output processing is on ( #9931 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-11-06 00:39:26 +00:00
43300bd98a
[Bugfix] Properly propagate trust_remote_code settings ( #10047 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-05 16:34:40 -08:00
ca9844b340
[bugfix] fix weak ref in piecewise cudagraph and tractable test ( #10048 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 14:49:20 -08:00
235366fe2e
[CI] Prune back the number of tests in tests/kernels/* ( #9932 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:32 -05:00
02462465ea
[CI] Prune tests/models/decoder_only/language/* tests ( #9940 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:23 -05:00
b9c64c0ca7
[Misc] Modify BNB parameter name ( #9997 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-05 14:40:08 -05:00
d2e80332a7
[Feature] Update benchmark_throughput.py to support image input ( #9851 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-05 19:30:02 +00:00
a53046b16f
[Model] Support quantization of PixtralHFTransformer for PixtralHF ( #9921 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 10:42:20 -08:00
731aec5be7
[CI/Build] Limit github CI jobs based on files changed ( #9928 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 10:30:42 -08:00
09d3550372
[Misc] Add logging for CUDA memory ( #10027 )
...
Signed-off-by: Chenghao Yang <yangalan1996@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Chenghao Yang <yangalan1996@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-05 09:50:50 -08:00
cd34029e91
Refactor TPU requirements file and pin build dependencies ( #10010 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-05 16:48:44 +00:00
5952d81139
[Frontend] Fix tcp port reservation for api server ( #10012 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 07:50:57 -08:00
93dee88f6b
[Misc] vllm CLI flags should be ordered for better user readability ( #10017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-05 18:59:56 +08:00
7a83b1aec0
[BugFix] Lazy import ray ( #10021 )
2024-11-05 10:04:10 +00:00
ad23318928
[Bugfix] Fixup Mamba ( #10004 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 03:46:38 +00:00
bbc3619dc8
[Core] Make encoder-decoder inputs a nested structure to be more composable ( #9604 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-05 10:07:31 +08:00
04bbf38e05
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep ( #9994 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 01:08:21 +00:00
8f0a9ca890
[Bugfix] Respect modules_to_not_convert within awq_marlin ( #9895 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-04 16:57:44 -07:00
2094062b4e
[4.5/N] bugfix for quant config in speculative decode ( #10007 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 15:11:59 -08:00
d93478b399
[Bugfix] Upgrade to pytorch 2.5.1 ( #10001 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-04 15:11:28 -08:00
ac04a97a9f
[Frontend] Add max_tokens prometheus metric ( #9881 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-04 22:53:24 +00:00
9a5664d4a4
[Misc] Refactor benchmark_throughput.py ( #9779 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <lkchen@github.com >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-04 14:32:16 -08:00
04cef2c6ab
[Bugfix] Fix MQLLMEngine hanging ( #9973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-11-04 16:01:43 -05:00
6e056bcf04
[Doc] Update VLM doc about loading from local files ( #9999 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-04 19:47:11 +00:00
5208dc7a20
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs ( #9279 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-11-04 11:37:46 -08:00
1c45f4c385
[CI] Basic Integration Test For TPU ( #9968 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
2024-11-04 11:34:26 -08:00
603a661ae8
[Model] factoring out MambaMixer out of Jamba ( #8993 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-04 18:00:00 +00:00
fb2716d641
[Misc]Reduce BNB static variable ( #9987 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 17:04:40 +00:00
8d72bb20fa
[4/N] make quant config first-class citizen ( #9978 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 08:51:31 -08:00
ac6b8f19b9
[Frontend] Multi-Modality Support for Loading Local Image Files ( #9915 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-04 15:34:57 +00:00
ccb5376a9a
[Bugfix][OpenVINO] Fix circular reference #9939 ( #9974 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-04 18:14:13 +08:00
ea4adeddc1
[Bugfix] Fix E2EL mean and median stats ( #9984 )
...
Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com >
2024-11-04 09:37:58 +00:00
4dbcbbeb09
[Misc] Compute query_start_loc/seq_start_loc on CPU ( #9447 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-11-04 08:54:37 +00:00
b67feb1274
[Bugfix]Using the correct type hints ( #9885 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-11-04 06:19:51 +00:00
c49f0407ba
[Bugfix] Fix MiniCPMV and Mllama BNB bug ( #9917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 03:36:41 +00:00
91c9ebbb1b
[V1] Fix Configs ( #9971 )
2024-11-04 00:24:40 +00:00
54597724f4
[Model] Add support for H2OVL-Mississippi models ( #9747 )
...
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-04 00:15:36 +00:00
1f1b6d6eda
[V1] Support per-request seed ( #9945 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
2024-11-03 09:14:17 -08:00
3bb4befea7
[bugfix] fix tsts ( #9959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 15:54:05 -07:00
ae5279a163
[torch.compile] Adding torch compile to vision-language models ( #9946 )
2024-11-02 12:56:05 -07:00
1b73ab2a1f
[CI/Build] Quoting around > ( #9956 )
2024-11-02 12:50:28 -07:00
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 12:08:49 -07:00
74b529ceee
[bugfix] fix chatglm dummy_data_for_glmv ( #9955 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 08:03:33 -07:00
d6459b4516
[V1] Fix EngineArgs refactor on V1 ( #9954 )
2024-11-02 07:44:38 -07:00
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-02 07:35:05 -07:00
1d4cfe2be1
[Doc] Updated tpu-installation.rst with more details ( #9926 )
...
Signed-off-by: Michael Green <mikegre@google.com >
2024-11-02 10:06:45 -04:00
eed92f12fc
[Docs] Update Granite 3.0 models in supported models table ( #9930 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-02 09:02:18 +00:00
af7380d83b
[torch.compile] fix cpu broken code ( #9947 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 23:35:47 -07:00
a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models ( #9559 )
2024-11-01 23:22:49 -07:00
d522034c85
[ci/build] Have dependabot ignore pinned dependencies ( #9935 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-11-01 23:56:13 +00:00
6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking ( #8346 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-01 16:21:10 -07:00
d151fde834
[ci/build] Bump the patch-update group with 10 updates ( #9897 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-11-01 23:04:42 +00:00
27cd36e6e2
[Bugfix] PicklingError on RayTaskError ( #9934 )
...
Signed-off-by: Gene Su <e870252314@gmail.com >
2024-11-01 22:08:23 +00:00
18bd7587b7
[1/N] pass the complete config from engine to executor ( #9933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 13:51:57 -07:00
598b6d7b07
[Bugfix/Core] Flashinfer k_scale and v_scale ( #9861 )
2024-11-01 12:15:05 -07:00
aff1fd8188
[torch.compile] use interpreter with stable api from pytorch ( #9889 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 11:50:37 -07:00
4581d2cc02
[Core] Refactor: Clean up unused argument in Scheduler._preempt ( #9696 )
...
Signed-off-by: André Jonasson <andre.jonasson@gmail.com >
2024-11-01 11:41:38 -07:00
1dd4cb2935
[Bugfix] Fix edge cases for MistralTokenizer ( #9625 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2024-11-01 10:33:15 -07:00
ba0d892074
[Frontend] Use a proper chat template for VLM2Vec ( #9912 )
2024-11-01 14:09:07 +00:00
30a2e80742
[CI/Build] Add Model Tests for PixtralHF ( #9813 )
2024-11-01 07:55:29 -06:00
06386a64dd
[Frontend] Chat-based Embeddings API ( #9759 )
2024-11-01 08:13:35 +00:00
d3aa2a8b2f
[Doc] Update multi-input support ( #9906 )
2024-11-01 07:34:49 +00:00
2b5bf20988
[torch.compile] Adding torch compile annotations to some models ( #9876 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-01 00:25:47 -07:00
93a76dd21d
[Model] Support bitsandbytes for MiniCPMV ( #9891 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:31:56 +08:00
566cd27797
[torch.compile] rework test plans ( #9866 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 22:20:17 -07:00
37a4947dcd
[Bugfix] Fix layer skip logic with bitsandbytes ( #9887 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:12:44 +08:00
96e0c9cbbd
[torch.compile] directly register custom op ( #9896 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 21:56:09 -07:00
031a7995f3
[Bugfix][Frontend] Reject guided decoding in multistep mode ( #9892 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-01 01:09:46 +00:00
b63c64d95b
[ci/build] Configure dependabot to update pip dependencies ( #9811 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-31 15:55:38 -07:00
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 ( #9838 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-10-31 20:06:25 +00:00
55650c83a0
[Bugfix] Fix illegal memory access error with chunked prefill, prefix caching, block manager v2 and xformers enabled together ( #9532 )
...
Signed-off-by: sasha0552 <admin@sasha0552.org >
2024-10-31 11:46:36 -07:00
77f7ef2908
[CI/Build] Adding a forced docker system prune to clean up space ( #9849 )
2024-11-01 01:02:58 +08:00
16b8f7a86f
[CI/Build] Add Model Tests for Qwen2-VL ( #9846 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-31 09:10:52 -07:00
5608e611c2
[Doc] Update Qwen documentation ( #9869 )
2024-10-31 08:54:18 +00:00
3ea2dc2ec4
[Misc] Remove deprecated arg for cuda graph capture ( #9864 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-10-31 07:22:07 +00:00
d087bf863e
[Model] Support quantization of Qwen2VisionTransformer ( #9817 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-30 22:41:20 -07:00
890ca36072
Revert "[Bugfix] Use host argument to bind to interface ( #9798 )" ( #9852 )
2024-10-31 01:44:51 +00:00
abbfb6134d
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint ( #9837 )
2024-10-30 18:15:56 -07:00
64384bbcdf
[torch.compile] upgrade tests ( #9858 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 16:34:22 -07:00
00d91c8a2c
[CI/Build] Simplify exception trace in api server tests ( #9787 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-30 14:52:05 -07:00
c2cd1a2142
[doc] update pp support ( #9853 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 13:36:51 -07:00
c787f2d81d
[Neuron] Update Dockerfile.neuron to fix build failure ( #9822 )
2024-10-30 12:22:02 -07:00
33d257735f
[Doc] link bug for multistep guided decoding ( #9843 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 17:28:29 +00:00
3b3f1e7436
[Bugfix][core] replace heartbeat with pid check ( #9818 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 09:34:07 -07:00
9ff4511e43
[Misc] Add chunked-prefill support on FlashInfer. ( #9781 )
2024-10-30 09:33:53 -07:00
81f09cfd80
[Model] Support math-shepherd-mistral-7b-prm model ( #9697 )
...
Signed-off-by: Went-Liang <wenteng_liang@163.com >
2024-10-30 09:33:42 -07:00
cc98f1e079
[CI/Build] VLM Test Consolidation ( #9372 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-30 09:32:17 -07:00
211fe91aa8
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA ( #9438 )
2024-10-30 09:41:38 +00:00
6aa6020f9b
[Misc] Specify minimum pynvml version ( #9827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 23:05:43 -07:00
ff5ed6e1bc
[torch.compile] rework compile control with piecewise cudagraph ( #9715 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 23:03:49 -07:00
7b0365efef
[Doc] Add the DCO to CONTRIBUTING.md ( #9803 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-30 05:22:23 +00:00
04a3ae0aca
[Bugfix] Fix multi nodes TP+PP for XPU ( #8884 )
...
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn >
2024-10-29 21:34:45 -07:00
62fac4b9aa
[ci/build] Pin CI dependencies version with pip-compile ( #9810 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-30 03:34:55 +00:00
226688bd61
[Bugfix][VLM] Make apply_fp8_linear work with >2D input ( #9812 )
2024-10-29 19:49:44 -07:00
64cb1cdc3f
Update README.md ( #9819 )
2024-10-29 17:28:43 -07:00
1ab6f6b4ad
[core][distributed] fix custom allreduce in pytorch 2.5 ( #9815 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 17:06:24 -07:00
bc73e9821c
[Bugfix] Fix prefix strings for quantized VLMs ( #9772 )
2024-10-29 16:02:59 -07:00
8d7724104a
[Docs] Add notes about Snowflake Meetup ( #9814 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-10-29 15:19:02 -07:00
882a1ad0de
[Model] tool calling support for ibm-granite/granite-20b-functioncalling ( #8339 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
2024-10-29 15:07:37 -07:00
67bdf8e523
[Bugfix][Frontend] Guard against bad token ids ( #9634 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-29 14:13:20 -07:00
0ad216f575
[MISC] Set label value to timestamp over 0, to keep track of recent history ( #9777 )
...
Signed-off-by: Kunjan Patel <kunjanp@google.com >
2024-10-29 19:52:19 +00:00
7585ec996f
[CI/Build] mergify: fix rules for ci/build label ( #9804 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-29 19:24:42 +00:00
ab6f981671
[CI][Bugfix] Skip chameleon for transformers 4.46.1 ( #9808 )
2024-10-29 11:12:43 -07:00
ac3d748dba
[Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel ( #9806 )
2024-10-29 10:40:35 -07:00
0ce7798f44
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) ( #9801 )
...
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2024-10-29 10:39:20 -07:00
0f43387157
[Bugfix] Use host argument to bind to interface ( #9798 )
2024-10-29 10:37:59 -07:00
08600ddc68
Fix the log to correct guide user to install modelscope ( #9793 )
...
Signed-off-by: yuze.zyz <yuze.zyz@alibaba-inc.com >
2024-10-29 10:36:59 -07:00
74fc2d77ae
[Misc] Add metrics for request queue time, forward time, and execute time ( #9659 )
2024-10-29 10:32:56 -07:00
622b7ab955
[Hardware] using current_platform.seed_everything ( #9785 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-29 14:47:44 +00:00
09500f7dde
[Model] Add BNB quantization support for Mllama ( #9720 )
2024-10-29 08:20:02 -04:00
ef7865b4f9
[Frontend] re-enable multi-modality input in the new beam search implementation ( #9427 )
...
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
eae3d48181
[Bugfix] Use temporary directory in registry ( #9721 )
2024-10-28 22:08:20 -07:00
e74f2d448c
[Doc] Specify async engine args in docs ( #9726 )
2024-10-28 22:07:57 -07:00
7a4df5f200
[Model][LoRA]LoRA support added for Qwen ( #9622 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 04:14:07 +00:00
c5d7fb9ddc
[Doc] fix third-party model example ( #9771 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 19:39:21 -07:00
76ed5340f0
[torch.compile] add deepseek v2 compile ( #9775 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 14:35:17 -07:00
97b61bfae6
[misc] avoid circular import ( #9765 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 20:51:23 +00:00
aa0addb397
Adding "torch compile" annotations to moe models ( #9758 )
2024-10-28 13:49:56 -07:00
5f8d8075f9
[Model][VLM] Add multi-video support for LLaVA-Onevision ( #8905 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-28 18:04:10 +00:00
8b0e4f2ad7
[CI/Build] Adopt Mergify for auto-labeling PRs ( #9259 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 09:38:09 -07:00
2adb4409e0
[Bugfix] Fix ray instance detect issue ( #9439 )
2024-10-28 07:13:03 +00:00
feb92fbe4a
Fix beam search eos ( #9627 )
2024-10-28 06:59:37 +00:00
32176fee73
[torch.compile] support moe models ( #9632 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 21:58:04 -07:00
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-28 04:07:00 +00:00
34a9941620
[Bugfix] Fix load config when using bools ( #9533 )
2024-10-27 13:46:41 -04:00
e130c40e4e
Fix cache management in "Close inactive issues and PRs" actions workflow ( #9734 )
2024-10-27 10:30:03 -07:00
3cb07a36a2
[Misc] Upgrade to pytorch 2.5 ( #9588 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-27 09:44:24 +00:00
8549c82660
[core] cudagraph output with tensor weak reference ( #9724 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 00:19:28 -07:00
67a6882da4
[Misc] SpecDecodeWorker supports profiling ( #9719 )
...
Signed-off-by: Abatom <abatom@163.com >
2024-10-27 04:18:03 +00:00
6650e6a930
[Model] Add classification Task with Qwen2ForSequenceClassification ( #9704 )
...
Signed-off-by: Kevin-Yang <ykcha9@gmail.com >
Co-authored-by: Kevin-Yang <ykcha9@gmail.com >
2024-10-26 17:53:35 +00:00
07e981fdf4
[Frontend] Bad words sampling parameter ( #9717 )
...
Signed-off-by: Vasily Alexeev <alvasian@yandex.ru >
2024-10-26 16:29:38 +00:00
55137e8ee3
Fix: MI100 Support By Bypassing Custom Paged Attention ( #9560 )
2024-10-26 12:12:57 +00:00
5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino ( #9716 )
2024-10-26 10:59:06 +00:00
067e77f9a8
[Bugfix] Steaming continuous_usage_stats default to False ( #9709 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-10-26 05:05:47 +00:00
6567e13724
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding ( #9631 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: pavlo-ruban <pavlo.ruban@servicenow.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-25 15:42:56 -07:00
228cfbd03f
[Doc] Improve quickstart documentation ( #9256 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-25 14:32:10 -07:00
ca0d92227e
[Bugfix] Fix compressed_tensors_moe bad config.strategy ( #9677 )
2024-10-25 12:40:33 -07:00
9645b9f646
[V1] Support sliding window attention ( #9679 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-24 22:20:37 -07:00
a6f3721861
[Model] add a lora module for granite 3.0 MoE models ( #9673 )
2024-10-24 22:00:17 -07:00
9f7b4ba865
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 ( #9676 )
2024-10-24 20:59:00 -07:00
c91ed47c43
[Bugfix] Remove xformers requirement for Pixtral ( #9597 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 15:38:05 -07:00
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement ( #9384 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2024-10-24 15:37:52 -07:00
e26d37a185
[Log][Bugfix] Fix default value check for image_url.detail ( #9663 )
2024-10-24 10:44:38 -07:00
722d46edb9
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints ( #9650 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-24 10:42:24 -07:00
c866e0079d
[CI/Build] Fix VLM test failures when using transformers v4.46 ( #9666 )
2024-10-25 01:40:40 +08:00
d27cfbf791
[torch.compile] Adding torch compile annotations to some models ( #9641 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 09:31:42 -07:00
de662d32b5
Increase operation per run limit for "Close inactive issues and PRs" workflow ( #9661 )
...
Signed-off-by: Harry Mellor <hej.mellor@gmail.com >
2024-10-24 12:17:45 -04:00
f58454968f
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models ( #9653 )
2024-10-24 07:52:07 -07:00
b979143d5b
[Doc] Move additional tips/notes to the top ( #9647 )
2024-10-24 09:43:59 +00:00
ad6f78053e
[torch.compile] expanding support and fix allgather compilation ( #9637 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 01:32:15 -07:00
295a061fb3
[Kernel] add kernel for FATReLU ( #9610 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-24 16:18:27 +08:00
8a02cd045a
[torch.compile] Adding torch compile annotations to some models ( #9639 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 00:54:57 -07:00
4fdc581f9e
[core] simplify seq group code ( #9569 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-10-24 00:16:44 -07:00
3770071eb4
[V1][Bugfix] Clean up requests when aborted ( #9629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-23 23:33:22 -07:00
836e8ef6ee
[Bugfix] Fix PP for ChatGLM and Molmo ( #9422 )
2024-10-24 06:12:05 +00:00
056a68c7db
[XPU] avoid triton import for xpu ( #9440 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-24 05:14:00 +00:00
33bab41060
[Bugfix]: Make chat content text allow type content ( #9358 )
...
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2024-10-24 05:05:49 +00:00
b7df53cd42
[Bugfix] Use "vision_model" prefix for MllamaVisionModel ( #9628 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:07:44 +08:00
bb01f2915e
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image ( #9626 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:03:44 +08:00
b548d7a5f4
[CI/Build] Add bot to close stale issues and PRs ( #9436 )
2024-10-23 15:45:26 -07:00
fc6c274626
[Model] Add Qwen2-Audio model support ( #9248 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-23 17:54:22 +00:00
150b779081
[Frontend] Enable Online Multi-image Support for MLlama ( #9393 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-23 17:28:57 +00:00
9013e24f7b
[torch.compile] Adding torch compile annotations to some models ( #9614 )
2024-10-23 10:07:48 -07:00
fd0e2cfdb2
[Misc] Separate total and output tokens in benchmark_throughput.py ( #8914 )
2024-10-23 16:47:20 +00:00
e5ac6a4199
[Bugfix] Fix divide by zero when serving Mamba models ( #9617 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-10-23 16:40:43 +00:00
dbdd3b5e5a
[misc] comment to avoid future confusion about baichuan ( #9620 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-23 09:14:44 -07:00
e7116c017c
[Bugfix] Fix _init_vision_model in NVLM_D model ( #9611 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 14:09:04 +00:00
31a08f5bd2
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs ( #9612 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-23 14:05:18 +00:00
c18e1a3418
[VLM] Enable overriding whether post layernorm is used in vision encoder + fix quant args ( #9217 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 11:27:37 +00:00
3ff57ebfca
[Model] Initialize Florence-2 language backbone support ( #9555 )
2024-10-23 10:42:47 +00:00
2394962d70
[Hardware][XPU] using current_platform.is_xpu ( #9605 )
2024-10-23 08:28:21 +00:00
51c24c9736
[Build] Fix FetchContent multiple build issue ( #9596 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-10-23 12:43:07 +08:00
831540cf04
[Model] Support E5-V ( #9576 )
2024-10-23 11:35:29 +08:00
29061ed9df
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages ( #9590 )
2024-10-23 11:17:28 +08:00
65050a40e6
[Bugfix] Generate exactly input_len tokens in benchmark_throughput ( #9592 )
2024-10-22 17:45:35 -07:00
208cb34c81
[Doc]: Update tensorizer docs to include vllm[tensorizer] ( #7889 )
...
Co-authored-by: Kaunil Dhruv <dhruv.kaunil@gmail.com >
2024-10-22 15:43:25 -07:00
b17046e298
[BugFix] Fix metrics error for --num-scheduler-steps > 1 ( #8234 )
2024-10-22 15:43:03 -07:00
d1e8240875
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing ( #9487 )
2024-10-22 15:41:13 -07:00
cb6fdaa0a0
[Misc] Make benchmarks use EngineArgs ( #9529 )
2024-10-22 15:40:38 -07:00
23b899a8e6
[Bugfix] fix detokenizer shallow copy ( #5919 )
2024-10-22 15:38:12 -07:00
17c79f3c36
[torch.compile] auto infer dynamic_arg_dims from type annotation ( #9589 )
2024-10-22 13:43:37 -07:00
cd5601ac37
[BugFix] Prevent exporting duplicate OpenTelemetry spans ( #9017 )
2024-10-22 11:11:53 -07:00
434984e665
[Frontend] Support custom request_id from request ( #9550 )
...
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com >
2024-10-22 18:07:30 +00:00
32a1ee74a0
[Hardware][Intel CPU][DOC] Update docs for CPU backend ( #6212 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com >
Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com >
2024-10-22 10:38:04 -07:00
08075c3448
[Bugfix] Eagle: change config name for fc bias ( #9580 )
2024-10-22 16:14:22 +00:00
bb392ea2d2
[Model][VLM] Initialize support for Mono-InternVL model ( #9528 )
2024-10-22 16:01:46 +00:00
9dbcce84a7
[Neuron] [Bugfix] Fix neuron startup ( #9374 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-22 12:51:41 +00:00
a48e3ec052
[CI/Build][LoRA] Temporarily fix long context failure issue ( #9579 )
2024-10-22 11:32:51 +00:00
6c5af09b39
[V1] Implement vLLM V1 [1/N] ( #9289 )
2024-10-22 01:24:07 -07:00
3ddbe25502
[Hardware][CPU] using current_platform.is_cpu ( #9536 )
2024-10-22 00:50:43 -07:00
0d02747f2e
support TP in qwen2 bnb ( #9574 )
2024-10-22 07:13:23 +00:00
f7db5f0fa9
[Doc] Use shell code-blocks and fix section headers ( #9508 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-22 06:43:24 +00:00
ca30c3c84b
[Core] Remove evictor_v1 ( #9572 )
2024-10-22 04:55:49 +00:00
c0292211ce
[CI/Build] Replaced some models on tests for smaller ones ( #9570 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-22 04:52:14 +00:00
74692421f7
[Bugfix]: phi.py get rope_theta from config file ( #9503 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-22 02:53:36 +00:00
29acd2c34c
[Bugfix][OpenVINO] fix_dockerfile_openvino ( #9552 )
2024-10-21 19:47:52 -07:00
f085995a7b
[CI/Build] Remove unnecessary fork_new_process ( #9484 )
2024-10-21 19:47:29 -07:00
b729901139
[Bugfix]: serialize config by value for --trust-remote-code ( #6751 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-21 19:46:24 -07:00
76a5e13270
[core] move parallel sampling out from vllm core ( #9302 )
2024-10-22 00:31:44 +00:00
ef7faad1b8
🐛 Fixup more test failures from memory profiling ( #9563 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-21 17:10:56 -07:00
575dcebe9a
[CI] Make format checker error message more user-friendly by using emoji ( #9564 )
...
This PR makes format checker error message more user-friendly by adding emojis.
2024-10-21 23:45:15 +00:00
711f3a7806
[Frontend] Don't log duplicate error stacktrace for every request in the batch ( #9023 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-21 14:49:41 -07:00
15713e3b75
[BugFix] Update draft model TP size check to allow matching target TP size ( #9394 )
...
Co-authored-by: Baoyuan Qi <qibaoyuan@126.com >
2024-10-21 14:14:29 -07:00
d621c43df7
[doc] fix format ( #9562 )
2024-10-21 13:54:57 -07:00
9d9186be97
[Frontend] Reduce frequency of client cancellation checking ( #7959 )
2024-10-21 13:28:10 -07:00
5241aa1494
[Model][Bugfix] Fix batching with multi-image in PixtralHF ( #9518 )
2024-10-21 14:20:07 -04:00
ec6bd6c4c6
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint ( #9492 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-21 17:43:02 +00:00
8ca8954841
[Bugfix][Misc]: fix graph capture for decoder ( #9549 )
2024-10-21 17:33:30 +00:00
f6b97293aa
[Model] FalconMamba Support ( #9325 )
2024-10-21 12:50:16 -04:00
496e991da8
[Doc] Consistent naming of attention backends ( #9498 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-10-21 22:29:57 +08:00
696b01af8f
[CI/Build] Split up decoder-only LM tests ( #9488 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-20 21:27:50 -07:00
855e0e6f97
[Frontend][Misc] Goodput metric support ( #9338 )
2024-10-20 18:39:32 +00:00
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
962d2c6349
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision ( #9520 )
2024-10-20 05:29:14 +00:00
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger ( #9530 )
2024-10-20 00:05:02 +00:00
8e3e7f2713
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf ( #9514 )
2024-10-19 10:44:29 -04:00
263d8ee150
[Bugfix] Fix missing task for speculative decoding ( #9524 )
2024-10-19 06:49:40 +00:00
c5eea3c8ba
[Frontend] Support simpler image input format ( #9478 )
2024-10-18 23:17:07 -07:00
85dc92fc98
[CI/Build] Configure matcher for actionlint workflow ( #9511 )
...
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
2024-10-19 06:04:18 +00:00
dfd951ed9b
[CI/Build] Add error matching for ruff output ( #9513 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-19 05:42:20 +00:00
82c25151ec
[Doc] update gpu-memory-utilization flag docs ( #9507 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-19 11:26:36 +08:00
1325872ec8
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily ( #9521 )
2024-10-18 20:21:01 -07:00
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-18 21:25:19 -04:00
337ed76671
[Bugfix] Fix offline mode when using mistral_common ( #9457 )
2024-10-18 18:12:32 -07:00
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores ( #9497 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-18 17:55:48 -07:00
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py ( #9510 )
2024-10-18 14:30:55 -07:00
9bb10a7d27
[MISC] Add lora requests to metrics ( #9477 )
...
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a .c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
3921a2f29e
[Model] Support Pixtral models in the HF Transformers format ( #9036 )
2024-10-18 13:29:56 -06:00
67a7e5ef38
[CI/Build] Add error matching config for mypy ( #9512 )
2024-10-18 12:17:53 -07:00
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
7dbe738d65
[Misc] benchmark: Add option to set max concurrency ( #9390 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-18 11:15:28 -07:00
ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py ( #9505 )
2024-10-18 16:59:19 +00:00
1bbbcc0b1d
[CI/Build] Fix lint errors in mistral tokenizer ( #9504 )
2024-10-19 00:09:35 +08:00
25aeb7d4c9
[BugFix] Fix and simplify completion API usage streaming ( #9475 )
2024-10-18 14:10:26 +00:00
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser ( #9154 )
2024-10-18 10:27:48 +00:00
1ffc8a7362
[BugFix] Typing fixes to RequestOutput.prompt and beam search ( #9473 )
2024-10-18 07:19:53 +00:00
944dd8edaf
[CI/Build] Use commit hash references for github actions ( #9430 )
2024-10-17 21:54:58 -07:00
154a8ae880
[Qwen2.5] Support bnb quant for Qwen2.5 ( #9467 )
2024-10-18 04:40:14 +00:00
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-17 22:47:27 -04:00
48138a8415
[BugFix] Stop silent failures on compressed-tensors parsing ( #9381 )
2024-10-17 18:54:00 -07:00
343f8e0905
Support BERTModel (first encoder-only embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: laishzh <laishengzhang@gmail.com >
Co-authored-by: Max de Bayser <maxdebayser@gmail.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-10-17 23:21:01 +00:00
bb76538bbd
[Hardwware][Neuron] Simplify model load for transformers-neuronx library ( #9380 )
2024-10-17 15:39:39 -07:00
d615b5c9f8
[Bugfix] Print warnings related to mistral_common tokenizer only once ( #9468 )
2024-10-17 21:44:20 +00:00
d65049daab
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script ( #9013 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-17 21:11:11 +00:00
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
7871659abb
[Misc] Remove commit id file ( #9470 )
2024-10-17 10:34:37 -07:00
a2c71c5405
[CI/Build] remove .github from .dockerignore, add dirty repo check ( #9375 )
2024-10-17 10:25:06 -07:00
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
5eda21e773
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support ( #9344 )
2024-10-17 12:21:04 -04:00
8e1cddcd44
[TPU] Call torch._sync(param) during weight loading ( #9437 )
2024-10-17 09:00:11 -07:00
5e443b594f
[Bugfix] Allow prefill of assistant response when using mistral_common ( #9446 )
2024-10-17 15:06:37 +00:00
9d30a056e7
[misc] CUDA Time Layerwise Profiler ( #8337 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-10-17 10:36:09 -04:00
390be74649
[Misc] Print stack trace using logger.exception ( #9461 )
2024-10-17 13:55:48 +00:00
e312e52b44
[Kernel] Add Exllama as a backend for compressed-tensors ( #9395 )
2024-10-17 09:48:26 -04:00
dbfa8d31d5
Add notes on the use of Slack ( #9442 )
2024-10-17 04:46:46 +00:00
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels ( #9391 )
2024-10-17 01:34:06 +00:00
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel ( #9425 )
2024-10-16 23:46:06 +00:00
776dbd74f1
[CI/Build] mypy: Resolve some errors from checking vllm/engine ( #9267 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-16 22:55:59 +00:00
8345045833
[Performance][Spec Decode] Optimize ngram lookup performance ( #9333 )
2024-10-16 13:37:45 -06:00
5b8a1fde84
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft ( #9396 )
2024-10-16 16:40:24 +00:00
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba ( #9189 )
2024-10-16 12:12:43 -04:00
415f76a9cb
Support mistral interleaved attn ( #9414 )
2024-10-16 13:28:30 +00:00
cf1d62a644
[Model] Support SDPA attention for Molmo vision backbone ( #9410 )
2024-10-16 11:52:01 +00:00
59230ef32b
[Misc] Consolidate example usage of OpenAI client for multimodal models ( #9412 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-16 11:20:51 +00:00
cee711fdbb
[Core] Rename input data types ( #8688 )
2024-10-16 10:49:37 +00:00
1de76a0e55
[CI/Build] Test VLM embeddings ( #9406 )
2024-10-16 09:44:30 +00:00
7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM ( #9303 )
2024-10-16 14:31:00 +08:00
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
ed920135c8
[Bugfix] Molmo text-only input bug fix ( #9397 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-16 04:56:09 +00:00
717a5f82cd
[Bugfix][CI/Build] Fix CUDA 11.8 Build ( #9386 )
2024-10-16 00:15:21 +00:00
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids ( #9034 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-15 15:40:43 -07:00
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions ( #8909 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-15 15:40:25 -07:00
5d264f4ab8
pass ignore_eos parameter to all benchmark_serving calls ( #9349 )
2024-10-15 13:30:44 -07:00
e9d517f276
[BugFix] Fix chat API continuous usage stats ( #9357 )
2024-10-14 23:19:48 -07:00
55e081fbad
[Bugfix] Update InternVL input mapper to support image embeds ( #9351 )
2024-10-14 21:29:19 -07:00
8e836d982a
[Doc] Fix code formatting in spec_decode.rst ( #9348 )
2024-10-14 21:29:11 -07:00
44eaa5a5d9
[Frontend] Clarify model_type error messages ( #9345 )
2024-10-14 21:29:01 -07:00
169b530607
[Bugfix] Clean up some cruft in mamba.py ( #9343 )
2024-10-15 00:24:25 +00:00
f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images ( #9095 )
2024-10-14 15:24:26 -07:00
4d31cd424b
[Frontend] merge beam search implementations ( #9296 )
2024-10-14 15:05:52 -07:00
473e7b3606
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel ( #9350 )
2024-10-14 15:02:06 -07:00
fd47e57f4b
[Docs] Remove PDF build from Readtehdocs ( #9347 )
2024-10-14 11:57:47 -07:00
203ab8f80f
[CI/Build] setuptools-scm fixes ( #8900 )
2024-10-14 11:34:47 -07:00
4141608c6a
[Hardware][intel GPU] add async output process for xpu ( #8897 )
2024-10-14 12:23:33 -06:00
dfe43a2071
[Model] Molmo vLLM Integration ( #9016 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-14 07:56:24 -07:00
16b24e7dcd
[Bugfix] Bandaid fix for speculative decoding tests ( #9327 )
2024-10-13 23:02:11 +00:00
f519902c52
[CI] Fix merge conflict ( #9317 )
2024-10-13 06:41:23 +00:00
250e26a63e
[Bugfix]Fix MiniCPM's LoRA bug ( #9286 )
2024-10-12 09:36:47 -07:00
2b184ddd4f
[Misc][Installation] Improve source installation script and doc ( #9309 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-12 09:36:40 -07:00
00298e092c
[Bugfix] Fix bug of xformer prefill for encoder-decoder ( #9026 )
2024-10-12 15:00:43 +08:00
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
ec10cb8511
[BugFix] Fix tool call finish reason in streaming case ( #9209 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-10-11 18:24:26 -07:00
d11b46f3a5
[bugfix] fix f-string for error ( #9295 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-11 17:03:48 -07:00
c6cf9295e1
[Bugfix] Sets is_first_step_output for TPUModelRunner ( #9202 )
2024-10-11 13:28:10 -07:00
de9fb4bef8
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected ( #9254 )
2024-10-11 15:57:39 -04:00
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-11 11:18:50 -07:00
1a1823871d
[Doc] Remove outdated comment to avoid misunderstanding ( #9287 )
2024-10-11 18:02:03 +00:00
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
f710090d8e
[Kernel] adding fused moe kernel config for L40S TP4 ( #9245 )
2024-10-11 08:54:22 -07:00
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
df3dcdf49d
[Bugfix] Fix priority in multiprocessing engine ( #9277 )
2024-10-11 15:35:35 +00:00
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
e808156f30
[Misc] Collect model support info in a single process per model ( #9233 )
2024-10-11 11:08:11 +00:00
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-10-10 21:30:44 -07:00
94bf9ae4e9
[Misc] Fix sampling from sonnet for long context case ( #9235 )
2024-10-11 00:33:16 +00:00
f990bab2a4
[Doc][Neuron] add note to neuron documentation about resolving triton issue ( #9257 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-10-10 23:36:32 +00:00
e00c094f15
[torch.compile] generic decorators ( #9258 )
2024-10-10 15:54:23 -07:00
a78c6ba7c8
[ci/build] Add placeholder command for custom models test ( #9262 )
2024-10-10 15:45:09 -07:00
fb870fd491
Bump actions/setup-python from 3 to 5 ( #9195 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:46 -07:00
270953bafb
Bump actions/checkout from 3 to 4 ( #9196 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:35 -07:00
9cc811c4ff
Bump actions/github-script from 6 to 7 ( #9197 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:24 -07:00
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
78c0b4166c
Suggest codeowners for the core componenets ( #9210 )
2024-10-10 12:29:24 -07:00
21efb603f5
[CI/Build] Make the Dockerfile.cpu file's PIP_EXTRA_INDEX_URL Configurable as a Build Argument ( #9252 )
2024-10-10 18:18:18 +00:00
055f3270d4
[Doc] Improve debugging documentation ( #9204 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-10 10:48:51 -07:00
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError ( #9218 )
2024-10-10 17:39:56 +00:00
83ea5c72b9
[OpenVINO] Use torch 2.4.0 and newer optimim version ( #9121 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-10 11:18:58 -06:00
04de9057ab
[Model] support input image embedding for minicpmv ( #9237 )
2024-10-10 15:00:47 +00:00
07c11cf4d4
[Bugfix] Fix lm_head weights tying with lora for llama ( #9227 )
2024-10-10 21:11:56 +08:00
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
de895f1697
[misc] improve model support check in another process ( #9208 )
2024-10-09 21:58:27 -07:00
cf25b93bdd
[Core] Fix invalid args to _process_request ( #9201 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-10 12:10:09 +08:00
d5fbb8706d
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 ( #9130 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 12:51:47 -06:00
cdca8994bd
[CI/Build] mypy: check vllm/entrypoints ( #9194 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-09 17:15:28 +00:00
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
7dea289066
Add Dependabot configuration for GitHub Actions updates ( #1217 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 08:16:26 -07:00
cfaa6008e6
[Bugfix] Access get_vocab instead of vocab in tool parsers ( #9188 )
2024-10-09 08:59:57 -06:00
21906a6f50
[Bugfix] Fix lora loading for Compressed Tensors in #9120 ( #9179 )
2024-10-09 12:10:44 +00:00
dc4aea677a
[Doc] Fix VLM prompt placeholder sample bug ( #9170 )
2024-10-09 08:59:42 +00:00
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
8bfaa4e31e
[Bugfix] fix composite weight loading and EAGLE weight loading ( #9160 )
2024-10-09 00:36:55 -07:00
0b5b5d767e
[Frontend] Log the maximum supported concurrency ( #8831 )
2024-10-09 00:03:14 -07:00
cdc72e3c80
[Model] Remap FP8 kv_scale in CommandR and DBRX ( #9174 )
2024-10-09 06:43:06 +00:00
7627172bf4
[Bugfix][Doc] Report neuron error in output ( #9159 )
2024-10-08 22:43:34 -07:00
480b7f40cf
[Misc] Improve validation errors around best_of and n ( #9167 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-10-09 04:54:48 +00:00
acce7630c1
Update link to KServe deployment guide ( #9173 )
2024-10-09 03:58:49 +00:00
ffc4b27ea8
Add classifiers in setup.py ( #9171 )
2024-10-08 19:30:48 -07:00
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
9ba0bd6aa6
Add lm-eval directly to requirements-test.txt ( #9161 )
2024-10-08 18:22:31 -07:00
2a131965a8
mypy: check additional directories ( #9162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-08 22:08:22 +00:00
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
de24046fcd
[Doc] Improve contributing and installation documentation ( #9132 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-08 20:22:08 +00:00
1874c6a1b0
[Doc] Update vlm.rst to include an example on videos ( #9155 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-08 18:12:29 +00:00
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
cfba685bd4
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models ( #8758 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-10-08 09:37:34 -07:00
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:31:26 +00:00
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:12:56 +00:00
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
e1faa2a598
[misc] improve ux on readme ( #9147 )
2024-10-07 22:26:25 -07:00
80b57f00d5
[Intel GPU] Fix xpu decode input ( #9145 )
2024-10-08 03:51:14 +00:00
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
8eeb857084
Add Slack to README ( #9137 )
2024-10-07 17:06:21 -07:00
fa45513a51
[misc] fix comment and variable name ( #9139 )
2024-10-07 16:07:05 -07:00
c0d9a98d0c
[Doc] Include performance benchmark in README ( #9135 )
2024-10-07 15:04:06 -07:00
e0dbdb013d
[CI/Build] Add linting for github actions workflows ( #7876 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-07 21:18:10 +00:00
93cf74a8a7
[Doc]: Add deploying_with_k8s guide ( #8451 )
2024-10-07 13:31:45 -07:00
151ef4efd2
[Model] Support NVLM-D and fix QK Norm in InternViT ( #9045 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2024-10-07 11:55:12 +00:00
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
c8f26bb636
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None ( #9103 )
2024-10-07 03:52:42 +00:00
487678d046
[Bugfix][Hardware][CPU] Fix CPU model input for decode ( #9044 )
2024-10-06 19:14:27 -07:00
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-06 12:48:11 -07:00
fdf59d30ea
[Bugfix] fix tool_parser error handling when serve a model not support it ( #8709 )
2024-10-06 12:51:08 +00:00
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-06 16:35:27 +08:00
f22619fe96
[Misc] Remove user-facing error for removed VLM args ( #9104 )
2024-10-06 01:33:52 -07:00
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-05 23:39:03 -07:00
23fea8714a
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model ( #9101 )
2024-10-06 13:00:04 +08:00
f4dd830e09
[core] use forward context for flash infer ( #9097 )
2024-10-05 19:37:31 -07:00
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
53b3a33027
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs ( #8979 )
2024-10-04 22:05:37 -07:00
dac914b0d6
[Bugfix] use blockmanagerv1 for encoder-decoder ( #9084 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-05 04:45:38 +00:00
a95354a36e
[Doc] Update README.md with Ray summit slides ( #9088 )
2024-10-05 02:54:45 +00:00
663874e048
[torch.compile] improve allreduce registration ( #9061 )
2024-10-04 16:43:50 -07:00
cc90419e89
[Hardware][Neuron] Add on-device sampling support for Neuron ( #8746 )
...
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
2024-10-04 16:42:20 -07:00
27302dd584
[Misc] Fix CI lint ( #9085 )
2024-10-04 16:07:54 -07:00
0cc566ca8f
[Misc] Add random seed for prefix cache benchmark ( #9081 )
2024-10-04 21:58:57 +00:00
05c531be47
[Misc] Improved prefix cache example ( #9077 )
2024-10-04 21:38:42 +00:00
fbb74420e7
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang ( #7412 )
2024-10-04 14:01:44 -07:00
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika Sikka <ds3822@columbia.edu >
2024-10-04 12:34:44 -06:00
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-10-04 18:31:40 +00:00
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models ( #8717 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:38:25 -07:00
e5dc713c23
[Hardware][PowerPC] Make oneDNN dependency optional for Power ( #9039 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-04 17:24:42 +00:00
36eecfbddb
Remove AMD Ray Summit Banner ( #9075 )
2024-10-04 10:17:16 -07:00
9ade8bbc8d
[Model] add a bunch of supported lora modules for mixtral ( #9008 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-04 16:24:40 +00:00
22482e495e
[Bugfix] Flash attention arches not getting set properly ( #9062 )
2024-10-04 09:43:15 -06:00
3d826d2c52
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL ( #9071 )
2024-10-04 14:34:58 +00:00
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:56:58 +08:00
303d44790a
[Misc] Enable multi-step output streaming by default ( #9047 )
2024-10-03 22:55:42 -04:00
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model ( #8405 )
2024-10-04 10:36:39 +08:00
2838d6b38e
[Bugfix] Weight loading fix for OPT model ( #9042 )
...
Co-authored-by: dvres <dvres@fri.uni-lj.si >
2024-10-03 19:53:29 -04:00
91add85ec4
Fix failing spec decode test ( #9054 )
2024-10-03 23:07:29 +00:00
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
63e39937f9
[Frontend] [Neuron] Parse literals out of override-neuron-config ( #8959 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-03 18:02:07 +00:00
f5d72b2fc6
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. ( #8678 )
2024-10-03 09:44:21 -07:00
83caf35e08
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser ( #9020 )
2024-10-03 16:44:52 +08:00
01843c89b8
[Misc] log when using default MoE config ( #8971 )
2024-10-03 04:31:07 +00:00
19a4dd0990
[Bugfix] example template should not add parallel_tool_prompt if tools is none ( #9007 )
2024-10-03 03:04:17 +00:00
18c2e30c57
[Doc] Update Granite model docs ( #9025 )
2024-10-03 02:42:24 +00:00
19f0d25796
[Model] Adding Granite MoE. ( #8206 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-03 09:33:57 +08:00
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-02 19:44:39 +00:00
7f60520deb
[Misc] Update Default Image Mapper Error Log ( #8977 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-02 11:44:38 +00:00
563649aafe
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching ( #8804 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
2024-10-02 07:52:20 +00:00
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows ( #8997 )
2024-10-01 11:07:06 -07:00
4f341bd4bf
[Doc] Update list of supported models ( #8987 )
2024-10-02 00:35:39 +08:00
35bd215168
[Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API ( #8965 )
2024-10-01 09:58:06 +00:00
1fe0a4264a
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders ( #8991 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-01 09:52:44 +00:00
bc4eb65b54
[Bugfix] Fix Fuyu tensor parallel inference ( #8986 )
2024-10-01 17:51:41 +08:00
82f3937e59
[Misc] add process_weights_after_loading for DummyLoader ( #8969 )
2024-10-01 03:46:41 +00:00
7da2487591
[torch.compile] fix tensor alias ( #8982 )
2024-10-01 03:40:48 +00:00
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-01 03:33:12 +00:00
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-01 09:34:25 +08:00
bce324487a
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. ( #8975 )
2024-10-01 00:51:40 +00:00
1425a1bcf9
[ci] Add CODEOWNERS for test directories ( #8795 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-01 00:47:08 +00:00
1cabfcefb6
[Misc] Adjust max_position_embeddings for LoRA compatibility ( #8957 )
2024-09-30 12:57:39 +00:00
be76e5aabf
[Core] Make scheduling policy settable via EngineArgs ( #8956 )
2024-09-30 12:28:44 +00:00
2ae25f79cf
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg ( #8946 )
2024-09-30 13:01:20 +08:00
8e60afa15e
[Model][LoRA]LoRA support added for MiniCPMV2.6 ( #8943 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-30 04:31:55 +00:00
b6d7392579
[Misc][CI/Build] Include cv2 via mistral_common[opencv] ( #8951 )
2024-09-30 04:28:26 +00:00
e01ab595d8
[Model] support input embeddings for qwen2vl ( #8856 )
2024-09-30 03:16:10 +00:00
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
6c9ba48fde
[Frontend] Added support for HF's new continue_final_message parameter ( #8942 )
2024-09-29 17:59:47 +00:00
1fb9c1b0bf
[Misc] Fix typo in BlockSpaceManagerV1 ( #8944 )
2024-09-29 15:05:54 +00:00
31f46a0d35
[BugFix] Fix seeded random sampling with encoder-decoder models ( #8870 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-29 09:43:14 +00:00
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
bc2ef1f77c
[Model] Support Qwen2.5-Math-RM-72B ( #8896 )
2024-09-28 21:19:39 -07:00
2e7fe7e79f
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching ( #8930 )
2024-09-29 03:13:01 +00:00
26a68d5d7e
[CI/Build] Add test decorator for minimum GPU memory ( #8925 )
2024-09-29 02:50:51 +00:00
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-28 18:19:40 -07:00
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots ( #8824 )
2024-09-29 09:17:45 +08:00
d1537039ce
[Core] Improve choice of Python multiprocessing method ( #8823 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-29 09:17:07 +08:00
cc276443b5
[doc] organize installation doc and expose per-commit docker ( #8931 )
2024-09-28 17:48:41 -07:00
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 ( #8891 )
2024-09-28 18:51:22 +00:00
090e945e36
[Frontend] Make beam search emulator temperature modifiable ( #8928 )
...
Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr >
2024-09-28 11:30:21 -07:00
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-28 09:54:35 -07:00
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
39d3f8d94f
[Bugfix] Fix code for downloading models from modelscope ( #8443 )
2024-09-28 08:24:12 -07:00
b0298aa8cc
[Misc] Remove vLLM patch of BaichuanTokenizer ( #8921 )
2024-09-28 08:11:25 +00:00
260024a374
[Bugfix][Intel] Fix XPU Dockerfile Build ( #7824 )
...
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-27 23:45:50 -07:00
d86f6b2afb
[misc] fix wheel name ( #8919 )
2024-09-27 22:10:44 -07:00
bd429f2b75
[Core] Priority-based scheduling in async engine ( #8850 )
2024-09-27 15:07:10 -07:00
18e60d7d13
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag ( #8911 )
2024-09-27 14:27:56 -07:00
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-09-27 13:32:07 -07:00
c5d55356f9
[Bugfix] fix for deepseek w4a16 ( #8906 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-27 13:12:34 -06:00
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
8df2dc3c88
[TPU] Update pallas.py to support trillium ( #8871 )
2024-09-27 01:16:55 -07:00
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1 ( #8892 )
2024-09-27 01:15:58 -07:00
0e088750af
[MISC] Fix invalid escape sequence '\' ( #8830 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-09-27 01:13:25 -07:00
dc4e3df5c2
[misc] fix collect env ( #8894 )
2024-09-27 00:26:38 -07:00
3b00b9c26c
[Core] renamePromptInputs and inputs ( #8876 )
2024-09-26 20:35:15 -07:00
344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use ( #8343 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-09-26 17:01:42 -07:00
1b49148e47
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility ( #8764 )
2024-09-26 16:54:09 -07:00
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade ( #8829 )
2024-09-26 16:46:43 -07:00
71d21c73ab
[Bugfix] Fixup advance_step.cu warning ( #8815 )
2024-09-26 16:23:45 -07:00
ee2da3e9ef
fix validation: Only set tool_choice auto if at least one tool is provided ( #8568 )
2024-09-26 16:23:17 -07:00
e2f6f26e86
[Bugfix] Fix print_warning_once's line info ( #8867 )
2024-09-26 16:18:26 -07:00
b28d2104de
[Misc] Change dummy profiling and BOS fallback warns to log once ( #8820 )
2024-09-26 16:18:14 -07:00
93d364da34
[Bugfix] Include encoder prompts len to non-stream api usage response ( #8861 )
2024-09-26 15:47:00 -07:00
d9cfbc891e
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM ( #8872 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-26 15:02:16 -07:00
70de39f6b4
[misc][installation] build from source without compilation ( #8818 )
2024-09-26 13:19:04 -07:00
68988d4e0d
[CI/Build] Fix missing ci dependencies ( #8834 )
2024-09-26 11:04:39 -07:00
520db4dbc1
[Docs] Add README to the build docker image ( #8825 )
2024-09-26 11:02:52 -07:00
f70bccac75
[Build/CI] Upgrade to gcc 10 in the base build Docker image ( #8814 )
2024-09-26 10:07:18 -07:00
4bb98f2190
[Misc] Update config loading for Qwen2-VL and remove Granite ( #8837 )
2024-09-26 07:45:30 -07:00
7193774b1f
[Misc] Support quantization of MllamaForCausalLM ( #8822 )
2024-09-25 14:46:22 -07:00
e2c6e0a829
[Doc] Update doc for Transformers 4.45 ( #8817 )
2024-09-25 13:29:48 -07:00
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-25 13:29:32 -07:00
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility ( #8760 ) ( #8810 )
2024-09-25 10:36:26 -07:00
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors ( #8588 )
2024-09-25 09:43:36 -07:00
64840dfae4
[Frontend] MQLLMEngine supports profiling. ( #8761 )
2024-09-25 09:37:41 -07:00
28e1299e60
rename PromptInputs and inputs with backward compatibility ( #8760 )
2024-09-25 09:36:47 -07:00
0c4d2ad5e6
[VLM][Bugfix] internvl with num_scheduler_steps > 1 ( #8614 )
2024-09-25 09:35:53 -07:00
c6f2485c82
[[Misc]] Add extra deps for openai server image ( #8792 )
2024-09-25 09:35:23 -07:00
300da09177
[Kernel] Fullgraph and opcheck tests ( #8479 )
2024-09-25 08:35:52 -06:00
1c046447a6
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade ( #8777 )
2024-09-25 22:26:37 +08:00
8fae5ed7f6
[Misc] Fix minor typo in scheduler ( #8765 )
2024-09-25 00:53:03 -07:00
3368c3ab36
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node ( #8767 )
...
Signed-off-by: darthhexx <darthhexx@gmail.com >
2024-09-25 00:52:26 -07:00
1ac3de09cd
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer ( #8672 )
2024-09-25 07:49:26 +00:00
3e073e66f1
[Bugfix] load fc bias from config for eagle ( #8790 )
2024-09-24 23:16:30 -07:00
c23953675f
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend ( #8770 )
2024-09-24 23:16:11 -07:00
e3dd0692fa
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv ( #8250 )
2024-09-25 05:53:43 +00:00
fc3afc20df
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 ( #8752 )
2024-09-24 21:26:36 -07:00
b4522474a3
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal ( #8776 )
2024-09-24 21:26:33 -07:00
ee777d9c30
Fix test_schedule_swapped_simple in test_scheduler.py ( #8780 )
2024-09-24 21:26:18 -07:00
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks ( #8583 )
2024-09-24 20:37:38 -07:00
6da1ab6b41
[Core] Adding Priority Scheduling ( #5958 )
2024-09-24 19:50:50 -07:00
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-09-24 17:29:56 -07:00
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 ( #8768 )
2024-09-24 17:08:55 -07:00
1e7d5c01f5
[misc] soft drop beam search ( #8763 )
2024-09-24 15:48:39 -07:00
2467b642dd
[CI/Build] fix setuptools-scm usage ( #8771 )
2024-09-24 12:38:12 -07:00
72fc97a0f1
[Bugfix] Fix torch dynamo fixes caused by replace_parameters ( #8748 )
2024-09-24 14:33:21 -04:00
2529d09b5a
[Frontend] Batch inference for llm.chat() API ( #8648 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-09-24 09:44:11 -07:00
a928ded995
[Kernel] Split Marlin MoE kernels into multiple files ( #8661 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-24 09:31:42 -07:00
cc4325b66a
[Bugfix] Fix potentially unsafe custom allreduce synchronization ( #8558 )
2024-09-24 01:08:14 -07:00
8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg ( #8658 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-24 07:36:46 +00:00
3f06bae907
[Core][Model] Support loading weights by ID within models ( #7931 )
2024-09-24 07:14:15 +00:00
b8747e8a7c
[MISC] Skip dumping inputs when unpicklable ( #8744 )
2024-09-24 06:10:03 +00:00
3185fb0cca
Revert "[Core] Rename PromptInputs to PromptType, and inputs to prompt" ( #8750 )
2024-09-24 05:45:20 +00:00
0250dd68c5
re-implement beam search on top of vllm core ( #8726 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-09-23 22:08:12 -07:00
88577ac928
Fix tests in test_scheduler.py that fail with BlockManager V2 ( #8728 )
2024-09-24 04:43:13 +00:00
530821d00c
[Hardware][AMD] ROCm6.2 upgrade ( #8674 )
2024-09-23 18:52:39 -07:00
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse ( #8335 )
2024-09-23 15:38:04 -07:00
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids ( #8562 )
2024-09-23 12:32:27 -07:00
b05f5c9238
[Core] Allow IPv6 in VLLM_HOST_IP with zmq ( #8575 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-23 12:15:41 -07:00
9b0e3ec970
[Kernel][LoRA] Add assertion for punica sgmv kernels ( #7585 )
2024-09-23 18:57:42 +00:00
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-23 13:46:26 -04:00
ee5f34b1c2
[CI/Build] use setuptools-scm to set __version__ ( #4738 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-23 09:44:26 -07:00
f2bd246c17
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size ( #8707 )
2024-09-23 14:43:09 +00:00
a79e522984
[Model] Support pp for qwen2-vl ( #8696 )
2024-09-23 13:46:59 +00:00
3e83c12b5c
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner ( #8733 )
2024-09-23 13:15:16 +00:00
e551ca1555
[Hardware][CPU] Refactor CPU model runner ( #8729 )
2024-09-23 20:12:20 +08:00
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs ( #8657 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-23 07:44:48 +00:00
d23679eb99
[Bugfix] fix docker build for xpu ( #8652 )
2024-09-22 22:54:18 -07:00
57a0702e63
[Bugfix] Fix CPU CMake build ( #8723 )
...
Co-authored-by: Yuan <yuan.zhou@intel.com >
2024-09-22 20:40:46 -07:00
3dda7c2250
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building ( #8702 )
2024-09-22 22:24:59 -04:00
92ba7e7477
[misc] upgrade mistral-common ( #8715 )
2024-09-22 15:41:59 -07:00
d4a2ac8302
[build] enable existing pytorch (for GH200, aarch64, nightly) ( #8713 )
2024-09-22 12:47:54 -07:00
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. ( #8701 )
2024-09-22 12:34:14 -07:00
5b59532760
[Model][VLM] Add LLaVA-Onevision model support ( #8486 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-22 10:51:44 -07:00
ca2b628b3c
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler ( #8703 )
2024-09-22 10:44:09 -07:00
8ca5051b9a
[Misc] Use NamedTuple in Multi-image example ( #8705 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-22 20:56:20 +08:00
06ed2815e2
[Model] Refactor BLIP/BLIP-2 to support composite model loading ( #8407 )
2024-09-22 12:24:21 +00:00
0e40ac9b7b
[ci][build] fix vllm-flash-attn ( #8699 )
2024-09-21 23:24:58 -07:00
13d88d4137
[Bugfix] Refactor composite weight loading logic ( #8656 )
2024-09-22 04:33:27 +00:00
d66ac62854
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu ( #8643 )
2024-09-21 23:45:02 +00:00
9dc7c6c7f3
[dbrx] refactor dbrx experts to extend FusedMoe class ( #8518 )
2024-09-21 15:09:39 -06:00
ec4aaad812
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 ( #8646 )
2024-09-21 09:20:54 +00:00
4dfdf43196
[Doc] Fix typo in AMD installation guide ( #8689 )
2024-09-21 00:24:12 -07:00
5e85f4f82a
[VLM] Use SequenceData.from_token_counts to create dummy data ( #8687 )
2024-09-20 23:28:56 -07:00
71c60491f2
[Kernel] Build flash-attn from source ( #8245 )
2024-09-20 23:27:10 -07:00
0faab90eb0
[beam search] add output for manually checking the correctness ( #8684 )
2024-09-20 19:55:33 -07:00
0455c46ed4
[Core] Factor out common code in SequenceData and Sequence ( #8675 )
2024-09-21 02:30:39 +00:00
d4bf085ad0
[MISC] add support custom_op check ( #8557 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-20 19:03:55 -07:00
0057894ef7
[Core] Rename PromptInputs and inputs( #8673 )
2024-09-20 19:00:54 -07:00
0f961b3ce9
[Bugfix] Fix incorrect llava next feature size calculation ( #8496 )
2024-09-20 22:48:32 +00:00
7f9c8902e3
[Hardware][AWS] update neuron to 2.20 ( #8676 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:19:44 -07:00
7c8566aa4f
[Doc] neuron documentation update ( #8671 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:04:37 -07:00
b4e4eda92e
[Bugfix][Core] Fix tekken edge case for mistral tokenizer ( #8640 )
2024-09-20 14:33:03 -07:00
2874bac618
[Bugfix] Config got an unexpected keyword argument 'engine' ( #8556 )
2024-09-20 14:00:45 -07:00
035fa895ec
[Misc] Show AMD GPU topology in collect_env.py ( #8649 )
2024-09-20 13:52:19 -07:00
b28298f2f4
[Bugfix] Validate SamplingParam n is an int ( #8548 )
2024-09-20 12:46:02 -07:00
2940afa04e
[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build ( #8670 )
2024-09-20 10:27:44 -07:00
3b63de9353
[Model] Add OLMoE ( #7922 )
2024-09-20 09:31:41 -07:00
260d40b5ea
[Core] Support Lora lineage and base model metadata management ( #6315 )
2024-09-20 06:20:56 +00:00
9e5ec35b1f
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata ( #8474 )
2024-09-19 20:49:54 -07:00
18ae428a0d
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference ( #8571 )
2024-09-20 08:54:02 +08:00
de6f90a13d
[Misc] guard against change in cuda library name ( #8609 )
2024-09-20 06:36:30 +08:00
6cb748e190
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail ( #8551 )
2024-09-19 13:06:32 -07:00
9e99407e3c
Create SECURITY.md ( #8642 )
2024-09-19 12:16:28 -07:00
ea4647b7d7
[Doc] Add documentation for GGUF quantization ( #8618 )
2024-09-19 13:15:55 -06:00
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p ( #8619 )
2024-09-19 18:28:25 +00:00
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention ( #8577 )
2024-09-19 17:37:57 +00:00
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too ( #8584 )
2024-09-19 12:51:06 -04:00
855c8ae2c9
[MISC] remove engine_use_ray in benchmark_throughput.py ( #8615 )
2024-09-18 22:33:20 -07:00
c52ec5f034
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py ( #8616 )
2024-09-19 05:24:24 +00:00
02c9afa2d0
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" ( #8593 )
2024-09-19 04:14:28 +00:00
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. ( #8545 )
2024-09-19 02:24:15 +00:00
4c34ce8916
[Kernel] Remove marlin moe templating on thread_m_blocks ( #8573 )
...
Co-authored-by: lwilkinson@neuralmagic.com
2024-09-19 01:42:49 +00:00
0d47bf3bf4
[Bugfix] add dead_error property to engine client ( #8574 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-18 22:10:01 +00:00
d9cd78eb71
[BugFix] Nonzero exit code if MQLLMEngine startup fails ( #8572 )
2024-09-18 20:17:55 +00:00
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching ( #8039 )
2024-09-18 20:05:06 +00:00
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call ( #8380 )
...
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 10:41:08 -07:00
e18749ff09
[Model] Support Solar Model ( #8386 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 11:04:00 -06:00
d65798f78c
[Core] zmq: bind only to 127.0.0.1 for local-only usage ( #8543 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-18 16:10:27 +00:00
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step ( #8199 )
2024-09-18 08:38:43 -07:00
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-18 13:56:58 +00:00
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-18 11:00:56 +00:00
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
e351572900
[Misc] Add argument to disable FastAPI docs ( #8554 )
2024-09-18 09:51:59 +00:00
95965d31b6
[CI/Build] fix Dockerfile.cpu on podman ( #8540 )
2024-09-18 10:49:53 +08:00
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching ( #8012 )
2024-09-17 23:44:27 +00:00
09deb4721f
[CI/Build] Excluding kernels/test_gguf.py from ROCm ( #8520 )
2024-09-17 16:40:29 -07:00
fa0c114fad
[doc] improve installation doc ( #8550 )
...
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com >
2024-09-17 16:24:06 -07:00
98f9713399
[Bugfix] Fix TP > 1 for new granite ( #8544 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 23:17:08 +00:00
56c3de018c
[Misc] Don't dump contents of kvcache tensors on errors ( #8527 )
2024-09-17 12:24:29 -07:00
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format ( #8515 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-17 17:50:37 +00:00
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization ( #8434 )
2024-09-17 08:09:12 -07:00
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
1b6de8352b
[Benchmark] Support sample from HF datasets and image input for benchmark_serving ( #8495 )
2024-09-17 07:34:27 +00:00
cbdb252259
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change ( #8509 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-09-17 00:06:26 -07:00
99aa4eddaf
[torch.compile] register allreduce operations as custom ops ( #8526 )
2024-09-16 22:57:57 -07:00
ee2bceaaa6
[Misc][Bugfix] Disable guided decoding for mistral tokenizer ( #8521 )
2024-09-16 22:22:45 -07:00
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing ( #8525 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-17 04:17:32 +00:00
546034b466
[refactor] remove triton based sampler ( #8524 )
2024-09-16 20:04:48 -07:00
cca61642e0
[Bugfix] Fix 3.12 builds on main ( #8510 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 00:01:45 +00:00
5ce45eb54d
[misc] small qol fixes for release process ( #8517 )
2024-09-16 15:11:27 -07:00
5478c4b41f
[perf bench] set timeout to debug hanging ( #8516 )
2024-09-16 14:30:02 -07:00
47f5e03b5b
[Bugfix] Bind api server port before starting engine ( #8491 )
2024-09-16 13:56:28 -07:00
2759a43a26
[doc] update doc on testing and debugging ( #8514 )
2024-09-16 12:10:23 -07:00
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
781e3b9a42
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel ( #8506 )
2024-09-16 12:15:57 -06:00
acd5511b6d
[BugFix] Fix clean shutdown issues ( #8492 )
2024-09-16 09:33:46 -07:00
837c1968f9
[Frontend] Expose revision arg in OpenAI server ( #8501 )
2024-09-16 15:55:26 +00:00
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE ( #8032 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2024-09-16 09:47:19 -06:00
fc990f9795
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
3724d5f6b5
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations ( #8490 )
2024-09-15 04:20:05 +00:00
50e9ec41fc
[TPU] Implement multi-step scheduling ( #8489 )
2024-09-14 16:58:31 -07:00
47790f3e32
[torch.compile] add a flag to disable custom op ( #8488 )
2024-09-14 13:07:16 -07:00
a36e070dad
[torch.compile] fix functionalization ( #8480 )
2024-09-14 09:46:04 -07:00
8a0cf1ddc3
[Model] support minicpm3 ( #8297 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-14 14:50:26 +00:00
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm ( #8310 )
2024-09-13 17:01:11 -07:00
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 ( #8365 )
...
Co-authored-by: Yan Ma <yan.ma@intel.com >
2024-09-13 16:54:34 -07:00
9ba0817ff1
bump version to v0.6.1.post2 ( #8473 )
2024-09-13 11:35:00 -07:00
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming ( #8468 )
2024-09-13 11:31:12 -07:00
f57092c00b
[Doc] Add oneDNN installation to CPU backend documentation ( #8467 )
2024-09-13 18:06:30 +00:00
a84e598e21
[CI/Build] Reorganize models tests ( #7820 )
2024-09-13 10:20:06 -07:00
0a4806f0a9
[plugin][torch.compile] allow to add custom compile backend ( #8445 )
2024-09-13 09:32:42 -07:00
ecd7a1d5b6
[Installation] Gate FastAPI version for Python 3.8 ( #8456 )
2024-09-13 09:02:26 -07:00
a2469127db
[misc][ci] fix quant test ( #8449 )
2024-09-13 17:20:14 +08:00
06311e2956
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 ( #8442 )
2024-09-13 07:58:28 +00:00
cab69a15e4
[doc] recommend pip instead of conda ( #8446 )
2024-09-12 23:52:41 -07:00
9b4a3b235e
[CI/Build] Enable InternVL2 PP test only on single node ( #8437 )
2024-09-13 06:35:20 +00:00
acda0b35d0
bump version to v0.6.1.post1 ( #8440 )
2024-09-12 21:39:49 -07:00
ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor ( #8354 )
2024-09-12 21:30:00 -07:00
6821020109
[Bugfix] Fix async log stats ( #8417 )
2024-09-12 20:48:59 -07:00
8427550488
[CI/Build] Update pixtral tests to use JSON ( #8436 )
2024-09-13 03:47:52 +00:00
3f79bc3d1a
[Bugfix] Bump fastapi and pydantic version ( #8435 )
2024-09-13 03:21:42 +00:00
40c396533d
[Bugfix] Mapping physical device indices for e2e test utils ( #8290 )
2024-09-13 11:06:28 +08:00
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class ( #7329 )
2024-09-13 02:56:13 +00:00
8f44a92d85
[BugFix] fix group_topk ( #8430 )
2024-09-13 09:23:42 +08:00
360ddbd37e
[Misc] Update Pixtral example ( #8431 )
2024-09-12 17:31:18 -07:00
a480939e8e
[Bugfix] Fix weight loading issue by rename variable. ( #8293 )
2024-09-12 19:25:00 -04:00
d31174a4e1
[Hotfix][Pixtral] Fix multiple images bugs ( #8415 )
2024-09-12 15:21:51 -07:00
b61bd98f90
[CI/Build] Disable multi-node test for InternVL2 ( #8428 )
2024-09-12 15:05:35 -07:00
c16369455f
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models ( #8425 )
2024-09-12 14:06:51 -07:00
019877253b
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible ( #8427 )
2024-09-12 21:01:50 +00:00
551ce01078
[Core] Add engine option to return only deltas or final output ( #7381 )
2024-09-12 12:02:00 -07:00
a6c0f3658d
[multi-step] add flashinfer backend ( #7928 )
2024-09-12 11:16:22 -07:00
f2e263b801
[Bugfix] Offline mode fix ( #8376 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-12 11:11:57 -07:00
1f0c75afa9
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser ( #8423 )
2024-09-12 11:10:11 -07:00
8a23e93302
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance ( #8403 )
2024-09-12 10:47:42 -07:00
c6202daeed
[Model] Support multiple images for qwen-vl ( #8247 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:54 -07:00
e56bf27741
[Bugfix] Fix InternVL2 inference with various num_patches ( #8375 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:35 -07:00
520ca380ae
[Hotfix][VLM] Fixing max position embeddings for Pixtral ( #8399 )
2024-09-12 09:28:37 -07:00
7de49aa86c
[torch.compile] hide slicing under custom op for inductor ( #8384 )
2024-09-12 00:11:55 -07:00
42ffba11ad
[Misc] Use RoPE cache for MRoPE ( #8396 )
2024-09-11 23:13:14 -07:00
295c4730a8
[Misc] Raise error when using encoder/decoder model with cpu backend ( #8355 )
2024-09-12 05:45:24 +00:00
1bf2dd9df0
[Gemma2] add bitsandbytes support for Gemma2 ( #8338 )
2024-09-11 21:53:12 -07:00
5a60699c45
[Bugfix]: Fix the logic for deciding if tool parsing is used ( #8366 )
2024-09-12 03:55:30 +00:00
b6c75e1cf2
Fix the AMD weight loading tests ( #8390 )
2024-09-11 20:35:33 -07:00
b71c956deb
[TPU] Use Ray for default distributed backend ( #8389 )
2024-09-11 20:31:51 -07:00
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
a65cb16067
[MISC] Dump model runner inputs when crashing ( #8305 )
2024-09-12 01:12:25 +00:00
3fd2b0d21c
Bump version to v0.6.1 ( #8379 )
2024-09-11 14:42:11 -07:00
d394787e52
Pixtral ( #8377 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-11 14:41:55 -07:00
775f00f81e
[Speculative Decoding] Test refactor ( #8317 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-11 14:07:34 -07:00
8baa454937
[Misc] Move device options to a single place ( #8322 )
2024-09-11 13:25:58 -07:00
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2024-09-11 12:52:19 -07:00
7015417fd4
[Bugfix] Add missing attributes in mistral tokenizer ( #8364 )
2024-09-11 11:36:54 -07:00
aea02f30de
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation ( #8373 )
2024-09-11 18:31:41 +00:00
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend ( #7257 )
2024-09-11 09:46:46 -07:00
3b7fea770f
[Model][VLM] Add Qwen2-VL model support ( #7905 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-11 09:31:19 -07:00
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch ( #8347 )
2024-09-11 05:30:11 +00:00
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-10 22:21:36 -07:00
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
e497b8aeff
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models ( #8329 )
2024-09-10 20:59:19 -04:00
94144e726c
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag ( #8043 )
2024-09-10 23:51:58 +00:00
1d5e397aa4
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers ( #8172 )
2024-09-10 23:46:08 +00:00
22f3a4bc6c
[Bugfix] lookahead block table with cuda graph max capture ( #8340 )
...
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340 )
2024-09-10 16:00:35 -07:00
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled ( #8342 )
2024-09-10 22:28:28 +00:00
04e7c4e771
[Misc] remove peft as dependency for prompt models ( #8162 )
2024-09-10 17:21:56 -04:00
5faedf1b62
[Spec Decode] Move ops.advance_step to flash attn advance_step ( #8224 )
2024-09-10 13:18:14 -07:00
02751a7a42
Fix ppc64le buildkite job ( #8309 )
2024-09-10 12:58:34 -07:00
f421f3cefb
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail ( #8130 )
2024-09-10 11:51:15 -07:00
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
6234385f4a
[CI/Build] enable ccache/scccache for HIP builds ( #8327 )
2024-09-10 08:55:08 -07:00
da1a844e61
[Bugfix] Fix missing post_layernorm in CLIP ( #8155 )
2024-09-10 08:22:50 +00:00
a1d874224d
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info ( #8319 )
2024-09-09 23:21:00 -07:00
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
f9b4a2d415
[Bugfix] Correct adapter usage for cohere and jamba ( #8292 )
2024-09-09 11:20:46 -07:00
58fcc8545a
[Frontend] Add progress reporting to run_batch.py ( #8060 )
...
Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io >
2024-09-09 11:16:37 -07:00
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption ( #8267 )
2024-09-07 21:01:51 -07:00
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-07 13:03:16 -07:00
b962ee1470
ppc64le: Dockerfile fixed, and a script for buildkite ( #8026 )
2024-09-07 11:18:40 -07:00
36bf8150cc
[Model][VLM] Decouple weight loading logic for Paligemma ( #8269 )
2024-09-07 17:45:44 +00:00
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) ( #8241 )
2024-09-06 20:18:16 -07:00
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
41e95c5247
[Bugfix] Fix Hermes tool call chat template bug ( #8256 )
...
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-07 10:49:01 +08:00
12dd715807
[misc] [doc] [frontend] LLM torch profiler support ( #7943 )
2024-09-06 17:48:48 -07:00
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-06 17:02:05 -06:00
23f322297f
[Misc] Remove SqueezeLLM ( #8220 )
2024-09-06 16:29:03 -06:00
9db52eab3d
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput ( #8248 )
2024-09-06 16:26:09 -06:00
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
de80783b69
[Misc] Use ray[adag] dependency instead of cuda ( #7938 )
2024-09-06 09:18:35 -07:00
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py ( #8191 )
2024-09-06 09:01:14 -07:00
baa5467547
[BugFix] Fix Granite model configuration ( #8216 )
2024-09-06 11:39:29 +08:00
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-09-05 18:10:33 -07:00
2febcf2777
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM ( #7962 )
2024-09-05 16:25:29 -04:00
2ee45281a5
Move verify_marlin_supported to GPTQMarlinLinearMethod ( #8165 )
2024-09-05 11:09:46 -04:00
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-05 12:48:10 +00:00
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
288a938872
[Doc] Indicate more information about supported modalities ( #8181 )
2024-09-05 10:51:53 +00:00
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
ba262c4e5a
[ci] Mark LoRA test as soft-fail ( #8160 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-04 20:33:12 -07:00
4624d98dbd
[Misc] Clean up RoPE forward_native ( #8076 )
2024-09-04 20:31:48 -07:00
1afc931987
[bugfix] >1.43 constraint for openai ( #8169 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 17:35:36 -07:00
e01c2beb7d
[Doc] [Misc] Create CODE_OF_CONDUCT.md ( #8161 )
2024-09-04 16:50:13 -07:00
32e7db2536
Bump version to v0.6.0 ( #8166 )
2024-09-04 16:34:27 -07:00
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… ( #8062 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-09-04 16:33:43 -07:00
77d9e514a2
[MISC] Replace input token throughput with total token throughput ( #8164 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 20:23:22 +00:00
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models ( #5649 )
...
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com >
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-04 13:18:13 -07:00
561d6f8077
[CI] Change test input in Gemma LoRA test ( #8163 )
2024-09-04 13:05:50 -07:00
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm ( #7369 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-04 11:57:54 -07:00
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests ( #8131 )
2024-09-04 18:53:25 +00:00
d3311562fb
[Bugfix] remove post_layernorm in siglip ( #8106 )
2024-09-04 18:55:37 +08:00
ccd7207191
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env ( #8103 )
2024-09-03 23:17:05 -07:00
855c262a6b
[Frontend] Multimodal support in offline chat ( #8098 )
2024-09-04 05:22:17 +00:00
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks ( #7963 )
2024-09-04 04:38:21 +00:00
e16fa99a6a
[Misc] Update fbgemmfp8 to use vLLMParameters ( #7972 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-03 20:12:41 -06:00
61f4a93d14
[TPU][Bugfix] Use XLA rank for persistent cache path ( #8137 )
2024-09-03 18:35:33 -07:00
d4db9f53c8
[Benchmark] Add --async-engine option to benchmark_throughput.py ( #7964 )
2024-09-03 20:57:41 -04:00
2188a60c7e
[Misc] Update GPTQ to use vLLMParameters ( #7976 )
2024-09-03 17:21:44 -04:00
dc0b6066ab
[CI] Change PR remainder to avoid at-mentions ( #8134 )
2024-09-03 14:11:42 -07:00
0af3abe3d3
[TPU][Bugfix] Fix next_token_ids shape ( #8128 )
2024-09-03 13:29:24 -07:00
f1575dc99f
[ci] Fix GHA workflow ( #8129 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 13:25:09 -07:00
c02638efb3
[CI/Build] make pip install vllm work in macos (for import only) ( #8118 )
2024-09-03 12:37:08 -07:00
652c83b697
[Misc] Raise a more informative exception in add/remove_logger ( #7750 )
2024-09-03 12:28:25 -07:00
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
95a178f861
[CI] Only PR reviewers/committers can trigger CI on PR ( #8124 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 11:32:27 -07:00
bd852f2a8b
[Performance] Enable chunked prefill and prefix caching together ( #8120 )
...
Co-authored-by: Tao He <sighingnow@gmail.com >
Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com >
2024-09-03 10:49:18 -07:00
ec266536b7
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend ( #8061 )
2024-09-03 21:37:52 +08:00
0fbc6696c2
[Bugfix] Fix single output condition in output processor ( #7881 )
2024-09-02 20:35:42 -07:00
6e36f4fa6c
improve chunked prefill performance
...
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874 )
2024-09-02 14:20:12 -07:00
dd2a6a82e3
[Bugfix] Fix internlm2 tensor parallel inference ( #8055 )
2024-09-02 23:48:56 +08:00
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension ( #8056 )
2024-09-02 08:43:26 -04:00
e2b2aa5a0f
[TPU] Align worker index with node boundary ( #7932 )
2024-09-01 23:09:46 -07:00
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
f8d60145b4
[Model] Add Granite model ( #7436 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-09-01 18:37:18 -07:00
5b86b19954
[Misc] Optional installation of audio related packages ( #8063 )
2024-09-01 14:46:57 -07:00
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items ( #8049 )
2024-08-31 16:35:53 -07:00
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation ( #8059 )
2024-08-31 19:44:03 +00:00
4f5d8446ed
[Bugfix] Fix ModelScope models in v0.5.5 ( #8037 )
2024-08-31 00:27:58 -07:00
d05f0a9db2
[Bugfix] Fix import error in Phi-3.5-MoE ( #8052 )
2024-08-30 22:26:55 -07:00
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE ( #7729 )
...
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Zeqi Lin <zelin@microsoft.com >
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com >
2024-08-30 13:42:57 -06:00
2684efc467
[TPU][Bugfix] Fix tpu type api ( #8035 )
2024-08-30 09:01:26 -07:00
058344f89a
[Frontend]-config-cli-args ( #7737 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com >
2024-08-30 08:21:02 -07:00
98cef6a227
[Core] Increase default max_num_batched_tokens for multimodal models ( #8028 )
2024-08-30 08:20:34 -07:00
f97be32d1d
[VLM][Model] TP support for ViTs ( #7186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-30 08:19:27 -07:00
afd39a4511
[Bugfix] Fix import error in Exaone model ( #8034 )
2024-08-30 08:03:28 -07:00
2148441fd3
[TPU] Support single and multi-host TPUs on GKE ( #7613 )
2024-08-30 00:27:40 -07:00
dc13e99348
[MODEL] add Exaone model support ( #7819 )
2024-08-29 23:34:20 -07:00
34a0e96d46
[Kernel] changing fused moe kernel chunk size default to 32k ( #7995 )
2024-08-30 04:11:39 +00:00
80c7b089b1
[TPU] Async output processing for TPU ( #8011 )
2024-08-29 19:35:29 -07:00
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
0c785d344d
Add more percentiles and latencies ( #7759 )
2024-08-29 16:48:11 -07:00
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models ( #7445 )
2024-08-29 19:09:08 -04:00
257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. ( #7885 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-08-29 13:58:14 -07:00
86a677de42
[misc] update tpu int8 to use new vLLM Parameters ( #7973 )
2024-08-29 16:46:55 -04:00
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism ( #7954 )
2024-08-29 15:54:49 -04:00
c334b1898b
extend cuda graph size for H200 ( #7894 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-29 12:15:04 -07:00
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-29 14:53:11 -04:00
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding ( #7899 )
2024-08-28 22:18:13 -07:00
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors ( #7974 )
2024-08-29 03:24:31 +00:00
a7f65c2be9
[torch.compile] remove reset ( #7975 )
2024-08-28 17:32:26 -07:00
4289cad37f
[Frontend] Minor optimizations to zmq decoupled front-end ( #7957 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-08-28 17:22:43 -07:00
af59df0a10
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test ( #7961 )
2024-08-28 19:19:17 -04:00
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-28 16:10:12 -07:00
3cdfe1f38b
[Bugfix] Make torch registration of punica ops optional ( #7970 )
2024-08-28 16:11:49 -06:00
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM ( #7651 )
2024-08-28 15:06:52 -07:00
8c56e57def
[Doc] fix 404 link ( #7966 )
2024-08-28 13:54:23 -07:00
eeffde1ac0
[TPU] Upgrade PyTorch XLA nightly ( #7967 )
2024-08-28 13:10:21 -07:00
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-28 10:01:22 -07:00
ef9baee3c5
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 ( #7948 )
2024-08-28 08:11:18 -07:00
98c12cffe5
[Doc] fix the autoAWQ example ( #7937 )
2024-08-28 12:12:32 +00:00
f52a43a8b9
[ci][test] fix pp test failure ( #7945 )
2024-08-28 01:27:07 -07:00
e3580537a4
[Performance] Enable chunked prefill and prefix caching together ( #7753 )
2024-08-28 00:36:31 -07:00
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) ( #7911 )
2024-08-28 00:02:30 -07:00
51f86bf487
[mypy][CI/Build] Fix mypy errors ( #7929 )
2024-08-27 23:47:44 -07:00
c166e7e43e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. ( #7886 )
2024-08-27 23:13:45 -04:00
bc6e42a9b1
[hardware][rocm] allow rocm to override default env var ( #7926 )
2024-08-27 19:50:06 -07:00
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt ( #7902 )
2024-08-28 01:53:56 +00:00
9c71c97ae2
[mypy] Enable mypy type checking for vllm/core ( #7229 )
2024-08-28 07:11:14 +08:00
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference ( #7230 )
2024-08-28 07:09:02 +08:00
345be0e244
[benchmark] Update TGI version ( #7917 )
2024-08-27 15:07:53 -07:00
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7766 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-27 15:07:09 -07:00
ed6f002d33
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES ( #7924 )
2024-08-27 12:06:11 -07:00
b09c755be8
[Bugfix] Fix phi3v incorrect image_idx when using async engine ( #7916 )
2024-08-27 17:36:09 +00:00
42e932c7d4
[CI/Build][ROCm] Enabling tensorizer tests for ROCm ( #7237 )
2024-08-27 10:09:13 -07:00
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. ( #7810 )
2024-08-27 10:07:02 -07:00
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test ( #7897 )
2024-08-27 15:28:30 +00:00
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
9606c7197d
Revert #7509 ( #7887 )
2024-08-27 00:16:31 -07:00
64cc644425
[core][torch.compile] discard the compile for profiling ( #7796 )
2024-08-26 21:33:58 -07:00
39178c7fbc
[Tests] Disable retries and use context manager for openai client ( #7565 )
2024-08-26 21:33:17 -07:00
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com >
2024-08-26 20:53:20 -07:00
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix from create_weights ( #7825 )
2024-08-26 18:09:34 -06:00
760e9f71a8
[Bugfix] neuron: enable tensor parallelism ( #7562 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-08-26 15:13:13 -07:00
05826c887b
[misc] fix custom allreduce p2p cache file generation ( #7853 )
2024-08-26 15:02:25 -07:00
dd9857f5fa
[Misc] Update gptq_marlin_24 to use vLLMParameters ( #7762 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-26 17:44:54 -04:00
665304092d
[Misc] Update qqq to use vLLMParameters ( #7805 )
2024-08-26 13:16:15 -06:00
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule ( #7822 )
2024-08-26 11:24:53 -07:00
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer ( #7836 )
2024-08-26 05:31:10 +00:00
0b769992ec
[Bugfix]: Use float32 for base64 embedding ( #7855 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-08-26 03:16:38 +00:00
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation ( #7851 )
2024-08-25 15:45:14 -07:00
70c094ade6
[misc][cuda] improve pynvml warning ( #7852 )
2024-08-25 14:30:09 -07:00
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test ( #7835 )
2024-08-25 15:53:09 +00:00
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models ( #7783 )
2024-08-25 11:51:20 +00:00
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes ( #7840 )
2024-08-24 18:16:24 -07:00
aab0fcdb63
[ci][test] fix RemoteOpenAIServer ( #7838 )
2024-08-24 17:31:28 +00:00
ea9fa160e3
[ci][test] exclude model download time in server start time ( #7834 )
2024-08-24 01:03:27 -07:00
7d9ffa2ae1
[misc][core] lazy import outlines ( #7831 )
2024-08-24 00:51:38 -07:00
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol ( #7654 )
2024-08-23 23:07:24 -07:00
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API ( #7641 )
2024-08-23 23:04:22 -07:00
6885fde317
[Bugfix] Fix run_batch logger ( #7640 )
2024-08-23 13:58:26 -07:00
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00
09c7792610
Bump version to v0.5.5 ( #7823 )
2024-08-23 11:35:33 -07:00
f1df5dbfd6
[Misc] Update marlin to use vLLMParameters ( #7803 )
2024-08-23 14:30:52 -04:00
35ee2ad6b9
[github][misc] promote asking llm first ( #7809 )
2024-08-23 09:38:50 -07:00
e25fee57c2
[BugFix] Fix server crash on empty prompt ( #7746 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-08-23 13:12:44 +00:00
faeddb565d
[misc] Add Torch profiler support for CPU-only devices ( #7806 )
2024-08-23 05:46:25 +00:00
fc5ebbd1d3
[Hardware][Intel GPU] refactor xpu_model_runner for tp ( #7712 )
2024-08-22 20:06:54 -07:00
c01a6cb231
[Ray backend] Better error when pg topology is bad. ( #7584 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-22 17:44:25 -07:00
b903e1ba7f
[Frontend] error suppression cleanup ( #7786 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 21:50:21 +00:00
a152246428
[Misc] fix typo in triton import warning ( #7794 )
2024-08-22 13:51:23 -07:00
666ad0aa16
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args ( #7705 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-22 20:10:55 +00:00
15310b5101
[Bugfix] Use LoadFormat values for vllm serve --load-format ( #7784 )
2024-08-22 11:37:08 -07:00
57792ed469
[Doc] Fix incorrect docs from #7615 ( #7788 )
2024-08-22 10:02:06 -07:00
d3b5b98021
[Misc] Enhance prefix-caching benchmark tool ( #6568 )
2024-08-22 09:32:02 -07:00
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output ( #7232 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-22 09:33:48 -04:00
955b5191c9
[Misc] update fp8 to use vLLMParameter ( #7437 )
2024-08-22 08:36:18 -04:00
55d63b1211
[Bugfix] Don't build machete on cuda <12.0 ( #7757 )
2024-08-22 08:28:52 -04:00
4f419c00a6
Fix ShardedStateLoader for vllm fp8 quantization ( #7708 )
2024-08-22 08:25:04 -04:00
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer ( #6830 )
2024-08-22 02:42:24 -07:00
b3856bef7d
[Misc] Use torch.compile for GemmaRMSNorm ( #7642 )
2024-08-22 01:14:13 -07:00
8c6f694a79
[ci] refine dependency for distributed tests ( #7776 )
2024-08-22 00:54:15 -07:00
eeee1c3b1a
[TPU] Avoid initializing TPU runtime in is_tpu ( #7763 )
2024-08-21 21:31:49 -07:00
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )" ( #7764 )
2024-08-22 03:42:14 +00:00
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness ( #7443 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 02:18:11 +00:00
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue ( #7710 )
2024-08-22 09:36:24 +08:00
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-21 20:18:00 -04:00
9984605412
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility ( #7477 )
...
Co-authored-by: Charlie Fu <Charlie.Fu@amd.com >
2024-08-21 16:47:36 -07:00
7eebe8ccaa
[distributed][misc] error on same VLLM_HOST_IP setting ( #7756 )
2024-08-21 16:25:34 -07:00
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-21 16:17:10 -07:00
5844017285
[ci] [multi-step] narrow multi-step test dependency paths ( #7760 )
2024-08-21 15:52:40 -07:00
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig ( #7615 )
2024-08-21 22:49:39 +00:00
dd53c4b023
[misc] Add Torch profiler support ( #7451 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-21 15:39:26 -07:00
970dfdc01d
[Frontend] Improve Startup Failure UX ( #7716 )
2024-08-21 19:53:01 +00:00
91f4522cbf
[multi-step] Raise error if not using async engine ( #7703 )
2024-08-21 11:49:19 -07:00
1b32e02648
[Bugfix] Pass PYTHONPATH from setup.py to CMake ( #7730 )
2024-08-21 11:17:48 -07:00
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-21 13:34:14 -04:00
d3c002eadc
[Bugfix] chat method add_generation_prompt param ( #7734 )
2024-08-21 17:33:35 +00:00
9b73a2f498
[Spec Decoding] Use target model max length as default for draft model ( #7706 )
2024-08-22 00:23:22 +08:00
6925cdbeea
[Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend ( #7735 )
2024-08-21 16:23:03 +00:00
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
dd3fa0e430
[Bugfix] Mirror jinja2 in pyproject.toml ( #7723 )
2024-08-21 13:41:17 +00:00
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Fei <dfdfcai4@gmail.com >
2024-08-20 23:28:21 -07:00
4506641212
[Doc] Section for Multimodal Language Models ( #7719 )
2024-08-20 23:24:01 -07:00
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
b74a125800
[ci] try to log process using the port to debug the port usage ( #7711 )
2024-08-20 17:41:12 -07:00
66a9e713a7
[Core] Pipe worker_class_fn argument in Executor ( #7707 )
2024-08-21 00:37:39 +00:00
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test ( #7709 )
2024-08-20 17:12:44 -07:00
6e4658c7aa
[Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) ( #7685 )
2024-08-20 12:01:09 -07:00
3b682179dd
[Core] Add AttentionState abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
c6af027a35
[Misc] Add jinja2 as an explicit build requirement ( #7695 )
2024-08-20 17:17:47 +00:00
2aa00d59ad
[CI/Build] Pin OpenTelemetry versions and make errors clearer ( #7266 )
...
[CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266 )
2024-08-20 10:02:21 -07:00
c42590f97a
[Hardware] [Intel GPU] refactor xpu worker/executor ( #7686 )
2024-08-20 09:54:10 -07:00
aae6927be0
[VLM][Model] Add test for InternViT vision encoder ( #7409 )
2024-08-20 23:10:20 +08:00
398521ad19
[OpenVINO] Updated documentation ( #7687 )
2024-08-20 07:33:56 -06:00
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
b6f99a6ffe
[Core] Refactor executor classes for easier inheritance ( #7673 )
...
[Core] Refactor executor classes to make it easier to inherit GPUExecutor (#7673 )
2024-08-20 00:56:50 -07:00
ad28a74beb
[misc][cuda] add warning for pynvml user ( #7675 )
2024-08-20 00:35:09 -07:00
e6d811dd13
[XPU] fallback to native implementation for xpu custom op ( #7670 )
2024-08-20 00:26:09 -07:00
c4be16e1a7
[misc] add nvidia related library in collect env ( #7674 )
2024-08-19 23:22:49 -07:00
3d8a5f063d
[CI] Organizing performance benchmark files ( #7616 )
2024-08-19 22:43:54 -07:00
f4fc7337bf
[Bugfix] support tie_word_embeddings for all models ( #5724 )
2024-08-19 20:00:04 -07:00
0df7ec0b2d
[ci] Install Buildkite test suite analysis ( #7667 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-19 19:55:04 -07:00
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion ( #7508 )
2024-08-19 17:58:14 -07:00
e54ebc2f8f
[doc] fix doc build error caused by msgspec ( #7659 )
2024-08-19 17:50:59 -07:00
67e02fa8a4
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding ( #7665 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-20 00:43:09 +00:00
43735bf5e1
[TPU] Remove redundant input tensor cloning ( #7660 )
2024-08-19 15:55:04 -07:00
da115230fd
[Bugfix] Don't disable existing loggers ( #7664 )
2024-08-19 15:11:58 -07:00
7601cb044d
[Core] Support tensor parallelism for GGUF quantization ( #7520 )
2024-08-19 17:30:14 -04:00
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
2024-08-19 13:52:13 -07:00
dad961ef5c
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 ( #5428 )
2024-08-19 20:47:00 +00:00
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00
df845b2b46
[Misc] Remove Gemma RoPE ( #7638 )
2024-08-19 09:29:31 -07:00
1a36287b89
[Bugfix] Fix xpu build ( #7644 )
2024-08-18 22:00:09 -07:00
f710fb5265
[Core] Use flashinfer sampling kernel when available ( #7137 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-19 03:24:03 +00:00
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
200a2ffa6b
[Misc] Refactor Llama3 RoPE initialization ( #7637 )
2024-08-18 17:18:12 -07:00
40e1360bb6
[CI/Build] Add text-only test for Qwen models ( #7475 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-08-19 07:43:46 +08:00
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend ( #7279 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-18 20:19:48 +00:00
ab7165f2c7
[TPU] Optimize RoPE forward_native2 ( #7636 )
2024-08-18 01:15:10 -07:00
0c2fa50b84
[TPU] Use mark_dynamic only for dummy run ( #7634 )
2024-08-18 00:18:53 -07:00
ce143353c6
[TPU] Skip creating empty tensor ( #7630 )
2024-08-17 14:22:46 -07:00
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
1ef13cf92f
[Misc]Fix BitAndBytes exception messages ( #7626 )
2024-08-17 12:02:14 -07:00
832163b875
[ci][test] allow longer wait time for api server ( #7629 )
2024-08-17 11:26:38 -07:00
e73f76eec6
[Model] Pipeline parallel support for JAIS ( #7603 )
2024-08-17 11:11:09 -07:00
d95cc0a55c
[core][misc] update libcudart finding ( #7620 )
...
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-08-16 23:01:35 -07:00
5bf45db7df
[ci][test] fix engine/logger test ( #7621 )
2024-08-16 23:00:59 -07:00
eed020f673
[misc] use nvml to get consistent device name ( #7582 )
2024-08-16 21:15:13 -07:00
7c0b7ea214
[Bugfix] add >= 1.0 constraint for openai dependency ( #7612 )
2024-08-16 20:56:01 -07:00
4706eb628e
[aDAG] Unflake aDAG + PP tests ( #7600 )
2024-08-16 20:49:30 -07:00
bae888cb8e
[Bugfix] Clear engine reference in AsyncEngineRPCServer ( #7618 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-16 20:44:05 -07:00
6bd19551b0
.[Build/CI] Enabling passing AMD tests. ( #7610 )
2024-08-16 20:25:32 -07:00
e680349994
[Bugfix] Fix custom_ar support check ( #7617 )
2024-08-16 19:05:49 -07:00
44f26a9466
[Model] Align nemotron config with final HF state and fix lm-eval-small ( #7611 )
2024-08-16 15:56:34 -07:00
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo ( #7596 )
2024-08-16 14:00:11 -07:00
7759ae958f
[Kernel][Misc] dynamo support for ScalarType ( #7594 )
2024-08-16 13:59:49 -07:00
9f69856356
[Kernel] register punica functions as torch ops ( #7591 )
2024-08-16 13:59:38 -07:00
d4f0f17b02
[Doc] Update quantization supported hardware table ( #7595 )
2024-08-16 13:59:27 -07:00
b3f4e17935
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints ( #7444 )
2024-08-16 13:59:16 -07:00
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
f366f6339b
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend ( #7571 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-16 11:41:56 -07:00
855866caa9
[Kernel] Add tuned triton configs for ExpertsInt8 ( #7601 )
2024-08-16 11:37:01 -07:00
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE ( #7415 )
2024-08-16 10:06:51 -07:00
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
ec724a725e
support tqdm in notebooks ( #7510 )
2024-08-16 09:17:50 -07:00
0e39a33c6d
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method ( #7513 )
2024-08-16 10:05:18 -06:00
6fc5b0f249
[CI] Fix crashes of performance benchmark ( #7500 )
2024-08-16 08:08:45 -07:00
9587b050fb
[Core] Use uvloop with zmq-decoupled front-end ( #7570 )
2024-08-15 22:48:07 -07:00
54bd9a03c4
register custom op for flash attn and use from torch.ops ( #7536 )
2024-08-15 22:38:56 -07:00
50b8d08dbd
[Misc/Testing] Use torch.testing.assert_close ( #7324 )
2024-08-16 04:24:04 +00:00
e165528778
[CI] Move quantization cpu offload tests out of fastcheck ( #7574 )
2024-08-15 21:16:20 -07:00
3b19e39dc5
Chat method for offline llm ( #5049 )
...
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-08-15 19:41:34 -07:00
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail ( #7572 )
2024-08-15 19:39:04 -07:00
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 ( #7453 )
2024-08-16 02:38:08 +00:00
b67ae00cdb
[Misc] Add quantization config support for speculative model. ( #7343 )
2024-08-15 19:34:28 -07:00
9c8e2d1161
[Bugfix][Harmless] Fix float16 dtype for model_is_embedding ( #7566 )
2024-08-15 18:26:19 -07:00
21313e09e3
[Bugfix] Fix default weight loading for scalars ( #7534 )
2024-08-15 13:10:22 -07:00
f4da5f7b6d
[Misc] Update dockerfile for CPU to cover protobuf installation ( #7182 )
2024-08-15 10:03:01 -07:00
9c1f78d5d6
[Bugfix] update neuron for version > 0.5.0 ( #7175 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-15 09:44:14 -07:00
fc93e56143
[Bugfix][TPU] Correct env variable for XLA cache path ( #7544 )
2024-08-15 00:02:29 -07:00
22b39e11f2
llama_index serving integration documentation ( #6973 )
...
Co-authored-by: pavanmantha <pavan.mantha@thevaslabs.io >
2024-08-14 15:38:37 -07:00
f55a9aea45
[Misc] Revert compressed-tensors code reuse ( #7521 )
2024-08-14 15:07:37 -07:00
951fdd66d3
[TPU] Set per-rank XLA cache ( #7533 )
2024-08-14 14:47:51 -07:00
2ecf7b1757
[core] [3/N] multi-step args and sequence.py ( #7452 )
2024-08-14 12:32:45 -07:00
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray ( #7424 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-14 09:44:27 -07:00
67d115db08
[Bugfix][Frontend] Disable embedding API for chat models ( #7504 )
...
Co-authored-by: jack <jack@alex>
2024-08-14 09:15:19 -07:00
d3d9cb6e4b
[ci] fix model tests ( #7507 )
2024-08-14 01:01:43 -07:00
c134a46402
Fix empty output when temp is too low ( #2937 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-08-14 05:31:44 +00:00
199adbb7cf
[doc] update test script to include cudagraph ( #7501 )
2024-08-13 21:52:58 -07:00
dd164d72f3
[Bugfix][Docs] Update list of mock imports ( #7493 )
2024-08-13 20:37:30 -07:00
ea49e6a3c8
[misc][ci] fix cpu test with plugins ( #7489 )
2024-08-13 19:27:46 -07:00
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests ( #7396 )
2024-08-13 17:27:29 -07:00
59edd0f134
[Bugfix][CI] Import ray under guard ( #7486 )
2024-08-13 17:12:58 -07:00
a08df8322e
[TPU] Support multi-host inference ( #7457 )
2024-08-13 16:31:20 -07:00
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
373538f973
[Misc] compressed-tensors code reuse ( #7277 )
2024-08-13 19:05:15 -04:00
33e5d7e6b6
[frontend] spawn engine process from api server process ( #7484 )
2024-08-13 15:40:17 -07:00
c5c7768264
Announce NVIDIA Meetup ( #7483 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-13 14:28:36 -07:00
b1e5afc3e7
[Misc] Update awq and awq_marlin to use vLLMParameters ( #7422 )
2024-08-13 17:08:20 -04:00
d3bdfd3ab9
[Misc] Update Fused MoE weight loading ( #7334 )
2024-08-13 14:57:45 -04:00
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters ( #7281 )
2024-08-13 14:30:11 -04:00
181abbc27d
[Misc] Update LM Eval Tolerance ( #7473 )
2024-08-13 14:28:14 -04:00
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models ( #7446 )
2024-08-13 17:39:33 +00:00
e20233d361
Revert "[Doc] Update supported_hardware.rst ( #7276 )" ( #7467 )
2024-08-13 01:37:08 -07:00
d6e634f3d7
[TPU] Suppress import custom_ops warning ( #7458 )
2024-08-13 00:30:30 -07:00
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() ( #7102 )
2024-08-13 00:16:42 -07:00
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
5469146bcc
[ci] Remove fast check cancel workflow ( #7455 )
2024-08-12 21:19:51 -07:00
97a6be95ba
[Misc] improve logits processors logging message ( #7435 )
2024-08-13 02:29:34 +00:00
9ba85bc152
[mypy] Misc. typing improvements ( #7417 )
2024-08-13 09:20:20 +08:00
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit ( #7224 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-12 17:57:16 -07:00
774cd1d3bf
[CI/Build] bump minimum cmake version ( #6999 )
2024-08-12 16:29:20 -07:00
91294d56e1
[Bugfix] Handle PackageNotFoundError when checking for xpu version ( #7398 )
2024-08-12 16:07:20 -07:00
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel ( #7208 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-12 22:47:41 +00:00
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments ( #7416 )
2024-08-12 14:14:14 -07:00
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels ( #7323 )
2024-08-12 14:40:13 -04:00
1137f343aa
[ci] Cancel fastcheck when PR is ready ( #7433 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:59:14 -07:00
9b3e2edd30
[ci] Cancel fastcheck run when PR is marked ready ( #7427 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:56:52 -07:00
65950e8f58
[ci] Entrypoints run upon changes in vllm/ ( #7423 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:18:03 -07:00
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend ( #7425 )
2024-08-12 09:58:28 -07:00
d2bc4510a4
[CI/Build] bump Dockerfile.neuron image base, use public ECR ( #6832 )
2024-08-12 09:53:35 -07:00
24154f8618
[Frontend] Disallow passing model as both argument and option ( #7347 )
2024-08-12 12:58:34 +00:00
e6e42e4b17
[Core][VLM] Support image embeddings as input ( #6613 )
2024-08-12 16:16:06 +08:00
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 ( #7319 )
2024-08-12 07:59:17 +00:00
86ab567bae
[CI/Build] Minor refactoring for vLLM assets ( #7407 )
2024-08-12 02:41:52 +00:00
f020a6297e
[Docs] Update readme ( #7316 )
2024-08-11 17:13:37 -07:00
6c8e595710
[misc] add commit id in collect env ( #7405 )
2024-08-11 15:40:48 -07:00
02b1988b9f
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty ( #7403 )
2024-08-11 14:38:17 -07:00
386087970a
[CI/Build] build on empty device for better dev experience ( #4773 )
2024-08-11 13:09:44 -07:00
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step ( #7387 )
2024-08-11 08:50:08 -07:00
4fb7b52a2c
Updating LM Format Enforcer version to v0.10.6 ( #7189 )
2024-08-11 08:11:50 -04:00
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time ( #7340 )
2024-08-10 18:12:22 -07:00
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio ( #7392 )
2024-08-10 16:19:33 +00:00
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 ( #7380 )
2024-08-09 23:48:49 +00:00
999ef0b917
[Misc] Add numpy implementation of compute_slot_mapping ( #7377 )
2024-08-09 22:52:29 +00:00
5c6c54d67a
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models ( #7376 )
2024-08-09 21:23:46 +00:00
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
70d268a399
[Bugfix] Fix ITL recording in serving benchmark ( #7372 )
2024-08-09 10:00:00 -07:00
249b88228d
[Frontend] Support embeddings in the run_batch API ( #7132 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-09 09:48:21 -07:00
74af2bbd90
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder ( #7360 )
2024-08-09 16:35:49 +00:00
fc7b8d1eef
[Performance] e2e overheads reduction: Small followup diff ( #7364 )
2024-08-09 15:49:36 +00:00
67abdbb42f
[VLM][Doc] Add stop_token_ids to InternVL example ( #7354 )
2024-08-09 14:51:04 +00:00
07ab160741
[Model][Jamba] Mamba cache single buffer ( #6739 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-08-09 10:07:06 -04:00
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine ( #7336 )
2024-08-09 07:06:36 +00:00
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace ( #6971 )
2024-08-09 05:42:45 +00:00
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary ( #7218 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-08 22:08:46 -07:00
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations ( #7162 )
2024-08-08 21:34:28 -07:00
73388c07a4
[TPU] Fix dockerfile.tpu ( #7331 )
2024-08-08 20:24:58 -07:00
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
0fa14907da
[TPU] Add Load-time W8A16 quantization for TPU Backend ( #7005 )
2024-08-08 18:35:49 -07:00
5923532e15
Add Skywork AI as Sponsor ( #7314 )
2024-08-08 13:59:57 -07:00
a049b107e2
[Misc] Temporarily resolve the error of BitAndBytes ( #7308 )
2024-08-08 13:42:58 -07:00
8334c39f37
[Bugfix] Fix new Llama3.1 GGUF model loading ( #7269 )
2024-08-08 13:42:44 -07:00
e904576743
[CI/Build] Dockerfile.cpu improvements ( #7298 )
2024-08-08 15:24:52 -04:00
e14fb22e59
[Doc] Put collect_env issue output in a <detail> block ( #7310 )
2024-08-08 11:22:49 -07:00
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic ( #6849 )
2024-08-08 10:43:30 -07:00
21b9c49aa3
[Frontend] Kill the server on engine death ( #6594 )
...
Signed-off-by: Joe Runde <joe@joerun.de >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-08 09:47:48 -07:00
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests ( #7305 )
2024-08-08 12:16:13 -04:00
757ac70a64
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 ( #7273 )
2024-08-08 14:02:41 +00:00
6dffa4b0a6
[Bugfix] Fix LoRA with PP ( #7292 )
2024-08-08 00:02:27 -07:00
48abee9e54
[Frontend] remove max_num_batched_tokens limit for lora ( #7288 )
2024-08-08 06:17:29 +00:00
746709642c
[Misc] Fix typos in scheduler.py ( #7285 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-07 17:06:01 -07:00
e53dfd3eaf
[Kernel] Fix Flashinfer Correctness ( #7284 )
2024-08-07 16:26:52 -07:00
6d94420246
[Doc] Update supported_hardware.rst ( #7276 )
2024-08-07 14:21:50 -07:00
fc1493a01e
[FrontEnd] Make merge_async_iterators is_cancelled arg optional ( #7282 )
2024-08-07 13:35:14 -07:00
311f743831
[Bugfix] Fix gptq failure on T4s ( #7264 )
2024-08-07 20:05:37 +00:00
469b3bc538
[ci] Make building wheels per commit optional ( #7278 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-07 11:34:25 -07:00
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization ( #7219 )
2024-08-07 11:23:12 -07:00
fde47d3bc2
[BugFix] Fix frontend multiprocessing hang ( #7217 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-08-07 18:09:36 +00:00
0e12cd67a8
[Doc] add online speculative decoding example ( #7243 )
2024-08-07 09:58:02 -07:00
80cbe10c59
[OpenVINO] migrate to latest dependencies versions ( #7251 )
2024-08-07 09:49:10 -07:00
b764547616
[Bugfix] Fix input processor for InternVL2 model ( #7164 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-07 09:32:07 -07:00
ab0f5e2823
Fixes typo in function name ( #7275 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-08-07 09:29:27 -07:00
564985729a
[ BugFix ] Move zmq frontend to IPC instead of TCP ( #7222 )
2024-08-07 16:24:56 +00:00
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 ( #5874 )
2024-08-07 09:17:58 -07:00
639159b2a6
[distributed][misc] add specialized method for cuda platform ( #7249 )
2024-08-07 08:54:52 -07:00
66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure ( #7238 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-07 09:12:05 +00:00
7b261092de
[BUGFIX]: top_k is expected to be an integer. ( #7227 )
2024-08-07 00:32:16 -07:00
2385c8f374
[Doc] Mock new dependencies for documentation ( #7245 )
2024-08-07 06:43:03 +00:00
9a3f49ae07
[BugFix] Overhaul async request cancellation ( #7111 )
2024-08-07 13:21:41 +08:00
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading ( #7225 )
2024-08-06 18:34:26 -07:00
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-06 16:51:47 -04:00
660470e5a3
[Core] Optimize evictor-v2 performance ( #7193 )
2024-08-06 12:34:25 -07:00
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-08-06 18:17:08 +00:00
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests ( #7183 )
2024-08-06 10:40:32 -07:00
00afc78590
[Bugfix] add gguf dependency ( #7198 )
...
Co-authored-by: katarzyna.papis <kpapis@kpapis-u20.sclab.intel.com >
2024-08-06 10:08:35 -07:00
541c1852d3
[ BugFix ] Fix ZMQ when VLLM_PORT is set ( #7205 )
2024-08-06 09:26:26 -07:00
a3bbbfa1d8
[BugFix] Fix DeepSeek remote code ( #7178 )
2024-08-06 08:16:53 -07:00
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models ( #7153 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-08-06 16:55:31 +08:00
9118217f58
[LoRA] Relax LoRA condition ( #7146 )
2024-08-06 01:57:25 +00:00
e3c664bfcb
[Build] Add initial conditional testing spec ( #6841 )
2024-08-05 17:39:22 -07:00
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-05 17:54:23 -06:00
ef527be06c
[MISC] Use non-blocking transfer in prepare_input ( #7172 )
2024-08-05 23:41:27 +00:00
89b8db6bb2
[Bugfix] Specify device when loading LoRA and embedding tensors ( #7129 )
...
Co-authored-by: Jacob Schein <jacobschein@Jacobs-MacBook-Pro-2.local >
2024-08-05 16:35:47 -07:00
789937af2e
[Doc] [SpecDecode] Update MLPSpeculator documentation ( #7100 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-05 23:29:43 +00:00
dfb1a15dcb
[ci][frontend] deduplicate tests ( #7101 )
2024-08-05 15:59:22 -07:00
4db5176d97
bump version to v0.5.4 ( #7139 )
2024-08-05 14:39:48 -07:00
4cf1dc39be
[Bugfix][CI/Build] Fix CUTLASS FetchContent ( #7171 )
2024-08-05 14:22:57 -07:00
6e4852ce28
[CI/Build] Suppress divide-by-zero and missing return statement warnings ( #7001 )
2024-08-05 16:00:01 -04:00
8571ac4672
[Kernel] Update CUTLASS to 3.5.1 ( #7085 )
2024-08-05 15:13:43 -04:00
997cf78308
[Misc] Fix typo in GroupCoordinator.recv() ( #7167 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-05 11:10:16 -07:00
57f560aa23
[BugFix] Use args.trust_remote_code ( #7121 )
2024-08-05 09:26:14 -07:00
003f8ee128
[BugFix] Use IP4 localhost form for zmq bind ( #7163 )
2024-08-05 08:41:03 -07:00
e9630458c7
[SpecDecode] Support FlashInfer in DraftModelRunner ( #6926 )
2024-08-05 08:05:05 -07:00
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification ( #6963 )
2024-08-05 08:46:44 +00:00
c0d8f1636c
[Model] SiglipVisionModel ported from transformers ( #6942 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-05 06:22:12 +00:00
cc08fc7225
[Frontend] Reapply "Factor out code for running uvicorn" ( #7095 )
2024-08-04 20:40:51 -07:00
7b86e7c9cd
[Model] Add multi-image support for minicpmv ( #7122 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-05 09:23:17 +08:00
f80ab3521c
Clean up remaining Punica C information ( #7027 )
2024-08-04 15:37:08 -07:00
16a1cc9bb2
[misc][distributed] improve libcudart.so finding ( #7127 )
2024-08-04 11:31:51 -07:00
b1c9aa3daa
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator ( #7105 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-04 07:13:18 -07:00
179a6a36f2
[Model]Refactor MiniCPMV ( #7020 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 08:12:41 +00:00
83c644fe7e
[core][misc] simply output processing with shortcut code path ( #7117 )
2024-08-04 00:22:19 -07:00
9fadc7b7a0
[misc] add zmq in collect env ( #7119 )
2024-08-03 22:03:46 -07:00
654bc5ca49
Support for guided decoding for offline LLM ( #6878 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 03:12:09 +00:00
825b044863
[Frontend] Warn if user max_model_len is greater than derived max_model_len ( #7080 )
...
Signed-off-by: Jefferson Fialho <jfialho@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-03 16:01:38 -07:00
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test ( #7114 )
2024-08-03 10:44:53 -07:00
67d745cc68
[CI] Temporarily turn off H100 performance benchmark ( #7104 )
2024-08-02 23:52:44 -07:00
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA ( #7081 )
2024-08-02 22:40:19 -07:00
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits ( #7018 )
2024-08-02 22:38:15 -07:00
0c25435daa
[Model] Refactor and decouple weight loading logic for InternVL2 model ( #7067 )
2024-08-02 22:36:14 -07:00
a0d164567c
[ci][distributed] disable ray dag tests ( #7099 )
2024-08-02 22:32:04 -07:00
04e5583425
[ci][distributed] merge distributed test commands ( #7097 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-02 21:33:53 -07:00
8c025fa703
[Frontend] Factor out chat message parsing ( #7055 )
2024-08-02 21:31:27 -07:00
69ea15e5cc
[ci][distributed] shorten wait time if server hangs ( #7098 )
2024-08-02 21:05:16 -07:00
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq ( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <joe@joerun.de >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-02 18:27:28 -07:00
708989341e
[misc] add a flag to enable compile ( #7092 )
2024-08-02 16:18:45 -07:00
22e718ff1a
[Misc] Revive to use loopback address for driver IP ( #7091 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 15:50:00 -07:00
05308891e2
[Core] Pipeline parallel with Ray ADAG ( #6837 )
...
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 13:55:40 -07:00
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
b482b9a5b1
[CI/Build] Add support for Python 3.12 ( #7035 )
2024-08-02 13:51:22 -07:00
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
c16eaac500
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend ( #6931 )
2024-08-02 08:55:58 -07:00
db35186391
[Core] Comment out unused code in sampler ( #7023 )
2024-08-02 00:58:26 -07:00
660dea1235
[cuda][misc] remove error_on_invalid_device_count_status ( #7069 )
2024-08-02 00:14:21 -07:00
cf2a1a4d9d
Fix tracing.py ( #7065 )
2024-08-01 23:28:00 -07:00
252357793d
[ci][distributed] try to fix pp test ( #7054 )
2024-08-01 22:03:12 -07:00
3bb4b1e4cd
[mypy] Speed up mypy checking ( #7056 )
2024-08-01 19:49:43 -07:00
954f7305a1
[Kernel] Fix input for flashinfer prefill wrapper. ( #7008 )
2024-08-01 18:44:16 -07:00
6ce01f3066
[Performance] Optimize get_seqs ( #7051 )
2024-08-01 18:29:52 -07:00
6a11fdfbb8
[CI/Build][Bugfix] Fix CUTLASS header-only line ( #7034 )
2024-08-01 13:51:15 -07:00
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn ( #7022 )
2024-08-01 13:14:37 -07:00
562e580abc
Update run-amd-test.sh ( #7044 )
2024-08-01 13:12:37 -07:00
fc912e0886
[Models] Support Qwen model with PP ( #6974 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-08-01 12:40:43 -07:00
f4fd390f5d
[Bugfix] Lower gemma's unloaded_params exception to warning ( #7002 )
2024-08-01 12:01:07 -07:00
fb3db61688
[CI/Build] Remove sparseml requirement from testing ( #7037 )
2024-08-01 12:00:51 -07:00
2dd34371a6
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm ( #6992 )
2024-08-01 12:00:28 -07:00
7e0861bd0b
[CI/Build] Update PyTorch to 2.4.0 ( #6951 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-01 11:11:24 -07:00
a72a424b3e
[Build/CI] Fixing Docker Hub quota issue. ( #7043 )
2024-08-01 11:07:37 -07:00
c8a7e93273
[core][scheduler] simplify and improve scheduler ( #6867 )
2024-07-31 23:51:09 -07:00
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user ( #6954 )
2024-07-31 21:13:34 -07:00
0437492ea9
PP comm optimization: replace send with partial send + allgather ( #6695 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
2024-07-31 20:15:42 -07:00
630dd9e0ae
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings ( #6758 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-31 19:49:11 -07:00
23993a7997
[Bugfix][TPU] Do not use torch.Generator for TPUs ( #6981 )
2024-07-31 18:50:28 -07:00
1d2e7fb73f
[Model] Pipeline parallel support for Qwen2 ( #6924 )
2024-07-31 18:49:51 -07:00
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
7eb0cb4a14
Revert "[Frontend] Factor out code for running uvicorn" ( #7012 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-07-31 16:34:26 -07:00
a0dce9383a
[Misc] Add compressed-tensors to optimized quant list ( #7006 )
2024-07-31 14:40:44 -07:00
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) ( #6996 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:32 -07:00
93548eb37e
[Kernel] Enable FP8 Cutlass for Ada Lovelace ( #6950 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:22 -07:00
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization ( #6960 )
2024-07-31 12:47:46 -07:00
bd70013407
[MISC] Introduce pipeline parallelism partition strategies ( #6920 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-31 12:02:17 -07:00
2ee8d3ba55
[Model] use FusedMoE layer in Jamba ( #6935 )
2024-07-31 12:00:24 -07:00
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT ( #6982 )
2024-07-31 23:46:17 +08:00
2f4e108f75
[Bugfix] Clean up MiniCPM-V ( #6939 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-31 14:39:19 +00:00
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
c0644cf9ce
[Bugfix] fix logit processor excceed vocab size issue ( #6927 )
2024-07-31 16:16:01 +08:00
533d1932d2
[Bugfix][TPU] Set readonly=True for non-root devices ( #6980 )
2024-07-31 00:19:28 -07:00
9f0e69b653
[CI/Build] Fix mypy errors ( #6968 )
2024-07-30 19:49:48 -07:00
f230cc2ca6
[Bugfix] Fix broadcasting logic for multi_modal_kwargs ( #6836 )
2024-07-31 10:38:45 +08:00
da1f7cc12a
[mypy] Enable following imports for some directories ( #6681 )
2024-07-31 10:38:03 +08:00
c32ab8be1a
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding ( #6964 )
2024-07-31 00:53:21 +00:00
fb4f530bf5
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists ( #6706 )
2024-07-30 16:28:49 -07:00
79319cedfa
[Nightly benchmarking suite] Remove pkill python from run benchmark suite ( #6965 )
2024-07-30 16:28:05 -07:00
40c27a7cbb
[Build] Temporarily Disable Kernels and LoRA tests ( #6961 )
2024-07-30 14:59:48 -07:00
6ca8031e71
[core][misc] improve free_finished_seq_groups ( #6865 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-30 14:32:12 -07:00
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun ( #6842 )
2024-07-30 16:37:01 -04:00
052b6f8ca4
[Bugfix] Fix tensorizer memory profiling bug during testing ( #6881 )
2024-07-30 11:48:50 -07:00
5895b24677
[OpenVINO] Updated OpenVINO requirements and build docs ( #6948 )
2024-07-30 11:33:01 -07:00
cbbc904470
[Kernel] Squash a few more warnings ( #6914 )
2024-07-30 13:50:42 -04:00
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel ( #6698 )
2024-07-30 10:40:08 -07:00
f058403683
[Doc] Super tiny fix doc typo ( #6949 )
2024-07-30 09:14:03 -07:00
c66c7f86ac
[Bugfix] Fix PaliGemma MMP ( #6930 )
2024-07-30 02:20:57 -07:00
6e063ea35b
[TPU] Fix greedy decoding ( #6933 )
2024-07-30 02:06:29 -07:00
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace ( #6848 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 20:24:58 -06:00
61a97c32f6
[Kernel] Fix marlin divide-by-zero warnings ( #6904 )
2024-07-30 01:26:07 +00:00
4fbf4aa128
[ci] GHA workflow to remove ready label upon "/notready" comment ( #6921 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-29 17:03:45 -07:00
aae6d36f7e
[Kernel] Remove unused variables in awq/gemm_kernels.cu ( #6908 )
2024-07-29 18:01:17 -06:00
9f69d8245a
[Frontend] New allowed_token_ids decoding request parameter ( #6753 )
2024-07-29 23:37:27 +00:00
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. ( #6786 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-29 14:51:27 -07:00
7f8d612d24
[TPU] Support tensor parallelism in async llm engine ( #6891 )
2024-07-29 12:42:21 -07:00
60d1c6e584
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel ( #6901 )
2024-07-29 09:59:02 -07:00
db9e5708a9
[Core] Reduce unnecessary compute when logprobs=None ( #6532 )
2024-07-29 16:47:31 +00:00
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace ( #6677 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 09:42:35 -06:00
7cbd9ec7a9
[Model] Initialize support for InternVL2 series models ( #6514 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-29 10:16:30 +00:00
3eeb148f46
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 ( #6871 )
2024-07-28 11:13:49 -04:00
b1366a9534
Add Nemotron to PP_SUPPORTED_MODELS ( #6863 )
2024-07-27 15:05:17 -07:00
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
fad5576c58
[TPU] Reduce compilation time & Upgrade PyTorch XLA version ( #6856 )
2024-07-27 10:28:33 -07:00
f954d0715c
[Docs] Add RunLLM chat widget ( #6857 )
2024-07-27 09:24:46 -07:00
1ad86acf17
[Model] Initial support for BLIP-2 ( #5920 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-07-27 11:53:07 +00:00
ecb33a28cb
[CI/Build][Doc] Update CI and Doc for VLM example changes ( #6860 )
2024-07-27 09:54:14 +00:00
a57d75821c
[bugfix] make args.stream work ( #6831 )
2024-07-27 09:07:02 +00:00
925de97e05
[Bugfix] Fix VLM example typo ( #6859 )
2024-07-27 14:24:08 +08:00
aa46953a20
[Misc][VLM][Doc] Consolidate offline examples for vision language models ( #6858 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-07-26 22:44:13 -07:00
593e79e733
[Bugfix] torch.set_num_threads() in multiproc_gpu_executor ( #6802 )
...
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor (#6802 )
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-26 22:15:20 -07:00
c53041ae3b
[Doc] Add missing mock import to docs conf.py ( #6834 )
2024-07-27 04:47:33 +00:00
52f07e3dec
[Hardware][TPU] Implement tensor parallelism with Ray ( #5871 )
2024-07-26 20:54:27 -07:00
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
ed94e4f427
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba ( #6784 )
2024-07-26 20:45:31 -07:00
3c3012398e
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron ( #6844 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-07-26 20:20:16 -07:00
ced36cd89b
[ROCm] Upgrade PyTorch nightly version ( #6845 )
2024-07-26 20:16:13 -07:00
969d032265
[Bugfix]: Fix Tensorizer test failures ( #6835 )
2024-07-26 20:02:25 -07:00
55712941e5
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b ( #6852 )
2024-07-27 02:27:44 +00:00
981b0d5673
[Frontend] Factor out code for running uvicorn ( #6828 )
2024-07-27 09:58:25 +08:00
d09b94ca58
[TPU] Support collective communications in XLA devices ( #6813 )
2024-07-27 01:45:57 +00:00
bb5494676f
enforce eager mode with bnb quantization temporarily ( #6846 )
2024-07-27 01:32:20 +00:00
b5f49ee55b
Update README.md ( #6847 )
2024-07-27 00:26:45 +00:00
150a1ffbfd
[Doc] Update SkyPilot doc for wrong indents and instructions for update service ( #4283 )
2024-07-26 14:39:10 -07:00
281977bd6e
[Doc] Add Nemotron to supported model docs ( #6843 )
2024-07-26 17:32:44 -04:00
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
aa4867791e
[Misc][TPU] Support TPU in initialize_ray_cluster ( #6812 )
2024-07-26 19:39:49 +00:00
71734f1bf2
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm ( #6811 )
2024-07-26 12:28:32 -07:00
50704f52c4
[Bugfix][Kernel] Promote another index to int64_t ( #6838 )
2024-07-26 18:41:04 +00:00
07278c37dd
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) ( #6611 )
2024-07-26 14:33:42 -04:00
85ad7e2d01
[doc][debugging] add known issues for hangs ( #6816 )
2024-07-25 21:48:05 -07:00
89a84b0bb7
[Core] Use array to speedup padding ( #6779 )
2024-07-25 21:31:31 -07:00
084a01fd35
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. ( #6770 )
2024-07-25 21:25:35 -07:00
062a1d0fab
Fix ReplicatedLinear weight loading ( #6793 )
2024-07-25 19:24:58 -07:00
2eb9f4ff26
[ci] Mark tensorizer as soft fail and separate from grouped test ( #6810 )
...
[ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check (#6810 )
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-25 18:08:33 -07:00
443c7cf4cf
[ci][distributed] fix flaky tests ( #6806 )
2024-07-25 17:44:09 -07:00
1adddb14bf
[Core] Fix ray forward_dag error mssg ( #6792 )
2024-07-25 16:53:25 -07:00
b7215de2c5
[Docs] Publish 5th meetup slides ( #6799 )
2024-07-25 16:47:55 -07:00
f3ff63c3f4
[doc][distributed] improve multinode serving doc ( #6804 )
2024-07-25 15:38:32 -07:00
cd7edc4e87
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors ( #6798 )
2024-07-25 15:05:09 -07:00
6a1e25b151
[Doc] Add documentations for nightly benchmarks ( #6412 )
2024-07-25 11:57:16 -07:00
95db75de64
[Bugfix] Add synchronize to prevent possible data race ( #6788 )
...
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-07-25 10:40:01 -07:00
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints ( #6761 )
2024-07-25 09:46:15 -07:00
889da130e7
[ Misc ] fp8-marlin channelwise via compressed-tensors ( #6524 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-07-25 09:46:04 -07:00
b75e314fff
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V ( #6787 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-25 09:42:49 -07:00
316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py ( #6755 )
2024-07-24 22:48:07 -07:00
0310029a2f
[Bugfix] Fix awq_marlin and gptq_marlin flags ( #6745 )
2024-07-24 22:34:11 -07:00
309aaef825
[Bugfix] Fix decode tokens w. CUDA graph ( #6757 )
2024-07-24 22:33:56 -07:00
9e169a4c61
[Model] Adding support for MiniCPM-V ( #4087 )
2024-07-24 20:59:30 -07:00
5689e256ba
[Frontend] Represent tokens with identifiable strings ( #6626 )
2024-07-25 09:51:00 +08:00
740374d456
[core][distributed] fix zmq hang ( #6759 )
2024-07-24 17:37:12 -07:00
d88c458f44
[Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users ( #6754 )
2024-07-24 14:32:57 -07:00
421e218b37
[Bugfix] Bump transformers to 4.43.2 ( #6752 )
2024-07-24 13:22:16 -07:00
5448f67635
[Core] Tweaks to model runner/input builder developer APIs ( #6712 )
2024-07-24 12:17:12 -07:00
0e63494cf3
Add fp8 support to reshape_and_cache_flash ( #6667 )
2024-07-24 18:36:52 +00:00
ee812580f7
[Frontend] split run_server into build_server and run_server ( #6740 )
2024-07-24 10:36:04 -07:00
40468b13fa
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. ( #6686 )
2024-07-24 08:58:42 -07:00
2cf0df3381
[Bugfix] Fix speculative decode seeded test ( #6743 )
2024-07-24 08:58:31 -07:00
545146349c
Adding f-string to validation error which is missing ( #6748 )
2024-07-24 08:55:53 -07:00
f4f8a9d892
[Bugfix]fix modelscope compatible issue ( #6730 )
2024-07-24 05:04:46 -07:00
b570811706
[Build/CI] Update run-amd-test.sh. Enable Docker Hub login. ( #6711 )
2024-07-24 05:01:14 -07:00
ccc4a73257
[Docs][ROCm] Detailed instructions to build from source ( #6680 )
2024-07-24 01:07:23 -07:00
0a740a11ba
[Bugfix] Fix token padding for chameleon ( #6724 )
2024-07-24 01:05:09 -07:00
c882a7f5b3
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model ( #6714 )
2024-07-24 07:34:22 +00:00
5e8ca973eb
[Bugfix] fix flashinfer cudagraph capture for PP ( #6708 )
2024-07-24 01:49:44 +00:00
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-23 23:45:09 +00:00
2f808e69ab
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. ( #6645 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-23 23:05:05 +00:00
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization ( #6702 )
2024-07-23 22:45:12 +00:00
72fc704803
[build] relax wheel size limit ( #6704 )
2024-07-23 14:03:49 -07:00
1bedf210e3
Bump transformers version for Llama 3.1 hotfix and patch Chameleon ( #6690 )
2024-07-23 13:47:48 -07:00
507ef787d8
[Model] Pipeline Parallel Support for DeepSeek v2 ( #6519 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-23 12:22:09 -07:00
58f53034ad
[Frontend] Add Usage data in each chunk for chat_serving. #6540 ( #6652 )
2024-07-23 11:41:55 -07:00
0eb0757bef
[Misc] Add ignored layers for fp8 quantization ( #6657 )
2024-07-23 14:04:04 -04:00
38c4b7e863
Bump version to 0.5.3.post1 ( #6696 )
2024-07-23 10:08:59 -07:00
a112a84aad
[BugFix] Fix RoPE error in Llama 3.1 ( #6693 )
2024-07-23 09:46:05 -07:00
461089a21a
[Bugfix] Fix a log error in chunked prefill ( #6694 )
2024-07-23 09:27:58 -07:00
71950af726
[doc][distributed] fix doc argument order ( #6691 )
2024-07-23 08:55:33 -07:00
cb1362a889
[Docs] Announce llama3.1 support ( #6688 )
2024-07-23 08:18:15 -07:00
bb2fc08072
Bump version to v0.5.3 ( #6674 )
2024-07-23 00:00:08 -07:00
3eda4ec780
support ignore patterns in model loader ( #6673 )
2024-07-22 23:59:42 -07:00
22fa2e35cb
[VLM][Model] Support image input for Chameleon ( #6633 )
2024-07-22 23:50:48 -07:00
c5201240a4
[misc] only tqdm for first rank ( #6672 )
2024-07-22 21:57:27 -07:00
97234be0ec
[Misc] Manage HTTP connections in one place ( #6600 )
2024-07-22 21:32:02 -07:00
c051bfe4eb
[doc][distributed] doc for setting up multi-node environment ( #6529 )
...
[doc][distributed] add more doc for setting up multi-node environment (#6529 )
2024-07-22 21:22:09 -07:00
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors ( #6528 )
2024-07-23 04:11:50 +00:00
e519ae097a
add tqdm when loading checkpoint shards ( #6569 )
...
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-22 20:48:01 -07:00
7c2749a4fd
[misc] add start loading models for users information ( #6670 )
2024-07-22 20:08:02 -07:00
729171ae58
[Misc] Enable chunked prefill by default for long context models ( #6666 )
2024-07-22 20:03:13 -07:00
c5e8330997
[Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization ( #6665 )
2024-07-22 19:25:05 -07:00
e0c15758b8
[Core] Modulize prepare input and attention metadata builder ( #6596 )
2024-07-23 00:45:24 +00:00
bdf5fd1386
[Misc] Remove deprecation warning for beam search ( #6659 )
2024-07-23 00:21:58 +00:00
5a96ee52a3
[ci][build] add back vim in docker ( #6661 )
2024-07-22 16:26:29 -07:00
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace ( #6234 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-22 15:42:40 -07:00
69d5ae38dc
[ci] Use different sccache bucket for CUDA 11.8 wheel build ( #6656 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-22 14:20:41 -07:00
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
739b61a348
[Frontend] Refactor prompt processing ( #4028 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-22 10:13:53 -07:00
89c1c6a196
[Bugfix] Fix vocab_size field access in llava_next.py ( #6624 )
2024-07-22 05:02:51 +00:00
42de2cefcb
[Misc] Add a wrapper for torch.inference_mode ( #6618 )
2024-07-21 18:43:11 -07:00
c9eef37f32
[Model] Initial Support for Chameleon ( #5770 )
2024-07-21 17:37:51 -07:00
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
25e778aa16
[Model] Refactor and decouple phi3v image embedding ( #6621 )
2024-07-21 16:07:58 -07:00
b6df37f943
[Misc] Remove abused noqa ( #6619 )
2024-07-21 23:47:04 +08:00
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. ( #6485 )
2024-07-20 23:58:58 -07:00
d7f4178dd9
[Frontend] Move chat utils ( #6602 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-21 08:38:17 +08:00
082ecd80d5
[ Bugfix ] Fix AutoFP8 fp8 marlin ( #6609 )
2024-07-20 17:25:56 -06:00
f952bbc8ff
[Misc] Fix input_scale typing in w8a8_utils.py ( #6579 )
2024-07-20 23:11:13 +00:00
9364f74eee
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models ( #6606 )
2024-07-20 18:50:10 +00:00
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes ( #6543 )
2024-07-20 09:39:07 -07:00
683e3cb9c4
[ Misc ] fbgemm checkpoints ( #6559 )
2024-07-20 09:36:57 -07:00
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors ( #6541 )
2024-07-20 04:17:24 +00:00
3f8d42c81f
Pipeline Parallel: Guard for KeyErrors at request abort ( #6587 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-19 19:18:19 -07:00
7bd82002ae
[Core] Allow specifying custom Executor ( #6557 )
2024-07-20 01:25:06 +00:00
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com >
2024-07-19 18:15:26 -07:00
e81522e879
[build] add ib in image for out-of-the-box infiniband support ( #6599 )
...
[build] add ib so that multi-node support with infiniband can be supported out-of-the-box (#6599 )
2024-07-19 17:16:57 -07:00
45ceb85a0c
[Docs] Update PP docs ( #6598 )
2024-07-19 16:38:21 -07:00
4cc24f01b1
[ Kernel ] Enable Dynamic Per Token fp8 ( #6547 )
2024-07-19 23:08:15 +00:00
07eb6f19f3
[bugfix][distributed] fix multi-node bug for shared memory ( #6597 )
2024-07-19 15:34:34 -07:00
f0bbfaf917
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection ( #6578 )
2024-07-19 14:01:03 -07:00
30efe41532
[Docs] Update docs for wheel location ( #6580 )
2024-07-19 12:14:11 -07:00
9ed82e7074
[Misc] Small perf improvements ( #6520 )
2024-07-19 12:10:56 -07:00
51f8aa90ad
[Bugfix][Frontend] remove duplicate init logger ( #6581 )
2024-07-19 10:16:27 -07:00
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply ( #6327 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-19 07:15:22 -06:00
a921e86392
[BUGFIX] Raise an error for no draft token case when draft_tp>1 ( #6369 )
2024-07-19 06:01:09 -07:00
6366efc67b
[Bugfix][Frontend] Fix missing /metrics endpoint ( #6463 )
2024-07-19 03:55:13 +00:00
dbe5588554
[ Misc ] non-uniform quantization via compressed-tensors for Llama ( #6515 )
2024-07-18 22:39:18 -04:00
d4201e06d5
[Bugfix] Make spec. decode respect per-request seed. ( #6034 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-07-18 19:22:08 -07:00
b5672a112c
[Core] Multiprocessing Pipeline Parallel support ( #6130 )
...
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-18 19:15:52 -07:00
c5df56f88b
Add support for a rope extension method ( #6553 )
2024-07-19 01:53:03 +00:00
1689219ebf
[CI/Build] Build on Ubuntu 20.04 instead of 22.04 ( #6517 )
2024-07-18 17:29:25 -07:00
4ffffccb7e
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm ( #6552 )
2024-07-18 23:52:22 +00:00
f53b8f0d05
[ci][test] add correctness test for cpu offloading ( #6549 )
2024-07-18 23:41:06 +00:00
2d4733ba2d
Fix PR comment bot ( #6554 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-18 14:48:29 -07:00
15c6a079b1
[Model] Support Mistral-Nemo ( #6548 )
2024-07-18 20:31:50 +00:00
ecdb462c24
[ci] Reword Github bot comment ( #6534 )
2024-07-18 08:01:45 -07:00
58ca663224
[ Misc ] Improve Min Capability Checking in compressed-tensors ( #6522 )
2024-07-18 14:39:12 +00:00
4634c8728b
[TPU] Refactor TPU worker & model runner ( #6506 )
2024-07-18 01:34:16 -07:00
c8a7d51c49
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash ( #6501 )
2024-07-18 07:47:13 +00:00
e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs ( #6227 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-18 15:13:30 +08:00
8a74c68bd1
[Misc] Minor patch for draft model runner ( #6523 )
2024-07-18 06:06:21 +00:00
61e592747c
[Core] Introduce SPMD worker execution using Ray accelerated DAG ( #6032 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu >
2024-07-17 22:27:09 -07:00
d25877dd9b
[BugFix] Avoid secondary error in ShmRingBuffer destructor ( #6530 )
2024-07-17 22:24:43 -07:00
1c27d25fb5
[core][model] yet another cpu offload implementation ( #6496 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-17 20:54:35 -07:00
18fecc3559
[ Kernel ] Fp8 Channelwise Weight Support ( #6487 )
2024-07-18 03:18:13 +00:00
b5af8c223c
[Model] Pipeline parallel support for Mixtral ( #6516 )
2024-07-17 19:26:04 -07:00
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-18 01:38:35 +00:00
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
5f0b9933e6
[Bugfix] Fix Ray Metrics API usage ( #6354 )
2024-07-17 19:40:10 +00:00
a38524f338
[DOC] - Add docker image to Cerebrium Integration ( #6510 )
2024-07-17 10:22:53 -07:00
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 ( #6164 )
2024-07-17 09:37:16 -07:00
a9a2e74d21
[Misc] Use torch.Tensor for type annotation ( #6505 )
2024-07-17 13:01:10 +00:00
e09ce759aa
[TPU] Remove multi-modal args in TPU backend ( #6504 )
2024-07-17 04:02:53 -07:00
5fa6e9876e
[Bugfix] Fix for multinode crash on 4 PP ( #6495 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-17 08:25:10 +00:00
5bf35a91e4
[Doc][CI/Build] Update docs and tests to use vllm serve ( #6431 )
2024-07-17 07:43:21 +00:00
a19e8d3726
[Misc][Speculative decoding] Typos and typing fixes ( #6467 )
...
Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com >
2024-07-17 07:17:07 +00:00
10383887e0
[ROCm] Cleanup Dockerfile and remove outdated patch ( #6482 )
2024-07-16 22:47:02 -07:00
1d094fd7c0
[Distributed][PP] only create embedding & lm head when necessary ( #6455 )
...
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
2024-07-16 19:20:26 -07:00
ce37be7ba0
[misc][distributed] add seed to dummy weights ( #6491 )
2024-07-16 19:16:34 -07:00
7f62077af5
[misc][distributed] improve tests ( #6488 )
2024-07-16 17:35:52 -07:00
09c2eb85dd
[ci][distributed] add pipeline parallel correctness test ( #6410 )
2024-07-16 15:44:22 -07:00
978aed5300
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale ( #6081 )
2024-07-16 15:31:32 -07:00
160e1d8c99
[Misc] Log spec decode metrics ( #6454 )
2024-07-16 20:37:10 +00:00
94162beb9f
[Doc] Fix the lora adapter path in server startup script ( #6230 )
2024-07-16 10:11:04 -07:00
c467dff24f
[Hardware][TPU] Support MoE with Pallas GMM kernel ( #6457 )
2024-07-16 09:56:28 -07:00
9f4ccec761
[doc][misc] remind to cancel debugging environment variables ( #6481 )
...
[doc][misc] remind users to cancel debugging environment variables after debugging (#6481 )
2024-07-16 09:45:30 -07:00
38ef94888a
[CI/Build] Remove "boardwalk" image asset ( #6460 )
2024-07-16 08:59:36 -07:00
2bb0489cb3
[Core] Use numpy to speed up padded token processing ( #6442 )
2024-07-16 08:13:25 -07:00
7508a3dc34
[Misc] Fix typos in spec. decode metrics logging. ( #6470 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-16 13:55:15 +00:00
7a3d2a5b95
[Frontend] Support for chat completions input in the tokenize endpoint ( #5923 )
2024-07-16 20:18:09 +08:00
d97011512e
[CI/Build] vLLM cache directory for images ( #6444 )
2024-07-15 23:12:25 -07:00
37d776606f
[Docs] Announce 5th meetup ( #6458 )
2024-07-15 21:04:58 -07:00
d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests ( #6419 )
2024-07-15 18:54:15 -07:00
9ad32dacd9
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug ( #6425 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-07-16 01:32:55 +00:00
d6f3b3d5c4
Pin sphinx-argparse version ( #6453 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-16 01:26:11 +00:00
4552e37b55
[CI/Build][TPU] Add TPU CI test ( #6277 )
...
Co-authored-by: kevin <kevin@anyscale.com >
2024-07-15 14:31:16 -07:00
ec9933f4a5
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod ( #6289 )
2024-07-15 19:02:14 +00:00
3dee97b05f
[Docs] Add Google Cloud to sponsor list ( #6450 )
2024-07-15 11:58:10 -07:00
4cf256ae7f
[misc][distributed] fix pp missing layer condition ( #6446 )
2024-07-15 10:32:35 -07:00
64fdc08c72
bump version to v0.5.2 ( #6433 )
2024-07-15 17:27:40 +00:00
4ef95b0f06
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF ( #6409 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-15 13:14:49 -04:00
eaec4b9153
[Bugfix] Add custom Triton cache manager to resolve MoE MP issue ( #6140 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com >
2024-07-15 10:12:47 -07:00
a63a4c6341
[Misc] Use 0.0.9 version for flashinfer ( #6447 )
...
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com >
2024-07-15 10:10:26 -07:00
c8fd97f26d
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias ( #6270 )
2024-07-15 13:05:52 -04:00
94b82e8c18
[doc][distributed] add suggestion for distributed inference ( #6418 )
2024-07-15 09:45:51 -07:00
6ae1597ddf
[VLM] Minor space optimization for ClipVisionModel ( #6436 )
2024-07-15 17:29:51 +08:00
22e79ee8f3
[doc][misc] doc update ( #6439 )
2024-07-14 23:33:25 -07:00
de19916314
[Bugfix] Convert image to RGB by default ( #6430 )
2024-07-15 05:39:15 +00:00
69672f116c
[core][distributed] simplify code to support pipeline parallel ( #6406 )
2024-07-14 21:20:51 -07:00
44874a0bf9
[Doc] add env docs for flashinfer backend ( #6437 )
2024-07-14 21:16:51 -07:00
b47008b4d2
[BugFix] BatchResponseData body should be optional ( #6345 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-15 04:06:09 +00:00
9bfece89fd
Add FUNDING.yml ( #6435 )
2024-07-14 20:36:16 -07:00
32c9d7f765
Report usage for beam search ( #6404 )
2024-07-14 19:37:35 -07:00
ccb20db8bd
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' ( #6428 )
2024-07-14 19:27:01 -07:00
a754dc2cb9
[CI/Build] Cross python wheel ( #6394 )
2024-07-14 18:54:46 -07:00
61e85dbad8
[Doc] xpu backend requires running setvars.sh ( #6393 )
2024-07-14 17:10:11 -07:00
dbfe254eda
[Feature] vLLM CLI ( #5090 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-07-14 15:36:43 -07:00
73030b7dae
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 ( #6423 )
2024-07-14 21:38:42 +00:00
ccd3c04571
[ci][build] fix commit id ( #6420 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-07-14 22:16:21 +08:00
9dad5cc859
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace ( #6384 )
2024-07-14 13:37:19 +00:00
6ef3bf912c
Remove unnecessary trailing period in spec_decode.rst ( #6405 )
2024-07-14 07:58:09 +00:00
540c0368b1
[Model] Initialize Fuyu-8B support ( #3924 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-14 05:27:14 +00:00
fb6af8bc08
[ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 ( #6417 )
2024-07-13 20:03:58 -07:00
eeceadaecc
[Misc] Add deprecation warning for beam search ( #6402 )
2024-07-13 11:52:22 -07:00
babf52dade
[ Misc ] More Cleanup of Marlin ( #6359 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-07-13 10:21:37 +00:00
9da4aad44b
Updating LM Format Enforcer version to v10.3 ( #6411 )
2024-07-13 10:09:12 +00:00
41708e5034
[ci] try to add multi-node tests ( #6280 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-12 21:51:48 -07:00
d80aef3776
[Docs] Clean up latest news ( #6401 )
2024-07-12 19:36:53 -07:00
e1684a766a
[Bugfix] Fix hard-coded value of x in context_attention_fwd ( #6373 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-12 18:30:54 -07:00
a27f87da34
[Doc] Fix Typo in Doc ( #6392 )
...
Co-authored-by: Saliya Ekanayake <esaliya@d-matrix.ai >
2024-07-13 00:48:23 +00:00
16ff6bd58c
[ci] Fix wording for GH bot ( #6398 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 16:34:37 -07:00
f8f9ff57ee
[Bugfix][TPU] Fix megacore setting for v5e-litepod ( #6397 )
2024-07-12 15:59:47 -07:00
6bc9710f6e
Fix release pipeline's dir permission ( #6391 )
2024-07-12 15:52:43 -07:00
111fc6e7ec
[Misc] Add generated git commit hash as vllm.__commit__ ( #6386 )
2024-07-12 22:52:15 +00:00
75f64d8b94
[Bugfix] Fix illegal memory access in FP8 MoE kernel ( #6382 )
2024-07-12 21:33:33 +00:00
21b2dcedab
Fix release pipeline's -e flag ( #6390 )
2024-07-12 14:08:04 -07:00
07b35af86d
Fix interpolation in release pipeline ( #6389 )
2024-07-12 14:03:39 -07:00
bb1a784b05
Fix release-pipeline.yaml ( #6388 )
2024-07-12 14:00:57 -07:00
d719ba24c5
Build some nightly wheels by default ( #6380 )
2024-07-12 13:56:59 -07:00
aa48e502fb
[MISC] Upgrade dependency to PyTorch 2.3.1 ( #5327 )
2024-07-12 12:04:26 -07:00
4dbebd03cc
[ci] Add GHA workflows to enable full CI run ( #6381 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 11:36:26 -07:00
b75bce1008
[ci] Add grouped tests & mark tests to run by default for fastcheck pipeline ( #6365 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 09:58:38 -07:00
b039cbbce3
[Misc] add fixture to guided processor tests ( #6341 )
2024-07-12 09:55:39 -07:00
f9d25c2519
[Build/CI] Checking/Waiting for the GPU's clean state ( #6379 )
2024-07-12 09:42:24 -07:00
024ad87cdc
[Bugfix] Fix dtype mismatch in PaliGemma ( #6367 )
2024-07-12 08:22:18 -07:00
aea19f0989
[ Misc ] Support Models With Bias in compressed-tensors integration ( #6356 )
2024-07-12 11:11:29 -04:00
f7160d946a
[Misc][Bugfix] Update transformers for tokenizer issue ( #6364 )
2024-07-12 08:40:07 +00:00
6047187cd8
[ Misc ] Remove separate bias add ( #6353 )
2024-07-12 05:06:09 +00:00
b6c16cf8ff
[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm ( #6352 )
2024-07-11 21:30:46 -07:00
d26a8b3f1f
[CI/Build] (2/2) Switching AMD CI to store images in Docker Hub ( #6350 )
2024-07-11 21:26:26 -07:00
d59eb98489
[Model][Phi3-Small] Remove scipy from blocksparse_attention ( #6343 )
2024-07-12 10:47:17 +08:00
adf32e0a0f
[Bugfix] Fix usage stats logging exception warning with OpenVINO ( #6349 )
2024-07-12 10:47:00 +08:00
2b0fb53481
[distributed][misc] be consistent with pytorch for libcudart.so ( #6346 )
...
[distributed][misc] keep consistent with how pytorch finds libcudart.so (#6346 )
2024-07-11 19:35:17 -07:00
d6ab528997
[Misc] Remove flashinfer warning, add flashinfer tests to CI ( #6351 )
2024-07-12 01:32:06 +00:00
7ed6a4f0e1
[ BugFix ] Prompt Logprobs Detokenization ( #6223 )
...
Co-authored-by: Zifei Tong <zifeitong@gmail.com >
2024-07-11 22:02:29 +00:00
a4feba929b
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy ( #5362 )
2024-07-11 13:28:38 -07:00
2d23b42d92
[doc] update pipeline parallel in readme ( #6347 )
2024-07-11 11:38:40 -07:00
1df43de9bb
[bug fix] Fix llava next feature size calculation. ( #6339 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-07-11 17:21:10 +00:00
52b7fcb35a
Benchmark: add H100 suite ( #6047 )
2024-07-11 09:17:07 -07:00
b675069d74
[ Misc ] Refactor Marlin Python Utilities ( #6082 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-07-11 15:40:11 +00:00
55f692b46e
[BugFix] get_and_reset only when scheduler outputs are not empty ( #6266 )
2024-07-11 07:40:20 -07:00
8a1415cf77
[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. ( #6326 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-11 07:05:59 -07:00
546b101fa0
[BugFix]: fix engine timeout due to request abort ( #6255 )
...
Signed-off-by: yatta zhang <ytzhang01@foxmail.com >
Signed-off-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com >
Co-authored-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com >
2024-07-11 06:46:31 -07:00
3963a5335b
[Misc] refactor(config): clean up unused code ( #6320 )
2024-07-11 09:39:07 +00:00
c4774eb841
[Bugfix] Fix snapshot download in serving benchmark ( #6318 )
2024-07-11 07:04:05 +00:00
fc17110bbe
[BugFix]: set outlines pkg version ( #6262 )
2024-07-11 04:37:11 +00:00
439c84581a
[Doc] Update description of vLLM support for CPUs ( #6003 )
2024-07-10 21:15:29 -07:00
99ded1e1c4
[Doc] Remove comments incorrectly copied from another project ( #6286 )
2024-07-10 17:05:26 -07:00
997df46a32
[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor ( #6313 )
2024-07-10 16:39:02 -07:00
ae151d73be
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models ( #5765 )
2024-07-10 16:02:47 -07:00
44cc76610d
[Bugfix] Fix OpenVINOExecutor abstractmethod error ( #6296 )
...
Signed-off-by: sangjune.park <sangjune.park@navercorp.com >
2024-07-10 10:03:32 -07:00
b422d4961a
[CI/Build] Enable mypy typing for remaining folders ( #6268 )
2024-07-10 22:15:55 +08:00
c38eba3046
[Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. ( #6303 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-10 09:04:07 -04:00
e72ae80b06
[Bugfix] Support 2D input shape in MoE layer ( #6287 )
2024-07-10 09:03:16 -04:00
8a924d2248
[Doc] Guide for adding multi-modal plugins ( #6205 )
2024-07-10 14:55:34 +08:00
5ed3505d82
[Bugfix][TPU] Add prompt adapter methods to TPUExecutor ( #6279 )
2024-07-09 19:30:56 -07:00
da78caecfa
[core][distributed] zmq fallback for broadcasting large objects ( #6183 )
...
[core][distributed] add zmq fallback for broadcasting large objects (#6183 )
2024-07-09 18:49:11 -07:00
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
d3a245138a
[Bugfix]fix and needs_scalar_to_array logic check ( #6238 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-07-09 23:43:24 +00:00
673dd4cae9
[Docs] Docs update for Pipeline Parallel ( #6222 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-09 16:24:58 -07:00
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com >
Co-authored-by: Joe G <joseph.granados@h2o.ai >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-09 13:26:36 -07:00
a0550cbc80
Add support for multi-node on CI ( #5955 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-09 12:56:56 -07:00
08c5bdecae
[Bugfix][TPU] Fix outlines installation in TPU Dockerfile ( #6256 )
2024-07-09 02:56:06 -07:00
5d5b4c5fe5
[Bugfix][TPU] Add missing None to model input ( #6245 )
2024-07-09 00:21:37 -07:00
70c232f85a
[core][distributed] fix ray worker rank assignment ( #6235 )
2024-07-08 21:31:44 -07:00
a3c9435d93
[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability ( #6216 )
2024-07-08 20:02:15 -07:00
4f0e0ea131
Add FlashInfer to default Dockerfile ( #6172 )
2024-07-08 13:38:03 -07:00
ddc369fba1
[Bugfix] Mamba cache Cuda Graph padding ( #6214 )
2024-07-08 11:25:51 -07:00
185ad31f37
[Bugfix] use diskcache in outlines _get_guide #5436 ( #6203 )
2024-07-08 11:23:24 -07:00
543aa48573
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) ( #4888 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-08 17:12:15 +00:00
f7a8fa39d8
[Kernel] reloading fused_moe config on the last chunk ( #6210 )
2024-07-08 08:00:38 -07:00
717f4bcea0
Feature/add benchmark testing ( #5947 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-08 07:52:06 +00:00
16620f439d
do not exclude object field in CompletionStreamResponse ( #6196 )
2024-07-08 10:32:57 +08:00
3b08fe2b13
[misc][frontend] log all available endpoints ( #6195 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-07-07 15:11:12 -07:00
abfe705a02
[ Misc ] Support Fp8 via llm-compressor ( #6110 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
333306a252
add benchmark for fix length input and output ( #5857 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-07 07:42:13 +00:00
6206dcb29e
[Model] Add PaliGemma ( #5189 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-07 09:25:50 +08:00
9389380015
[Doc] Move guide for multimodal model and other improvements ( #6168 )
2024-07-06 17:18:59 +08:00
175c43eca4
[Doc] Reorganize Supported Models by Type ( #6167 )
2024-07-06 05:59:36 +00:00
bc96d5c330
Move release wheel env var to Dockerfile instead ( #6163 )
2024-07-05 17:19:53 -07:00
f0250620dd
Fix release wheel build env var ( #6162 )
2024-07-05 16:24:31 -07:00
2de490d60f
Update wheel builds to strip debug ( #6161 )
2024-07-05 14:51:25 -07:00
79d406e918
[Docs] Fix readthedocs for tag build ( #6158 )
2024-07-05 12:44:40 -07:00
abad5746a7
bump version to v0.5.1 ( #6157 )
2024-07-05 12:04:51 -07:00
e58294ddf2
[Bugfix] Add verbose error if scipy is missing for blocksparse attention ( #5695 )
2024-07-05 10:41:01 -07:00
f1e15da6fe
[Frontend] Continuous usage stats in OpenAI completion API ( #5742 )
2024-07-05 10:37:09 -07:00
0097bb1829
[Bugfix] Use templated datasource in grafana.json to allow automatic imports ( #6136 )
...
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de >
2024-07-05 09:49:47 -07:00
ea4b570483
[VLM] Cleanup validation and update docs ( #6149 )
2024-07-05 05:49:38 +00:00
a41357e941
[VLM] Improve consistency between feature size calculation and dummy data for profiling ( #6146 )
2024-07-05 09:29:47 +08:00
ae96ef8fbd
[VLM] Calculate maximum number of multi-modal tokens by model ( #6121 )
2024-07-04 16:37:23 -07:00
69ec3ca14c
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer ( #6051 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-04 16:35:51 -07:00
81d7a50f24
[Hardware][Intel CPU] Adding intel openmp tunings in Docker file ( #6008 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-07-04 15:22:12 -07:00
27902d42be
[misc][doc] try to add warning for latest html ( #5979 )
2024-07-04 09:57:09 -07:00
56b325e977
[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention ( #6043 )
...
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
2024-07-03 22:19:38 -07:00
3dd507083f
[CI/Build] Cleanup VLM tests ( #6107 )
2024-07-03 18:58:18 -07:00
0ed646b7aa
[Distributed][Core] Support Py39 and Py38 for PP ( #6120 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-03 17:52:29 -07:00
1dab9bc8a9
[Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing ( #6109 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-07-03 16:56:59 -07:00
3de6e6a30e
[core][distributed] support n layers % pp size != 0 ( #6115 )
2024-07-03 16:40:31 -07:00
966fe72141
[doc][misc] bump up py version in installation doc ( #6119 )
2024-07-03 15:52:04 -07:00
62963d129e
[ Misc ] Clean Up CompressedTensorsW8A8 ( #6113 )
2024-07-03 22:50:08 +00:00
d9e98f42e4
[vlm] Remove vision language config. ( #6089 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-03 22:14:16 +00:00
3c6325f0fc
[core][distributed] custom allreduce when pp size > 1 ( #6117 )
2024-07-03 14:41:32 -07:00
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
7cd2ebb025
[Bugfix] Fix compute_logits in Jamba ( #6093 )
2024-07-03 00:32:35 -07:00
f1c78138aa
[Doc] Fix Mock Import ( #6094 )
2024-07-03 00:13:56 -07:00
3a86b54fb0
[VLM][Frontend] Proper Image Prompt Formatting from OpenAI API ( #6091 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-02 23:41:23 -07:00
f666207161
[misc][distributed] error on invalid state ( #6092 )
2024-07-02 23:37:29 -07:00
d830656a97
[BugFix] Avoid unnecessary Ray import warnings ( #6079 )
2024-07-03 14:09:40 +08:00
d18bab3587
[CI] Fix base url doesn't strip "/" ( #6087 )
2024-07-02 21:31:25 -07:00
9831aec49f
[Core] Dynamic image size support for VLMs ( #5276 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: ywang96 <ywang@roblox.com >
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-07-02 20:34:00 -07:00
482045ee77
[hardware][misc] introduce platform abstraction ( #6080 )
2024-07-02 20:12:22 -07:00
9d6a8daa87
[Model] Jamba support ( #4115 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Erez Schwartz <erezs@ai21.com >
Co-authored-by: Mor Zusman <morz@ai21.com >
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: Tomer Asida <tomera@ai21.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 23:11:29 +00:00
ee93f4f92a
[CORE] Quantized lm-head Framework ( #4442 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: ZX <zx@lbx.dev >
2024-07-02 22:25:17 +00:00
7c008c51a9
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral ( #5970 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-02 21:54:35 +00:00
4d26d806e1
Update conftest.py ( #6076 )
2024-07-02 20:14:22 +00:00
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 10:58:08 -07:00
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) ( #6050 )
...
Co-authored-by: Sirej Dua <sirej.dua@databricks.com >
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
31354e563f
[Doc] Reinstate doc dependencies ( #6061 )
2024-07-02 10:53:16 +00:00
98d6682cd1
[VLM] Remove image_input_type from VLM config ( #5852 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-02 07:57:09 +00:00
2c37540aa6
[Frontend] Add template related params to request ( #5709 )
2024-07-01 23:01:57 -07:00
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) ( #5602 )
2024-07-01 20:10:37 -07:00
54600709b6
[Model] Changes to MLPSpeculator to support tie_weights and input_scale ( #5965 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com >
2024-07-01 16:40:02 -07:00
e373853e12
[Frontend] Relax api url assertion for openai benchmarking ( #6046 )
2024-07-01 23:39:10 +00:00
c87ebc3ef9
[BugFix] Ensure worker model loop is always stopped at the right time ( #5987 )
2024-07-01 16:17:58 -07:00
c4059ea54f
[Bugfix] Add explicit end_forward calls to flashinfer ( #6044 )
2024-07-01 23:08:58 +00:00
8e0817c262
[Bugfix][Doc] Fix Doc Formatting ( #6048 )
2024-07-01 15:09:11 -07:00
83bdcb6ac3
add FAQ doc under 'serving' ( #5946 )
2024-07-01 14:11:36 -07:00
12a59959ed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs ( #6029 )
2024-07-01 21:08:29 +00:00
dec6fc6f3b
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool ( #6039 )
2024-07-01 20:12:40 +00:00
8893130b63
[doc][misc] further lower visibility of simple api server ( #6041 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-01 10:50:56 -07:00
bb60326836
[Misc] update benchmark backend for scalellm ( #6018 )
2024-07-01 10:20:33 -07:00
4050d646e5
[doc][misc] remove deprecated api server in doc ( #6037 )
2024-07-01 12:52:43 -04:00
d76084c12f
[ CI ] Re-enable Large Model LM Eval ( #6031 )
2024-07-01 12:40:45 -04:00
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker ( #5348 )
2024-07-01 00:33:05 -07:00
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
af9ad46fca
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) ( #5940 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
7836fdcc11
[Misc] Fix get_min_capability ( #5971 )
2024-06-30 20:15:16 +00:00
deacb7ec44
[ CI ] Temporarily Disable Large LM-Eval Tests ( #6005 )
...
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>
2024-06-30 11:56:56 -07:00
f5e73c9f1b
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. ( #5909 )
...
Co-authored-by: sang <sangcho@anyscale.com >
2024-06-30 17:11:15 +00:00
c6c240aa0a
[Frontend]: Support base64 embedding ( #5935 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-06-30 23:53:00 +08:00
2be6955a3f
[ci][distributed] fix device count call
...
[ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991 )
2024-06-30 08:06:13 +00:00
9d47f64eb6
[CI/Build] [3/3] Reorganize entrypoints tests ( #5966 )
2024-06-30 12:58:49 +08:00
cff6a1fec1
[CI/Build] Reuse code for checking output consistency ( #5988 )
2024-06-30 11:44:25 +08:00
bcc6a09b63
[CI/Build] Temporarily Remove Phi3-Vision from TP Test ( #5989 )
2024-06-30 09:18:31 +08:00
9def10664e
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests ( #5949 )
2024-06-29 12:47:58 -07:00
75aa1442db
[ CI/Build ] LM Eval Harness Based CI Testing ( #5838 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 13:04:30 -04:00
99397da534
[CI/Build] Add TP test for vision models ( #5892 )
2024-06-29 15:45:54 +00:00
8dbfcd35bf
[ CI/Build ] Added E2E Test For Compressed Tensors ( #5839 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 21:12:58 +08:00
f7dac83d95
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k ( #5939 )
2024-06-29 21:04:20 +08:00
7c01f70641
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum ( #5974 )
2024-06-29 12:47:53 +00:00
51e971d39e
[Bugfix] Support eos_token_id from config.json ( #5954 )
2024-06-29 11:19:02 +00:00
329df38f1a
[Misc] Update Phi-3-Vision Example ( #5981 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-06-29 14:34:29 +08:00
580353da93
[Bugfix] Fix precisions in Gemma 1 ( #5913 )
2024-06-29 03:10:21 +00:00
ba4994443a
[Kernel] Add punica dimensions for Granite 3b and 8b ( #5930 )
...
Signed-off-by: Joe Runde <joe@joerun.de >
2024-06-29 10:48:25 +08:00
906a19cdb0
[Misc] Extend vLLM Metrics logging API ( #5925 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-29 10:36:06 +08:00
c4bca740e8
[Bugfix] fix missing last itl in openai completions benchmark ( #5926 )
2024-06-29 10:34:42 +08:00
7f83f40dee
[Bugfix][TPU] Fix pad slot id ( #5977 )
2024-06-28 18:55:17 -07:00
54814fd85b
[Bugfix][TPU] Fix TPU sampler output ( #5978 )
2024-06-28 18:14:16 -07:00
7041de4384
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode ( #4628 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >, bong-furiosa <bongwon.jang@furiosa.ai >
2024-06-28 15:28:49 -07:00
6a62cb82cc
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError ( #5963 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 17:46:30 -04:00
5d2a1a9cf0
Unmark more files as executable ( #5962 )
2024-06-28 17:34:56 -04:00
4bf35ed9ae
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled ( #5936 )
2024-06-28 21:12:40 +00:00
be0b3af9e0
Support Deepseek-V2 ( #4650 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
2024-06-28 13:24:57 -07:00
2cd402e169
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 ( #5921 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 18:43:49 +00:00
b185230744
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) ( #5928 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 13:49:57 -04:00
6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues ( #5931 )
2024-06-28 17:10:34 +00:00
b2c620230a
[Spec Decode] Introduce DraftModelRunner ( #5799 )
2024-06-28 09:17:51 -07:00
b90d8cd832
[Distributed] Make it clear that % should not be in tensor dict keys. ( #5927 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-06-28 15:20:22 +00:00
3b752a6555
[CI/Build] [2/3] Reorganize entrypoints tests ( #5904 )
2024-06-28 07:59:18 -07:00
ec1ad0046c
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high ( #5894 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-28 07:42:17 -07:00
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend ( #5379 )
2024-06-28 13:50:16 +00:00
5932634409
Unmark fused_moe config json file as executable ( #5960 )
2024-06-28 06:36:12 -07:00
5cbe8d155c
[Core] Registry for processing model inputs ( #5214 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-06-28 12:09:56 +00:00
0d0e3a42ac
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner ( #5956 )
2024-06-28 12:03:41 +00:00
74d55c065b
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. ( #5905 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-28 07:29:13 +00:00
f136da15e1
[Hardware][TPU] Optimize KV cache swapping ( #5878 )
2024-06-27 21:12:13 -07:00
c3dde367f1
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X ( #5932 )
2024-06-27 13:41:08 -07:00
64e8d2a783
[core][misc] remove logical block ( #5882 )
2024-06-27 13:34:55 -07:00
79c92c7c8a
[Model] Add Gemma 2 ( #5908 )
2024-06-27 13:33:56 -07:00
736ed38849
[CI/Build] Fix Args for _get_logits_warper in Sampler Test ( #5922 )
2024-06-27 11:43:04 -07:00
365791ff81
[BugFix] Fix min_tokens behaviour for multiple eos tokens ( #5849 )
2024-06-27 11:31:11 -07:00
691e29ecf3
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens ( #5876 )
2024-06-27 10:59:33 -07:00
3fd02bda51
[doc][misc] add note for Kubernetes users ( #5916 )
2024-06-27 10:07:07 -07:00
98cf2ed678
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision ( #5896 )
2024-06-27 09:08:10 -07:00
e9d32d077d
[CI/Build] [1/3] Reorganize entrypoints tests ( #5526 )
2024-06-27 12:43:17 +00:00
2061f0b8a7
[Bugfix] Fix img_sizes Parsing in Phi3-Vision ( #5888 )
2024-06-27 08:29:24 +00:00
96354d6a29
[Model] Add base class for LoRA-supported models ( #5018 )
2024-06-27 16:03:04 +08:00
d12af207d2
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly ( #5880 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-06-27 15:15:24 +08:00
6eabc6cb0e
[Doc] Add note about context length in Phi-3-Vision example ( #5887 )
2024-06-26 23:20:01 -07:00
2110557dab
[BugFix] Fix cuda graph for MLPSpeculator ( #5875 )
...
Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com >
2024-06-27 04:12:10 +00:00
b9e84259e9
[Misc] Add example for LLaVA-NeXT ( #5879 )
2024-06-26 17:57:16 -07:00
294104c3f9
[doc] update usage of env var to avoid conflict ( #5873 )
2024-06-26 17:57:12 -04:00
38a1674abb
Support CPU inference with VSX PowerPC ISA ( #5652 )
2024-06-26 21:53:04 +00:00
f5c8628fdc
[Bugfix][TPU] Fix CPU cache allocation ( #5869 )
2024-06-26 13:42:40 -07:00
cbc53b6b8d
[Hardware][TPU] Support parallel sampling & Swapping ( #5855 )
2024-06-26 11:07:49 -07:00
c54269d967
[Frontend] Add tokenize/detokenize endpoints ( #5054 )
2024-06-26 16:54:22 +00:00
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm ( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-06-26 15:16:00 +00:00
6984c02a27
[CI/Build] Refactor image test assets ( #5821 )
2024-06-26 01:02:34 -07:00
3439c5a8e3
[Bugfix][TPU] Fix KV cache size calculation ( #5860 )
2024-06-26 00:58:23 -07:00
6806998bf9
[Bugfix] Fix embedding to support 2D inputs ( #5829 )
2024-06-26 00:15:22 -07:00
515080ad2f
[bugfix][distributed] fix shm broadcast when the queue size is full ( #5801 )
2024-06-25 21:56:02 -07:00
3aa7b6cf66
[Misc][Doc] Add Example of using OpenAI Server with VLM ( #5832 )
2024-06-25 20:34:25 -07:00
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication ( #5408 )
...
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu >
Signed-off-by: Stephanie <swang@anyscale.com >
Co-authored-by: Stephanie <swang@anyscale.com >
2024-06-25 20:30:03 -07:00
82079729cc
[Bugfix] Fix assertion in NeuronExecutor ( #5841 )
2024-06-25 19:52:10 -07:00
c2a8ac75e0
[CI/Build] Add E2E tests for MLPSpeculator ( #5791 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-26 00:04:08 +00:00
f178e56c68
[Hardware][TPU] Raise errors for unsupported sampling params ( #5850 )
2024-06-25 16:58:23 -07:00
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes ( #5422 )
2024-06-25 15:56:15 -07:00
bc34937d68
[Hardware][TPU] Refactor TPU backend ( #5831 )
2024-06-25 15:25:52 -07:00
dd248f7675
[Misc] Update w4a16 compressed-tensors support to include w8a16 ( #5794 )
2024-06-25 19:23:35 +00:00
d9b34baedd
[CI/Build] Add unit testing for FlexibleArgumentParser ( #5798 )
2024-06-25 12:18:03 -07:00
c18ebfdd71
[doc][distributed] add both gloo and nccl tests ( #5834 )
2024-06-25 15:10:28 -04:00
67882dbb44
[Core] Add fault tolerance for RayTokenizerGroupPool ( #5748 )
2024-06-25 10:15:10 -07:00
7b99314301
[Misc] Remove useless code in cpu_worker ( #5824 )
2024-06-25 09:41:36 -07:00
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model ( #5414 )
2024-06-25 09:56:06 +00:00
f23871e9ee
[Doc] Add notice about breaking changes to VLMs ( #5818 )
2024-06-25 01:25:03 -07:00
e9de9dd551
[ci] Remove aws template ( #5757 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-24 21:09:02 -07:00
ba991d5c84
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args ( #5795 )
2024-06-24 17:01:19 -06:00
1744cc99ba
[Doc] Add Phi-3-medium to list of supported models ( #5788 )
2024-06-24 10:48:55 -07:00
e72dc6cb35
[Doc] Add "Suggest edit" button to doc pages ( #5789 )
2024-06-24 10:26:17 -07:00
c246212952
[doc][faq] add warning to download models for every nodes ( #5783 )
2024-06-24 15:37:42 +08:00
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement ( #5772 )
2024-06-24 12:11:53 +08:00
5d4d90536f
[Distributed] Add send and recv helpers ( #5719 )
2024-06-23 14:42:28 -07:00
6c916ac8a8
[BugFix] [Kernel] Add Cutlass2x fallback kernels ( #5744 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-23 21:07:11 +00:00
832ea88fcb
[core][distributed] improve shared memory broadcast ( #5754 )
2024-06-22 10:00:43 -07:00
8c00f9c15d
[Docs][TPU] Add installation tip for TPU ( #5761 )
2024-06-21 23:09:40 -07:00
0cbc1d2b4f
[Bugfix] Fix pin_lora error in TPU executor ( #5760 )
2024-06-21 22:25:14 -07:00
ff9ddbceee
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py ( #5756 )
2024-06-22 03:33:12 +00:00
9c62db07ed
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs ( #5710 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-22 02:07:08 +00:00
cf90ae0123
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline ( #5616 )
2024-06-21 17:09:34 -07:00
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache ( #5603 )
2024-06-21 15:42:46 -07:00
7187507301
[ci][test] fix ca test in main ( #5746 )
2024-06-21 14:04:26 -07:00
f1e72cc19a
[BugFix] exclude version 1.15.0 for modelscope ( #5668 )
2024-06-21 13:15:48 -06:00
5b15bde539
[Doc] Documentation on supported hardware for quantization methods ( #5745 )
2024-06-21 12:44:29 -04:00
bd620b01fb
[Kernel][CPU] Add Quick gelu to CPU ( #5717 )
2024-06-21 06:39:40 +00:00
d9a252bc8e
[Core][Distributed] add shm broadcast ( #5399 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-06-21 05:12:35 +00:00
67005a07bc
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora ( #5665 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-21 04:46:28 +00:00
c35e4a3dd7
[BugFix] Fix test_phi3v.py ( #5725 )
2024-06-21 04:45:34 +00:00
1f5674218f
[Kernel] Add punica dimension for Qwen2 LoRA ( #5441 )
2024-06-20 17:55:41 -07:00
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com >
2024-06-20 20:23:12 -04:00
6c5b7af152
[distributed][misc] use fork by default for mp ( #5669 )
2024-06-20 17:06:34 -07:00
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
a7dcc62086
[Kernel] Update Cutlass int8 kernel configs for SM80 ( #5275 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-20 13:33:21 +00:00
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
111af1fa2c
[Kernel] Update Cutlass int8 kernel configs for SM90 ( #5514 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-20 06:37:08 +00:00
1b2eaac316
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors ( #5703 )
2024-06-19 23:10:47 -07:00
3730a1c832
[Misc] Improve conftest ( #5681 )
2024-06-19 19:09:21 -07:00
949e49a685
[ci] Limit num gpus if specified for A100 ( #5694 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-19 16:30:03 -07:00
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
e83db9e7e3
[Doc] Update docker references ( #5614 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-06-19 15:01:45 -07:00
78687504f7
[Bugfix] AsyncLLMEngine hangs with asyncio.run ( #5654 )
2024-06-19 13:57:12 -07:00
d571ca0108
[ci][distributed] add tests for custom allreduce ( #5689 )
2024-06-19 20:16:04 +00:00
afed90a034
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py ( #5688 )
2024-06-19 14:41:42 -04:00
3ee5c4bca5
[ci] Add A100 queue into AWS CI template ( #5648 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-19 08:42:13 -06:00
e9c2732b97
[CI/Build] Add tqdm to dependencies ( #5680 )
2024-06-19 08:37:33 -06:00
d8714530d1
[Misc]Add param max-model-len in benchmark_latency.py ( #5629 )
2024-06-19 18:19:08 +08:00
7d46c8d378
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example ( #5684 )
2024-06-19 17:58:32 +08:00
da971ec7a5
[Model] Add FP8 kv cache for Qwen2 ( #5656 )
2024-06-19 09:38:26 +00:00
3eea74889f
[misc][distributed] use 127.0.0.1 for single-node ( #5619 )
2024-06-19 08:05:00 +00:00
f758aed0e8
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices ( #5641 )
2024-06-18 23:21:29 -07:00
e5150f2c28
[Bugfix] Added test for sampling repetition penalty bug. ( #5659 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-19 06:03:55 +00:00
59a1eb59c9
[Bugfix] Fix Phi-3 Long RoPE scaling implementation ( #5628 )
2024-06-19 01:46:38 +00:00
6820724e51
[Bugfix] Fix w8a8 benchmarks for int8 case ( #5643 )
2024-06-19 00:33:25 +00:00
b23ce92032
[Bugfix] Fix CUDA version check for mma warning suppression ( #5642 )
2024-06-18 23:48:49 +00:00
2bd231a7b7
[Doc] Added cerebrium as Integration option ( #5553 )
2024-06-18 15:56:59 -07:00
8a173382c8
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties ( #5639 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-18 14:18:37 -07:00
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
19091efc44
[ci] Setup Release pipeline and build release wheels with cache ( #5610 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-18 11:00:36 -07:00
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
7879f24dcc
[Misc] Add OpenTelemetry support ( #4687 )
...
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.
I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
13db4369d9
[ci] Deprecate original CI template ( #5624 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-18 14:26:20 +00:00
4ad7b53e59
[CI/Build][Misc] Update Pytest Marker for VLMs ( #5623 )
2024-06-18 13:10:04 +00:00
f0cc0e68e3
[Misc] Remove import from transformers logging ( #5625 )
2024-06-18 12:12:19 +00:00
db5ec52ad7
[bugfix][distributed] improve p2p capability test ( #5612 )
...
[bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612 )
2024-06-18 07:21:05 +00:00
114d7270ff
[CI] Avoid naming different metrics with the same name in performance benchmark ( #5615 )
2024-06-17 21:37:18 -07:00
32c86e494a
[Misc] Fix typo ( #5618 )
2024-06-17 20:58:30 -07:00
8eadcf0b90
[misc][typo] fix typo ( #5620 )
2024-06-17 20:54:57 -07:00
5002175e80
[Kernel] Add punica dimensions for Granite 13b ( #5559 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-06-18 03:54:11 +00:00
daef218b55
[Model] Initialize Phi-3-vision support ( #4986 )
2024-06-17 19:34:33 -07:00
fa9e385229
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier ( #5131 )
2024-06-17 21:29:09 -05:00
26e1188e51
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py ( #5606 )
2024-06-17 23:16:10 +00:00
a3e8a05d4c
[Bugfix] Fix KV head calculation for MPT models when using GQA ( #5142 )
2024-06-17 15:26:41 -07:00
e441bad674
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids ( #5584 )
2024-06-17 22:08:05 +00:00
1b44aaf4e3
[bugfix][distributed] fix 16 gpus local rank arrangement ( #5604 )
2024-06-17 21:35:04 +00:00
9e4e6fe207
[CI] the readability of benchmarking and prepare for dashboard ( #5571 )
...
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571 )
2024-06-17 11:41:08 -07:00
ab66536dbf
[CI/BUILD] Support non-AVX512 vLLM building and testing ( #5574 )
2024-06-17 14:36:10 -04:00
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend ( #3814 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com >
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com >
2024-06-17 11:01:25 -07:00
1f12122b17
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed ( #5588 )
2024-06-17 09:40:35 -07:00
890d8d960b
[Kernel] compressed-tensors marlin 24 support ( #5435 )
2024-06-17 12:32:48 -04:00
9e74d9d003
Correct alignment in the seq_len diagram. ( #5592 )
...
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai >
2024-06-17 12:05:33 -04:00
9333fb8eb9
[Model] Rename Phi3 rope scaling type ( #5595 )
2024-06-17 12:04:14 -04:00
e2b85cf86a
Fix w8a8 benchmark and add Llama-3-8B ( #5562 )
2024-06-17 06:48:06 +00:00
845a3f26f9
[Doc] add debugging tips for crash and multi-node debugging ( #5581 )
2024-06-17 10:08:01 +08:00
f07d513320
[build][misc] limit numpy version ( #5582 )
2024-06-16 16:07:01 -07:00
4a6769053a
[CI][BugFix] Flip is_quant_method_supported condition ( #5577 )
2024-06-16 14:07:34 +00:00
f31c1f90e3
Add basic correctness 2 GPU tests to 4 GPU pipeline ( #5518 )
2024-06-16 07:48:02 +00:00
3ce2c050dd
[Fix] Correct OpenAI batch response format ( #5554 )
2024-06-15 16:57:54 -07:00
1c0afa13c5
[BugFix] Don't start a Ray cluster when not using Ray ( #5570 )
2024-06-15 16:30:51 -07:00
d919ecc771
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 ( #5145 )
2024-06-15 13:38:16 -04:00
e691918e3b
[misc] Do not allow to use lora with chunked prefill. ( #5538 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-06-15 14:59:36 +00:00
81fbb3655f
[CI/Build] Test both text and token IDs in batched OpenAI Completions API ( #5568 )
2024-06-15 07:29:42 -04:00
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
1b8a0d71cf
[Core][Bugfix]: fix prefix caching for blockv2 ( #5364 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-06-14 17:23:56 -07:00
bd7efe95d0
Add ccache to amd ( #5555 )
2024-06-14 17:18:22 -07:00
f5bb85b435
[Core][Distributed] improve p2p cache generation ( #5528 )
2024-06-14 14:47:45 -07:00
28c145eb57
[Bugfix] Fix typo in Pallas backend ( #5558 )
2024-06-14 14:40:09 -07:00
e2afb03c92
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models ( #5460 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-14 20:28:11 +00:00
6e2527a7cb
[Doc] Update documentation on Tensorizer ( #5471 )
2024-06-14 11:27:57 -07:00
cdab68dcdb
[Docs] Add ZhenFund as a Sponsor ( #5548 )
2024-06-14 11:17:21 -07:00
d1c3d7d139
[misc][distributed] fix benign error in is_in_the_same_node ( #5512 )
2024-06-14 10:59:28 -07:00
77490c6f2f
[Core] Remove duplicate processing in async engine ( #5525 )
2024-06-14 10:04:42 -07:00
48f589e18b
[mis] fix flaky test of test_cuda_device_count_stateless ( #5546 )
2024-06-14 10:02:23 -07:00
348616ac4b
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later ( #5401 )
2024-06-14 10:02:00 -07:00
15985680e2
[ Misc ] Rs/compressed tensors cleanup ( #5432 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2024-06-14 10:01:46 -07:00
d74674bbd9
[Misc] Fix arg names ( #5524 )
2024-06-14 09:47:44 -07:00
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue ( #5516 )
2024-06-14 09:30:15 -07:00
d47af2bc02
[CI/Build] Disable LLaVA-NeXT CPU test ( #5529 )
2024-06-14 09:27:30 -07:00
319ad7f1d3
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label ( #5073 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-06-13 22:36:20 -07:00
0f0d8bc065
bump version to v0.5.0.post1 ( #5522 )
2024-06-13 19:42:06 -07:00
55d6361b13
[Misc] Fix arg names in quantizer script ( #5507 )
2024-06-13 19:02:53 -07:00
cd9c0d65d9
[Hardware][Intel] Support CPU inference with AVX2 ISA ( #5452 )
2024-06-13 17:22:24 -06:00
50eed24d25
Add cuda_device_count_stateless ( #5473 )
2024-06-13 16:06:49 -07:00
e38042d4af
[Kernel] Disable CUTLASS kernels for fp8 ( #5505 )
2024-06-13 13:38:05 -07:00
33e3b37242
[CI/Build] Disable test_fp8.py ( #5508 )
2024-06-13 13:37:48 -07:00
1696efe6c9
[misc] fix format.sh ( #5511 )
2024-06-13 12:09:16 -07:00
6b0511a57b
Revert "[Core] Remove unnecessary copies in flash attn backend" ( #5478 )
2024-06-13 11:22:50 -07:00
a8fda4f661
Seperate dev requirements into lint and test ( #5474 )
2024-06-13 11:22:41 -07:00
30299a41fa
[MISC] Remove FP8 warning ( #5472 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
2024-06-13 11:22:30 -07:00
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: zifeitong <zifei.tong@parasail.io >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-13 11:22:19 -07:00
0ce7b952f8
[Doc] Update LLaVA docs ( #5437 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-13 11:22:07 -07:00
39873476f8
[CI/Build] Simplify OpenAI server setup in tests ( #5100 )
2024-06-13 11:21:53 -07:00
03dccc886e
[Misc] Add vLLM version getter to utils ( #5098 )
2024-06-13 11:21:39 -07:00
a65634d3ae
[Docs] Add 4th meetup slides ( #5509 )
2024-06-13 10:18:26 -07:00
80aa7e91fc
[Hardware][Intel] Optimize CPU backend and add more performance tips ( #4971 )
...
Co-authored-by: Jianan Gu <jianan.gu@intel.com >
2024-06-13 09:33:14 -07:00
bd43973522
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 ( #5497 )
...
Tune Qwen2-57B-A14B configs based on #4921
Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2
A100 GPU
benchmark no config w/ PR
tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s
tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
2024-06-13 09:01:10 -07:00
23ec72fa03
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations ( #5466 )
2024-06-13 15:18:08 +00:00
c2637a613b
[Kernel] w4a16 support for compressed-tensors ( #5385 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-13 10:19:56 -04:00
88407532e7
[Bugfix]if the content is started with ":"(response of ping), client should i… ( #5303 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-12 20:16:41 -07:00
916d219d62
[ci] Use sccache to build images ( #5419 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-12 17:58:12 -07:00
ea3890a5f0
[Core][Distributed] code deduplication in tp&pp with coordinator( #5293 )
...
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293 )
2024-06-12 17:27:08 -07:00
2135cacb45
[Bugfix] Fix wrong multi_modal_input format for CPU runner ( #5451 )
2024-06-12 16:20:18 -07:00
7d19de2e9c
[Frontend] Add "input speed" to tqdm postfix alongside output speed ( #5425 )
2024-06-12 18:42:12 -04:00
94a07bbdd8
[Bugfix] Fix typo in scheduler.py (requeset -> request) ( #5470 )
2024-06-12 21:59:44 +00:00
b8d4dfff9c
[Doc] Update debug docs ( #5438 )
2024-06-12 14:49:31 -07:00
622d45128c
[misc] add hint for AttributeError ( #5462 )
2024-06-12 21:46:35 +00:00
51602eefd3
[Frontend] [Core] Support for sharded tensorized models ( #4990 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Sanger Steel <sangersteel@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-12 14:13:52 -07:00
5cc50a531f
[Bugfix] TYPE_CHECKING for MultiModalData ( #5444 )
2024-06-12 14:08:52 -07:00
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
8b82a89997
[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests ( #5464 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-12 14:00:18 -07:00
c3c2903e72
[Bugfix] Add device assertion to TorchSDPA ( #5402 )
2024-06-12 12:58:53 -07:00
1a8bfd92d5
[Hardware] Initial TPU integration ( #5292 )
2024-06-12 11:53:03 -07:00
847cdcca1c
[CI] Upgrade codespell version. ( #5381 )
2024-06-12 10:06:14 -07:00
e3c12bf6d2
Revert "[CI/Build] Add is_quant_method_supported to control quantization test configurations" ( #5463 )
2024-06-12 10:03:24 -07:00
3dd6853bc8
[CI/Build] Add is_quant_method_supported to control quantization test configurations ( #5253 )
2024-06-12 09:58:02 -07:00
8f89d72090
[Doc] add common case for long waiting time ( #5430 )
2024-06-11 11:12:13 -07:00
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case ( #5230 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-11 11:10:41 -07:00
c4bd03c7c5
[Core][Distributed] add same-node detection ( #5369 )
2024-06-11 10:53:59 -07:00
dcbf4286af
[Frontend] Customizable RoPE theta ( #5197 )
2024-06-11 10:42:26 -07:00
00e6a2dc53
[Bugfix] fix lora_dtype value type in arg_utils.py ( #5398 )
2024-06-11 10:40:23 -07:00
2e02311a1b
[Bugfix] Fix MultiprocessingGPUExecutor.check_health when world_size == 1 ( #5254 )
2024-06-11 10:38:07 -07:00
89ec06c33b
[Docs] [Spec decode] Fix docs error in code example ( #5427 )
2024-06-11 10:31:56 -07:00
9fde251bf0
[Doc] Add an automatic prefix caching section in vllm documentation ( #5324 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-06-11 10:24:59 -07:00
4c2ffb28ff
[Speculative decoding] Initial spec decode docs ( #5400 )
2024-06-11 10:15:40 -07:00
246598a6b1
[CI] docfix ( #5410 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: ywang96 <ywang@roblox.com >
2024-06-11 01:28:50 -07:00
8bab4959be
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable ( #5389 )
2024-06-11 00:37:56 -07:00
3c4cebf751
[Doc][Typo] Fixing Missing Comma ( #5403 )
2024-06-11 00:20:28 -07:00
d8f31f2f8b
[Doc] add debugging tips ( #5409 )
2024-06-10 23:21:43 -07:00
640052b069
[Bugfix][Frontend] Cleanup "fix chat logprobs" ( #5026 )
2024-06-10 22:36:46 -07:00
351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs ( #5312 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-06-11 10:30:31 +08:00
a008629807
[Misc] Various simplifications and typing fixes ( #5368 )
2024-06-11 10:29:02 +08:00
76477a93b7
[ci] Fix Buildkite agent path ( #5392 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 18:58:07 -07:00
77c87beb06
[Doc] Add documentation for FP8 W8A8 ( #5388 )
2024-06-10 18:55:12 -06:00
114332b88e
Bump version to v0.5.0 ( #5384 )
2024-06-10 15:56:06 -07:00
cb77ad836f
[Docs] Alphabetically sort sponsors ( #5386 )
2024-06-10 15:17:19 -05:00
856c990041
[Docs] Add Docs on Limitations of VLM Support ( #5383 )
2024-06-10 09:53:50 -07:00
c5602f0baa
[ci] Mount buildkite agent on Docker container to upload benchmark results ( #5330 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 09:22:34 -07:00
f7f9c5f97b
[ci] Use small_cpu_queue for doc build ( #5331 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 09:21:11 -07:00
2c0d933594
[Bugfix] Fix LLaVA-NeXT ( #5380 )
2024-06-10 15:38:47 +00:00
774d1035e4
[Feature][Frontend]: Continued stream_options implementation also in CompletionRequest ( #5319 )
2024-06-10 14:22:09 +00:00
6b29d6fe70
[Model] Initial support for LLaVA-NeXT ( #4199 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-10 12:47:15 +00:00
0bfa1c4f13
[Misc] Improve error message when LoRA parsing fails ( #5194 )
2024-06-10 19:38:49 +08:00
c81da5f56d
[misc][typo] fix typo ( #5372 )
2024-06-10 09:51:02 +00:00
68bc81703e
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server ( #5374 )
2024-06-10 09:13:39 +00:00
5884c2b454
[Misc] Update to comply with the new compressed-tensors config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-06-10 03:49:46 +00:00
45f92c00cf
[Bugfix] Fix KeyError: 1 When Using LoRA adapters ( #5164 )
2024-06-09 16:23:14 -07:00
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
5d7e3d0176
[mis][ci/test] fix flaky test in test_sharded_state_loader.py ( #5361 )
...
[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361 )
2024-06-09 03:50:14 +00:00
0373e1837e
[Core][CUDA Graph] add output buffer for cudagraph ( #5074 )
...
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074 )
2024-06-08 19:14:43 -07:00
c09dade2a2
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale ( #5353 )
2024-06-08 13:54:05 -04:00
8ea5e44a43
[CI/Test] improve robustness of test (vllm_runner) ( #5357 )
...
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357 )
2024-06-08 08:59:20 +00:00
9fb900f90c
[CI/Test] improve robustness of test (hf_runner) ( #5347 )
...
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347 )
2024-06-07 22:31:32 -07:00
c96fc06747
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention ( #4965 )
2024-06-07 19:13:12 -07:00
b3376e5c76
[Misc] Add args for selecting distributed executor to benchmarks ( #5335 )
2024-06-08 09:20:16 +08:00
e69ded7d1c
[Bug Fix] Fix the support check for FP8 CUTLASS ( #5352 )
...
Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)
This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183 .
2024-06-08 00:42:05 +00:00
767c727a81
fix DbrxFusedNormAttention missing cache_config ( #5340 )
...
Co-authored-by: team <calvinn.ng@ahrefs.com >
2024-06-07 14:10:21 -07:00
6840a71610
[Misc] Remove unused cuda_utils.h in CPU backend ( #5345 )
2024-06-07 14:09:13 -07:00
7a9cb294ae
[Frontend] Add OpenAI Vision API Support ( #5237 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-06-07 11:23:32 -07:00
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-07 09:36:26 -07:00
dc49fb892c
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill ( #5296 )
2024-06-07 13:35:42 +00:00
18a277b52d
Remove Ray health check ( #4693 )
2024-06-07 10:01:56 +00:00
8d75fe48ca
[Kernel] Switch fp8 layers to use the CUTLASS kernels ( #5183 )
...
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8
see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
2024-06-07 08:42:35 +00:00
388596c914
[Misc][Utils] allow get_open_port to be called for multiple times ( #5333 )
2024-06-06 22:15:11 -07:00
baa15a9ec3
[Feature][Frontend]: Add support for stream_options in ChatCompletionRequest ( #5135 )
2024-06-07 03:29:24 +00:00
15063741e3
[Misc] Missing error message for custom ops import ( #5282 )
2024-06-06 20:17:21 -07:00
ccdc490dda
[Core] Change LoRA embedding sharding to support loading methods ( #5038 )
2024-06-06 19:07:57 -07:00
a31cab7556
[Core] Avoid copying prompt/output tokens if no penalties are used ( #5289 )
2024-06-06 18:12:00 -07:00
828da0d44e
[Frontend] enable passing multiple LoRA adapters at once to generate() ( #5300 )
2024-06-06 15:48:13 -05:00
abe855d637
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 ( #5294 )
2024-06-06 09:29:29 -07:00
4efff036f0
Bugfix: fix broken of download models from modelscope ( #5233 )
...
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com >
2024-06-06 09:28:10 -07:00
89c920785f
[CI/Build] Update vision tests ( #5307 )
2024-06-06 05:17:18 -05:00
7b0a0dfb22
[Frontend][Core] Update Outlines Integration from FSM to Guide ( #4109 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-06-05 16:49:12 -07:00
3a6ae1d33c
[CI] Disable flash_attn backend for spec decode ( #5286 )
2024-06-05 15:49:27 -07:00
8f1729b829
[Docs] Add Ray Summit CFP ( #5295 )
2024-06-05 15:25:18 -07:00
6a7c7711a2
[Misc] Skip for logits_scale == 1.0 ( #5291 )
2024-06-05 15:19:02 -07:00
0f83ddd4d7
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. ( #5290 )
2024-06-05 15:18:12 -07:00
065aff6c16
[Bugfix] Make EngineArgs use named arguments for config construction ( #5285 )
2024-06-05 15:16:56 -07:00
3d33e372a1
[BugFix] Fix log message about default max model length ( #5284 )
2024-06-05 14:53:16 -07:00
faf71bcd4b
[Speculative Decoding] Add ProposerWorkerBase abstract class ( #5252 )
2024-06-05 14:53:05 -07:00
f270a39537
[Docs] Add Sequoia as sponsors ( #5287 )
2024-06-05 18:02:56 +00:00
51a08e7d8f
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 ( #5238 )
2024-06-05 10:59:14 -07:00
eb8fcd2666
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM ( #5207 )
...
Co-authored-by: qiujiawei9 <qiujiawei9@jd.com >
2024-06-05 10:59:02 -07:00
5563a4dea8
[Model] Correct Mixtral FP8 checkpoint loading ( #5231 )
2024-06-05 10:58:50 -07:00
ccd4f129e8
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size ( #5157 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-06-05 10:44:15 -07:00
02cc3b51a7
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results ( #5263 )
2024-06-05 10:17:51 -07:00
d5b1eb081e
[CI] Add nightly benchmarks ( #5260 )
2024-06-05 09:42:08 -07:00
f0a500545f
[Frontend] OpenAI API server: Add add_special_tokens to ChatCompletionRequest (default False) ( #5278 )
2024-06-05 09:32:58 -07:00
c65146e75e
[Misc] Fix docstring of get_attn_backend ( #5271 )
2024-06-05 09:18:59 -07:00
41ca62cf03
[Misc] Add CustomOp interface for device portability ( #5255 )
2024-06-05 09:18:19 -07:00
974fc9b845
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True ( #5226 )
2024-06-04 19:37:28 -07:00
fee4dcc33a
[Misc] update collect env ( #5261 )
2024-06-04 17:29:09 -05:00
650a4cc55e
[Misc] Add transformers version to collect_env.py ( #5259 )
2024-06-04 12:52:28 -07:00
9ca62d8668
[CI] mark AMD test as softfail to prevent blockage ( #5256 )
2024-06-04 11:34:53 -07:00
45c35f0d58
[CI/Build] Reducing CPU CI execution time ( #5241 )
2024-06-04 10:26:40 -07:00
9ba093b4f4
[CI/Build] Simplify model loading for HfRunner ( #5251 )
2024-06-04 10:09:19 -07:00
27208be66e
[Kernel] Add back batch size 1536 and 3072 to MoE tuning ( #5242 )
2024-06-04 09:58:47 -07:00
87d5abef75
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend ( #5249 )
2024-06-04 09:57:51 -07:00
ec784b2526
[CI/Build] Add inputs tests ( #5215 )
2024-06-03 21:01:46 -07:00
a58f24e590
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor ( #5229 )
2024-06-03 20:55:50 -07:00
f42a006b15
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend ( #5210 )
2024-06-03 20:32:57 -07:00
3a434b07ed
[Kernel] Enhance MoE benchmarking & tuning script ( #4921 )
2024-06-03 20:06:59 -07:00
bd0e7802e0
[Bugfix] Add warmup for prefix caching example ( #5235 )
2024-06-03 19:36:41 -07:00
06b2550cbb
[Bugfix] Support prompt_logprobs==0 ( #5217 )
2024-06-03 17:59:30 -07:00
f775a07e30
[FRONTEND] OpenAI tools support named functions ( #5032 )
2024-06-03 18:25:29 -05:00
4f0d17c05c
New CI template on AWS stack ( #5110 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-03 16:16:43 -07:00
10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 ( #3834 )
2024-06-03 13:37:11 -07:00
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
0ab278ca31
[Core] Remove unnecessary copies in flash attn backend ( #5138 )
2024-06-03 09:39:31 -07:00
7a64d24aad
[Core] Support image processor ( #4197 )
2024-06-02 22:56:41 -07:00
dfbe60dc62
[Misc] Simplify code and fix type annotations in conftest.py ( #5118 )
2024-06-02 16:05:50 -07:00
a66cf40b20
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer ( #4927 )
...
This PR enables the fused topk_softmax kernel used in moe layer for HIP
2024-06-02 14:13:26 -07:00
f790ad3c50
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response ( #4643 )
2024-06-02 08:06:13 +00:00
ed59a7ed23
Update test_ignore_eos ( #4898 )
2024-06-02 02:21:53 +00:00
044793d8df
[BugFix] Prevent LLM.encode for non-generation Models ( #5184 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-06-01 23:35:41 +00:00
c2d6d2f960
[Bugfix]: Fix issues related to prefix caching example ( #5177 ) ( #5180 )
2024-06-01 15:53:52 -07:00
8279078e21
[Bugfix] Remove deprecated @abstractproperty ( #5174 )
2024-06-01 22:40:25 +00:00
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
37464a0f74
[Bugfix] Fix call to init_logger in openai server ( #4765 )
2024-06-01 17:18:50 +00:00
c354072828
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py ( #5151 )
...
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com >
2024-06-01 17:11:22 +00:00
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-01 08:46:07 +00:00
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
a360ff80bb
[CI/Build] CMakeLists: build all extensions' cmake targets at the same time ( #5034 )
2024-05-31 22:06:45 -06:00
1197e02141
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels ( #5168 )
2024-05-31 17:21:38 -07:00
657579113f
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support ( #5171 )
2024-05-31 17:20:19 -07:00
e9899fb7a4
[Model] Enable FP8 QKV in MoE and refine kernel tuning script ( #5039 )
2024-05-31 14:29:19 -07:00
a377f0bd5e
[Misc]: optimize eager mode host time ( #4196 )
...
Co-authored-by: xuhao <xuhao@cambricon.com >
2024-05-31 13:14:50 +08:00
e9d3aa04f6
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" ( #5149 )
2024-05-30 22:00:26 -07:00
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-05-30 19:24:41 -07:00
533c217792
Fix cutlass sm_90a vesrion in CMakeList
2024-05-31 02:13:01 +00:00
6d21fa1cad
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) ( #5136 )
2024-05-30 21:02:11 -05:00
b35be5403f
[Bugfix] Avoid Warnings in SparseML Activation Quantization ( #5120 )
2024-05-30 17:04:37 -07:00
45a1a69b98
[Build] Disable sm_90a in cu11 ( #5141 )
2024-05-30 14:37:16 -07:00
87a658c812
Bump version to v0.4.3 ( #5046 )
2024-05-30 11:13:46 -07:00
429d89720e
add doc about serving option on dstack ( #3074 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-30 10:11:07 -07:00
a9bcc7afb2
[Doc] Use intersphinx and update entrypoints docs ( #5125 )
2024-05-30 09:59:23 -07:00
d79d9eaaff
[Misc] remove duplicate definition of seq_lens_tensor in model_runner.py ( #5129 )
2024-05-30 06:56:19 -07:00
f758505c73
[CI/Build] increase wheel size limit to 200 MB ( #5130 )
2024-05-30 06:29:48 -07:00
d910816c73
[Bugfix] Automatically Detect SparseML models ( #5119 )
2024-05-30 12:58:37 +00:00
87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs ( #5029 )
...
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-05-30 02:52:14 -07:00
e07aff9e52
[CI/Build] Docker cleanup functionality for amd servers ( #5112 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Co-authored-by: omkarkakarparthi <okakarpa>
2024-05-30 03:27:39 +00:00
5bf185a1c4
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter ( #5108 )
2024-05-30 00:30:18 +00:00
4fbcb0f27e
[Doc][Build] update after removing vllm-nccl ( #5103 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-05-29 23:51:18 +00:00
7c3604fb68
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 ( #5031 )
2024-05-29 16:13:22 -07:00
b1c255630d
[Core] Avoid the need to pass None values to Sequence.inputs ( #5099 )
2024-05-29 16:05:01 -07:00
eb6c50cdc2
[Bugfix][CI/Build] Fix codespell failing to skip files in git diff ( #5097 )
2024-05-29 16:02:54 -07:00
eecd864388
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators ( #5096 )
2024-05-29 16:02:25 -07:00
ae495c74ea
[Doc]Replace deprecated flag in readme ( #4526 )
2024-05-29 22:26:33 +00:00
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
594392d27a
[Core][Distributed] improve p2p access check ( #4992 )
2024-05-29 11:29:07 +00:00
18c1f16d86
[Bugfix] Fix arguments passed to Sequence in stop checker test ( #5092 )
2024-05-29 07:16:41 +00:00
5bd3c65072
[Core][Optimization] remove vllm-nccl ( #5091 )
2024-05-29 05:13:52 +00:00
616e600e0b
[Misc] add gpu_memory_utilization arg ( #5079 )
...
Signed-off-by: pandyamarut <pandyamarut@gmail.com >
2024-05-28 17:16:18 -07:00
dfba529b40
[Bugfix] Remove the last EOS token unless explicitly specified ( #5077 )
2024-05-28 17:15:35 -07:00
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-28 13:29:31 -07:00
290f4ada2b
[Docs] Add Dropbox as sponsors ( #5089 )
2024-05-28 10:29:09 -07:00
dd8de11f0a
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X ( #4951 )
...
This PR adds Triton kernel configs for the MoE kernel for MI300X
2024-05-28 16:03:23 +00:00
9ba415588a
[BugFix] Fix Embedding Models with TP>1 ( #5075 )
2024-05-28 08:32:42 -07:00
d4f3985907
[Core] Sliding window for block manager v2 ( #4545 )
...
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local >
2024-05-28 11:07:07 +09:00
890aa93d27
[Model] Add support for falcon-11B ( #5069 )
2024-05-27 16:41:43 -07:00
fbdb7b3ee2
[Core] Allow AQLM on Pascal ( #5058 )
2024-05-27 15:26:14 -07:00
1102bef219
[Bugfix / Core] Prefix Caching Guards (merged with main) ( #4846 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-27 15:18:17 -07:00
f17a1a8f96
[Misc] Make Serving Benchmark More User-friendly ( #5044 )
2024-05-25 17:28:16 +00:00
d5a1697772
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding ( #5000 )
2024-05-25 10:00:14 -07:00
325c119961
[Misc] add logging level env var ( #5045 )
2024-05-24 23:49:49 -07:00
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com >
Co-authored-by: bapatra <bapatra@microsoft.com >
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-24 22:00:52 -07:00
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-05-24 10:07:09 -07:00
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-24 12:28:27 +00:00
6a50f4cafa
[Doc] add ccache guide in doc ( #5012 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-23 23:21:54 +00:00
e3470f8753
[Core]: Option To Use Prompt Token Ids Inside Logits Processor ( #4985 )
...
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university >
2024-05-23 22:04:24 +00:00
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-05-23 21:29:18 +00:00
5eda2ea02a
[Core][1/N] Support send/recv in PyNCCL Groups ( #4988 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-05-23 09:54:48 -07:00
2ba80bed27
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined ( #5009 )
2024-05-23 09:08:58 -07:00
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
a36de682d4
[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig ( #4991 )
2024-05-22 22:26:56 +00:00
eb6d3c264d
[Core] Eliminate parallel worker per-step task scheduling overhead ( #4894 )
2024-05-23 06:17:27 +09:00
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
c74c913bfb
[misc] remove comments that were supposed to be removed ( #4977 )
2024-05-22 09:02:58 -04:00
5f6d10c14c
[CI/Build] Enforce style for C++ and CUDA code with clang-format ( #4722 )
2024-05-22 07:18:41 +00:00
9b9a10d6cb
[Frontend] Dynamic RoPE scaling ( #4638 )
2024-05-22 01:32:35 -04:00
99eff67ba9
[Bugfix][Kernel] Add head size check for attention backend selection ( #4944 )
2024-05-21 15:33:25 -04:00
14772eeb8e
[Bugfix] Fix flag name for max_seq_len_to_capture ( #4935 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2024-05-21 09:30:52 -07:00
757b62c495
[CI/Build] Codespell ignore build/ directory ( #4945 )
2024-05-21 09:06:10 -07:00
e941f88584
[Docs] Add acknowledgment for sponsors ( #4925 )
2024-05-21 00:17:25 -07:00
f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
d130b573a0
[Model] add rope_scaling support for qwen2 ( #4930 )
2024-05-21 05:22:22 +00:00
65ae8c2c8f
[Core] Fix scheduler considering "no LoRA" as "LoRA" ( #4897 )
2024-05-20 17:48:32 -07:00
c3af44722c
[Doc]Add documentation to benchmarking script when running TGI ( #4920 )
2024-05-20 20:16:57 +00:00
1937e29848
[Core] Sharded State Loader download from HF ( #4889 )
2024-05-20 11:46:12 -07:00
f0eecee610
[Bugfix] Fix dummy weight for fp8 ( #4916 )
...
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-05-20 18:44:25 +00:00
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
2024-05-20 11:29:28 -07:00
546a97ef69
[Misc]: allow user to specify port in distributed setting ( #4914 )
2024-05-20 17:45:06 +00:00
da5a0b539d
Remove marlin warning ( #4918 )
2024-05-20 14:55:34 +00:00
6287537a0c
[Model] LLaVA model refactor ( #4910 )
2024-05-20 08:11:25 +00:00
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
c0724fc915
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used ( #4658 )
2024-05-18 05:09:11 +00:00
86b45ae065
[Bugfix] Relax tiktoken to >= 0.6.0 ( #4890 )
2024-05-17 12:58:52 -06:00
c5711ef985
[Doc] Update Ray Data distributed offline inference example ( #4871 )
2024-05-17 10:52:11 -07:00
48d5985a08
Sync huggingface modifications of qwen Moe model ( #4774 )
2024-05-17 09:43:19 -07:00
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
8e7fb5d43a
Support to serve vLLM on Kubernetes with LWS ( #4829 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2024-05-16 16:37:29 -07:00
9a31a817a8
[Bugfix] Fix FP8 KV cache support ( #4869 )
2024-05-16 22:42:29 +00:00
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net >
2024-05-16 11:16:09 -07:00
10fa9eea21
[Misc] remove old comments ( #4866 )
2024-05-16 11:07:41 -07:00
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
b5853f9963
[ROCm][AMD][Bugfix] adding a missing triton autotune config ( #4845 )
2024-05-16 10:46:52 -07:00
f09edd8a25
Add JSON output support for benchmark_latency and benchmark_throughput ( #4848 )
2024-05-16 10:02:56 -07:00
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-05-16 12:56:15 -04:00
9216b9cc38
[Bugfix] Bypass authorization API token for preflight requests ( #4862 )
2024-05-16 09:42:21 -07:00
5e0391c040
[Frontend] Separate OpenAI Batch Runner usage from API Server ( #4851 )
2024-05-17 00:42:41 +09:00
dbc0754ddf
[docs] Fix typo in examples filename openi -> openai ( #4864 )
2024-05-17 00:42:17 +09:00
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cade Daniel <cade@anyscale.com >
2024-05-16 00:53:51 -07:00
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-05-15 22:11:54 -07:00
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-15 19:13:36 -04:00
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
361c461a12
[Doc] Highlight the fourth meetup in the README ( #4842 )
2024-05-15 11:38:49 -07:00
a5675d348b
[Bugfix] Properly set distributed_executor_backend in ParallelConfig ( #4816 )
2024-05-15 07:22:09 -07:00
e9cdd2b1e2
[CI/Build] Further decouple HuggingFace implementation from ours during tests ( #4166 )
2024-05-14 23:38:40 -07:00
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5.
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
29bc01bf3b
Add 4th meetup announcement to readme ( #4817 )
2024-05-14 18:33:06 -04:00
676a99982f
[Core] Add MultiprocessingGPUExecutor ( #4539 )
...
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com >
2024-05-14 10:38:59 -07:00
dc72402b57
[Bugfix][Doc] Fix CI failure in docs ( #4804 )
...
This PR fixes the CI failure introduced by #4798 .
The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion.
I have also changed the format of the links to be more distinct from each other.
2024-05-15 01:57:08 +09:00
ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies ( #4696 )
2024-05-14 21:34:33 +09:00
c579b750a0
[Doc] Add meetups to the doc ( #4798 )
2024-05-13 18:48:00 -07:00
4bfa7e7f75
[Doc] Add API reference for offline inference ( #4710 )
2024-05-13 17:47:42 -07:00
ac1fbf7fd2
[Doc] Shorten README by removing supported model list ( #4796 )
2024-05-13 16:23:54 -07:00
33d3914b1e
[Bugfix] Fix dynamic FP8 quantization for Mixtral ( #4793 )
2024-05-13 19:00:27 -04:00
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-05-13 15:50:33 -07:00
ce532ff45c
[Speculative decoding] Improve n-gram efficiency ( #4724 )
2024-05-13 15:00:13 -07:00
8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-13 23:50:44 +09:00
350f9e107f
[CI/Build] Move test_utils.py to tests/utils.py ( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
a7be4d0072
[CORE] Improvement in ranks code ( #4718 )
2024-05-12 17:47:47 -07:00
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
6eaccb7353
[Model] Add support for IBM Granite Code models ( #4636 )
2024-05-11 21:27:24 -07:00
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-10 17:36:25 +00:00
706588a77d
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4729 )
2024-05-11 00:00:56 +09:00
6a0f617210
[Core] Fix circular reference which leaked llm instance in local dev env ( #4737 )
...
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
2024-05-10 23:54:32 +09:00
dac6a3f6ed
[Misc] Apply a couple g++ cleanups ( #4719 )
2024-05-10 13:37:05 +00:00
64b77dfd7e
[Core]fix type annotation for swap_blocks ( #4726 )
2024-05-10 21:52:48 +09:00
51d4094fda
chunked-prefill-doc-syntax ( #4603 )
...
Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html
Co-authored-by: sang <rkooo567@gmail.com >
2024-05-10 14:13:23 +09:00
e965d46184
[Misc] Keep only one implementation of the create_dummy_prompt function. ( #4716 )
2024-05-09 21:42:38 -07:00
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
379da6dcb5
[Kernel] [FP8] Improve FP8 linear layer performance ( #4691 )
...
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).
We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.
Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:
qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
2024-05-09 16:38:07 -07:00
ebce310b74
[Model] Snowflake arctic model implementation ( #4652 )
...
Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com >
Co-authored-by: Aurick Qiao <qiao@aurick.net >
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-09 22:37:14 +00:00
be0c5180ac
[Bugfix] Add logs for all model dtype casting ( #4717 )
2024-05-09 18:36:25 +00:00
cea64430f6
[Bugfix] Update grafana.json ( #4711 )
2024-05-09 10:10:13 -07:00
a3c124570a
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4709 )
2024-05-09 09:53:14 -07:00
ff5abcd746
[ROCm] Add support for Punica kernels on AMD GPUs ( #3140 )
...
Co-authored-by: miloice <jeffaw99@hotmail.com >
2024-05-09 09:19:50 -07:00
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
190bc838e1
[Misc] Remove unnecessary ModelRunner imports ( #4703 )
2024-05-09 00:17:17 -07:00
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
16bc0a098f
[Frontend] add tok/s speed metric to llm class when using tqdm ( #4400 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-08 22:02:31 -07:00
e288df0632
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin ( #4626 )
2024-05-08 17:14:31 -07:00
8b9241be3a
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs ( #4672 )
2024-05-08 23:24:46 +00:00
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-08 21:44:00 +00:00
89579a201f
[Misc] Use vllm-flash-attn instead of flash-attn ( #4686 )
2024-05-08 13:15:34 -07:00
230c4b38c1
[CI/Test] fix swap test for multi gpu ( #4689 )
2024-05-08 13:14:02 -07:00
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
ad932a221d
[Core] Faster startup for LoRA enabled models ( #4634 )
2024-05-08 10:33:18 -07:00
5510cf0e8a
[Misc] Add get_name method to attention backends ( #4685 )
2024-05-08 09:59:31 -07:00
0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi ( #4573 )
2024-05-08 09:19:58 -07:00
f6a593093a
[CI] Make mistral tests pass ( #4596 )
2024-05-08 08:44:35 -07:00
d7740ea4dc
[Core] Optimize sampler get_logprobs ( #4594 )
2024-05-08 08:42:28 -07:00
cc466a3290
[Core][Distributed] support cpu&device in broadcast tensor dict ( #4660 )
...
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660 )
2024-05-07 19:34:47 -07:00
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly ( #4551 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-07 11:40:18 -07:00
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list ( #4648 )
2024-05-07 11:06:32 -07:00
10760da800
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora ( #4609 )
2024-05-07 10:59:07 -07:00
478aed5827
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. ( #4642 )
2024-05-07 09:23:17 -07:00
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
a98187cf72
[Kernel] Make static FP8 scaling more robust ( #4570 )
...
Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint
https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale
(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k ), I'm getting the following mostly random performance on MMLU:
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.2295|± |0.0035|
| - humanities |N/A |none | 5|acc |0.2421|± |0.0062|
| - other |N/A |none | 5|acc |0.2398|± |0.0076|
| - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074|
| - stem |N/A |none | 5|acc |0.2125|± |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7008|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6453|± |0.0065|
| - other |N/A |none | 5|acc |0.7692|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070|
| - stem |N/A |none | 5|acc |0.6115|± |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
2024-05-06 17:39:28 -07:00
bd99d22629
Update lm-format-enforcer to 0.10.1 ( #4631 )
2024-05-06 23:51:59 +00:00
19cb4716ee
[CI] Add retry for agent lost ( #4633 )
2024-05-06 23:18:57 +00:00
e186d37cb1
[CI] use ccache actions properly in release workflow ( #4629 )
2024-05-06 22:23:36 +00:00
323f27b904
[Bugfix] Fix asyncio.Task not being subscriptable ( #4623 )
2024-05-06 09:31:05 -07:00
0650e5935b
Disable cuda version check in vllm-openai image ( #4530 )
2024-05-05 16:58:55 -07:00