7097f31955
test
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-15 03:22:32 -08:00
f840b53063
fix
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-15 03:07:17 -08:00
1ca4298b9b
Fix
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-01 18:44:21 -08:00
ba64a0249f
Minor
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-01 18:42:22 -08:00
1260e43230
Minor
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2025-01-01 03:16:56 -08:00
a6e5d7b5b7
Merge branch 'main' into v1-blocktable-opt
2025-01-01 03:10:50 -08:00
6d70198b17
[Doc] Fix typo ( #11666 )
...
Signed-off-by: Kazuhiro Serizawa <nserihiro@gmail.com >
2025-01-01 08:10:10 +00:00
f962f426bc
[Misc] Replace space with - in the file names ( #11667 )
...
Signed-off-by: Lu Fang <lufang@fb.com >
2025-01-01 07:39:30 +00:00
11d8a091c6
[Misc] Optimize Qwen2-VL LoRA test ( #11663 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2025-01-01 14:42:23 +08:00
365801fedd
[VLM] Add max-count checking in data parser for single image models ( #11661 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-31 22:15:21 -08:00
4db72e57f6
[Bugfix][Refactor] Unify model management in frontend ( #11660 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2025-01-01 02:21:51 +00:00
0c6f998554
[Benchmark] Add benchmark script for CPU offloading ( #11533 )
...
Signed-off-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: KuntaiDu <kuntai@uchicago.edu >
2025-01-01 00:10:55 +00:00
e7c7c5e822
[V1][VLM] V1 support for selected single-image models. ( #11632 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-31 21:17:22 +00:00
8c3230d8c1
[V1] Simpify vision block hash for prefix caching by removing offset from hash ( #11646 )
2024-12-31 08:56:01 +00:00
2c5718809b
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. ( #11565 )
2024-12-31 06:29:04 +00:00
82c49d3260
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) ( #6909 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 22:15:58 -08:00
74fa1d123c
[Bugfix] Fix OpenAI parallel sampling when using xgrammar ( #11637 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-31 03:43:54 +00:00
a2a40bcd0d
[Model][LoRA]LoRA support added for MolmoForCausalLM ( #11439 )
...
Signed-off-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Matthias Vogler <matthias.vogler@joesecurity.org >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-30 17:33:06 -08:00
ccb1aabcca
[benchmark] Remove dependency for H100 benchmark step ( #11572 )
2024-12-30 12:27:07 -08:00
36e7670045
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel ( #11631 )
2024-12-30 18:51:04 +00:00
5886aa496e
[V1] [6/N] API Server: Better Shutdown ( #11586 )
2024-12-30 15:51:02 +00:00
8d9b6721e7
[VLM] Abstract out multi-modal data parsing in merged processor ( #11620 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-30 15:01:35 +00:00
b12e87f942
[platforms] enable platform plugins ( #11602 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 20:24:45 +08:00
5dbf854553
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels ( #11618 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-30 10:17:04 +00:00
970d6d0776
[Build][Kernel] Update CUTLASS to v3.6.0 ( #11607 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-30 17:22:13 +08:00
628ec6c17b
[Docker] bump up neuron sdk v2.21 ( #11593 )
...
Signed-off-by: Liangfu Chen <liangfc@amazon.com >
2024-12-30 13:46:14 +08:00
3682e33f9f
[v1] fix compilation cache ( #11598 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-30 04:24:12 +00:00
0aa38d16f5
Remove print statement in DeepseekScalingRotaryEmbedding ( #11604 )
2024-12-29 20:16:46 +00:00
faef77c0d6
[Misc] KV cache transfer connector registry ( #11481 )
...
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
2024-12-29 16:08:09 +00:00
dba4d9dec6
[v1][bugfix] fix cudagraph with inplace buffer assignment ( #11596 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-29 09:03:49 +00:00
32b4c63f02
[Doc] Convert list tables to MyST ( #11594 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-29 15:56:22 +08:00
4fb8e329fd
[V1] [5/N] API Server: unify Detokenizer
and EngineCore
input ( #11545 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-28 20:51:57 +00:00
328841d002
[bugfix] interleaving sliding window for cohere2 model ( #11583 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-28 16:55:42 +00:00
d427e5cfda
[Doc] Minor documentation fixes ( #11580 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-28 21:53:59 +08:00
42bb201fd6
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor ( #11581 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-28 13:33:12 +00:00
59d6bb4c86
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version ( #11515 )
...
Signed-off-by: hjwei <hjwei_xd@163.com >
2024-12-28 11:17:35 +00:00
b7dcc003dc
[Model] Remove hardcoded image tokens ids from Pixtral ( #11582 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-28 10:54:23 +00:00
d34be24bb1
[Model] Support InternLM2 Reward models ( #11571 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-28 06:14:10 +00:00
b5cbe8eeb3
[Bugfix] Last token measurement fix ( #11376 )
...
Signed-off-by: rajveerb <46040700+rajveerb@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-28 11:34:46 +08:00
df04dffade
[V1] [4/N] API Server: ZMQ/MP Utilities ( #11541 )
2024-12-28 01:45:08 +00:00
a60731247f
[Doc] Update mllama example based on official doc ( #11567 )
...
Signed-off-by: Chen Zhang <zhangch99@outlook.com >
2024-12-28 00:31:10 +00:00
ac79799403
[Bugfix] Fix for ROCM compressed tensor support ( #11561 )
2024-12-27 20:12:11 +00:00
dde1fa18c9
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix ( #11566 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-27 19:45:13 +00:00
0240402c46
[Misc]Add BNB quantization for MolmoForCausalLM ( #11551 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 18:48:24 +00:00
55509c2114
[MODEL] LoRA support for Jamba model ( #11209 )
...
Signed-off-by: Erez Schwartz <erezs@ai21.com >
2024-12-27 17:58:21 +00:00
101418096f
[VLM] Support caching in merged multi-modal processor ( #11396 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 17:22:48 +00:00
5ce4627a7e
[Doc] Add xgrammar in doc ( #11549 )
...
Signed-off-by: ccjincong <chenjincong11@gmail.com >
2024-12-27 13:05:10 +00:00
7af553ea30
[Misc] Abstract the logic for reading and writing media content ( #11527 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-27 19:21:23 +08:00
2c9b8ea2b0
[Bugfix] Fix TeleChat2ForCausalLM weights mapper ( #11546 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-27 10:39:15 +00:00
d003f3ea39
Update deploying_with_k8s.md with AMD ROCm GPU example ( #11465 )
...
Signed-off-by: Alex He <alehe@amd.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-27 10:00:04 +00:00
6c6f7fe8a8
[Platform] Move model arch check to platform ( #11503 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-27 08:45:25 +00:00
2339d59f92
[BugFix] Fix quantization for all other methods ( #11547 )
2024-12-26 22:23:29 -08:00
1b875a0ef3
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly ( #11534 )
2024-12-26 21:19:21 -08:00
ebfbe1244b
ruff
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 20:06:53 -08:00
eb881ed006
[misc] fix typing ( #11540 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-27 11:05:08 +08:00
6ba31aa5f6
Minor
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 19:03:59 -08:00
34d6cc2aea
Merge branch 'main' into v1-blocktable-opt
2024-12-26 18:52:19 -08:00
46d4359450
[CI] Fix broken CI ( #11543 )
2024-12-26 18:49:16 -08:00
81b979f2a8
[V1] Fix yapf ( #11538 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-27 09:47:10 +09:00
371d04d39b
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling ( #11394 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-27 09:32:38 +09:00
0c0c2015c5
Update openai_compatible_server.md ( #11536 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-12-26 16:26:18 -08:00
82d24f7aac
[Docs] Document Deepseek V3 support ( #11535 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-12-26 16:21:56 -08:00
f49777ba62
Deepseek v3 ( #11502 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: robertgshaw2-neuralmagic <rshaw@neuralmagic.com >
2024-12-26 16:09:44 -08:00
55fb97f7bd
[2/N] API Server: Avoid ulimit footgun ( #11530 )
2024-12-26 23:43:05 +00:00
2072924d14
[Model] [Quantization] Support deepseek_v3 w8a8 fp8 block-wise quantization ( #11523 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: simon-mo <simon.mo@hey.com >
Signed-off-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: HandH1998 <1335248067@qq.com >
2024-12-26 15:33:30 -08:00
720b10fdc6
[1/N] API Server (Remove Proxy) ( #11529 )
2024-12-26 23:03:43 +00:00
27e8eb2e94
Add kernel test
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 11:23:52 -08:00
ca4f9e69a8
minor
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 11:13:41 -08:00
52922193cd
Add test for uva
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 11:00:19 -08:00
bef68163a0
Minor
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 10:48:29 -08:00
ff5b1033dc
Merge branch 'main' into v1-blocktable-opt
2024-12-26 10:12:17 -08:00
b85a977822
[Doc] Add video example to openai client for multimodal ( #11521 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-26 17:31:29 +00:00
eec906d811
[Misc] Add placeholder module ( #11501 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 13:12:51 +00:00
f57ee5650d
[Model] Modify MolmoForCausalLM MLP ( #11510 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 13:12:05 +00:00
dcb1a944d4
[V1] Adding min tokens/repetition/presence/frequence penalties to V1 sampler ( #10681 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-26 19:02:58 +09:00
7492a36207
[Doc] Add QVQ
and QwQ
to the list of supported models ( #11509 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-26 09:44:32 +00:00
aa25985bd1
[Misc][LoRA] Fix LoRA weight mapper ( #11495 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-26 15:52:48 +08:00
dbeac95dbb
Mypy checking for vllm/compilation ( #11496 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-26 05:04:07 +00:00
51a624bf02
[Misc] Move some multimodal utils to modality-specific modules ( #11494 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-26 04:23:20 +00:00
b938606993
Merge branch 'main' into v1-blocktable-opt
2024-12-25 15:49:02 -08:00
6ad909fdda
[Doc] Improve GitHub links ( #11491 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 14:49:26 -08:00
b689ada91e
[Frontend] Enable decord to load video from base64 ( #11492 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-25 16:33:55 +00:00
fc601665eb
[Misc] Update disaggregation benchmark scripts and test logs ( #11456 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-25 06:58:48 +00:00
9832e5572a
[V1] Unify VLLM_ENABLE_V1_MULTIPROCESSING handling in RayExecutor ( #11472 )
2024-12-24 19:49:46 -08:00
3f3e92e1f2
[Model] Automatic conversion of classification and reward models ( #11469 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 18:22:22 +00:00
409475a827
[Bugfix] Fix issues in CPU build Dockerfile. Fixes #9182 ( #11435 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-24 16:53:28 +00:00
196c34b0ac
[Misc] Move weights mapper ( #11443 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 13:05:25 +00:00
5c7963249d
[attn][tiny fix] fix attn backend in MultiHeadAttention ( #11463 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-12-24 12:39:36 +00:00
461cde2080
[OpenVINO] Fixed installation conflicts ( #11458 )
...
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@intel.com >
2024-12-24 11:38:21 +00:00
7a5286cc04
[Bugfix][Hardware][CPU] Fix CPU input_positions
creation for text-only inputs with mrope ( #11434 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-24 17:59:51 +08:00
b1b1038fbd
[Bugfix] Fix Qwen2-VL LoRA weight loading ( #11430 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-24 09:56:10 +00:00
9edca6bf8f
[Frontend] Online Pooling API ( #11457 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-24 17:54:30 +08:00
4f074fbf53
[Misc]Suppress irrelevant exception stack trace information when CUDA… ( #11438 )
...
Co-authored-by: shiquan <shiquan>
2024-12-24 08:43:39 +00:00
a491d6f535
[V1] TP Ray executor ( #11107 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-12-23 23:00:12 +00:00
32aa2059ad
[Docs] Convert rST to MyST (Markdown) ( #11145 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-23 22:35:38 +00:00
94d545a1a1
[Doc] Fix typo in the help message of '--guided-decoding-backend' ( #11440 )
2024-12-23 20:20:44 +00:00
60fb4f3bcf
[Bugfix] Add kv cache scales to gemma2.py ( #11269 )
2024-12-23 19:30:45 +00:00
63afbe9215
[CI] Expand OpenAI test_chat.py guided decoding tests ( #11048 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 18:35:38 +00:00
8cef6e02dc
[Misc] add w8a8 asym models ( #11075 )
2024-12-23 13:33:20 -05:00
b866cdbd05
[Misc] Add assertion and helpful message for marlin24 compressed models ( #11388 )
2024-12-24 02:23:38 +08:00
2e726680b3
[Bugfix] torch nightly version in ROCm installation guide ( #11423 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-23 17:20:22 +00:00
5bfb30a529
[Bugfix] Fix CFGGuide and use outlines for grammars that can't convert to GBNF ( #11389 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-23 23:06:20 +08:00
e51719ae72
mypy type checking for vllm/worker ( #11418 )
...
Signed-off-by: lucast2021 <lucast2021@headroyce.org >
Co-authored-by: lucast2021 <lucast2021@headroyce.org >
2024-12-23 13:55:49 +00:00
f30581c518
[misc][perf] remove old code ( #11425 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-23 08:01:08 +00:00
3fdbd8e2f5
comments
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-22 22:39:03 -08:00
0420fb2c7b
Merge branch 'main' into v1-blocktable-opt
2024-12-22 22:16:22 -08:00
ee965c9c69
Use default
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-22 22:16:12 -08:00
048fc57a0f
[CI] Unboock H100 Benchmark ( #11419 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-12-22 14:17:43 -08:00
f1d1bf6288
[Bugfix] Fix fully sharded LoRAs with Mixtral ( #11390 )
...
Signed-off-by: Jason Greene <jason.greene@redhat.com >
2024-12-22 23:25:10 +08:00
72d9c316d3
[cd][release] fix race conditions ( #11407 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-22 00:39:11 -08:00
4a9139780a
[cd][release] add pypi index for every commit and nightly build ( #11404 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-12-21 23:53:44 -08:00
29c748930e
[CI] Fix flaky entrypoint tests ( #11403 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-21 21:08:44 -08:00
0a669eed7b
Minor
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 17:39:13 -08:00
03b1e6fdbd
Minor
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 17:28:21 -08:00
8a4180c8b6
yapf
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 17:11:00 -08:00
1aaced5830
wip
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 17:07:46 -08:00
c2d1b075ba
[Bugfix] Fix issues for Pixtral-Large-Instruct-2411
( #11393 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-21 10:15:03 +00:00
584f0ae40d
[V1] Make AsyncLLMEngine v1-v0 opaque ( #11383 )
...
Signed-off-by: Ricky Xu <xuchen727@hotmail.com >
2024-12-21 15:14:08 +08:00
51ff216d85
[Bugfix] update should_ignore_layer ( #11354 )
...
Signed-off-by: George Ohashi <george@neuralmagic.com >
2024-12-21 06:36:23 +00:00
dd2b5633dd
[V1][Bugfix] Skip hashing empty or None mm_data ( #11386 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-21 14:22:21 +09:00
47a0b615b4
Add ray[default] to wget to run distributed inference out of box ( #11265 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-20 13:54:55 -08:00
5d2248d81a
[doc] explain nccl requirements for rlhf ( #11381 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-20 13:00:56 -08:00
d573aeadcc
[Bugfix] Don't log OpenAI field aliases as ignored ( #11378 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 19:03:50 +00:00
995f56236b
[Core] Loading model from S3 using RunAI Model Streamer as optional loader ( #10192 )
...
Signed-off-by: OmerD <omer@run.ai >
2024-12-20 16:46:24 +00:00
7c7aa37c69
[CI/Build] fix pre-compiled wheel install for exact tag ( #11373 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
2024-12-21 00:14:40 +08:00
04139ade59
[V1] Fix profiling for models with merged input processor ( #11370 )
...
Signed-off-by: ywang96 <ywang@roblox.com >
2024-12-20 12:04:21 +00:00
1ecc645b8f
[doc] backward compatibility for 0.6.4 ( #11359 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:33:53 -08:00
c954f21ac0
[misc] add early error message for custom ops ( #11355 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-19 21:18:25 -08:00
86c2d8fd1c
[Bugfix] Fix spec decoding when seed is none in a batch ( #10863 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-12-20 05:15:31 +00:00
b880ffb87e
[Misc] Add tqdm progress bar during graph capture ( #11349 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-20 04:35:18 +00:00
7801f56ed7
[ci][gh200] dockerfile clean up ( #11351 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: cenzhiyao <2523403608@qq.com >
2024-12-19 18:13:06 -08:00
48edab8041
[Bugfix][Hardware][POWERPC] Fix auto dtype failure in case of POWER10 ( #11331 )
...
Signed-off-by: Akash Kaothalkar <0052v2@linux.vnet.ibm.com >
2024-12-20 01:32:07 +00:00
a985f7af9f
[CI] Adding CPU docker pipeline ( #11261 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-19 11:46:55 -08:00
e461c262f0
[Misc] Remove unused vllm/block.py ( #11336 )
2024-12-19 17:54:24 +00:00
276738ce0f
[Bugfix] Fix broken CPU compressed-tensors test ( #11338 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-19 17:37:31 +00:00
cdf22afdda
[Misc] Clean up and consolidate LRUCache ( #11339 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-20 00:59:32 +08:00
e24113a8fe
[Model] Refactor Qwen2-VL to use merged multimodal processor ( #11258 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 16:28:00 +00:00
7379b3d4b2
[V1] Fix multimodal profiling for Molmo
( #11325 )
...
Signed-off-by: ywang96 <ywang@example.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-19 16:27:22 +00:00
6c7f881541
[Model] Add JambaForSequenceClassification model ( #10860 )
...
Signed-off-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Yehoshua Cohen <yehoshuaco@ai21.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 22:48:06 +08:00
a0f7d53beb
[Bugfix] Cleanup Pixtral HF code ( #11333 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 13:22:00 +00:00
5aef49806d
[Feature] Add load generation config from model ( #11164 )
...
Signed-off-by: liuyanyi <wolfsonliu@163.com >
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-12-19 10:50:38 +00:00
98356735ac
[misc] benchmark_throughput : Add LoRA ( #11267 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 15:43:16 +08:00
f26c4aeecb
[Misc] Optimize ray worker initialization time ( #11275 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-18 23:38:02 -08:00
8936316d58
[Kernel] Refactor Cutlass c3x ( #10049 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-19 07:00:18 +00:00
6142ef0ada
[VLM] Merged multimodal processor for Qwen2-Audio ( #11303 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-19 06:14:17 +00:00
c6b0a7d3ba
[V1] Simplify prefix caching logic by removing num_evictable_computed_blocks
( #11310 )
2024-12-19 04:17:12 +00:00
a30482f054
[CI] Expand test_guided_generate to test all backends ( #11313 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-19 04:00:38 +00:00
17ca964273
[Model] IBM Granite 3.1 ( #11307 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-19 11:27:24 +08:00
5a9da2e6e9
[Bugfix][Build/CI] Fix sparse CUTLASS compilation on CUDA [12.0, 12.2) ( #11311 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-19 02:43:30 +00:00
fdea8ec167
[V1] VLM - enable processor cache by default ( #11305 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-18 18:54:46 -05:00
ca5f54a9b9
[Bugfix] fix minicpmv test ( #11304 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-18 10:34:26 -08:00
f954fe0e65
[FIX] update openai version ( #11287 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-12-18 10:17:05 -08:00
362cff1eb3
[CI][Misc] Remove Github Action Release Workflow ( #11274 )
2024-12-18 10:16:53 -08:00
996aa70f00
[Bugfix] Fix broken phi3-v mm_processor_kwargs tests ( #11263 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-18 10:16:40 -08:00
60508ffda9
[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support ( #10995 )
...
Co-authored-by: Faraz Shahsavan <faraz.shahsavan@gmail.com >
Co-authored-by: ilmarkov <markovilya197@gmail.com >
Co-authored-by: Rahul Tuli <rahul@neuralmagic.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-12-18 09:57:16 -05:00
f04e407e6b
[MISC][XPU]update ipex link for CI fix ( #11278 )
2024-12-17 22:34:23 -08:00
8b79f9e107
[Bugfix] Fix guided decoding with tokenizer mode mistral ( #11046 )
2024-12-17 22:34:08 -08:00
866fa4550d
[Bugfix] Restore support for larger block sizes ( #11259 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-17 16:39:07 -08:00
bf8717ebae
[V1] Prefix caching for vision language models ( #11187 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-17 16:37:59 -08:00
c77eb8a33c
[Bugfix] Set temperature=0.7 in test_guided_choice_chat ( #11264 )
2024-12-17 16:34:06 -08:00
2d1b9baa8f
[Bugfix] Fix request cancellation without polling ( #11190 )
2024-12-17 12:26:32 -08:00
f9ecbb18bf
[Misc] Allow passing logits_soft_cap for xformers backend ( #11252 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-17 00:37:04 -08:00
02222a0256
[Misc] Kernel Benchmark for RMSNorm
( #11241 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Xiaoyu Zhang <BBuf@users.noreply.github.com >
2024-12-17 06:57:02 +00:00
2bfdbf2a36
[V1][Core] Use weakref.finalize instead of atexit ( #11242 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 22:11:33 -08:00
e88db68cf5
[Platform] platform agnostic for EngineArgs initialization ( #11225 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-16 22:11:06 -08:00
59c9b6ebeb
[V1][VLM] Proper memory profiling for image language models ( #11210 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: ywang96 <ywang@example.com >
2024-12-16 22:10:57 -08:00
66d4b16724
[Frontend] Add OpenAI API support for input_audio ( #11027 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-16 22:09:58 -08:00
0064f697d3
[CI] Add test case with JSON schema using references + use xgrammar by default with OpenAI parse ( #10935 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-17 11:39:58 +08:00
35bae114a8
fix gh200 tests on main ( #11246 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 17:22:38 -08:00
88a412ed3d
[torch.compile] fast inductor ( #11108 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-16 16:15:22 -08:00
c301616ed2
[ci][tests] add gh200 tests ( #11244 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 15:53:18 -08:00
35ffa682b1
[Docs] hint to enable use of GPU performance counters in profiling tools for multi-node distributed serving ( #11235 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-16 22:20:39 +00:00
551603feff
[core] overhaul memory profiling and fix backward compatibility ( #10511 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-16 13:32:25 -08:00
efbce85f4d
[misc] Layerwise profile updates ( #10242 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-16 18:14:57 +00:00
2ca830dbaa
[Doc] Reorder vision language examples in alphabet order ( #11228 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-16 11:23:33 +00:00
d927dbcd88
[Model] Refactor Ultravox to use merged input processor ( #11198 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-16 10:09:53 +00:00
bddbbcb132
[Model] Support Cohere2ForCausalLM (Cohere R7B) ( #11203 )
2024-12-16 09:56:19 +00:00
b3b1526f03
WIP: [CI/Build] simplify Dockerfile build for ARM64 / GH200 ( #11212 )
...
Signed-off-by: drikster80 <ed.sealing@gmail.com >
Co-authored-by: drikster80 <ed.sealing@gmail.com >
2024-12-16 09:20:49 +00:00
17138af7c4
[Bugfix] Fix the default value for temperature in ChatCompletionRequest ( #11219 )
2024-12-16 00:15:40 -08:00
69ba344de8
[Bugfix] Fix block size validation ( #10938 )
2024-12-15 16:38:40 -08:00
da6f409246
Update deploying_with_k8s.rst ( #10922 )
2024-12-15 16:33:58 -08:00
25ebed2f8c
[V1][Minor] Cache np arange to reduce input preparation overhead ( #11214 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-15 13:33:00 -08:00
d263bd9df7
[Core] Support disaggregated prefill with Mooncake Transfer Engine ( #10884 )
...
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com >
2024-12-15 21:28:18 +00:00
38e599d6a8
[Doc] add documentation for disaggregated prefilling ( #11197 )
...
Signed-off-by: Kuntai Du <kuntai@uchicago.edu >
2024-12-15 13:31:16 -06:00
96d673e0f8
[Bugfix] Fix error handling of unsupported sliding window ( #11213 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 10:59:42 -07:00
b10609e6a1
[Misc] Clean up multi-modal processor ( #11207 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-15 06:30:28 +00:00
a1c02058ba
[torch.compile] allow tracking forward time ( #11081 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-14 19:45:00 -08:00
15859f2357
[[Misc]Upgrade bitsandbytes to the latest version 0.45.0 ( #11201 )
2024-12-15 03:03:06 +00:00
886936837c
[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion ( #7209 )
2024-12-14 11:38:10 -08:00
6d917d0eeb
Enable mypy checking on V1 code ( #11105 )
...
Signed-off-by: Mark McLoughlin <markmc@redhat.com >
2024-12-14 09:54:04 -08:00
93abf23a64
[VLM] Fully dynamic prompt replacement in merged input processor ( #11199 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 17:52:18 +00:00
9c3dadd1c9
[Frontend] Add logits_processors
as an extra completion argument ( #11150 )
...
Signed-off-by: Brad Hilton <brad.hilton.nw@gmail.com >
2024-12-14 16:46:42 +00:00
3cb5769883
[Misc] Minor improvements to the readability of PunicaWrapperBase ( #11200 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-14 16:38:27 +00:00
ea7bd68d10
[V1][Bugfix] Fix V1 TP trust-remote-code ( #11182 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-14 08:21:23 +00:00
48259264a4
[Core] Update outlines and increase its threadpool size ( #11140 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-14 07:46:18 +00:00
24a3d12b82
update compressed-tensors to latest version ( #11183 )
...
Co-authored-by: dhuangnm <dhuang@MacBook-Pro-2.local >
2024-12-14 03:22:44 +00:00
9855aea21b
[Bugfix][V1] Re-compute an entire block when fully cache hit ( #11186 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 17:08:23 -08:00
4b5b8a6a3b
[V1][Bugfix] Fix EngineCoreProc profile ( #11185 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-13 17:02:35 -08:00
4863e5fba5
[Core] V1: Use multiprocessing by default ( #11074 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-13 16:27:32 -08:00
0d8451c3a4
[Distributed] Allow the placement group more time to wait for resources to be ready ( #11138 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
2024-12-13 20:17:37 +00:00
0a56bcc03d
[Bugfix][Hardware][CPU] Enable Gemma2 with SDPA on CPU backend ( #11169 )
2024-12-13 18:00:40 +00:00
0920ab9131
[Doc] Reorganize online pooling APIs ( #11172 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-14 00:22:22 +08:00
238c0d93b4
[Misc] Add tokenizer_mode param to benchmark_serving.py ( #11174 )
...
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
2024-12-13 16:19:10 +00:00
5b0ed8391d
[Bugfix] using len(tokenizer) instead of tokenizer.vocab_size in AllowedTokenIdsLogitsProcessor ( #11156 )
2024-12-13 15:56:19 +00:00
c31d4a57a6
[Core] support LoRA and prompt adapter in content-based hashing for Block Manager v2 prefix caching ( #8240 )
2024-12-13 07:51:25 -08:00
d1fa714cb1
[Refactor]A simple device-related refactor ( #11163 )
...
Signed-off-by: noemotiovon <noemotiovon@gmail.com >
Co-authored-by: noemotiovon <noemotiovon@gmail.com >
2024-12-13 13:39:00 +00:00
969da7d70b
[V1][VLM] Fix edge case bug for InternVL2 ( #11165 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-13 11:09:30 +00:00
eeec9e3390
[Frontend] Separate pooling APIs in offline inference ( #11129 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-13 10:40:07 +00:00
f93bf2b189
[Bugfix][CI][CPU] add missing datasets package to requirements-cpu.txt ( #11159 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-12-13 08:50:35 +00:00
7cd7409142
PaliGemma 2 support ( #11142 )
2024-12-13 07:40:07 +00:00
be39e3cd18
[core] clean up cudagraph batchsize padding logic ( #10996 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-13 06:57:50 +00:00
34f1a806d5
[Bugfix][V1] Fix 'NoneType' object has no attribute 'hash_value' ( #11157 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-12-13 06:30:06 +00:00
00c1bde5d8
[ROCm][AMD] Disable auto enabling chunked prefill on ROCm ( #11146 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-13 05:31:26 +00:00
3989a79824
[Bugfix] Update starcoder2 to remap k/v scale names for kv_cache quantization ( #11148 )
2024-12-13 05:07:20 +00:00
1efce68605
[Bugfix] Use runner_type instead of task in GritLM ( #11144 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-13 04:09:53 +00:00
30870b4f66
[torch.compile] Dynamic fp8 + rms_norm fusion ( #10906 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-13 03:19:23 +00:00
78ed8f57d8
[Misc][V1] Fix type in v1 prefix caching ( #11151 )
2024-12-13 00:57:40 +00:00
db6c264a1e
[Bugfix] Fix value unpack error of simple connector for KVCache transfer. ( #11058 )
...
Signed-off-by: ShangmingCai <csmthu@gmail.com >
2024-12-12 21:19:17 +00:00
9f3974a319
Fix logging of the vLLM Config ( #11143 )
2024-12-12 12:05:57 -08:00
2c97eca1ff
[Misc] Validate grammar and fail early ( #11119 )
2024-12-12 18:34:26 +00:00
5d712571af
[Bugfix] Quick fix to make Pixtral-HF load correctly again after 39e227c7ae. ( #11024 )
2024-12-12 18:09:20 +00:00
d4d5291cc2
fix(docs): typo in helm install instructions ( #11141 )
...
Signed-off-by: Ramon Ziai <ramon.ziai@bettermarks.com >
2024-12-12 17:36:32 +00:00
4816d20aa4
[V1] Fix torch profiling for offline inference ( #11125 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-12 15:51:53 +00:00
85362f028c
[Misc][LoRA] Ensure Lora Adapter requests return adapter name ( #11094 )
...
Signed-off-by: Jiaxin Shan <seedjeffwan@gmail.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-12 09:25:16 +00:00
62de37a38e
[core][distributed] initialization from StatelessProcessGroup ( #10986 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-12 09:04:19 +00:00
8195824206
[Hardware][Intel-Gaudi] Enable LoRA support for Intel Gaudi (HPU) ( #10565 )
...
Signed-off-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
2024-12-12 08:09:28 +00:00
f092153fbe
[V1] Use more persistent buffers to optimize input preparation overheads ( #11111 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 23:14:20 -08:00
1da8f0e1dd
[Model] Add support for embedding model GritLM ( #10816 )
...
Signed-off-by: Pooya Davoodi <pooya.davoodi@parasail.io >
2024-12-12 06:39:16 +00:00
ccede2b264
[Core] cleanup zmq ipc sockets on exit ( #11115 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 19:12:24 -08:00
24a36d6d5f
Update link to LlamaStack remote vLLM guide in serving_with_llamastack.rst ( #11112 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-12-12 02:39:21 +00:00
8fb26dac61
[Docs] Add media kit ( #11121 )
2024-12-11 17:33:11 -08:00
7439a8b5fc
[Bugfix] Multiple fixes to tool streaming with hermes and mistral ( #10979 )
...
Signed-off-by: cedonley <clayton@donley.io >
2024-12-12 01:10:12 +00:00
4e11683368
[V1] VLM preprocessor hashing ( #11020 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Signed-off-by: Alexander Matveev <alexm@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-12 00:55:30 +00:00
452a723bf2
[V1][Core] Remove should_shutdown to simplify core process termination ( #11113 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 23:34:54 +00:00
d1e21a979b
[CI/Build] Split up VLM tests ( #11083 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-12 06:18:16 +08:00
72ff3a9686
[core] Bump ray to use _overlap_gpu_communication in compiled graph tests ( #10410 )
...
Signed-off-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Rui Qiao <ubuntu@ip-172-31-15-128.us-west-2.compute.internal >
2024-12-11 11:36:35 -08:00
66aaa7722d
[torch.compile] remove graph logging in ci ( #11110 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:59:50 -08:00
d643c2aba1
[V1] Use input_ids as input for text-only models ( #11032 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-11 10:49:23 -08:00
91642db952
[torch.compile] use depyf to dump torch.compile internals ( #10972 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-11 10:43:05 -08:00
fd22220687
[Doc] Installed version of llmcompressor for int8/fp8 quantization ( #11103 )
...
Signed-off-by: Guangda Liu <bingps@users.noreply.github.com >
Co-authored-by: Guangda Liu <bingps@users.noreply.github.com >
2024-12-11 15:43:24 +00:00
b2f775456e
[CI/Build] Enable prefix caching test for AMD ( #11098 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-12-11 15:23:37 +00:00
cad5c0a6ed
[Doc] Update docs to refer to pooling models ( #11093 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 13:36:27 +00:00
8f10d5e393
[Misc] Split up pooling tasks ( #10820 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 01:28:00 -08:00
40766ca1b8
[Bugfix]: Clamp -inf
logprob values in prompt_logprobs ( #11073 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-12-11 01:27:39 -08:00
2e32f5d28d
[Bugfix] Fix Idefics3 fails during multi-image inference ( #11080 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-12-11 01:27:07 -08:00
61b1d2f6ae
[Core] v1: Use atexit to handle engine core client shutdown ( #11076 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-11 01:26:36 -08:00
9974fca047
[ci/build] Fix entrypoints test and pin outlines version ( #11088 )
2024-12-11 01:01:53 -08:00
3fb4b4f163
[ci/build] Fix AMD CI dependencies ( #11087 )
2024-12-11 00:39:53 -08:00
2e33fe4191
[CI/Build] Check transformers v4.47 ( #10991 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-11 05:02:02 +00:00
e39400a4b6
Fix streaming for granite tool call when <|tool_call|> is present ( #11069 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-11 04:51:40 +00:00
ffa48c9146
[Model] PP support for Mamba-like models ( #10992 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-12-10 21:53:37 -05:00
d5c5154fcf
[Misc] LoRA + Chunked Prefill ( #9057 )
2024-12-11 10:09:20 +08:00
9a93973708
[Bugfix] Fix Mamba multistep ( #11071 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-11 00:16:22 +00:00
134810b3d9
[V1][Bugfix] Always set enable_chunked_prefill = True for V1 ( #11061 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-10 14:41:23 -08:00
75f89dc44c
[torch.compile] add a flag to track batchsize statistics ( #11059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-10 12:40:52 -08:00
e739194926
[Core] Update to outlines >= 0.1.8 ( #10576 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-10 12:08:16 -08:00
250ee65d72
[BUG] Remove token param #10921 ( #11022 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-12-10 17:38:15 +00:00
9b9cef3145
[Bugfix] Backport request id validation to v0 ( #11036 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 16:38:23 +00:00
d05f88679b
[Misc][LoRA] Add PEFTHelper for LoRA ( #11003 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-10 11:12:01 +00:00
beb16b2c81
[Bugfix] Handle <|tool_call|> token in granite tool parser ( #11039 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-10 10:27:11 +00:00
fe2e10c71b
Add example of helm chart for vllm deployment on k8s ( #9199 )
...
Signed-off-by: Maxime Fournioux <55544262+mfournioux@users.noreply.github.com >
2024-12-10 09:19:27 +00:00
82c73fd510
[Bugfix] cuda error running llama 3.2 ( #11047 )
2024-12-10 07:41:11 +00:00
bfd610430c
Update README.md ( #11034 )
2024-12-09 23:08:10 -08:00
e35879c276
[Bugfix] Fix xgrammar failing to read a vocab_size from LlavaConfig on PixtralHF. ( #11043 )
2024-12-10 14:54:22 +08:00
ebf778061d
monitor metrics of tokens per step using cudagraph batchsizes ( #11031 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 22:35:36 -08:00
28b3a1c7e5
[V1] Multiprocessing Tensor Parallel Support for v1 ( #9856 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-10 06:28:14 +00:00
bc192a2b09
[Pixtral] Improve loading ( #11040 )
2024-12-10 06:09:32 +00:00
980ad394a8
[Frontend] Use request id from header ( #10968 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-12-10 13:46:29 +08:00
391d7b2763
[Bugfix] Fix usage of deprecated
decorator ( #11025 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-10 13:45:47 +08:00
d1f6d1c8af
[Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba ( #10739 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-10 10:23:07 +08:00
6d525288c1
[Docs] Add dedicated tool calling page to docs ( #10554 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-09 20:15:34 -05:00
6faec54505
[V1] Do not store None
in self.generators ( #11038 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 15:08:19 -08:00
5ed5d5f128
Build tpu image in release pipeline ( #10936 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-12-09 23:07:48 +00:00
b63ba84832
[ROCm][bugfix] scpecilative decoding worker class ( #11035 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-09 14:00:29 -08:00
9c6459e4cb
[Neuron] Upgrade neuron to 2.20.2 ( #11016 )
...
Signed-off-by: Jerzy Zagorski <jzagorsk@amazon.com >
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-12-09 13:53:24 -08:00
1a2f8fb828
[v1] fix use compile sizes ( #11000 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-09 13:47:24 -08:00
cbcbdb1ceb
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version ( #11028 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-09 13:21:06 -08:00
a811dd6608
[Model] merged input processor for Phi-3-Vision models ( #10977 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-09 12:55:10 -08:00
ca871491ed
[Misc][LoRA] Abstract PunicaWrapper ( #10955 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-09 12:54:44 -08:00
3b61cb450d
[V1] Further reduce CPU overheads in flash-attn ( #10989 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-09 12:38:46 -08:00
edc4fa3188
[ci/build] Recompile CI dependencies list with Python 3.12 ( #11013 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-12-09 11:46:58 -08:00
25b79d9fd3
[V1] Input Batch Relocation ( #10962 )
...
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-12-09 09:33:41 -08:00
aea2fc38c3
[Platform] Move async output
check to platform ( #10768 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-09 17:24:46 +00:00
e691b26f6f
[Core] Require xgrammar >= 0.1.6 ( #11021 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-09 16:44:27 +00:00
c690357928
[V1] Fix Detokenizer loading in AsyncLLM
( #10997 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-09 16:27:10 +00:00
d1c2e15eb3
[torch.compile] add dynamo time tracking ( #11005 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 23:09:04 -08:00
af7c4a92e6
[Doc][V1] Add V1 support column for multimodal models ( #10998 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 22:29:16 -08:00
46004e83a2
[misc] clean up and unify logging ( #10999 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 17:28:27 -08:00
43b05fa314
[torch.compile][misc] fix comments ( #10993 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:18:18 -08:00
a11f326528
[V1] Initial support of multimodal models for V1 re-arch ( #10699 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-12-08 12:50:51 +00:00
fd57d2b534
[torch.compile] allow candidate compile sizes ( #10984 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-08 11:05:21 +00:00
7be15d9356
[core][misc] remove use_dummy driver for _run_workers ( #10920 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 12:06:08 -08:00
1b62745b1d
[core][executor] simplify instance id ( #10976 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-07 09:33:45 -08:00
78029b34ed
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when conv_states is None ( #10928 )
...
Signed-off-by: xffxff <1247714429@qq.com >
2024-12-08 01:21:18 +08:00
c889d5888b
[Doc] Explicitly state that PP isn't compatible with speculative decoding yet ( #10975 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:20:49 +00:00
39e227c7ae
[Model] Update multi-modal processor to support Mantis(LLaVA) model ( #10711 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 17:10:05 +00:00
1c768fe537
[Doc] Explicitly state that InternVL 2.5 is supported ( #10978 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 16:58:02 +00:00
bf0e382e16
[Model] Composite weight loading for multimodal Qwen2 ( #10944 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-07 07:22:52 -07:00
b26b4cd03c
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora implementation ( #10958 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-07 18:33:49 +08:00
f13cf9ad50
[Build] Fix for the Wswitch-bool clang warning ( #10060 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-07 09:03:44 +00:00
955fa9533a
[3/N] Support and implement merged input processor for LLaVA model ( #10676 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-07 00:50:58 -08:00
acf092d348
[Bugfix] Fix test-pipeline.yaml ( #10973 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-07 12:08:54 +08:00
69d357ba12
[Core] Cleanup startup logging a bit ( #10961 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-12-07 02:30:23 +00:00
dcdc3fafe5
[ci] fix broken tests ( #10956 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:47 -08:00
c05cfb67da
[misc] fix typo ( #10960 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:25:20 -08:00
7406274041
[Doc] add KubeAI to serving integrations ( #10837 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-12-06 17:03:56 +00:00
8b59631855
[Core] Support Lark grammars for XGrammar ( #10870 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-06 08:34:29 -07:00
a1887f2c96
[torch.compile] fix deprecated code ( #10948 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-06 11:01:23 +00:00
222f5b082a
[CI/Build] Fix broken multimodal test ( #10950 )
2024-12-06 10:41:23 +00:00
b031a455a9
[torch.compile] add logging for compilation time ( #10941 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-06 10:07:15 +00:00
db87eb6c67
[torch.compile] use size tuning for specific sizes ( #10933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 20:30:41 -08:00
9743d64e4e
[ci][build] add tests for python only compilation ( #10915 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-05 08:54:47 -08:00
a43065272f
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives ( #10897 )
...
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
2024-12-05 08:47:46 -08:00
998eeafe58
[CI/Build] Bump test transformers version ( #10106 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 16:05:52 +00:00
571da8fc43
[Misc][LoRA] Clean up the function interface of Punica ( #10917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:22:28 +00:00
39c89e71a8
[Misc] Update llama 3.2 template to support system prompt with images ( #10901 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-12-05 05:54:06 +00:00
1f958a7d52
[Bugfix] Fix BNB loader target_modules ( #10720 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-05 13:20:26 +08:00
aa39a8e175
[Doc] Create a new "Usage" section ( #10827 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-05 11:19:35 +08:00
8d370e91cb
[Bugfix] Fallback to outlines for complex json schemas ( #10899 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-05 11:14:06 +08:00
7883c2bbe7
[benchmark] Make H100 benchmark optional ( #10908 )
2024-12-04 17:02:17 -08:00
2a56e1264f
[V1] Fix when max_model_len is not divisible by block_size ( #10903 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-04 16:54:05 -08:00
e4c34c23de
[CI/Build] improve python-only dev setup ( #9621 )
...
Signed-off-by: Daniele Trifirò <dtrifiro@redhat.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-04 21:48:13 +00:00
82eb5ea8f3
Benchmark serving structured output ( #10880 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-12-04 16:28:21 -05:00
10398b4706
[Model] Consolidate ViTs attention implementation without mask ( #10893 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-04 18:11:08 +00:00
01d079fd8e
[LoRA] Change lora_tokenizers capacity ( #10796 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-12-04 17:40:16 +00:00
c92acb9693
[ci/build] Update vLLM postmerge ECR repo ( #10887 )
2024-12-04 09:01:20 +00:00
8db957ee3a
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 ( #10854 )
...
Signed-off-by: jianzheng <57654625+o2363286@users.noreply.github.com >
2024-12-04 08:48:22 +00:00
c9ca4fce3f
[ci/build] Job to build and push release image ( #10877 )
2024-12-04 15:02:40 +08:00
fa2dea61df
[ci/build] Change queue name for Release jobs ( #10875 )
2024-12-04 15:02:16 +08:00
b5b647b084
Drop ROCm load format check ( #10767 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-04 04:32:21 +00:00
d2bd88b122
[CI/Build] Replace mean with torch.all in test_pynccl.py ( #10876 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-12-04 03:23:21 +00:00
381ac93bb5
[Benchmark] Benchmark structured output with datasets ( #10557 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
Co-authored-by: Aaron Pham <contact@aarnphm.xyz >
2024-12-03 17:21:06 -07:00
a061fe601e
[Build][Bugfix] Using the correct type hint ( #10866 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-12-03 15:47:55 -05:00
7c32b6861e
[Frontend] correctly record prefill and decode time metrics ( #10853 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-12-03 19:13:31 +00:00
7090c27bb2
[Bugfix] Only require XGrammar on x86 ( #10865 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-12-03 10:32:21 -08:00
2f2cdc745a
[MISC][XPU] quick fix for XPU CI ( #10859 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-12-03 17:16:31 +00:00
3bc94cab69
[V1] VLM - Run the mm_mapper preprocessor in the frontend process ( #10640 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-12-03 10:33:10 +00:00
f6084f6324
[Speculative Decoding] Move indices to device before filtering output ( #10850 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-12-03 17:01:39 +08:00
9323a3153b
[Core][Performance] Add XGrammar support for guided decoding and set it as default ( #10785 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Signed-off-by: mgoin <michael@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-12-03 15:17:00 +08:00
3257d449fa
[Misc] Remove deprecated names ( #10817 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:52:57 +00:00
ef51831ee8
[Doc] Add github links for source code references ( #10672 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-03 06:46:07 +00:00
dc5ce861bf
[torch.compile] remove compilation_context and simplify code ( #10838 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 06:19:02 +00:00
21fe7b481a
[core][distributed] add pynccl broadcast ( #10843 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-03 04:53:23 +00:00
a4cf256159
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug ( #10844 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-03 12:10:29 +08:00
d746268e92
[Model] support bitsandbytes quantization with minicpm model ( #10842 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-12-03 03:06:41 +00:00
4433195ab7
[Bugfix] Prevent benchmark_throughput.py from using duplicated random prompts ( #10753 )
2024-12-03 02:26:15 +00:00
4c05edb33a
[Model] Add TP and BNB quantization support to LlavaMultiModalProjector ( #10834 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-12-02 23:06:09 +00:00
9b14d978aa
Fix openvino on GPU ( #10793 )
2024-12-02 18:52:19 +00:00
519cc6ca12
[Misc][XPU] Avoid torch compile for XPU platform ( #10747 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-12-02 17:53:55 +00:00
b45f0d7946
[Misc][LoRA] Move the implementation of lora bias to punica.py ( #10829 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-12-02 17:53:36 +00:00
a4c4daf364
[misc] use out argument for flash attention ( #10822 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 10:50:10 +00:00
e95f275f57
[CI/Build] Update mistral_common
version for tests and docs ( #10825 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-02 10:26:10 +00:00
ef31eabc68
[Model]: add some tests for aria model ( #10770 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-02 05:36:36 +00:00
995a148575
[doc]Update config docstring ( #10732 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-12-02 04:14:45 +00:00
63a164172d
[misc] remove xverse modeling file ( #10814 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-02 03:27:13 +00:00
e25810ae29
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill ( #10799 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-12-02 10:05:32 +08:00
073a4bd1c0
[Kernel] Use out
arg in flash_attn_varlen_func ( #10811 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-12-01 17:55:39 -08:00
b7954776fd
[core] Avoid metrics log noise when idle - include speculative decodi… ( #10809 )
2024-12-02 01:49:48 +00:00
b18c9bbaba
[Model] Add BNB support to Llava and Pixtral-HF ( #10795 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-12-02 01:31:09 +00:00
0590ec3fd9
[Core] Implement disagg prefill by StatelessProcessGroup ( #10502 )
...
This PR provides initial support for single-node disaggregated prefill in 1P1D scenario.
Signed-off-by: KuntaiDu <kuntai@uchicago.edu >
Co-authored-by: ApostaC <yihua98@uchicago.edu >
Co-authored-by: YaoJiayi <120040070@link.cuhk.edu.cn >
2024-12-01 19:01:00 -06:00
c11f172187
[Misc] Adding MMMU-Pro
vision dataset to serving benchmark ( #10804 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Chen Zhang <zhangch99@outlook.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-12-01 08:47:05 +00:00
169a0ff911
[doc] add warning about comparing hf and vllm outputs ( #10805 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-12-01 00:41:38 -08:00
d2f058e76c
[Misc] Rename embedding classes to pooling ( #10801 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 14:36:51 +08:00
f877a7d12a
[Misc] Improve type annotations for support_torch_compile
( #10763 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-30 17:48:35 -08:00
133707123e
[Model] Replace embedding models with pooling adapter ( #10769 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-12-01 08:02:54 +08:00
7e4bbda573
[doc] format fix ( #10789 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-30 11:38:40 +00:00
e7cfc4ef4c
[Interleaved ATTN] Support for Mistral-8B ( #10591 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-30 07:45:50 +00:00
16ee07f22a
[Model] Refactor Molmo weights loading to use AutoWeightsLoader ( #10771 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-30 04:19:14 +00:00
40bc242579
[Bugfix] Fix OpenVino/Neuron driver_worker
init ( #10779 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
Signed-off-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-30 12:07:13 +08:00
661175bc82
[platform] Add verify_quantization in platform. ( #10757 )
...
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com >
2024-11-29 15:22:21 +00:00
3132aac043
[Bugfix] Fix Idefics3 bug ( #10778 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-29 13:56:46 +00:00
c82b432d4a
[Misc] typo find in sampling_metadata.py ( #10740 )
2024-11-29 05:17:57 +00:00
fa6ecb9aa7
[Model] Clean up MiniCPMV ( #10751 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-29 04:47:06 +00:00
c83919c7a6
[Model] Add Internlm2 LoRA support ( #5064 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-28 17:29:04 +00:00
98f47f2a40
[V1] Optimize the CPU overheads in FlashAttention custom op ( #10733 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 09:01:02 -08:00
8c1e77fb58
[Kernel] Update vllm-flash-attn version to reduce CPU overheads ( #10742 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 08:31:28 -08:00
5fc5ce0fe4
[Model] Added GLM-4 series hf format model support vllm==0.6.4 ( #10561 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-11-28 14:53:31 +00:00
3ed5e73146
[TPU] Update requirements-tpu ( #10726 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-28 02:30:48 -08:00
9a8bff0285
[Kernel] Update vllm-flash-attn version ( #10736 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 02:25:59 -08:00
a79b122400
[V1] Do not allocate beyond the max_model_len ( #10730 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-28 00:13:15 -08:00
d9b4b3f069
[Bug][CLI] Allow users to disable prefix caching explicitly ( #10724 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-27 23:59:28 -08:00
278be671a3
[Doc] Update model in arch_overview.rst to match comment ( #10701 )
...
Signed-off-by: spacewander <spacewanderlzx@gmail.com >
2024-11-27 23:58:39 -08:00
70dc14fbd0
[Model] support bitsandbytes quantization with minicpm3 model ( #10682 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-27 23:58:02 -08:00
cb4e1c3f3a
[misc] upgrade filelock version ( #10731 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 19:54:58 -08:00
395b1c7454
[Frontend] don't block event loop in tokenization (preprocess) in OpenAI compatible server ( #10635 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-27 13:21:10 -08:00
9b4b150395
[Bugfix] Ignore lm_head
when loading embedding models ( #10719 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-27 19:05:29 +00:00
197b4484a3
[Bugfix][Mamba] Fix Multistep on Mamba-like models ( #10705 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-27 19:02:27 +00:00
b98c62ba49
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint ( #10675 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-27 10:43:17 -08:00
c411def234
[torch.compile] fix shape specialization ( #10722 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 10:16:10 -08:00
308cc5e21e
[ci] fix slow tests ( #10698 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-27 09:26:14 -08:00
9e0a147d50
[V1] Update interface for mistral-format Pixtral ( #10703 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 12:26:27 +00:00
418cb3b93f
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault ( #10700 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-27 11:55:38 +00:00
1209261e93
[Model] Support telechat2 ( #10311 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
Co-authored-by: xiangw2 <xiangw2@chinatelecom.cn >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-27 11:32:35 +00:00
e2251109c7
[Kernel] Remove if-else with identical branches in marlin 2:4 ( #10687 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-26 22:55:32 -08:00
15cc2a9f1a
[Misc]Further reduce BNB static variable ( #10597 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-26 22:54:12 -08:00
e85250b1d1
[Hardware][Gaudi]add get_name method for HPUAttentionBackend ( #10667 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-26 22:49:40 -08:00
cfb3bf25fb
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesConfig ( #10657 )
2024-11-27 13:55:23 +08:00
1bf905ddaa
[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling. ( #10198 )
...
Signed-off-by: jeongin601 <0200angela@gmail.com >
Signed-off-by: jeong_in.bae <jeong_in.bae@navercorp.com >
2024-11-27 05:07:30 +00:00
0a4d968500
[V1] Update interface for idefics3 ( #10680 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-27 10:04:01 +08:00
0a71900bc9
Remove hard-dependencies of Speculative decode to CUDA workers ( #10587 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-26 17:57:11 -08:00
2f0a0a17a4
[V1] Refactor model executable interface for multimodal models ( #10570 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-26 20:46:11 +00:00
7576cd38df
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes ( #10642 )
2024-11-26 12:29:00 -08:00
9a99273b48
[Bugfix] Fix using -O[0,3]
with LLM entrypoint ( #10677 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-26 10:44:01 -08:00
f5792c7c4a
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson ( #9735 )
...
Signed-off-by: Conroy Cheers <conroy@corncheese.org >
2024-11-26 10:26:28 -08:00
db66e018ea
[Bugfix] Fix for Spec model TP + Chunked Prefill ( #10232 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
Signed-off-by: Sourashis Roy <sroy@roblox.com >
Co-authored-by: Sourashis Roy <sroy@roblox.com >
2024-11-26 09:11:16 -08:00
1f6584ee85
[V1] Enable profile for LLMEngine ( #10665 )
2024-11-26 10:36:45 +00:00
334d64d1e8
[ci] add vllm_test_utils ( #10659 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-26 00:20:04 -08:00
940635343a
[Misc] Remove outdated init protocols ( #10655 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-26 14:55:00 +08:00
9a88f89799
custom allreduce + torch.compile ( #10121 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-25 22:00:16 -08:00
519e8e4182
[v1] EngineArgs for better config handling for v1 ( #10382 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-25 21:09:43 -08:00
a6760f6456
[Feature] vLLM ARM Enablement for AARCH64 CPUs ( #9228 )
...
Signed-off-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: Sanket Kale <sanketk.kale@fujitsu.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-11-25 18:32:39 -08:00
45ac4ff270
[bugfix] fix aria model and add torch.compile ( #10645 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 18:32:09 -08:00
6e9ff050c8
[misc] do not read HOST_IP ( #10644 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 17:04:50 -08:00
9db713a1dc
[Model] Add OLMo November 2024 model ( #10503 )
2024-11-25 17:26:40 -05:00
1b583cfefa
[Doc] Fix typos in docs ( #10636 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 10:15:45 -08:00
cf73f0c95e
[Model] Enable optional prefix when loading embedding models ( #10639 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 18:14:33 +00:00
b1d920531f
[Model]: Add support for Aria model ( #10514 )
...
Signed-off-by: xffxff <1247714429@qq.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-11-25 18:10:55 +00:00
452a4e80c3
[Docs] Add Snowflake Slides ( #10641 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-25 09:34:46 -08:00
c27df94e1f
[Bugfix] Fix chunked prefill with model dtype float32 on Turing Devices ( #9850 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-11-25 12:23:32 -05:00
d04b13a380
[Bug]: Authorization ignored when root_path is set ( #10606 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-25 16:21:41 +00:00
2b0879bfc2
Super tiny little typo fix ( #10633 )
2024-11-25 13:08:30 +00:00
ed46f14321
[Model] Support is_causal
HF config field for Qwen2 model ( #10621 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 09:51:20 +00:00
05d1f8c9c6
[misc] move functions to config.py ( #10624 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 09:27:30 +00:00
25d806e953
[misc] add torch.compile compatibility check ( #10618 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:40:08 -08:00
65813781a2
[torch.compile] add warning for unsupported models ( #10622 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-24 23:27:51 -08:00
7c2134beda
[torch.compile] force inductor threads ( #10620 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 23:04:21 -08:00
a30a605d21
[Doc] Add encoder-based models to Supported Models page ( #10616 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-25 06:34:07 +00:00
571841b7fc
[torch.compile] support encoder based models ( #10613 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-25 05:24:33 +00:00
7ea3cd7c3e
[Refactor][MISC] del redundant code in ParallelConfig.postinit ( #10614 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-25 05:14:56 +00:00
214efc2c3c
Support Cross encoder models ( #10400 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-24 18:56:20 -08:00
49628fe13e
[Doc] Update README.md with Ray Summit talk links ( #10610 )
2024-11-24 16:45:09 -08:00
e4fbb14414
[doc] update the code to add models ( #10603 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-24 11:21:40 -08:00
c055747867
[model][utils] add extract_layer_index utility function ( #10599 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-23 22:22:54 -08:00
eda2b3589c
Revert "Print running script to enhance CI log readability" ( #10601 )
2024-11-23 21:31:47 -08:00
1c445dca51
[CI/Build] Print running script to enhance CI log readability ( #10594 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-24 03:57:13 +00:00
1700c543a5
[Bugfix] Fix LoRA weight sharding ( #10450 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-23 17:23:17 -08:00
17d8fc1806
[bugfix] Fix example/tensorize_vllm_model tests ( #10595 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-23 17:22:33 -08:00
04668ebe7a
[Bugfix] Avoid import AttentionMetadata explicitly in Mllama ( #10593 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-23 18:12:20 +00:00
651f6c31ac
For ppc64le, disabled tests for now and addressed space issues ( #10538 )
2024-11-23 09:33:53 +00:00
86a44fb896
[Platforms] Refactor openvino code ( #10573 )
...
Signed-off-by: statelesshz <hzji210@gmail.com >
2024-11-22 22:23:12 -08:00
4cfe5d2bca
[Bugfix] multi_modal_kwargs
broadcast for CPU tensor parallel ( #10541 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 21:25:46 -08:00
c8acd80548
[2/N] handling placeholders in merged multi-modal processor ( #10485 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-22 21:25:09 -08:00
4634a89d18
Prefix Cache Aware Scheduling [1/n] ( #10128 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-22 21:15:55 -08:00
7c25fe45a6
[AMD] Add support for GGUF quantization on ROCm ( #10254 )
2024-11-22 21:14:49 -08:00
02a43f82a9
Update default max_num_batch_tokens for chunked prefill to 2048 ( #10544 )
2024-11-22 21:14:19 -08:00
cfea9c04ef
[Model] Fix Baichuan BNB online quantization ( #10572 )
...
Signed-off-by: Chen Wu <cntryroa@gmail.com >
2024-11-22 21:13:59 -08:00
7d8ffb344f
[Bugfix] Internal Server Error when tool_choice is incorrect. ( #10567 )
...
Signed-off-by: Varun Shenoy <varun.vinayak.shenoy@oracle.com >
2024-11-22 21:13:29 -08:00
4aba6e3d1a
[core] gemma2 full context length support ( #10584 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 20:13:54 -08:00
978b39744b
[Misc] Add pynccl wrappers for all_gather and reduce_scatter ( #9432 )
2024-11-22 22:14:03 -05:00
ebda51968b
[Core] Fix broken log configuration ( #10458 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-23 10:23:51 +08:00
9195dbdbca
[Bugfix][Frontend] Update Llama Chat Templates to also support Non-Tool use ( #10164 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-23 10:17:38 +08:00
d559979c54
[bugfix] fix cpu tests ( #10585 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 17:34:03 -08:00
d345f409b7
[V1] EngineCore supports profiling ( #10564 )
...
Signed-off-by: Abatom <abzhonghua@gmail.com >
2024-11-22 17:16:15 -08:00
28598f3939
[Core] remove temporary local variables in LLMEngine.__init__ ( #10577 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-22 16:22:53 -08:00
948c859571
support bitsandbytes quantization with qwen model ( #10549 )
...
Signed-off-by: Ubuntu <zixuanzhang@bytedance.com >
2024-11-22 16:16:14 -08:00
97814fbf0f
[v1] Refactor KVCacheManager for more hash input than token ids ( #10507 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-22 23:27:25 +00:00
eebad39f26
[torch.compile] support all attention backends ( #10558 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 14:04:42 -08:00
db100c5cde
[bugfix] fix full graph tests ( #10581 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-22 10:02:14 -08:00
11fcf0e066
Remove token-adding chat embedding params ( #10551 )
...
Signed-off-by: Noam Gat <noamgat@gmail.com >
2024-11-21 23:59:47 -08:00
b6374e09b0
[Bugfix] Fix Phi-3 BNB quantization with tensor parallel ( #9948 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-22 15:01:56 +08:00
a111d0151f
[platforms] absorb worker cls difference into platforms folder ( #10555 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-21 21:00:32 -08:00
446c7806b2
[Minor] Fix line-too-long ( #10563 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 19:40:40 -08:00
33e0a2540a
[9/N] torch.compile LLM usage ( #10552 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 19:13:31 -08:00
aed074860a
[Benchmark] Add new H100 machine ( #10547 )
2024-11-21 18:27:20 -08:00
9afa014552
Add small example to metrics.rst ( #10550 )
2024-11-21 23:43:43 +00:00
46fe9b46d8
[Minor] Revert change in offline inference example ( #10545 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 21:28:16 +00:00
cf656f5a02
[misc] improve error message ( #10553 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 13:13:17 -08:00
edec3385b6
[CI][Installation] Avoid uploading CUDA 11.8 wheel ( #10535 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-11-21 13:03:58 -08:00
f9310cbd0c
[V1] Fix Compilation config & Enable CUDA graph by default ( #10528 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-21 12:53:39 -08:00
7560ae5caf
[8/N] enable cli flag without a space ( #10529 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-21 12:30:42 -08:00
e7a8341c7c
[Bugfix] Allow token ID-only inputs in Qwen2-Audio ( #10536 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-21 18:09:43 +00:00
c51e397fe8
[Misc] Suppress duplicated logging regarding multimodal input pipeline ( #10530 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-21 09:21:31 -08:00
2385b60d83
[Kernel] Register punica ops directly ( #10522 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-21 09:18:11 -08:00
da7e702c6f
[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored ( #10180 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-21 16:24:32 +00:00
4d676f0852
[Bugfix] Embedding model pooling_type equals ALL and multi input's bug ( #10494 )
2024-11-21 14:40:02 +00:00
d5ec121f95
[Model] Expose dynamic_image_size
as mm_processor_kwargs for InternVL2 models ( #10518 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-21 14:20:08 +00:00
8a93a598d9
fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len ( #10524 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
2024-11-21 11:15:36 +00:00
1cfde82ffd
[Model] Add Support for Multimodal Granite Models ( #10291 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-21 10:46:20 +00:00
f0e0238016
[Doc] fix a small typo in docstring of llama_tool_parser ( #10513 )
2024-11-21 09:05:23 +00:00
aaddce5d26
[platforms] improve error message for unspecified platforms ( #10520 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 23:07:56 -08:00
3430857b64
[Misc] Increase default video fetch timeout ( #10495 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 23:06:42 -08:00
8b0fe06c89
[torch.compile] Inductor code caching fix ( #10273 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Signed-off-by: Luka Govedic <luka.govedic@gmail.com >
2024-11-20 21:44:57 -08:00
9d827170a3
[Platforms] Add device_type
in Platform
( #10508 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-21 04:44:20 +00:00
6c1208d083
[Core] Add Sliding Window Support with Flashinfer ( #10462 )
...
Signed-off-by: Pavani Majety <pmajety@nvidia.com >
2024-11-20 19:56:47 -08:00
388ee3de66
[torch.compile] limit inductor threads and lazy import quant ( #10482 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 18:36:33 -08:00
2f77b6cfec
[TPU] Implement prefix caching for TPUs ( #10307 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-20 13:54:15 -08:00
c68f7ede6a
[Bugfix]: allow extra fields in requests to openai compatible server ( #10463 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-20 16:42:21 -05:00
0cd3d9717e
[7/N] torch.compile, reduce compilation time ( #10460 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 11:20:38 -08:00
5f1d6af2b6
[perf bench] H200 development ( #9768 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-20 11:06:56 -08:00
772a66732d
[platforms] restore xpu check for parallel config ( #10479 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-20 17:13:28 +00:00
63f1fde277
[Hardware][CPU] Support chunked-prefill and prefix-caching on CPU ( #10355 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-20 10:57:39 +00:00
d5b28447e0
[Platforms] Refactor xpu code ( #10468 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-19 22:52:13 -08:00
09dbf9ff16
[Bugfix] Handle conflicts between modern and legacy fields ( #10471 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 14:45:08 +08:00
343041c4c4
[model] Reduce medusa weight ( #10454 )
...
Signed-off-by: skylee-01 <497627264@qq.com >
2024-11-20 06:05:55 +00:00
ed701ca963
[ci/build] Combine nightly and optional ( #10465 )
2024-11-19 21:36:03 -08:00
7629a9c6e5
[CI/Build] Support compilation with local cutlass path ( #10423 ) ( #10424 )
2024-11-19 21:35:50 -08:00
709c9f1f25
[CI/Build] Add sphinx/rst linter for docs ( #10366 )
2024-11-19 21:35:31 -08:00
b4be5a8adb
[Bugfix] Enforce no chunked prefill for embedding models ( #10470 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-20 05:12:51 +00:00
ad44437ba3
[Bugfix] Fix Mamba model initialization and MLP Speculator weights loading ( #10456 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-20 05:04:05 +00:00
9e05252b46
[Misc] Add __setitem__ for LazyDict ( #10469 )
...
Signed-off-by: Yanyi Liu <wolfsonliu@163.com >
2024-11-20 04:44:57 +00:00
d200972e7f
[Bugfix] Marlin 2:4 temp fix for large M dim (>256) ( #10464 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-19 19:40:33 -08:00
d5b68aba2f
[CI/Build] Update Dockerfile.rocm ( #10434 )
...
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
2024-11-19 17:19:59 -08:00
a324d3a1a7
Change granite chat template to keep json list formatting for tool calls ( #10452 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
2024-11-19 18:16:54 -07:00
b00b33d77e
[Model][Quantization] HQQ support through Marlin kernel expansion ( #9766 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-19 13:31:12 -08:00
efa9084628
[Core] Avoid metrics log noise when idle ( #8868 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 21:05:25 +00:00
803f37eaaa
[6/N] torch.compile rollout to users ( #10437 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:09:03 -08:00
fd9f124971
[Doc] fix link for page that was renamed ( #10455 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:48:30 -08:00
1ea291a417
Fix: Build error seen on Power Architecture ( #10421 )
...
Signed-off-by: Manjul Mohan <manjul.mohan@ibm.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Signed-off-by: Isotr0py <2037008807@qq.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
Signed-off-by: mgoin <michael@neuralmagic.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
Signed-off-by: rickyx <rickyx@anyscale.com >
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: Mengqing Cao <cmq0113@163.com >
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Manjul Mohan manjul.mohan@ibm.com <manjulmohan@ltcd97-lp2.aus.stglabs.ibm.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Isotr0py <2037008807@qq.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: ismael-dm <ismaeldm99@gmail.com >
Co-authored-by: Andrew Nesbitt <andrewnez@gmail.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Yan Ma <yan.ma@intel.com >
Co-authored-by: Angus Wang <wangjadehao@gmail.com >
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com >
Co-authored-by: Ricky Xu <rickyx@anyscale.com >
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
Co-authored-by: Mengqing Cao <cmq0113@163.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-19 09:34:57 -08:00
11fd7ea639
[Pixtral-Large] Pixtral actually has no bias in vision-lang adapter ( #10449 )
2024-11-19 17:33:06 +00:00
f028dff33d
[BugFix] Fix hermes tool parser output error stream arguments in some cases ( #10395 ) ( #10398 )
...
Signed-off-by: xiyuan lee <lixiyuan@haier.com >
2024-11-19 13:42:50 +00:00
b4614656b8
[CI][CPU] adding numa node number as container name suffix ( #10441 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-19 13:16:43 +00:00
25f9c78961
[misc][plugin] improve plugin loading ( #10443 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-19 10:43:21 +00:00
5390d6664f
[Doc] Add the start of an arch overview page ( #10368 )
2024-11-19 09:52:11 +00:00
382b6a4852
[Misc] Avoid misleading warning messages ( #10438 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 08:54:58 +00:00
272e31c0bd
[Bugfix] Guard for negative counter metrics to prevent crash ( #10430 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-19 04:57:10 +00:00
74f8c2cf5f
Add openai.beta.chat.completions.parse example to structured_outputs.rst ( #10433 )
2024-11-19 04:37:46 +00:00
8c1fb50705
[Platform][Refactor] Extract func get_default_attn_backend
to Platform
( #10358 )
...
Signed-off-by: Mengqing Cao <cmq0113@163.com >
2024-11-19 11:22:26 +08:00
7eb719df13
[Bugfix]Fix Phi-3 BNB online quantization ( #10417 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-19 03:21:42 +00:00
284203f171
[ci/build] Have dependabot ignore all patch update ( #10436 )
...
We have too many dependencies and all patch updates can be a little noisy. This is to have dependabot ignore all patch version updates.
2024-11-19 01:04:25 +00:00
90a6c759ca
[misc] partial prefix & random input generation benchmark ( #9929 )
...
Signed-off-by: rickyx <rickyx@anyscale.com >
2024-11-18 15:39:14 -08:00
2298e69b5f
[ci][bugfix] fix kernel tests ( #10431 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:29:37 -08:00
a03ea40792
[3/N][torch.compile] consolidate custom op logging ( #10399 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 15:14:59 -08:00
96d999fbe8
[Kernel] Initial Machete W4A8 support + Refactors ( #9855 )
...
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-11-18 12:59:29 -07:00
c2170a5b39
[Kernel] Explicitly specify other value in tl.load calls ( #9014 )
...
Signed-off-by: Angus Wang <wangjadehao@gmail.com >
2024-11-18 11:39:40 -08:00
6b2d25efc7
[Hardware][XPU] AWQ/GPTQ support for xpu backend ( #10107 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-18 11:18:05 -07:00
281cc4b3cd
[Model][Bugfix] Support TP for PixtralHF ViT ( #10405 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-18 10:04:14 -08:00
4f686d139f
Fix open_collective value in FUNDING.yml ( #10426 )
...
Signed-off-by: Andrew Nesbitt <andrewnez@gmail.com >
2024-11-18 09:52:42 -08:00
31894a2155
[Doc] Add documentation for Structured Outputs ( #9943 )
...
Signed-off-by: ismael-dm <ismaeldm99@gmail.com >
2024-11-18 09:52:12 -08:00
7851b45196
[5/N][torch.compile] torch.jit.script --> torch.compile ( #10406 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-18 23:20:06 +08:00
4186be8111
[Doc] Update doc for LoRA support in GLM-4V ( #10425 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 15:08:30 +00:00
e7ebb662d7
[Model] Remove transformers attention porting in VITs ( #10414 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 21:45:21 +08:00
5be4e52b65
[Model][LoRA]LoRA support added for glm-4v ( #10418 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-18 12:57:10 +00:00
01aae1cc68
[Model] Remove redundant softmax when using PoolingType.STEP ( #10415 )
2024-11-18 10:05:36 +00:00
c7dec926f6
[VLM] Report multi_modal_placeholders in output ( #10407 )
...
Signed-off-by: Linkun Chen <lkchen+anyscale@github.com >
2024-11-18 16:06:16 +08:00
51bb12d17b
[4/N][torch.compile] clean up set_torch_compile_backend ( #10401 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-17 23:57:20 -08:00
47826cacf0
[Bugfix] Ignore ray reinit error when current platform is ROCm or XPU ( #10375 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-11-18 11:29:26 +08:00
c4e464333e
[Misc] Add uninitialized params tracking for AutoWeightsLoader
( #10327 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-18 09:07:46 +08:00
d1557e66d3
[Misc] Enhance offline_inference to support user-configurable paramet… ( #10392 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-17 11:32:40 +00:00
80d85c5d7b
[Bugfix] Fix mrope_position_delta in non-last prefill chunk ( #10403 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 08:50:24 +00:00
76aab90ab6
[Hardware] [HPU]add mark_step
for hpu ( #10239 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-17 00:44:44 -08:00
8d74b5aee9
[platforms] refactor cpu code ( #10402 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 23:14:23 -08:00
cf349c4a97
[Bugfix][CPU] Fix CPU embedding runner with tensor parallel ( #10394 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-16 23:12:04 -08:00
905d0f0af4
[CI/Build] Fix IDC hpu [Device not found] issue ( #10384 )
...
Signed-off-by: Chendi Xue <chendi.xue@intel.com >
2024-11-17 14:58:22 +08:00
643ecf7b11
[V1] Refactor model executable interface for all text-only language models ( #10374 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-17 05:18:46 +00:00
4fd9375028
[2/N][torch.compile] make compilation cfg part of vllm cfg ( #10383 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 18:02:14 -08:00
661a34fd4f
[V1] Add code owners for V1 ( #10397 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-16 10:45:26 -08:00
361c29e174
[Bugfix] Fix M-RoPE position calculation when chunked prefill is enabled ( #10388 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-17 02:10:00 +08:00
b98d89efd4
[Misc] Medusa supports custom bias ( #10361 )
2024-11-16 16:33:01 +00:00
8b6725b0cf
[Misc] Update benchmark to support image_url file or http ( #10287 )
...
Signed-off-by: rbbang <anjaehyun87@gmail.com >
2024-11-16 18:15:40 +08:00
1d75472626
[BugFix] [Kernel] Fix GPU SEGV occuring in fused_moe kernel ( #10385 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-16 09:55:05 +00:00
2f427c2d16
[misc][plugin] improve log messages ( #10386 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-16 01:23:20 -08:00
755b85359b
[doc] add doc for the plugin system ( #10372 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 21:46:27 -08:00
32e46e000f
[Frontend] Automatic detection of chat content format from AST ( #9919 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-16 13:35:40 +08:00
4f168f69a3
[Docs] Misc updates to TPU installation instructions ( #10165 )
2024-11-15 13:26:17 -08:00
3e8d14d8a1
[Doc] Move PR template content to docs ( #10159 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:20:20 -08:00
a067f85e08
[Frontend] Add --version flag to CLI ( #10369 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-15 13:13:53 -08:00
c76ac49d26
[Docs] Add Nebius as sponsors ( #10371 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 12:47:40 -08:00
a6221a144a
[Misc] bump mistral common version ( #10367 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-15 09:48:07 -08:00
79ee45b428
[Misc] Bump up test_fused_moe tolerance ( #10364 )
...
Signed-off-by: ElizaWszola <eliza@neuralmagic.com >
2024-11-15 16:31:18 +00:00
691a3ec047
[Bugfix] Ensure special tokens are properly filtered out for guided structured output with MistralTokenizer ( #10363 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-15 14:50:40 +00:00
3a763ba0c3
[core][misc] keep compatibility for old-style classes ( #10356 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-15 13:55:51 +00:00
f2056f726d
[Misc] Fix some help info of arg_utils to improve readability ( #10362 )
2024-11-15 12:40:30 +00:00
1d65ec7eeb
[Bugfix] Fix fully sharded LoRA bug ( #10352 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-15 10:34:58 +00:00
26908554b2
[Doc] Remove float32 choice from --lora-dtype ( #10348 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-15 10:22:57 +00:00
b311efd0bd
[Misc] Fix import error in tensorizer tests and cleanup some code ( #10349 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 09:34:17 +00:00
3d158cdc8d
Add default value to avoid Falcon crash ( #5363 ) ( #10347 )
...
Signed-off-by: wchen61 <wchen61@foxmail.com >
2024-11-15 08:52:20 +00:00
02dbf30e9a
[Build] skip renaming files for release wheels pipeline ( #9671 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-14 23:31:52 -08:00
2ac6d0e75b
[Misc] Consolidate pooler config overrides ( #10351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 06:59:00 +00:00
2ec8827288
[Bugfix] Qwen-vl output is inconsistent in speculative decoding ( #10350 )
2024-11-15 05:40:10 +00:00
b40cf6402e
[Model] Support Qwen2 embeddings and use tags to select model tests ( #10184 )
2024-11-14 20:23:09 -08:00
2885ba0e24
[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug ( #10308 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-15 02:44:26 +00:00
bf2ddc6610
[bugfix] Fix static asymmetric quantization case ( #10334 )
...
Signed-off-by: Daniël de Kok <me@danieldk.eu >
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Daniël de Kok <me@danieldk.eu >
2024-11-15 09:35:11 +08:00
972112d82f
[Bugfix] Fix unable to load some models ( #10312 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 16:55:54 -08:00
11cd1ae6ad
[Tool parsing] Improve / correct mistral tool parsing ( #10333 )
2024-11-15 00:42:49 +00:00
554af9228d
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 ( #9583 )
...
Signed-off-by: xiaozijin <xiaozijin@bytedance.com >
2024-11-14 16:38:53 -08:00
b2e0ad3b59
[Perf] Reduce peak memory usage of llama ( #10339 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2024-11-15 00:38:20 +00:00
4a18fd14ba
Support Roberta embedding models ( #9387 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-14 21:23:29 +00:00
1dbae0329c
[Docs] Publish meetup slides ( #10331 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-14 16:19:38 +00:00
675d603400
[CI/Build] Make shellcheck happy ( #10285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 09:47:53 +00:00
03025c023f
[CI/Build] Fix CPU CI online inference timeout ( #10314 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 16:45:32 +08:00
29f3ef26a3
[ci][distributed] disable hanging tests ( #10317 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-14 00:23:39 -08:00
294bf467ba
[Model] Add BNB quantization support for Idefics3 ( #10310 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-14 06:31:44 +00:00
52b48c1ead
[BugFix]: properly deserialize tool_calls
iterator before processing by mistral-common when MistralTokenizer is used ( #9951 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-14 04:48:16 +00:00
f67ce05d0b
[Frontend] Pythonic tool parser ( #9859 )
...
Signed-off-by: Mike Depinet <mike@fixie.ai >
2024-11-14 04:14:34 +00:00
e0853b6508
[Misc] format.sh: Simplify tool_version_check ( #10305 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-14 11:12:35 +08:00
504ac53d18
[misc] error early for old-style class ( #10304 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-13 18:55:39 -08:00
15bb8330aa
[Bugfix] Fix tensor parallel for qwen2 classification model ( #10297 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 10:54:59 +08:00
ac49b59d8b
[Bugfix] bitsandbytes models fail to run pipeline parallel ( #10200 )
...
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com >
2024-11-13 09:56:39 -07:00
0b8bb86bf1
[1/N] Initial prototype for multi-modal processor ( #10044 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-13 12:39:03 +00:00
bb7991aa29
[V1] Add missing tokenizer options for Detokenizer
( #10288 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-13 11:02:56 +00:00
d909acf9fe
[Model][LoRA]LoRA support added for idefics3 ( #10281 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-13 17:25:59 +08:00
b6dde33019
[Core] Flashinfer - Remove advance step size restriction ( #10282 )
2024-11-13 16:29:32 +08:00
1b886aa104
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 ( #9944 )
...
Signed-off-by: FurtherAI <austin.veselka@lighton.ai >
Co-authored-by: FurtherAI <austin.veselka@lighton.ai >
2024-11-13 08:28:13 +00:00
3945c82346
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions ( #10221 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-13 07:07:22 +00:00
032fcf16ae
[Doc] Fix typo in arg_utils.py ( #10264 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-12 21:54:52 -08:00
56a955e774
Bump to compressed-tensors v0.8.0 ( #10279 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2024-11-12 21:54:10 -08:00
bbd3e86926
[V1] Support VLMs with fine-grained scheduling ( #9871 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-13 04:53:13 +00:00
0d4ea3fb5c
[core][distributed] use tcp store directly ( #10275 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 17:36:08 -08:00
112fa0bbe5
[V1] Fix CI tests on V1 engine ( #10272 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 16:17:20 -08:00
377b74fe87
Revert "[ci][build] limit cmake version" ( #10271 )
2024-11-12 15:06:48 -08:00
18081451f9
[doc] improve debugging doc ( #10270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:43:52 -08:00
96ae0eaeb2
[doc] fix location of runllm widget ( #10266 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:34:39 -08:00
1f55e05713
[V1] Enable Inductor when using piecewise CUDA graphs ( #10268 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 13:39:56 -08:00
8a06428c70
[LoRA] Adds support for bias in LoRA ( #5733 )
...
Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com >
Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com >
2024-11-12 11:08:40 -08:00
b41fb9d3b1
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers ( #9982 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-12 10:53:57 -08:00
7c65527918
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest ( #10245 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 08:57:14 -08:00
47db6ec831
[Frontend] Add per-request number of cached token stats ( #10174 )
2024-11-12 16:42:28 +00:00
176fcb1c71
[Bugfix] Fix QwenModel argument ( #10262 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2024-11-12 16:36:51 +00:00
a838ba7254
[Misc]Fix Idefics3Model argument ( #10255 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 13:07:11 +00:00
36c513a076
[BugFix] Do not raise a ValueError
when tool_choice
is set to the supported none
option and tools
are not defined. ( #10000 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-12 11:13:46 +00:00
d201d41973
[CI][CPU]refactor CPU tests to allow to bind with different cores ( #10222 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-12 10:07:32 +00:00
3a28f18b0b
[doc] explain the class hierarchy in vLLM ( #10240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 22:56:44 -08:00
812c981fa0
Splitting attention kernel file ( #10091 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2024-11-11 22:55:07 -08:00
7f5edb5900
[Misc][LoRA] Replace hardcoded cuda device with configurable argument ( #10223 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 11:10:15 +08:00
eea55cca5b
[1/N] torch.compile user interface design ( #10237 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 18:01:06 -08:00
9cdba9669c
[Doc] Update help text for --distributed-executor-backend
( #10231 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-12 09:55:09 +08:00
d1c6799b88
[doc] update debugging guide ( #10236 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 15:21:12 -08:00
6ace6fba2c
[V1] AsyncLLM
Implementation ( #9826 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-11 23:05:38 +00:00
08f93e7439
Make shutil rename in python_only_dev ( #10233 )
...
Signed-off-by: shcheglovnd <shcheglovnd@avride.ai >
2024-11-11 14:29:19 -08:00
9d5b4e4dea
[V1] Enable custom ops with piecewise CUDA graphs ( #10228 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:58:07 -08:00
8a7fe47d32
[misc][distributed] auto port selection and disable tests ( #10226 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 11:54:59 -08:00
4800339c62
Add docs on serving with Llama Stack ( #10183 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-11 11:28:55 -08:00
fe15729a2b
[V1] Use custom ops for piecewise CUDA graphs ( #10227 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:26:48 -08:00
330e82d34a
[v1][torch.compile] support managing cudagraph buffer ( #10203 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:10:27 -08:00
d7a4f2207b
[V1] Do not use inductor for piecewise CUDA graphs ( #10225 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:05:57 -08:00
f9dadfbee3
[V1] Fix detokenizer ports ( #10224 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 10:42:07 -08:00
25144ceed0
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #10209 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 17:24:10 +00:00
e6de9784d2
[core][distributed] add stateless process group ( #10216 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 09:02:14 -08:00
36fc439de0
[Doc] fix doc string typo in block_manager swap_out
function ( #10212 )
2024-11-11 08:53:07 -08:00
874f551b36
[Metrics] add more metrics ( #4464 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-12 00:17:38 +08:00
2cebda42bb
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner ( #10218 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 12:37:58 +00:00
5fb1f935b0
[V1] Allow tokenizer_mode
and trust_remote_code
for Detokenizer ( #10211 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-11 18:01:18 +08:00
36e4acd02a
[LoRA][Kernel] Remove the unused libentry module ( #10214 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-11 09:43:23 +00:00
58170d6503
[Hardware][CPU] Add embedding models support for CPU backend ( #10193 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 08:54:28 +00:00
9804ac7c7c
Bump the patch-update group with 5 updates ( #10210 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 07:22:40 +00:00
f89d18ff74
[6/N] pass whole config to inner model ( #10205 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 06:41:46 +00:00
f0f2e5638e
[doc] improve debugging code ( #10206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-10 17:49:40 -08:00
ad9a78bf64
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py ( #10196 )
2024-11-11 00:14:22 +00:00
73b9083e99
[misc] improve cloudpickle registration and tests ( #10202 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 00:10:53 +00:00
20cf2f553c
[Misc] small fixes to function tracing file path ( #9543 )
...
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 15:21:06 -08:00
bfb7d61a7c
[doc] Polish the integration with huggingface doc ( #10195 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 10:22:04 -08:00
19682023b6
[Doc] Fix typo error in CONTRIBUTING.md ( #10190 )
...
Signed-off-by: FuryMartin <furymartin9910@outlook.com >
2024-11-10 07:47:24 +00:00
9fa4bdde9d
[ci][build] limit cmake version ( #10188 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 16:27:26 -08:00
51c2e1fcef
[CI/Build] Split up models tests ( #10069 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:39:14 -08:00
b09895a618
[Frontend][Core] Override HF config.json
via CLI ( #5836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 16:19:27 +00:00
d88bff1b96
[Frontend] add add_request_id
middleware ( #9594 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-11-09 10:18:29 +00:00
9e37266420
bugfix: fix the bug that stream generate not work ( #2756 )
2024-11-09 10:09:48 +00:00
8a4358ecb5
[doc] explaining the integration with huggingface ( #10173 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 01:02:54 -08:00
bd46357ad9
[bugfix] fix broken tests of mlp speculator ( #10177 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 00:04:50 -08:00
f192aeba74
[Bugfix] Enable some fp8 and quantized fullgraph tests ( #10171 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-09 08:01:27 +00:00
8e1529dc57
[CI/Build] Add run-hpu-test.sh script ( #10167 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2024-11-09 06:26:52 +00:00
1a95f10ee7
[5/N] pass the whole config to model ( #9983 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 14:17:28 +08:00
49d2a41a86
[Doc] Adjust RunLLM location ( #10176 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 20:07:10 -08:00
47672f38b5
[CI/Build] Fix VLM broadcast tests tensor_parallel_size
passing ( #10161 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-09 04:02:59 +00:00
f83feccd7f
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module ( #10169 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-09 03:36:46 +00:00
e0191a95d8
[0/N] Rename MultiModalInputs
to MultiModalKwargs
( #10040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:31:02 +08:00
d7edca1dee
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking ( #6892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 03:27:11 +00:00
127c07480e
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case ( #9857 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-08 19:59:22 -05:00
10b67d865d
[Bugfix] SymIntArrayRef expected to contain concrete integers ( #10170 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-08 14:44:18 -08:00
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
e1b5a82179
Rename vllm.logging to vllm.logging_utils ( #10134 )
2024-11-08 20:53:24 +00:00
87713c6053
[CI/Build] Ignore .gitignored files for shellcheck ( #10162 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-11-08 19:53:36 +00:00
b5815c8413
[V1] Fix non-cudagraph op name ( #10166 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-08 10:23:04 -08:00
6b30471586
[Misc] Improve Web UI ( #10090 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-08 09:51:04 -08:00
f6778620a9
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 ( #10136 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-08 15:56:18 +00:00
0535e5fe6c
Fix edge case Mistral tokenizer ( #10152 )
2024-11-08 15:42:27 +00:00
b489fc3c91
[CI/Build] Update CPU tests to include all "standard" tests ( #5481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 23:30:04 +08:00
208ce622c7
[V1]Enable APC by default only for text models ( #10148 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-08 14:39:41 +00:00
1ff4aed5bd
[Model] Expose size to Idefics3 as mm_processor_kwargs ( #10146 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-08 09:56:58 +00:00
f10797c0ce
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator ( #10144 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-08 09:41:03 +00:00
f4c2187e29
[Misc] Fix typo in #5895 ( #10145 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 09:07:01 +00:00
aea6ad629f
Add hf_transfer to testing image ( #10096 )
2024-11-08 08:35:25 +00:00
da07a9ead7
Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. ( #9285 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2024-11-08 05:31:28 +00:00
3a7f15a398
[Doc] Move CONTRIBUTING to docs site ( #9924 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 05:15:12 +00:00
7371749d54
[Misc] Fix ImportError causing by triton ( #9493 )
2024-11-08 05:08:51 +00:00
ad39bd640c
[Bugfix] Add error handling when server cannot respond any valid tokens ( #5895 )
2024-11-08 04:58:37 +00:00
40d0e7411d
[Doc] Update FAQ links in spec_decode.rst ( #9662 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2024-11-08 04:44:58 +00:00
6bb52b0f97
[CI/Build] Give PR cleanup job PR write access ( #10139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 12:10:20 +08:00
201fc07730
[V1] Prefix caching (take 2) ( #9972 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-07 17:34:44 -08:00
42b4f46b71
[V1] Add all_token_ids attribute to Request ( #10135 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-07 17:08:24 -08:00
073a472728
[Misc] report relevant env vars in collect_env.py tool ( #9293 )
2024-11-07 16:14:01 -08:00
93bff421bc
Bump actions/checkout from 4.2.1 to 4.2.2 ( #9746 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 21:44:58 +00:00
28b2877d30
Online video support for VLMs ( #10020 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 20:25:59 +00:00
97b8475beb
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #9745 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 18:55:35 +00:00
a2f1f3b089
[CI/Build] Automate PR body text cleanup ( #10082 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:26:28 +00:00
3be5b26a76
[CI/Build] Add shell script linting using shellcheck ( #7925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:17:29 +00:00
de0e61a323
[CI/Build] Always run mypy ( #10122 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 16:43:16 +00:00
9d43afcc53
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding ( #9291 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2024-11-07 08:15:14 -08:00
ae62fd17c0
[Frontend] Tool calling parser for Granite 3.0 models ( #9027 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 07:09:02 -08:00
a62bc0109c
[Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. ( #10105 )
...
Signed-off-by: Mozhou <spli161006@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-11-07 11:20:30 +00:00
999df95b4e
[Bugfix] Make image processor respect mm_processor_kwargs
for Qwen2-VL ( #10112 )
...
Signed-off-by: Jiahao Li <liplus17@163.com >
2024-11-07 10:50:44 +00:00
a6f332d0d9
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target ( #10108 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 18:42:50 +08:00
0dfba97b42
[Frontend] Fix multiple values for keyword argument error ( #10075 ) ( #10076 )
...
Signed-off-by: Lei <ylxx@live.com >
2024-11-07 09:07:19 +00:00
aa9078fa03
Adds method to read the pooling types from model's files ( #9506 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 08:42:40 +00:00
e036e527a0
[CI/Build] Improve mypy + python version matrix ( #10041 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 07:54:16 +00:00
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce ( #10030 )
...
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com >
2024-11-06 23:50:47 -08:00
d7263a1bb8
Doc: Improve benchmark documentation ( #9927 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-06 23:50:35 -08:00
104d729656
[CI/Build] re-add codespell to CI ( #10083 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:54:46 -08:00
db7db4aab9
[Misc] Consolidate ModelConfig code related to HF config ( #10104 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 06:00:21 +00:00
1fa020c539
[V1][BugFix] Fix Generator construction in greedy + seed case ( #10097 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-07 05:06:57 +00:00
e7b84c394d
[doc] add back Python 3.8 ABI ( #10100 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 21:06:41 -08:00
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 ( #9911 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 04:43:08 +00:00
29862b884b
[Frontend] Adjust try/except blocks in API impl ( #10056 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-06 20:07:51 -08:00
d3859f1891
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend ( #9823 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-06 17:29:03 -08:00
4ab3256644
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 ( #10095 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-07 00:54:13 +00:00
719c1ca468
[core][distributed] add stateless_init_process_group ( #10072 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 16:42:09 -08:00
74f2f8a0f1
[CI/Build] Always run the ruff workflow ( #10092 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:25:23 +00:00
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-06 11:57:35 -08:00
87bd7e0515
[CI/Build] change conflict PR comment from mergify ( #10080 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 10:15:42 -08:00
098f94de42
[CI/Build] Drop Python 3.8 support ( #10038 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 14:31:01 +00:00
399c798608
Remove ScaledActivation for AWQ ( #10057 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-06 14:27:06 +00:00
406d4cc480
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration ( #10022 )
...
Signed-off-by: ericperfect <ericperfectttt@gmail.com >
2024-11-06 14:13:15 +00:00
a5bba7d234
[Model] Add Idefics3 support ( #9767 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
2024-11-06 11:41:17 +00:00
2003cc3513
[Model][LoRA]LoRA support added for LlamaEmbeddingModel ( #10071 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-06 09:49:19 +00:00
6a585a23d2
[Hotfix] Fix ruff errors ( #10073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-06 01:24:28 -08:00
a02a50e6e5
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend ( #6143 )
...
Signed-off-by: yuwenzho <yuwen.zhou@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Signed-off-by: Bob Zhu <bob.zhu@intel.com >
Signed-off-by: zehao-intel <zehao.huang@intel.com >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai >
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com >
Co-authored-by: Vivek Goel <vgoel@habana.ai >
Co-authored-by: yuwenzho <yuwen.zhou@intel.com >
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai >
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com >
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com >
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai >
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com >
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai >
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com >
Co-authored-by: Ilia Taraban <tarabanil@gmail.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai >
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai >
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com >
Co-authored-by: Sun Choi <schoi@habana.ai >
Co-authored-by: Iryna Boiko <iboiko@habana.ai >
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com >
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com >
Co-authored-by: Zehao Huang <zehao.huang@intel.com >
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com >
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com >
Co-authored-by: Nir David <ndavid@habana.ai >
Co-authored-by: Yu-Zhou <yu.zhou@intel.com >
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai >
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com >
Co-authored-by: Jacek Czaja <jczaja@habana.ai >
Co-authored-by: Yuan <yuan.zhou@outlook.com >
2024-11-06 01:09:10 -08:00
a5fda50a10
[CI/Build] Fix large_gpu_mark reason ( #10070 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-06 08:50:37 +00:00
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
4be3a45158
[distributed] add function to create ipc buffers directly ( #10064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 22:35:03 -08:00
4089985552
[V1] Integrate Piecewise CUDA graphs ( #10058 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-05 22:16:04 -08:00
9d59b75593
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type ( #10054 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-06 05:13:09 +00:00
ea928f608c
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path ( #10063 )
...
Signed-off-by: Alex Rakowski <alex.rakowski@amd.com >
Signed-off-by: Alex Rakowski <182798202+arakowsk-amd@users.noreply.github.com >
2024-11-06 05:10:40 +00:00
2bcbae704c
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer ( #10051 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-06 04:28:29 +00:00
ffc0f2b47a
[Model][OpenVINO] Fix regressions from #8346 ( #10045 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-06 04:19:15 +00:00
82bfc38d07
[Misc] Sort the list of embedding models ( #10037 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 04:05:05 +00:00
c4cacbaa7f
[v1] reduce graph capture time for piecewise cudagraph ( #10059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 18:19:50 -08:00
0c63c34f72
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode ( #9730 )
...
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-11-06 01:45:45 +00:00
966e31697b
[Bugfix] Fix pickle of input when async output processing is on ( #9931 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-11-06 00:39:26 +00:00
43300bd98a
[Bugfix] Properly propagate trust_remote_code settings ( #10047 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-05 16:34:40 -08:00
ca9844b340
[bugfix] fix weak ref in piecewise cudagraph and tractable test ( #10048 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 14:49:20 -08:00
235366fe2e
[CI] Prune back the number of tests in tests/kernels/* ( #9932 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:32 -05:00
02462465ea
[CI] Prune tests/models/decoder_only/language/* tests ( #9940 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:23 -05:00
b9c64c0ca7
[Misc] Modify BNB parameter name ( #9997 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-05 14:40:08 -05:00
d2e80332a7
[Feature] Update benchmark_throughput.py to support image input ( #9851 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-05 19:30:02 +00:00
a53046b16f
[Model] Support quantization of PixtralHFTransformer for PixtralHF ( #9921 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 10:42:20 -08:00
731aec5be7
[CI/Build] Limit github CI jobs based on files changed ( #9928 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 10:30:42 -08:00
09d3550372
[Misc] Add logging for CUDA memory ( #10027 )
...
Signed-off-by: Chenghao Yang <yangalan1996@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Chenghao Yang <yangalan1996@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-05 09:50:50 -08:00
cd34029e91
Refactor TPU requirements file and pin build dependencies ( #10010 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-05 16:48:44 +00:00
5952d81139
[Frontend] Fix tcp port reservation for api server ( #10012 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 07:50:57 -08:00
93dee88f6b
[Misc] vllm CLI flags should be ordered for better user readability ( #10017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-05 18:59:56 +08:00
7a83b1aec0
[BugFix] Lazy import ray ( #10021 )
2024-11-05 10:04:10 +00:00
ad23318928
[Bugfix] Fixup Mamba ( #10004 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 03:46:38 +00:00
bbc3619dc8
[Core] Make encoder-decoder inputs a nested structure to be more composable ( #9604 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-05 10:07:31 +08:00
04bbf38e05
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep ( #9994 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 01:08:21 +00:00
8f0a9ca890
[Bugfix] Respect modules_to_not_convert within awq_marlin ( #9895 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-04 16:57:44 -07:00
2094062b4e
[4.5/N] bugfix for quant config in speculative decode ( #10007 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 15:11:59 -08:00
d93478b399
[Bugfix] Upgrade to pytorch 2.5.1 ( #10001 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-04 15:11:28 -08:00
ac04a97a9f
[Frontend] Add max_tokens prometheus metric ( #9881 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-04 22:53:24 +00:00
9a5664d4a4
[Misc] Refactor benchmark_throughput.py ( #9779 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <lkchen@github.com >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-04 14:32:16 -08:00
04cef2c6ab
[Bugfix] Fix MQLLMEngine
hanging ( #9973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-11-04 16:01:43 -05:00
6e056bcf04
[Doc] Update VLM doc about loading from local files ( #9999 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-04 19:47:11 +00:00
5208dc7a20
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs ( #9279 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-11-04 11:37:46 -08:00
1c45f4c385
[CI] Basic Integration Test For TPU ( #9968 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
2024-11-04 11:34:26 -08:00
603a661ae8
[Model] factoring out MambaMixer out of Jamba ( #8993 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-04 18:00:00 +00:00
fb2716d641
[Misc]Reduce BNB static variable ( #9987 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 17:04:40 +00:00
8d72bb20fa
[4/N] make quant config first-class citizen ( #9978 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 08:51:31 -08:00
ac6b8f19b9
[Frontend] Multi-Modality Support for Loading Local Image Files ( #9915 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-04 15:34:57 +00:00
ccb5376a9a
[Bugfix][OpenVINO] Fix circular reference #9939 ( #9974 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-04 18:14:13 +08:00
ea4adeddc1
[Bugfix] Fix E2EL mean and median stats ( #9984 )
...
Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com >
2024-11-04 09:37:58 +00:00
4dbcbbeb09
[Misc] Compute query_start_loc/seq_start_loc on CPU ( #9447 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-11-04 08:54:37 +00:00
b67feb1274
[Bugfix]Using the correct type hints ( #9885 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-11-04 06:19:51 +00:00
c49f0407ba
[Bugfix] Fix MiniCPMV and Mllama BNB bug ( #9917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 03:36:41 +00:00
91c9ebbb1b
[V1] Fix Configs ( #9971 )
2024-11-04 00:24:40 +00:00
54597724f4
[Model] Add support for H2OVL-Mississippi models ( #9747 )
...
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-04 00:15:36 +00:00
1f1b6d6eda
[V1] Support per-request seed ( #9945 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
2024-11-03 09:14:17 -08:00
3bb4befea7
[bugfix] fix tsts ( #9959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 15:54:05 -07:00
ae5279a163
[torch.compile] Adding torch compile to vision-language models ( #9946 )
2024-11-02 12:56:05 -07:00
1b73ab2a1f
[CI/Build] Quoting around > ( #9956 )
2024-11-02 12:50:28 -07:00
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 12:08:49 -07:00
74b529ceee
[bugfix] fix chatglm dummy_data_for_glmv ( #9955 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 08:03:33 -07:00
d6459b4516
[V1] Fix EngineArgs
refactor on V1 ( #9954 )
2024-11-02 07:44:38 -07:00
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-02 07:35:05 -07:00
1d4cfe2be1
[Doc] Updated tpu-installation.rst with more details ( #9926 )
...
Signed-off-by: Michael Green <mikegre@google.com >
2024-11-02 10:06:45 -04:00
eed92f12fc
[Docs] Update Granite 3.0 models in supported models table ( #9930 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-02 09:02:18 +00:00
af7380d83b
[torch.compile] fix cpu broken code ( #9947 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 23:35:47 -07:00
a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models ( #9559 )
2024-11-01 23:22:49 -07:00
d522034c85
[ci/build] Have dependabot ignore pinned dependencies ( #9935 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-11-01 23:56:13 +00:00
6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking ( #8346 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-01 16:21:10 -07:00
d151fde834
[ci/build] Bump the patch-update group with 10 updates ( #9897 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-11-01 23:04:42 +00:00
27cd36e6e2
[Bugfix] PicklingError on RayTaskError ( #9934 )
...
Signed-off-by: Gene Su <e870252314@gmail.com >
2024-11-01 22:08:23 +00:00
18bd7587b7
[1/N] pass the complete config from engine to executor ( #9933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 13:51:57 -07:00
598b6d7b07
[Bugfix/Core] Flashinfer k_scale and v_scale ( #9861 )
2024-11-01 12:15:05 -07:00
aff1fd8188
[torch.compile] use interpreter with stable api from pytorch ( #9889 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 11:50:37 -07:00
4581d2cc02
[Core] Refactor: Clean up unused argument in Scheduler._preempt ( #9696 )
...
Signed-off-by: André Jonasson <andre.jonasson@gmail.com >
2024-11-01 11:41:38 -07:00
1dd4cb2935
[Bugfix] Fix edge cases for MistralTokenizer ( #9625 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2024-11-01 10:33:15 -07:00
ba0d892074
[Frontend] Use a proper chat template for VLM2Vec ( #9912 )
2024-11-01 14:09:07 +00:00
30a2e80742
[CI/Build] Add Model Tests for PixtralHF ( #9813 )
2024-11-01 07:55:29 -06:00
06386a64dd
[Frontend] Chat-based Embeddings API ( #9759 )
2024-11-01 08:13:35 +00:00
d3aa2a8b2f
[Doc] Update multi-input support ( #9906 )
2024-11-01 07:34:49 +00:00
2b5bf20988
[torch.compile] Adding torch compile annotations to some models ( #9876 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-01 00:25:47 -07:00
93a76dd21d
[Model] Support bitsandbytes for MiniCPMV ( #9891 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:31:56 +08:00
566cd27797
[torch.compile] rework test plans ( #9866 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 22:20:17 -07:00
37a4947dcd
[Bugfix] Fix layer skip logic with bitsandbytes ( #9887 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:12:44 +08:00
96e0c9cbbd
[torch.compile] directly register custom op ( #9896 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 21:56:09 -07:00
031a7995f3
[Bugfix][Frontend] Reject guided decoding in multistep mode ( #9892 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-01 01:09:46 +00:00
b63c64d95b
[ci/build] Configure dependabot to update pip dependencies ( #9811 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-31 15:55:38 -07:00
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 ( #9838 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-10-31 20:06:25 +00:00
55650c83a0
[Bugfix] Fix illegal memory access
error with chunked prefill, prefix caching, block manager v2 and xformers enabled together ( #9532 )
...
Signed-off-by: sasha0552 <admin@sasha0552.org >
2024-10-31 11:46:36 -07:00
77f7ef2908
[CI/Build] Adding a forced docker system prune to clean up space ( #9849 )
2024-11-01 01:02:58 +08:00
16b8f7a86f
[CI/Build] Add Model Tests for Qwen2-VL ( #9846 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-31 09:10:52 -07:00
5608e611c2
[Doc] Update Qwen documentation ( #9869 )
2024-10-31 08:54:18 +00:00
3ea2dc2ec4
[Misc] Remove deprecated arg for cuda graph capture ( #9864 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-10-31 07:22:07 +00:00
d087bf863e
[Model] Support quantization of Qwen2VisionTransformer ( #9817 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-30 22:41:20 -07:00
890ca36072
Revert "[Bugfix] Use host argument to bind to interface ( #9798 )" ( #9852 )
2024-10-31 01:44:51 +00:00
abbfb6134d
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint ( #9837 )
2024-10-30 18:15:56 -07:00
64384bbcdf
[torch.compile] upgrade tests ( #9858 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 16:34:22 -07:00
00d91c8a2c
[CI/Build] Simplify exception trace in api server tests ( #9787 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-30 14:52:05 -07:00
c2cd1a2142
[doc] update pp support ( #9853 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 13:36:51 -07:00
c787f2d81d
[Neuron] Update Dockerfile.neuron to fix build failure ( #9822 )
2024-10-30 12:22:02 -07:00
33d257735f
[Doc] link bug for multistep guided decoding ( #9843 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 17:28:29 +00:00
3b3f1e7436
[Bugfix][core] replace heartbeat with pid check ( #9818 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 09:34:07 -07:00
9ff4511e43
[Misc] Add chunked-prefill support on FlashInfer. ( #9781 )
2024-10-30 09:33:53 -07:00
81f09cfd80
[Model] Support math-shepherd-mistral-7b-prm model ( #9697 )
...
Signed-off-by: Went-Liang <wenteng_liang@163.com >
2024-10-30 09:33:42 -07:00
cc98f1e079
[CI/Build] VLM Test Consolidation ( #9372 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-30 09:32:17 -07:00
211fe91aa8
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA ( #9438 )
2024-10-30 09:41:38 +00:00
6aa6020f9b
[Misc] Specify minimum pynvml version ( #9827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 23:05:43 -07:00
ff5ed6e1bc
[torch.compile] rework compile control with piecewise cudagraph ( #9715 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 23:03:49 -07:00
7b0365efef
[Doc] Add the DCO to CONTRIBUTING.md ( #9803 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-30 05:22:23 +00:00
04a3ae0aca
[Bugfix] Fix multi nodes TP+PP for XPU ( #8884 )
...
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn >
2024-10-29 21:34:45 -07:00
62fac4b9aa
[ci/build] Pin CI dependencies version with pip-compile ( #9810 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-30 03:34:55 +00:00
226688bd61
[Bugfix][VLM] Make apply_fp8_linear work with >2D input ( #9812 )
2024-10-29 19:49:44 -07:00
64cb1cdc3f
Update README.md ( #9819 )
2024-10-29 17:28:43 -07:00
1ab6f6b4ad
[core][distributed] fix custom allreduce in pytorch 2.5 ( #9815 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 17:06:24 -07:00
bc73e9821c
[Bugfix] Fix prefix strings for quantized VLMs ( #9772 )
2024-10-29 16:02:59 -07:00
8d7724104a
[Docs] Add notes about Snowflake Meetup ( #9814 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-10-29 15:19:02 -07:00
882a1ad0de
[Model] tool calling support for ibm-granite/granite-20b-functioncalling ( #8339 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
2024-10-29 15:07:37 -07:00
67bdf8e523
[Bugfix][Frontend] Guard against bad token ids ( #9634 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-29 14:13:20 -07:00
0ad216f575
[MISC] Set label value to timestamp over 0, to keep track of recent history ( #9777 )
...
Signed-off-by: Kunjan Patel <kunjanp@google.com >
2024-10-29 19:52:19 +00:00
7585ec996f
[CI/Build] mergify: fix rules for ci/build label ( #9804 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-29 19:24:42 +00:00
ab6f981671
[CI][Bugfix] Skip chameleon for transformers 4.46.1 ( #9808 )
2024-10-29 11:12:43 -07:00
ac3d748dba
[Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel ( #9806 )
2024-10-29 10:40:35 -07:00
0ce7798f44
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) ( #9801 )
...
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2024-10-29 10:39:20 -07:00
0f43387157
[Bugfix] Use host argument to bind to interface ( #9798 )
2024-10-29 10:37:59 -07:00
08600ddc68
Fix the log to correct guide user to install modelscope ( #9793 )
...
Signed-off-by: yuze.zyz <yuze.zyz@alibaba-inc.com >
2024-10-29 10:36:59 -07:00
74fc2d77ae
[Misc] Add metrics for request queue time, forward time, and execute time ( #9659 )
2024-10-29 10:32:56 -07:00
622b7ab955
[Hardware] using current_platform.seed_everything ( #9785 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-29 14:47:44 +00:00
09500f7dde
[Model] Add BNB quantization support for Mllama ( #9720 )
2024-10-29 08:20:02 -04:00
ef7865b4f9
[Frontend] re-enable multi-modality input in the new beam search implementation ( #9427 )
...
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
eae3d48181
[Bugfix] Use temporary directory in registry ( #9721 )
2024-10-28 22:08:20 -07:00
e74f2d448c
[Doc] Specify async engine args in docs ( #9726 )
2024-10-28 22:07:57 -07:00
7a4df5f200
[Model][LoRA]LoRA support added for Qwen ( #9622 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 04:14:07 +00:00
c5d7fb9ddc
[Doc] fix third-party model example ( #9771 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 19:39:21 -07:00
76ed5340f0
[torch.compile] add deepseek v2 compile ( #9775 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 14:35:17 -07:00
97b61bfae6
[misc] avoid circular import ( #9765 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 20:51:23 +00:00
aa0addb397
Adding "torch compile" annotations to moe models ( #9758 )
2024-10-28 13:49:56 -07:00
5f8d8075f9
[Model][VLM] Add multi-video support for LLaVA-Onevision ( #8905 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-28 18:04:10 +00:00
8b0e4f2ad7
[CI/Build] Adopt Mergify for auto-labeling PRs ( #9259 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 09:38:09 -07:00
2adb4409e0
[Bugfix] Fix ray instance detect issue ( #9439 )
2024-10-28 07:13:03 +00:00
feb92fbe4a
Fix beam search eos ( #9627 )
2024-10-28 06:59:37 +00:00
32176fee73
[torch.compile] support moe models ( #9632 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 21:58:04 -07:00
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-28 04:07:00 +00:00
34a9941620
[Bugfix] Fix load config when using bools ( #9533 )
2024-10-27 13:46:41 -04:00
e130c40e4e
Fix cache management in "Close inactive issues and PRs" actions workflow ( #9734 )
2024-10-27 10:30:03 -07:00
3cb07a36a2
[Misc] Upgrade to pytorch 2.5 ( #9588 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-27 09:44:24 +00:00
8549c82660
[core] cudagraph output with tensor weak reference ( #9724 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 00:19:28 -07:00
67a6882da4
[Misc] SpecDecodeWorker supports profiling ( #9719 )
...
Signed-off-by: Abatom <abatom@163.com >
2024-10-27 04:18:03 +00:00
6650e6a930
[Model] Add classification Task with Qwen2ForSequenceClassification ( #9704 )
...
Signed-off-by: Kevin-Yang <ykcha9@gmail.com >
Co-authored-by: Kevin-Yang <ykcha9@gmail.com >
2024-10-26 17:53:35 +00:00
07e981fdf4
[Frontend] Bad words sampling parameter ( #9717 )
...
Signed-off-by: Vasily Alexeev <alvasian@yandex.ru >
2024-10-26 16:29:38 +00:00
55137e8ee3
Fix: MI100 Support By Bypassing Custom Paged Attention ( #9560 )
2024-10-26 12:12:57 +00:00
5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino ( #9716 )
2024-10-26 10:59:06 +00:00
067e77f9a8
[Bugfix] Steaming continuous_usage_stats default to False ( #9709 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-10-26 05:05:47 +00:00
6567e13724
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding ( #9631 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: pavlo-ruban <pavlo.ruban@servicenow.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-25 15:42:56 -07:00
228cfbd03f
[Doc] Improve quickstart documentation ( #9256 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-25 14:32:10 -07:00
ca0d92227e
[Bugfix] Fix compressed_tensors_moe bad config.strategy ( #9677 )
2024-10-25 12:40:33 -07:00
9645b9f646
[V1] Support sliding window attention ( #9679 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-24 22:20:37 -07:00
a6f3721861
[Model] add a lora module for granite 3.0 MoE models ( #9673 )
2024-10-24 22:00:17 -07:00
9f7b4ba865
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 ( #9676 )
2024-10-24 20:59:00 -07:00
c91ed47c43
[Bugfix] Remove xformers requirement for Pixtral ( #9597 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 15:38:05 -07:00
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement ( #9384 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2024-10-24 15:37:52 -07:00
e26d37a185
[Log][Bugfix] Fix default value check for image_url.detail
( #9663 )
2024-10-24 10:44:38 -07:00
722d46edb9
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints ( #9650 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-24 10:42:24 -07:00
c866e0079d
[CI/Build] Fix VLM test failures when using transformers v4.46 ( #9666 )
2024-10-25 01:40:40 +08:00
d27cfbf791
[torch.compile] Adding torch compile annotations to some models ( #9641 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 09:31:42 -07:00
de662d32b5
Increase operation per run limit for "Close inactive issues and PRs" workflow ( #9661 )
...
Signed-off-by: Harry Mellor <hej.mellor@gmail.com >
2024-10-24 12:17:45 -04:00
f58454968f
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models ( #9653 )
2024-10-24 07:52:07 -07:00
b979143d5b
[Doc] Move additional tips/notes to the top ( #9647 )
2024-10-24 09:43:59 +00:00
ad6f78053e
[torch.compile] expanding support and fix allgather compilation ( #9637 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 01:32:15 -07:00
295a061fb3
[Kernel] add kernel for FATReLU ( #9610 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-24 16:18:27 +08:00
8a02cd045a
[torch.compile] Adding torch compile annotations to some models ( #9639 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 00:54:57 -07:00
4fdc581f9e
[core] simplify seq group code ( #9569 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-10-24 00:16:44 -07:00
3770071eb4
[V1][Bugfix] Clean up requests when aborted ( #9629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-23 23:33:22 -07:00
836e8ef6ee
[Bugfix] Fix PP for ChatGLM and Molmo ( #9422 )
2024-10-24 06:12:05 +00:00
056a68c7db
[XPU] avoid triton import for xpu ( #9440 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-24 05:14:00 +00:00
33bab41060
[Bugfix]: Make chat content text allow type content ( #9358 )
...
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2024-10-24 05:05:49 +00:00
b7df53cd42
[Bugfix] Use "vision_model" prefix for MllamaVisionModel ( #9628 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:07:44 +08:00
bb01f2915e
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image ( #9626 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:03:44 +08:00
b548d7a5f4
[CI/Build] Add bot to close stale issues and PRs ( #9436 )
2024-10-23 15:45:26 -07:00
fc6c274626
[Model] Add Qwen2-Audio model support ( #9248 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-23 17:54:22 +00:00
150b779081
[Frontend] Enable Online Multi-image Support for MLlama ( #9393 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-23 17:28:57 +00:00
9013e24f7b
[torch.compile] Adding torch compile annotations to some models ( #9614 )
2024-10-23 10:07:48 -07:00
fd0e2cfdb2
[Misc] Separate total and output tokens in benchmark_throughput.py ( #8914 )
2024-10-23 16:47:20 +00:00
e5ac6a4199
[Bugfix] Fix divide by zero when serving Mamba models ( #9617 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-10-23 16:40:43 +00:00
dbdd3b5e5a
[misc] comment to avoid future confusion about baichuan ( #9620 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-23 09:14:44 -07:00
e7116c017c
[Bugfix] Fix _init_vision_model
in NVLM_D model ( #9611 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 14:09:04 +00:00
31a08f5bd2
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs ( #9612 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-23 14:05:18 +00:00
c18e1a3418
[VLM] Enable overriding whether post layernorm is used in vision encoder + fix quant args ( #9217 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 11:27:37 +00:00
3ff57ebfca
[Model] Initialize Florence-2 language backbone support ( #9555 )
2024-10-23 10:42:47 +00:00
2394962d70
[Hardware][XPU] using current_platform.is_xpu ( #9605 )
2024-10-23 08:28:21 +00:00
51c24c9736
[Build] Fix FetchContent
multiple build issue ( #9596 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-10-23 12:43:07 +08:00
831540cf04
[Model] Support E5-V ( #9576 )
2024-10-23 11:35:29 +08:00
29061ed9df
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages ( #9590 )
2024-10-23 11:17:28 +08:00
65050a40e6
[Bugfix] Generate exactly input_len tokens in benchmark_throughput ( #9592 )
2024-10-22 17:45:35 -07:00
208cb34c81
[Doc]: Update tensorizer docs to include vllm[tensorizer] ( #7889 )
...
Co-authored-by: Kaunil Dhruv <dhruv.kaunil@gmail.com >
2024-10-22 15:43:25 -07:00
b17046e298
[BugFix] Fix metrics error for --num-scheduler-steps > 1 ( #8234 )
2024-10-22 15:43:03 -07:00
d1e8240875
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing ( #9487 )
2024-10-22 15:41:13 -07:00
cb6fdaa0a0
[Misc] Make benchmarks use EngineArgs ( #9529 )
2024-10-22 15:40:38 -07:00
23b899a8e6
[Bugfix] fix detokenizer shallow copy ( #5919 )
2024-10-22 15:38:12 -07:00
17c79f3c36
[torch.compile] auto infer dynamic_arg_dims from type annotation ( #9589 )
2024-10-22 13:43:37 -07:00
cd5601ac37
[BugFix] Prevent exporting duplicate OpenTelemetry spans ( #9017 )
2024-10-22 11:11:53 -07:00
434984e665
[Frontend] Support custom request_id from request ( #9550 )
...
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com >
2024-10-22 18:07:30 +00:00
32a1ee74a0
[Hardware][Intel CPU][DOC] Update docs for CPU backend ( #6212 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com >
Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com >
2024-10-22 10:38:04 -07:00
08075c3448
[Bugfix] Eagle: change config name for fc bias ( #9580 )
2024-10-22 16:14:22 +00:00
bb392ea2d2
[Model][VLM] Initialize support for Mono-InternVL model ( #9528 )
2024-10-22 16:01:46 +00:00
9dbcce84a7
[Neuron] [Bugfix] Fix neuron startup ( #9374 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-22 12:51:41 +00:00
a48e3ec052
[CI/Build][LoRA] Temporarily fix long context failure issue ( #9579 )
2024-10-22 11:32:51 +00:00
6c5af09b39
[V1] Implement vLLM V1 [1/N] ( #9289 )
2024-10-22 01:24:07 -07:00
3ddbe25502
[Hardware][CPU] using current_platform.is_cpu ( #9536 )
2024-10-22 00:50:43 -07:00
0d02747f2e
support TP in qwen2 bnb ( #9574 )
2024-10-22 07:13:23 +00:00
f7db5f0fa9
[Doc] Use shell code-blocks and fix section headers ( #9508 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-22 06:43:24 +00:00
ca30c3c84b
[Core] Remove evictor_v1 ( #9572 )
2024-10-22 04:55:49 +00:00
c0292211ce
[CI/Build] Replaced some models on tests for smaller ones ( #9570 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-22 04:52:14 +00:00
74692421f7
[Bugfix]: phi.py get rope_theta from config file ( #9503 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-22 02:53:36 +00:00
29acd2c34c
[Bugfix][OpenVINO] fix_dockerfile_openvino ( #9552 )
2024-10-21 19:47:52 -07:00
f085995a7b
[CI/Build] Remove unnecessary fork_new_process
( #9484 )
2024-10-21 19:47:29 -07:00
b729901139
[Bugfix]: serialize config by value for --trust-remote-code ( #6751 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-21 19:46:24 -07:00
76a5e13270
[core] move parallel sampling out from vllm core ( #9302 )
2024-10-22 00:31:44 +00:00
ef7faad1b8
🐛 Fixup more test failures from memory profiling ( #9563 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-21 17:10:56 -07:00
575dcebe9a
[CI] Make format checker error message more user-friendly by using emoji ( #9564 )
...
This PR makes format checker error message more user-friendly by adding emojis.
2024-10-21 23:45:15 +00:00
711f3a7806
[Frontend] Don't log duplicate error stacktrace for every request in the batch ( #9023 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-21 14:49:41 -07:00
15713e3b75
[BugFix] Update draft model TP size check to allow matching target TP size ( #9394 )
...
Co-authored-by: Baoyuan Qi <qibaoyuan@126.com >
2024-10-21 14:14:29 -07:00
d621c43df7
[doc] fix format ( #9562 )
2024-10-21 13:54:57 -07:00
9d9186be97
[Frontend] Reduce frequency of client cancellation checking ( #7959 )
2024-10-21 13:28:10 -07:00
5241aa1494
[Model][Bugfix] Fix batching with multi-image in PixtralHF ( #9518 )
2024-10-21 14:20:07 -04:00
ec6bd6c4c6
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint ( #9492 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-21 17:43:02 +00:00
8ca8954841
[Bugfix][Misc]: fix graph capture for decoder ( #9549 )
2024-10-21 17:33:30 +00:00
f6b97293aa
[Model] FalconMamba Support ( #9325 )
2024-10-21 12:50:16 -04:00
496e991da8
[Doc] Consistent naming of attention backends ( #9498 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-10-21 22:29:57 +08:00
696b01af8f
[CI/Build] Split up decoder-only LM tests ( #9488 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-20 21:27:50 -07:00
855e0e6f97
[Frontend][Misc] Goodput metric support ( #9338 )
2024-10-20 18:39:32 +00:00
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
962d2c6349
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision ( #9520 )
2024-10-20 05:29:14 +00:00
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger ( #9530 )
2024-10-20 00:05:02 +00:00
8e3e7f2713
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf ( #9514 )
2024-10-19 10:44:29 -04:00
263d8ee150
[Bugfix] Fix missing task for speculative decoding ( #9524 )
2024-10-19 06:49:40 +00:00
c5eea3c8ba
[Frontend] Support simpler image input format ( #9478 )
2024-10-18 23:17:07 -07:00
85dc92fc98
[CI/Build] Configure matcher for actionlint workflow ( #9511 )
...
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
2024-10-19 06:04:18 +00:00
dfd951ed9b
[CI/Build] Add error matching for ruff output ( #9513 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-19 05:42:20 +00:00
82c25151ec
[Doc] update gpu-memory-utilization flag docs ( #9507 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-19 11:26:36 +08:00
1325872ec8
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily ( #9521 )
2024-10-18 20:21:01 -07:00
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-18 21:25:19 -04:00
337ed76671
[Bugfix] Fix offline mode when using mistral_common
( #9457 )
2024-10-18 18:12:32 -07:00
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores ( #9497 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-18 17:55:48 -07:00
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py ( #9510 )
2024-10-18 14:30:55 -07:00
9bb10a7d27
[MISC] Add lora requests to metrics ( #9477 )
...
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a .c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
3921a2f29e
[Model] Support Pixtral models in the HF Transformers format ( #9036 )
2024-10-18 13:29:56 -06:00
67a7e5ef38
[CI/Build] Add error matching config for mypy ( #9512 )
2024-10-18 12:17:53 -07:00
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
7dbe738d65
[Misc] benchmark: Add option to set max concurrency ( #9390 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-18 11:15:28 -07:00
ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py ( #9505 )
2024-10-18 16:59:19 +00:00
1bbbcc0b1d
[CI/Build] Fix lint errors in mistral tokenizer ( #9504 )
2024-10-19 00:09:35 +08:00
25aeb7d4c9
[BugFix] Fix and simplify completion API usage streaming ( #9475 )
2024-10-18 14:10:26 +00:00
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser ( #9154 )
2024-10-18 10:27:48 +00:00
1ffc8a7362
[BugFix] Typing fixes to RequestOutput.prompt and beam search ( #9473 )
2024-10-18 07:19:53 +00:00
944dd8edaf
[CI/Build] Use commit hash references for github actions ( #9430 )
2024-10-17 21:54:58 -07:00
154a8ae880
[Qwen2.5] Support bnb quant for Qwen2.5 ( #9467 )
2024-10-18 04:40:14 +00:00
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-17 22:47:27 -04:00
48138a8415
[BugFix] Stop silent failures on compressed-tensors parsing ( #9381 )
2024-10-17 18:54:00 -07:00
343f8e0905
Support BERTModel
(first encoder-only
embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: laishzh <laishengzhang@gmail.com >
Co-authored-by: Max de Bayser <maxdebayser@gmail.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-10-17 23:21:01 +00:00
bb76538bbd
[Hardwware][Neuron] Simplify model load for transformers-neuronx library ( #9380 )
2024-10-17 15:39:39 -07:00
d615b5c9f8
[Bugfix] Print warnings related to mistral_common
tokenizer only once ( #9468 )
2024-10-17 21:44:20 +00:00
d65049daab
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script ( #9013 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-17 21:11:11 +00:00
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
7871659abb
[Misc] Remove commit id file ( #9470 )
2024-10-17 10:34:37 -07:00
a2c71c5405
[CI/Build] remove .github from .dockerignore, add dirty repo check ( #9375 )
2024-10-17 10:25:06 -07:00
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
5eda21e773
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support ( #9344 )
2024-10-17 12:21:04 -04:00
8e1cddcd44
[TPU] Call torch._sync(param) during weight loading ( #9437 )
2024-10-17 09:00:11 -07:00
5e443b594f
[Bugfix] Allow prefill of assistant response when using mistral_common
( #9446 )
2024-10-17 15:06:37 +00:00
9d30a056e7
[misc] CUDA Time Layerwise Profiler ( #8337 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-10-17 10:36:09 -04:00
390be74649
[Misc] Print stack trace using logger.exception
( #9461 )
2024-10-17 13:55:48 +00:00
e312e52b44
[Kernel] Add Exllama as a backend for compressed-tensors ( #9395 )
2024-10-17 09:48:26 -04:00
dbfa8d31d5
Add notes on the use of Slack ( #9442 )
2024-10-17 04:46:46 +00:00
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels ( #9391 )
2024-10-17 01:34:06 +00:00
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel ( #9425 )
2024-10-16 23:46:06 +00:00
776dbd74f1
[CI/Build] mypy: Resolve some errors from checking vllm/engine ( #9267 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-16 22:55:59 +00:00
8345045833
[Performance][Spec Decode] Optimize ngram lookup performance ( #9333 )
2024-10-16 13:37:45 -06:00
5b8a1fde84
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft ( #9396 )
2024-10-16 16:40:24 +00:00
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba ( #9189 )
2024-10-16 12:12:43 -04:00
415f76a9cb
Support mistral interleaved attn ( #9414 )
2024-10-16 13:28:30 +00:00
cf1d62a644
[Model] Support SDPA attention for Molmo vision backbone ( #9410 )
2024-10-16 11:52:01 +00:00
59230ef32b
[Misc] Consolidate example usage of OpenAI client for multimodal models ( #9412 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-16 11:20:51 +00:00
cee711fdbb
[Core] Rename input data types ( #8688 )
2024-10-16 10:49:37 +00:00
1de76a0e55
[CI/Build] Test VLM embeddings ( #9406 )
2024-10-16 09:44:30 +00:00
7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM ( #9303 )
2024-10-16 14:31:00 +08:00
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
ed920135c8
[Bugfix] Molmo text-only input bug fix ( #9397 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-16 04:56:09 +00:00
717a5f82cd
[Bugfix][CI/Build] Fix CUDA 11.8 Build ( #9386 )
2024-10-16 00:15:21 +00:00
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids ( #9034 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-15 15:40:43 -07:00
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions ( #8909 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-15 15:40:25 -07:00
5d264f4ab8
pass ignore_eos parameter to all benchmark_serving calls ( #9349 )
2024-10-15 13:30:44 -07:00
e9d517f276
[BugFix] Fix chat API continuous usage stats ( #9357 )
2024-10-14 23:19:48 -07:00
55e081fbad
[Bugfix] Update InternVL input mapper to support image embeds ( #9351 )
2024-10-14 21:29:19 -07:00
8e836d982a
[Doc] Fix code formatting in spec_decode.rst ( #9348 )
2024-10-14 21:29:11 -07:00
44eaa5a5d9
[Frontend] Clarify model_type error messages ( #9345 )
2024-10-14 21:29:01 -07:00
169b530607
[Bugfix] Clean up some cruft in mamba.py ( #9343 )
2024-10-15 00:24:25 +00:00
f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images ( #9095 )
2024-10-14 15:24:26 -07:00
4d31cd424b
[Frontend] merge beam search implementations ( #9296 )
2024-10-14 15:05:52 -07:00
473e7b3606
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel ( #9350 )
2024-10-14 15:02:06 -07:00
fd47e57f4b
[Docs] Remove PDF build from Readtehdocs ( #9347 )
2024-10-14 11:57:47 -07:00
203ab8f80f
[CI/Build] setuptools-scm fixes ( #8900 )
2024-10-14 11:34:47 -07:00
4141608c6a
[Hardware][intel GPU] add async output process for xpu ( #8897 )
2024-10-14 12:23:33 -06:00
dfe43a2071
[Model] Molmo vLLM Integration ( #9016 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-14 07:56:24 -07:00
16b24e7dcd
[Bugfix] Bandaid fix for speculative decoding tests ( #9327 )
2024-10-13 23:02:11 +00:00
f519902c52
[CI] Fix merge conflict ( #9317 )
2024-10-13 06:41:23 +00:00
250e26a63e
[Bugfix]Fix MiniCPM's LoRA bug ( #9286 )
2024-10-12 09:36:47 -07:00
2b184ddd4f
[Misc][Installation] Improve source installation script and doc ( #9309 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-12 09:36:40 -07:00
00298e092c
[Bugfix] Fix bug of xformer prefill for encoder-decoder ( #9026 )
2024-10-12 15:00:43 +08:00
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
ec10cb8511
[BugFix] Fix tool call finish reason in streaming case ( #9209 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-10-11 18:24:26 -07:00
d11b46f3a5
[bugfix] fix f-string for error ( #9295 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-11 17:03:48 -07:00
c6cf9295e1
[Bugfix] Sets is_first_step_output
for TPUModelRunner ( #9202 )
2024-10-11 13:28:10 -07:00
de9fb4bef8
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected ( #9254 )
2024-10-11 15:57:39 -04:00
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-11 11:18:50 -07:00
1a1823871d
[Doc] Remove outdated comment to avoid misunderstanding ( #9287 )
2024-10-11 18:02:03 +00:00
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
f710090d8e
[Kernel] adding fused moe kernel config for L40S TP4 ( #9245 )
2024-10-11 08:54:22 -07:00
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
df3dcdf49d
[Bugfix] Fix priority in multiprocessing engine ( #9277 )
2024-10-11 15:35:35 +00:00
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
e808156f30
[Misc] Collect model support info in a single process per model ( #9233 )
2024-10-11 11:08:11 +00:00
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-10-10 21:30:44 -07:00
94bf9ae4e9
[Misc] Fix sampling from sonnet for long context case ( #9235 )
2024-10-11 00:33:16 +00:00
f990bab2a4
[Doc][Neuron] add note to neuron documentation about resolving triton issue ( #9257 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-10-10 23:36:32 +00:00
e00c094f15
[torch.compile] generic decorators ( #9258 )
2024-10-10 15:54:23 -07:00
a78c6ba7c8
[ci/build] Add placeholder command for custom models test ( #9262 )
2024-10-10 15:45:09 -07:00
fb870fd491
Bump actions/setup-python from 3 to 5 ( #9195 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:46 -07:00
270953bafb
Bump actions/checkout from 3 to 4 ( #9196 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:35 -07:00
9cc811c4ff
Bump actions/github-script from 6 to 7 ( #9197 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:24 -07:00
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
78c0b4166c
Suggest codeowners for the core componenets ( #9210 )
2024-10-10 12:29:24 -07:00
21efb603f5
[CI/Build] Make the Dockerfile.cpu
file's PIP_EXTRA_INDEX_URL
Configurable as a Build Argument ( #9252 )
2024-10-10 18:18:18 +00:00
055f3270d4
[Doc] Improve debugging documentation ( #9204 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-10 10:48:51 -07:00
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError
( #9218 )
2024-10-10 17:39:56 +00:00
83ea5c72b9
[OpenVINO] Use torch 2.4.0 and newer optimim version ( #9121 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-10 11:18:58 -06:00
04de9057ab
[Model] support input image embedding for minicpmv ( #9237 )
2024-10-10 15:00:47 +00:00
07c11cf4d4
[Bugfix] Fix lm_head weights tying with lora for llama ( #9227 )
2024-10-10 21:11:56 +08:00
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
de895f1697
[misc] improve model support check in another process ( #9208 )
2024-10-09 21:58:27 -07:00
cf25b93bdd
[Core] Fix invalid args to _process_request ( #9201 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-10 12:10:09 +08:00
d5fbb8706d
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 ( #9130 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 12:51:47 -06:00
cdca8994bd
[CI/Build] mypy: check vllm/entrypoints ( #9194 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-09 17:15:28 +00:00
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
7dea289066
Add Dependabot configuration for GitHub Actions updates ( #1217 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 08:16:26 -07:00
cfaa6008e6
[Bugfix] Access get_vocab
instead of vocab
in tool parsers ( #9188 )
2024-10-09 08:59:57 -06:00
21906a6f50
[Bugfix] Fix lora loading for Compressed Tensors in #9120 ( #9179 )
2024-10-09 12:10:44 +00:00
dc4aea677a
[Doc] Fix VLM prompt placeholder sample bug ( #9170 )
2024-10-09 08:59:42 +00:00
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
8bfaa4e31e
[Bugfix] fix composite weight loading and EAGLE weight loading ( #9160 )
2024-10-09 00:36:55 -07:00
0b5b5d767e
[Frontend] Log the maximum supported concurrency ( #8831 )
2024-10-09 00:03:14 -07:00
cdc72e3c80
[Model] Remap FP8 kv_scale in CommandR and DBRX ( #9174 )
2024-10-09 06:43:06 +00:00
7627172bf4
[Bugfix][Doc] Report neuron error in output ( #9159 )
2024-10-08 22:43:34 -07:00
480b7f40cf
[Misc] Improve validation errors around best_of and n ( #9167 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-10-09 04:54:48 +00:00
acce7630c1
Update link to KServe deployment guide ( #9173 )
2024-10-09 03:58:49 +00:00
ffc4b27ea8
Add classifiers in setup.py ( #9171 )
2024-10-08 19:30:48 -07:00
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
9ba0bd6aa6
Add lm-eval
directly to requirements-test.txt ( #9161 )
2024-10-08 18:22:31 -07:00
2a131965a8
mypy: check additional directories ( #9162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-08 22:08:22 +00:00
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
de24046fcd
[Doc] Improve contributing and installation documentation ( #9132 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-08 20:22:08 +00:00
1874c6a1b0
[Doc] Update vlm.rst to include an example on videos ( #9155 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-08 18:12:29 +00:00
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
cfba685bd4
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models ( #8758 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-10-08 09:37:34 -07:00
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:31:26 +00:00
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:12:56 +00:00
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
e1faa2a598
[misc] improve ux on readme ( #9147 )
2024-10-07 22:26:25 -07:00
80b57f00d5
[Intel GPU] Fix xpu decode input ( #9145 )
2024-10-08 03:51:14 +00:00
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
8eeb857084
Add Slack to README ( #9137 )
2024-10-07 17:06:21 -07:00
fa45513a51
[misc] fix comment and variable name ( #9139 )
2024-10-07 16:07:05 -07:00
c0d9a98d0c
[Doc] Include performance benchmark in README ( #9135 )
2024-10-07 15:04:06 -07:00
e0dbdb013d
[CI/Build] Add linting for github actions workflows ( #7876 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-07 21:18:10 +00:00
93cf74a8a7
[Doc]: Add deploying_with_k8s guide ( #8451 )
2024-10-07 13:31:45 -07:00
151ef4efd2
[Model] Support NVLM-D and fix QK Norm in InternViT ( #9045 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2024-10-07 11:55:12 +00:00
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
c8f26bb636
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None ( #9103 )
2024-10-07 03:52:42 +00:00
487678d046
[Bugfix][Hardware][CPU] Fix CPU model input for decode ( #9044 )
2024-10-06 19:14:27 -07:00
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-06 12:48:11 -07:00
fdf59d30ea
[Bugfix] fix tool_parser error handling when serve a model not support it ( #8709 )
2024-10-06 12:51:08 +00:00
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-06 16:35:27 +08:00
f22619fe96
[Misc] Remove user-facing error for removed VLM args ( #9104 )
2024-10-06 01:33:52 -07:00
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-05 23:39:03 -07:00
23fea8714a
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model ( #9101 )
2024-10-06 13:00:04 +08:00
f4dd830e09
[core] use forward context for flash infer ( #9097 )
2024-10-05 19:37:31 -07:00
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
53b3a33027
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs ( #8979 )
2024-10-04 22:05:37 -07:00
dac914b0d6
[Bugfix] use blockmanagerv1 for encoder-decoder ( #9084 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-05 04:45:38 +00:00
a95354a36e
[Doc] Update README.md with Ray summit slides ( #9088 )
2024-10-05 02:54:45 +00:00
663874e048
[torch.compile] improve allreduce registration ( #9061 )
2024-10-04 16:43:50 -07:00
cc90419e89
[Hardware][Neuron] Add on-device sampling support for Neuron ( #8746 )
...
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
2024-10-04 16:42:20 -07:00
27302dd584
[Misc] Fix CI lint ( #9085 )
2024-10-04 16:07:54 -07:00
0cc566ca8f
[Misc] Add random seed for prefix cache benchmark ( #9081 )
2024-10-04 21:58:57 +00:00
05c531be47
[Misc] Improved prefix cache example ( #9077 )
2024-10-04 21:38:42 +00:00
fbb74420e7
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang ( #7412 )
2024-10-04 14:01:44 -07:00
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika Sikka <ds3822@columbia.edu >
2024-10-04 12:34:44 -06:00
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-10-04 18:31:40 +00:00
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models ( #8717 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:38:25 -07:00
e5dc713c23
[Hardware][PowerPC] Make oneDNN dependency optional for Power ( #9039 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-04 17:24:42 +00:00
36eecfbddb
Remove AMD Ray Summit Banner ( #9075 )
2024-10-04 10:17:16 -07:00
9ade8bbc8d
[Model] add a bunch of supported lora modules for mixtral ( #9008 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-04 16:24:40 +00:00
22482e495e
[Bugfix] Flash attention arches not getting set properly ( #9062 )
2024-10-04 09:43:15 -06:00
3d826d2c52
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL ( #9071 )
2024-10-04 14:34:58 +00:00
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:56:58 +08:00
303d44790a
[Misc] Enable multi-step output streaming by default ( #9047 )
2024-10-03 22:55:42 -04:00
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model ( #8405 )
2024-10-04 10:36:39 +08:00
2838d6b38e
[Bugfix] Weight loading fix for OPT model ( #9042 )
...
Co-authored-by: dvres <dvres@fri.uni-lj.si >
2024-10-03 19:53:29 -04:00
91add85ec4
Fix failing spec decode test ( #9054 )
2024-10-03 23:07:29 +00:00
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
63e39937f9
[Frontend] [Neuron] Parse literals out of override-neuron-config ( #8959 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-03 18:02:07 +00:00
f5d72b2fc6
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. ( #8678 )
2024-10-03 09:44:21 -07:00
83caf35e08
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser ( #9020 )
2024-10-03 16:44:52 +08:00
01843c89b8
[Misc] log when using default MoE config ( #8971 )
2024-10-03 04:31:07 +00:00
19a4dd0990
[Bugfix] example template should not add parallel_tool_prompt if tools is none ( #9007 )
2024-10-03 03:04:17 +00:00
18c2e30c57
[Doc] Update Granite model docs ( #9025 )
2024-10-03 02:42:24 +00:00
19f0d25796
[Model] Adding Granite MoE. ( #8206 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-03 09:33:57 +08:00
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-02 19:44:39 +00:00
7f60520deb
[Misc] Update Default Image Mapper Error Log ( #8977 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-02 11:44:38 +00:00
563649aafe
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching ( #8804 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
2024-10-02 07:52:20 +00:00
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows ( #8997 )
2024-10-01 11:07:06 -07:00
4f341bd4bf
[Doc] Update list of supported models ( #8987 )
2024-10-02 00:35:39 +08:00
35bd215168
[Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API ( #8965 )
2024-10-01 09:58:06 +00:00
1fe0a4264a
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders ( #8991 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-01 09:52:44 +00:00
bc4eb65b54
[Bugfix] Fix Fuyu tensor parallel inference ( #8986 )
2024-10-01 17:51:41 +08:00
82f3937e59
[Misc] add process_weights_after_loading for DummyLoader ( #8969 )
2024-10-01 03:46:41 +00:00
7da2487591
[torch.compile] fix tensor alias ( #8982 )
2024-10-01 03:40:48 +00:00
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-01 03:33:12 +00:00
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-01 09:34:25 +08:00
bce324487a
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. ( #8975 )
2024-10-01 00:51:40 +00:00
1425a1bcf9
[ci] Add CODEOWNERS for test directories ( #8795 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-01 00:47:08 +00:00
1cabfcefb6
[Misc] Adjust max_position_embeddings for LoRA compatibility ( #8957 )
2024-09-30 12:57:39 +00:00
be76e5aabf
[Core] Make scheduling policy settable via EngineArgs ( #8956 )
2024-09-30 12:28:44 +00:00
2ae25f79cf
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg ( #8946 )
2024-09-30 13:01:20 +08:00
8e60afa15e
[Model][LoRA]LoRA support added for MiniCPMV2.6 ( #8943 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-30 04:31:55 +00:00
b6d7392579
[Misc][CI/Build] Include cv2
via mistral_common[opencv]
( #8951 )
2024-09-30 04:28:26 +00:00
e01ab595d8
[Model] support input embeddings for qwen2vl ( #8856 )
2024-09-30 03:16:10 +00:00
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
6c9ba48fde
[Frontend] Added support for HF's new continue_final_message
parameter ( #8942 )
2024-09-29 17:59:47 +00:00
1fb9c1b0bf
[Misc] Fix typo in BlockSpaceManagerV1 ( #8944 )
2024-09-29 15:05:54 +00:00
31f46a0d35
[BugFix] Fix seeded random sampling with encoder-decoder models ( #8870 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-29 09:43:14 +00:00
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
bc2ef1f77c
[Model] Support Qwen2.5-Math-RM-72B ( #8896 )
2024-09-28 21:19:39 -07:00
2e7fe7e79f
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching ( #8930 )
2024-09-29 03:13:01 +00:00
26a68d5d7e
[CI/Build] Add test decorator for minimum GPU memory ( #8925 )
2024-09-29 02:50:51 +00:00
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-28 18:19:40 -07:00
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots ( #8824 )
2024-09-29 09:17:45 +08:00
d1537039ce
[Core] Improve choice of Python multiprocessing method ( #8823 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-29 09:17:07 +08:00
cc276443b5
[doc] organize installation doc and expose per-commit docker ( #8931 )
2024-09-28 17:48:41 -07:00
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 ( #8891 )
2024-09-28 18:51:22 +00:00
090e945e36
[Frontend] Make beam search emulator temperature modifiable ( #8928 )
...
Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr >
2024-09-28 11:30:21 -07:00
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-28 09:54:35 -07:00
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
39d3f8d94f
[Bugfix] Fix code for downloading models from modelscope ( #8443 )
2024-09-28 08:24:12 -07:00
b0298aa8cc
[Misc] Remove vLLM patch of BaichuanTokenizer
( #8921 )
2024-09-28 08:11:25 +00:00
260024a374
[Bugfix][Intel] Fix XPU Dockerfile Build ( #7824 )
...
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-27 23:45:50 -07:00
d86f6b2afb
[misc] fix wheel name ( #8919 )
2024-09-27 22:10:44 -07:00
bd429f2b75
[Core] Priority-based scheduling in async engine ( #8850 )
2024-09-27 15:07:10 -07:00
18e60d7d13
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag ( #8911 )
2024-09-27 14:27:56 -07:00
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-09-27 13:32:07 -07:00
c5d55356f9
[Bugfix] fix for deepseek w4a16 ( #8906 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-27 13:12:34 -06:00
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
8df2dc3c88
[TPU] Update pallas.py to support trillium ( #8871 )
2024-09-27 01:16:55 -07:00
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1
( #8892 )
2024-09-27 01:15:58 -07:00
0e088750af
[MISC] Fix invalid escape sequence '\' ( #8830 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-09-27 01:13:25 -07:00
dc4e3df5c2
[misc] fix collect env ( #8894 )
2024-09-27 00:26:38 -07:00
3b00b9c26c
[Core] renamePromptInputs
and inputs
( #8876 )
2024-09-26 20:35:15 -07:00
344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use ( #8343 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-09-26 17:01:42 -07:00
1b49148e47
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility ( #8764 )
2024-09-26 16:54:09 -07:00
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade ( #8829 )
2024-09-26 16:46:43 -07:00
71d21c73ab
[Bugfix] Fixup advance_step.cu warning ( #8815 )
2024-09-26 16:23:45 -07:00
ee2da3e9ef
fix validation: Only set tool_choice auto
if at least one tool is provided ( #8568 )
2024-09-26 16:23:17 -07:00
e2f6f26e86
[Bugfix] Fix print_warning_once's line info ( #8867 )
2024-09-26 16:18:26 -07:00
b28d2104de
[Misc] Change dummy profiling and BOS fallback warns to log once ( #8820 )
2024-09-26 16:18:14 -07:00
93d364da34
[Bugfix] Include encoder prompts len to non-stream api usage response ( #8861 )
2024-09-26 15:47:00 -07:00
d9cfbc891e
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM ( #8872 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-26 15:02:16 -07:00
70de39f6b4
[misc][installation] build from source without compilation ( #8818 )
2024-09-26 13:19:04 -07:00
68988d4e0d
[CI/Build] Fix missing ci dependencies ( #8834 )
2024-09-26 11:04:39 -07:00
520db4dbc1
[Docs] Add README to the build docker image ( #8825 )
2024-09-26 11:02:52 -07:00
f70bccac75
[Build/CI] Upgrade to gcc 10 in the base build Docker image ( #8814 )
2024-09-26 10:07:18 -07:00
4bb98f2190
[Misc] Update config loading for Qwen2-VL and remove Granite ( #8837 )
2024-09-26 07:45:30 -07:00
7193774b1f
[Misc] Support quantization of MllamaForCausalLM ( #8822 )
2024-09-25 14:46:22 -07:00
e2c6e0a829
[Doc] Update doc for Transformers 4.45 ( #8817 )
2024-09-25 13:29:48 -07:00
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-25 13:29:32 -07:00
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility ( #8760 ) ( #8810 )
2024-09-25 10:36:26 -07:00
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors ( #8588 )
2024-09-25 09:43:36 -07:00
64840dfae4
[Frontend] MQLLMEngine supports profiling. ( #8761 )
2024-09-25 09:37:41 -07:00
28e1299e60
rename PromptInputs and inputs with backward compatibility ( #8760 )
2024-09-25 09:36:47 -07:00
0c4d2ad5e6
[VLM][Bugfix] internvl with num_scheduler_steps > 1 ( #8614 )
2024-09-25 09:35:53 -07:00
c6f2485c82
[[Misc]] Add extra deps for openai server image ( #8792 )
2024-09-25 09:35:23 -07:00
300da09177
[Kernel] Fullgraph and opcheck tests ( #8479 )
2024-09-25 08:35:52 -06:00
1c046447a6
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade ( #8777 )
2024-09-25 22:26:37 +08:00
8fae5ed7f6
[Misc] Fix minor typo in scheduler ( #8765 )
2024-09-25 00:53:03 -07:00
3368c3ab36
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node ( #8767 )
...
Signed-off-by: darthhexx <darthhexx@gmail.com >
2024-09-25 00:52:26 -07:00
1ac3de09cd
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer ( #8672 )
2024-09-25 07:49:26 +00:00
3e073e66f1
[Bugfix] load fc bias from config for eagle ( #8790 )
2024-09-24 23:16:30 -07:00
c23953675f
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend ( #8770 )
2024-09-24 23:16:11 -07:00
e3dd0692fa
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv ( #8250 )
2024-09-25 05:53:43 +00:00
fc3afc20df
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 ( #8752 )
2024-09-24 21:26:36 -07:00
b4522474a3
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal ( #8776 )
2024-09-24 21:26:33 -07:00
ee777d9c30
Fix test_schedule_swapped_simple in test_scheduler.py ( #8780 )
2024-09-24 21:26:18 -07:00
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks ( #8583 )
2024-09-24 20:37:38 -07:00
6da1ab6b41
[Core] Adding Priority Scheduling ( #5958 )
2024-09-24 19:50:50 -07:00
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-09-24 17:29:56 -07:00
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 ( #8768 )
2024-09-24 17:08:55 -07:00
1e7d5c01f5
[misc] soft drop beam search ( #8763 )
2024-09-24 15:48:39 -07:00
2467b642dd
[CI/Build] fix setuptools-scm usage ( #8771 )
2024-09-24 12:38:12 -07:00
72fc97a0f1
[Bugfix] Fix torch dynamo fixes caused by replace_parameters
( #8748 )
2024-09-24 14:33:21 -04:00
2529d09b5a
[Frontend] Batch inference for llm.chat() API ( #8648 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-09-24 09:44:11 -07:00
a928ded995
[Kernel] Split Marlin MoE kernels into multiple files ( #8661 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-24 09:31:42 -07:00
cc4325b66a
[Bugfix] Fix potentially unsafe custom allreduce synchronization ( #8558 )
2024-09-24 01:08:14 -07:00
8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg ( #8658 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-24 07:36:46 +00:00
3f06bae907
[Core][Model] Support loading weights by ID within models ( #7931 )
2024-09-24 07:14:15 +00:00
b8747e8a7c
[MISC] Skip dumping inputs when unpicklable ( #8744 )
2024-09-24 06:10:03 +00:00
3185fb0cca
Revert "[Core] Rename PromptInputs
to PromptType
, and inputs
to prompt
" ( #8750 )
2024-09-24 05:45:20 +00:00
0250dd68c5
re-implement beam search on top of vllm core ( #8726 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-09-23 22:08:12 -07:00
88577ac928
Fix tests in test_scheduler.py that fail with BlockManager V2 ( #8728 )
2024-09-24 04:43:13 +00:00
530821d00c
[Hardware][AMD] ROCm6.2 upgrade ( #8674 )
2024-09-23 18:52:39 -07:00
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse ( #8335 )
2024-09-23 15:38:04 -07:00
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids ( #8562 )
2024-09-23 12:32:27 -07:00
b05f5c9238
[Core] Allow IPv6 in VLLM_HOST_IP with zmq ( #8575 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-23 12:15:41 -07:00
9b0e3ec970
[Kernel][LoRA] Add assertion for punica sgmv kernels ( #7585 )
2024-09-23 18:57:42 +00:00
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-23 13:46:26 -04:00
ee5f34b1c2
[CI/Build] use setuptools-scm to set __version__ ( #4738 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-23 09:44:26 -07:00
f2bd246c17
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size ( #8707 )
2024-09-23 14:43:09 +00:00
a79e522984
[Model] Support pp for qwen2-vl ( #8696 )
2024-09-23 13:46:59 +00:00
3e83c12b5c
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner ( #8733 )
2024-09-23 13:15:16 +00:00
e551ca1555
[Hardware][CPU] Refactor CPU model runner ( #8729 )
2024-09-23 20:12:20 +08:00
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs ( #8657 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-23 07:44:48 +00:00
d23679eb99
[Bugfix] fix docker build for xpu ( #8652 )
2024-09-22 22:54:18 -07:00
57a0702e63
[Bugfix] Fix CPU CMake build ( #8723 )
...
Co-authored-by: Yuan <yuan.zhou@intel.com >
2024-09-22 20:40:46 -07:00
3dda7c2250
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building ( #8702 )
2024-09-22 22:24:59 -04:00
92ba7e7477
[misc] upgrade mistral-common ( #8715 )
2024-09-22 15:41:59 -07:00
d4a2ac8302
[build] enable existing pytorch (for GH200, aarch64, nightly) ( #8713 )
2024-09-22 12:47:54 -07:00
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. ( #8701 )
2024-09-22 12:34:14 -07:00
5b59532760
[Model][VLM] Add LLaVA-Onevision model support ( #8486 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-22 10:51:44 -07:00
ca2b628b3c
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler ( #8703 )
2024-09-22 10:44:09 -07:00
8ca5051b9a
[Misc] Use NamedTuple in Multi-image example ( #8705 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-22 20:56:20 +08:00
06ed2815e2
[Model] Refactor BLIP/BLIP-2 to support composite model loading ( #8407 )
2024-09-22 12:24:21 +00:00
0e40ac9b7b
[ci][build] fix vllm-flash-attn ( #8699 )
2024-09-21 23:24:58 -07:00
13d88d4137
[Bugfix] Refactor composite weight loading logic ( #8656 )
2024-09-22 04:33:27 +00:00
d66ac62854
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu ( #8643 )
2024-09-21 23:45:02 +00:00
9dc7c6c7f3
[dbrx] refactor dbrx experts to extend FusedMoe class ( #8518 )
2024-09-21 15:09:39 -06:00
ec4aaad812
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 ( #8646 )
2024-09-21 09:20:54 +00:00
4dfdf43196
[Doc] Fix typo in AMD installation guide ( #8689 )
2024-09-21 00:24:12 -07:00
5e85f4f82a
[VLM] Use SequenceData.from_token_counts
to create dummy data ( #8687 )
2024-09-20 23:28:56 -07:00
71c60491f2
[Kernel] Build flash-attn from source ( #8245 )
2024-09-20 23:27:10 -07:00
0faab90eb0
[beam search] add output for manually checking the correctness ( #8684 )
2024-09-20 19:55:33 -07:00
0455c46ed4
[Core] Factor out common code in SequenceData
and Sequence
( #8675 )
2024-09-21 02:30:39 +00:00
d4bf085ad0
[MISC] add support custom_op check ( #8557 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-20 19:03:55 -07:00
0057894ef7
[Core] Rename PromptInputs
and inputs
( #8673 )
2024-09-20 19:00:54 -07:00
0f961b3ce9
[Bugfix] Fix incorrect llava next feature size calculation ( #8496 )
2024-09-20 22:48:32 +00:00
7f9c8902e3
[Hardware][AWS] update neuron to 2.20 ( #8676 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:19:44 -07:00
7c8566aa4f
[Doc] neuron documentation update ( #8671 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:04:37 -07:00
b4e4eda92e
[Bugfix][Core] Fix tekken edge case for mistral tokenizer ( #8640 )
2024-09-20 14:33:03 -07:00
2874bac618
[Bugfix] Config got an unexpected keyword argument 'engine' ( #8556 )
2024-09-20 14:00:45 -07:00
035fa895ec
[Misc] Show AMD GPU topology in collect_env.py
( #8649 )
2024-09-20 13:52:19 -07:00
b28298f2f4
[Bugfix] Validate SamplingParam n is an int ( #8548 )
2024-09-20 12:46:02 -07:00
2940afa04e
[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build ( #8670 )
2024-09-20 10:27:44 -07:00
3b63de9353
[Model] Add OLMoE ( #7922 )
2024-09-20 09:31:41 -07:00
260d40b5ea
[Core] Support Lora lineage and base model metadata management ( #6315 )
2024-09-20 06:20:56 +00:00
9e5ec35b1f
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata ( #8474 )
2024-09-19 20:49:54 -07:00
18ae428a0d
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference ( #8571 )
2024-09-20 08:54:02 +08:00
de6f90a13d
[Misc] guard against change in cuda library name ( #8609 )
2024-09-20 06:36:30 +08:00
6cb748e190
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail ( #8551 )
2024-09-19 13:06:32 -07:00
9e99407e3c
Create SECURITY.md ( #8642 )
2024-09-19 12:16:28 -07:00
ea4647b7d7
[Doc] Add documentation for GGUF quantization ( #8618 )
2024-09-19 13:15:55 -06:00
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p ( #8619 )
2024-09-19 18:28:25 +00:00
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention ( #8577 )
2024-09-19 17:37:57 +00:00
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too ( #8584 )
2024-09-19 12:51:06 -04:00
855c8ae2c9
[MISC] remove engine_use_ray in benchmark_throughput.py ( #8615 )
2024-09-18 22:33:20 -07:00
c52ec5f034
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py ( #8616 )
2024-09-19 05:24:24 +00:00
02c9afa2d0
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" ( #8593 )
2024-09-19 04:14:28 +00:00
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. ( #8545 )
2024-09-19 02:24:15 +00:00
4c34ce8916
[Kernel] Remove marlin moe templating on thread_m_blocks ( #8573 )
...
Co-authored-by: lwilkinson@neuralmagic.com
2024-09-19 01:42:49 +00:00
0d47bf3bf4
[Bugfix] add dead_error
property to engine client ( #8574 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-18 22:10:01 +00:00
d9cd78eb71
[BugFix] Nonzero exit code if MQLLMEngine startup fails ( #8572 )
2024-09-18 20:17:55 +00:00
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching ( #8039 )
2024-09-18 20:05:06 +00:00
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call ( #8380 )
...
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 10:41:08 -07:00
e18749ff09
[Model] Support Solar Model ( #8386 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 11:04:00 -06:00
d65798f78c
[Core] zmq: bind only to 127.0.0.1 for local-only usage ( #8543 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-18 16:10:27 +00:00
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step ( #8199 )
2024-09-18 08:38:43 -07:00
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine
to avoid asyncio
OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-18 13:56:58 +00:00
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-18 11:00:56 +00:00
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
e351572900
[Misc] Add argument to disable FastAPI docs ( #8554 )
2024-09-18 09:51:59 +00:00
95965d31b6
[CI/Build] fix Dockerfile.cpu on podman ( #8540 )
2024-09-18 10:49:53 +08:00
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching ( #8012 )
2024-09-17 23:44:27 +00:00
09deb4721f
[CI/Build] Excluding kernels/test_gguf.py from ROCm ( #8520 )
2024-09-17 16:40:29 -07:00
fa0c114fad
[doc] improve installation doc ( #8550 )
...
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com >
2024-09-17 16:24:06 -07:00
98f9713399
[Bugfix] Fix TP > 1 for new granite ( #8544 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 23:17:08 +00:00
56c3de018c
[Misc] Don't dump contents of kvcache tensors on errors ( #8527 )
2024-09-17 12:24:29 -07:00
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format ( #8515 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-17 17:50:37 +00:00
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization ( #8434 )
2024-09-17 08:09:12 -07:00
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
1b6de8352b
[Benchmark] Support sample from HF datasets and image input for benchmark_serving ( #8495 )
2024-09-17 07:34:27 +00:00
cbdb252259
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change ( #8509 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-09-17 00:06:26 -07:00
99aa4eddaf
[torch.compile] register allreduce operations as custom ops ( #8526 )
2024-09-16 22:57:57 -07:00
ee2bceaaa6
[Misc][Bugfix] Disable guided decoding for mistral tokenizer ( #8521 )
2024-09-16 22:22:45 -07:00
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing ( #8525 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-17 04:17:32 +00:00
546034b466
[refactor] remove triton based sampler ( #8524 )
2024-09-16 20:04:48 -07:00
cca61642e0
[Bugfix] Fix 3.12 builds on main ( #8510 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 00:01:45 +00:00
5ce45eb54d
[misc] small qol fixes for release process ( #8517 )
2024-09-16 15:11:27 -07:00
5478c4b41f
[perf bench] set timeout to debug hanging ( #8516 )
2024-09-16 14:30:02 -07:00
47f5e03b5b
[Bugfix] Bind api server port before starting engine ( #8491 )
2024-09-16 13:56:28 -07:00
2759a43a26
[doc] update doc on testing and debugging ( #8514 )
2024-09-16 12:10:23 -07:00
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
781e3b9a42
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel ( #8506 )
2024-09-16 12:15:57 -06:00
acd5511b6d
[BugFix] Fix clean shutdown issues ( #8492 )
2024-09-16 09:33:46 -07:00
837c1968f9
[Frontend] Expose revision arg in OpenAI server ( #8501 )
2024-09-16 15:55:26 +00:00
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE ( #8032 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2024-09-16 09:47:19 -06:00
fc990f9795
[Bugfix][Kernel] Add IQ1_M
quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
3724d5f6b5
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations ( #8490 )
2024-09-15 04:20:05 +00:00
50e9ec41fc
[TPU] Implement multi-step scheduling ( #8489 )
2024-09-14 16:58:31 -07:00
47790f3e32
[torch.compile] add a flag to disable custom op ( #8488 )
2024-09-14 13:07:16 -07:00
a36e070dad
[torch.compile] fix functionalization ( #8480 )
2024-09-14 09:46:04 -07:00
8a0cf1ddc3
[Model] support minicpm3 ( #8297 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-14 14:50:26 +00:00
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm ( #8310 )
2024-09-13 17:01:11 -07:00
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 ( #8365 )
...
Co-authored-by: Yan Ma <yan.ma@intel.com >
2024-09-13 16:54:34 -07:00
9ba0817ff1
bump version to v0.6.1.post2 ( #8473 )
2024-09-13 11:35:00 -07:00
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming ( #8468 )
2024-09-13 11:31:12 -07:00
f57092c00b
[Doc] Add oneDNN installation to CPU backend documentation ( #8467 )
2024-09-13 18:06:30 +00:00
a84e598e21
[CI/Build] Reorganize models tests ( #7820 )
2024-09-13 10:20:06 -07:00
0a4806f0a9
[plugin][torch.compile] allow to add custom compile backend ( #8445 )
2024-09-13 09:32:42 -07:00
ecd7a1d5b6
[Installation] Gate FastAPI version for Python 3.8 ( #8456 )
2024-09-13 09:02:26 -07:00
a2469127db
[misc][ci] fix quant test ( #8449 )
2024-09-13 17:20:14 +08:00
06311e2956
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 ( #8442 )
2024-09-13 07:58:28 +00:00
cab69a15e4
[doc] recommend pip instead of conda ( #8446 )
2024-09-12 23:52:41 -07:00
9b4a3b235e
[CI/Build] Enable InternVL2 PP test only on single node ( #8437 )
2024-09-13 06:35:20 +00:00
acda0b35d0
bump version to v0.6.1.post1 ( #8440 )
2024-09-12 21:39:49 -07:00
ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor ( #8354 )
2024-09-12 21:30:00 -07:00
6821020109
[Bugfix] Fix async log stats ( #8417 )
2024-09-12 20:48:59 -07:00
8427550488
[CI/Build] Update pixtral tests to use JSON ( #8436 )
2024-09-13 03:47:52 +00:00
3f79bc3d1a
[Bugfix] Bump fastapi and pydantic version ( #8435 )
2024-09-13 03:21:42 +00:00
40c396533d
[Bugfix] Mapping physical device indices for e2e test utils ( #8290 )
2024-09-13 11:06:28 +08:00
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class ( #7329 )
2024-09-13 02:56:13 +00:00
8f44a92d85
[BugFix] fix group_topk ( #8430 )
2024-09-13 09:23:42 +08:00
360ddbd37e
[Misc] Update Pixtral example ( #8431 )
2024-09-12 17:31:18 -07:00
a480939e8e
[Bugfix] Fix weight loading issue by rename variable. ( #8293 )
2024-09-12 19:25:00 -04:00
d31174a4e1
[Hotfix][Pixtral] Fix multiple images bugs ( #8415 )
2024-09-12 15:21:51 -07:00
b61bd98f90
[CI/Build] Disable multi-node test for InternVL2 ( #8428 )
2024-09-12 15:05:35 -07:00
c16369455f
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models ( #8425 )
2024-09-12 14:06:51 -07:00
019877253b
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible ( #8427 )
2024-09-12 21:01:50 +00:00
551ce01078
[Core] Add engine option to return only deltas or final output ( #7381 )
2024-09-12 12:02:00 -07:00
a6c0f3658d
[multi-step] add flashinfer backend ( #7928 )
2024-09-12 11:16:22 -07:00
f2e263b801
[Bugfix] Offline mode fix ( #8376 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-12 11:11:57 -07:00
1f0c75afa9
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser ( #8423 )
2024-09-12 11:10:11 -07:00
8a23e93302
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance ( #8403 )
2024-09-12 10:47:42 -07:00
c6202daeed
[Model] Support multiple images for qwen-vl ( #8247 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:54 -07:00
e56bf27741
[Bugfix] Fix InternVL2 inference with various num_patches ( #8375 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:35 -07:00
520ca380ae
[Hotfix][VLM] Fixing max position embeddings for Pixtral ( #8399 )
2024-09-12 09:28:37 -07:00
7de49aa86c
[torch.compile] hide slicing under custom op for inductor ( #8384 )
2024-09-12 00:11:55 -07:00
42ffba11ad
[Misc] Use RoPE cache for MRoPE ( #8396 )
2024-09-11 23:13:14 -07:00
295c4730a8
[Misc] Raise error when using encoder/decoder model with cpu backend ( #8355 )
2024-09-12 05:45:24 +00:00
1bf2dd9df0
[Gemma2] add bitsandbytes support for Gemma2 ( #8338 )
2024-09-11 21:53:12 -07:00
5a60699c45
[Bugfix]: Fix the logic for deciding if tool parsing is used ( #8366 )
2024-09-12 03:55:30 +00:00
b6c75e1cf2
Fix the AMD weight loading tests ( #8390 )
2024-09-11 20:35:33 -07:00
b71c956deb
[TPU] Use Ray for default distributed backend ( #8389 )
2024-09-11 20:31:51 -07:00
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
a65cb16067
[MISC] Dump model runner inputs when crashing ( #8305 )
2024-09-12 01:12:25 +00:00
3fd2b0d21c
Bump version to v0.6.1 ( #8379 )
2024-09-11 14:42:11 -07:00
d394787e52
Pixtral ( #8377 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-11 14:41:55 -07:00
775f00f81e
[Speculative Decoding] Test refactor ( #8317 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-11 14:07:34 -07:00
8baa454937
[Misc] Move device options to a single place ( #8322 )
2024-09-11 13:25:58 -07:00
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2024-09-11 12:52:19 -07:00
7015417fd4
[Bugfix] Add missing attributes in mistral tokenizer ( #8364 )
2024-09-11 11:36:54 -07:00
aea02f30de
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation ( #8373 )
2024-09-11 18:31:41 +00:00
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend ( #7257 )
2024-09-11 09:46:46 -07:00
3b7fea770f
[Model][VLM] Add Qwen2-VL model support ( #7905 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-11 09:31:19 -07:00
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch ( #8347 )
2024-09-11 05:30:11 +00:00
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-10 22:21:36 -07:00
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
e497b8aeff
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models ( #8329 )
2024-09-10 20:59:19 -04:00
94144e726c
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag ( #8043 )
2024-09-10 23:51:58 +00:00
1d5e397aa4
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers ( #8172 )
2024-09-10 23:46:08 +00:00
22f3a4bc6c
[Bugfix] lookahead block table with cuda graph max capture ( #8340 )
...
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340 )
2024-09-10 16:00:35 -07:00
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled ( #8342 )
2024-09-10 22:28:28 +00:00
04e7c4e771
[Misc] remove peft as dependency for prompt models ( #8162 )
2024-09-10 17:21:56 -04:00
5faedf1b62
[Spec Decode] Move ops.advance_step to flash attn advance_step ( #8224 )
2024-09-10 13:18:14 -07:00
02751a7a42
Fix ppc64le buildkite job ( #8309 )
2024-09-10 12:58:34 -07:00
f421f3cefb
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail ( #8130 )
2024-09-10 11:51:15 -07:00
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
6234385f4a
[CI/Build] enable ccache/scccache for HIP builds ( #8327 )
2024-09-10 08:55:08 -07:00
da1a844e61
[Bugfix] Fix missing post_layernorm
in CLIP ( #8155 )
2024-09-10 08:22:50 +00:00
a1d874224d
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info ( #8319 )
2024-09-09 23:21:00 -07:00
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
f9b4a2d415
[Bugfix] Correct adapter usage for cohere and jamba ( #8292 )
2024-09-09 11:20:46 -07:00
58fcc8545a
[Frontend] Add progress reporting to run_batch.py ( #8060 )
...
Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io >
2024-09-09 11:16:37 -07:00
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption ( #8267 )
2024-09-07 21:01:51 -07:00
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-07 13:03:16 -07:00
b962ee1470
ppc64le: Dockerfile fixed, and a script for buildkite ( #8026 )
2024-09-07 11:18:40 -07:00
36bf8150cc
[Model][VLM] Decouple weight loading logic for Paligemma
( #8269 )
2024-09-07 17:45:44 +00:00
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) ( #8241 )
2024-09-06 20:18:16 -07:00
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
41e95c5247
[Bugfix] Fix Hermes tool call chat template bug ( #8256 )
...
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-07 10:49:01 +08:00
12dd715807
[misc] [doc] [frontend] LLM torch profiler support ( #7943 )
2024-09-06 17:48:48 -07:00
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-06 17:02:05 -06:00
23f322297f
[Misc] Remove SqueezeLLM
( #8220 )
2024-09-06 16:29:03 -06:00
9db52eab3d
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput ( #8248 )
2024-09-06 16:26:09 -06:00
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
de80783b69
[Misc] Use ray[adag] dependency instead of cuda ( #7938 )
2024-09-06 09:18:35 -07:00
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py
( #8191 )
2024-09-06 09:01:14 -07:00
baa5467547
[BugFix] Fix Granite model configuration ( #8216 )
2024-09-06 11:39:29 +08:00
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-09-05 18:10:33 -07:00
2febcf2777
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM ( #7962 )
2024-09-05 16:25:29 -04:00
2ee45281a5
Move verify_marlin_supported to GPTQMarlinLinearMethod ( #8165 )
2024-09-05 11:09:46 -04:00
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-05 12:48:10 +00:00
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
288a938872
[Doc] Indicate more information about supported modalities ( #8181 )
2024-09-05 10:51:53 +00:00
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
ba262c4e5a
[ci] Mark LoRA test as soft-fail ( #8160 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-04 20:33:12 -07:00
4624d98dbd
[Misc] Clean up RoPE forward_native ( #8076 )
2024-09-04 20:31:48 -07:00
1afc931987
[bugfix] >1.43 constraint for openai ( #8169 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 17:35:36 -07:00
e01c2beb7d
[Doc] [Misc] Create CODE_OF_CONDUCT.md ( #8161 )
2024-09-04 16:50:13 -07:00
32e7db2536
Bump version to v0.6.0 ( #8166 )
2024-09-04 16:34:27 -07:00
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… ( #8062 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-09-04 16:33:43 -07:00
77d9e514a2
[MISC] Replace input token throughput with total token throughput ( #8164 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 20:23:22 +00:00
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models ( #5649 )
...
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com >
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-04 13:18:13 -07:00
561d6f8077
[CI] Change test input in Gemma LoRA test ( #8163 )
2024-09-04 13:05:50 -07:00
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm ( #7369 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-04 11:57:54 -07:00
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests ( #8131 )
2024-09-04 18:53:25 +00:00
d3311562fb
[Bugfix] remove post_layernorm in siglip ( #8106 )
2024-09-04 18:55:37 +08:00
ccd7207191
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env ( #8103 )
2024-09-03 23:17:05 -07:00
855c262a6b
[Frontend] Multimodal support in offline chat ( #8098 )
2024-09-04 05:22:17 +00:00
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks ( #7963 )
2024-09-04 04:38:21 +00:00
e16fa99a6a
[Misc] Update fbgemmfp8 to use vLLMParameters
( #7972 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-03 20:12:41 -06:00
61f4a93d14
[TPU][Bugfix] Use XLA rank for persistent cache path ( #8137 )
2024-09-03 18:35:33 -07:00
d4db9f53c8
[Benchmark] Add --async-engine
option to benchmark_throughput.py ( #7964 )
2024-09-03 20:57:41 -04:00
2188a60c7e
[Misc] Update GPTQ
to use vLLMParameters
( #7976 )
2024-09-03 17:21:44 -04:00
dc0b6066ab
[CI] Change PR remainder to avoid at-mentions ( #8134 )
2024-09-03 14:11:42 -07:00
0af3abe3d3
[TPU][Bugfix] Fix next_token_ids shape ( #8128 )
2024-09-03 13:29:24 -07:00
f1575dc99f
[ci] Fix GHA workflow ( #8129 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 13:25:09 -07:00
c02638efb3
[CI/Build] make pip install vllm work in macos (for import only) ( #8118 )
2024-09-03 12:37:08 -07:00
652c83b697
[Misc] Raise a more informative exception in add/remove_logger ( #7750 )
2024-09-03 12:28:25 -07:00
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
95a178f861
[CI] Only PR reviewers/committers can trigger CI on PR ( #8124 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 11:32:27 -07:00
bd852f2a8b
[Performance] Enable chunked prefill and prefix caching together ( #8120 )
...
Co-authored-by: Tao He <sighingnow@gmail.com >
Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com >
2024-09-03 10:49:18 -07:00
ec266536b7
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend ( #8061 )
2024-09-03 21:37:52 +08:00
0fbc6696c2
[Bugfix] Fix single output condition in output processor ( #7881 )
2024-09-02 20:35:42 -07:00
6e36f4fa6c
improve chunked prefill performance
...
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874 )
2024-09-02 14:20:12 -07:00
dd2a6a82e3
[Bugfix] Fix internlm2 tensor parallel inference ( #8055 )
2024-09-02 23:48:56 +08:00
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension ( #8056 )
2024-09-02 08:43:26 -04:00
e2b2aa5a0f
[TPU] Align worker index with node boundary ( #7932 )
2024-09-01 23:09:46 -07:00
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
f8d60145b4
[Model] Add Granite model ( #7436 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-09-01 18:37:18 -07:00
5b86b19954
[Misc] Optional installation of audio related packages ( #8063 )
2024-09-01 14:46:57 -07:00
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items ( #8049 )
2024-08-31 16:35:53 -07:00
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation ( #8059 )
2024-08-31 19:44:03 +00:00
4f5d8446ed
[Bugfix] Fix ModelScope models in v0.5.5 ( #8037 )
2024-08-31 00:27:58 -07:00
d05f0a9db2
[Bugfix] Fix import error in Phi-3.5-MoE ( #8052 )
2024-08-30 22:26:55 -07:00
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE ( #7729 )
...
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Zeqi Lin <zelin@microsoft.com >
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com >
2024-08-30 13:42:57 -06:00
2684efc467
[TPU][Bugfix] Fix tpu type api ( #8035 )
2024-08-30 09:01:26 -07:00
058344f89a
[Frontend]-config-cli-args ( #7737 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com >
2024-08-30 08:21:02 -07:00
98cef6a227
[Core] Increase default max_num_batched_tokens
for multimodal models ( #8028 )
2024-08-30 08:20:34 -07:00
f97be32d1d
[VLM][Model] TP support for ViTs ( #7186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-30 08:19:27 -07:00
afd39a4511
[Bugfix] Fix import error in Exaone model ( #8034 )
2024-08-30 08:03:28 -07:00
2148441fd3
[TPU] Support single and multi-host TPUs on GKE ( #7613 )
2024-08-30 00:27:40 -07:00
dc13e99348
[MODEL] add Exaone model support ( #7819 )
2024-08-29 23:34:20 -07:00
34a0e96d46
[Kernel] changing fused moe kernel chunk size default to 32k ( #7995 )
2024-08-30 04:11:39 +00:00
80c7b089b1
[TPU] Async output processing for TPU ( #8011 )
2024-08-29 19:35:29 -07:00
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
4abed65c58
[VLM] Disallow overflowing max_model_len
for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
0c785d344d
Add more percentiles and latencies ( #7759 )
2024-08-29 16:48:11 -07:00
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models ( #7445 )
2024-08-29 19:09:08 -04:00
257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. ( #7885 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-08-29 13:58:14 -07:00
86a677de42
[misc] update tpu int8 to use new vLLM Parameters ( #7973 )
2024-08-29 16:46:55 -04:00
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism ( #7954 )
2024-08-29 15:54:49 -04:00
c334b1898b
extend cuda graph size for H200 ( #7894 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-29 12:15:04 -07:00
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-29 14:53:11 -04:00
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding ( #7899 )
2024-08-28 22:18:13 -07:00
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors ( #7974 )
2024-08-29 03:24:31 +00:00
a7f65c2be9
[torch.compile] remove reset ( #7975 )
2024-08-28 17:32:26 -07:00
4289cad37f
[Frontend] Minor optimizations to zmq decoupled front-end ( #7957 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-08-28 17:22:43 -07:00
af59df0a10
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test ( #7961 )
2024-08-28 19:19:17 -04:00
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-28 16:10:12 -07:00
3cdfe1f38b
[Bugfix] Make torch registration of punica ops optional ( #7970 )
2024-08-28 16:11:49 -06:00
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM ( #7651 )
2024-08-28 15:06:52 -07:00
8c56e57def
[Doc] fix 404 link ( #7966 )
2024-08-28 13:54:23 -07:00
eeffde1ac0
[TPU] Upgrade PyTorch XLA nightly ( #7967 )
2024-08-28 13:10:21 -07:00
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-28 10:01:22 -07:00
ef9baee3c5
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 ( #7948 )
2024-08-28 08:11:18 -07:00
98c12cffe5
[Doc] fix the autoAWQ example ( #7937 )
2024-08-28 12:12:32 +00:00
f52a43a8b9
[ci][test] fix pp test failure ( #7945 )
2024-08-28 01:27:07 -07:00
e3580537a4
[Performance] Enable chunked prefill and prefix caching together ( #7753 )
2024-08-28 00:36:31 -07:00
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) ( #7911 )
2024-08-28 00:02:30 -07:00
51f86bf487
[mypy][CI/Build] Fix mypy errors ( #7929 )
2024-08-27 23:47:44 -07:00
c166e7e43e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. ( #7886 )
2024-08-27 23:13:45 -04:00
bc6e42a9b1
[hardware][rocm] allow rocm to override default env var ( #7926 )
2024-08-27 19:50:06 -07:00
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt ( #7902 )
2024-08-28 01:53:56 +00:00
9c71c97ae2
[mypy] Enable mypy type checking for vllm/core
( #7229 )
2024-08-28 07:11:14 +08:00
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference ( #7230 )
2024-08-28 07:09:02 +08:00
345be0e244
[benchmark] Update TGI version ( #7917 )
2024-08-27 15:07:53 -07:00
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7766 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-27 15:07:09 -07:00
ed6f002d33
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES ( #7924 )
2024-08-27 12:06:11 -07:00
b09c755be8
[Bugfix] Fix phi3v incorrect image_idx when using async engine ( #7916 )
2024-08-27 17:36:09 +00:00
42e932c7d4
[CI/Build][ROCm] Enabling tensorizer tests for ROCm ( #7237 )
2024-08-27 10:09:13 -07:00
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. ( #7810 )
2024-08-27 10:07:02 -07:00
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test ( #7897 )
2024-08-27 15:28:30 +00:00
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
9606c7197d
Revert #7509 ( #7887 )
2024-08-27 00:16:31 -07:00
64cc644425
[core][torch.compile] discard the compile for profiling ( #7796 )
2024-08-26 21:33:58 -07:00
39178c7fbc
[Tests] Disable retries and use context manager for openai client ( #7565 )
2024-08-26 21:33:17 -07:00
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com >
2024-08-26 20:53:20 -07:00
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix
from create_weights
( #7825 )
2024-08-26 18:09:34 -06:00
760e9f71a8
[Bugfix] neuron: enable tensor parallelism ( #7562 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-08-26 15:13:13 -07:00
05826c887b
[misc] fix custom allreduce p2p cache file generation ( #7853 )
2024-08-26 15:02:25 -07:00
dd9857f5fa
[Misc] Update gptq_marlin_24
to use vLLMParameters ( #7762 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-26 17:44:54 -04:00
665304092d
[Misc] Update qqq
to use vLLMParameters ( #7805 )
2024-08-26 13:16:15 -06:00
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule ( #7822 )
2024-08-26 11:24:53 -07:00
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer
( #7836 )
2024-08-26 05:31:10 +00:00
0b769992ec
[Bugfix]: Use float32 for base64 embedding ( #7855 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-08-26 03:16:38 +00:00
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation ( #7851 )
2024-08-25 15:45:14 -07:00
70c094ade6
[misc][cuda] improve pynvml warning ( #7852 )
2024-08-25 14:30:09 -07:00
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test ( #7835 )
2024-08-25 15:53:09 +00:00
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models ( #7783 )
2024-08-25 11:51:20 +00:00
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes ( #7840 )
2024-08-24 18:16:24 -07:00
aab0fcdb63
[ci][test] fix RemoteOpenAIServer ( #7838 )
2024-08-24 17:31:28 +00:00
ea9fa160e3
[ci][test] exclude model download time in server start time ( #7834 )
2024-08-24 01:03:27 -07:00
7d9ffa2ae1
[misc][core] lazy import outlines ( #7831 )
2024-08-24 00:51:38 -07:00
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol ( #7654 )
2024-08-23 23:07:24 -07:00
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API ( #7641 )
2024-08-23 23:04:22 -07:00
6885fde317
[Bugfix] Fix run_batch logger ( #7640 )
2024-08-23 13:58:26 -07:00
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00