02dbf30e9a
[Build] skip renaming files for release wheels pipeline ( #9671 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-11-14 23:31:52 -08:00
2ac6d0e75b
[Misc] Consolidate pooler config overrides ( #10351 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-15 06:59:00 +00:00
2ec8827288
[Bugfix] Qwen-vl output is inconsistent in speculative decoding ( #10350 )
2024-11-15 05:40:10 +00:00
b40cf6402e
[Model] Support Qwen2 embeddings and use tags to select model tests ( #10184 )
2024-11-14 20:23:09 -08:00
2885ba0e24
[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug ( #10308 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-15 02:44:26 +00:00
bf2ddc6610
[bugfix] Fix static asymmetric quantization case ( #10334 )
...
Signed-off-by: Daniël de Kok <me@danieldk.eu >
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: Daniël de Kok <me@danieldk.eu >
2024-11-15 09:35:11 +08:00
972112d82f
[Bugfix] Fix unable to load some models ( #10312 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 16:55:54 -08:00
11cd1ae6ad
[Tool parsing] Improve / correct mistral tool parsing ( #10333 )
2024-11-15 00:42:49 +00:00
554af9228d
[Bugfix] use AF_INET6 for OpenAI Compatible Server with ipv6 ( #9583 )
...
Signed-off-by: xiaozijin <xiaozijin@bytedance.com >
2024-11-14 16:38:53 -08:00
b2e0ad3b59
[Perf] Reduce peak memory usage of llama ( #10339 )
...
Signed-off-by: andoorve <37849411+andoorve@users.noreply.github.com >
2024-11-15 00:38:20 +00:00
4a18fd14ba
Support Roberta embedding models ( #9387 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Co-authored-by: Flavia Beo <flavia.beo@ibm.com >
2024-11-14 21:23:29 +00:00
1dbae0329c
[Docs] Publish meetup slides ( #10331 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-14 16:19:38 +00:00
675d603400
[CI/Build] Make shellcheck happy ( #10285 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-14 09:47:53 +00:00
03025c023f
[CI/Build] Fix CPU CI online inference timeout ( #10314 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 16:45:32 +08:00
29f3ef26a3
[ci][distributed] disable hanging tests ( #10317 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-14 00:23:39 -08:00
294bf467ba
[Model] Add BNB quantization support for Idefics3 ( #10310 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-14 06:31:44 +00:00
52b48c1ead
[BugFix]: properly deserialize tool_calls
iterator before processing by mistral-common when MistralTokenizer is used ( #9951 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-14 04:48:16 +00:00
f67ce05d0b
[Frontend] Pythonic tool parser ( #9859 )
...
Signed-off-by: Mike Depinet <mike@fixie.ai >
2024-11-14 04:14:34 +00:00
e0853b6508
[Misc] format.sh: Simplify tool_version_check ( #10305 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-14 11:12:35 +08:00
504ac53d18
[misc] error early for old-style class ( #10304 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-13 18:55:39 -08:00
15bb8330aa
[Bugfix] Fix tensor parallel for qwen2 classification model ( #10297 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-14 10:54:59 +08:00
ac49b59d8b
[Bugfix] bitsandbytes models fail to run pipeline parallel ( #10200 )
...
Signed-off-by: Hoang Cong Duc <hoangcongducltt@gmail.com >
2024-11-13 09:56:39 -07:00
0b8bb86bf1
[1/N] Initial prototype for multi-modal processor ( #10044 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-13 12:39:03 +00:00
bb7991aa29
[V1] Add missing tokenizer options for Detokenizer
( #10288 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-13 11:02:56 +00:00
d909acf9fe
[Model][LoRA]LoRA support added for idefics3 ( #10281 )
...
Signed-off-by: B-201 <Joy25810@foxmail.com >
2024-11-13 17:25:59 +08:00
b6dde33019
[Core] Flashinfer - Remove advance step size restriction ( #10282 )
2024-11-13 16:29:32 +08:00
1b886aa104
[Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 ( #9944 )
...
Signed-off-by: FurtherAI <austin.veselka@lighton.ai >
Co-authored-by: FurtherAI <austin.veselka@lighton.ai >
2024-11-13 08:28:13 +00:00
3945c82346
[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions ( #10221 )
...
Signed-off-by: imkero <kerorek@outlook.com >
2024-11-13 07:07:22 +00:00
032fcf16ae
[Doc] Fix typo in arg_utils.py ( #10264 )
...
Signed-off-by: Xin Yang <xyang19@gmail.com >
2024-11-12 21:54:52 -08:00
56a955e774
Bump to compressed-tensors v0.8.0 ( #10279 )
...
Signed-off-by: Dipika <dipikasikka1@gmail.com >
2024-11-12 21:54:10 -08:00
bbd3e86926
[V1] Support VLMs with fine-grained scheduling ( #9871 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-13 04:53:13 +00:00
0d4ea3fb5c
[core][distributed] use tcp store directly ( #10275 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 17:36:08 -08:00
112fa0bbe5
[V1] Fix CI tests on V1 engine ( #10272 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 16:17:20 -08:00
377b74fe87
Revert "[ci][build] limit cmake version" ( #10271 )
2024-11-12 15:06:48 -08:00
18081451f9
[doc] improve debugging doc ( #10270 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:43:52 -08:00
96ae0eaeb2
[doc] fix location of runllm widget ( #10266 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-12 14:34:39 -08:00
1f55e05713
[V1] Enable Inductor when using piecewise CUDA graphs ( #10268 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 13:39:56 -08:00
8a06428c70
[LoRA] Adds support for bias in LoRA ( #5733 )
...
Signed-off-by: Umesh Deshpande <udeshpa@us.ibm.com >
Co-authored-by: Umesh Deshpande <udeshpa@us.ibm.com >
2024-11-12 11:08:40 -08:00
b41fb9d3b1
[Encoder Decoder] Update Mllama to run with both FlashAttention and XFormers ( #9982 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-12 10:53:57 -08:00
7c65527918
[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest ( #10245 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-12 08:57:14 -08:00
47db6ec831
[Frontend] Add per-request number of cached token stats ( #10174 )
2024-11-12 16:42:28 +00:00
176fcb1c71
[Bugfix] Fix QwenModel argument ( #10262 )
...
Signed-off-by: Jie Fu <jiefu@tencent.com >
2024-11-12 16:36:51 +00:00
a838ba7254
[Misc]Fix Idefics3Model argument ( #10255 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 13:07:11 +00:00
36c513a076
[BugFix] Do not raise a ValueError
when tool_choice
is set to the supported none
option and tools
are not defined. ( #10000 )
...
Signed-off-by: Guillaume Calmettes <gcalmettes@scaleway.com >
2024-11-12 11:13:46 +00:00
d201d41973
[CI][CPU]refactor CPU tests to allow to bind with different cores ( #10222 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-11-12 10:07:32 +00:00
3a28f18b0b
[doc] explain the class hierarchy in vLLM ( #10240 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 22:56:44 -08:00
812c981fa0
Splitting attention kernel file ( #10091 )
...
Signed-off-by: maleksan85 <maleksan@amd.com >
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com >
2024-11-11 22:55:07 -08:00
7f5edb5900
[Misc][LoRA] Replace hardcoded cuda device with configurable argument ( #10223 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-12 11:10:15 +08:00
eea55cca5b
[1/N] torch.compile user interface design ( #10237 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 18:01:06 -08:00
9cdba9669c
[Doc] Update help text for --distributed-executor-backend
( #10231 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-12 09:55:09 +08:00
d1c6799b88
[doc] update debugging guide ( #10236 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 15:21:12 -08:00
6ace6fba2c
[V1] AsyncLLM
Implementation ( #9826 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Signed-off-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-11 23:05:38 +00:00
08f93e7439
Make shutil rename in python_only_dev ( #10233 )
...
Signed-off-by: shcheglovnd <shcheglovnd@avride.ai >
2024-11-11 14:29:19 -08:00
9d5b4e4dea
[V1] Enable custom ops with piecewise CUDA graphs ( #10228 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:58:07 -08:00
8a7fe47d32
[misc][distributed] auto port selection and disable tests ( #10226 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 11:54:59 -08:00
4800339c62
Add docs on serving with Llama Stack ( #10183 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
Co-authored-by: Russell Bryant <rbryant@redhat.com >
2024-11-11 11:28:55 -08:00
fe15729a2b
[V1] Use custom ops for piecewise CUDA graphs ( #10227 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:26:48 -08:00
330e82d34a
[v1][torch.compile] support managing cudagraph buffer ( #10203 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:10:27 -08:00
d7a4f2207b
[V1] Do not use inductor for piecewise CUDA graphs ( #10225 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 11:05:57 -08:00
f9dadfbee3
[V1] Fix detokenizer ports ( #10224 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-11 10:42:07 -08:00
25144ceed0
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #10209 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 17:24:10 +00:00
e6de9784d2
[core][distributed] add stateless process group ( #10216 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 09:02:14 -08:00
36fc439de0
[Doc] fix doc string typo in block_manager swap_out
function ( #10212 )
2024-11-11 08:53:07 -08:00
874f551b36
[Metrics] add more metrics ( #4464 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-12 00:17:38 +08:00
2cebda42bb
[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner ( #10218 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 12:37:58 +00:00
5fb1f935b0
[V1] Allow tokenizer_mode
and trust_remote_code
for Detokenizer ( #10211 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-11 18:01:18 +08:00
36e4acd02a
[LoRA][Kernel] Remove the unused libentry module ( #10214 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-11 09:43:23 +00:00
58170d6503
[Hardware][CPU] Add embedding models support for CPU backend ( #10193 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-11 08:54:28 +00:00
9804ac7c7c
Bump the patch-update group with 5 updates ( #10210 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-11 07:22:40 +00:00
f89d18ff74
[6/N] pass whole config to inner model ( #10205 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 06:41:46 +00:00
f0f2e5638e
[doc] improve debugging code ( #10206 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-10 17:49:40 -08:00
ad9a78bf64
[Doc] Fix typo error in vllm/entrypoints/openai/cli_args.py ( #10196 )
2024-11-11 00:14:22 +00:00
73b9083e99
[misc] improve cloudpickle registration and tests ( #10202 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-11 00:10:53 +00:00
20cf2f553c
[Misc] small fixes to function tracing file path ( #9543 )
...
Signed-off-by: Shawn Du <shawnd200@outlook.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 15:21:06 -08:00
bfb7d61a7c
[doc] Polish the integration with huggingface doc ( #10195 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-10 10:22:04 -08:00
19682023b6
[Doc] Fix typo error in CONTRIBUTING.md ( #10190 )
...
Signed-off-by: FuryMartin <furymartin9910@outlook.com >
2024-11-10 07:47:24 +00:00
9fa4bdde9d
[ci][build] limit cmake version ( #10188 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 16:27:26 -08:00
51c2e1fcef
[CI/Build] Split up models tests ( #10069 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:39:14 -08:00
b09895a618
[Frontend][Core] Override HF config.json
via CLI ( #5836 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 16:19:27 +00:00
d88bff1b96
[Frontend] add add_request_id
middleware ( #9594 )
...
Signed-off-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-11-09 10:18:29 +00:00
9e37266420
bugfix: fix the bug that stream generate not work ( #2756 )
2024-11-09 10:09:48 +00:00
8a4358ecb5
[doc] explaining the integration with huggingface ( #10173 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 01:02:54 -08:00
bd46357ad9
[bugfix] fix broken tests of mlp speculator ( #10177 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 00:04:50 -08:00
f192aeba74
[Bugfix] Enable some fp8 and quantized fullgraph tests ( #10171 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-09 08:01:27 +00:00
8e1529dc57
[CI/Build] Add run-hpu-test.sh script ( #10167 )
...
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
2024-11-09 06:26:52 +00:00
1a95f10ee7
[5/N] pass the whole config to model ( #9983 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-09 14:17:28 +08:00
49d2a41a86
[Doc] Adjust RunLLM location ( #10176 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 20:07:10 -08:00
47672f38b5
[CI/Build] Fix VLM broadcast tests tensor_parallel_size
passing ( #10161 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-09 04:02:59 +00:00
f83feccd7f
[Bugfix] Ignore GPTQ quantization of Qwen2-VL visual module ( #10169 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-09 03:36:46 +00:00
e0191a95d8
[0/N] Rename MultiModalInputs
to MultiModalKwargs
( #10040 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 11:31:02 +08:00
d7edca1dee
[CI/Build] Adding timeout in CPU CI to avoid CPU test queue blocking ( #6892 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-09 03:27:11 +00:00
127c07480e
[Kernel][Triton] Add Triton implementation for scaled_mm_triton to support fp8 and int8 SmoothQuant, symmetric case ( #9857 )
...
Signed-off-by: Randall Smith <Randall.Smith@amd.com >
2024-11-08 19:59:22 -05:00
10b67d865d
[Bugfix] SymIntArrayRef expected to contain concrete integers ( #10170 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-08 14:44:18 -08:00
4f93dfe952
[torch.compile] Fuse RMSNorm with quant ( #9138 )
...
Signed-off-by: luka <luka@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-11-08 21:20:08 +00:00
e1b5a82179
Rename vllm.logging to vllm.logging_utils ( #10134 )
2024-11-08 20:53:24 +00:00
87713c6053
[CI/Build] Ignore .gitignored files for shellcheck ( #10162 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-11-08 19:53:36 +00:00
b5815c8413
[V1] Fix non-cudagraph op name ( #10166 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-08 10:23:04 -08:00
6b30471586
[Misc] Improve Web UI ( #10090 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-08 09:51:04 -08:00
f6778620a9
Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 ( #10136 )
...
Signed-off-by: Sourashis Roy <sroy@roblox.com >
2024-11-08 15:56:18 +00:00
0535e5fe6c
Fix edge case Mistral tokenizer ( #10152 )
2024-11-08 15:42:27 +00:00
b489fc3c91
[CI/Build] Update CPU tests to include all "standard" tests ( #5481 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 23:30:04 +08:00
208ce622c7
[V1]Enable APC by default only for text models ( #10148 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-08 14:39:41 +00:00
1ff4aed5bd
[Model] Expose size to Idefics3 as mm_processor_kwargs ( #10146 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-08 09:56:58 +00:00
f10797c0ce
[Bugfix][XPU] Fix xpu tp by introducing XpuCommunicator ( #10144 )
...
Signed-off-by: yan ma <yan.ma@intel.com >
2024-11-08 09:41:03 +00:00
f4c2187e29
[Misc] Fix typo in #5895 ( #10145 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-08 09:07:01 +00:00
aea6ad629f
Add hf_transfer to testing image ( #10096 )
2024-11-08 08:35:25 +00:00
da07a9ead7
Fixes a typo about 'max_decode_seq_len' which causes crashes with cuda graph. ( #9285 )
...
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com >
2024-11-08 05:31:28 +00:00
3a7f15a398
[Doc] Move CONTRIBUTING to docs site ( #9924 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 05:15:12 +00:00
7371749d54
[Misc] Fix ImportError causing by triton ( #9493 )
2024-11-08 05:08:51 +00:00
ad39bd640c
[Bugfix] Add error handling when server cannot respond any valid tokens ( #5895 )
2024-11-08 04:58:37 +00:00
40d0e7411d
[Doc] Update FAQ links in spec_decode.rst ( #9662 )
...
Signed-off-by: whyiug <whyiug@hotmail.com >
2024-11-08 04:44:58 +00:00
6bb52b0f97
[CI/Build] Give PR cleanup job PR write access ( #10139 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-08 12:10:20 +08:00
201fc07730
[V1] Prefix caching (take 2) ( #9972 )
...
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com >
2024-11-07 17:34:44 -08:00
42b4f46b71
[V1] Add all_token_ids attribute to Request ( #10135 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-07 17:08:24 -08:00
073a472728
[Misc] report relevant env vars in collect_env.py tool ( #9293 )
2024-11-07 16:14:01 -08:00
93bff421bc
Bump actions/checkout from 4.2.1 to 4.2.2 ( #9746 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 21:44:58 +00:00
28b2877d30
Online video support for VLMs ( #10020 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 20:25:59 +00:00
97b8475beb
Bump actions/setup-python from 5.2.0 to 5.3.0 ( #9745 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-07 18:55:35 +00:00
a2f1f3b089
[CI/Build] Automate PR body text cleanup ( #10082 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:26:28 +00:00
3be5b26a76
[CI/Build] Add shell script linting using shellcheck ( #7925 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 18:17:29 +00:00
de0e61a323
[CI/Build] Always run mypy ( #10122 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 16:43:16 +00:00
9d43afcc53
[Feature] [Spec decode]: Combine chunked prefill with speculative decoding ( #9291 )
...
Signed-off-by: NickLucche <nlucches@redhat.com >
2024-11-07 08:15:14 -08:00
ae62fd17c0
[Frontend] Tool calling parser for Granite 3.0 models ( #9027 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 07:09:02 -08:00
a62bc0109c
[Misc] Add Gamma-Distribution Request Generation Support for Serving Benchmark. ( #10105 )
...
Signed-off-by: Mozhou <spli161006@gmail.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-11-07 11:20:30 +00:00
999df95b4e
[Bugfix] Make image processor respect mm_processor_kwargs
for Qwen2-VL ( #10112 )
...
Signed-off-by: Jiahao Li <liplus17@163.com >
2024-11-07 10:50:44 +00:00
a6f332d0d9
[Hardware][CPU][bugfix] Fix half dtype support on AVX2-only target ( #10108 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 18:42:50 +08:00
0dfba97b42
[Frontend] Fix multiple values for keyword argument error ( #10075 ) ( #10076 )
...
Signed-off-by: Lei <ylxx@live.com >
2024-11-07 09:07:19 +00:00
aa9078fa03
Adds method to read the pooling types from model's files ( #9506 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
2024-11-07 08:42:40 +00:00
e036e527a0
[CI/Build] Improve mypy + python version matrix ( #10041 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-07 07:54:16 +00:00
6192e9b8fe
[Core][Distributed] Refactor ipc buffer init in CustomAllreduce ( #10030 )
...
Signed-off-by: Hanzhi Zhou <hanzhi713@gmail.com >
2024-11-06 23:50:47 -08:00
d7263a1bb8
Doc: Improve benchmark documentation ( #9927 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-11-06 23:50:35 -08:00
104d729656
[CI/Build] re-add codespell to CI ( #10083 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:54:46 -08:00
db7db4aab9
[Misc] Consolidate ModelConfig code related to HF config ( #10104 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-07 06:00:21 +00:00
1fa020c539
[V1][BugFix] Fix Generator construction in greedy + seed case ( #10097 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-07 05:06:57 +00:00
e7b84c394d
[doc] add back Python 3.8 ABI ( #10100 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 21:06:41 -08:00
a4b3e0c1e9
[Hardware][CPU] Update torch 2.5 ( #9911 )
...
Signed-off-by: jiang1.li <jiang1.li@intel.com >
2024-11-07 04:43:08 +00:00
29862b884b
[Frontend] Adjust try/except blocks in API impl ( #10056 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
2024-11-06 20:07:51 -08:00
d3859f1891
[Misc][XPU] Upgrade to Pytorch 2.5 for xpu backend ( #9823 )
...
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
2024-11-06 17:29:03 -08:00
4ab3256644
[Bugfix] Fix FP8 torch._scaled_mm fallback for torch>2.5 with CUDA<12.4 ( #10095 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-07 00:54:13 +00:00
719c1ca468
[core][distributed] add stateless_init_process_group ( #10072 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-06 16:42:09 -08:00
74f2f8a0f1
[CI/Build] Always run the ruff workflow ( #10092 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 22:25:23 +00:00
d58268c56a
[V1] Make v1 more testable ( #9888 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-06 11:57:35 -08:00
87bd7e0515
[CI/Build] change conflict PR comment from mergify ( #10080 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-06 10:15:42 -08:00
098f94de42
[CI/Build] Drop Python 3.8 support ( #10038 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 14:31:01 +00:00
399c798608
Remove ScaledActivation for AWQ ( #10057 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-06 14:27:06 +00:00
406d4cc480
[Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration ( #10022 )
...
Signed-off-by: ericperfect <ericperfectttt@gmail.com >
2024-11-06 14:13:15 +00:00
a5bba7d234
[Model] Add Idefics3 support ( #9767 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
Signed-off-by: B-201 <Joy25810@foxmail.com >
Co-authored-by: B-201 <Joy25810@foxmail.com >
2024-11-06 11:41:17 +00:00
2003cc3513
[Model][LoRA]LoRA support added for LlamaEmbeddingModel ( #10071 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-06 09:49:19 +00:00
6a585a23d2
[Hotfix] Fix ruff errors ( #10073 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-06 01:24:28 -08:00
a02a50e6e5
[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend ( #6143 )
...
Signed-off-by: yuwenzho <yuwen.zhou@intel.com >
Signed-off-by: Chendi.Xue <chendi.xue@intel.com >
Signed-off-by: Bob Zhu <bob.zhu@intel.com >
Signed-off-by: zehao-intel <zehao.huang@intel.com >
Signed-off-by: Konrad Zawora <kzawora@habana.ai >
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai >
Co-authored-by: Michal Adamczyk <madamczyk@habana.ai >
Co-authored-by: Marceli Fylcek <mfylcek@habana.ai >
Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com >
Co-authored-by: Vivek Goel <vgoel@habana.ai >
Co-authored-by: yuwenzho <yuwen.zhou@intel.com >
Co-authored-by: Dominika Olszewska <dolszewska@habana.ai >
Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com >
Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com >
Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai >
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com >
Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai >
Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com >
Co-authored-by: Ilia Taraban <tarabanil@gmail.com >
Co-authored-by: Chendi.Xue <chendi.xue@intel.com >
Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai >
Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai >
Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com >
Co-authored-by: Sun Choi <schoi@habana.ai >
Co-authored-by: Iryna Boiko <iboiko@habana.ai >
Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com >
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com >
Co-authored-by: Zehao Huang <zehao.huang@intel.com >
Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com >
Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com >
Co-authored-by: Nir David <ndavid@habana.ai >
Co-authored-by: Yu-Zhou <yu.zhou@intel.com >
Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai >
Co-authored-by: Karol Damaszke <kdamaszke@habana.ai >
Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Jacek Czaja <jacek.czaja@intel.com >
Co-authored-by: Jacek Czaja <jczaja@habana.ai >
Co-authored-by: Yuan <yuan.zhou@outlook.com >
2024-11-06 01:09:10 -08:00
a5fda50a10
[CI/Build] Fix large_gpu_mark reason ( #10070 )
...
Signed-off-by: Isotr0py <2037008807@qq.com >
2024-11-06 08:50:37 +00:00
21063c11c7
[CI/Build] drop support for Python 3.8 EOL ( #8464 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
2024-11-06 07:11:55 +00:00
4be3a45158
[distributed] add function to create ipc buffers directly ( #10064 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 22:35:03 -08:00
4089985552
[V1] Integrate Piecewise CUDA graphs ( #10058 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-11-05 22:16:04 -08:00
9d59b75593
[Bugfix] Remove CustomChatCompletionContentPartParam multimodal input type ( #10054 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-06 05:13:09 +00:00
ea928f608c
[Bugfix] Gpt-j-6B patch kv_scale to k_scale path ( #10063 )
...
Signed-off-by: Alex Rakowski <alex.rakowski@amd.com >
Signed-off-by: Alex Rakowski <182798202+arakowsk-amd@users.noreply.github.com >
2024-11-06 05:10:40 +00:00
2bcbae704c
[Bugfix] Fix edge-case crash when using chat with the Mistral Tekken Tokenizer ( #10051 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-11-06 04:28:29 +00:00
ffc0f2b47a
[Model][OpenVINO] Fix regressions from #8346 ( #10045 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-06 04:19:15 +00:00
82bfc38d07
[Misc] Sort the list of embedding models ( #10037 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-06 04:05:05 +00:00
c4cacbaa7f
[v1] reduce graph capture time for piecewise cudagraph ( #10059 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 18:19:50 -08:00
0c63c34f72
[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode ( #9730 )
...
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-11-06 01:45:45 +00:00
966e31697b
[Bugfix] Fix pickle of input when async output processing is on ( #9931 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-11-06 00:39:26 +00:00
43300bd98a
[Bugfix] Properly propagate trust_remote_code settings ( #10047 )
...
Signed-off-by: Zifei Tong <zifeitong@gmail.com >
2024-11-05 16:34:40 -08:00
ca9844b340
[bugfix] fix weak ref in piecewise cudagraph and tractable test ( #10048 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-05 14:49:20 -08:00
235366fe2e
[CI] Prune back the number of tests in tests/kernels/* ( #9932 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:32 -05:00
02462465ea
[CI] Prune tests/models/decoder_only/language/* tests ( #9940 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 16:02:23 -05:00
b9c64c0ca7
[Misc] Modify BNB parameter name ( #9997 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-05 14:40:08 -05:00
d2e80332a7
[Feature] Update benchmark_throughput.py to support image input ( #9851 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-05 19:30:02 +00:00
a53046b16f
[Model] Support quantization of PixtralHFTransformer for PixtralHF ( #9921 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-05 10:42:20 -08:00
731aec5be7
[CI/Build] Limit github CI jobs based on files changed ( #9928 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 10:30:42 -08:00
09d3550372
[Misc] Add logging for CUDA memory ( #10027 )
...
Signed-off-by: Chenghao Yang <yangalan1996@gmail.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Chenghao Yang <yangalan1996@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-05 09:50:50 -08:00
cd34029e91
Refactor TPU requirements file and pin build dependencies ( #10010 )
...
Signed-off-by: Richard Liu <ricliu@google.com >
2024-11-05 16:48:44 +00:00
5952d81139
[Frontend] Fix tcp port reservation for api server ( #10012 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-11-05 07:50:57 -08:00
93dee88f6b
[Misc] vllm CLI flags should be ordered for better user readability ( #10017 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-05 18:59:56 +08:00
7a83b1aec0
[BugFix] Lazy import ray ( #10021 )
2024-11-05 10:04:10 +00:00
ad23318928
[Bugfix] Fixup Mamba ( #10004 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 03:46:38 +00:00
bbc3619dc8
[Core] Make encoder-decoder inputs a nested structure to be more composable ( #9604 )
...
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-11-05 10:07:31 +08:00
04bbf38e05
[Core] Use os.sched_yield in ShmRingBuffer instead of time.sleep ( #9994 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-11-05 01:08:21 +00:00
8f0a9ca890
[Bugfix] Respect modules_to_not_convert within awq_marlin ( #9895 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-04 16:57:44 -07:00
2094062b4e
[4.5/N] bugfix for quant config in speculative decode ( #10007 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 15:11:59 -08:00
d93478b399
[Bugfix] Upgrade to pytorch 2.5.1 ( #10001 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
2024-11-04 15:11:28 -08:00
ac04a97a9f
[Frontend] Add max_tokens prometheus metric ( #9881 )
...
Signed-off-by: Tomer Asida <tomera@ai21.com >
2024-11-04 22:53:24 +00:00
9a5664d4a4
[Misc] Refactor benchmark_throughput.py ( #9779 )
...
Signed-off-by: Linkun Chen <github+anyscale@lkchen.net >
Co-authored-by: Linkun Chen <lkchen@github.com >
Co-authored-by: Linkun Chen <github+anyscale@lkchen.net >
2024-11-04 14:32:16 -08:00
04cef2c6ab
[Bugfix] Fix MQLLMEngine
hanging ( #9973 )
...
Signed-off-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
2024-11-04 16:01:43 -05:00
6e056bcf04
[Doc] Update VLM doc about loading from local files ( #9999 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-11-04 19:47:11 +00:00
5208dc7a20
[Bugfix][CI/Build][Hardware][AMD] Shard ID parameters in AMD tests running parallel jobs ( #9279 )
...
Signed-off-by: Hissu Hyvarinen <hissu.hyvarinen@amd.com >
2024-11-04 11:37:46 -08:00
1c45f4c385
[CI] Basic Integration Test For TPU ( #9968 )
...
Signed-off-by: Robert Shaw <rshaw@neuralmagic.com >
2024-11-04 11:34:26 -08:00
603a661ae8
[Model] factoring out MambaMixer out of Jamba ( #8993 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-11-04 18:00:00 +00:00
fb2716d641
[Misc]Reduce BNB static variable ( #9987 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 17:04:40 +00:00
8d72bb20fa
[4/N] make quant config first-class citizen ( #9978 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-04 08:51:31 -08:00
ac6b8f19b9
[Frontend] Multi-Modality Support for Loading Local Image Files ( #9915 )
...
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com >
2024-11-04 15:34:57 +00:00
ccb5376a9a
[Bugfix][OpenVINO] Fix circular reference #9939 ( #9974 )
...
Signed-off-by: MengqingCao <cmq0113@163.com >
2024-11-04 18:14:13 +08:00
ea4adeddc1
[Bugfix] Fix E2EL mean and median stats ( #9984 )
...
Signed-off-by: daitran2k1 <tranquangdai7a@gmail.com >
2024-11-04 09:37:58 +00:00
4dbcbbeb09
[Misc] Compute query_start_loc/seq_start_loc on CPU ( #9447 )
...
Co-authored-by: Yang Zheng(SW)(Alex) <you@example.com >
2024-11-04 08:54:37 +00:00
b67feb1274
[Bugfix]Using the correct type hints ( #9885 )
...
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
2024-11-04 06:19:51 +00:00
c49f0407ba
[Bugfix] Fix MiniCPMV and Mllama BNB bug ( #9917 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-11-04 03:36:41 +00:00
91c9ebbb1b
[V1] Fix Configs ( #9971 )
2024-11-04 00:24:40 +00:00
54597724f4
[Model] Add support for H2OVL-Mississippi models ( #9747 )
...
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai >
Signed-off-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-11-04 00:15:36 +00:00
1f1b6d6eda
[V1] Support per-request seed ( #9945 )
...
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
2024-11-03 09:14:17 -08:00
3bb4befea7
[bugfix] fix tsts ( #9959 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 15:54:05 -07:00
ae5279a163
[torch.compile] Adding torch compile to vision-language models ( #9946 )
2024-11-02 12:56:05 -07:00
1b73ab2a1f
[CI/Build] Quoting around > ( #9956 )
2024-11-02 12:50:28 -07:00
cea808f325
[3/N] model runner pass the whole config to model ( #9958 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 12:08:49 -07:00
74b529ceee
[bugfix] fix chatglm dummy_data_for_glmv ( #9955 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-02 08:03:33 -07:00
d6459b4516
[V1] Fix EngineArgs
refactor on V1 ( #9954 )
2024-11-02 07:44:38 -07:00
e893795443
[2/N] executor pass the complete config to worker/modelrunner ( #9938 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nhill@redhat.com >
2024-11-02 07:35:05 -07:00
1d4cfe2be1
[Doc] Updated tpu-installation.rst with more details ( #9926 )
...
Signed-off-by: Michael Green <mikegre@google.com >
2024-11-02 10:06:45 -04:00
eed92f12fc
[Docs] Update Granite 3.0 models in supported models table ( #9930 )
...
Signed-off-by: Nick Hill <nhill@redhat.com >
Signed-off-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-11-02 09:02:18 +00:00
af7380d83b
[torch.compile] fix cpu broken code ( #9947 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 23:35:47 -07:00
a78dd3303e
[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models ( #9559 )
2024-11-01 23:22:49 -07:00
d522034c85
[ci/build] Have dependabot ignore pinned dependencies ( #9935 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-11-01 23:56:13 +00:00
6c0b7f548d
[Core][VLM] Add precise multi-modal placeholder tracking ( #8346 )
...
Signed-off-by: Peter Salas <peter@fixie.ai >
2024-11-01 16:21:10 -07:00
d151fde834
[ci/build] Bump the patch-update group with 10 updates ( #9897 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Kevin H. Luu <kevin@anyscale.com >
2024-11-01 23:04:42 +00:00
27cd36e6e2
[Bugfix] PicklingError on RayTaskError ( #9934 )
...
Signed-off-by: Gene Su <e870252314@gmail.com >
2024-11-01 22:08:23 +00:00
18bd7587b7
[1/N] pass the complete config from engine to executor ( #9933 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 13:51:57 -07:00
598b6d7b07
[Bugfix/Core] Flashinfer k_scale and v_scale ( #9861 )
2024-11-01 12:15:05 -07:00
aff1fd8188
[torch.compile] use interpreter with stable api from pytorch ( #9889 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-11-01 11:50:37 -07:00
4581d2cc02
[Core] Refactor: Clean up unused argument in Scheduler._preempt ( #9696 )
...
Signed-off-by: André Jonasson <andre.jonasson@gmail.com >
2024-11-01 11:41:38 -07:00
1dd4cb2935
[Bugfix] Fix edge cases for MistralTokenizer ( #9625 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com >
2024-11-01 10:33:15 -07:00
ba0d892074
[Frontend] Use a proper chat template for VLM2Vec ( #9912 )
2024-11-01 14:09:07 +00:00
30a2e80742
[CI/Build] Add Model Tests for PixtralHF ( #9813 )
2024-11-01 07:55:29 -06:00
06386a64dd
[Frontend] Chat-based Embeddings API ( #9759 )
2024-11-01 08:13:35 +00:00
d3aa2a8b2f
[Doc] Update multi-input support ( #9906 )
2024-11-01 07:34:49 +00:00
2b5bf20988
[torch.compile] Adding torch compile annotations to some models ( #9876 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-11-01 00:25:47 -07:00
93a76dd21d
[Model] Support bitsandbytes for MiniCPMV ( #9891 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:31:56 +08:00
566cd27797
[torch.compile] rework test plans ( #9866 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 22:20:17 -07:00
37a4947dcd
[Bugfix] Fix layer skip logic with bitsandbytes ( #9887 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-11-01 13:12:44 +08:00
96e0c9cbbd
[torch.compile] directly register custom op ( #9896 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-31 21:56:09 -07:00
031a7995f3
[Bugfix][Frontend] Reject guided decoding in multistep mode ( #9892 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-11-01 01:09:46 +00:00
b63c64d95b
[ci/build] Configure dependabot to update pip dependencies ( #9811 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-31 15:55:38 -07:00
9fb12f7848
[BugFix][Kernel] Fix Illegal memory access in causal_conv1d in H100 ( #9838 )
...
Signed-off-by: mzusman <mor.zusmann@gmail.com >
2024-10-31 20:06:25 +00:00
55650c83a0
[Bugfix] Fix illegal memory access
error with chunked prefill, prefix caching, block manager v2 and xformers enabled together ( #9532 )
...
Signed-off-by: sasha0552 <admin@sasha0552.org >
2024-10-31 11:46:36 -07:00
77f7ef2908
[CI/Build] Adding a forced docker system prune to clean up space ( #9849 )
2024-11-01 01:02:58 +08:00
16b8f7a86f
[CI/Build] Add Model Tests for Qwen2-VL ( #9846 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-31 09:10:52 -07:00
5608e611c2
[Doc] Update Qwen documentation ( #9869 )
2024-10-31 08:54:18 +00:00
3ea2dc2ec4
[Misc] Remove deprecated arg for cuda graph capture ( #9864 )
...
Signed-off-by: Roger Wang <ywang@roblox.com >
2024-10-31 07:22:07 +00:00
d087bf863e
[Model] Support quantization of Qwen2VisionTransformer ( #9817 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-30 22:41:20 -07:00
890ca36072
Revert "[Bugfix] Use host argument to bind to interface ( #9798 )" ( #9852 )
2024-10-31 01:44:51 +00:00
abbfb6134d
[Misc][OpenAI] deprecate max_tokens in favor of new max_completion_tokens field for chat completion endpoint ( #9837 )
2024-10-30 18:15:56 -07:00
64384bbcdf
[torch.compile] upgrade tests ( #9858 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 16:34:22 -07:00
00d91c8a2c
[CI/Build] Simplify exception trace in api server tests ( #9787 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-30 14:52:05 -07:00
c2cd1a2142
[doc] update pp support ( #9853 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-30 13:36:51 -07:00
c787f2d81d
[Neuron] Update Dockerfile.neuron to fix build failure ( #9822 )
2024-10-30 12:22:02 -07:00
33d257735f
[Doc] link bug for multistep guided decoding ( #9843 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 17:28:29 +00:00
3b3f1e7436
[Bugfix][core] replace heartbeat with pid check ( #9818 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-30 09:34:07 -07:00
9ff4511e43
[Misc] Add chunked-prefill support on FlashInfer. ( #9781 )
2024-10-30 09:33:53 -07:00
81f09cfd80
[Model] Support math-shepherd-mistral-7b-prm model ( #9697 )
...
Signed-off-by: Went-Liang <wenteng_liang@163.com >
2024-10-30 09:33:42 -07:00
cc98f1e079
[CI/Build] VLM Test Consolidation ( #9372 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-30 09:32:17 -07:00
211fe91aa8
[TPU] Correctly profile peak memory usage & Upgrade PyTorch XLA ( #9438 )
2024-10-30 09:41:38 +00:00
6aa6020f9b
[Misc] Specify minimum pynvml version ( #9827 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 23:05:43 -07:00
ff5ed6e1bc
[torch.compile] rework compile control with piecewise cudagraph ( #9715 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 23:03:49 -07:00
7b0365efef
[Doc] Add the DCO to CONTRIBUTING.md ( #9803 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-30 05:22:23 +00:00
04a3ae0aca
[Bugfix] Fix multi nodes TP+PP for XPU ( #8884 )
...
Signed-off-by: YiSheng5 <syhm@mail.ustc.edu.cn >
Signed-off-by: yan ma <yan.ma@intel.com >
Co-authored-by: YiSheng5 <syhm@mail.ustc.edu.cn >
2024-10-29 21:34:45 -07:00
62fac4b9aa
[ci/build] Pin CI dependencies version with pip-compile ( #9810 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-30 03:34:55 +00:00
226688bd61
[Bugfix][VLM] Make apply_fp8_linear work with >2D input ( #9812 )
2024-10-29 19:49:44 -07:00
64cb1cdc3f
Update README.md ( #9819 )
2024-10-29 17:28:43 -07:00
1ab6f6b4ad
[core][distributed] fix custom allreduce in pytorch 2.5 ( #9815 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-29 17:06:24 -07:00
bc73e9821c
[Bugfix] Fix prefix strings for quantized VLMs ( #9772 )
2024-10-29 16:02:59 -07:00
8d7724104a
[Docs] Add notes about Snowflake Meetup ( #9814 )
...
Signed-off-by: simon-mo <simon.mo@hey.com >
2024-10-29 15:19:02 -07:00
882a1ad0de
[Model] tool calling support for ibm-granite/granite-20b-functioncalling ( #8339 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com >
2024-10-29 15:07:37 -07:00
67bdf8e523
[Bugfix][Frontend] Guard against bad token ids ( #9634 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-29 14:13:20 -07:00
0ad216f575
[MISC] Set label value to timestamp over 0, to keep track of recent history ( #9777 )
...
Signed-off-by: Kunjan Patel <kunjanp@google.com >
2024-10-29 19:52:19 +00:00
7585ec996f
[CI/Build] mergify: fix rules for ci/build label ( #9804 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-29 19:24:42 +00:00
ab6f981671
[CI][Bugfix] Skip chameleon for transformers 4.46.1 ( #9808 )
2024-10-29 11:12:43 -07:00
ac3d748dba
[Model] Add LlamaEmbeddingModel as an embedding Implementation of LlamaModel ( #9806 )
2024-10-29 10:40:35 -07:00
0ce7798f44
[Misc]: Typo fix: Renaming classes (casualLM -> causalLM) ( #9801 )
...
Signed-off-by: Yannick Schnider <Yannick.Schnider1@ibm.com >
2024-10-29 10:39:20 -07:00
0f43387157
[Bugfix] Use host argument to bind to interface ( #9798 )
2024-10-29 10:37:59 -07:00
08600ddc68
Fix the log to correct guide user to install modelscope ( #9793 )
...
Signed-off-by: yuze.zyz <yuze.zyz@alibaba-inc.com >
2024-10-29 10:36:59 -07:00
74fc2d77ae
[Misc] Add metrics for request queue time, forward time, and execute time ( #9659 )
2024-10-29 10:32:56 -07:00
622b7ab955
[Hardware] using current_platform.seed_everything ( #9785 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-29 14:47:44 +00:00
09500f7dde
[Model] Add BNB quantization support for Mllama ( #9720 )
2024-10-29 08:20:02 -04:00
ef7865b4f9
[Frontend] re-enable multi-modality input in the new beam search implementation ( #9427 )
...
Signed-off-by: Qishuai Ferdinandzhong@gmail.com
2024-10-29 11:49:47 +00:00
eae3d48181
[Bugfix] Use temporary directory in registry ( #9721 )
2024-10-28 22:08:20 -07:00
e74f2d448c
[Doc] Specify async engine args in docs ( #9726 )
2024-10-28 22:07:57 -07:00
7a4df5f200
[Model][LoRA]LoRA support added for Qwen ( #9622 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-29 04:14:07 +00:00
c5d7fb9ddc
[Doc] fix third-party model example ( #9771 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 19:39:21 -07:00
76ed5340f0
[torch.compile] add deepseek v2 compile ( #9775 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 14:35:17 -07:00
97b61bfae6
[misc] avoid circular import ( #9765 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-28 20:51:23 +00:00
aa0addb397
Adding "torch compile" annotations to moe models ( #9758 )
2024-10-28 13:49:56 -07:00
5f8d8075f9
[Model][VLM] Add multi-video support for LLaVA-Onevision ( #8905 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-28 18:04:10 +00:00
8b0e4f2ad7
[CI/Build] Adopt Mergify for auto-labeling PRs ( #9259 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-28 09:38:09 -07:00
2adb4409e0
[Bugfix] Fix ray instance detect issue ( #9439 )
2024-10-28 07:13:03 +00:00
feb92fbe4a
Fix beam search eos ( #9627 )
2024-10-28 06:59:37 +00:00
32176fee73
[torch.compile] support moe models ( #9632 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 21:58:04 -07:00
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm ( #9642 )
...
Signed-off-by: wangshuai09 <391746016@qq.com >
2024-10-28 04:07:00 +00:00
34a9941620
[Bugfix] Fix load config when using bools ( #9533 )
2024-10-27 13:46:41 -04:00
e130c40e4e
Fix cache management in "Close inactive issues and PRs" actions workflow ( #9734 )
2024-10-27 10:30:03 -07:00
3cb07a36a2
[Misc] Upgrade to pytorch 2.5 ( #9588 )
...
Signed-off-by: Bill Nell <bill@neuralmagic.com >
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-27 09:44:24 +00:00
8549c82660
[core] cudagraph output with tensor weak reference ( #9724 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-27 00:19:28 -07:00
67a6882da4
[Misc] SpecDecodeWorker supports profiling ( #9719 )
...
Signed-off-by: Abatom <abatom@163.com >
2024-10-27 04:18:03 +00:00
6650e6a930
[Model] Add classification Task with Qwen2ForSequenceClassification ( #9704 )
...
Signed-off-by: Kevin-Yang <ykcha9@gmail.com >
Co-authored-by: Kevin-Yang <ykcha9@gmail.com >
2024-10-26 17:53:35 +00:00
07e981fdf4
[Frontend] Bad words sampling parameter ( #9717 )
...
Signed-off-by: Vasily Alexeev <alvasian@yandex.ru >
2024-10-26 16:29:38 +00:00
55137e8ee3
Fix: MI100 Support By Bypassing Custom Paged Attention ( #9560 )
2024-10-26 12:12:57 +00:00
5cbdccd151
[Hardware][openvino] is_openvino --> current_platform.is_openvino ( #9716 )
2024-10-26 10:59:06 +00:00
067e77f9a8
[Bugfix] Steaming continuous_usage_stats default to False ( #9709 )
...
Signed-off-by: Sam Stoelinga <sammiestoel@gmail.com >
2024-10-26 05:05:47 +00:00
6567e13724
[Bugfix] Fix crash with llama 3.2 vision models and guided decoding ( #9631 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: pavlo-ruban <pavlo.ruban@servicenow.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-25 15:42:56 -07:00
228cfbd03f
[Doc] Improve quickstart documentation ( #9256 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-25 14:32:10 -07:00
ca0d92227e
[Bugfix] Fix compressed_tensors_moe bad config.strategy ( #9677 )
2024-10-25 12:40:33 -07:00
9645b9f646
[V1] Support sliding window attention ( #9679 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-24 22:20:37 -07:00
a6f3721861
[Model] add a lora module for granite 3.0 MoE models ( #9673 )
2024-10-24 22:00:17 -07:00
9f7b4ba865
[ci/Build] Skip Chameleon for transformers 4.46.0 on broadcast test #9675 ( #9676 )
2024-10-24 20:59:00 -07:00
c91ed47c43
[Bugfix] Remove xformers requirement for Pixtral ( #9597 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 15:38:05 -07:00
59449095ab
[Performance][Kernel] Fused_moe Performance Improvement ( #9384 )
...
Signed-off-by: charlifu <charlifu@amd.com >
2024-10-24 15:37:52 -07:00
e26d37a185
[Log][Bugfix] Fix default value check for image_url.detail
( #9663 )
2024-10-24 10:44:38 -07:00
722d46edb9
[Model] Compute Llava Next Max Tokens / Dummy Data From Gridpoints ( #9650 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-24 10:42:24 -07:00
c866e0079d
[CI/Build] Fix VLM test failures when using transformers v4.46 ( #9666 )
2024-10-25 01:40:40 +08:00
d27cfbf791
[torch.compile] Adding torch compile annotations to some models ( #9641 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 09:31:42 -07:00
de662d32b5
Increase operation per run limit for "Close inactive issues and PRs" workflow ( #9661 )
...
Signed-off-by: Harry Mellor <hej.mellor@gmail.com >
2024-10-24 12:17:45 -04:00
f58454968f
[Bugfix]Disable the post_norm layer of the vision encoder for LLaVA models ( #9653 )
2024-10-24 07:52:07 -07:00
b979143d5b
[Doc] Move additional tips/notes to the top ( #9647 )
2024-10-24 09:43:59 +00:00
ad6f78053e
[torch.compile] expanding support and fix allgather compilation ( #9637 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 01:32:15 -07:00
295a061fb3
[Kernel] add kernel for FATReLU ( #9610 )
...
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com >
2024-10-24 16:18:27 +08:00
8a02cd045a
[torch.compile] Adding torch compile annotations to some models ( #9639 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-10-24 00:54:57 -07:00
4fdc581f9e
[core] simplify seq group code ( #9569 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-10-24 00:16:44 -07:00
3770071eb4
[V1][Bugfix] Clean up requests when aborted ( #9629 )
...
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-10-23 23:33:22 -07:00
836e8ef6ee
[Bugfix] Fix PP for ChatGLM and Molmo ( #9422 )
2024-10-24 06:12:05 +00:00
056a68c7db
[XPU] avoid triton import for xpu ( #9440 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-24 05:14:00 +00:00
33bab41060
[Bugfix]: Make chat content text allow type content ( #9358 )
...
Signed-off-by: Vinay Damodaran <vrdn@hey.com >
2024-10-24 05:05:49 +00:00
b7df53cd42
[Bugfix] Use "vision_model" prefix for MllamaVisionModel ( #9628 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:07:44 +08:00
bb01f2915e
[Bugfix][Model] Fix Mllama SDPA illegal memory access for batched multi-image ( #9626 )
...
Signed-off-by: mgoin <michael@neuralmagic.com >
2024-10-24 10:03:44 +08:00
b548d7a5f4
[CI/Build] Add bot to close stale issues and PRs ( #9436 )
2024-10-23 15:45:26 -07:00
fc6c274626
[Model] Add Qwen2-Audio model support ( #9248 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-23 17:54:22 +00:00
150b779081
[Frontend] Enable Online Multi-image Support for MLlama ( #9393 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-23 17:28:57 +00:00
9013e24f7b
[torch.compile] Adding torch compile annotations to some models ( #9614 )
2024-10-23 10:07:48 -07:00
fd0e2cfdb2
[Misc] Separate total and output tokens in benchmark_throughput.py ( #8914 )
2024-10-23 16:47:20 +00:00
e5ac6a4199
[Bugfix] Fix divide by zero when serving Mamba models ( #9617 )
...
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-10-23 16:40:43 +00:00
dbdd3b5e5a
[misc] comment to avoid future confusion about baichuan ( #9620 )
...
Signed-off-by: youkaichao <youkaichao@gmail.com >
2024-10-23 09:14:44 -07:00
e7116c017c
[Bugfix] Fix _init_vision_model
in NVLM_D model ( #9611 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 14:09:04 +00:00
31a08f5bd2
[Model] Add min_pixels / max_pixels to Qwen2VL as mm_processor_kwargs ( #9612 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-23 14:05:18 +00:00
c18e1a3418
[VLM] Enable overriding whether post layernorm is used in vision encoder + fix quant args ( #9217 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-23 11:27:37 +00:00
3ff57ebfca
[Model] Initialize Florence-2 language backbone support ( #9555 )
2024-10-23 10:42:47 +00:00
2394962d70
[Hardware][XPU] using current_platform.is_xpu ( #9605 )
2024-10-23 08:28:21 +00:00
51c24c9736
[Build] Fix FetchContent
multiple build issue ( #9596 )
...
Signed-off-by: luka <luka@neuralmagic.com >
2024-10-23 12:43:07 +08:00
831540cf04
[Model] Support E5-V ( #9576 )
2024-10-23 11:35:29 +08:00
29061ed9df
[Misc] Add an env var VLLM_LOGGING_PREFIX, if set, it will be prepend to all logging messages ( #9590 )
2024-10-23 11:17:28 +08:00
65050a40e6
[Bugfix] Generate exactly input_len tokens in benchmark_throughput ( #9592 )
2024-10-22 17:45:35 -07:00
208cb34c81
[Doc]: Update tensorizer docs to include vllm[tensorizer] ( #7889 )
...
Co-authored-by: Kaunil Dhruv <dhruv.kaunil@gmail.com >
2024-10-22 15:43:25 -07:00
b17046e298
[BugFix] Fix metrics error for --num-scheduler-steps > 1 ( #8234 )
2024-10-22 15:43:03 -07:00
d1e8240875
[Bugfix] Fix spurious "No compiled cutlass_scaled_mm ..." for W8A8 on Turing ( #9487 )
2024-10-22 15:41:13 -07:00
cb6fdaa0a0
[Misc] Make benchmarks use EngineArgs ( #9529 )
2024-10-22 15:40:38 -07:00
23b899a8e6
[Bugfix] fix detokenizer shallow copy ( #5919 )
2024-10-22 15:38:12 -07:00
17c79f3c36
[torch.compile] auto infer dynamic_arg_dims from type annotation ( #9589 )
2024-10-22 13:43:37 -07:00
cd5601ac37
[BugFix] Prevent exporting duplicate OpenTelemetry spans ( #9017 )
2024-10-22 11:11:53 -07:00
434984e665
[Frontend] Support custom request_id from request ( #9550 )
...
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com >
2024-10-22 18:07:30 +00:00
32a1ee74a0
[Hardware][Intel CPU][DOC] Update docs for CPU backend ( #6212 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com >
Co-authored-by: Gubrud, Aaron D <aaron.d.gubrud@intel.com >
Co-authored-by: adgubrud <96072084+adgubrud@users.noreply.github.com >
2024-10-22 10:38:04 -07:00
08075c3448
[Bugfix] Eagle: change config name for fc bias ( #9580 )
2024-10-22 16:14:22 +00:00
bb392ea2d2
[Model][VLM] Initialize support for Mono-InternVL model ( #9528 )
2024-10-22 16:01:46 +00:00
9dbcce84a7
[Neuron] [Bugfix] Fix neuron startup ( #9374 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-22 12:51:41 +00:00
a48e3ec052
[CI/Build][LoRA] Temporarily fix long context failure issue ( #9579 )
2024-10-22 11:32:51 +00:00
6c5af09b39
[V1] Implement vLLM V1 [1/N] ( #9289 )
2024-10-22 01:24:07 -07:00
3ddbe25502
[Hardware][CPU] using current_platform.is_cpu ( #9536 )
2024-10-22 00:50:43 -07:00
0d02747f2e
support TP in qwen2 bnb ( #9574 )
2024-10-22 07:13:23 +00:00
f7db5f0fa9
[Doc] Use shell code-blocks and fix section headers ( #9508 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-22 06:43:24 +00:00
ca30c3c84b
[Core] Remove evictor_v1 ( #9572 )
2024-10-22 04:55:49 +00:00
c0292211ce
[CI/Build] Replaced some models on tests for smaller ones ( #9570 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-22 04:52:14 +00:00
74692421f7
[Bugfix]: phi.py get rope_theta from config file ( #9503 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-22 02:53:36 +00:00
29acd2c34c
[Bugfix][OpenVINO] fix_dockerfile_openvino ( #9552 )
2024-10-21 19:47:52 -07:00
f085995a7b
[CI/Build] Remove unnecessary fork_new_process
( #9484 )
2024-10-21 19:47:29 -07:00
b729901139
[Bugfix]: serialize config by value for --trust-remote-code ( #6751 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-21 19:46:24 -07:00
76a5e13270
[core] move parallel sampling out from vllm core ( #9302 )
2024-10-22 00:31:44 +00:00
ef7faad1b8
🐛 Fixup more test failures from memory profiling ( #9563 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-21 17:10:56 -07:00
575dcebe9a
[CI] Make format checker error message more user-friendly by using emoji ( #9564 )
...
This PR makes format checker error message more user-friendly by adding emojis.
2024-10-21 23:45:15 +00:00
711f3a7806
[Frontend] Don't log duplicate error stacktrace for every request in the batch ( #9023 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-21 14:49:41 -07:00
15713e3b75
[BugFix] Update draft model TP size check to allow matching target TP size ( #9394 )
...
Co-authored-by: Baoyuan Qi <qibaoyuan@126.com >
2024-10-21 14:14:29 -07:00
d621c43df7
[doc] fix format ( #9562 )
2024-10-21 13:54:57 -07:00
9d9186be97
[Frontend] Reduce frequency of client cancellation checking ( #7959 )
2024-10-21 13:28:10 -07:00
5241aa1494
[Model][Bugfix] Fix batching with multi-image in PixtralHF ( #9518 )
2024-10-21 14:20:07 -04:00
ec6bd6c4c6
[BugFix] Use correct python3 binary in Docker.ppc64le entrypoint ( #9492 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-21 17:43:02 +00:00
8ca8954841
[Bugfix][Misc]: fix graph capture for decoder ( #9549 )
2024-10-21 17:33:30 +00:00
f6b97293aa
[Model] FalconMamba Support ( #9325 )
2024-10-21 12:50:16 -04:00
496e991da8
[Doc] Consistent naming of attention backends ( #9498 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-10-21 22:29:57 +08:00
696b01af8f
[CI/Build] Split up decoder-only LM tests ( #9488 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-20 21:27:50 -07:00
855e0e6f97
[Frontend][Misc] Goodput metric support ( #9338 )
2024-10-20 18:39:32 +00:00
4fa3e33349
[Kernel] Support sliding window in flash attention backend ( #9403 )
2024-10-20 10:57:52 -07:00
962d2c6349
[Model][Pixtral] Use memory_efficient_attention for PixtralHFVision ( #9520 )
2024-10-20 05:29:14 +00:00
5b59fe0f08
[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger ( #9530 )
2024-10-20 00:05:02 +00:00
8e3e7f2713
[Model][Pixtral] Optimizations for input_processor_for_pixtral_hf ( #9514 )
2024-10-19 10:44:29 -04:00
263d8ee150
[Bugfix] Fix missing task for speculative decoding ( #9524 )
2024-10-19 06:49:40 +00:00
c5eea3c8ba
[Frontend] Support simpler image input format ( #9478 )
2024-10-18 23:17:07 -07:00
85dc92fc98
[CI/Build] Configure matcher for actionlint workflow ( #9511 )
...
Signed-off-by: Russell Bryant <russell.bryant@gmail.com >
2024-10-19 06:04:18 +00:00
dfd951ed9b
[CI/Build] Add error matching for ruff output ( #9513 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-19 05:42:20 +00:00
82c25151ec
[Doc] update gpu-memory-utilization flag docs ( #9507 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-19 11:26:36 +08:00
1325872ec8
[Frontend] Avoid creating guided decoding LogitsProcessor unnecessarily ( #9521 )
2024-10-18 20:21:01 -07:00
380e18639f
🐛 fix torch memory profiling ( #9516 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-18 21:25:19 -04:00
337ed76671
[Bugfix] Fix offline mode when using mistral_common
( #9457 )
2024-10-18 18:12:32 -07:00
0c9a5258f9
[Kernel] Add env variable to force flashinfer backend to enable tensor cores ( #9497 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh Yang <chih.chieh.yang@ibm.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-18 17:55:48 -07:00
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py ( #9510 )
2024-10-18 14:30:55 -07:00
9bb10a7d27
[MISC] Add lora requests to metrics ( #9477 )
...
Co-authored-by: Kunjan Patel <kunjanp_google_com@vllm.us-central1-a .c.kunjanp-gke-dev-2.internal>
2024-10-18 20:50:18 +00:00
3921a2f29e
[Model] Support Pixtral models in the HF Transformers format ( #9036 )
2024-10-18 13:29:56 -06:00
67a7e5ef38
[CI/Build] Add error matching config for mypy ( #9512 )
2024-10-18 12:17:53 -07:00
051eaf6db3
[Model] Add user-configurable task for models that support both generation and embedding ( #9424 )
2024-10-18 11:31:58 -07:00
7dbe738d65
[Misc] benchmark: Add option to set max concurrency ( #9390 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-18 11:15:28 -07:00
ae8b633ba3
[Bugfix] Fix offline_inference_with_prefix.py ( #9505 )
2024-10-18 16:59:19 +00:00
1bbbcc0b1d
[CI/Build] Fix lint errors in mistral tokenizer ( #9504 )
2024-10-19 00:09:35 +08:00
25aeb7d4c9
[BugFix] Fix and simplify completion API usage streaming ( #9475 )
2024-10-18 14:10:26 +00:00
d2b1bf55ec
[Frontend][Feature] Add jamba tool parser ( #9154 )
2024-10-18 10:27:48 +00:00
1ffc8a7362
[BugFix] Typing fixes to RequestOutput.prompt and beam search ( #9473 )
2024-10-18 07:19:53 +00:00
944dd8edaf
[CI/Build] Use commit hash references for github actions ( #9430 )
2024-10-17 21:54:58 -07:00
154a8ae880
[Qwen2.5] Support bnb quant for Qwen2.5 ( #9467 )
2024-10-18 04:40:14 +00:00
de4008e2ab
[Bugfix][Core] Use torch.cuda.memory_stats() to profile peak memory usage ( #9352 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-10-17 22:47:27 -04:00
48138a8415
[BugFix] Stop silent failures on compressed-tensors parsing ( #9381 )
2024-10-17 18:54:00 -07:00
343f8e0905
Support BERTModel
(first encoder-only
embedding model) ( #9056 )
...
Signed-off-by: Max de Bayser <maxdebayser@gmail.com >
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Andrew Feldman <afeldman@neuralmagic.com >
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: laishzh <laishengzhang@gmail.com >
Co-authored-by: Max de Bayser <maxdebayser@gmail.com >
Co-authored-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-10-17 23:21:01 +00:00
bb76538bbd
[Hardwware][Neuron] Simplify model load for transformers-neuronx library ( #9380 )
2024-10-17 15:39:39 -07:00
d615b5c9f8
[Bugfix] Print warnings related to mistral_common
tokenizer only once ( #9468 )
2024-10-17 21:44:20 +00:00
d65049daab
[Bugfix] Add random_seed to sample_hf_requests in benchmark_serving script ( #9013 )
...
Co-authored-by: Isotr0py <2037008807@qq.com >
2024-10-17 21:11:11 +00:00
eca2c5f7c0
[Bugfix] Fix support for dimension like integers and ScalarType ( #9299 )
2024-10-17 19:08:34 +00:00
0f41fbe5a3
[torch.compile] Fine-grained CustomOp enabling mechanism ( #9300 )
2024-10-17 18:36:37 +00:00
7871659abb
[Misc] Remove commit id file ( #9470 )
2024-10-17 10:34:37 -07:00
a2c71c5405
[CI/Build] remove .github from .dockerignore, add dirty repo check ( #9375 )
2024-10-17 10:25:06 -07:00
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default ( #8704 )
...
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
5eda21e773
[Hardware][CPU] compressed-tensor INT8 W8A8 AZP support ( #9344 )
2024-10-17 12:21:04 -04:00
8e1cddcd44
[TPU] Call torch._sync(param) during weight loading ( #9437 )
2024-10-17 09:00:11 -07:00
5e443b594f
[Bugfix] Allow prefill of assistant response when using mistral_common
( #9446 )
2024-10-17 15:06:37 +00:00
9d30a056e7
[misc] CUDA Time Layerwise Profiler ( #8337 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-10-17 10:36:09 -04:00
390be74649
[Misc] Print stack trace using logger.exception
( #9461 )
2024-10-17 13:55:48 +00:00
e312e52b44
[Kernel] Add Exllama as a backend for compressed-tensors ( #9395 )
2024-10-17 09:48:26 -04:00
dbfa8d31d5
Add notes on the use of Slack ( #9442 )
2024-10-17 04:46:46 +00:00
92d86da217
[BugFix] [Kernel] Fix GPU SEGV occurring in int8 kernels ( #9391 )
2024-10-17 01:34:06 +00:00
c3fab5f769
[Bugfix][Kernel] Prevent integer overflow in fp8 dynamic per-token quantize kernel ( #9425 )
2024-10-16 23:46:06 +00:00
776dbd74f1
[CI/Build] mypy: Resolve some errors from checking vllm/engine ( #9267 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-16 22:55:59 +00:00
8345045833
[Performance][Spec Decode] Optimize ngram lookup performance ( #9333 )
2024-10-16 13:37:45 -06:00
5b8a1fde84
[Model][Bugfix] Add FATReLU activation and support for openbmb/MiniCPM-S-1B-sft ( #9396 )
2024-10-16 16:40:24 +00:00
fb60ae9b91
[Kernel][Model] Improve continuous batching for Jamba and Mamba ( #9189 )
2024-10-16 12:12:43 -04:00
415f76a9cb
Support mistral interleaved attn ( #9414 )
2024-10-16 13:28:30 +00:00
cf1d62a644
[Model] Support SDPA attention for Molmo vision backbone ( #9410 )
2024-10-16 11:52:01 +00:00
59230ef32b
[Misc] Consolidate example usage of OpenAI client for multimodal models ( #9412 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-16 11:20:51 +00:00
cee711fdbb
[Core] Rename input data types ( #8688 )
2024-10-16 10:49:37 +00:00
1de76a0e55
[CI/Build] Test VLM embeddings ( #9406 )
2024-10-16 09:44:30 +00:00
7abba39ee6
[Model] VLM2Vec, the first multimodal embedding model in vLLM ( #9303 )
2024-10-16 14:31:00 +08:00
7e7eae338d
[Misc] Standardize RoPE handling for Qwen2-VL ( #9250 )
2024-10-16 13:56:17 +08:00
ed920135c8
[Bugfix] Molmo text-only input bug fix ( #9397 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-16 04:56:09 +00:00
717a5f82cd
[Bugfix][CI/Build] Fix CUDA 11.8 Build ( #9386 )
2024-10-16 00:15:21 +00:00
ba30942240
[Bugfix] Fix vLLM UsageInfo and logprobs None AssertionError with empty token_ids ( #9034 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-15 15:40:43 -07:00
22f8a69549
[Misc] Directly use compressed-tensors for checkpoint definitions ( #8909 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-15 15:40:25 -07:00
5d264f4ab8
pass ignore_eos parameter to all benchmark_serving calls ( #9349 )
2024-10-15 13:30:44 -07:00
e9d517f276
[BugFix] Fix chat API continuous usage stats ( #9357 )
2024-10-14 23:19:48 -07:00
55e081fbad
[Bugfix] Update InternVL input mapper to support image embeds ( #9351 )
2024-10-14 21:29:19 -07:00
8e836d982a
[Doc] Fix code formatting in spec_decode.rst ( #9348 )
2024-10-14 21:29:11 -07:00
44eaa5a5d9
[Frontend] Clarify model_type error messages ( #9345 )
2024-10-14 21:29:01 -07:00
169b530607
[Bugfix] Clean up some cruft in mamba.py ( #9343 )
2024-10-15 00:24:25 +00:00
f0fe4fe86d
[Model] Make llama3.2 support multiple and interleaved images ( #9095 )
2024-10-14 15:24:26 -07:00
4d31cd424b
[Frontend] merge beam search implementations ( #9296 )
2024-10-14 15:05:52 -07:00
473e7b3606
[TPU] Fix TPU SMEM OOM by Pallas paged attention kernel ( #9350 )
2024-10-14 15:02:06 -07:00
fd47e57f4b
[Docs] Remove PDF build from Readtehdocs ( #9347 )
2024-10-14 11:57:47 -07:00
203ab8f80f
[CI/Build] setuptools-scm fixes ( #8900 )
2024-10-14 11:34:47 -07:00
4141608c6a
[Hardware][intel GPU] add async output process for xpu ( #8897 )
2024-10-14 12:23:33 -06:00
dfe43a2071
[Model] Molmo vLLM Integration ( #9016 )
...
Co-authored-by: sanghol <sanghol@allenai.org >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-14 07:56:24 -07:00
16b24e7dcd
[Bugfix] Bandaid fix for speculative decoding tests ( #9327 )
2024-10-13 23:02:11 +00:00
f519902c52
[CI] Fix merge conflict ( #9317 )
2024-10-13 06:41:23 +00:00
250e26a63e
[Bugfix]Fix MiniCPM's LoRA bug ( #9286 )
2024-10-12 09:36:47 -07:00
2b184ddd4f
[Misc][Installation] Improve source installation script and doc ( #9309 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-12 09:36:40 -07:00
00298e092c
[Bugfix] Fix bug of xformer prefill for encoder-decoder ( #9026 )
2024-10-12 15:00:43 +08:00
89feb4c84d
[SpecDec] Remove Batch Expansion (2/3) ( #9298 )
2024-10-12 05:13:37 +00:00
ec10cb8511
[BugFix] Fix tool call finish reason in streaming case ( #9209 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-10-11 18:24:26 -07:00
d11b46f3a5
[bugfix] fix f-string for error ( #9295 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-11 17:03:48 -07:00
c6cf9295e1
[Bugfix] Sets is_first_step_output
for TPUModelRunner ( #9202 )
2024-10-11 13:28:10 -07:00
de9fb4bef8
[Bugfix][CI/Build] Fix docker build where CUDA archs < 7.0 are being detected ( #9254 )
2024-10-11 15:57:39 -04:00
8baf85e4e9
[Doc] Compatibility matrix for mutual exclusive features ( #8512 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
2024-10-11 11:18:50 -07:00
1a1823871d
[Doc] Remove outdated comment to avoid misunderstanding ( #9287 )
2024-10-11 18:02:03 +00:00
6cf1167c1a
[Model] Add GLM-4v support and meet vllm==0.6.2 ( #9242 )
2024-10-11 17:36:13 +00:00
f710090d8e
[Kernel] adding fused moe kernel config for L40S TP4 ( #9245 )
2024-10-11 08:54:22 -07:00
7342a7d7f8
[Model] Support Mamba ( #6484 )
2024-10-11 15:40:06 +00:00
df3dcdf49d
[Bugfix] Fix priority in multiprocessing engine ( #9277 )
2024-10-11 15:35:35 +00:00
36ea79079b
[Misc][LoRA] Support loading LoRA weights for target_modules in reg format ( #9275 )
2024-10-11 12:31:21 +00:00
e808156f30
[Misc] Collect model support info in a single process per model ( #9233 )
2024-10-11 11:08:11 +00:00
cbc2ef5529
[misc] hide best_of from engine ( #9261 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-10-10 21:30:44 -07:00
94bf9ae4e9
[Misc] Fix sampling from sonnet for long context case ( #9235 )
2024-10-11 00:33:16 +00:00
f990bab2a4
[Doc][Neuron] add note to neuron documentation about resolving triton issue ( #9257 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-10-10 23:36:32 +00:00
e00c094f15
[torch.compile] generic decorators ( #9258 )
2024-10-10 15:54:23 -07:00
a78c6ba7c8
[ci/build] Add placeholder command for custom models test ( #9262 )
2024-10-10 15:45:09 -07:00
fb870fd491
Bump actions/setup-python from 3 to 5 ( #9195 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:46 -07:00
270953bafb
Bump actions/checkout from 3 to 4 ( #9196 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:35 -07:00
9cc811c4ff
Bump actions/github-script from 6 to 7 ( #9197 )
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-10 13:30:24 -07:00
e4d652ea3e
[torch.compile] integration with compilation control ( #9058 )
2024-10-10 12:39:36 -07:00
78c0b4166c
Suggest codeowners for the core componenets ( #9210 )
2024-10-10 12:29:24 -07:00
21efb603f5
[CI/Build] Make the Dockerfile.cpu
file's PIP_EXTRA_INDEX_URL
Configurable as a Build Argument ( #9252 )
2024-10-10 18:18:18 +00:00
055f3270d4
[Doc] Improve debugging documentation ( #9204 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-10 10:48:51 -07:00
18511aeda6
[Bugfix] Fix Machete unittests failing with NotImplementedError
( #9218 )
2024-10-10 17:39:56 +00:00
83ea5c72b9
[OpenVINO] Use torch 2.4.0 and newer optimim version ( #9121 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-10 11:18:58 -06:00
04de9057ab
[Model] support input image embedding for minicpmv ( #9237 )
2024-10-10 15:00:47 +00:00
07c11cf4d4
[Bugfix] Fix lm_head weights tying with lora for llama ( #9227 )
2024-10-10 21:11:56 +08:00
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 ( #9149 )
2024-10-10 14:17:17 +08:00
a64e7b9407
[Bugfix] Machete garbage results for some models (large K dim) ( #9212 )
2024-10-10 14:16:17 +08:00
ce00231a8b
[Bugfix] Fix Weight Loading Multiple GPU Test - Large Models ( #9213 )
2024-10-10 14:15:40 +08:00
de895f1697
[misc] improve model support check in another process ( #9208 )
2024-10-09 21:58:27 -07:00
cf25b93bdd
[Core] Fix invalid args to _process_request ( #9201 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-10 12:10:09 +08:00
d5fbb8706d
[CI/Build] Update Dockerfile install+deploy image to ubuntu 22.04 ( #9130 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 12:51:47 -06:00
cdca8994bd
[CI/Build] mypy: check vllm/entrypoints ( #9194 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-09 17:15:28 +00:00
ca77dd7a44
[Hardware][CPU] Support AWQ for CPU backend ( #7515 )
2024-10-09 10:28:08 -06:00
7dea289066
Add Dependabot configuration for GitHub Actions updates ( #1217 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-09 08:16:26 -07:00
cfaa6008e6
[Bugfix] Access get_vocab
instead of vocab
in tool parsers ( #9188 )
2024-10-09 08:59:57 -06:00
21906a6f50
[Bugfix] Fix lora loading for Compressed Tensors in #9120 ( #9179 )
2024-10-09 12:10:44 +00:00
dc4aea677a
[Doc] Fix VLM prompt placeholder sample bug ( #9170 )
2024-10-09 08:59:42 +00:00
c8627cd41b
[ci][test] use load dummy for testing ( #9165 )
2024-10-09 00:38:40 -07:00
8bfaa4e31e
[Bugfix] fix composite weight loading and EAGLE weight loading ( #9160 )
2024-10-09 00:36:55 -07:00
0b5b5d767e
[Frontend] Log the maximum supported concurrency ( #8831 )
2024-10-09 00:03:14 -07:00
cdc72e3c80
[Model] Remap FP8 kv_scale in CommandR and DBRX ( #9174 )
2024-10-09 06:43:06 +00:00
7627172bf4
[Bugfix][Doc] Report neuron error in output ( #9159 )
2024-10-08 22:43:34 -07:00
480b7f40cf
[Misc] Improve validation errors around best_of and n ( #9167 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-10-09 04:54:48 +00:00
acce7630c1
Update link to KServe deployment guide ( #9173 )
2024-10-09 03:58:49 +00:00
ffc4b27ea8
Add classifiers in setup.py ( #9171 )
2024-10-08 19:30:48 -07:00
2f4117c38e
support bitsandbytes quantization with more models ( #9148 )
2024-10-08 19:52:19 -06:00
9ba0bd6aa6
Add lm-eval
directly to requirements-test.txt ( #9161 )
2024-10-08 18:22:31 -07:00
2a131965a8
mypy: check additional directories ( #9162 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-08 22:08:22 +00:00
bd37b9fbe2
[Bugfix] Try to handle older versions of pytorch ( #9086 )
2024-10-08 14:28:12 -07:00
de24046fcd
[Doc] Improve contributing and installation documentation ( #9132 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-10-08 20:22:08 +00:00
1874c6a1b0
[Doc] Update vlm.rst to include an example on videos ( #9155 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-10-08 18:12:29 +00:00
9a94ca4a5d
[Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing ( #8537 )
2024-10-08 09:38:40 -07:00
cfba685bd4
[CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models ( #8758 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-10-08 09:37:34 -07:00
069d3bd8d0
[Frontend] Add Early Validation For Chat Template / Tool Call Parser ( #9151 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:31:26 +00:00
a3691b6b5e
[Core][Frontend] Add Support for Inference Time mm_processor_kwargs ( #9131 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-08 14:12:56 +00:00
8c746226c9
[Frontend] API support for beam search for MQLLMEngine ( #9117 )
2024-10-08 05:51:43 +00:00
e1faa2a598
[misc] improve ux on readme ( #9147 )
2024-10-07 22:26:25 -07:00
80b57f00d5
[Intel GPU] Fix xpu decode input ( #9145 )
2024-10-08 03:51:14 +00:00
04c12f8157
[misc] update utils to support comparing multiple settings ( #9140 )
2024-10-08 02:51:49 +00:00
8eeb857084
Add Slack to README ( #9137 )
2024-10-07 17:06:21 -07:00
fa45513a51
[misc] fix comment and variable name ( #9139 )
2024-10-07 16:07:05 -07:00
c0d9a98d0c
[Doc] Include performance benchmark in README ( #9135 )
2024-10-07 15:04:06 -07:00
e0dbdb013d
[CI/Build] Add linting for github actions workflows ( #7876 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-10-07 21:18:10 +00:00
93cf74a8a7
[Doc]: Add deploying_with_k8s guide ( #8451 )
2024-10-07 13:31:45 -07:00
151ef4efd2
[Model] Support NVLM-D and fix QK Norm in InternViT ( #9045 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn >
2024-10-07 11:55:12 +00:00
f19da64871
[Core] Refactor GGUF parameters packing and forwarding ( #8859 )
2024-10-07 10:01:46 +00:00
4f95ffee6f
[Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend ( #9089 )
2024-10-07 06:50:35 +00:00
8c6de96ea1
[Model] Explicit interface for vLLM models and support OOT embedding models ( #9108 )
2024-10-07 06:10:35 +00:00
18b296fdb2
[core] remove beam search from the core ( #9105 )
2024-10-07 05:47:04 +00:00
c8f26bb636
[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None ( #9103 )
2024-10-07 03:52:42 +00:00
487678d046
[Bugfix][Hardware][CPU] Fix CPU model input for decode ( #9044 )
2024-10-06 19:14:27 -07:00
cb3b2b9ba4
[Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling ( #9038 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-06 12:48:11 -07:00
fdf59d30ea
[Bugfix] fix tool_parser error handling when serve a model not support it ( #8709 )
2024-10-06 12:51:08 +00:00
b22b798471
[Model] PP support for embedding models and update docs ( #9090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-06 16:35:27 +08:00
f22619fe96
[Misc] Remove user-facing error for removed VLM args ( #9104 )
2024-10-06 01:33:52 -07:00
168cab6bbf
[Frontend] API support for beam search ( #9087 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-10-05 23:39:03 -07:00
23fea8714a
[Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model ( #9101 )
2024-10-06 13:00:04 +08:00
f4dd830e09
[core] use forward context for flash infer ( #9097 )
2024-10-05 19:37:31 -07:00
5df1834895
[Bugfix] Fix order of arguments matters in config.yaml ( #8960 )
2024-10-05 17:35:11 +00:00
cfadb9c687
[Bugfix] Deprecate registration of custom configs to huggingface ( #9083 )
2024-10-05 21:56:40 +08:00
15986f598c
[Model] Support Gemma2 embedding model ( #9004 )
2024-10-05 06:57:05 +00:00
53b3a33027
[Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs ( #8979 )
2024-10-04 22:05:37 -07:00
dac914b0d6
[Bugfix] use blockmanagerv1 for encoder-decoder ( #9084 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-10-05 04:45:38 +00:00
a95354a36e
[Doc] Update README.md with Ray summit slides ( #9088 )
2024-10-05 02:54:45 +00:00
663874e048
[torch.compile] improve allreduce registration ( #9061 )
2024-10-04 16:43:50 -07:00
cc90419e89
[Hardware][Neuron] Add on-device sampling support for Neuron ( #8746 )
...
Co-authored-by: Ashraf Mahgoub <ashymahg@amazon.com >
2024-10-04 16:42:20 -07:00
27302dd584
[Misc] Fix CI lint ( #9085 )
2024-10-04 16:07:54 -07:00
0cc566ca8f
[Misc] Add random seed for prefix cache benchmark ( #9081 )
2024-10-04 21:58:57 +00:00
05c531be47
[Misc] Improved prefix cache example ( #9077 )
2024-10-04 21:38:42 +00:00
fbb74420e7
[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang ( #7412 )
2024-10-04 14:01:44 -07:00
05d686432f
[Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE ( #8973 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
Co-authored-by: Dipika Sikka <ds3822@columbia.edu >
2024-10-04 12:34:44 -06:00
0dcc8cbe5a
Adds truncate_prompt_tokens param for embeddings creation ( #8999 )
...
Signed-off-by: Flavia Beo <flavia.beo@ibm.com >
2024-10-04 18:31:40 +00:00
26aa325f4f
[Core][VLM] Test registration for OOT multimodal models ( #8717 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:38:25 -07:00
e5dc713c23
[Hardware][PowerPC] Make oneDNN dependency optional for Power ( #9039 )
...
Signed-off-by: Varad Ahirwadkar <varad.ahirwadkar1@ibm.com >
2024-10-04 17:24:42 +00:00
36eecfbddb
Remove AMD Ray Summit Banner ( #9075 )
2024-10-04 10:17:16 -07:00
9ade8bbc8d
[Model] add a bunch of supported lora modules for mixtral ( #9008 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
2024-10-04 16:24:40 +00:00
22482e495e
[Bugfix] Flash attention arches not getting set properly ( #9062 )
2024-10-04 09:43:15 -06:00
3d826d2c52
[Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL ( #9071 )
2024-10-04 14:34:58 +00:00
0e36fd4909
[Misc] Move registry to its own file ( #9064 )
2024-10-04 10:01:37 +00:00
0f6d7a9a34
[Models] Add remaining model PP support ( #7168 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Signed-off-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-10-04 10:56:58 +08:00
303d44790a
[Misc] Enable multi-step output streaming by default ( #9047 )
2024-10-03 22:55:42 -04:00
aeb37c2a72
[CI/Build] Per file CUDA Archs (improve wheel size and dev build times) ( #8845 )
2024-10-03 22:55:25 -04:00
3dbb215b38
[Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model ( #8405 )
2024-10-04 10:36:39 +08:00
2838d6b38e
[Bugfix] Weight loading fix for OPT model ( #9042 )
...
Co-authored-by: dvres <dvres@fri.uni-lj.si >
2024-10-03 19:53:29 -04:00
91add85ec4
Fix failing spec decode test ( #9054 )
2024-10-03 23:07:29 +00:00
9aaf14c62e
[misc] add forward context for attention ( #9029 )
2024-10-03 12:09:42 -07:00
63e39937f9
[Frontend] [Neuron] Parse literals out of override-neuron-config ( #8959 )
...
Co-authored-by: Jerzy Zagorski <jzagorsk@amazon.com >
2024-10-03 18:02:07 +00:00
f5d72b2fc6
[Core] Make BlockSpaceManagerV2 the default BlockManager to use. ( #8678 )
2024-10-03 09:44:21 -07:00
83caf35e08
[BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser ( #9020 )
2024-10-03 16:44:52 +08:00
01843c89b8
[Misc] log when using default MoE config ( #8971 )
2024-10-03 04:31:07 +00:00
19a4dd0990
[Bugfix] example template should not add parallel_tool_prompt if tools is none ( #9007 )
2024-10-03 03:04:17 +00:00
18c2e30c57
[Doc] Update Granite model docs ( #9025 )
2024-10-03 02:42:24 +00:00
19f0d25796
[Model] Adding Granite MoE. ( #8206 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-03 09:33:57 +08:00
f58d4fccc9
[OpenVINO] Enable GPU support for OpenVINO vLLM backend ( #8192 )
2024-10-02 17:50:01 -04:00
afb050b29d
[Core] CUDA Graphs for Multi-Step + Chunked-Prefill ( #8645 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-10-02 19:44:39 +00:00
7f60520deb
[Misc] Update Default Image Mapper Error Log ( #8977 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-10-02 11:44:38 +00:00
563649aafe
[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching ( #8804 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
2024-10-02 07:52:20 +00:00
1570203864
[Spec Decode] (1/2) Remove batch expansion ( #8839 )
2024-10-01 16:04:42 -07:00
22f5851b80
Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows ( #8997 )
2024-10-01 11:07:06 -07:00
4f341bd4bf
[Doc] Update list of supported models ( #8987 )
2024-10-02 00:35:39 +08:00
35bd215168
[Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API ( #8965 )
2024-10-01 09:58:06 +00:00
1fe0a4264a
[Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders ( #8991 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-10-01 09:52:44 +00:00
bc4eb65b54
[Bugfix] Fix Fuyu tensor parallel inference ( #8986 )
2024-10-01 17:51:41 +08:00
82f3937e59
[Misc] add process_weights_after_loading for DummyLoader ( #8969 )
2024-10-01 03:46:41 +00:00
7da2487591
[torch.compile] fix tensor alias ( #8982 )
2024-10-01 03:40:48 +00:00
aaccca2b4d
[CI/Build] Fix machete generated kernel files ordering ( #8976 )
...
Signed-off-by: kevin <kevin@anyscale.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-10-01 03:33:12 +00:00
062c89e7c9
[Frontend][Core] Move guided decoding params into sampling params ( #8252 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-10-01 09:34:25 +08:00
bce324487a
[CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. ( #8975 )
2024-10-01 00:51:40 +00:00
1425a1bcf9
[ci] Add CODEOWNERS for test directories ( #8795 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-10-01 00:47:08 +00:00
1cabfcefb6
[Misc] Adjust max_position_embeddings for LoRA compatibility ( #8957 )
2024-09-30 12:57:39 +00:00
be76e5aabf
[Core] Make scheduling policy settable via EngineArgs ( #8956 )
2024-09-30 12:28:44 +00:00
2ae25f79cf
[Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg ( #8946 )
2024-09-30 13:01:20 +08:00
8e60afa15e
[Model][LoRA]LoRA support added for MiniCPMV2.6 ( #8943 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-30 04:31:55 +00:00
b6d7392579
[Misc][CI/Build] Include cv2
via mistral_common[opencv]
( #8951 )
2024-09-30 04:28:26 +00:00
e01ab595d8
[Model] support input embeddings for qwen2vl ( #8856 )
2024-09-30 03:16:10 +00:00
f13a07b1f8
[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model ( #8533 )
2024-09-29 17:35:58 -04:00
6c9ba48fde
[Frontend] Added support for HF's new continue_final_message
parameter ( #8942 )
2024-09-29 17:59:47 +00:00
1fb9c1b0bf
[Misc] Fix typo in BlockSpaceManagerV1 ( #8944 )
2024-09-29 15:05:54 +00:00
31f46a0d35
[BugFix] Fix seeded random sampling with encoder-decoder models ( #8870 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-29 09:43:14 +00:00
3d49776bbb
[Model][LoRA]LoRA support added for MiniCPMV2.5 ( #7199 )
2024-09-29 06:59:45 +00:00
bc2ef1f77c
[Model] Support Qwen2.5-Math-RM-72B ( #8896 )
2024-09-28 21:19:39 -07:00
2e7fe7e79f
[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching ( #8930 )
2024-09-29 03:13:01 +00:00
26a68d5d7e
[CI/Build] Add test decorator for minimum GPU memory ( #8925 )
2024-09-29 02:50:51 +00:00
d081da0064
[Bugfix] Fix Marlin MoE act order when is_k_full == False ( #8741 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-28 18:19:40 -07:00
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots ( #8824 )
2024-09-29 09:17:45 +08:00
d1537039ce
[Core] Improve choice of Python multiprocessing method ( #8823 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-29 09:17:07 +08:00
cc276443b5
[doc] organize installation doc and expose per-commit docker ( #8931 )
2024-09-28 17:48:41 -07:00
e585b583a9
[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 ( #8891 )
2024-09-28 18:51:22 +00:00
090e945e36
[Frontend] Make beam search emulator temperature modifiable ( #8928 )
...
Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr >
2024-09-28 11:30:21 -07:00
e1a3f5e831
[CI/Build] Update models tests & examples ( #8874 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-28 09:54:35 -07:00
19d02ff938
[Bugfix] Fix PP for Multi-Step ( #8887 )
2024-09-28 08:52:46 -07:00
39d3f8d94f
[Bugfix] Fix code for downloading models from modelscope ( #8443 )
2024-09-28 08:24:12 -07:00
b0298aa8cc
[Misc] Remove vLLM patch of BaichuanTokenizer
( #8921 )
2024-09-28 08:11:25 +00:00
260024a374
[Bugfix][Intel] Fix XPU Dockerfile Build ( #7824 )
...
Signed-off-by: tylertitsworth <tyler.titsworth@intel.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-27 23:45:50 -07:00
d86f6b2afb
[misc] fix wheel name ( #8919 )
2024-09-27 22:10:44 -07:00
bd429f2b75
[Core] Priority-based scheduling in async engine ( #8850 )
2024-09-27 15:07:10 -07:00
18e60d7d13
[misc][distributed] add VLLM_SKIP_P2P_CHECK flag ( #8911 )
2024-09-27 14:27:56 -07:00
c2ec430ab5
[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path ( #8378 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-09-27 13:32:07 -07:00
c5d55356f9
[Bugfix] fix for deepseek w4a16 ( #8906 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-27 13:12:34 -06:00
172d1cd276
[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method ( #7271 )
2024-09-27 14:25:10 -04:00
a9b15c606f
[torch.compile] use empty tensor instead of None for profiling ( #8875 )
2024-09-27 08:11:32 -07:00
8df2dc3c88
[TPU] Update pallas.py to support trillium ( #8871 )
2024-09-27 01:16:55 -07:00
6d792d2f31
[Bugfix][VLM] Fix Fuyu batching inference with max_num_seqs>1
( #8892 )
2024-09-27 01:15:58 -07:00
0e088750af
[MISC] Fix invalid escape sequence '\' ( #8830 )
...
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io >
2024-09-27 01:13:25 -07:00
dc4e3df5c2
[misc] fix collect env ( #8894 )
2024-09-27 00:26:38 -07:00
3b00b9c26c
[Core] renamePromptInputs
and inputs
( #8876 )
2024-09-26 20:35:15 -07:00
344cd2b6f4
[Feature] Add support for Llama 3.1 and 3.2 tool use ( #8343 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-09-26 17:01:42 -07:00
1b49148e47
[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility ( #8764 )
2024-09-26 16:54:09 -07:00
4b377d6feb
[BugFix] Fix test breakages from transformers 4.45 upgrade ( #8829 )
2024-09-26 16:46:43 -07:00
71d21c73ab
[Bugfix] Fixup advance_step.cu warning ( #8815 )
2024-09-26 16:23:45 -07:00
ee2da3e9ef
fix validation: Only set tool_choice auto
if at least one tool is provided ( #8568 )
2024-09-26 16:23:17 -07:00
e2f6f26e86
[Bugfix] Fix print_warning_once's line info ( #8867 )
2024-09-26 16:18:26 -07:00
b28d2104de
[Misc] Change dummy profiling and BOS fallback warns to log once ( #8820 )
2024-09-26 16:18:14 -07:00
93d364da34
[Bugfix] Include encoder prompts len to non-stream api usage response ( #8861 )
2024-09-26 15:47:00 -07:00
d9cfbc891e
[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM ( #8872 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-26 15:02:16 -07:00
70de39f6b4
[misc][installation] build from source without compilation ( #8818 )
2024-09-26 13:19:04 -07:00
68988d4e0d
[CI/Build] Fix missing ci dependencies ( #8834 )
2024-09-26 11:04:39 -07:00
520db4dbc1
[Docs] Add README to the build docker image ( #8825 )
2024-09-26 11:02:52 -07:00
f70bccac75
[Build/CI] Upgrade to gcc 10 in the base build Docker image ( #8814 )
2024-09-26 10:07:18 -07:00
4bb98f2190
[Misc] Update config loading for Qwen2-VL and remove Granite ( #8837 )
2024-09-26 07:45:30 -07:00
7193774b1f
[Misc] Support quantization of MllamaForCausalLM ( #8822 )
2024-09-25 14:46:22 -07:00
e2c6e0a829
[Doc] Update doc for Transformers 4.45 ( #8817 )
2024-09-25 13:29:48 -07:00
770ec6024f
[Model] Add support for the multi-modal Llama 3.2 model ( #8811 )
...
Co-authored-by: simon-mo <xmo@berkeley.edu >
Co-authored-by: Chang Su <chang.s.su@oracle.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-25 13:29:32 -07:00
4f1ba0844b
Revert "rename PromptInputs and inputs with backward compatibility ( #8760 ) ( #8810 )
2024-09-25 10:36:26 -07:00
873edda6cf
[Misc] Support FP8 MoE for compressed-tensors ( #8588 )
2024-09-25 09:43:36 -07:00
64840dfae4
[Frontend] MQLLMEngine supports profiling. ( #8761 )
2024-09-25 09:37:41 -07:00
28e1299e60
rename PromptInputs and inputs with backward compatibility ( #8760 )
2024-09-25 09:36:47 -07:00
0c4d2ad5e6
[VLM][Bugfix] internvl with num_scheduler_steps > 1 ( #8614 )
2024-09-25 09:35:53 -07:00
c6f2485c82
[[Misc]] Add extra deps for openai server image ( #8792 )
2024-09-25 09:35:23 -07:00
300da09177
[Kernel] Fullgraph and opcheck tests ( #8479 )
2024-09-25 08:35:52 -06:00
1c046447a6
[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade ( #8777 )
2024-09-25 22:26:37 +08:00
8fae5ed7f6
[Misc] Fix minor typo in scheduler ( #8765 )
2024-09-25 00:53:03 -07:00
3368c3ab36
[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node ( #8767 )
...
Signed-off-by: darthhexx <darthhexx@gmail.com >
2024-09-25 00:52:26 -07:00
1ac3de09cd
[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer ( #8672 )
2024-09-25 07:49:26 +00:00
3e073e66f1
[Bugfix] load fc bias from config for eagle ( #8790 )
2024-09-24 23:16:30 -07:00
c23953675f
[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend ( #8770 )
2024-09-24 23:16:11 -07:00
e3dd0692fa
[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv ( #8250 )
2024-09-25 05:53:43 +00:00
fc3afc20df
Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 ( #8752 )
2024-09-24 21:26:36 -07:00
b4522474a3
[Bugfix][Kernel] Implement acquire/release polyfill for Pascal ( #8776 )
2024-09-24 21:26:33 -07:00
ee777d9c30
Fix test_schedule_swapped_simple in test_scheduler.py ( #8780 )
2024-09-24 21:26:18 -07:00
6e0c9d6bd0
[Bugfix] Use heartbeats instead of health checks ( #8583 )
2024-09-24 20:37:38 -07:00
6da1ab6b41
[Core] Adding Priority Scheduling ( #5958 )
2024-09-24 19:50:50 -07:00
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding ( #8047 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-09-24 17:29:56 -07:00
13f9f7a3d0
[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 ( #8768 )
2024-09-24 17:08:55 -07:00
1e7d5c01f5
[misc] soft drop beam search ( #8763 )
2024-09-24 15:48:39 -07:00
2467b642dd
[CI/Build] fix setuptools-scm usage ( #8771 )
2024-09-24 12:38:12 -07:00
72fc97a0f1
[Bugfix] Fix torch dynamo fixes caused by replace_parameters
( #8748 )
2024-09-24 14:33:21 -04:00
2529d09b5a
[Frontend] Batch inference for llm.chat() API ( #8648 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-09-24 09:44:11 -07:00
a928ded995
[Kernel] Split Marlin MoE kernels into multiple files ( #8661 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-09-24 09:31:42 -07:00
cc4325b66a
[Bugfix] Fix potentially unsafe custom allreduce synchronization ( #8558 )
2024-09-24 01:08:14 -07:00
8ff7ced996
[Model] Expose Phi3v num_crops as a mm_processor_kwarg ( #8658 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-24 07:36:46 +00:00
3f06bae907
[Core][Model] Support loading weights by ID within models ( #7931 )
2024-09-24 07:14:15 +00:00
b8747e8a7c
[MISC] Skip dumping inputs when unpicklable ( #8744 )
2024-09-24 06:10:03 +00:00
3185fb0cca
Revert "[Core] Rename PromptInputs
to PromptType
, and inputs
to prompt
" ( #8750 )
2024-09-24 05:45:20 +00:00
0250dd68c5
re-implement beam search on top of vllm core ( #8726 )
...
Co-authored-by: Brendan Wong <bjwpokemon@gmail.com >
2024-09-23 22:08:12 -07:00
88577ac928
Fix tests in test_scheduler.py that fail with BlockManager V2 ( #8728 )
2024-09-24 04:43:13 +00:00
530821d00c
[Hardware][AMD] ROCm6.2 upgrade ( #8674 )
2024-09-23 18:52:39 -07:00
1a2aef3e59
Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse ( #8335 )
2024-09-23 15:38:04 -07:00
5f7bb58427
Fix typical acceptance sampler with correct recovered token ids ( #8562 )
2024-09-23 12:32:27 -07:00
b05f5c9238
[Core] Allow IPv6 in VLLM_HOST_IP with zmq ( #8575 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-23 12:15:41 -07:00
9b0e3ec970
[Kernel][LoRA] Add assertion for punica sgmv kernels ( #7585 )
2024-09-23 18:57:42 +00:00
86e9c8df29
[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin ( #7701 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-09-23 13:46:26 -04:00
ee5f34b1c2
[CI/Build] use setuptools-scm to set __version__ ( #4738 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-23 09:44:26 -07:00
f2bd246c17
[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size ( #8707 )
2024-09-23 14:43:09 +00:00
a79e522984
[Model] Support pp for qwen2-vl ( #8696 )
2024-09-23 13:46:59 +00:00
3e83c12b5c
[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner ( #8733 )
2024-09-23 13:15:16 +00:00
e551ca1555
[Hardware][CPU] Refactor CPU model runner ( #8729 )
2024-09-23 20:12:20 +08:00
9b8c8ba119
[Core][Frontend] Support Passing Multimodal Processor Kwargs ( #8657 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-23 07:44:48 +00:00
d23679eb99
[Bugfix] fix docker build for xpu ( #8652 )
2024-09-22 22:54:18 -07:00
57a0702e63
[Bugfix] Fix CPU CMake build ( #8723 )
...
Co-authored-by: Yuan <yuan.zhou@intel.com >
2024-09-22 20:40:46 -07:00
3dda7c2250
[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building ( #8702 )
2024-09-22 22:24:59 -04:00
92ba7e7477
[misc] upgrade mistral-common ( #8715 )
2024-09-22 15:41:59 -07:00
d4a2ac8302
[build] enable existing pytorch (for GH200, aarch64, nightly) ( #8713 )
2024-09-22 12:47:54 -07:00
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. ( #8701 )
2024-09-22 12:34:14 -07:00
5b59532760
[Model][VLM] Add LLaVA-Onevision model support ( #8486 )
...
Co-authored-by: litianjian <litianjian@bytedance.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-22 10:51:44 -07:00
ca2b628b3c
[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler ( #8703 )
2024-09-22 10:44:09 -07:00
8ca5051b9a
[Misc] Use NamedTuple in Multi-image example ( #8705 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-22 20:56:20 +08:00
06ed2815e2
[Model] Refactor BLIP/BLIP-2 to support composite model loading ( #8407 )
2024-09-22 12:24:21 +00:00
0e40ac9b7b
[ci][build] fix vllm-flash-attn ( #8699 )
2024-09-21 23:24:58 -07:00
13d88d4137
[Bugfix] Refactor composite weight loading logic ( #8656 )
2024-09-22 04:33:27 +00:00
d66ac62854
[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu ( #8643 )
2024-09-21 23:45:02 +00:00
9dc7c6c7f3
[dbrx] refactor dbrx experts to extend FusedMoe class ( #8518 )
2024-09-21 15:09:39 -06:00
ec4aaad812
[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 ( #8646 )
2024-09-21 09:20:54 +00:00
4dfdf43196
[Doc] Fix typo in AMD installation guide ( #8689 )
2024-09-21 00:24:12 -07:00
5e85f4f82a
[VLM] Use SequenceData.from_token_counts
to create dummy data ( #8687 )
2024-09-20 23:28:56 -07:00
71c60491f2
[Kernel] Build flash-attn from source ( #8245 )
2024-09-20 23:27:10 -07:00
0faab90eb0
[beam search] add output for manually checking the correctness ( #8684 )
2024-09-20 19:55:33 -07:00
0455c46ed4
[Core] Factor out common code in SequenceData
and Sequence
( #8675 )
2024-09-21 02:30:39 +00:00
d4bf085ad0
[MISC] add support custom_op check ( #8557 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-20 19:03:55 -07:00
0057894ef7
[Core] Rename PromptInputs
and inputs
( #8673 )
2024-09-20 19:00:54 -07:00
0f961b3ce9
[Bugfix] Fix incorrect llava next feature size calculation ( #8496 )
2024-09-20 22:48:32 +00:00
7f9c8902e3
[Hardware][AWS] update neuron to 2.20 ( #8676 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:19:44 -07:00
7c8566aa4f
[Doc] neuron documentation update ( #8671 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-09-20 15:04:37 -07:00
b4e4eda92e
[Bugfix][Core] Fix tekken edge case for mistral tokenizer ( #8640 )
2024-09-20 14:33:03 -07:00
2874bac618
[Bugfix] Config got an unexpected keyword argument 'engine' ( #8556 )
2024-09-20 14:00:45 -07:00
035fa895ec
[Misc] Show AMD GPU topology in collect_env.py
( #8649 )
2024-09-20 13:52:19 -07:00
b28298f2f4
[Bugfix] Validate SamplingParam n is an int ( #8548 )
2024-09-20 12:46:02 -07:00
2940afa04e
[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build ( #8670 )
2024-09-20 10:27:44 -07:00
3b63de9353
[Model] Add OLMoE ( #7922 )
2024-09-20 09:31:41 -07:00
260d40b5ea
[Core] Support Lora lineage and base model metadata management ( #6315 )
2024-09-20 06:20:56 +00:00
9e5ec35b1f
[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata ( #8474 )
2024-09-19 20:49:54 -07:00
18ae428a0d
[Bugfix] Fix Phi3.5 mini and MoE LoRA inference ( #8571 )
2024-09-20 08:54:02 +08:00
de6f90a13d
[Misc] guard against change in cuda library name ( #8609 )
2024-09-20 06:36:30 +08:00
6cb748e190
[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail ( #8551 )
2024-09-19 13:06:32 -07:00
9e99407e3c
Create SECURITY.md ( #8642 )
2024-09-19 12:16:28 -07:00
ea4647b7d7
[Doc] Add documentation for GGUF quantization ( #8618 )
2024-09-19 13:15:55 -06:00
e42c634acb
[Core] simplify logits resort in _apply_top_k_top_p ( #8619 )
2024-09-19 18:28:25 +00:00
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention ( #8577 )
2024-09-19 17:37:57 +00:00
76515f303b
[Frontend] Use MQLLMEngine for embeddings models too ( #8584 )
2024-09-19 12:51:06 -04:00
855c8ae2c9
[MISC] remove engine_use_ray in benchmark_throughput.py ( #8615 )
2024-09-18 22:33:20 -07:00
c52ec5f034
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py ( #8616 )
2024-09-19 05:24:24 +00:00
02c9afa2d0
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" ( #8593 )
2024-09-19 04:14:28 +00:00
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. ( #8545 )
2024-09-19 02:24:15 +00:00
4c34ce8916
[Kernel] Remove marlin moe templating on thread_m_blocks ( #8573 )
...
Co-authored-by: lwilkinson@neuralmagic.com
2024-09-19 01:42:49 +00:00
0d47bf3bf4
[Bugfix] add dead_error
property to engine client ( #8574 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-18 22:10:01 +00:00
d9cd78eb71
[BugFix] Nonzero exit code if MQLLMEngine startup fails ( #8572 )
2024-09-18 20:17:55 +00:00
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching ( #8039 )
2024-09-18 20:05:06 +00:00
b3195bc9e4
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call ( #8380 )
...
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 10:41:08 -07:00
e18749ff09
[Model] Support Solar Model ( #8386 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-18 11:04:00 -06:00
d65798f78c
[Core] zmq: bind only to 127.0.0.1 for local-only usage ( #8543 )
...
Signed-off-by: Russell Bryant <rbryant@redhat.com >
2024-09-18 16:10:27 +00:00
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step ( #8199 )
2024-09-18 08:38:43 -07:00
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine
to avoid asyncio
OH ( #8157 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-18 13:56:58 +00:00
9d104b5beb
[CI/Build] Update Ruff version ( #8469 )
...
Signed-off-by: Aaron Pham <contact@aarnphm.xyz >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-18 11:00:56 +00:00
6ffa3f314c
[CI/Build] Avoid CUDA initialization ( #8534 )
2024-09-18 10:38:11 +00:00
e351572900
[Misc] Add argument to disable FastAPI docs ( #8554 )
2024-09-18 09:51:59 +00:00
95965d31b6
[CI/Build] fix Dockerfile.cpu on podman ( #8540 )
2024-09-18 10:49:53 +08:00
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching ( #8012 )
2024-09-17 23:44:27 +00:00
09deb4721f
[CI/Build] Excluding kernels/test_gguf.py from ROCm ( #8520 )
2024-09-17 16:40:29 -07:00
fa0c114fad
[doc] improve installation doc ( #8550 )
...
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com >
2024-09-17 16:24:06 -07:00
98f9713399
[Bugfix] Fix TP > 1 for new granite ( #8544 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 23:17:08 +00:00
56c3de018c
[Misc] Don't dump contents of kvcache tensors on errors ( #8527 )
2024-09-17 12:24:29 -07:00
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format ( #8515 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-17 17:50:37 +00:00
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization ( #8434 )
2024-09-17 08:09:12 -07:00
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models ( #7631 )
2024-09-17 07:35:01 -07:00
1b6de8352b
[Benchmark] Support sample from HF datasets and image input for benchmark_serving ( #8495 )
2024-09-17 07:34:27 +00:00
cbdb252259
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change ( #8509 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-09-17 00:06:26 -07:00
99aa4eddaf
[torch.compile] register allreduce operations as custom ops ( #8526 )
2024-09-16 22:57:57 -07:00
ee2bceaaa6
[Misc][Bugfix] Disable guided decoding for mistral tokenizer ( #8521 )
2024-09-16 22:22:45 -07:00
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing ( #8525 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-09-17 04:17:32 +00:00
546034b466
[refactor] remove triton based sampler ( #8524 )
2024-09-16 20:04:48 -07:00
cca61642e0
[Bugfix] Fix 3.12 builds on main ( #8510 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-17 00:01:45 +00:00
5ce45eb54d
[misc] small qol fixes for release process ( #8517 )
2024-09-16 15:11:27 -07:00
5478c4b41f
[perf bench] set timeout to debug hanging ( #8516 )
2024-09-16 14:30:02 -07:00
47f5e03b5b
[Bugfix] Bind api server port before starting engine ( #8491 )
2024-09-16 13:56:28 -07:00
2759a43a26
[doc] update doc on testing and debugging ( #8514 )
2024-09-16 12:10:23 -07:00
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels ( #7270 )
2024-09-16 11:52:40 -07:00
781e3b9a42
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel ( #8506 )
2024-09-16 12:15:57 -06:00
acd5511b6d
[BugFix] Fix clean shutdown issues ( #8492 )
2024-09-16 09:33:46 -07:00
837c1968f9
[Frontend] Expose revision arg in OpenAI server ( #8501 )
2024-09-16 15:55:26 +00:00
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE ( #8032 )
...
Co-authored-by: Dipika <dipikasikka1@gmail.com >
2024-09-16 09:47:19 -06:00
fc990f9795
[Bugfix][Kernel] Add IQ1_M
quantization implementation to GGUF kernel ( #8357 )
2024-09-15 16:51:44 -06:00
3724d5f6b5
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations ( #8490 )
2024-09-15 04:20:05 +00:00
50e9ec41fc
[TPU] Implement multi-step scheduling ( #8489 )
2024-09-14 16:58:31 -07:00
47790f3e32
[torch.compile] add a flag to disable custom op ( #8488 )
2024-09-14 13:07:16 -07:00
a36e070dad
[torch.compile] fix functionalization ( #8480 )
2024-09-14 09:46:04 -07:00
8a0cf1ddc3
[Model] support minicpm3 ( #8297 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-14 14:50:26 +00:00
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm ( #8310 )
2024-09-13 17:01:11 -07:00
851725202a
[Hardware][intel GPU] bump up ipex version to 2.3 ( #8365 )
...
Co-authored-by: Yan Ma <yan.ma@intel.com >
2024-09-13 16:54:34 -07:00
9ba0817ff1
bump version to v0.6.1.post2 ( #8473 )
2024-09-13 11:35:00 -07:00
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming ( #8468 )
2024-09-13 11:31:12 -07:00
f57092c00b
[Doc] Add oneDNN installation to CPU backend documentation ( #8467 )
2024-09-13 18:06:30 +00:00
a84e598e21
[CI/Build] Reorganize models tests ( #7820 )
2024-09-13 10:20:06 -07:00
0a4806f0a9
[plugin][torch.compile] allow to add custom compile backend ( #8445 )
2024-09-13 09:32:42 -07:00
ecd7a1d5b6
[Installation] Gate FastAPI version for Python 3.8 ( #8456 )
2024-09-13 09:02:26 -07:00
a2469127db
[misc][ci] fix quant test ( #8449 )
2024-09-13 17:20:14 +08:00
06311e2956
[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 ( #8442 )
2024-09-13 07:58:28 +00:00
cab69a15e4
[doc] recommend pip instead of conda ( #8446 )
2024-09-12 23:52:41 -07:00
9b4a3b235e
[CI/Build] Enable InternVL2 PP test only on single node ( #8437 )
2024-09-13 06:35:20 +00:00
acda0b35d0
bump version to v0.6.1.post1 ( #8440 )
2024-09-12 21:39:49 -07:00
ba77527955
[bugfix] torch profiler bug for single gpu with GPUExecutor ( #8354 )
2024-09-12 21:30:00 -07:00
6821020109
[Bugfix] Fix async log stats ( #8417 )
2024-09-12 20:48:59 -07:00
8427550488
[CI/Build] Update pixtral tests to use JSON ( #8436 )
2024-09-13 03:47:52 +00:00
3f79bc3d1a
[Bugfix] Bump fastapi and pydantic version ( #8435 )
2024-09-13 03:21:42 +00:00
40c396533d
[Bugfix] Mapping physical device indices for e2e test utils ( #8290 )
2024-09-13 11:06:28 +08:00
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class ( #7329 )
2024-09-13 02:56:13 +00:00
8f44a92d85
[BugFix] fix group_topk ( #8430 )
2024-09-13 09:23:42 +08:00
360ddbd37e
[Misc] Update Pixtral example ( #8431 )
2024-09-12 17:31:18 -07:00
a480939e8e
[Bugfix] Fix weight loading issue by rename variable. ( #8293 )
2024-09-12 19:25:00 -04:00
d31174a4e1
[Hotfix][Pixtral] Fix multiple images bugs ( #8415 )
2024-09-12 15:21:51 -07:00
b61bd98f90
[CI/Build] Disable multi-node test for InternVL2 ( #8428 )
2024-09-12 15:05:35 -07:00
c16369455f
[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models ( #8425 )
2024-09-12 14:06:51 -07:00
019877253b
[Bugfix] multi-step + flashinfer: ensure cuda graph compatible ( #8427 )
2024-09-12 21:01:50 +00:00
551ce01078
[Core] Add engine option to return only deltas or final output ( #7381 )
2024-09-12 12:02:00 -07:00
a6c0f3658d
[multi-step] add flashinfer backend ( #7928 )
2024-09-12 11:16:22 -07:00
f2e263b801
[Bugfix] Offline mode fix ( #8376 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-12 11:11:57 -07:00
1f0c75afa9
[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser ( #8423 )
2024-09-12 11:10:11 -07:00
8a23e93302
[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance ( #8403 )
2024-09-12 10:47:42 -07:00
c6202daeed
[Model] Support multiple images for qwen-vl ( #8247 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:54 -07:00
e56bf27741
[Bugfix] Fix InternVL2 inference with various num_patches ( #8375 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-12 10:10:35 -07:00
520ca380ae
[Hotfix][VLM] Fixing max position embeddings for Pixtral ( #8399 )
2024-09-12 09:28:37 -07:00
7de49aa86c
[torch.compile] hide slicing under custom op for inductor ( #8384 )
2024-09-12 00:11:55 -07:00
42ffba11ad
[Misc] Use RoPE cache for MRoPE ( #8396 )
2024-09-11 23:13:14 -07:00
295c4730a8
[Misc] Raise error when using encoder/decoder model with cpu backend ( #8355 )
2024-09-12 05:45:24 +00:00
1bf2dd9df0
[Gemma2] add bitsandbytes support for Gemma2 ( #8338 )
2024-09-11 21:53:12 -07:00
5a60699c45
[Bugfix]: Fix the logic for deciding if tool parsing is used ( #8366 )
2024-09-12 03:55:30 +00:00
b6c75e1cf2
Fix the AMD weight loading tests ( #8390 )
2024-09-11 20:35:33 -07:00
b71c956deb
[TPU] Use Ray for default distributed backend ( #8389 )
2024-09-11 20:31:51 -07:00
f842a7aff1
[misc] remove engine_use_ray ( #8126 )
2024-09-11 18:23:36 -07:00
a65cb16067
[MISC] Dump model runner inputs when crashing ( #8305 )
2024-09-12 01:12:25 +00:00
3fd2b0d21c
Bump version to v0.6.1 ( #8379 )
2024-09-11 14:42:11 -07:00
d394787e52
Pixtral ( #8377 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-09-11 14:41:55 -07:00
775f00f81e
[Speculative Decoding] Test refactor ( #8317 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-09-11 14:07:34 -07:00
8baa454937
[Misc] Move device options to a single place ( #8322 )
2024-09-11 13:25:58 -07:00
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks ( #6917 )
...
Co-authored-by: Sage Moore <sage@neuralmagic.com >
2024-09-11 12:52:19 -07:00
7015417fd4
[Bugfix] Add missing attributes in mistral tokenizer ( #8364 )
2024-09-11 11:36:54 -07:00
aea02f30de
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation ( #8373 )
2024-09-11 18:31:41 +00:00
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend ( #7257 )
2024-09-11 09:46:46 -07:00
3b7fea770f
[Model][VLM] Add Qwen2-VL model support ( #7905 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-11 09:31:19 -07:00
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch ( #8347 )
2024-09-11 05:30:11 +00:00
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-09-10 22:21:36 -07:00
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
e497b8aeff
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models ( #8329 )
2024-09-10 20:59:19 -04:00
94144e726c
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag ( #8043 )
2024-09-10 23:51:58 +00:00
1d5e397aa4
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers ( #8172 )
2024-09-10 23:46:08 +00:00
22f3a4bc6c
[Bugfix] lookahead block table with cuda graph max capture ( #8340 )
...
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340 )
2024-09-10 16:00:35 -07:00
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled ( #8342 )
2024-09-10 22:28:28 +00:00
04e7c4e771
[Misc] remove peft as dependency for prompt models ( #8162 )
2024-09-10 17:21:56 -04:00
5faedf1b62
[Spec Decode] Move ops.advance_step to flash attn advance_step ( #8224 )
2024-09-10 13:18:14 -07:00
02751a7a42
Fix ppc64le buildkite job ( #8309 )
2024-09-10 12:58:34 -07:00
f421f3cefb
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail ( #8130 )
2024-09-10 11:51:15 -07:00
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
6234385f4a
[CI/Build] enable ccache/scccache for HIP builds ( #8327 )
2024-09-10 08:55:08 -07:00
da1a844e61
[Bugfix] Fix missing post_layernorm
in CLIP ( #8155 )
2024-09-10 08:22:50 +00:00
a1d874224d
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info ( #8319 )
2024-09-09 23:21:00 -07:00
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
f9b4a2d415
[Bugfix] Correct adapter usage for cohere and jamba ( #8292 )
2024-09-09 11:20:46 -07:00
58fcc8545a
[Frontend] Add progress reporting to run_batch.py ( #8060 )
...
Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io >
2024-09-09 11:16:37 -07:00
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption ( #8267 )
2024-09-07 21:01:51 -07:00
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-09-07 13:03:16 -07:00
b962ee1470
ppc64le: Dockerfile fixed, and a script for buildkite ( #8026 )
2024-09-07 11:18:40 -07:00
36bf8150cc
[Model][VLM] Decouple weight loading logic for Paligemma
( #8269 )
2024-09-07 17:45:44 +00:00
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) ( #8241 )
2024-09-06 20:18:16 -07:00
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
41e95c5247
[Bugfix] Fix Hermes tool call chat template bug ( #8256 )
...
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-07 10:49:01 +08:00
12dd715807
[misc] [doc] [frontend] LLM torch profiler support ( #7943 )
2024-09-06 17:48:48 -07:00
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-06 17:02:05 -06:00
23f322297f
[Misc] Remove SqueezeLLM
( #8220 )
2024-09-06 16:29:03 -06:00
9db52eab3d
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput ( #8248 )
2024-09-06 16:26:09 -06:00
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
de80783b69
[Misc] Use ray[adag] dependency instead of cuda ( #7938 )
2024-09-06 09:18:35 -07:00
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py
( #8191 )
2024-09-06 09:01:14 -07:00
baa5467547
[BugFix] Fix Granite model configuration ( #8216 )
2024-09-06 11:39:29 +08:00
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com >
2024-09-05 18:10:33 -07:00
2febcf2777
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM ( #7962 )
2024-09-05 16:25:29 -04:00
2ee45281a5
Move verify_marlin_supported to GPTQMarlinLinearMethod ( #8165 )
2024-09-05 11:09:46 -04:00
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-09-05 12:48:10 +00:00
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
288a938872
[Doc] Indicate more information about supported modalities ( #8181 )
2024-09-05 10:51:53 +00:00
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
ba262c4e5a
[ci] Mark LoRA test as soft-fail ( #8160 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-04 20:33:12 -07:00
4624d98dbd
[Misc] Clean up RoPE forward_native ( #8076 )
2024-09-04 20:31:48 -07:00
1afc931987
[bugfix] >1.43 constraint for openai ( #8169 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 17:35:36 -07:00
e01c2beb7d
[Doc] [Misc] Create CODE_OF_CONDUCT.md ( #8161 )
2024-09-04 16:50:13 -07:00
32e7db2536
Bump version to v0.6.0 ( #8166 )
2024-09-04 16:34:27 -07:00
008cf886c9
[Neuron] Adding support for adding/ overriding neuron configuration a… ( #8062 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-09-04 16:33:43 -07:00
77d9e514a2
[MISC] Replace input token throughput with total token throughput ( #8164 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-04 20:23:22 +00:00
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models ( #5649 )
...
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com >
Co-authored-by: Kyle Mistele <kyle@constellate.ai >
2024-09-04 13:18:13 -07:00
561d6f8077
[CI] Change test input in Gemma LoRA test ( #8163 )
2024-09-04 13:05:50 -07:00
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm ( #7369 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-09-04 11:57:54 -07:00
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests ( #8131 )
2024-09-04 18:53:25 +00:00
d3311562fb
[Bugfix] remove post_layernorm in siglip ( #8106 )
2024-09-04 18:55:37 +08:00
ccd7207191
chore: Update check-wheel-size.py to read MAX_SIZE_MB from env ( #8103 )
2024-09-03 23:17:05 -07:00
855c262a6b
[Frontend] Multimodal support in offline chat ( #8098 )
2024-09-04 05:22:17 +00:00
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks ( #7963 )
2024-09-04 04:38:21 +00:00
e16fa99a6a
[Misc] Update fbgemmfp8 to use vLLMParameters
( #7972 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-09-03 20:12:41 -06:00
61f4a93d14
[TPU][Bugfix] Use XLA rank for persistent cache path ( #8137 )
2024-09-03 18:35:33 -07:00
d4db9f53c8
[Benchmark] Add --async-engine
option to benchmark_throughput.py ( #7964 )
2024-09-03 20:57:41 -04:00
2188a60c7e
[Misc] Update GPTQ
to use vLLMParameters
( #7976 )
2024-09-03 17:21:44 -04:00
dc0b6066ab
[CI] Change PR remainder to avoid at-mentions ( #8134 )
2024-09-03 14:11:42 -07:00
0af3abe3d3
[TPU][Bugfix] Fix next_token_ids shape ( #8128 )
2024-09-03 13:29:24 -07:00
f1575dc99f
[ci] Fix GHA workflow ( #8129 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 13:25:09 -07:00
c02638efb3
[CI/Build] make pip install vllm work in macos (for import only) ( #8118 )
2024-09-03 12:37:08 -07:00
652c83b697
[Misc] Raise a more informative exception in add/remove_logger ( #7750 )
2024-09-03 12:28:25 -07:00
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
95a178f861
[CI] Only PR reviewers/committers can trigger CI on PR ( #8124 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-09-03 11:32:27 -07:00
bd852f2a8b
[Performance] Enable chunked prefill and prefix caching together ( #8120 )
...
Co-authored-by: Tao He <sighingnow@gmail.com >
Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com >
2024-09-03 10:49:18 -07:00
ec266536b7
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend ( #8061 )
2024-09-03 21:37:52 +08:00
0fbc6696c2
[Bugfix] Fix single output condition in output processor ( #7881 )
2024-09-02 20:35:42 -07:00
6e36f4fa6c
improve chunked prefill performance
...
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874 )
2024-09-02 14:20:12 -07:00
dd2a6a82e3
[Bugfix] Fix internlm2 tensor parallel inference ( #8055 )
2024-09-02 23:48:56 +08:00
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension ( #8056 )
2024-09-02 08:43:26 -04:00
e2b2aa5a0f
[TPU] Align worker index with node boundary ( #7932 )
2024-09-01 23:09:46 -07:00
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
f8d60145b4
[Model] Add Granite model ( #7436 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-09-01 18:37:18 -07:00
5b86b19954
[Misc] Optional installation of audio related packages ( #8063 )
2024-09-01 14:46:57 -07:00
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items ( #8049 )
2024-08-31 16:35:53 -07:00
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation ( #8059 )
2024-08-31 19:44:03 +00:00
4f5d8446ed
[Bugfix] Fix ModelScope models in v0.5.5 ( #8037 )
2024-08-31 00:27:58 -07:00
d05f0a9db2
[Bugfix] Fix import error in Phi-3.5-MoE ( #8052 )
2024-08-30 22:26:55 -07:00
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE ( #7729 )
...
Co-authored-by: Your Name <you@example.com >
Co-authored-by: Zeqi Lin <zelin@microsoft.com >
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com >
2024-08-30 13:42:57 -06:00
2684efc467
[TPU][Bugfix] Fix tpu type api ( #8035 )
2024-08-30 09:01:26 -07:00
058344f89a
[Frontend]-config-cli-args ( #7737 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com >
2024-08-30 08:21:02 -07:00
98cef6a227
[Core] Increase default max_num_batched_tokens
for multimodal models ( #8028 )
2024-08-30 08:20:34 -07:00
f97be32d1d
[VLM][Model] TP support for ViTs ( #7186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-30 08:19:27 -07:00
afd39a4511
[Bugfix] Fix import error in Exaone model ( #8034 )
2024-08-30 08:03:28 -07:00
2148441fd3
[TPU] Support single and multi-host TPUs on GKE ( #7613 )
2024-08-30 00:27:40 -07:00
dc13e99348
[MODEL] add Exaone model support ( #7819 )
2024-08-29 23:34:20 -07:00
34a0e96d46
[Kernel] changing fused moe kernel chunk size default to 32k ( #7995 )
2024-08-30 04:11:39 +00:00
80c7b089b1
[TPU] Async output processing for TPU ( #8011 )
2024-08-29 19:35:29 -07:00
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
4abed65c58
[VLM] Disallow overflowing max_model_len
for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
0c785d344d
Add more percentiles and latencies ( #7759 )
2024-08-29 16:48:11 -07:00
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models ( #7445 )
2024-08-29 19:09:08 -04:00
257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. ( #7885 )
...
Co-authored-by: Harsha Bikki <harbikh@amazon.com >
2024-08-29 13:58:14 -07:00
86a677de42
[misc] update tpu int8 to use new vLLM Parameters ( #7973 )
2024-08-29 16:46:55 -04:00
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism ( #7954 )
2024-08-29 15:54:49 -04:00
c334b1898b
extend cuda graph size for H200 ( #7894 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-29 12:15:04 -07:00
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-29 14:53:11 -04:00
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding ( #7899 )
2024-08-28 22:18:13 -07:00
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors ( #7974 )
2024-08-29 03:24:31 +00:00
a7f65c2be9
[torch.compile] remove reset ( #7975 )
2024-08-28 17:32:26 -07:00
4289cad37f
[Frontend] Minor optimizations to zmq decoupled front-end ( #7957 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-08-28 17:22:43 -07:00
af59df0a10
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test ( #7961 )
2024-08-28 19:19:17 -04:00
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-28 16:10:12 -07:00
3cdfe1f38b
[Bugfix] Make torch registration of punica ops optional ( #7970 )
2024-08-28 16:11:49 -06:00
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM ( #7651 )
2024-08-28 15:06:52 -07:00
8c56e57def
[Doc] fix 404 link ( #7966 )
2024-08-28 13:54:23 -07:00
eeffde1ac0
[TPU] Upgrade PyTorch XLA nightly ( #7967 )
2024-08-28 13:10:21 -07:00
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-28 10:01:22 -07:00
ef9baee3c5
[Bugfix][VLM] Fix incompatibility between #7902 and #7230 ( #7948 )
2024-08-28 08:11:18 -07:00
98c12cffe5
[Doc] fix the autoAWQ example ( #7937 )
2024-08-28 12:12:32 +00:00
f52a43a8b9
[ci][test] fix pp test failure ( #7945 )
2024-08-28 01:27:07 -07:00
e3580537a4
[Performance] Enable chunked prefill and prefix caching together ( #7753 )
2024-08-28 00:36:31 -07:00
f508e03e7f
[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) ( #7911 )
2024-08-28 00:02:30 -07:00
51f86bf487
[mypy][CI/Build] Fix mypy errors ( #7929 )
2024-08-27 23:47:44 -07:00
c166e7e43e
[Bugfix] Allow ScalarType to be compiled with pytorch 2.3 and add checks for registering FakeScalarType and dynamo support. ( #7886 )
2024-08-27 23:13:45 -04:00
bc6e42a9b1
[hardware][rocm] allow rocm to override default env var ( #7926 )
2024-08-27 19:50:06 -07:00
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt ( #7902 )
2024-08-28 01:53:56 +00:00
9c71c97ae2
[mypy] Enable mypy type checking for vllm/core
( #7229 )
2024-08-28 07:11:14 +08:00
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference ( #7230 )
2024-08-28 07:09:02 +08:00
345be0e244
[benchmark] Update TGI version ( #7917 )
2024-08-27 15:07:53 -07:00
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7766 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-27 15:07:09 -07:00
ed6f002d33
[cuda][misc] error on empty CUDA_VISIBLE_DEVICES ( #7924 )
2024-08-27 12:06:11 -07:00
b09c755be8
[Bugfix] Fix phi3v incorrect image_idx when using async engine ( #7916 )
2024-08-27 17:36:09 +00:00
42e932c7d4
[CI/Build][ROCm] Enabling tensorizer tests for ROCm ( #7237 )
2024-08-27 10:09:13 -07:00
076169f603
[Hardware][Intel GPU] Add intel GPU pipeline parallel support. ( #7810 )
2024-08-27 10:07:02 -07:00
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test ( #7897 )
2024-08-27 15:28:30 +00:00
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
9606c7197d
Revert #7509 ( #7887 )
2024-08-27 00:16:31 -07:00
64cc644425
[core][torch.compile] discard the compile for profiling ( #7796 )
2024-08-26 21:33:58 -07:00
39178c7fbc
[Tests] Disable retries and use context manager for openai client ( #7565 )
2024-08-26 21:33:17 -07:00
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com >
2024-08-26 20:53:20 -07:00
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix
from create_weights
( #7825 )
2024-08-26 18:09:34 -06:00
760e9f71a8
[Bugfix] neuron: enable tensor parallelism ( #7562 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-08-26 15:13:13 -07:00
05826c887b
[misc] fix custom allreduce p2p cache file generation ( #7853 )
2024-08-26 15:02:25 -07:00
dd9857f5fa
[Misc] Update gptq_marlin_24
to use vLLMParameters ( #7762 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-26 17:44:54 -04:00
665304092d
[Misc] Update qqq
to use vLLMParameters ( #7805 )
2024-08-26 13:16:15 -06:00
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule ( #7822 )
2024-08-26 11:24:53 -07:00
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer
( #7836 )
2024-08-26 05:31:10 +00:00
0b769992ec
[Bugfix]: Use float32 for base64 embedding ( #7855 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org >
2024-08-26 03:16:38 +00:00
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation ( #7851 )
2024-08-25 15:45:14 -07:00
70c094ade6
[misc][cuda] improve pynvml warning ( #7852 )
2024-08-25 14:30:09 -07:00
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test ( #7835 )
2024-08-25 15:53:09 +00:00
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models ( #7783 )
2024-08-25 11:51:20 +00:00
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes ( #7840 )
2024-08-24 18:16:24 -07:00
aab0fcdb63
[ci][test] fix RemoteOpenAIServer ( #7838 )
2024-08-24 17:31:28 +00:00
ea9fa160e3
[ci][test] exclude model download time in server start time ( #7834 )
2024-08-24 01:03:27 -07:00
7d9ffa2ae1
[misc][core] lazy import outlines ( #7831 )
2024-08-24 00:51:38 -07:00
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol ( #7654 )
2024-08-23 23:07:24 -07:00
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API ( #7641 )
2024-08-23 23:04:22 -07:00
6885fde317
[Bugfix] Fix run_batch logger ( #7640 )
2024-08-23 13:58:26 -07:00
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00
09c7792610
Bump version to v0.5.5 ( #7823 )
2024-08-23 11:35:33 -07:00
f1df5dbfd6
[Misc] Update marlin
to use vLLMParameters ( #7803 )
2024-08-23 14:30:52 -04:00
35ee2ad6b9
[github][misc] promote asking llm first ( #7809 )
2024-08-23 09:38:50 -07:00
e25fee57c2
[BugFix] Fix server crash on empty prompt ( #7746 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
2024-08-23 13:12:44 +00:00
faeddb565d
[misc] Add Torch profiler support for CPU-only devices ( #7806 )
2024-08-23 05:46:25 +00:00
fc5ebbd1d3
[Hardware][Intel GPU] refactor xpu_model_runner for tp ( #7712 )
2024-08-22 20:06:54 -07:00
c01a6cb231
[Ray backend] Better error when pg topology is bad. ( #7584 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-22 17:44:25 -07:00
b903e1ba7f
[Frontend] error suppression cleanup ( #7786 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 21:50:21 +00:00
a152246428
[Misc] fix typo in triton import warning ( #7794 )
2024-08-22 13:51:23 -07:00
666ad0aa16
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args ( #7705 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-22 20:10:55 +00:00
15310b5101
[Bugfix] Use LoadFormat values for vllm serve --load-format
( #7784 )
2024-08-22 11:37:08 -07:00
57792ed469
[Doc] Fix incorrect docs from #7615 ( #7788 )
2024-08-22 10:02:06 -07:00
d3b5b98021
[Misc] Enhance prefix-caching benchmark tool ( #6568 )
2024-08-22 09:32:02 -07:00
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output ( #7232 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-22 09:33:48 -04:00
955b5191c9
[Misc] update fp8 to use vLLMParameter
( #7437 )
2024-08-22 08:36:18 -04:00
55d63b1211
[Bugfix] Don't build machete on cuda <12.0 ( #7757 )
2024-08-22 08:28:52 -04:00
4f419c00a6
Fix ShardedStateLoader for vllm fp8 quantization ( #7708 )
2024-08-22 08:25:04 -04:00
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer ( #6830 )
2024-08-22 02:42:24 -07:00
b3856bef7d
[Misc] Use torch.compile for GemmaRMSNorm ( #7642 )
2024-08-22 01:14:13 -07:00
8c6f694a79
[ci] refine dependency for distributed tests ( #7776 )
2024-08-22 00:54:15 -07:00
eeee1c3b1a
[TPU] Avoid initializing TPU runtime in is_tpu ( #7763 )
2024-08-21 21:31:49 -07:00
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )" ( #7764 )
2024-08-22 03:42:14 +00:00
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness ( #7443 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-22 02:18:11 +00:00
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue ( #7710 )
2024-08-22 09:36:24 +08:00
7937009a7e
[Kernel] Replaced blockReduce[...]
functions with cub::BlockReduce
( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-21 20:18:00 -04:00
9984605412
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility ( #7477 )
...
Co-authored-by: Charlie Fu <Charlie.Fu@amd.com >
2024-08-21 16:47:36 -07:00
7eebe8ccaa
[distributed][misc] error on same VLLM_HOST_IP setting ( #7756 )
2024-08-21 16:25:34 -07:00
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
2024-08-21 16:17:10 -07:00
5844017285
[ci] [multi-step] narrow multi-step test dependency paths ( #7760 )
2024-08-21 15:52:40 -07:00
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig ( #7615 )
2024-08-21 22:49:39 +00:00
dd53c4b023
[misc] Add Torch profiler support ( #7451 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-21 15:39:26 -07:00
970dfdc01d
[Frontend] Improve Startup Failure UX ( #7716 )
2024-08-21 19:53:01 +00:00
91f4522cbf
[multi-step] Raise error if not using async engine ( #7703 )
2024-08-21 11:49:19 -07:00
1b32e02648
[Bugfix] Pass PYTHONPATH from setup.py to CMake ( #7730 )
2024-08-21 11:17:48 -07:00
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq
Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-21 13:34:14 -04:00
d3c002eadc
[Bugfix] chat method add_generation_prompt param ( #7734 )
2024-08-21 17:33:35 +00:00
9b73a2f498
[Spec Decoding] Use target model max length as default for draft model ( #7706 )
2024-08-22 00:23:22 +08:00
6925cdbeea
[Bugfix][Hardware][CPU] Fix mm_limits
initialization for CPU backend ( #7735 )
2024-08-21 16:23:03 +00:00
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
dd3fa0e430
[Bugfix] Mirror jinja2 in pyproject.toml ( #7723 )
2024-08-21 13:41:17 +00:00
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: Fei <dfdfcai4@gmail.com >
2024-08-20 23:28:21 -07:00
4506641212
[Doc] Section for Multimodal Language Models ( #7719 )
2024-08-20 23:24:01 -07:00
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
b74a125800
[ci] try to log process using the port to debug the port usage ( #7711 )
2024-08-20 17:41:12 -07:00
66a9e713a7
[Core] Pipe worker_class_fn
argument in Executor ( #7707 )
2024-08-21 00:37:39 +00:00
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test ( #7709 )
2024-08-20 17:12:44 -07:00
6e4658c7aa
[Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) ( #7685 )
2024-08-20 12:01:09 -07:00
3b682179dd
[Core] Add AttentionState
abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
c6af027a35
[Misc] Add jinja2 as an explicit build requirement ( #7695 )
2024-08-20 17:17:47 +00:00
2aa00d59ad
[CI/Build] Pin OpenTelemetry versions and make errors clearer ( #7266 )
...
[CI/Build] Pin OpenTelemetry versions and make a availability errors clearer (#7266 )
2024-08-20 10:02:21 -07:00
c42590f97a
[Hardware] [Intel GPU] refactor xpu worker/executor ( #7686 )
2024-08-20 09:54:10 -07:00
aae6927be0
[VLM][Model] Add test for InternViT vision encoder ( #7409 )
2024-08-20 23:10:20 +08:00
398521ad19
[OpenVINO] Updated documentation ( #7687 )
2024-08-20 07:33:56 -06:00
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
b6f99a6ffe
[Core] Refactor executor classes for easier inheritance ( #7673 )
...
[Core] Refactor executor classes to make it easier to inherit GPUExecutor (#7673 )
2024-08-20 00:56:50 -07:00
ad28a74beb
[misc][cuda] add warning for pynvml user ( #7675 )
2024-08-20 00:35:09 -07:00
e6d811dd13
[XPU] fallback to native implementation for xpu custom op ( #7670 )
2024-08-20 00:26:09 -07:00
c4be16e1a7
[misc] add nvidia related library in collect env ( #7674 )
2024-08-19 23:22:49 -07:00
3d8a5f063d
[CI] Organizing performance benchmark files ( #7616 )
2024-08-19 22:43:54 -07:00
f4fc7337bf
[Bugfix] support tie_word_embeddings
for all models ( #5724 )
2024-08-19 20:00:04 -07:00
0df7ec0b2d
[ci] Install Buildkite test suite analysis ( #7667 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-19 19:55:04 -07:00
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion ( #7508 )
2024-08-19 17:58:14 -07:00
e54ebc2f8f
[doc] fix doc build error caused by msgspec ( #7659 )
2024-08-19 17:50:59 -07:00
67e02fa8a4
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding ( #7665 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-20 00:43:09 +00:00
43735bf5e1
[TPU] Remove redundant input tensor cloning ( #7660 )
2024-08-19 15:55:04 -07:00
da115230fd
[Bugfix] Don't disable existing loggers ( #7664 )
2024-08-19 15:11:58 -07:00
7601cb044d
[Core] Support tensor parallelism for GGUF quantization ( #7520 )
2024-08-19 17:30:14 -04:00
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com >
2024-08-19 13:52:13 -07:00
dad961ef5c
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 ( #5428 )
2024-08-19 20:47:00 +00:00
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00
df845b2b46
[Misc] Remove Gemma RoPE ( #7638 )
2024-08-19 09:29:31 -07:00
1a36287b89
[Bugfix] Fix xpu build ( #7644 )
2024-08-18 22:00:09 -07:00
f710fb5265
[Core] Use flashinfer sampling kernel when available ( #7137 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-19 03:24:03 +00:00
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
200a2ffa6b
[Misc] Refactor Llama3 RoPE initialization ( #7637 )
2024-08-18 17:18:12 -07:00
40e1360bb6
[CI/Build] Add text-only test for Qwen models ( #7475 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com >
2024-08-19 07:43:46 +08:00
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq
Frontend ( #7279 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-18 20:19:48 +00:00
ab7165f2c7
[TPU] Optimize RoPE forward_native2 ( #7636 )
2024-08-18 01:15:10 -07:00
0c2fa50b84
[TPU] Use mark_dynamic only for dummy run ( #7634 )
2024-08-18 00:18:53 -07:00
ce143353c6
[TPU] Skip creating empty tensor ( #7630 )
2024-08-17 14:22:46 -07:00
bbf55c4805
[VLM] Refactor MultiModalConfig
initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
1ef13cf92f
[Misc]Fix BitAndBytes exception messages ( #7626 )
2024-08-17 12:02:14 -07:00
832163b875
[ci][test] allow longer wait time for api server ( #7629 )
2024-08-17 11:26:38 -07:00
e73f76eec6
[Model] Pipeline parallel support for JAIS ( #7603 )
2024-08-17 11:11:09 -07:00
d95cc0a55c
[core][misc] update libcudart finding ( #7620 )
...
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com >
2024-08-16 23:01:35 -07:00
5bf45db7df
[ci][test] fix engine/logger test ( #7621 )
2024-08-16 23:00:59 -07:00
eed020f673
[misc] use nvml to get consistent device name ( #7582 )
2024-08-16 21:15:13 -07:00
7c0b7ea214
[Bugfix] add >= 1.0 constraint for openai dependency ( #7612 )
2024-08-16 20:56:01 -07:00
4706eb628e
[aDAG] Unflake aDAG + PP tests ( #7600 )
2024-08-16 20:49:30 -07:00
bae888cb8e
[Bugfix] Clear engine reference in AsyncEngineRPCServer ( #7618 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-16 20:44:05 -07:00
6bd19551b0
.[Build/CI] Enabling passing AMD tests. ( #7610 )
2024-08-16 20:25:32 -07:00
e680349994
[Bugfix] Fix custom_ar support check ( #7617 )
2024-08-16 19:05:49 -07:00
44f26a9466
[Model] Align nemotron config with final HF state and fix lm-eval-small ( #7611 )
2024-08-16 15:56:34 -07:00
37fd47e780
[Kernel] fix types used in aqlm and ggml kernels to support dynamo ( #7596 )
2024-08-16 14:00:11 -07:00
7759ae958f
[Kernel][Misc] dynamo support for ScalarType ( #7594 )
2024-08-16 13:59:49 -07:00
9f69856356
[Kernel] register punica functions as torch ops ( #7591 )
2024-08-16 13:59:38 -07:00
d4f0f17b02
[Doc] Update quantization supported hardware table ( #7595 )
2024-08-16 13:59:27 -07:00
b3f4e17935
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints ( #7444 )
2024-08-16 13:59:16 -07:00
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
f366f6339b
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend ( #7571 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-16 11:41:56 -07:00
855866caa9
[Kernel] Add tuned triton configs for ExpertsInt8 ( #7601 )
2024-08-16 11:37:01 -07:00
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE ( #7415 )
2024-08-16 10:06:51 -07:00
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
ec724a725e
support tqdm in notebooks ( #7510 )
2024-08-16 09:17:50 -07:00
0e39a33c6d
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method ( #7513 )
2024-08-16 10:05:18 -06:00
6fc5b0f249
[CI] Fix crashes of performance benchmark ( #7500 )
2024-08-16 08:08:45 -07:00
9587b050fb
[Core] Use uvloop with zmq-decoupled front-end ( #7570 )
2024-08-15 22:48:07 -07:00
54bd9a03c4
register custom op for flash attn and use from torch.ops ( #7536 )
2024-08-15 22:38:56 -07:00
50b8d08dbd
[Misc/Testing] Use torch.testing.assert_close
( #7324 )
2024-08-16 04:24:04 +00:00
e165528778
[CI] Move quantization cpu offload tests out of fastcheck ( #7574 )
2024-08-15 21:16:20 -07:00
3b19e39dc5
Chat method for offline llm ( #5049 )
...
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-08-15 19:41:34 -07:00
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail ( #7572 )
2024-08-15 19:39:04 -07:00
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 ( #7453 )
2024-08-16 02:38:08 +00:00
b67ae00cdb
[Misc] Add quantization config support for speculative model. ( #7343 )
2024-08-15 19:34:28 -07:00
9c8e2d1161
[Bugfix][Harmless] Fix float16 dtype for model_is_embedding ( #7566 )
2024-08-15 18:26:19 -07:00
21313e09e3
[Bugfix] Fix default weight loading for scalars ( #7534 )
2024-08-15 13:10:22 -07:00
f4da5f7b6d
[Misc] Update dockerfile for CPU to cover protobuf installation ( #7182 )
2024-08-15 10:03:01 -07:00
9c1f78d5d6
[Bugfix] update neuron for version > 0.5.0 ( #7175 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-15 09:44:14 -07:00
fc93e56143
[Bugfix][TPU] Correct env variable for XLA cache path ( #7544 )
2024-08-15 00:02:29 -07:00
22b39e11f2
llama_index serving integration documentation ( #6973 )
...
Co-authored-by: pavanmantha <pavan.mantha@thevaslabs.io >
2024-08-14 15:38:37 -07:00
f55a9aea45
[Misc] Revert compressed-tensors
code reuse ( #7521 )
2024-08-14 15:07:37 -07:00
951fdd66d3
[TPU] Set per-rank XLA cache ( #7533 )
2024-08-14 14:47:51 -07:00
2ecf7b1757
[core] [3/N] multi-step args and sequence.py ( #7452 )
2024-08-14 12:32:45 -07:00
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray ( #7424 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-08-14 09:44:27 -07:00
67d115db08
[Bugfix][Frontend] Disable embedding API for chat models ( #7504 )
...
Co-authored-by: jack <jack@alex>
2024-08-14 09:15:19 -07:00
d3d9cb6e4b
[ci] fix model tests ( #7507 )
2024-08-14 01:01:43 -07:00
c134a46402
Fix empty output when temp is too low ( #2937 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-08-14 05:31:44 +00:00
199adbb7cf
[doc] update test script to include cudagraph ( #7501 )
2024-08-13 21:52:58 -07:00
dd164d72f3
[Bugfix][Docs] Update list of mock imports ( #7493 )
2024-08-13 20:37:30 -07:00
ea49e6a3c8
[misc][ci] fix cpu test with plugins ( #7489 )
2024-08-13 19:27:46 -07:00
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests ( #7396 )
2024-08-13 17:27:29 -07:00
59edd0f134
[Bugfix][CI] Import ray under guard ( #7486 )
2024-08-13 17:12:58 -07:00
a08df8322e
[TPU] Support multi-host inference ( #7457 )
2024-08-13 16:31:20 -07:00
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
373538f973
[Misc] compressed-tensors
code reuse ( #7277 )
2024-08-13 19:05:15 -04:00
33e5d7e6b6
[frontend] spawn engine process from api server process ( #7484 )
2024-08-13 15:40:17 -07:00
c5c7768264
Announce NVIDIA Meetup ( #7483 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-08-13 14:28:36 -07:00
b1e5afc3e7
[Misc] Update awq
and awq_marlin
to use vLLMParameters
( #7422 )
2024-08-13 17:08:20 -04:00
d3bdfd3ab9
[Misc] Update Fused MoE weight loading ( #7334 )
2024-08-13 14:57:45 -04:00
fb377d7e74
[Misc] Update gptq_marlin
to use new vLLMParameters ( #7281 )
2024-08-13 14:30:11 -04:00
181abbc27d
[Misc] Update LM Eval Tolerance ( #7473 )
2024-08-13 14:28:14 -04:00
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models ( #7446 )
2024-08-13 17:39:33 +00:00
e20233d361
Revert "[Doc] Update supported_hardware.rst ( #7276 )" ( #7467 )
2024-08-13 01:37:08 -07:00
d6e634f3d7
[TPU] Suppress import custom_ops warning ( #7458 )
2024-08-13 00:30:30 -07:00
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() ( #7102 )
2024-08-13 00:16:42 -07:00
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
5469146bcc
[ci] Remove fast check cancel workflow ( #7455 )
2024-08-12 21:19:51 -07:00
97a6be95ba
[Misc] improve logits processors logging message ( #7435 )
2024-08-13 02:29:34 +00:00
9ba85bc152
[mypy] Misc. typing improvements ( #7417 )
2024-08-13 09:20:20 +08:00
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit ( #7224 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-12 17:57:16 -07:00
774cd1d3bf
[CI/Build] bump minimum cmake version ( #6999 )
2024-08-12 16:29:20 -07:00
91294d56e1
[Bugfix] Handle PackageNotFoundError when checking for xpu version ( #7398 )
2024-08-12 16:07:20 -07:00
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel ( #7208 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-08-12 22:47:41 +00:00
4ddc4743d7
[Core] Consolidate GB
constant and enable float GB arguments ( #7416 )
2024-08-12 14:14:14 -07:00
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin
kernels ( #7323 )
2024-08-12 14:40:13 -04:00
1137f343aa
[ci] Cancel fastcheck when PR is ready ( #7433 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:59:14 -07:00
9b3e2edd30
[ci] Cancel fastcheck run when PR is marked ready ( #7427 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:56:52 -07:00
65950e8f58
[ci] Entrypoints run upon changes in vllm/ ( #7423 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-12 10:18:03 -07:00
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend ( #7425 )
2024-08-12 09:58:28 -07:00
d2bc4510a4
[CI/Build] bump Dockerfile.neuron image base, use public ECR ( #6832 )
2024-08-12 09:53:35 -07:00
24154f8618
[Frontend] Disallow passing model
as both argument and option ( #7347 )
2024-08-12 12:58:34 +00:00
e6e42e4b17
[Core][VLM] Support image embeddings as input ( #6613 )
2024-08-12 16:16:06 +08:00
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 ( #7319 )
2024-08-12 07:59:17 +00:00
86ab567bae
[CI/Build] Minor refactoring for vLLM assets ( #7407 )
2024-08-12 02:41:52 +00:00
f020a6297e
[Docs] Update readme ( #7316 )
2024-08-11 17:13:37 -07:00
6c8e595710
[misc] add commit id in collect env ( #7405 )
2024-08-11 15:40:48 -07:00
02b1988b9f
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty ( #7403 )
2024-08-11 14:38:17 -07:00
386087970a
[CI/Build] build on empty device for better dev experience ( #4773 )
2024-08-11 13:09:44 -07:00
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step ( #7387 )
2024-08-11 08:50:08 -07:00
4fb7b52a2c
Updating LM Format Enforcer version to v0.10.6 ( #7189 )
2024-08-11 08:11:50 -04:00
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time ( #7340 )
2024-08-10 18:12:22 -07:00
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio ( #7392 )
2024-08-10 16:19:33 +00:00
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 ( #7380 )
2024-08-09 23:48:49 +00:00
999ef0b917
[Misc] Add numpy implementation of compute_slot_mapping
( #7377 )
2024-08-09 22:52:29 +00:00
5c6c54d67a
[Bugfix] Fix PerTensorScaleParameter
weight loading for fused models ( #7376 )
2024-08-09 21:23:46 +00:00
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
70d268a399
[Bugfix] Fix ITL recording in serving benchmark ( #7372 )
2024-08-09 10:00:00 -07:00
249b88228d
[Frontend] Support embeddings in the run_batch API ( #7132 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-09 09:48:21 -07:00
74af2bbd90
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder ( #7360 )
2024-08-09 16:35:49 +00:00
fc7b8d1eef
[Performance] e2e overheads reduction: Small followup diff ( #7364 )
2024-08-09 15:49:36 +00:00
67abdbb42f
[VLM][Doc] Add stop_token_ids
to InternVL example ( #7354 )
2024-08-09 14:51:04 +00:00
07ab160741
[Model][Jamba] Mamba cache single buffer ( #6739 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-08-09 10:07:06 -04:00
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine
( #7336 )
2024-08-09 07:06:36 +00:00
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace ( #6971 )
2024-08-09 05:42:45 +00:00
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary ( #7218 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-08-08 22:08:46 -07:00
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations ( #7162 )
2024-08-08 21:34:28 -07:00
73388c07a4
[TPU] Fix dockerfile.tpu ( #7331 )
2024-08-08 20:24:58 -07:00
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
0fa14907da
[TPU] Add Load-time W8A16 quantization for TPU Backend ( #7005 )
2024-08-08 18:35:49 -07:00
5923532e15
Add Skywork AI as Sponsor ( #7314 )
2024-08-08 13:59:57 -07:00
a049b107e2
[Misc] Temporarily resolve the error of BitAndBytes ( #7308 )
2024-08-08 13:42:58 -07:00
8334c39f37
[Bugfix] Fix new Llama3.1 GGUF model loading ( #7269 )
2024-08-08 13:42:44 -07:00
e904576743
[CI/Build] Dockerfile.cpu improvements ( #7298 )
2024-08-08 15:24:52 -04:00
e14fb22e59
[Doc] Put collect_env issue output in a <detail> block ( #7310 )
2024-08-08 11:22:49 -07:00
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic ( #6849 )
2024-08-08 10:43:30 -07:00
21b9c49aa3
[Frontend] Kill the server on engine death ( #6594 )
...
Signed-off-by: Joe Runde <joe@joerun.de >
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-08-08 09:47:48 -07:00
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests ( #7305 )
2024-08-08 12:16:13 -04:00
757ac70a64
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 ( #7273 )
2024-08-08 14:02:41 +00:00
6dffa4b0a6
[Bugfix] Fix LoRA with PP ( #7292 )
2024-08-08 00:02:27 -07:00
48abee9e54
[Frontend] remove max_num_batched_tokens limit for lora ( #7288 )
2024-08-08 06:17:29 +00:00
746709642c
[Misc] Fix typos in scheduler.py ( #7285 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-07 17:06:01 -07:00
e53dfd3eaf
[Kernel] Fix Flashinfer Correctness ( #7284 )
2024-08-07 16:26:52 -07:00
6d94420246
[Doc] Update supported_hardware.rst ( #7276 )
2024-08-07 14:21:50 -07:00
fc1493a01e
[FrontEnd] Make merge_async_iterators
is_cancelled
arg optional ( #7282 )
2024-08-07 13:35:14 -07:00
311f743831
[Bugfix] Fix gptq failure on T4s ( #7264 )
2024-08-07 20:05:37 +00:00
469b3bc538
[ci] Make building wheels per commit optional ( #7278 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-08-07 11:34:25 -07:00
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization ( #7219 )
2024-08-07 11:23:12 -07:00
fde47d3bc2
[BugFix] Fix frontend multiprocessing hang ( #7217 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-08-07 18:09:36 +00:00
0e12cd67a8
[Doc] add online speculative decoding example ( #7243 )
2024-08-07 09:58:02 -07:00
80cbe10c59
[OpenVINO] migrate to latest dependencies versions ( #7251 )
2024-08-07 09:49:10 -07:00
b764547616
[Bugfix] Fix input processor for InternVL2 model ( #7164 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-07 09:32:07 -07:00
ab0f5e2823
Fixes typo in function name ( #7275 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-08-07 09:29:27 -07:00
564985729a
[ BugFix ] Move zmq
frontend to IPC instead of TCP ( #7222 )
2024-08-07 16:24:56 +00:00
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter
and weight_loader_v2
( #5874 )
2024-08-07 09:17:58 -07:00
639159b2a6
[distributed][misc] add specialized method for cuda platform ( #7249 )
2024-08-07 08:54:52 -07:00
66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure ( #7238 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-07 09:12:05 +00:00
7b261092de
[BUGFIX]: top_k is expected to be an integer. ( #7227 )
2024-08-07 00:32:16 -07:00
2385c8f374
[Doc] Mock new dependencies for documentation ( #7245 )
2024-08-07 06:43:03 +00:00
9a3f49ae07
[BugFix] Overhaul async request cancellation ( #7111 )
2024-08-07 13:21:41 +08:00
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading ( #7225 )
2024-08-06 18:34:26 -07:00
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-06 16:51:47 -04:00
660470e5a3
[Core] Optimize evictor-v2 performance ( #7193 )
2024-08-06 12:34:25 -07:00
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
2024-08-06 18:17:08 +00:00
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests ( #7183 )
2024-08-06 10:40:32 -07:00
00afc78590
[Bugfix] add gguf dependency ( #7198 )
...
Co-authored-by: katarzyna.papis <kpapis@kpapis-u20.sclab.intel.com >
2024-08-06 10:08:35 -07:00
541c1852d3
[ BugFix ] Fix ZMQ when VLLM_PORT
is set ( #7205 )
2024-08-06 09:26:26 -07:00
a3bbbfa1d8
[BugFix] Fix DeepSeek remote code ( #7178 )
2024-08-06 08:16:53 -07:00
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models ( #7153 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-08-06 16:55:31 +08:00
9118217f58
[LoRA] Relax LoRA condition ( #7146 )
2024-08-06 01:57:25 +00:00
e3c664bfcb
[Build] Add initial conditional testing spec ( #6841 )
2024-08-05 17:39:22 -07:00
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-05 17:54:23 -06:00
ef527be06c
[MISC] Use non-blocking transfer in prepare_input ( #7172 )
2024-08-05 23:41:27 +00:00
89b8db6bb2
[Bugfix] Specify device when loading LoRA and embedding tensors ( #7129 )
...
Co-authored-by: Jacob Schein <jacobschein@Jacobs-MacBook-Pro-2.local >
2024-08-05 16:35:47 -07:00
789937af2e
[Doc] [SpecDecode] Update MLPSpeculator documentation ( #7100 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-05 23:29:43 +00:00
dfb1a15dcb
[ci][frontend] deduplicate tests ( #7101 )
2024-08-05 15:59:22 -07:00
4db5176d97
bump version to v0.5.4 ( #7139 )
2024-08-05 14:39:48 -07:00
4cf1dc39be
[Bugfix][CI/Build] Fix CUTLASS FetchContent ( #7171 )
2024-08-05 14:22:57 -07:00
6e4852ce28
[CI/Build] Suppress divide-by-zero and missing return statement warnings ( #7001 )
2024-08-05 16:00:01 -04:00
8571ac4672
[Kernel] Update CUTLASS to 3.5.1 ( #7085 )
2024-08-05 15:13:43 -04:00
997cf78308
[Misc] Fix typo in GroupCoordinator.recv() ( #7167 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-05 11:10:16 -07:00
57f560aa23
[BugFix] Use args.trust_remote_code ( #7121 )
2024-08-05 09:26:14 -07:00
003f8ee128
[BugFix] Use IP4 localhost form for zmq bind ( #7163 )
2024-08-05 08:41:03 -07:00
e9630458c7
[SpecDecode] Support FlashInfer in DraftModelRunner ( #6926 )
2024-08-05 08:05:05 -07:00
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification ( #6963 )
2024-08-05 08:46:44 +00:00
c0d8f1636c
[Model] SiglipVisionModel ported from transformers ( #6942 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-05 06:22:12 +00:00
cc08fc7225
[Frontend] Reapply "Factor out code for running uvicorn" ( #7095 )
2024-08-04 20:40:51 -07:00
7b86e7c9cd
[Model] Add multi-image support for minicpmv ( #7122 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-05 09:23:17 +08:00
f80ab3521c
Clean up remaining Punica C information ( #7027 )
2024-08-04 15:37:08 -07:00
16a1cc9bb2
[misc][distributed] improve libcudart.so finding ( #7127 )
2024-08-04 11:31:51 -07:00
b1c9aa3daa
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator ( #7105 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-04 07:13:18 -07:00
179a6a36f2
[Model]Refactor MiniCPMV ( #7020 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 08:12:41 +00:00
83c644fe7e
[core][misc] simply output processing with shortcut code path ( #7117 )
2024-08-04 00:22:19 -07:00
9fadc7b7a0
[misc] add zmq in collect env ( #7119 )
2024-08-03 22:03:46 -07:00
654bc5ca49
Support for guided decoding for offline LLM ( #6878 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 03:12:09 +00:00
825b044863
[Frontend] Warn if user max_model_len
is greater than derived max_model_len
( #7080 )
...
Signed-off-by: Jefferson Fialho <jfialho@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-03 16:01:38 -07:00
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test ( #7114 )
2024-08-03 10:44:53 -07:00
67d745cc68
[CI] Temporarily turn off H100 performance benchmark ( #7104 )
2024-08-02 23:52:44 -07:00
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA ( #7081 )
2024-08-02 22:40:19 -07:00
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits ( #7018 )
2024-08-02 22:38:15 -07:00
0c25435daa
[Model] Refactor and decouple weight loading logic for InternVL2 model ( #7067 )
2024-08-02 22:36:14 -07:00
a0d164567c
[ci][distributed] disable ray dag tests ( #7099 )
2024-08-02 22:32:04 -07:00
04e5583425
[ci][distributed] merge distributed test commands ( #7097 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-02 21:33:53 -07:00
8c025fa703
[Frontend] Factor out chat message parsing ( #7055 )
2024-08-02 21:31:27 -07:00
69ea15e5cc
[ci][distributed] shorten wait time if server hangs ( #7098 )
2024-08-02 21:05:16 -07:00
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq
( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <joe@joerun.de >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-02 18:27:28 -07:00
708989341e
[misc] add a flag to enable compile ( #7092 )
2024-08-02 16:18:45 -07:00
22e718ff1a
[Misc] Revive to use loopback address for driver IP ( #7091 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 15:50:00 -07:00
05308891e2
[Core] Pipeline parallel with Ray ADAG ( #6837 )
...
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 13:55:40 -07:00
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
b482b9a5b1
[CI/Build] Add support for Python 3.12 ( #7035 )
2024-08-02 13:51:22 -07:00
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
c16eaac500
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend ( #6931 )
2024-08-02 08:55:58 -07:00
db35186391
[Core] Comment out unused code in sampler ( #7023 )
2024-08-02 00:58:26 -07:00
660dea1235
[cuda][misc] remove error_on_invalid_device_count_status ( #7069 )
2024-08-02 00:14:21 -07:00
cf2a1a4d9d
Fix tracing.py ( #7065 )
2024-08-01 23:28:00 -07:00
252357793d
[ci][distributed] try to fix pp test ( #7054 )
2024-08-01 22:03:12 -07:00
3bb4b1e4cd
[mypy] Speed up mypy checking ( #7056 )
2024-08-01 19:49:43 -07:00
954f7305a1
[Kernel] Fix input for flashinfer prefill wrapper. ( #7008 )
2024-08-01 18:44:16 -07:00
6ce01f3066
[Performance] Optimize get_seqs
( #7051 )
2024-08-01 18:29:52 -07:00
6a11fdfbb8
[CI/Build][Bugfix] Fix CUTLASS header-only line ( #7034 )
2024-08-01 13:51:15 -07:00
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn ( #7022 )
2024-08-01 13:14:37 -07:00
562e580abc
Update run-amd-test.sh ( #7044 )
2024-08-01 13:12:37 -07:00
fc912e0886
[Models] Support Qwen model with PP ( #6974 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-08-01 12:40:43 -07:00
f4fd390f5d
[Bugfix] Lower gemma's unloaded_params exception to warning ( #7002 )
2024-08-01 12:01:07 -07:00
fb3db61688
[CI/Build] Remove sparseml requirement from testing ( #7037 )
2024-08-01 12:00:51 -07:00
2dd34371a6
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm ( #6992 )
2024-08-01 12:00:28 -07:00
7e0861bd0b
[CI/Build] Update PyTorch to 2.4.0 ( #6951 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-01 11:11:24 -07:00
a72a424b3e
[Build/CI] Fixing Docker Hub quota issue. ( #7043 )
2024-08-01 11:07:37 -07:00
c8a7e93273
[core][scheduler] simplify and improve scheduler ( #6867 )
2024-07-31 23:51:09 -07:00
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user ( #6954 )
2024-07-31 21:13:34 -07:00
0437492ea9
PP comm optimization: replace send with partial send + allgather ( #6695 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
2024-07-31 20:15:42 -07:00
630dd9e0ae
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings ( #6758 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-31 19:49:11 -07:00
23993a7997
[Bugfix][TPU] Do not use torch.Generator for TPUs ( #6981 )
2024-07-31 18:50:28 -07:00
1d2e7fb73f
[Model] Pipeline parallel support for Qwen2 ( #6924 )
2024-07-31 18:49:51 -07:00
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
7eb0cb4a14
Revert "[Frontend] Factor out code for running uvicorn" ( #7012 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-07-31 16:34:26 -07:00
a0dce9383a
[Misc] Add compressed-tensors to optimized quant list ( #7006 )
2024-07-31 14:40:44 -07:00
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) ( #6996 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:32 -07:00
93548eb37e
[Kernel] Enable FP8 Cutlass for Ada Lovelace ( #6950 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:22 -07:00
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization ( #6960 )
2024-07-31 12:47:46 -07:00
bd70013407
[MISC] Introduce pipeline parallelism partition strategies ( #6920 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-31 12:02:17 -07:00
2ee8d3ba55
[Model] use FusedMoE layer in Jamba ( #6935 )
2024-07-31 12:00:24 -07:00
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT ( #6982 )
2024-07-31 23:46:17 +08:00
2f4e108f75
[Bugfix] Clean up MiniCPM-V ( #6939 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-31 14:39:19 +00:00
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
c0644cf9ce
[Bugfix] fix logit processor excceed vocab size issue ( #6927 )
2024-07-31 16:16:01 +08:00
533d1932d2
[Bugfix][TPU] Set readonly=True for non-root devices ( #6980 )
2024-07-31 00:19:28 -07:00
9f0e69b653
[CI/Build] Fix mypy errors ( #6968 )
2024-07-30 19:49:48 -07:00
f230cc2ca6
[Bugfix] Fix broadcasting logic for multi_modal_kwargs
( #6836 )
2024-07-31 10:38:45 +08:00
da1f7cc12a
[mypy] Enable following imports for some directories ( #6681 )
2024-07-31 10:38:03 +08:00
c32ab8be1a
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding ( #6964 )
2024-07-31 00:53:21 +00:00
fb4f530bf5
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists ( #6706 )
2024-07-30 16:28:49 -07:00
79319cedfa
[Nightly benchmarking suite] Remove pkill python from run benchmark suite ( #6965 )
2024-07-30 16:28:05 -07:00
40c27a7cbb
[Build] Temporarily Disable Kernels and LoRA tests ( #6961 )
2024-07-30 14:59:48 -07:00
6ca8031e71
[core][misc] improve free_finished_seq_groups ( #6865 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-30 14:32:12 -07:00
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun ( #6842 )
2024-07-30 16:37:01 -04:00
052b6f8ca4
[Bugfix] Fix tensorizer memory profiling bug during testing ( #6881 )
2024-07-30 11:48:50 -07:00
5895b24677
[OpenVINO] Updated OpenVINO requirements and build docs ( #6948 )
2024-07-30 11:33:01 -07:00
cbbc904470
[Kernel] Squash a few more warnings ( #6914 )
2024-07-30 13:50:42 -04:00
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel ( #6698 )
2024-07-30 10:40:08 -07:00
f058403683
[Doc] Super tiny fix doc typo ( #6949 )
2024-07-30 09:14:03 -07:00
c66c7f86ac
[Bugfix] Fix PaliGemma MMP ( #6930 )
2024-07-30 02:20:57 -07:00
6e063ea35b
[TPU] Fix greedy decoding ( #6933 )
2024-07-30 02:06:29 -07:00
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace ( #6848 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 20:24:58 -06:00
61a97c32f6
[Kernel] Fix marlin divide-by-zero warnings ( #6904 )
2024-07-30 01:26:07 +00:00
4fbf4aa128
[ci] GHA workflow to remove ready label upon "/notready" comment ( #6921 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-29 17:03:45 -07:00
aae6d36f7e
[Kernel] Remove unused variables in awq/gemm_kernels.cu ( #6908 )
2024-07-29 18:01:17 -06:00
9f69d8245a
[Frontend] New allowed_token_ids
decoding request parameter ( #6753 )
2024-07-29 23:37:27 +00:00
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. ( #6786 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-29 14:51:27 -07:00
7f8d612d24
[TPU] Support tensor parallelism in async llm engine ( #6891 )
2024-07-29 12:42:21 -07:00
60d1c6e584
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel ( #6901 )
2024-07-29 09:59:02 -07:00
db9e5708a9
[Core] Reduce unnecessary compute when logprobs=None ( #6532 )
2024-07-29 16:47:31 +00:00
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace ( #6677 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 09:42:35 -06:00
7cbd9ec7a9
[Model] Initialize support for InternVL2 series models ( #6514 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-29 10:16:30 +00:00
3eeb148f46
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 ( #6871 )
2024-07-28 11:13:49 -04:00
b1366a9534
Add Nemotron to PP_SUPPORTED_MODELS ( #6863 )
2024-07-27 15:05:17 -07:00
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
fad5576c58
[TPU] Reduce compilation time & Upgrade PyTorch XLA version ( #6856 )
2024-07-27 10:28:33 -07:00
f954d0715c
[Docs] Add RunLLM chat widget ( #6857 )
2024-07-27 09:24:46 -07:00
1ad86acf17
[Model] Initial support for BLIP-2 ( #5920 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-07-27 11:53:07 +00:00
ecb33a28cb
[CI/Build][Doc] Update CI and Doc for VLM example changes ( #6860 )
2024-07-27 09:54:14 +00:00
a57d75821c
[bugfix] make args.stream work ( #6831 )
2024-07-27 09:07:02 +00:00
925de97e05
[Bugfix] Fix VLM example typo ( #6859 )
2024-07-27 14:24:08 +08:00
aa46953a20
[Misc][VLM][Doc] Consolidate offline examples for vision language models ( #6858 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-07-26 22:44:13 -07:00
593e79e733
[Bugfix] torch.set_num_threads() in multiproc_gpu_executor ( #6802 )
...
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor (#6802 )
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-26 22:15:20 -07:00
c53041ae3b
[Doc] Add missing mock import to docs conf.py
( #6834 )
2024-07-27 04:47:33 +00:00
52f07e3dec
[Hardware][TPU] Implement tensor parallelism with Ray ( #5871 )
2024-07-26 20:54:27 -07:00
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
ed94e4f427
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba ( #6784 )
2024-07-26 20:45:31 -07:00
3c3012398e
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron ( #6844 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-07-26 20:20:16 -07:00
ced36cd89b
[ROCm] Upgrade PyTorch nightly version ( #6845 )
2024-07-26 20:16:13 -07:00
969d032265
[Bugfix]: Fix Tensorizer test failures ( #6835 )
2024-07-26 20:02:25 -07:00
55712941e5
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b ( #6852 )
2024-07-27 02:27:44 +00:00
981b0d5673
[Frontend] Factor out code for running uvicorn ( #6828 )
2024-07-27 09:58:25 +08:00
d09b94ca58
[TPU] Support collective communications in XLA devices ( #6813 )
2024-07-27 01:45:57 +00:00
bb5494676f
enforce eager mode with bnb quantization temporarily ( #6846 )
2024-07-27 01:32:20 +00:00
b5f49ee55b
Update README.md ( #6847 )
2024-07-27 00:26:45 +00:00
150a1ffbfd
[Doc] Update SkyPilot doc for wrong indents and instructions for update service ( #4283 )
2024-07-26 14:39:10 -07:00
281977bd6e
[Doc] Add Nemotron to supported model docs ( #6843 )
2024-07-26 17:32:44 -04:00
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
aa4867791e
[Misc][TPU] Support TPU in initialize_ray_cluster ( #6812 )
2024-07-26 19:39:49 +00:00
71734f1bf2
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm ( #6811 )
2024-07-26 12:28:32 -07:00
50704f52c4
[Bugfix][Kernel] Promote another index to int64_t ( #6838 )
2024-07-26 18:41:04 +00:00
07278c37dd
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) ( #6611 )
2024-07-26 14:33:42 -04:00
85ad7e2d01
[doc][debugging] add known issues for hangs ( #6816 )
2024-07-25 21:48:05 -07:00
89a84b0bb7
[Core] Use array to speedup padding ( #6779 )
2024-07-25 21:31:31 -07:00
084a01fd35
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. ( #6770 )
2024-07-25 21:25:35 -07:00
062a1d0fab
Fix ReplicatedLinear weight loading ( #6793 )
2024-07-25 19:24:58 -07:00
2eb9f4ff26
[ci] Mark tensorizer as soft fail and separate from grouped test ( #6810 )
...
[ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check (#6810 )
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-25 18:08:33 -07:00
443c7cf4cf
[ci][distributed] fix flaky tests ( #6806 )
2024-07-25 17:44:09 -07:00
1adddb14bf
[Core] Fix ray forward_dag error mssg ( #6792 )
2024-07-25 16:53:25 -07:00
b7215de2c5
[Docs] Publish 5th meetup slides ( #6799 )
2024-07-25 16:47:55 -07:00
f3ff63c3f4
[doc][distributed] improve multinode serving doc ( #6804 )
2024-07-25 15:38:32 -07:00
cd7edc4e87
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors ( #6798 )
2024-07-25 15:05:09 -07:00
6a1e25b151
[Doc] Add documentations for nightly benchmarks ( #6412 )
2024-07-25 11:57:16 -07:00
95db75de64
[Bugfix] Add synchronize to prevent possible data race ( #6788 )
...
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-07-25 10:40:01 -07:00
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8
without scales for FP8 checkpoints ( #6761 )
2024-07-25 09:46:15 -07:00
889da130e7
[ Misc ] fp8-marlin
channelwise via compressed-tensors
( #6524 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-07-25 09:46:04 -07:00
b75e314fff
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V ( #6787 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-25 09:42:49 -07:00
316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py ( #6755 )
2024-07-24 22:48:07 -07:00
0310029a2f
[Bugfix] Fix awq_marlin and gptq_marlin flags ( #6745 )
2024-07-24 22:34:11 -07:00
309aaef825
[Bugfix] Fix decode tokens w. CUDA graph ( #6757 )
2024-07-24 22:33:56 -07:00
9e169a4c61
[Model] Adding support for MiniCPM-V ( #4087 )
2024-07-24 20:59:30 -07:00
5689e256ba
[Frontend] Represent tokens with identifiable strings ( #6626 )
2024-07-25 09:51:00 +08:00
740374d456
[core][distributed] fix zmq hang ( #6759 )
2024-07-24 17:37:12 -07:00
d88c458f44
[Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users ( #6754 )
2024-07-24 14:32:57 -07:00
421e218b37
[Bugfix] Bump transformers to 4.43.2 ( #6752 )
2024-07-24 13:22:16 -07:00
5448f67635
[Core] Tweaks to model runner/input builder developer APIs ( #6712 )
2024-07-24 12:17:12 -07:00
0e63494cf3
Add fp8 support to reshape_and_cache_flash
( #6667 )
2024-07-24 18:36:52 +00:00
ee812580f7
[Frontend] split run_server into build_server and run_server ( #6740 )
2024-07-24 10:36:04 -07:00
40468b13fa
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. ( #6686 )
2024-07-24 08:58:42 -07:00
2cf0df3381
[Bugfix] Fix speculative decode seeded test ( #6743 )
2024-07-24 08:58:31 -07:00
545146349c
Adding f-string to validation error which is missing ( #6748 )
2024-07-24 08:55:53 -07:00
f4f8a9d892
[Bugfix]fix modelscope compatible issue ( #6730 )
2024-07-24 05:04:46 -07:00
b570811706
[Build/CI] Update run-amd-test.sh. Enable Docker Hub login. ( #6711 )
2024-07-24 05:01:14 -07:00
ccc4a73257
[Docs][ROCm] Detailed instructions to build from source ( #6680 )
2024-07-24 01:07:23 -07:00
0a740a11ba
[Bugfix] Fix token padding for chameleon ( #6724 )
2024-07-24 01:05:09 -07:00
c882a7f5b3
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model ( #6714 )
2024-07-24 07:34:22 +00:00
5e8ca973eb
[Bugfix] fix flashinfer cudagraph capture for PP ( #6708 )
2024-07-24 01:49:44 +00:00
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-23 23:45:09 +00:00
2f808e69ab
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. ( #6645 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-23 23:05:05 +00:00
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization ( #6702 )
2024-07-23 22:45:12 +00:00
72fc704803
[build] relax wheel size limit ( #6704 )
2024-07-23 14:03:49 -07:00
1bedf210e3
Bump transformers
version for Llama 3.1 hotfix and patch Chameleon ( #6690 )
2024-07-23 13:47:48 -07:00
507ef787d8
[Model] Pipeline Parallel Support for DeepSeek v2 ( #6519 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-23 12:22:09 -07:00
58f53034ad
[Frontend] Add Usage data in each chunk for chat_serving. #6540 ( #6652 )
2024-07-23 11:41:55 -07:00
0eb0757bef
[Misc] Add ignored layers for fp8
quantization ( #6657 )
2024-07-23 14:04:04 -04:00
38c4b7e863
Bump version to 0.5.3.post1 ( #6696 )
2024-07-23 10:08:59 -07:00
a112a84aad
[BugFix] Fix RoPE error in Llama 3.1 ( #6693 )
2024-07-23 09:46:05 -07:00
461089a21a
[Bugfix] Fix a log error in chunked prefill ( #6694 )
2024-07-23 09:27:58 -07:00
71950af726
[doc][distributed] fix doc argument order ( #6691 )
2024-07-23 08:55:33 -07:00
cb1362a889
[Docs] Announce llama3.1 support ( #6688 )
2024-07-23 08:18:15 -07:00
bb2fc08072
Bump version to v0.5.3 ( #6674 )
2024-07-23 00:00:08 -07:00
3eda4ec780
support ignore patterns in model loader ( #6673 )
2024-07-22 23:59:42 -07:00
22fa2e35cb
[VLM][Model] Support image input for Chameleon ( #6633 )
2024-07-22 23:50:48 -07:00
c5201240a4
[misc] only tqdm for first rank ( #6672 )
2024-07-22 21:57:27 -07:00
97234be0ec
[Misc] Manage HTTP connections in one place ( #6600 )
2024-07-22 21:32:02 -07:00
c051bfe4eb
[doc][distributed] doc for setting up multi-node environment ( #6529 )
...
[doc][distributed] add more doc for setting up multi-node environment (#6529 )
2024-07-22 21:22:09 -07:00
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors ( #6528 )
2024-07-23 04:11:50 +00:00
e519ae097a
add tqdm when loading checkpoint shards ( #6569 )
...
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-22 20:48:01 -07:00
7c2749a4fd
[misc] add start loading models for users information ( #6670 )
2024-07-22 20:08:02 -07:00
729171ae58
[Misc] Enable chunked prefill by default for long context models ( #6666 )
2024-07-22 20:03:13 -07:00
c5e8330997
[Bugfix] Fix null modules_to_not_convert
in FBGEMM Fp8 quantization ( #6665 )
2024-07-22 19:25:05 -07:00
e0c15758b8
[Core] Modulize prepare input and attention metadata builder ( #6596 )
2024-07-23 00:45:24 +00:00
bdf5fd1386
[Misc] Remove deprecation warning for beam search ( #6659 )
2024-07-23 00:21:58 +00:00
5a96ee52a3
[ci][build] add back vim in docker ( #6661 )
2024-07-22 16:26:29 -07:00
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace ( #6234 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-22 15:42:40 -07:00
69d5ae38dc
[ci] Use different sccache bucket for CUDA 11.8 wheel build ( #6656 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-22 14:20:41 -07:00
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
739b61a348
[Frontend] Refactor prompt processing ( #4028 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-22 10:13:53 -07:00
89c1c6a196
[Bugfix] Fix vocab_size
field access in llava_next.py
( #6624 )
2024-07-22 05:02:51 +00:00
42de2cefcb
[Misc] Add a wrapper for torch.inference_mode ( #6618 )
2024-07-21 18:43:11 -07:00
c9eef37f32
[Model] Initial Support for Chameleon ( #5770 )
2024-07-21 17:37:51 -07:00
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
25e778aa16
[Model] Refactor and decouple phi3v image embedding ( #6621 )
2024-07-21 16:07:58 -07:00
b6df37f943
[Misc] Remove abused noqa ( #6619 )
2024-07-21 23:47:04 +08:00
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. ( #6485 )
2024-07-20 23:58:58 -07:00
d7f4178dd9
[Frontend] Move chat utils ( #6602 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-21 08:38:17 +08:00
082ecd80d5
[ Bugfix ] Fix AutoFP8 fp8 marlin ( #6609 )
2024-07-20 17:25:56 -06:00
f952bbc8ff
[Misc] Fix input_scale typing in w8a8_utils.py ( #6579 )
2024-07-20 23:11:13 +00:00
9364f74eee
[ Kernel ] Enable fp8-marlin
for fbgemm-fp8
models ( #6606 )
2024-07-20 18:50:10 +00:00
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes ( #6543 )
2024-07-20 09:39:07 -07:00
683e3cb9c4
[ Misc ] fbgemm
checkpoints ( #6559 )
2024-07-20 09:36:57 -07:00
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors ( #6541 )
2024-07-20 04:17:24 +00:00
3f8d42c81f
Pipeline Parallel: Guard for KeyErrors at request abort ( #6587 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-19 19:18:19 -07:00
7bd82002ae
[Core] Allow specifying custom Executor ( #6557 )
2024-07-20 01:25:06 +00:00
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com >
2024-07-19 18:15:26 -07:00
e81522e879
[build] add ib in image for out-of-the-box infiniband support ( #6599 )
...
[build] add ib so that multi-node support with infiniband can be supported out-of-the-box (#6599 )
2024-07-19 17:16:57 -07:00
45ceb85a0c
[Docs] Update PP docs ( #6598 )
2024-07-19 16:38:21 -07:00
4cc24f01b1
[ Kernel ] Enable Dynamic Per Token fp8
( #6547 )
2024-07-19 23:08:15 +00:00
07eb6f19f3
[bugfix][distributed] fix multi-node bug for shared memory ( #6597 )
2024-07-19 15:34:34 -07:00
f0bbfaf917
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection ( #6578 )
2024-07-19 14:01:03 -07:00
30efe41532
[Docs] Update docs for wheel location ( #6580 )
2024-07-19 12:14:11 -07:00
9ed82e7074
[Misc] Small perf improvements ( #6520 )
2024-07-19 12:10:56 -07:00
51f8aa90ad
[Bugfix][Frontend] remove duplicate init logger ( #6581 )
2024-07-19 10:16:27 -07:00
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply ( #6327 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-19 07:15:22 -06:00
a921e86392
[BUGFIX] Raise an error for no draft token case when draft_tp>1 ( #6369 )
2024-07-19 06:01:09 -07:00
6366efc67b
[Bugfix][Frontend] Fix missing /metrics
endpoint ( #6463 )
2024-07-19 03:55:13 +00:00
dbe5588554
[ Misc ] non-uniform quantization via compressed-tensors
for Llama
( #6515 )
2024-07-18 22:39:18 -04:00
d4201e06d5
[Bugfix] Make spec. decode respect per-request seed. ( #6034 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-07-18 19:22:08 -07:00
b5672a112c
[Core] Multiprocessing Pipeline Parallel support ( #6130 )
...
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-18 19:15:52 -07:00
c5df56f88b
Add support for a rope extension method ( #6553 )
2024-07-19 01:53:03 +00:00
1689219ebf
[CI/Build] Build on Ubuntu 20.04 instead of 22.04 ( #6517 )
2024-07-18 17:29:25 -07:00
4ffffccb7e
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm ( #6552 )
2024-07-18 23:52:22 +00:00
f53b8f0d05
[ci][test] add correctness test for cpu offloading ( #6549 )
2024-07-18 23:41:06 +00:00
2d4733ba2d
Fix PR comment bot ( #6554 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-18 14:48:29 -07:00
15c6a079b1
[Model] Support Mistral-Nemo ( #6548 )
2024-07-18 20:31:50 +00:00
ecdb462c24
[ci] Reword Github bot comment ( #6534 )
2024-07-18 08:01:45 -07:00
58ca663224
[ Misc ] Improve Min Capability Checking in compressed-tensors
( #6522 )
2024-07-18 14:39:12 +00:00
4634c8728b
[TPU] Refactor TPU worker & model runner ( #6506 )
2024-07-18 01:34:16 -07:00
c8a7d51c49
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash ( #6501 )
2024-07-18 07:47:13 +00:00
e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs ( #6227 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-18 15:13:30 +08:00
8a74c68bd1
[Misc] Minor patch for draft model runner ( #6523 )
2024-07-18 06:06:21 +00:00
61e592747c
[Core] Introduce SPMD worker execution using Ray accelerated DAG ( #6032 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu >
2024-07-17 22:27:09 -07:00
d25877dd9b
[BugFix] Avoid secondary error in ShmRingBuffer destructor ( #6530 )
2024-07-17 22:24:43 -07:00
1c27d25fb5
[core][model] yet another cpu offload implementation ( #6496 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-17 20:54:35 -07:00
18fecc3559
[ Kernel ] Fp8 Channelwise Weight Support ( #6487 )
2024-07-18 03:18:13 +00:00
b5af8c223c
[Model] Pipeline parallel support for Mixtral ( #6516 )
2024-07-17 19:26:04 -07:00
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-18 01:38:35 +00:00
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
5f0b9933e6
[Bugfix] Fix Ray Metrics API usage ( #6354 )
2024-07-17 19:40:10 +00:00
a38524f338
[DOC] - Add docker image to Cerebrium Integration ( #6510 )
2024-07-17 10:22:53 -07:00
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 ( #6164 )
2024-07-17 09:37:16 -07:00
a9a2e74d21
[Misc] Use torch.Tensor
for type annotation ( #6505 )
2024-07-17 13:01:10 +00:00
e09ce759aa
[TPU] Remove multi-modal args in TPU backend ( #6504 )
2024-07-17 04:02:53 -07:00
5fa6e9876e
[Bugfix] Fix for multinode crash on 4 PP ( #6495 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-17 08:25:10 +00:00
5bf35a91e4
[Doc][CI/Build] Update docs and tests to use vllm serve
( #6431 )
2024-07-17 07:43:21 +00:00
a19e8d3726
[Misc][Speculative decoding] Typos and typing fixes ( #6467 )
...
Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com >
2024-07-17 07:17:07 +00:00
10383887e0
[ROCm] Cleanup Dockerfile and remove outdated patch ( #6482 )
2024-07-16 22:47:02 -07:00
1d094fd7c0
[Distributed][PP] only create embedding & lm head when necessary ( #6455 )
...
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
2024-07-16 19:20:26 -07:00
ce37be7ba0
[misc][distributed] add seed to dummy weights ( #6491 )
2024-07-16 19:16:34 -07:00
7f62077af5
[misc][distributed] improve tests ( #6488 )
2024-07-16 17:35:52 -07:00
09c2eb85dd
[ci][distributed] add pipeline parallel correctness test ( #6410 )
2024-07-16 15:44:22 -07:00
978aed5300
[Kernel][Attention] Separate Attention.kv_scale
into k_scale
and v_scale
( #6081 )
2024-07-16 15:31:32 -07:00
160e1d8c99
[Misc] Log spec decode metrics ( #6454 )
2024-07-16 20:37:10 +00:00
94162beb9f
[Doc] Fix the lora adapter path in server startup script ( #6230 )
2024-07-16 10:11:04 -07:00
c467dff24f
[Hardware][TPU] Support MoE with Pallas GMM kernel ( #6457 )
2024-07-16 09:56:28 -07:00
9f4ccec761
[doc][misc] remind to cancel debugging environment variables ( #6481 )
...
[doc][misc] remind users to cancel debugging environment variables after debugging (#6481 )
2024-07-16 09:45:30 -07:00
38ef94888a
[CI/Build] Remove "boardwalk" image asset ( #6460 )
2024-07-16 08:59:36 -07:00
2bb0489cb3
[Core] Use numpy to speed up padded token processing ( #6442 )
2024-07-16 08:13:25 -07:00
7508a3dc34
[Misc] Fix typos in spec. decode metrics logging. ( #6470 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-16 13:55:15 +00:00
7a3d2a5b95
[Frontend] Support for chat completions input in the tokenize endpoint ( #5923 )
2024-07-16 20:18:09 +08:00
d97011512e
[CI/Build] vLLM cache directory for images ( #6444 )
2024-07-15 23:12:25 -07:00
37d776606f
[Docs] Announce 5th meetup ( #6458 )
2024-07-15 21:04:58 -07:00
d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests ( #6419 )
2024-07-15 18:54:15 -07:00
9ad32dacd9
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug ( #6425 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-07-16 01:32:55 +00:00
d6f3b3d5c4
Pin sphinx-argparse version ( #6453 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-16 01:26:11 +00:00
4552e37b55
[CI/Build][TPU] Add TPU CI test ( #6277 )
...
Co-authored-by: kevin <kevin@anyscale.com >
2024-07-15 14:31:16 -07:00
ec9933f4a5
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod ( #6289 )
2024-07-15 19:02:14 +00:00
3dee97b05f
[Docs] Add Google Cloud to sponsor list ( #6450 )
2024-07-15 11:58:10 -07:00