4db5176d97
bump version to v0.5.4 ( #7139 )
2024-08-05 14:39:48 -07:00
4cf1dc39be
[Bugfix][CI/Build] Fix CUTLASS FetchContent ( #7171 )
2024-08-05 14:22:57 -07:00
6e4852ce28
[CI/Build] Suppress divide-by-zero and missing return statement warnings ( #7001 )
2024-08-05 16:00:01 -04:00
8571ac4672
[Kernel] Update CUTLASS to 3.5.1 ( #7085 )
2024-08-05 15:13:43 -04:00
997cf78308
[Misc] Fix typo in GroupCoordinator.recv() ( #7167 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-05 11:10:16 -07:00
57f560aa23
[BugFix] Use args.trust_remote_code ( #7121 )
2024-08-05 09:26:14 -07:00
003f8ee128
[BugFix] Use IP4 localhost form for zmq bind ( #7163 )
2024-08-05 08:41:03 -07:00
e9630458c7
[SpecDecode] Support FlashInfer in DraftModelRunner ( #6926 )
2024-08-05 08:05:05 -07:00
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification ( #6963 )
2024-08-05 08:46:44 +00:00
c0d8f1636c
[Model] SiglipVisionModel ported from transformers ( #6942 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-08-05 06:22:12 +00:00
cc08fc7225
[Frontend] Reapply "Factor out code for running uvicorn" ( #7095 )
2024-08-04 20:40:51 -07:00
7b86e7c9cd
[Model] Add multi-image support for minicpmv ( #7122 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-05 09:23:17 +08:00
f80ab3521c
Clean up remaining Punica C information ( #7027 )
2024-08-04 15:37:08 -07:00
16a1cc9bb2
[misc][distributed] improve libcudart.so finding ( #7127 )
2024-08-04 11:31:51 -07:00
b1c9aa3daa
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size to 1 when using MLPSpeculator ( #7105 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-08-04 07:13:18 -07:00
179a6a36f2
[Model]Refactor MiniCPMV ( #7020 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 08:12:41 +00:00
83c644fe7e
[core][misc] simply output processing with shortcut code path ( #7117 )
2024-08-04 00:22:19 -07:00
9fadc7b7a0
[misc] add zmq in collect env ( #7119 )
2024-08-03 22:03:46 -07:00
654bc5ca49
Support for guided decoding for offline LLM ( #6878 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-04 03:12:09 +00:00
825b044863
[Frontend] Warn if user max_model_len
is greater than derived max_model_len
( #7080 )
...
Signed-off-by: Jefferson Fialho <jfialho@ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-08-03 16:01:38 -07:00
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test ( #7114 )
2024-08-03 10:44:53 -07:00
67d745cc68
[CI] Temporarily turn off H100 performance benchmark ( #7104 )
2024-08-02 23:52:44 -07:00
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA ( #7081 )
2024-08-02 22:40:19 -07:00
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits ( #7018 )
2024-08-02 22:38:15 -07:00
0c25435daa
[Model] Refactor and decouple weight loading logic for InternVL2 model ( #7067 )
2024-08-02 22:36:14 -07:00
a0d164567c
[ci][distributed] disable ray dag tests ( #7099 )
2024-08-02 22:32:04 -07:00
04e5583425
[ci][distributed] merge distributed test commands ( #7097 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-08-02 21:33:53 -07:00
8c025fa703
[Frontend] Factor out chat message parsing ( #7055 )
2024-08-02 21:31:27 -07:00
69ea15e5cc
[ci][distributed] shorten wait time if server hangs ( #7098 )
2024-08-02 21:05:16 -07:00
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq
( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com >
Co-authored-by: Joe Runde <joe@joerun.de >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-08-02 18:27:28 -07:00
708989341e
[misc] add a flag to enable compile ( #7092 )
2024-08-02 16:18:45 -07:00
22e718ff1a
[Misc] Revive to use loopback address for driver IP ( #7091 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 15:50:00 -07:00
05308891e2
[Core] Pipeline parallel with Ray ADAG ( #6837 )
...
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
2024-08-02 13:55:40 -07:00
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
b482b9a5b1
[CI/Build] Add support for Python 3.12 ( #7035 )
2024-08-02 13:51:22 -07:00
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
c16eaac500
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend ( #6931 )
2024-08-02 08:55:58 -07:00
db35186391
[Core] Comment out unused code in sampler ( #7023 )
2024-08-02 00:58:26 -07:00
660dea1235
[cuda][misc] remove error_on_invalid_device_count_status ( #7069 )
2024-08-02 00:14:21 -07:00
cf2a1a4d9d
Fix tracing.py ( #7065 )
2024-08-01 23:28:00 -07:00
252357793d
[ci][distributed] try to fix pp test ( #7054 )
2024-08-01 22:03:12 -07:00
3bb4b1e4cd
[mypy] Speed up mypy checking ( #7056 )
2024-08-01 19:49:43 -07:00
954f7305a1
[Kernel] Fix input for flashinfer prefill wrapper. ( #7008 )
2024-08-01 18:44:16 -07:00
6ce01f3066
[Performance] Optimize get_seqs
( #7051 )
2024-08-01 18:29:52 -07:00
6a11fdfbb8
[CI/Build][Bugfix] Fix CUTLASS header-only line ( #7034 )
2024-08-01 13:51:15 -07:00
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn ( #7022 )
2024-08-01 13:14:37 -07:00
562e580abc
Update run-amd-test.sh ( #7044 )
2024-08-01 13:12:37 -07:00
fc912e0886
[Models] Support Qwen model with PP ( #6974 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-08-01 12:40:43 -07:00
f4fd390f5d
[Bugfix] Lower gemma's unloaded_params exception to warning ( #7002 )
2024-08-01 12:01:07 -07:00
fb3db61688
[CI/Build] Remove sparseml requirement from testing ( #7037 )
2024-08-01 12:00:51 -07:00
2dd34371a6
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm ( #6992 )
2024-08-01 12:00:28 -07:00
7e0861bd0b
[CI/Build] Update PyTorch to 2.4.0 ( #6951 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-08-01 11:11:24 -07:00
a72a424b3e
[Build/CI] Fixing Docker Hub quota issue. ( #7043 )
2024-08-01 11:07:37 -07:00
c8a7e93273
[core][scheduler] simplify and improve scheduler ( #6867 )
2024-07-31 23:51:09 -07:00
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user ( #6954 )
2024-07-31 21:13:34 -07:00
0437492ea9
PP comm optimization: replace send with partial send + allgather ( #6695 )
...
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
2024-07-31 20:15:42 -07:00
630dd9e0ae
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embeddings ( #6758 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-31 19:49:11 -07:00
23993a7997
[Bugfix][TPU] Do not use torch.Generator for TPUs ( #6981 )
2024-07-31 18:50:28 -07:00
1d2e7fb73f
[Model] Pipeline parallel support for Qwen2 ( #6924 )
2024-07-31 18:49:51 -07:00
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
7eb0cb4a14
Revert "[Frontend] Factor out code for running uvicorn" ( #7012 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-07-31 16:34:26 -07:00
a0dce9383a
[Misc] Add compressed-tensors to optimized quant list ( #7006 )
2024-07-31 14:40:44 -07:00
35e9c12bfa
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) ( #6996 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:32 -07:00
93548eb37e
[Kernel] Enable FP8 Cutlass for Ada Lovelace ( #6950 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-31 14:40:22 -07:00
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization ( #6960 )
2024-07-31 12:47:46 -07:00
bd70013407
[MISC] Introduce pipeline parallelism partition strategies ( #6920 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-31 12:02:17 -07:00
2ee8d3ba55
[Model] use FusedMoE layer in Jamba ( #6935 )
2024-07-31 12:00:24 -07:00
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT ( #6982 )
2024-07-31 23:46:17 +08:00
2f4e108f75
[Bugfix] Clean up MiniCPM-V ( #6939 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-31 14:39:19 +00:00
6512937de1
Support W4A8 quantization for vllm ( #5218 )
2024-07-31 07:55:21 -06:00
c0644cf9ce
[Bugfix] fix logit processor excceed vocab size issue ( #6927 )
2024-07-31 16:16:01 +08:00
533d1932d2
[Bugfix][TPU] Set readonly=True for non-root devices ( #6980 )
2024-07-31 00:19:28 -07:00
9f0e69b653
[CI/Build] Fix mypy errors ( #6968 )
2024-07-30 19:49:48 -07:00
f230cc2ca6
[Bugfix] Fix broadcasting logic for multi_modal_kwargs
( #6836 )
2024-07-31 10:38:45 +08:00
da1f7cc12a
[mypy] Enable following imports for some directories ( #6681 )
2024-07-31 10:38:03 +08:00
c32ab8be1a
[Speculative decoding] Add serving benchmark for llama3 70b + speculative decoding ( #6964 )
2024-07-31 00:53:21 +00:00
fb4f530bf5
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exists ( #6706 )
2024-07-30 16:28:49 -07:00
79319cedfa
[Nightly benchmarking suite] Remove pkill python from run benchmark suite ( #6965 )
2024-07-30 16:28:05 -07:00
40c27a7cbb
[Build] Temporarily Disable Kernels and LoRA tests ( #6961 )
2024-07-30 14:59:48 -07:00
6ca8031e71
[core][misc] improve free_finished_seq_groups ( #6865 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-30 14:32:12 -07:00
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun ( #6842 )
2024-07-30 16:37:01 -04:00
052b6f8ca4
[Bugfix] Fix tensorizer memory profiling bug during testing ( #6881 )
2024-07-30 11:48:50 -07:00
5895b24677
[OpenVINO] Updated OpenVINO requirements and build docs ( #6948 )
2024-07-30 11:33:01 -07:00
cbbc904470
[Kernel] Squash a few more warnings ( #6914 )
2024-07-30 13:50:42 -04:00
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel ( #6698 )
2024-07-30 10:40:08 -07:00
f058403683
[Doc] Super tiny fix doc typo ( #6949 )
2024-07-30 09:14:03 -07:00
c66c7f86ac
[Bugfix] Fix PaliGemma MMP ( #6930 )
2024-07-30 02:20:57 -07:00
6e063ea35b
[TPU] Fix greedy decoding ( #6933 )
2024-07-30 02:06:29 -07:00
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace ( #6848 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 20:24:58 -06:00
61a97c32f6
[Kernel] Fix marlin divide-by-zero warnings ( #6904 )
2024-07-30 01:26:07 +00:00
4fbf4aa128
[ci] GHA workflow to remove ready label upon "/notready" comment ( #6921 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-29 17:03:45 -07:00
aae6d36f7e
[Kernel] Remove unused variables in awq/gemm_kernels.cu ( #6908 )
2024-07-29 18:01:17 -06:00
9f69d8245a
[Frontend] New allowed_token_ids
decoding request parameter ( #6753 )
2024-07-29 23:37:27 +00:00
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. ( #6786 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-29 14:51:27 -07:00
7f8d612d24
[TPU] Support tensor parallelism in async llm engine ( #6891 )
2024-07-29 12:42:21 -07:00
60d1c6e584
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel ( #6901 )
2024-07-29 09:59:02 -07:00
db9e5708a9
[Core] Reduce unnecessary compute when logprobs=None ( #6532 )
2024-07-29 16:47:31 +00:00
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace ( #6677 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-29 09:42:35 -06:00
7cbd9ec7a9
[Model] Initialize support for InternVL2 series models ( #6514 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-29 10:16:30 +00:00
3eeb148f46
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 ( #6871 )
2024-07-28 11:13:49 -04:00
b1366a9534
Add Nemotron to PP_SUPPORTED_MODELS ( #6863 )
2024-07-27 15:05:17 -07:00
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel ( #6795 )
2024-07-27 17:52:33 -04:00
fad5576c58
[TPU] Reduce compilation time & Upgrade PyTorch XLA version ( #6856 )
2024-07-27 10:28:33 -07:00
f954d0715c
[Docs] Add RunLLM chat widget ( #6857 )
2024-07-27 09:24:46 -07:00
1ad86acf17
[Model] Initial support for BLIP-2 ( #5920 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-07-27 11:53:07 +00:00
ecb33a28cb
[CI/Build][Doc] Update CI and Doc for VLM example changes ( #6860 )
2024-07-27 09:54:14 +00:00
a57d75821c
[bugfix] make args.stream work ( #6831 )
2024-07-27 09:07:02 +00:00
925de97e05
[Bugfix] Fix VLM example typo ( #6859 )
2024-07-27 14:24:08 +08:00
aa46953a20
[Misc][VLM][Doc] Consolidate offline examples for vision language models ( #6858 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-07-26 22:44:13 -07:00
593e79e733
[Bugfix] torch.set_num_threads() in multiproc_gpu_executor ( #6802 )
...
[Bugfix] Use torch.set_num_threads() to configure parallelism in multiproc_gpu_executor (#6802 )
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-26 22:15:20 -07:00
c53041ae3b
[Doc] Add missing mock import to docs conf.py
( #6834 )
2024-07-27 04:47:33 +00:00
52f07e3dec
[Hardware][TPU] Implement tensor parallelism with Ray ( #5871 )
2024-07-26 20:54:27 -07:00
14dbd5a767
[Model] H2O Danube3-4b ( #6451 )
2024-07-26 20:47:50 -07:00
ed94e4f427
[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba ( #6784 )
2024-07-26 20:45:31 -07:00
3c3012398e
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron ( #6844 )
...
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com >
2024-07-26 20:20:16 -07:00
ced36cd89b
[ROCm] Upgrade PyTorch nightly version ( #6845 )
2024-07-26 20:16:13 -07:00
969d032265
[Bugfix]: Fix Tensorizer test failures ( #6835 )
2024-07-26 20:02:25 -07:00
55712941e5
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b ( #6852 )
2024-07-27 02:27:44 +00:00
981b0d5673
[Frontend] Factor out code for running uvicorn ( #6828 )
2024-07-27 09:58:25 +08:00
d09b94ca58
[TPU] Support collective communications in XLA devices ( #6813 )
2024-07-27 01:45:57 +00:00
bb5494676f
enforce eager mode with bnb quantization temporarily ( #6846 )
2024-07-27 01:32:20 +00:00
b5f49ee55b
Update README.md ( #6847 )
2024-07-27 00:26:45 +00:00
150a1ffbfd
[Doc] Update SkyPilot doc for wrong indents and instructions for update service ( #4283 )
2024-07-26 14:39:10 -07:00
281977bd6e
[Doc] Add Nemotron to supported model docs ( #6843 )
2024-07-26 17:32:44 -04:00
3bbb4936dc
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation ( #6125 )
2024-07-26 13:50:10 -07:00
aa4867791e
[Misc][TPU] Support TPU in initialize_ray_cluster ( #6812 )
2024-07-26 19:39:49 +00:00
71734f1bf2
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm ( #6811 )
2024-07-26 12:28:32 -07:00
50704f52c4
[Bugfix][Kernel] Promote another index to int64_t ( #6838 )
2024-07-26 18:41:04 +00:00
07278c37dd
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) ( #6611 )
2024-07-26 14:33:42 -04:00
85ad7e2d01
[doc][debugging] add known issues for hangs ( #6816 )
2024-07-25 21:48:05 -07:00
89a84b0bb7
[Core] Use array to speedup padding ( #6779 )
2024-07-25 21:31:31 -07:00
084a01fd35
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. ( #6770 )
2024-07-25 21:25:35 -07:00
062a1d0fab
Fix ReplicatedLinear weight loading ( #6793 )
2024-07-25 19:24:58 -07:00
2eb9f4ff26
[ci] Mark tensorizer as soft fail and separate from grouped test ( #6810 )
...
[ci] Mark tensorizer test as soft fail and separate it from grouped test in fast check (#6810 )
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-25 18:08:33 -07:00
443c7cf4cf
[ci][distributed] fix flaky tests ( #6806 )
2024-07-25 17:44:09 -07:00
1adddb14bf
[Core] Fix ray forward_dag error mssg ( #6792 )
2024-07-25 16:53:25 -07:00
b7215de2c5
[Docs] Publish 5th meetup slides ( #6799 )
2024-07-25 16:47:55 -07:00
f3ff63c3f4
[doc][distributed] improve multinode serving doc ( #6804 )
2024-07-25 15:38:32 -07:00
cd7edc4e87
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 using compressed tensors ( #6798 )
2024-07-25 15:05:09 -07:00
6a1e25b151
[Doc] Add documentations for nightly benchmarks ( #6412 )
2024-07-25 11:57:16 -07:00
95db75de64
[Bugfix] Add synchronize to prevent possible data race ( #6788 )
...
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-07-25 10:40:01 -07:00
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8
without scales for FP8 checkpoints ( #6761 )
2024-07-25 09:46:15 -07:00
889da130e7
[ Misc ] fp8-marlin
channelwise via compressed-tensors
( #6524 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-07-25 09:46:04 -07:00
b75e314fff
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V ( #6787 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-25 09:42:49 -07:00
316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py ( #6755 )
2024-07-24 22:48:07 -07:00
0310029a2f
[Bugfix] Fix awq_marlin and gptq_marlin flags ( #6745 )
2024-07-24 22:34:11 -07:00
309aaef825
[Bugfix] Fix decode tokens w. CUDA graph ( #6757 )
2024-07-24 22:33:56 -07:00
9e169a4c61
[Model] Adding support for MiniCPM-V ( #4087 )
2024-07-24 20:59:30 -07:00
5689e256ba
[Frontend] Represent tokens with identifiable strings ( #6626 )
2024-07-25 09:51:00 +08:00
740374d456
[core][distributed] fix zmq hang ( #6759 )
2024-07-24 17:37:12 -07:00
d88c458f44
[Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x users ( #6754 )
2024-07-24 14:32:57 -07:00
421e218b37
[Bugfix] Bump transformers to 4.43.2 ( #6752 )
2024-07-24 13:22:16 -07:00
5448f67635
[Core] Tweaks to model runner/input builder developer APIs ( #6712 )
2024-07-24 12:17:12 -07:00
0e63494cf3
Add fp8 support to reshape_and_cache_flash
( #6667 )
2024-07-24 18:36:52 +00:00
ee812580f7
[Frontend] split run_server into build_server and run_server ( #6740 )
2024-07-24 10:36:04 -07:00
40468b13fa
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds inaccurate. ( #6686 )
2024-07-24 08:58:42 -07:00
2cf0df3381
[Bugfix] Fix speculative decode seeded test ( #6743 )
2024-07-24 08:58:31 -07:00
545146349c
Adding f-string to validation error which is missing ( #6748 )
2024-07-24 08:55:53 -07:00
f4f8a9d892
[Bugfix]fix modelscope compatible issue ( #6730 )
2024-07-24 05:04:46 -07:00
b570811706
[Build/CI] Update run-amd-test.sh. Enable Docker Hub login. ( #6711 )
2024-07-24 05:01:14 -07:00
ccc4a73257
[Docs][ROCm] Detailed instructions to build from source ( #6680 )
2024-07-24 01:07:23 -07:00
0a740a11ba
[Bugfix] Fix token padding for chameleon ( #6724 )
2024-07-24 01:05:09 -07:00
c882a7f5b3
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model ( #6714 )
2024-07-24 07:34:22 +00:00
5e8ca973eb
[Bugfix] fix flashinfer cudagraph capture for PP ( #6708 )
2024-07-24 01:49:44 +00:00
87525fab92
[bitsandbytes]: support read bnb pre-quantized model ( #5753 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-23 23:45:09 +00:00
2f808e69ab
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. ( #6645 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-23 23:05:05 +00:00
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization ( #6702 )
2024-07-23 22:45:12 +00:00
72fc704803
[build] relax wheel size limit ( #6704 )
2024-07-23 14:03:49 -07:00
1bedf210e3
Bump transformers
version for Llama 3.1 hotfix and patch Chameleon ( #6690 )
2024-07-23 13:47:48 -07:00
507ef787d8
[Model] Pipeline Parallel Support for DeepSeek v2 ( #6519 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-23 12:22:09 -07:00
58f53034ad
[Frontend] Add Usage data in each chunk for chat_serving. #6540 ( #6652 )
2024-07-23 11:41:55 -07:00
0eb0757bef
[Misc] Add ignored layers for fp8
quantization ( #6657 )
2024-07-23 14:04:04 -04:00
38c4b7e863
Bump version to 0.5.3.post1 ( #6696 )
2024-07-23 10:08:59 -07:00
a112a84aad
[BugFix] Fix RoPE error in Llama 3.1 ( #6693 )
2024-07-23 09:46:05 -07:00
461089a21a
[Bugfix] Fix a log error in chunked prefill ( #6694 )
2024-07-23 09:27:58 -07:00
71950af726
[doc][distributed] fix doc argument order ( #6691 )
2024-07-23 08:55:33 -07:00
cb1362a889
[Docs] Announce llama3.1 support ( #6688 )
2024-07-23 08:18:15 -07:00
bb2fc08072
Bump version to v0.5.3 ( #6674 )
2024-07-23 00:00:08 -07:00
3eda4ec780
support ignore patterns in model loader ( #6673 )
2024-07-22 23:59:42 -07:00
22fa2e35cb
[VLM][Model] Support image input for Chameleon ( #6633 )
2024-07-22 23:50:48 -07:00
c5201240a4
[misc] only tqdm for first rank ( #6672 )
2024-07-22 21:57:27 -07:00
97234be0ec
[Misc] Manage HTTP connections in one place ( #6600 )
2024-07-22 21:32:02 -07:00
c051bfe4eb
[doc][distributed] doc for setting up multi-node environment ( #6529 )
...
[doc][distributed] add more doc for setting up multi-node environment (#6529 )
2024-07-22 21:22:09 -07:00
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors ( #6528 )
2024-07-23 04:11:50 +00:00
e519ae097a
add tqdm when loading checkpoint shards ( #6569 )
...
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io >
Co-authored-by: youkaichao <youkaichao@126.com >
2024-07-22 20:48:01 -07:00
7c2749a4fd
[misc] add start loading models for users information ( #6670 )
2024-07-22 20:08:02 -07:00
729171ae58
[Misc] Enable chunked prefill by default for long context models ( #6666 )
2024-07-22 20:03:13 -07:00
c5e8330997
[Bugfix] Fix null modules_to_not_convert
in FBGEMM Fp8 quantization ( #6665 )
2024-07-22 19:25:05 -07:00
e0c15758b8
[Core] Modulize prepare input and attention metadata builder ( #6596 )
2024-07-23 00:45:24 +00:00
bdf5fd1386
[Misc] Remove deprecation warning for beam search ( #6659 )
2024-07-23 00:21:58 +00:00
5a96ee52a3
[ci][build] add back vim in docker ( #6661 )
2024-07-22 16:26:29 -07:00
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace ( #6234 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-22 15:42:40 -07:00
69d5ae38dc
[ci] Use different sccache bucket for CUDA 11.8 wheel build ( #6656 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-22 14:20:41 -07:00
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels ( #6649 )
2024-07-22 14:08:30 -06:00
739b61a348
[Frontend] Refactor prompt processing ( #4028 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-22 10:13:53 -07:00
89c1c6a196
[Bugfix] Fix vocab_size
field access in llava_next.py
( #6624 )
2024-07-22 05:02:51 +00:00
42de2cefcb
[Misc] Add a wrapper for torch.inference_mode ( #6618 )
2024-07-21 18:43:11 -07:00
c9eef37f32
[Model] Initial Support for Chameleon ( #5770 )
2024-07-21 17:37:51 -07:00
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel ( #6612 )
2024-07-21 19:41:42 -04:00
25e778aa16
[Model] Refactor and decouple phi3v image embedding ( #6621 )
2024-07-21 16:07:58 -07:00
b6df37f943
[Misc] Remove abused noqa ( #6619 )
2024-07-21 23:47:04 +08:00
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. ( #6485 )
2024-07-20 23:58:58 -07:00
d7f4178dd9
[Frontend] Move chat utils ( #6602 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-21 08:38:17 +08:00
082ecd80d5
[ Bugfix ] Fix AutoFP8 fp8 marlin ( #6609 )
2024-07-20 17:25:56 -06:00
f952bbc8ff
[Misc] Fix input_scale typing in w8a8_utils.py ( #6579 )
2024-07-20 23:11:13 +00:00
9364f74eee
[ Kernel ] Enable fp8-marlin
for fbgemm-fp8
models ( #6606 )
2024-07-20 18:50:10 +00:00
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes ( #6543 )
2024-07-20 09:39:07 -07:00
683e3cb9c4
[ Misc ] fbgemm
checkpoints ( #6559 )
2024-07-20 09:36:57 -07:00
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors ( #6541 )
2024-07-20 04:17:24 +00:00
3f8d42c81f
Pipeline Parallel: Guard for KeyErrors at request abort ( #6587 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-19 19:18:19 -07:00
7bd82002ae
[Core] Allow specifying custom Executor ( #6557 )
2024-07-20 01:25:06 +00:00
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub ( #6593 )
...
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com >
2024-07-19 18:15:26 -07:00
e81522e879
[build] add ib in image for out-of-the-box infiniband support ( #6599 )
...
[build] add ib so that multi-node support with infiniband can be supported out-of-the-box (#6599 )
2024-07-19 17:16:57 -07:00
45ceb85a0c
[Docs] Update PP docs ( #6598 )
2024-07-19 16:38:21 -07:00
4cc24f01b1
[ Kernel ] Enable Dynamic Per Token fp8
( #6547 )
2024-07-19 23:08:15 +00:00
07eb6f19f3
[bugfix][distributed] fix multi-node bug for shared memory ( #6597 )
2024-07-19 15:34:34 -07:00
f0bbfaf917
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection ( #6578 )
2024-07-19 14:01:03 -07:00
30efe41532
[Docs] Update docs for wheel location ( #6580 )
2024-07-19 12:14:11 -07:00
9ed82e7074
[Misc] Small perf improvements ( #6520 )
2024-07-19 12:10:56 -07:00
51f8aa90ad
[Bugfix][Frontend] remove duplicate init logger ( #6581 )
2024-07-19 10:16:27 -07:00
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply ( #6327 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-19 07:15:22 -06:00
a921e86392
[BUGFIX] Raise an error for no draft token case when draft_tp>1 ( #6369 )
2024-07-19 06:01:09 -07:00
6366efc67b
[Bugfix][Frontend] Fix missing /metrics
endpoint ( #6463 )
2024-07-19 03:55:13 +00:00
dbe5588554
[ Misc ] non-uniform quantization via compressed-tensors
for Llama
( #6515 )
2024-07-18 22:39:18 -04:00
d4201e06d5
[Bugfix] Make spec. decode respect per-request seed. ( #6034 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-07-18 19:22:08 -07:00
b5672a112c
[Core] Multiprocessing Pipeline Parallel support ( #6130 )
...
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-18 19:15:52 -07:00
c5df56f88b
Add support for a rope extension method ( #6553 )
2024-07-19 01:53:03 +00:00
1689219ebf
[CI/Build] Build on Ubuntu 20.04 instead of 22.04 ( #6517 )
2024-07-18 17:29:25 -07:00
4ffffccb7e
[Kernel] Implement fallback for FP8 channelwise using torch._scaled_mm ( #6552 )
2024-07-18 23:52:22 +00:00
f53b8f0d05
[ci][test] add correctness test for cpu offloading ( #6549 )
2024-07-18 23:41:06 +00:00
2d4733ba2d
Fix PR comment bot ( #6554 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-18 14:48:29 -07:00
15c6a079b1
[Model] Support Mistral-Nemo ( #6548 )
2024-07-18 20:31:50 +00:00
ecdb462c24
[ci] Reword Github bot comment ( #6534 )
2024-07-18 08:01:45 -07:00
58ca663224
[ Misc ] Improve Min Capability Checking in compressed-tensors
( #6522 )
2024-07-18 14:39:12 +00:00
4634c8728b
[TPU] Refactor TPU worker & model runner ( #6506 )
2024-07-18 01:34:16 -07:00
c8a7d51c49
[Bugfix] Update flashinfer.py with PagedAttention forwards - Fixes Gemma2 OpenAI Server Crash ( #6501 )
2024-07-18 07:47:13 +00:00
e2fbaee725
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs ( #6227 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-18 15:13:30 +08:00
8a74c68bd1
[Misc] Minor patch for draft model runner ( #6523 )
2024-07-18 06:06:21 +00:00
61e592747c
[Core] Introduce SPMD worker execution using Ray accelerated DAG ( #6032 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com >
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu >
2024-07-17 22:27:09 -07:00
d25877dd9b
[BugFix] Avoid secondary error in ShmRingBuffer destructor ( #6530 )
2024-07-17 22:24:43 -07:00
1c27d25fb5
[core][model] yet another cpu offload implementation ( #6496 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-17 20:54:35 -07:00
18fecc3559
[ Kernel ] Fp8 Channelwise Weight Support ( #6487 )
2024-07-18 03:18:13 +00:00
b5af8c223c
[Model] Pipeline parallel support for Mixtral ( #6516 )
2024-07-17 19:26:04 -07:00
b5241e41d9
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel ( #6511 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-07-18 01:38:35 +00:00
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step ( #6338 )
2024-07-17 14:30:28 -07:00
5f0b9933e6
[Bugfix] Fix Ray Metrics API usage ( #6354 )
2024-07-17 19:40:10 +00:00
a38524f338
[DOC] - Add docker image to Cerebrium Integration ( #6510 )
2024-07-17 10:22:53 -07:00
2fa4623d9e
[Core] Refactor _prepare_model_input_tensors - take 2 ( #6164 )
2024-07-17 09:37:16 -07:00
a9a2e74d21
[Misc] Use torch.Tensor
for type annotation ( #6505 )
2024-07-17 13:01:10 +00:00
e09ce759aa
[TPU] Remove multi-modal args in TPU backend ( #6504 )
2024-07-17 04:02:53 -07:00
5fa6e9876e
[Bugfix] Fix for multinode crash on 4 PP ( #6495 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-17 08:25:10 +00:00
5bf35a91e4
[Doc][CI/Build] Update docs and tests to use vllm serve
( #6431 )
2024-07-17 07:43:21 +00:00
a19e8d3726
[Misc][Speculative decoding] Typos and typing fixes ( #6467 )
...
Co-authored-by: caishangming.csm <caishangming.csm@alibaba-inc.com >
2024-07-17 07:17:07 +00:00
10383887e0
[ROCm] Cleanup Dockerfile and remove outdated patch ( #6482 )
2024-07-16 22:47:02 -07:00
1d094fd7c0
[Distributed][PP] only create embedding & lm head when necessary ( #6455 )
...
original title: [Distributed][Model] Rank-based Component Creation for Pipeline Parallelism Memory Optimization
2024-07-16 19:20:26 -07:00
ce37be7ba0
[misc][distributed] add seed to dummy weights ( #6491 )
2024-07-16 19:16:34 -07:00
7f62077af5
[misc][distributed] improve tests ( #6488 )
2024-07-16 17:35:52 -07:00
09c2eb85dd
[ci][distributed] add pipeline parallel correctness test ( #6410 )
2024-07-16 15:44:22 -07:00
978aed5300
[Kernel][Attention] Separate Attention.kv_scale
into k_scale
and v_scale
( #6081 )
2024-07-16 15:31:32 -07:00
160e1d8c99
[Misc] Log spec decode metrics ( #6454 )
2024-07-16 20:37:10 +00:00
94162beb9f
[Doc] Fix the lora adapter path in server startup script ( #6230 )
2024-07-16 10:11:04 -07:00
c467dff24f
[Hardware][TPU] Support MoE with Pallas GMM kernel ( #6457 )
2024-07-16 09:56:28 -07:00
9f4ccec761
[doc][misc] remind to cancel debugging environment variables ( #6481 )
...
[doc][misc] remind users to cancel debugging environment variables after debugging (#6481 )
2024-07-16 09:45:30 -07:00
38ef94888a
[CI/Build] Remove "boardwalk" image asset ( #6460 )
2024-07-16 08:59:36 -07:00
2bb0489cb3
[Core] Use numpy to speed up padded token processing ( #6442 )
2024-07-16 08:13:25 -07:00
7508a3dc34
[Misc] Fix typos in spec. decode metrics logging. ( #6470 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-16 13:55:15 +00:00
7a3d2a5b95
[Frontend] Support for chat completions input in the tokenize endpoint ( #5923 )
2024-07-16 20:18:09 +08:00
d97011512e
[CI/Build] vLLM cache directory for images ( #6444 )
2024-07-15 23:12:25 -07:00
37d776606f
[Docs] Announce 5th meetup ( #6458 )
2024-07-15 21:04:58 -07:00
d92b3c5cde
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests ( #6419 )
2024-07-15 18:54:15 -07:00
9ad32dacd9
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug ( #6425 )
...
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-07-16 01:32:55 +00:00
d6f3b3d5c4
Pin sphinx-argparse version ( #6453 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-16 01:26:11 +00:00
4552e37b55
[CI/Build][TPU] Add TPU CI test ( #6277 )
...
Co-authored-by: kevin <kevin@anyscale.com >
2024-07-15 14:31:16 -07:00
ec9933f4a5
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod ( #6289 )
2024-07-15 19:02:14 +00:00
3dee97b05f
[Docs] Add Google Cloud to sponsor list ( #6450 )
2024-07-15 11:58:10 -07:00
4cf256ae7f
[misc][distributed] fix pp missing layer condition ( #6446 )
2024-07-15 10:32:35 -07:00
64fdc08c72
bump version to v0.5.2 ( #6433 )
2024-07-15 17:27:40 +00:00
4ef95b0f06
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF ( #6409 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-15 13:14:49 -04:00
eaec4b9153
[Bugfix] Add custom Triton cache manager to resolve MoE MP issue ( #6140 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Chih-Chieh-Yang <chih.chieh.yang@ibm.com >
2024-07-15 10:12:47 -07:00
a63a4c6341
[Misc] Use 0.0.9 version for flashinfer ( #6447 )
...
Co-authored-by: Pernekhan Utemuratov <pernekhan@deepinfra.com >
2024-07-15 10:10:26 -07:00
c8fd97f26d
[Kernel] Use CUTLASS kernels for the FP8 layers with Bias ( #6270 )
2024-07-15 13:05:52 -04:00
94b82e8c18
[doc][distributed] add suggestion for distributed inference ( #6418 )
2024-07-15 09:45:51 -07:00
6ae1597ddf
[VLM] Minor space optimization for ClipVisionModel
( #6436 )
2024-07-15 17:29:51 +08:00
22e79ee8f3
[doc][misc] doc update ( #6439 )
2024-07-14 23:33:25 -07:00
de19916314
[Bugfix] Convert image to RGB by default ( #6430 )
2024-07-15 05:39:15 +00:00
69672f116c
[core][distributed] simplify code to support pipeline parallel ( #6406 )
2024-07-14 21:20:51 -07:00
44874a0bf9
[Doc] add env docs for flashinfer backend ( #6437 )
2024-07-14 21:16:51 -07:00
b47008b4d2
[BugFix] BatchResponseData body should be optional ( #6345 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-15 04:06:09 +00:00
9bfece89fd
Add FUNDING.yml ( #6435 )
2024-07-14 20:36:16 -07:00
32c9d7f765
Report usage for beam search ( #6404 )
2024-07-14 19:37:35 -07:00
ccb20db8bd
[Bugfix] Benchmark serving script used global parameter 'args' in function 'sample_random_requests' ( #6428 )
2024-07-14 19:27:01 -07:00
a754dc2cb9
[CI/Build] Cross python wheel ( #6394 )
2024-07-14 18:54:46 -07:00
61e85dbad8
[Doc] xpu backend requires running setvars.sh ( #6393 )
2024-07-14 17:10:11 -07:00
dbfe254eda
[Feature] vLLM CLI ( #5090 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-07-14 15:36:43 -07:00
73030b7dae
[ Misc ] Enable Quantizing All Layers of DeekSeekv2 ( #6423 )
2024-07-14 21:38:42 +00:00
ccd3c04571
[ci][build] fix commit id ( #6420 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-07-14 22:16:21 +08:00
9dad5cc859
[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace ( #6384 )
2024-07-14 13:37:19 +00:00
6ef3bf912c
Remove unnecessary trailing period in spec_decode.rst ( #6405 )
2024-07-14 07:58:09 +00:00
540c0368b1
[Model] Initialize Fuyu-8B support ( #3924 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-14 05:27:14 +00:00
fb6af8bc08
[ Misc ] Apply MoE Refactor to Deepseekv2 To Support Fp8 ( #6417 )
2024-07-13 20:03:58 -07:00
eeceadaecc
[Misc] Add deprecation warning for beam search ( #6402 )
2024-07-13 11:52:22 -07:00
babf52dade
[ Misc ] More Cleanup of Marlin ( #6359 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-07-13 10:21:37 +00:00
9da4aad44b
Updating LM Format Enforcer version to v10.3 ( #6411 )
2024-07-13 10:09:12 +00:00
41708e5034
[ci] try to add multi-node tests ( #6280 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-12 21:51:48 -07:00
d80aef3776
[Docs] Clean up latest news ( #6401 )
2024-07-12 19:36:53 -07:00
e1684a766a
[Bugfix] Fix hard-coded value of x in context_attention_fwd ( #6373 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-12 18:30:54 -07:00
a27f87da34
[Doc] Fix Typo in Doc ( #6392 )
...
Co-authored-by: Saliya Ekanayake <esaliya@d-matrix.ai >
2024-07-13 00:48:23 +00:00
16ff6bd58c
[ci] Fix wording for GH bot ( #6398 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 16:34:37 -07:00
f8f9ff57ee
[Bugfix][TPU] Fix megacore setting for v5e-litepod ( #6397 )
2024-07-12 15:59:47 -07:00
6bc9710f6e
Fix release pipeline's dir permission ( #6391 )
2024-07-12 15:52:43 -07:00
111fc6e7ec
[Misc] Add generated git commit hash as vllm.__commit__
( #6386 )
2024-07-12 22:52:15 +00:00
75f64d8b94
[Bugfix] Fix illegal memory access in FP8 MoE kernel ( #6382 )
2024-07-12 21:33:33 +00:00
21b2dcedab
Fix release pipeline's -e flag ( #6390 )
2024-07-12 14:08:04 -07:00
07b35af86d
Fix interpolation in release pipeline ( #6389 )
2024-07-12 14:03:39 -07:00
bb1a784b05
Fix release-pipeline.yaml ( #6388 )
2024-07-12 14:00:57 -07:00
d719ba24c5
Build some nightly wheels by default ( #6380 )
2024-07-12 13:56:59 -07:00
aa48e502fb
[MISC] Upgrade dependency to PyTorch 2.3.1 ( #5327 )
2024-07-12 12:04:26 -07:00
4dbebd03cc
[ci] Add GHA workflows to enable full CI run ( #6381 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 11:36:26 -07:00
b75bce1008
[ci] Add grouped tests & mark tests to run by default for fastcheck pipeline ( #6365 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-12 09:58:38 -07:00
b039cbbce3
[Misc] add fixture to guided processor tests ( #6341 )
2024-07-12 09:55:39 -07:00
f9d25c2519
[Build/CI] Checking/Waiting for the GPU's clean state ( #6379 )
2024-07-12 09:42:24 -07:00
024ad87cdc
[Bugfix] Fix dtype mismatch in PaliGemma ( #6367 )
2024-07-12 08:22:18 -07:00
aea19f0989
[ Misc ] Support Models With Bias in compressed-tensors
integration ( #6356 )
2024-07-12 11:11:29 -04:00
f7160d946a
[Misc][Bugfix] Update transformers for tokenizer issue ( #6364 )
2024-07-12 08:40:07 +00:00
6047187cd8
[ Misc ] Remove separate bias add ( #6353 )
2024-07-12 05:06:09 +00:00
b6c16cf8ff
[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm ( #6352 )
2024-07-11 21:30:46 -07:00
d26a8b3f1f
[CI/Build] (2/2) Switching AMD CI to store images in Docker Hub ( #6350 )
2024-07-11 21:26:26 -07:00
d59eb98489
[Model][Phi3-Small] Remove scipy from blocksparse_attention ( #6343 )
2024-07-12 10:47:17 +08:00
adf32e0a0f
[Bugfix] Fix usage stats logging exception warning with OpenVINO ( #6349 )
2024-07-12 10:47:00 +08:00
2b0fb53481
[distributed][misc] be consistent with pytorch for libcudart.so ( #6346 )
...
[distributed][misc] keep consistent with how pytorch finds libcudart.so (#6346 )
2024-07-11 19:35:17 -07:00
d6ab528997
[Misc] Remove flashinfer warning, add flashinfer tests to CI ( #6351 )
2024-07-12 01:32:06 +00:00
7ed6a4f0e1
[ BugFix ] Prompt Logprobs Detokenization ( #6223 )
...
Co-authored-by: Zifei Tong <zifeitong@gmail.com >
2024-07-11 22:02:29 +00:00
a4feba929b
[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy ( #5362 )
2024-07-11 13:28:38 -07:00
2d23b42d92
[doc] update pipeline parallel in readme ( #6347 )
2024-07-11 11:38:40 -07:00
1df43de9bb
[bug fix] Fix llava next feature size calculation. ( #6339 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-07-11 17:21:10 +00:00
52b7fcb35a
Benchmark: add H100 suite ( #6047 )
2024-07-11 09:17:07 -07:00
b675069d74
[ Misc ] Refactor Marlin Python Utilities ( #6082 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-07-11 15:40:11 +00:00
55f692b46e
[BugFix] get_and_reset only when scheduler outputs are not empty ( #6266 )
2024-07-11 07:40:20 -07:00
8a1415cf77
[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. ( #6326 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-07-11 07:05:59 -07:00
546b101fa0
[BugFix]: fix engine timeout due to request abort ( #6255 )
...
Signed-off-by: yatta zhang <ytzhang01@foxmail.com >
Signed-off-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com >
Co-authored-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com >
2024-07-11 06:46:31 -07:00
3963a5335b
[Misc] refactor(config): clean up unused code ( #6320 )
2024-07-11 09:39:07 +00:00
c4774eb841
[Bugfix] Fix snapshot download in serving benchmark ( #6318 )
2024-07-11 07:04:05 +00:00
fc17110bbe
[BugFix]: set outlines pkg version ( #6262 )
2024-07-11 04:37:11 +00:00
439c84581a
[Doc] Update description of vLLM support for CPUs ( #6003 )
2024-07-10 21:15:29 -07:00
99ded1e1c4
[Doc] Remove comments incorrectly copied from another project ( #6286 )
2024-07-10 17:05:26 -07:00
997df46a32
[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor ( #6313 )
2024-07-10 16:39:02 -07:00
ae151d73be
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models ( #5765 )
2024-07-10 16:02:47 -07:00
44cc76610d
[Bugfix] Fix OpenVINOExecutor abstractmethod error ( #6296 )
...
Signed-off-by: sangjune.park <sangjune.park@navercorp.com >
2024-07-10 10:03:32 -07:00
b422d4961a
[CI/Build] Enable mypy typing for remaining folders ( #6268 )
2024-07-10 22:15:55 +08:00
c38eba3046
[Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. ( #6303 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-07-10 09:04:07 -04:00
e72ae80b06
[Bugfix] Support 2D input shape in MoE layer ( #6287 )
2024-07-10 09:03:16 -04:00
8a924d2248
[Doc] Guide for adding multi-modal plugins ( #6205 )
2024-07-10 14:55:34 +08:00
5ed3505d82
[Bugfix][TPU] Add prompt adapter methods to TPUExecutor ( #6279 )
2024-07-09 19:30:56 -07:00
da78caecfa
[core][distributed] zmq fallback for broadcasting large objects ( #6183 )
...
[core][distributed] add zmq fallback for broadcasting large objects (#6183 )
2024-07-09 18:49:11 -07:00
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer ( #4978 )
2024-07-09 18:34:02 -07:00
d3a245138a
[Bugfix]fix and needs_scalar_to_array logic check ( #6238 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-07-09 23:43:24 +00:00
673dd4cae9
[Docs] Docs update for Pipeline Parallel ( #6222 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-09 16:24:58 -07:00
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts ( #4645 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com >
Co-authored-by: Joe G <joseph.granados@h2o.ai >
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-07-09 13:26:36 -07:00
a0550cbc80
Add support for multi-node on CI ( #5955 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-07-09 12:56:56 -07:00
08c5bdecae
[Bugfix][TPU] Fix outlines installation in TPU Dockerfile ( #6256 )
2024-07-09 02:56:06 -07:00
5d5b4c5fe5
[Bugfix][TPU] Add missing None to model input ( #6245 )
2024-07-09 00:21:37 -07:00
70c232f85a
[core][distributed] fix ray worker rank assignment ( #6235 )
2024-07-08 21:31:44 -07:00
a3c9435d93
[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability ( #6216 )
2024-07-08 20:02:15 -07:00
4f0e0ea131
Add FlashInfer to default Dockerfile ( #6172 )
2024-07-08 13:38:03 -07:00
ddc369fba1
[Bugfix] Mamba cache Cuda Graph padding ( #6214 )
2024-07-08 11:25:51 -07:00
185ad31f37
[Bugfix] use diskcache in outlines _get_guide #5436 ( #6203 )
2024-07-08 11:23:24 -07:00
543aa48573
[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) ( #4888 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-08 17:12:15 +00:00
f7a8fa39d8
[Kernel] reloading fused_moe config on the last chunk ( #6210 )
2024-07-08 08:00:38 -07:00
717f4bcea0
Feature/add benchmark testing ( #5947 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-08 07:52:06 +00:00
16620f439d
do not exclude object
field in CompletionStreamResponse ( #6196 )
2024-07-08 10:32:57 +08:00
3b08fe2b13
[misc][frontend] log all available endpoints ( #6195 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-07-07 15:11:12 -07:00
abfe705a02
[ Misc ] Support Fp8 via llm-compressor
( #6110 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-07-07 20:42:11 +00:00
333306a252
add benchmark for fix length input and output ( #5857 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-07 07:42:13 +00:00
6206dcb29e
[Model] Add PaliGemma ( #5189 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-07-07 09:25:50 +08:00
9389380015
[Doc] Move guide for multimodal model and other improvements ( #6168 )
2024-07-06 17:18:59 +08:00
175c43eca4
[Doc] Reorganize Supported Models by Type ( #6167 )
2024-07-06 05:59:36 +00:00
bc96d5c330
Move release wheel env var to Dockerfile instead ( #6163 )
2024-07-05 17:19:53 -07:00
f0250620dd
Fix release wheel build env var ( #6162 )
2024-07-05 16:24:31 -07:00
2de490d60f
Update wheel builds to strip debug ( #6161 )
2024-07-05 14:51:25 -07:00
79d406e918
[Docs] Fix readthedocs for tag build ( #6158 )
2024-07-05 12:44:40 -07:00
abad5746a7
bump version to v0.5.1 ( #6157 )
2024-07-05 12:04:51 -07:00
e58294ddf2
[Bugfix] Add verbose error if scipy is missing for blocksparse attention ( #5695 )
2024-07-05 10:41:01 -07:00
f1e15da6fe
[Frontend] Continuous usage stats in OpenAI completion API ( #5742 )
2024-07-05 10:37:09 -07:00
0097bb1829
[Bugfix] Use templated datasource in grafana.json to allow automatic imports ( #6136 )
...
Signed-off-by: Christian Rohmann <christian.rohmann@inovex.de >
2024-07-05 09:49:47 -07:00
ea4b570483
[VLM] Cleanup validation and update docs ( #6149 )
2024-07-05 05:49:38 +00:00
a41357e941
[VLM] Improve consistency between feature size calculation and dummy data for profiling ( #6146 )
2024-07-05 09:29:47 +08:00
ae96ef8fbd
[VLM] Calculate maximum number of multi-modal tokens by model ( #6121 )
2024-07-04 16:37:23 -07:00
69ec3ca14c
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer ( #6051 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-04 16:35:51 -07:00
81d7a50f24
[Hardware][Intel CPU] Adding intel openmp tunings in Docker file ( #6008 )
...
Signed-off-by: Yuan Zhou <yuan.zhou@intel.com >
2024-07-04 15:22:12 -07:00
27902d42be
[misc][doc] try to add warning for latest html ( #5979 )
2024-07-04 09:57:09 -07:00
56b325e977
[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention ( #6043 )
...
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com >
2024-07-03 22:19:38 -07:00
3dd507083f
[CI/Build] Cleanup VLM tests ( #6107 )
2024-07-03 18:58:18 -07:00
0ed646b7aa
[Distributed][Core] Support Py39 and Py38 for PP ( #6120 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-03 17:52:29 -07:00
1dab9bc8a9
[Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing ( #6109 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-07-03 16:56:59 -07:00
3de6e6a30e
[core][distributed] support n layers % pp size != 0 ( #6115 )
2024-07-03 16:40:31 -07:00
966fe72141
[doc][misc] bump up py version in installation doc ( #6119 )
2024-07-03 15:52:04 -07:00
62963d129e
[ Misc ] Clean Up CompressedTensorsW8A8
( #6113 )
2024-07-03 22:50:08 +00:00
d9e98f42e4
[vlm] Remove vision language config. ( #6089 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-03 22:14:16 +00:00
3c6325f0fc
[core][distributed] custom allreduce when pp size > 1 ( #6117 )
2024-07-03 14:41:32 -07:00
47f0954af0
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin ( #5975 )
2024-07-03 17:38:00 +00:00
7cd2ebb025
[Bugfix] Fix compute_logits
in Jamba ( #6093 )
2024-07-03 00:32:35 -07:00
f1c78138aa
[Doc] Fix Mock Import ( #6094 )
2024-07-03 00:13:56 -07:00
3a86b54fb0
[VLM][Frontend] Proper Image Prompt Formatting from OpenAI API ( #6091 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-07-02 23:41:23 -07:00
f666207161
[misc][distributed] error on invalid state ( #6092 )
2024-07-02 23:37:29 -07:00
d830656a97
[BugFix] Avoid unnecessary Ray import warnings ( #6079 )
2024-07-03 14:09:40 +08:00
d18bab3587
[CI] Fix base url doesn't strip "/" ( #6087 )
2024-07-02 21:31:25 -07:00
9831aec49f
[Core] Dynamic image size support for VLMs ( #5276 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: ywang96 <ywang@roblox.com >
Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-07-02 20:34:00 -07:00
482045ee77
[hardware][misc] introduce platform abstraction ( #6080 )
2024-07-02 20:12:22 -07:00
9d6a8daa87
[Model] Jamba support ( #4115 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
Co-authored-by: Erez Schwartz <erezs@ai21.com >
Co-authored-by: Mor Zusman <morz@ai21.com >
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com >
Co-authored-by: Tomer Asida <tomera@ai21.com >
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 23:11:29 +00:00
ee93f4f92a
[CORE] Quantized lm-head Framework ( #4442 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
Co-authored-by: ZX <zx@lbx.dev >
2024-07-02 22:25:17 +00:00
7c008c51a9
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral ( #5970 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-07-02 21:54:35 +00:00
4d26d806e1
Update conftest.py ( #6076 )
2024-07-02 20:14:22 +00:00
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-07-02 10:58:08 -07:00
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) ( #6050 )
...
Co-authored-by: Sirej Dua <sirej.dua@databricks.com >
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
31354e563f
[Doc] Reinstate doc dependencies ( #6061 )
2024-07-02 10:53:16 +00:00
98d6682cd1
[VLM] Remove image_input_type
from VLM config ( #5852 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-07-02 07:57:09 +00:00
2c37540aa6
[Frontend] Add template related params to request ( #5709 )
2024-07-01 23:01:57 -07:00
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) ( #5602 )
2024-07-01 20:10:37 -07:00
54600709b6
[Model] Changes to MLPSpeculator to support tie_weights and input_scale ( #5965 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com >
2024-07-01 16:40:02 -07:00
e373853e12
[Frontend] Relax api url assertion for openai benchmarking ( #6046 )
2024-07-01 23:39:10 +00:00
c87ebc3ef9
[BugFix] Ensure worker model loop is always stopped at the right time ( #5987 )
2024-07-01 16:17:58 -07:00
c4059ea54f
[Bugfix] Add explicit end_forward
calls to flashinfer ( #6044 )
2024-07-01 23:08:58 +00:00
8e0817c262
[Bugfix][Doc] Fix Doc Formatting ( #6048 )
2024-07-01 15:09:11 -07:00
83bdcb6ac3
add FAQ doc under 'serving' ( #5946 )
2024-07-01 14:11:36 -07:00
12a59959ed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs ( #6029 )
2024-07-01 21:08:29 +00:00
dec6fc6f3b
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool ( #6039 )
2024-07-01 20:12:40 +00:00
8893130b63
[doc][misc] further lower visibility of simple api server ( #6041 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-07-01 10:50:56 -07:00
bb60326836
[Misc] update benchmark backend for scalellm ( #6018 )
2024-07-01 10:20:33 -07:00
4050d646e5
[doc][misc] remove deprecated api server in doc ( #6037 )
2024-07-01 12:52:43 -04:00
d76084c12f
[ CI ] Re-enable Large Model LM Eval ( #6031 )
2024-07-01 12:40:45 -04:00
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker ( #5348 )
2024-07-01 00:33:05 -07:00
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
af9ad46fca
[ Misc ] Refactor w8a8 to use process_weights_after_load
(Simplify Weight Loading) ( #5940 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
7836fdcc11
[Misc] Fix get_min_capability
( #5971 )
2024-06-30 20:15:16 +00:00
deacb7ec44
[ CI ] Temporarily Disable Large LM-Eval Tests ( #6005 )
...
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>
2024-06-30 11:56:56 -07:00
f5e73c9f1b
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. ( #5909 )
...
Co-authored-by: sang <sangcho@anyscale.com >
2024-06-30 17:11:15 +00:00
c6c240aa0a
[Frontend]: Support base64 embedding ( #5935 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-06-30 23:53:00 +08:00
2be6955a3f
[ci][distributed] fix device count call
...
[ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991 )
2024-06-30 08:06:13 +00:00
9d47f64eb6
[CI/Build] [3/3] Reorganize entrypoints tests ( #5966 )
2024-06-30 12:58:49 +08:00
cff6a1fec1
[CI/Build] Reuse code for checking output consistency ( #5988 )
2024-06-30 11:44:25 +08:00
bcc6a09b63
[CI/Build] Temporarily Remove Phi3-Vision from TP Test ( #5989 )
2024-06-30 09:18:31 +08:00
9def10664e
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests ( #5949 )
2024-06-29 12:47:58 -07:00
75aa1442db
[ CI/Build ] LM Eval Harness Based CI Testing ( #5838 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 13:04:30 -04:00
99397da534
[CI/Build] Add TP test for vision models ( #5892 )
2024-06-29 15:45:54 +00:00
8dbfcd35bf
[ CI/Build ] Added E2E Test For Compressed Tensors ( #5839 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 21:12:58 +08:00
f7dac83d95
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k ( #5939 )
2024-06-29 21:04:20 +08:00
7c01f70641
[Core] Optimize SequenceStatus.is_finished
by switching to IntEnum ( #5974 )
2024-06-29 12:47:53 +00:00
51e971d39e
[Bugfix] Support eos_token_id
from config.json
( #5954 )
2024-06-29 11:19:02 +00:00
329df38f1a
[Misc] Update Phi-3-Vision Example ( #5981 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com >
2024-06-29 14:34:29 +08:00
580353da93
[Bugfix] Fix precisions in Gemma 1 ( #5913 )
2024-06-29 03:10:21 +00:00
ba4994443a
[Kernel] Add punica dimensions for Granite 3b and 8b ( #5930 )
...
Signed-off-by: Joe Runde <joe@joerun.de >
2024-06-29 10:48:25 +08:00
906a19cdb0
[Misc] Extend vLLM Metrics logging API ( #5925 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-29 10:36:06 +08:00
c4bca740e8
[Bugfix] fix missing last itl in openai completions benchmark ( #5926 )
2024-06-29 10:34:42 +08:00
7f83f40dee
[Bugfix][TPU] Fix pad slot id ( #5977 )
2024-06-28 18:55:17 -07:00
54814fd85b
[Bugfix][TPU] Fix TPU sampler output ( #5978 )
2024-06-28 18:14:16 -07:00
7041de4384
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode ( #4628 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >, bong-furiosa <bongwon.jang@furiosa.ai >
2024-06-28 15:28:49 -07:00
6a62cb82cc
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError ( #5963 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 17:46:30 -04:00
5d2a1a9cf0
Unmark more files as executable ( #5962 )
2024-06-28 17:34:56 -04:00
4bf35ed9ae
[Bugfix] Only add Attention.kv_scale
if kv cache quantization is enabled ( #5936 )
2024-06-28 21:12:40 +00:00
be0b3af9e0
Support Deepseek-V2 ( #4650 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
2024-06-28 13:24:57 -07:00
2cd402e169
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 ( #5921 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 18:43:49 +00:00
b185230744
[ Misc ] Remove fp8_shard_indexer
from Col/Row Parallel Linear (Simplify Weight Loading) ( #5928 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 13:49:57 -04:00
6a2d659d28
[Bugfix] Fix compute datatype for cutlass 3.x epilogues ( #5931 )
2024-06-28 17:10:34 +00:00
b2c620230a
[Spec Decode] Introduce DraftModelRunner ( #5799 )
2024-06-28 09:17:51 -07:00
b90d8cd832
[Distributed] Make it clear that % should not be in tensor dict keys. ( #5927 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-06-28 15:20:22 +00:00
3b752a6555
[CI/Build] [2/3] Reorganize entrypoints tests ( #5904 )
2024-06-28 07:59:18 -07:00
ec1ad0046c
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens
is set too high ( #5894 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-28 07:42:17 -07:00
57f09a419c
[Hardware][Intel] OpenVINO vLLM backend ( #5379 )
2024-06-28 13:50:16 +00:00
5932634409
Unmark fused_moe config json file as executable ( #5960 )
2024-06-28 06:36:12 -07:00
5cbe8d155c
[Core] Registry for processing model inputs ( #5214 )
...
Co-authored-by: ywang96 <ywang@roblox.com >
2024-06-28 12:09:56 +00:00
0d0e3a42ac
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner ( #5956 )
2024-06-28 12:03:41 +00:00
74d55c065b
[VLM][BugFix] Make sure that multi_modal_kwargs
can broadcast properly with ring buffer. ( #5905 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-28 07:29:13 +00:00
f136da15e1
[Hardware][TPU] Optimize KV cache swapping ( #5878 )
2024-06-27 21:12:13 -07:00
c3dde367f1
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X ( #5932 )
2024-06-27 13:41:08 -07:00
64e8d2a783
[core][misc] remove logical block ( #5882 )
2024-06-27 13:34:55 -07:00
79c92c7c8a
[Model] Add Gemma 2 ( #5908 )
2024-06-27 13:33:56 -07:00
736ed38849
[CI/Build] Fix Args for _get_logits_warper
in Sampler Test ( #5922 )
2024-06-27 11:43:04 -07:00
365791ff81
[BugFix] Fix min_tokens
behaviour for multiple eos tokens ( #5849 )
2024-06-27 11:31:11 -07:00
691e29ecf3
[BugFix] Fix MLPSpeculator
handling of num_speculative_tokens
( #5876 )
2024-06-27 10:59:33 -07:00
3fd02bda51
[doc][misc] add note for Kubernetes users ( #5916 )
2024-06-27 10:07:07 -07:00
98cf2ed678
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision ( #5896 )
2024-06-27 09:08:10 -07:00
e9d32d077d
[CI/Build] [1/3] Reorganize entrypoints tests ( #5526 )
2024-06-27 12:43:17 +00:00
2061f0b8a7
[Bugfix] Fix img_sizes Parsing in Phi3-Vision ( #5888 )
2024-06-27 08:29:24 +00:00
96354d6a29
[Model] Add base class for LoRA-supported models ( #5018 )
2024-06-27 16:03:04 +08:00
d12af207d2
[VLM][Bugfix] Make sure that multi_modal_kwargs
is broadcasted properly ( #5880 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com >
2024-06-27 15:15:24 +08:00
6eabc6cb0e
[Doc] Add note about context length in Phi-3-Vision example ( #5887 )
2024-06-26 23:20:01 -07:00
2110557dab
[BugFix] Fix cuda graph for MLPSpeculator ( #5875 )
...
Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com >
2024-06-27 04:12:10 +00:00
b9e84259e9
[Misc] Add example for LLaVA-NeXT ( #5879 )
2024-06-26 17:57:16 -07:00
294104c3f9
[doc] update usage of env var to avoid conflict ( #5873 )
2024-06-26 17:57:12 -04:00
38a1674abb
Support CPU inference with VSX PowerPC ISA ( #5652 )
2024-06-26 21:53:04 +00:00
f5c8628fdc
[Bugfix][TPU] Fix CPU cache allocation ( #5869 )
2024-06-26 13:42:40 -07:00
cbc53b6b8d
[Hardware][TPU] Support parallel sampling & Swapping ( #5855 )
2024-06-26 11:07:49 -07:00
c54269d967
[Frontend] Add tokenize/detokenize endpoints ( #5054 )
2024-06-26 16:54:22 +00:00
5bfd1bbc98
[Kernel] Adding bias epilogue support for cutlass_scaled_mm
( #5560 )
...
Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com >
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com >
2024-06-26 15:16:00 +00:00
6984c02a27
[CI/Build] Refactor image test assets ( #5821 )
2024-06-26 01:02:34 -07:00
3439c5a8e3
[Bugfix][TPU] Fix KV cache size calculation ( #5860 )
2024-06-26 00:58:23 -07:00
6806998bf9
[Bugfix] Fix embedding to support 2D inputs ( #5829 )
2024-06-26 00:15:22 -07:00
515080ad2f
[bugfix][distributed] fix shm broadcast when the queue size is full ( #5801 )
2024-06-25 21:56:02 -07:00
3aa7b6cf66
[Misc][Doc] Add Example of using OpenAI Server with VLM ( #5832 )
2024-06-25 20:34:25 -07:00
dda4811591
[Core] Refactor Worker and ModelRunner to consolidate control plane communication ( #5408 )
...
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu >
Signed-off-by: Stephanie <swang@anyscale.com >
Co-authored-by: Stephanie <swang@anyscale.com >
2024-06-25 20:30:03 -07:00
82079729cc
[Bugfix] Fix assertion in NeuronExecutor ( #5841 )
2024-06-25 19:52:10 -07:00
c2a8ac75e0
[CI/Build] Add E2E tests for MLPSpeculator ( #5791 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-26 00:04:08 +00:00
f178e56c68
[Hardware][TPU] Raise errors for unsupported sampling params ( #5850 )
2024-06-25 16:58:23 -07:00
dd793d1de5
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes ( #5422 )
2024-06-25 15:56:15 -07:00
bc34937d68
[Hardware][TPU] Refactor TPU backend ( #5831 )
2024-06-25 15:25:52 -07:00
dd248f7675
[Misc] Update w4a16
compressed-tensors
support to include w8a16
( #5794 )
2024-06-25 19:23:35 +00:00
d9b34baedd
[CI/Build] Add unit testing for FlexibleArgumentParser ( #5798 )
2024-06-25 12:18:03 -07:00
c18ebfdd71
[doc][distributed] add both gloo and nccl tests ( #5834 )
2024-06-25 15:10:28 -04:00
67882dbb44
[Core] Add fault tolerance for RayTokenizerGroupPool
( #5748 )
2024-06-25 10:15:10 -07:00
7b99314301
[Misc] Remove useless code in cpu_worker ( #5824 )
2024-06-25 09:41:36 -07:00
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model ( #5414 )
2024-06-25 09:56:06 +00:00
f23871e9ee
[Doc] Add notice about breaking changes to VLMs ( #5818 )
2024-06-25 01:25:03 -07:00
e9de9dd551
[ci] Remove aws template ( #5757 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-24 21:09:02 -07:00
ba991d5c84
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args ( #5795 )
2024-06-24 17:01:19 -06:00
1744cc99ba
[Doc] Add Phi-3-medium to list of supported models ( #5788 )
2024-06-24 10:48:55 -07:00
e72dc6cb35
[Doc] Add "Suggest edit" button to doc pages ( #5789 )
2024-06-24 10:26:17 -07:00
c246212952
[doc][faq] add warning to download models for every nodes ( #5783 )
2024-06-24 15:37:42 +08:00
edd5fe5fa2
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement ( #5772 )
2024-06-24 12:11:53 +08:00
5d4d90536f
[Distributed] Add send and recv helpers ( #5719 )
2024-06-23 14:42:28 -07:00
6c916ac8a8
[BugFix] [Kernel] Add Cutlass2x fallback kernels ( #5744 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-23 21:07:11 +00:00
832ea88fcb
[core][distributed] improve shared memory broadcast ( #5754 )
2024-06-22 10:00:43 -07:00
8c00f9c15d
[Docs][TPU] Add installation tip for TPU ( #5761 )
2024-06-21 23:09:40 -07:00
0cbc1d2b4f
[Bugfix] Fix pin_lora error in TPU executor ( #5760 )
2024-06-21 22:25:14 -07:00
ff9ddbceee
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py ( #5756 )
2024-06-22 03:33:12 +00:00
9c62db07ed
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs ( #5710 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-22 02:07:08 +00:00
cf90ae0123
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline ( #5616 )
2024-06-21 17:09:34 -07:00
f5dda63eb5
[LoRA] Add support for pinning lora adapters in the LRU cache ( #5603 )
2024-06-21 15:42:46 -07:00
7187507301
[ci][test] fix ca test in main ( #5746 )
2024-06-21 14:04:26 -07:00
f1e72cc19a
[BugFix] exclude version 1.15.0 for modelscope ( #5668 )
2024-06-21 13:15:48 -06:00
5b15bde539
[Doc] Documentation on supported hardware for quantization methods ( #5745 )
2024-06-21 12:44:29 -04:00
bd620b01fb
[Kernel][CPU] Add Quick gelu
to CPU ( #5717 )
2024-06-21 06:39:40 +00:00
d9a252bc8e
[Core][Distributed] add shm broadcast ( #5399 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-06-21 05:12:35 +00:00
67005a07bc
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora ( #5665 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-21 04:46:28 +00:00
c35e4a3dd7
[BugFix] Fix test_phi3v.py ( #5725 )
2024-06-21 04:45:34 +00:00
1f5674218f
[Kernel] Add punica dimension for Qwen2 LoRA ( #5441 )
2024-06-20 17:55:41 -07:00
b12518d3cf
[Model] MLPSpeculator speculative decoding support ( #4947 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com >
2024-06-20 20:23:12 -04:00
6c5b7af152
[distributed][misc] use fork by default for mp ( #5669 )
2024-06-20 17:06:34 -07:00
8065a7e220
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names ( #5718 )
2024-06-20 17:00:13 -06:00
3f3b6b2150
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels ( #5715 )
2024-06-20 18:36:10 +00:00
a7dcc62086
[Kernel] Update Cutlass int8 kernel configs for SM80 ( #5275 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-20 13:33:21 +00:00
ad137cd111
[Model] Port over CLIPVisionModel for VLMs ( #5591 )
2024-06-20 11:52:09 +00:00
111af1fa2c
[Kernel] Update Cutlass int8 kernel configs for SM90 ( #5514 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-20 06:37:08 +00:00
1b2eaac316
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors ( #5703 )
2024-06-19 23:10:47 -07:00
3730a1c832
[Misc] Improve conftest ( #5681 )
2024-06-19 19:09:21 -07:00
949e49a685
[ci] Limit num gpus if specified for A100 ( #5694 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-19 16:30:03 -07:00
4a30d7e3cc
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes ( #5650 )
2024-06-19 18:06:44 -04:00
e83db9e7e3
[Doc] Update docker references ( #5614 )
...
Signed-off-by: Rafael Vasquez <rafvasq21@gmail.com >
2024-06-19 15:01:45 -07:00
78687504f7
[Bugfix] AsyncLLMEngine hangs with asyncio.run ( #5654 )
2024-06-19 13:57:12 -07:00
d571ca0108
[ci][distributed] add tests for custom allreduce ( #5689 )
2024-06-19 20:16:04 +00:00
afed90a034
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py ( #5688 )
2024-06-19 14:41:42 -04:00
3ee5c4bca5
[ci] Add A100 queue into AWS CI template ( #5648 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-19 08:42:13 -06:00
e9c2732b97
[CI/Build] Add tqdm to dependencies ( #5680 )
2024-06-19 08:37:33 -06:00
d8714530d1
[Misc]Add param max-model-len in benchmark_latency.py ( #5629 )
2024-06-19 18:19:08 +08:00
7d46c8d378
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example ( #5684 )
2024-06-19 17:58:32 +08:00
da971ec7a5
[Model] Add FP8 kv cache for Qwen2 ( #5656 )
2024-06-19 09:38:26 +00:00
3eea74889f
[misc][distributed] use 127.0.0.1 for single-node ( #5619 )
2024-06-19 08:05:00 +00:00
f758aed0e8
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices ( #5641 )
2024-06-18 23:21:29 -07:00
e5150f2c28
[Bugfix] Added test for sampling repetition penalty bug. ( #5659 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-19 06:03:55 +00:00
59a1eb59c9
[Bugfix] Fix Phi-3 Long RoPE scaling implementation ( #5628 )
2024-06-19 01:46:38 +00:00
6820724e51
[Bugfix] Fix w8a8 benchmarks for int8 case ( #5643 )
2024-06-19 00:33:25 +00:00
b23ce92032
[Bugfix] Fix CUDA version check for mma warning suppression ( #5642 )
2024-06-18 23:48:49 +00:00
2bd231a7b7
[Doc] Added cerebrium as Integration option ( #5553 )
2024-06-18 15:56:59 -07:00
8a173382c8
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties ( #5639 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-18 14:18:37 -07:00
07feecde1a
[Model] LoRA support added for command-r ( #5178 )
2024-06-18 11:01:21 -07:00
19091efc44
[ci] Setup Release pipeline and build release wheels with cache ( #5610 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-18 11:00:36 -07:00
95db455e7f
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization ( #5542 )
2024-06-18 12:45:05 -04:00
7879f24dcc
[Misc] Add OpenTelemetry support ( #4687 )
...
This PR adds basic support for OpenTelemetry distributed tracing.
It includes changes to enable tracing functionality and improve monitoring capabilities.
I've also added a markdown with print-screens to guide users how to use this feature. You can find it here
2024-06-19 01:17:03 +09:00
13db4369d9
[ci] Deprecate original CI template ( #5624 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-18 14:26:20 +00:00
4ad7b53e59
[CI/Build][Misc] Update Pytest Marker for VLMs ( #5623 )
2024-06-18 13:10:04 +00:00
f0cc0e68e3
[Misc] Remove import from transformers logging ( #5625 )
2024-06-18 12:12:19 +00:00
db5ec52ad7
[bugfix][distributed] improve p2p capability test ( #5612 )
...
[bugfix][distributed] do not error if two processes do not agree on p2p capability (#5612 )
2024-06-18 07:21:05 +00:00
114d7270ff
[CI] Avoid naming different metrics with the same name in performance benchmark ( #5615 )
2024-06-17 21:37:18 -07:00
32c86e494a
[Misc] Fix typo ( #5618 )
2024-06-17 20:58:30 -07:00
8eadcf0b90
[misc][typo] fix typo ( #5620 )
2024-06-17 20:54:57 -07:00
5002175e80
[Kernel] Add punica dimensions for Granite 13b ( #5559 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com >
2024-06-18 03:54:11 +00:00
daef218b55
[Model] Initialize Phi-3-vision support ( #4986 )
2024-06-17 19:34:33 -07:00
fa9e385229
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier ( #5131 )
2024-06-17 21:29:09 -05:00
26e1188e51
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py ( #5606 )
2024-06-17 23:16:10 +00:00
a3e8a05d4c
[Bugfix] Fix KV head calculation for MPT models when using GQA ( #5142 )
2024-06-17 15:26:41 -07:00
e441bad674
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids ( #5584 )
2024-06-17 22:08:05 +00:00
1b44aaf4e3
[bugfix][distributed] fix 16 gpus local rank arrangement ( #5604 )
2024-06-17 21:35:04 +00:00
9e4e6fe207
[CI] the readability of benchmarking and prepare for dashboard ( #5571 )
...
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard (#5571 )
2024-06-17 11:41:08 -07:00
ab66536dbf
[CI/BUILD] Support non-AVX512 vLLM building and testing ( #5574 )
2024-06-17 14:36:10 -04:00
728c4c8a06
[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend ( #3814 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com >
Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com >
Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com >
2024-06-17 11:01:25 -07:00
1f12122b17
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed ( #5588 )
2024-06-17 09:40:35 -07:00
890d8d960b
[Kernel] compressed-tensors
marlin 24 support ( #5435 )
2024-06-17 12:32:48 -04:00
9e74d9d003
Correct alignment in the seq_len diagram. ( #5592 )
...
Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai >
2024-06-17 12:05:33 -04:00
9333fb8eb9
[Model] Rename Phi3 rope scaling type ( #5595 )
2024-06-17 12:04:14 -04:00
e2b85cf86a
Fix w8a8 benchmark and add Llama-3-8B ( #5562 )
2024-06-17 06:48:06 +00:00
845a3f26f9
[Doc] add debugging tips for crash and multi-node debugging ( #5581 )
2024-06-17 10:08:01 +08:00
f07d513320
[build][misc] limit numpy version ( #5582 )
2024-06-16 16:07:01 -07:00
4a6769053a
[CI][BugFix] Flip is_quant_method_supported condition ( #5577 )
2024-06-16 14:07:34 +00:00
f31c1f90e3
Add basic correctness 2 GPU tests to 4 GPU pipeline ( #5518 )
2024-06-16 07:48:02 +00:00
3ce2c050dd
[Fix] Correct OpenAI batch response format ( #5554 )
2024-06-15 16:57:54 -07:00
1c0afa13c5
[BugFix] Don't start a Ray cluster when not using Ray ( #5570 )
2024-06-15 16:30:51 -07:00
d919ecc771
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 ( #5145 )
2024-06-15 13:38:16 -04:00
e691918e3b
[misc] Do not allow to use lora with chunked prefill. ( #5538 )
...
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-06-15 14:59:36 +00:00
81fbb3655f
[CI/Build] Test both text and token IDs in batched OpenAI Completions API ( #5568 )
2024-06-15 07:29:42 -04:00
0e9164b40a
[mypy] Enable type checking for test directory ( #5017 )
2024-06-15 04:45:31 +00:00
1b8a0d71cf
[Core][Bugfix]: fix prefix caching for blockv2 ( #5364 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-06-14 17:23:56 -07:00
bd7efe95d0
Add ccache to amd ( #5555 )
2024-06-14 17:18:22 -07:00
f5bb85b435
[Core][Distributed] improve p2p cache generation ( #5528 )
2024-06-14 14:47:45 -07:00
28c145eb57
[Bugfix] Fix typo in Pallas backend ( #5558 )
2024-06-14 14:40:09 -07:00
e2afb03c92
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models ( #5460 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com >
2024-06-14 20:28:11 +00:00
6e2527a7cb
[Doc] Update documentation on Tensorizer ( #5471 )
2024-06-14 11:27:57 -07:00
cdab68dcdb
[Docs] Add ZhenFund as a Sponsor ( #5548 )
2024-06-14 11:17:21 -07:00
d1c3d7d139
[misc][distributed] fix benign error in is_in_the_same_node
( #5512 )
2024-06-14 10:59:28 -07:00
77490c6f2f
[Core] Remove duplicate processing in async engine ( #5525 )
2024-06-14 10:04:42 -07:00
48f589e18b
[mis] fix flaky test of test_cuda_device_count_stateless ( #5546 )
2024-06-14 10:02:23 -07:00
348616ac4b
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later ( #5401 )
2024-06-14 10:02:00 -07:00
15985680e2
[ Misc ] Rs/compressed tensors cleanup ( #5432 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com >
2024-06-14 10:01:46 -07:00
d74674bbd9
[Misc] Fix arg names ( #5524 )
2024-06-14 09:47:44 -07:00
703475f6c2
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue ( #5516 )
2024-06-14 09:30:15 -07:00
d47af2bc02
[CI/Build] Disable LLaVA-NeXT CPU test ( #5529 )
2024-06-14 09:27:30 -07:00
319ad7f1d3
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks
label ( #5073 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-06-13 22:36:20 -07:00
0f0d8bc065
bump version to v0.5.0.post1 ( #5522 )
2024-06-13 19:42:06 -07:00
55d6361b13
[Misc] Fix arg names in quantizer script ( #5507 )
2024-06-13 19:02:53 -07:00
cd9c0d65d9
[Hardware][Intel] Support CPU inference with AVX2 ISA ( #5452 )
2024-06-13 17:22:24 -06:00
50eed24d25
Add cuda_device_count_stateless
( #5473 )
2024-06-13 16:06:49 -07:00
e38042d4af
[Kernel] Disable CUTLASS kernels for fp8 ( #5505 )
2024-06-13 13:38:05 -07:00
33e3b37242
[CI/Build] Disable test_fp8.py ( #5508 )
2024-06-13 13:37:48 -07:00
1696efe6c9
[misc] fix format.sh ( #5511 )
2024-06-13 12:09:16 -07:00
6b0511a57b
Revert "[Core] Remove unnecessary copies in flash attn backend" ( #5478 )
2024-06-13 11:22:50 -07:00
a8fda4f661
Seperate dev requirements into lint and test ( #5474 )
2024-06-13 11:22:41 -07:00
30299a41fa
[MISC] Remove FP8 warning ( #5472 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
2024-06-13 11:22:30 -07:00
85657b5607
[Kernel] Factor out epilogues from cutlass kernels ( #5391 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
Co-authored-by: zifeitong <zifei.tong@parasail.io >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-13 11:22:19 -07:00
0ce7b952f8
[Doc] Update LLaVA docs ( #5437 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-13 11:22:07 -07:00
39873476f8
[CI/Build] Simplify OpenAI server setup in tests ( #5100 )
2024-06-13 11:21:53 -07:00
03dccc886e
[Misc] Add vLLM version getter to utils ( #5098 )
2024-06-13 11:21:39 -07:00
a65634d3ae
[Docs] Add 4th meetup slides ( #5509 )
2024-06-13 10:18:26 -07:00
80aa7e91fc
[Hardware][Intel] Optimize CPU backend and add more performance tips ( #4971 )
...
Co-authored-by: Jianan Gu <jianan.gu@intel.com >
2024-06-13 09:33:14 -07:00
bd43973522
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 ( #5497 )
...
Tune Qwen2-57B-A14B configs based on #4921
Throughput Performance
command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2
A100 GPU
benchmark no config w/ PR
tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s
tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s
2024-06-13 09:01:10 -07:00
23ec72fa03
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations ( #5466 )
2024-06-13 15:18:08 +00:00
c2637a613b
[Kernel] w4a16
support for compressed-tensors
( #5385 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-13 10:19:56 -04:00
88407532e7
[Bugfix]if the content is started with ":"(response of ping), client should i… ( #5303 )
...
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-12 20:16:41 -07:00
916d219d62
[ci] Use sccache to build images ( #5419 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-12 17:58:12 -07:00
ea3890a5f0
[Core][Distributed] code deduplication in tp&pp with coordinator( #5293 )
...
[Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293 )
2024-06-12 17:27:08 -07:00
2135cacb45
[Bugfix] Fix wrong multi_modal_input format for CPU runner ( #5451 )
2024-06-12 16:20:18 -07:00
7d19de2e9c
[Frontend] Add "input speed" to tqdm postfix alongside output speed ( #5425 )
2024-06-12 18:42:12 -04:00
94a07bbdd8
[Bugfix] Fix typo in scheduler.py (requeset -> request) ( #5470 )
2024-06-12 21:59:44 +00:00
b8d4dfff9c
[Doc] Update debug docs ( #5438 )
2024-06-12 14:49:31 -07:00
622d45128c
[misc] add hint for AttributeError ( #5462 )
2024-06-12 21:46:35 +00:00
51602eefd3
[Frontend] [Core] Support for sharded tensorized models ( #4990 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Sanger Steel <sangersteel@gmail.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-12 14:13:52 -07:00
5cc50a531f
[Bugfix] TYPE_CHECKING for MultiModalData ( #5444 )
2024-06-12 14:08:52 -07:00
5985e3427d
[Kernel] Vectorized FP8 quantize kernel ( #5396 )
...
Inspired by #5146 , this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large).
In details, we applied 3 optimizations:
- Use inverted scale so that most divisions are changed to multiplications.
- Unroll the loop by 4 times to improve ILP.
- Use vectorized 4 to transfer data between HBM and SRAM.
2024-06-12 14:07:26 -07:00
8b82a89997
[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests ( #5464 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-12 14:00:18 -07:00
c3c2903e72
[Bugfix] Add device assertion to TorchSDPA ( #5402 )
2024-06-12 12:58:53 -07:00
1a8bfd92d5
[Hardware] Initial TPU integration ( #5292 )
2024-06-12 11:53:03 -07:00
847cdcca1c
[CI] Upgrade codespell version. ( #5381 )
2024-06-12 10:06:14 -07:00
e3c12bf6d2
Revert "[CI/Build] Add is_quant_method_supported
to control quantization test configurations" ( #5463 )
2024-06-12 10:03:24 -07:00
3dd6853bc8
[CI/Build] Add is_quant_method_supported
to control quantization test configurations ( #5253 )
2024-06-12 09:58:02 -07:00
8f89d72090
[Doc] add common case for long waiting time ( #5430 )
2024-06-11 11:12:13 -07:00
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case ( #5230 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-06-11 11:10:41 -07:00
c4bd03c7c5
[Core][Distributed] add same-node detection ( #5369 )
2024-06-11 10:53:59 -07:00
dcbf4286af
[Frontend] Customizable RoPE theta ( #5197 )
2024-06-11 10:42:26 -07:00
00e6a2dc53
[Bugfix] fix lora_dtype value type in arg_utils.py ( #5398 )
2024-06-11 10:40:23 -07:00
2e02311a1b
[Bugfix] Fix MultiprocessingGPUExecutor.check_health
when world_size == 1 ( #5254 )
2024-06-11 10:38:07 -07:00
89ec06c33b
[Docs] [Spec decode] Fix docs error in code example ( #5427 )
2024-06-11 10:31:56 -07:00
9fde251bf0
[Doc] Add an automatic prefix caching section in vllm documentation ( #5324 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-06-11 10:24:59 -07:00
4c2ffb28ff
[Speculative decoding] Initial spec decode docs ( #5400 )
2024-06-11 10:15:40 -07:00
246598a6b1
[CI] docfix ( #5410 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
Co-authored-by: ywang96 <ywang@roblox.com >
2024-06-11 01:28:50 -07:00
8bab4959be
[Misc] Remove VLLM_BUILD_WITH_NEURON env variable ( #5389 )
2024-06-11 00:37:56 -07:00
3c4cebf751
[Doc][Typo] Fixing Missing Comma ( #5403 )
2024-06-11 00:20:28 -07:00
d8f31f2f8b
[Doc] add debugging tips ( #5409 )
2024-06-10 23:21:43 -07:00
640052b069
[Bugfix][Frontend] Cleanup "fix chat logprobs" ( #5026 )
2024-06-10 22:36:46 -07:00
351d5e7b82
[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs ( #5312 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-06-11 10:30:31 +08:00
a008629807
[Misc] Various simplifications and typing fixes ( #5368 )
2024-06-11 10:29:02 +08:00
76477a93b7
[ci] Fix Buildkite agent path ( #5392 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 18:58:07 -07:00
77c87beb06
[Doc] Add documentation for FP8 W8A8 ( #5388 )
2024-06-10 18:55:12 -06:00
114332b88e
Bump version to v0.5.0 ( #5384 )
2024-06-10 15:56:06 -07:00
cb77ad836f
[Docs] Alphabetically sort sponsors ( #5386 )
2024-06-10 15:17:19 -05:00
856c990041
[Docs] Add Docs on Limitations of VLM Support ( #5383 )
2024-06-10 09:53:50 -07:00
c5602f0baa
[ci] Mount buildkite agent on Docker container to upload benchmark results ( #5330 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 09:22:34 -07:00
f7f9c5f97b
[ci] Use small_cpu_queue for doc build ( #5331 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-10 09:21:11 -07:00
2c0d933594
[Bugfix] Fix LLaVA-NeXT ( #5380 )
2024-06-10 15:38:47 +00:00
774d1035e4
[Feature][Frontend]: Continued stream_options
implementation also in CompletionRequest ( #5319 )
2024-06-10 14:22:09 +00:00
6b29d6fe70
[Model] Initial support for LLaVA-NeXT ( #4199 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-06-10 12:47:15 +00:00
0bfa1c4f13
[Misc] Improve error message when LoRA parsing fails ( #5194 )
2024-06-10 19:38:49 +08:00
c81da5f56d
[misc][typo] fix typo ( #5372 )
2024-06-10 09:51:02 +00:00
68bc81703e
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server ( #5374 )
2024-06-10 09:13:39 +00:00
5884c2b454
[Misc] Update to comply with the new compressed-tensors
config ( #5350 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-06-10 03:49:46 +00:00
45f92c00cf
[Bugfix] Fix KeyError: 1 When Using LoRA adapters ( #5164 )
2024-06-09 16:23:14 -07:00
5467ac3196
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops ( #5047 )
2024-06-09 16:23:30 -04:00
5d7e3d0176
[mis][ci/test] fix flaky test in test_sharded_state_loader.py ( #5361 )
...
[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361 )
2024-06-09 03:50:14 +00:00
0373e1837e
[Core][CUDA Graph] add output buffer for cudagraph ( #5074 )
...
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074 )
2024-06-08 19:14:43 -07:00
c09dade2a2
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale ( #5353 )
2024-06-08 13:54:05 -04:00
8ea5e44a43
[CI/Test] improve robustness of test (vllm_runner) ( #5357 )
...
[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357 )
2024-06-08 08:59:20 +00:00
9fb900f90c
[CI/Test] improve robustness of test (hf_runner) ( #5347 )
...
[CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347 )
2024-06-07 22:31:32 -07:00
c96fc06747
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention ( #4965 )
2024-06-07 19:13:12 -07:00
b3376e5c76
[Misc] Add args for selecting distributed executor to benchmarks ( #5335 )
2024-06-08 09:20:16 +08:00
e69ded7d1c
[Bug Fix] Fix the support check for FP8 CUTLASS ( #5352 )
...
Bug description:
With torch 2.4.0.dev20240603+cu121,
cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112)
This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183 .
2024-06-08 00:42:05 +00:00
767c727a81
fix DbrxFusedNormAttention missing cache_config ( #5340 )
...
Co-authored-by: team <calvinn.ng@ahrefs.com >
2024-06-07 14:10:21 -07:00
6840a71610
[Misc] Remove unused cuda_utils.h in CPU backend ( #5345 )
2024-06-07 14:09:13 -07:00
7a9cb294ae
[Frontend] Add OpenAI Vision API Support ( #5237 )
...
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk >
2024-06-07 11:23:32 -07:00
ca3ea51bde
[Kernel] Dynamic Per-Token Activation Quantization ( #5037 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-06-07 09:36:26 -07:00
dc49fb892c
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill ( #5296 )
2024-06-07 13:35:42 +00:00
18a277b52d
Remove Ray health check ( #4693 )
2024-06-07 10:01:56 +00:00
8d75fe48ca
[Kernel] Switch fp8 layers to use the CUTLASS kernels ( #5183 )
...
Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8
see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.
2024-06-07 08:42:35 +00:00
388596c914
[Misc][Utils] allow get_open_port to be called for multiple times ( #5333 )
2024-06-06 22:15:11 -07:00
baa15a9ec3
[Feature][Frontend]: Add support for stream_options
in ChatCompletionRequest
( #5135 )
2024-06-07 03:29:24 +00:00
15063741e3
[Misc] Missing error message for custom ops import ( #5282 )
2024-06-06 20:17:21 -07:00
ccdc490dda
[Core] Change LoRA embedding sharding to support loading methods ( #5038 )
2024-06-06 19:07:57 -07:00
a31cab7556
[Core] Avoid copying prompt/output tokens if no penalties are used ( #5289 )
2024-06-06 18:12:00 -07:00
828da0d44e
[Frontend] enable passing multiple LoRA adapters at once to generate() ( #5300 )
2024-06-06 15:48:13 -05:00
abe855d637
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 ( #5294 )
2024-06-06 09:29:29 -07:00
4efff036f0
Bugfix: fix broken of download models from modelscope ( #5233 )
...
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com >
2024-06-06 09:28:10 -07:00
89c920785f
[CI/Build] Update vision tests ( #5307 )
2024-06-06 05:17:18 -05:00
7b0a0dfb22
[Frontend][Core] Update Outlines Integration from FSM
to Guide
( #4109 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-06-05 16:49:12 -07:00
3a6ae1d33c
[CI] Disable flash_attn backend for spec decode ( #5286 )
2024-06-05 15:49:27 -07:00
8f1729b829
[Docs] Add Ray Summit CFP ( #5295 )
2024-06-05 15:25:18 -07:00
6a7c7711a2
[Misc] Skip for logits_scale == 1.0 ( #5291 )
2024-06-05 15:19:02 -07:00
0f83ddd4d7
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine gracefully shuts down. ( #5290 )
2024-06-05 15:18:12 -07:00
065aff6c16
[Bugfix] Make EngineArgs use named arguments for config construction ( #5285 )
2024-06-05 15:16:56 -07:00
3d33e372a1
[BugFix] Fix log message about default max model length ( #5284 )
2024-06-05 14:53:16 -07:00
faf71bcd4b
[Speculative Decoding] Add ProposerWorkerBase
abstract class ( #5252 )
2024-06-05 14:53:05 -07:00
f270a39537
[Docs] Add Sequoia as sponsors ( #5287 )
2024-06-05 18:02:56 +00:00
51a08e7d8f
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 ( #5238 )
2024-06-05 10:59:14 -07:00
eb8fcd2666
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM ( #5207 )
...
Co-authored-by: qiujiawei9 <qiujiawei9@jd.com >
2024-06-05 10:59:02 -07:00
5563a4dea8
[Model] Correct Mixtral FP8 checkpoint loading ( #5231 )
2024-06-05 10:58:50 -07:00
ccd4f129e8
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size ( #5157 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-06-05 10:44:15 -07:00
02cc3b51a7
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results ( #5263 )
2024-06-05 10:17:51 -07:00
d5b1eb081e
[CI] Add nightly benchmarks ( #5260 )
2024-06-05 09:42:08 -07:00
f0a500545f
[Frontend] OpenAI API server: Add add_special_tokens
to ChatCompletionRequest (default False) ( #5278 )
2024-06-05 09:32:58 -07:00
c65146e75e
[Misc] Fix docstring of get_attn_backend ( #5271 )
2024-06-05 09:18:59 -07:00
41ca62cf03
[Misc] Add CustomOp interface for device portability ( #5255 )
2024-06-05 09:18:19 -07:00
974fc9b845
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True ( #5226 )
2024-06-04 19:37:28 -07:00
fee4dcc33a
[Misc] update collect env ( #5261 )
2024-06-04 17:29:09 -05:00
650a4cc55e
[Misc] Add transformers version to collect_env.py ( #5259 )
2024-06-04 12:52:28 -07:00
9ca62d8668
[CI] mark AMD test as softfail to prevent blockage ( #5256 )
2024-06-04 11:34:53 -07:00
45c35f0d58
[CI/Build] Reducing CPU CI execution time ( #5241 )
2024-06-04 10:26:40 -07:00
9ba093b4f4
[CI/Build] Simplify model loading for HfRunner
( #5251 )
2024-06-04 10:09:19 -07:00
27208be66e
[Kernel] Add back batch size 1536 and 3072 to MoE tuning ( #5242 )
2024-06-04 09:58:47 -07:00
87d5abef75
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU backend ( #5249 )
2024-06-04 09:57:51 -07:00
ec784b2526
[CI/Build] Add inputs tests ( #5215 )
2024-06-03 21:01:46 -07:00
a58f24e590
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor ( #5229 )
2024-06-03 20:55:50 -07:00
f42a006b15
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend ( #5210 )
2024-06-03 20:32:57 -07:00
3a434b07ed
[Kernel] Enhance MoE benchmarking & tuning script ( #4921 )
2024-06-03 20:06:59 -07:00
bd0e7802e0
[Bugfix] Add warmup for prefix caching example ( #5235 )
2024-06-03 19:36:41 -07:00
06b2550cbb
[Bugfix] Support prompt_logprobs==0
( #5217 )
2024-06-03 17:59:30 -07:00
f775a07e30
[FRONTEND] OpenAI tools
support named functions ( #5032 )
2024-06-03 18:25:29 -05:00
4f0d17c05c
New CI template on AWS stack ( #5110 )
...
Signed-off-by: kevin <kevin@anyscale.com >
2024-06-03 16:16:43 -07:00
10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 ( #3834 )
2024-06-03 13:37:11 -07:00
cafb8e06c5
[CI/BUILD] enable intel queue for longer CPU tests ( #4113 )
2024-06-03 10:39:50 -07:00
cbb2f59cc8
[Kernel] Pass a device pointer into the quantize kernel for the scales ( #5159 )
2024-06-03 09:52:30 -07:00
0ab278ca31
[Core] Remove unnecessary copies in flash attn backend ( #5138 )
2024-06-03 09:39:31 -07:00
7a64d24aad
[Core] Support image processor ( #4197 )
2024-06-02 22:56:41 -07:00
dfbe60dc62
[Misc] Simplify code and fix type annotations in conftest.py
( #5118 )
2024-06-02 16:05:50 -07:00
a66cf40b20
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer ( #4927 )
...
This PR enables the fused topk_softmax kernel used in moe layer for HIP
2024-06-02 14:13:26 -07:00
f790ad3c50
[Frontend][OpenAI] Support for returning max_model_len on /v1/models response ( #4643 )
2024-06-02 08:06:13 +00:00
ed59a7ed23
Update test_ignore_eos ( #4898 )
2024-06-02 02:21:53 +00:00
044793d8df
[BugFix] Prevent LLM.encode
for non-generation Models ( #5184 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-06-01 23:35:41 +00:00
c2d6d2f960
[Bugfix]: Fix issues related to prefix caching example ( #5177 ) ( #5180 )
2024-06-01 15:53:52 -07:00
8279078e21
[Bugfix] Remove deprecated @abstractproperty ( #5174 )
2024-06-01 22:40:25 +00:00
b9c0605a8e
[Feature][Kernel] Support bitsandbytes quantization and QLoRA ( #4776 )
2024-06-01 14:51:10 -06:00
37464a0f74
[Bugfix] Fix call to init_logger in openai server ( #4765 )
2024-06-01 17:18:50 +00:00
c354072828
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py ( #5151 )
...
Signed-off-by: Ye Cao <caoye.cao@alibaba-inc.com >
2024-06-01 17:11:22 +00:00
f081c3ce4b
[Kernel] Update Cutlass fp8 configs ( #5144 )
...
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-06-01 08:46:07 +00:00
260d119e86
[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU ( #5137 )
2024-06-01 06:45:32 +00:00
a360ff80bb
[CI/Build] CMakeLists: build all extensions' cmake targets at the same time ( #5034 )
2024-05-31 22:06:45 -06:00
1197e02141
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels ( #5168 )
2024-05-31 17:21:38 -07:00
657579113f
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support ( #5171 )
2024-05-31 17:20:19 -07:00
e9899fb7a4
[Model] Enable FP8 QKV in MoE and refine kernel tuning script ( #5039 )
2024-05-31 14:29:19 -07:00
a377f0bd5e
[Misc]: optimize eager mode host time ( #4196 )
...
Co-authored-by: xuhao <xuhao@cambricon.com >
2024-05-31 13:14:50 +08:00
e9d3aa04f6
Revert "[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5)" ( #5149 )
2024-05-30 22:00:26 -07:00
a22dea54d3
[Model] Support MAP-NEO model ( #5081 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-05-30 19:24:41 -07:00
533c217792
Fix cutlass sm_90a vesrion in CMakeList
2024-05-31 02:13:01 +00:00
6d21fa1cad
[Kernel] Marlin_24: Ensure the mma.sp instruction is using the ::ordered_metadata modifier (introduced with PTX 8.5) ( #5136 )
2024-05-30 21:02:11 -05:00
b35be5403f
[Bugfix] Avoid Warnings in SparseML Activation Quantization ( #5120 )
2024-05-30 17:04:37 -07:00
45a1a69b98
[Build] Disable sm_90a in cu11 ( #5141 )
2024-05-30 14:37:16 -07:00
87a658c812
Bump version to v0.4.3 ( #5046 )
2024-05-30 11:13:46 -07:00
429d89720e
add doc about serving option on dstack ( #3074 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-30 10:11:07 -07:00
a9bcc7afb2
[Doc] Use intersphinx and update entrypoints docs ( #5125 )
2024-05-30 09:59:23 -07:00
d79d9eaaff
[Misc] remove duplicate definition of seq_lens_tensor
in model_runner.py ( #5129 )
2024-05-30 06:56:19 -07:00
f758505c73
[CI/Build] increase wheel size limit to 200 MB ( #5130 )
2024-05-30 06:29:48 -07:00
d910816c73
[Bugfix] Automatically Detect SparseML models ( #5119 )
2024-05-30 12:58:37 +00:00
87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs ( #5029 )
...
Co-authored-by: Breno Faria <breno.faria@intrafind.com >
2024-05-30 02:52:14 -07:00
e07aff9e52
[CI/Build] Docker cleanup functionality for amd servers ( #5112 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com >
Co-authored-by: Alexei V. Ivanov <alexei.ivanov@amd.com >
Co-authored-by: omkarkakarparthi <okakarpa>
2024-05-30 03:27:39 +00:00
5bf185a1c4
[Bugfix] gptq_marlin: Ensure g_idx_sort_indices is not a Parameter ( #5108 )
2024-05-30 00:30:18 +00:00
4fbcb0f27e
[Doc][Build] update after removing vllm-nccl ( #5103 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-05-29 23:51:18 +00:00
7c3604fb68
[Bugfix] logprobs is not compatible with the OpenAI spec #4795 ( #5031 )
2024-05-29 16:13:22 -07:00
b1c255630d
[Core] Avoid the need to pass None
values to Sequence.inputs
( #5099 )
2024-05-29 16:05:01 -07:00
eb6c50cdc2
[Bugfix][CI/Build] Fix codespell failing to skip files in git diff
( #5097 )
2024-05-29 16:02:54 -07:00
eecd864388
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators
( #5096 )
2024-05-29 16:02:25 -07:00
ae495c74ea
[Doc]Replace deprecated flag in readme ( #4526 )
2024-05-29 22:26:33 +00:00
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
594392d27a
[Core][Distributed] improve p2p access check ( #4992 )
2024-05-29 11:29:07 +00:00
18c1f16d86
[Bugfix] Fix arguments passed to Sequence
in stop checker test ( #5092 )
2024-05-29 07:16:41 +00:00
5bd3c65072
[Core][Optimization] remove vllm-nccl ( #5091 )
2024-05-29 05:13:52 +00:00
616e600e0b
[Misc] add gpu_memory_utilization arg ( #5079 )
...
Signed-off-by: pandyamarut <pandyamarut@gmail.com >
2024-05-28 17:16:18 -07:00
dfba529b40
[Bugfix] Remove the last EOS token unless explicitly specified ( #5077 )
2024-05-28 17:15:35 -07:00
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-05-28 13:29:31 -07:00
290f4ada2b
[Docs] Add Dropbox as sponsors ( #5089 )
2024-05-28 10:29:09 -07:00
dd8de11f0a
[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X ( #4951 )
...
This PR adds Triton kernel configs for the MoE kernel for MI300X
2024-05-28 16:03:23 +00:00
9ba415588a
[BugFix] Fix Embedding Models with TP>1 ( #5075 )
2024-05-28 08:32:42 -07:00
d4f3985907
[Core] Sliding window for block manager v2 ( #4545 )
...
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local >
2024-05-28 11:07:07 +09:00
890aa93d27
[Model] Add support for falcon-11B ( #5069 )
2024-05-27 16:41:43 -07:00
fbdb7b3ee2
[Core] Allow AQLM on Pascal ( #5058 )
2024-05-27 15:26:14 -07:00
1102bef219
[Bugfix / Core] Prefix Caching Guards (merged with main) ( #4846 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com >
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-27 15:18:17 -07:00
f17a1a8f96
[Misc] Make Serving Benchmark More User-friendly ( #5044 )
2024-05-25 17:28:16 +00:00
d5a1697772
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding ( #5000 )
2024-05-25 10:00:14 -07:00
325c119961
[Misc] add logging level env var ( #5045 )
2024-05-24 23:49:49 -07:00
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com >
Co-authored-by: bapatra <bapatra@microsoft.com >
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-24 22:00:52 -07:00
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-05-24 10:07:09 -07:00
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-24 12:28:27 +00:00
6a50f4cafa
[Doc] add ccache guide in doc ( #5012 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-23 23:21:54 +00:00
e3470f8753
[Core]: Option To Use Prompt Token Ids Inside Logits Processor ( #4985 )
...
Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university >
2024-05-23 22:04:24 +00:00
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com >
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com >
2024-05-23 21:29:18 +00:00
5eda2ea02a
[Core][1/N] Support send/recv in PyNCCL Groups ( #4988 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai >
2024-05-23 09:54:48 -07:00
2ba80bed27
[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined ( #5009 )
2024-05-23 09:08:58 -07:00
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
a36de682d4
[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig ( #4991 )
2024-05-22 22:26:56 +00:00
eb6d3c264d
[Core] Eliminate parallel worker per-step task scheduling overhead ( #4894 )
2024-05-23 06:17:27 +09:00
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
c74c913bfb
[misc] remove comments that were supposed to be removed ( #4977 )
2024-05-22 09:02:58 -04:00
5f6d10c14c
[CI/Build] Enforce style for C++ and CUDA code with clang-format
( #4722 )
2024-05-22 07:18:41 +00:00
9b9a10d6cb
[Frontend] Dynamic RoPE scaling ( #4638 )
2024-05-22 01:32:35 -04:00
99eff67ba9
[Bugfix][Kernel] Add head size check for attention backend selection ( #4944 )
2024-05-21 15:33:25 -04:00
14772eeb8e
[Bugfix] Fix flag name for max_seq_len_to_capture
( #4935 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2024-05-21 09:30:52 -07:00
757b62c495
[CI/Build] Codespell ignore build/
directory ( #4945 )
2024-05-21 09:06:10 -07:00
e941f88584
[Docs] Add acknowledgment for sponsors ( #4925 )
2024-05-21 00:17:25 -07:00
f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
d130b573a0
[Model] add rope_scaling support for qwen2 ( #4930 )
2024-05-21 05:22:22 +00:00
65ae8c2c8f
[Core] Fix scheduler considering "no LoRA" as "LoRA" ( #4897 )
2024-05-20 17:48:32 -07:00
c3af44722c
[Doc]Add documentation to benchmarking script when running TGI ( #4920 )
2024-05-20 20:16:57 +00:00
1937e29848
[Core] Sharded State Loader download from HF ( #4889 )
2024-05-20 11:46:12 -07:00
f0eecee610
[Bugfix] Fix dummy weight for fp8 ( #4916 )
...
Allow dummy load format for fp8,
torch.uniform_ doesn't support FP8 at the moment
Co-authored-by: Mor Zusman <morz@ai21.com >
2024-05-20 18:44:25 +00:00
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com >
2024-05-20 11:29:28 -07:00
546a97ef69
[Misc]: allow user to specify port in distributed setting ( #4914 )
2024-05-20 17:45:06 +00:00
da5a0b539d
Remove marlin warning ( #4918 )
2024-05-20 14:55:34 +00:00
6287537a0c
[Model] LLaVA model refactor ( #4910 )
2024-05-20 08:11:25 +00:00
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
c0724fc915
[ROCm][Hardware][AMD] Adding Navi21 to fallback to naive attention if Triton is not used ( #4658 )
2024-05-18 05:09:11 +00:00
86b45ae065
[Bugfix] Relax tiktoken to >= 0.6.0 ( #4890 )
2024-05-17 12:58:52 -06:00
c5711ef985
[Doc] Update Ray Data distributed offline inference example ( #4871 )
2024-05-17 10:52:11 -07:00
48d5985a08
Sync huggingface modifications of qwen Moe model ( #4774 )
2024-05-17 09:43:19 -07:00
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
8e7fb5d43a
Support to serve vLLM on Kubernetes with LWS ( #4829 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com >
2024-05-16 16:37:29 -07:00
9a31a817a8
[Bugfix] Fix FP8 KV cache support ( #4869 )
2024-05-16 22:42:29 +00:00
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net >
2024-05-16 11:16:09 -07:00
10fa9eea21
[Misc] remove old comments ( #4866 )
2024-05-16 11:07:41 -07:00
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
b5853f9963
[ROCm][AMD][Bugfix] adding a missing triton autotune config ( #4845 )
2024-05-16 10:46:52 -07:00
f09edd8a25
Add JSON output support for benchmark_latency and benchmark_throughput ( #4848 )
2024-05-16 10:02:56 -07:00
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-05-16 12:56:15 -04:00
9216b9cc38
[Bugfix] Bypass authorization API token for preflight requests ( #4862 )
2024-05-16 09:42:21 -07:00
5e0391c040
[Frontend] Separate OpenAI Batch Runner usage from API Server ( #4851 )
2024-05-17 00:42:41 +09:00
dbc0754ddf
[docs] Fix typo in examples filename openi -> openai ( #4864 )
2024-05-17 00:42:17 +09:00
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cade Daniel <cade@anyscale.com >
2024-05-16 00:53:51 -07:00
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-05-15 22:11:54 -07:00
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-15 19:13:36 -04:00
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
361c461a12
[Doc] Highlight the fourth meetup in the README ( #4842 )
2024-05-15 11:38:49 -07:00
a5675d348b
[Bugfix] Properly set distributed_executor_backend in ParallelConfig ( #4816 )
2024-05-15 07:22:09 -07:00
e9cdd2b1e2
[CI/Build] Further decouple HuggingFace implementation from ours during tests ( #4166 )
2024-05-14 23:38:40 -07:00
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5.
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
29bc01bf3b
Add 4th meetup announcement to readme ( #4817 )
2024-05-14 18:33:06 -04:00
676a99982f
[Core] Add MultiprocessingGPUExecutor ( #4539 )
...
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com >
2024-05-14 10:38:59 -07:00
dc72402b57
[Bugfix][Doc] Fix CI failure in docs ( #4804 )
...
This PR fixes the CI failure introduced by #4798 .
The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion.
I have also changed the format of the links to be more distinct from each other.
2024-05-15 01:57:08 +09:00
ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies ( #4696 )
2024-05-14 21:34:33 +09:00
c579b750a0
[Doc] Add meetups to the doc ( #4798 )
2024-05-13 18:48:00 -07:00
4bfa7e7f75
[Doc] Add API reference for offline inference ( #4710 )
2024-05-13 17:47:42 -07:00
ac1fbf7fd2
[Doc] Shorten README by removing supported model list ( #4796 )
2024-05-13 16:23:54 -07:00
33d3914b1e
[Bugfix] Fix dynamic FP8 quantization for Mixtral ( #4793 )
2024-05-13 19:00:27 -04:00
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com >
2024-05-13 15:50:33 -07:00
ce532ff45c
[Speculative decoding] Improve n-gram efficiency ( #4724 )
2024-05-13 15:00:13 -07:00
8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer
to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
2024-05-13 23:50:44 +09:00
350f9e107f
[CI/Build] Move test_utils.py
to tests/utils.py
( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
a7be4d0072
[CORE] Improvement in ranks code ( #4718 )
2024-05-12 17:47:47 -07:00
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
6eaccb7353
[Model] Add support for IBM Granite Code models ( #4636 )
2024-05-11 21:27:24 -07:00
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-10 17:36:25 +00:00
706588a77d
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4729 )
2024-05-11 00:00:56 +09:00
6a0f617210
[Core] Fix circular reference which leaked llm instance in local dev env ( #4737 )
...
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
2024-05-10 23:54:32 +09:00
dac6a3f6ed
[Misc] Apply a couple g++ cleanups ( #4719 )
2024-05-10 13:37:05 +00:00
64b77dfd7e
[Core]fix type annotation for swap_blocks
( #4726 )
2024-05-10 21:52:48 +09:00
51d4094fda
chunked-prefill-doc-syntax ( #4603 )
...
Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html
Co-authored-by: sang <rkooo567@gmail.com >
2024-05-10 14:13:23 +09:00
e965d46184
[Misc] Keep only one implementation of the create_dummy_prompt function. ( #4716 )
2024-05-09 21:42:38 -07:00
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
379da6dcb5
[Kernel] [FP8] Improve FP8 linear layer performance ( #4691 )
...
This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)).
We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance.
Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization:
qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16)
qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16)
qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16)
qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16)
qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)
2024-05-09 16:38:07 -07:00
ebce310b74
[Model] Snowflake arctic model implementation ( #4652 )
...
Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com >
Co-authored-by: Aurick Qiao <qiao@aurick.net >
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com >
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-09 22:37:14 +00:00
be0c5180ac
[Bugfix] Add logs for all model dtype casting ( #4717 )
2024-05-09 18:36:25 +00:00
cea64430f6
[Bugfix] Update grafana.json ( #4711 )
2024-05-09 10:10:13 -07:00
a3c124570a
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4709 )
2024-05-09 09:53:14 -07:00
ff5abcd746
[ROCm] Add support for Punica kernels on AMD GPUs ( #3140 )
...
Co-authored-by: miloice <jeffaw99@hotmail.com >
2024-05-09 09:19:50 -07:00
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
190bc838e1
[Misc] Remove unnecessary ModelRunner imports ( #4703 )
2024-05-09 00:17:17 -07:00
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
16bc0a098f
[Frontend] add tok/s speed metric to llm class when using tqdm ( #4400 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-05-08 22:02:31 -07:00
e288df0632
[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin ( #4626 )
2024-05-08 17:14:31 -07:00
8b9241be3a
[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec logprobs ( #4672 )
2024-05-08 23:24:46 +00:00
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-08 21:44:00 +00:00
89579a201f
[Misc] Use vllm-flash-attn instead of flash-attn ( #4686 )
2024-05-08 13:15:34 -07:00
230c4b38c1
[CI/Test] fix swap test for multi gpu ( #4689 )
2024-05-08 13:14:02 -07:00
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
ad932a221d
[Core] Faster startup for LoRA enabled models ( #4634 )
2024-05-08 10:33:18 -07:00
5510cf0e8a
[Misc] Add get_name
method to attention backends ( #4685 )
2024-05-08 09:59:31 -07:00
0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi ( #4573 )
2024-05-08 09:19:58 -07:00
f6a593093a
[CI] Make mistral tests pass ( #4596 )
2024-05-08 08:44:35 -07:00
d7740ea4dc
[Core] Optimize sampler get_logprobs ( #4594 )
2024-05-08 08:42:28 -07:00
cc466a3290
[Core][Distributed] support cpu&device in broadcast tensor dict ( #4660 )
...
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660 )
2024-05-07 19:34:47 -07:00
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly ( #4551 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Cade Daniel <edacih@gmail.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-05-07 11:40:18 -07:00
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list ( #4648 )
2024-05-07 11:06:32 -07:00
10760da800
[Bugfix] Fixed error in slice_lora_b for MergedQKVParallelLinearWithLora ( #4609 )
2024-05-07 10:59:07 -07:00
478aed5827
[Build/CI] Fixing 'docker run' to re-enable AMD CI tests. ( #4642 )
2024-05-07 09:23:17 -07:00
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
a98187cf72
[Kernel] Make static FP8 scaling more robust ( #4570 )
...
Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint
https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale
(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k ), I'm getting the following mostly random performance on MMLU:
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.2295|± |0.0035|
| - humanities |N/A |none | 5|acc |0.2421|± |0.0062|
| - other |N/A |none | 5|acc |0.2398|± |0.0076|
| - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074|
| - stem |N/A |none | 5|acc |0.2125|± |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7008|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6453|± |0.0065|
| - other |N/A |none | 5|acc |0.7692|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070|
| - stem |N/A |none | 5|acc |0.6115|± |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
2024-05-06 17:39:28 -07:00
bd99d22629
Update lm-format-enforcer to 0.10.1 ( #4631 )
2024-05-06 23:51:59 +00:00
19cb4716ee
[CI] Add retry for agent lost ( #4633 )
2024-05-06 23:18:57 +00:00
e186d37cb1
[CI] use ccache actions properly in release workflow ( #4629 )
2024-05-06 22:23:36 +00:00
323f27b904
[Bugfix] Fix asyncio.Task
not being subscriptable ( #4623 )
2024-05-06 09:31:05 -07:00
0650e5935b
Disable cuda version check in vllm-openai image ( #4530 )
2024-05-05 16:58:55 -07:00
c7f2cf2b7f
[CI] Reduce wheel size by not shipping debug symbols ( #4602 )
2024-05-04 21:28:58 -07:00
8d8357c8ed
bump version to v0.4.2 ( #4600 )
2024-05-04 17:09:49 -07:00
4302987069
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics ( #3937 )
2024-05-04 15:39:34 -07:00
021b1a2ab7
[CI] check size of the wheels ( #4319 )
2024-05-04 20:44:36 +00:00
2a052011ca
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) ( #4527 )
...
Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436 .
This PR enables the following checkpoint loading features for Mixtral:
Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:
The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
2024-05-04 11:45:16 -07:00
36fb68f947
[Doc] Chunked Prefill Documentation ( #4580 )
2024-05-04 00:18:00 -07:00
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData ( #4540 )
2024-05-03 17:47:07 -07:00
344bf7cd2d
[Misc] add installation time env vars ( #4574 )
2024-05-03 15:55:56 -07:00
ab50275111
[Speculative decoding] Support target-model logprobs ( #4378 )
2024-05-03 15:52:01 -07:00
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com >
2024-05-03 15:51:27 -07:00
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
7e65477e5e
[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None ( #4586 )
2024-05-03 10:32:21 -07:00
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
2d7bce9cd5
[Doc] add env vars to the doc ( #4572 )
2024-05-03 05:13:49 +00:00
ce3f1eedf8
[Misc] remove chunk detected debug logs ( #4571 )
2024-05-03 04:48:08 +00:00
808632d3b4
[BugFix] Prevent the task of _force_log
from being garbage collected ( #4567 )
2024-05-03 01:35:18 +00:00
344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups ( #4566 )
2024-05-02 17:32:33 -07:00
0f8a91401c
[Core] Ignore infeasible swap requests. ( #4557 )
2024-05-02 14:31:20 -07:00
9b5c9f9484
[CI/Build] AMD CI pipeline with extended set of tests. ( #4267 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-05-02 12:29:07 -07:00
32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel ( #4405 )
...
Co-authored-by: SangBin Cho <rkooo567@gmail.com >
2024-05-02 11:23:37 -07:00
5b8a7c1cb0
[Misc] centralize all usage of environment variables ( #4548 )
2024-05-02 11:13:25 -07:00
1ff0c73a79
[BugFix] Include target-device specific requirements.txt in sdist ( #4559 )
2024-05-02 10:52:51 -07:00
5ad60b0cbd
[Misc] Exclude the tests
directory from being packaged ( #4552 )
2024-05-02 10:50:25 -07:00
fb087af52e
[mypy][7/N] Cover all directories ( #4555 )
2024-05-02 10:47:41 -07:00
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
2a85f93007
[Core][Distributed] enable multiple tp group ( #4512 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-05-02 04:28:21 +00:00
cf8cac8c70
[mypy][6/N] Fix all the core subdirectory typing ( #4450 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-05-02 03:01:00 +00:00
5e401bce17
[CI]Add regression tests to ensure the async engine generates metrics ( #4524 )
2024-05-01 19:57:12 -07:00
0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption ( #4451 )
2024-05-01 19:24:13 -07:00
b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided ( #4273 )
2024-05-01 17:34:40 -07:00
826b82a260
[Misc] Fix expert_ids shape in MoE ( #4517 )
2024-05-01 23:47:59 +00:00
c9d852d601
[Misc] Remove Mixtral device="cuda" declarations ( #4543 )
...
Remove the device="cuda" declarations in mixtral as promised in #4343
2024-05-01 16:30:52 -07:00
6ef09b08f8
[Core][Distributed] fix pynccl del error ( #4508 )
2024-05-01 15:23:06 -07:00
3a922c1e7e
[Bugfix][Core] Fix and refactor logging stats ( #4336 )
2024-05-01 20:08:14 +00:00
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
24bb4fe432
[Kernel] Update fused_moe tuning script for FP8 ( #4457 )
...
This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo.
All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens.
Before this PR (with static activation scaling):
qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.1 ms ITL, 0.52s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 14.0 ms ITL, 0.70s e2e latency
qps = 10: 15.7 ms ITL, 0.79s e2e latency
After this PR (with static activation scaling):
qps = 1: 9.8 ms ITL, 0.49s e2e latency
qps = 2: 9.7 ms ITL, 0.49s e2e latency
qps = 4: 10.2 ms ITL, 0.53s e2e latency
qps = 6: 11.9 ms ITL, 0.59s e2e latency
qps = 8: 11.9 ms ITL, 0.59s e2e latency
qps = 10: 12.1 ms ITL, 0.61s e2e latency
2024-05-01 11:47:38 -07:00
a657bfc48a
[Core] Add multiproc_worker_utils
for multiprocessing-based workers ( #4357 )
2024-05-01 18:41:59 +00:00
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Sage Moore <sagemoore@utexas.edu >
2024-05-01 11:20:32 -07:00
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding ( #4237 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-05-01 11:13:03 -07:00
8b798eec75
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation ( #4534 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-05-01 18:01:50 +00:00
69909126a7
[Bugfix] Use random seed if seed is -1 ( #4531 )
2024-05-01 10:41:17 -07:00
e491c7e053
[Doc] update(example model): for OpenAI compatible serving ( #4503 )
2024-05-01 10:14:16 -07:00
4dc8026d86
[Bugfix] Fix 307 Redirect for /metrics
( #4523 )
2024-05-01 09:14:13 -07:00
a88bb9b032
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. ( #4173 )
...
Signed-off-by: AnyISalIn <anyisalin@gmail.com >
2024-05-01 09:11:03 -07:00
6f1df80436
[Test] Add ignore_eos test ( #4519 )
2024-05-01 08:45:42 -04:00
d6f4bd7cdd
[Misc]Add customized information for models ( #4132 )
2024-04-30 21:18:14 -07:00
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
a822eb3413
[Misc] fix typo in block manager ( #4453 )
2024-04-30 20:41:32 -07:00
f458112e8a
[Misc][Typo] type annotation fix ( #4495 )
2024-04-30 20:21:39 -07:00
2e240c69a9
[Core] Centralize GPU Worker construction ( #4419 )
2024-05-01 01:06:34 +00:00
ee37328da0
Unable to find Punica extension issue during source code installation ( #4494 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-05-01 00:42:09 +00:00
6ad58f42c5
fix_tokenizer_snapshot_download_bug ( #4493 )
2024-04-30 16:38:50 -07:00
dd1a50a8bc
[Bugfix][Minor] Make ignore_eos effective ( #4468 )
2024-04-30 16:33:33 -07:00
715c2d854d
[Frontend] [Core] Tensorizer: support dynamic num_readers
, update version ( #4467 )
2024-04-30 16:32:13 -07:00
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com >
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk >
2024-04-30 16:28:46 -07:00
111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) ( #4332 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
Co-authored-by: mgoin <michael@neuralmagic.com >
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com >
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com >
2024-04-30 21:46:12 +00:00
b31a1fb63c
[Doc] add visualization for multi-stage dockerfile ( #4456 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-30 17:41:59 +00:00
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor ( #4165 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-04-30 10:12:59 -07:00
26f2fb5113
[Core]Refactor gptq_marlin ops ( #4466 )
2024-04-30 08:14:47 -04:00
fa32207842
[Bugfix][Kernel] Fix compute_type for MoE kernel ( #4463 )
2024-04-29 22:05:40 -07:00
d627a3d837
[Misc] Upgrade to torch==2.3.0
( #4454 )
2024-04-29 20:05:47 -04:00
f4f921b7f1
[Core][Distributed] use cpu group to broadcast metadata in cpu ( #4444 )
2024-04-29 13:52:22 -07:00
ac5ccf0156
[CI] hotfix: soft fail neuron test ( #4458 )
2024-04-29 19:50:01 +00:00
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com >
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-04-29 09:35:34 -07:00
df29793dc7
[mypy][5/N] Support all typing on model executor ( #4427 )
2024-04-28 19:01:26 -07:00
03dd7d52bf
[CI] clean docker cache for neuron ( #4441 )
2024-04-28 23:32:07 +00:00
bf480c5302
Add more Prometheus metrics ( #2764 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com >
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-04-28 15:59:33 -07:00
9c7306ac11
[Misc] fix typo in llm_engine init logging ( #4428 )
2024-04-28 18:58:30 +08:00
4ea1f9678d
[BugFix] Resolved Issues For LinearMethod --> QuantConfig ( #4418 )
2024-04-27 18:35:33 +00:00
ba4be44c32
[BugFix] Fix return type of executor execute_model methods ( #4402 )
2024-04-27 11:17:45 -07:00
d6e520e170
[Core] Support offline use of local cache for models ( #4374 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com >
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com >
2024-04-27 09:59:55 -07:00
81661da7b2
[BugFix] Fix min_tokens
when eos_token_id
is None ( #4389 )
...
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com >
2024-04-27 09:52:46 -07:00
dfea173148
[Bugfix] Abort requests when the connection to /v1/completions is interrupted ( #4363 )
2024-04-27 09:48:37 -07:00
7134303cbb
[Bugfix][Core] Fix get decoding config from ray ( #4335 )
2024-04-27 11:30:08 +00:00
3da24c2df7
[Model] Phi-3 4k sliding window temp. fix ( #4380 )
2024-04-27 18:08:15 +08:00
eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers ( #3524 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-04-27 00:03:48 -07:00
18d23f642a
[ROCm][Hardware][AMD] Enable group query attention for triton FA ( #4406 )
2024-04-26 23:37:40 -07:00
87f545ba6f
[Misc] Fix logger format typo ( #4396 )
2024-04-27 13:45:02 +08:00
8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API ( #4355 )
2024-04-27 05:08:24 +00:00
12628d3c78
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales ( #4343 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-27 04:49:59 +00:00
258a2c58d0
[Core] Introduce DistributedGPUExecutor
abstract class ( #4348 )
2024-04-27 04:14:26 +00:00
aba47be3fe
[Misc] add RFC issue template ( #4401 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-26 15:47:45 -07:00
a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method ( #4373 )
2024-04-26 16:41:14 -04:00
603ad84815
[Core] Refactoring sampler and support prompt logprob for chunked prefill ( #4309 )
2024-04-26 13:02:02 +00:00
a88081bf76
[CI] Disable non-lazy string operation on logging ( #4326 )
...
Co-authored-by: Danny Guinther <dguinther@neuralmagic.com >
2024-04-26 00:16:58 -07:00
2f30e7c72f
[Frontend] Add --log-level option to api server ( #4377 )
2024-04-26 05:36:01 +00:00
a74dee9b62
[Bugfix] Fix parameter name in get_tokenizer
( #4107 )
2024-04-25 19:10:48 -07:00
cf29b7eda4
[ROCm][Hardware][AMD][Doc] Documentation update for ROCm ( #4376 )
...
Co-authored-by: WoosukKwon <woosuk.kwon@berkeley.edu >
2024-04-25 18:12:25 -07:00
efffb63f58
[Core] Move function tracing setup to util function ( #4352 )
2024-04-25 16:45:12 -07:00
15e7c675b0
[Core] Add shutdown()
method to ExecutorBase
( #4349 )
2024-04-25 16:32:48 -07:00
b6dcb4d442
[Misc] Fix flash attention backend log ( #4368 )
2024-04-25 12:43:32 -07:00
b5b4a398a7
[Mypy] Typing lora folder ( #4337 )
2024-04-25 19:13:50 +00:00
f4bc4de1b1
[Core]refactor aqlm quant ops ( #4351 )
2024-04-25 15:03:56 -04:00
bd7a8eef25
[Doc] README Phi-3 name fix. ( #4372 )
...
Co-authored-by: Caio Mendes <caiocesart@microsoft.com >
2024-04-25 10:32:00 -07:00
7ee82bef1e
[CI/Build] Adding functionality to reset the node's GPUs before processing. ( #4213 )
2024-04-25 09:37:20 -07:00
fbf152d976
[Bugfix][Model] Refactor OLMo model to support new HF format in transformers 4.40.0 ( #4324 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-25 09:35:56 -07:00
479d69fad0
[Core] Move ray_utils.py from engine
to executor
package ( #4347 )
2024-04-25 06:52:22 +00:00
96e90fdeb3
[Model] Adds Phi-3 support ( #4298 )
2024-04-25 03:06:57 +00:00
a395a638c2
[Misc] Use public API in benchmark_throughput ( #4300 )
2024-04-24 21:10:24 +00:00
2768884ac4
[Doc] Add note for docker user ( #4340 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-24 21:09:44 +00:00
aae08249ac
[Bugfix] Fix marlin kernel crash on H100 ( #4218 )
...
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
2024-04-24 10:35:01 -07:00
7923dcad12
[Misc] Update ShareGPT Dataset Sampling in Serving Benchmark ( #4279 )
2024-04-24 09:49:13 -07:00
3cd9b5bb2d
[Core][Distributed] use existing torch.cuda.device ( #4318 )
...
[Core][Distributed] use existing torch.cuda.device context manager (#4318 )
2024-04-24 09:00:20 -07:00
468d761b32
[Misc] Reduce supported Punica dtypes ( #4304 )
2024-04-23 18:54:33 -07:00
e4bf860a54
[CI][Build] change pynvml to nvidia-ml-py ( #4302 )
2024-04-23 18:33:12 -07:00
91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl ( #4248 )
2024-04-23 18:32:19 -07:00
79a268c4ab
[BUG] fixed fp8 conflict with aqlm ( #4307 )
...
Fixes fp8 iterface which broke in AQLM merge.
2024-04-23 18:26:33 -07:00
eace8bf0b9
[Kernel] FP8 support for MoE kernel / Mixtral ( #4244 )
...
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208
It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118 ), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954 ). With this PR, the results are as follows:
<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03 ">
**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7018|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6472|± |0.0065|
| - other |N/A |none | 5|acc |0.7673|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070|
| - stem |N/A |none | 5|acc |0.6131|± |0.0083|
```
this compares favorably with the fp16 results which are
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7020|± |0.1313|
| - humanities |N/A |none | 5|acc |0.6425|± |0.1349|
| - other |N/A |none | 5|acc |0.7744|± |0.1038|
| - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695|
| - stem |N/A |none | 5|acc |0.6108|± |0.1383|
```
Happy hacking!
2024-04-24 01:18:23 +00:00
1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened ( #4292 )
2024-04-23 18:19:03 +00:00
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com >
2024-04-23 13:59:33 -04:00
62b5166bd4
[CI] Add ccache for wheel builds job ( #4281 )
2024-04-23 09:51:41 -07:00
d86285a4a4
[Core][Logging] Add last frame information for better debugging ( #4278 )
2024-04-23 09:45:52 -07:00
d87f39e9a9
[Bugfix] Add init_cached_hf_modules to RayWorkerWrapper ( #4286 )
2024-04-23 09:28:35 -07:00
d3c8180ac4
[Bugfix] Fixing max token error message for openai compatible server ( #4016 )
2024-04-23 19:06:29 +08:00
62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. ( #3951 )
2024-04-23 08:02:36 +00:00
050f285ff6
[Core] Scheduling optimization 2 ( #4280 )
2024-04-23 08:02:11 +00:00
8f2ea22bde
[Core] Some simplification of WorkerWrapper changes ( #4183 )
2024-04-23 07:49:08 +00:00
0ae11f78ab
[Mypy] Part 3 fix typing for nested directories for most of directory ( #4161 )
2024-04-22 21:32:44 -07:00
34128a697e
Fix autodoc
directives ( #4272 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-23 01:53:01 +00:00
c1b4e4157c
[Core][Distributed] use absolute path for library file ( #4271 )
2024-04-22 17:21:48 -07:00
ceaf4ed003
[Doc] Update the SkyPilot doc with serving and Llama-3 ( #4276 )
2024-04-22 15:34:31 -07:00
ad8d696a99
[Core] Scheduler perf fix ( #4270 )
2024-04-22 21:11:06 +00:00
3d925165f2
Add example scripts to documentation ( #4225 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-22 16:36:54 +00:00
1543680691
[Bugfix] Ensure download_weights_from_hf(..) inside loader is using the revision parameter ( #4217 )
2024-04-22 09:10:48 -07:00
077f0a2e8a
[Frontend] Enable support for CPU backend in AsyncLLMEngine. ( #3993 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-04-22 09:19:51 +00:00
e73ed0f1c6
[Bugfix] Fix type annotations in CPU model runner ( #4256 )
2024-04-22 00:54:16 -07:00
296cdf8ac7
[Misc] Add vision language model support to CPU backend ( #3968 )
2024-04-22 00:44:16 -07:00
747b1a7147
[Core][Distributed] fix _is_full_nvlink detection ( #4233 )
2024-04-21 23:04:16 -07:00
95e5b087cf
[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring ( #4129 )
2024-04-21 21:57:24 -07:00
a37d815b83
Make initialization of tokenizer and detokenizer optional ( #3748 )
...
Co-authored-by: Yun Ding <yunding@nvidia.com >
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-21 22:06:46 +00:00
7f2593b164
[Doc]: Update the doc of adding new models ( #4236 )
2024-04-21 09:57:08 -07:00
fe7d648fe5
Don't show default value for flags in EngineArgs
( #4223 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-21 09:15:28 -07:00
cc74b2b232
Updating lm-format-enforcer version and adding links to decoding libraries in docs ( #4222 )
2024-04-20 08:33:16 +00:00
91528575ec
[Frontend] multiple sampling params support ( #3570 )
2024-04-20 00:11:57 -07:00
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
682789d402
Fix missing docs and out of sync EngineArgs
( #4219 )
...
Co-authored-by: Harry Mellor <hmellor@oxts.com >
2024-04-19 20:51:33 -07:00
138485a82d
[Bugfix] Add fix for JSON whitespace ( #4189 )
...
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal >
2024-04-19 20:49:22 -07:00
bc9df1571b
Pass tokenizer_revision
when getting tokenizer in openai serving ( #4214 )
2024-04-19 17:13:56 -07:00
15b86408a8
[Misc] add nccl in collect env ( #4211 )
2024-04-19 19:44:51 +00:00
7be4f5628f
[Bugfix][Core] Restore logging of stats in the async engine ( #4150 )
2024-04-19 08:08:26 -07:00
8f20fc04bf
[Misc] fix docstrings ( #4191 )
...
Co-authored-by: Zhong Wang <wangzhong@infini-ai.com >
2024-04-19 08:18:33 +00:00
221d93ecbf
Bump version of 0.4.1 ( #4177 )
2024-04-19 01:00:22 -07:00
d17c8477f1
[Bugfix] Fix LoRA loading check ( #4138 )
...
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-04-19 00:59:54 -07:00
a134ef6f5e
Support eos_token_id from generation_config.json ( #4182 )
2024-04-19 04:13:36 +00:00
8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference ( #4079 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-18 16:15:12 -07:00
8f9c28fd40
[Bugfix] Fix CustomAllreduce nvlink topology detection ( #3974 )
...
[Bugfix] Fix CustomAllreduce pcie nvlink topology detection (#3974 ) (#4159 )
2024-04-18 15:32:47 -07:00
cd2f63fb36
[CI/CD] add neuron docker and ci test scripts ( #3571 )
2024-04-18 15:26:01 -07:00
87fa80c91f
[Misc] Bump transformers to latest version ( #4176 )
2024-04-18 14:36:39 -07:00
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields ( #4149 )
2024-04-18 21:12:55 +00:00
705578ae14
[Docs] document that Meta Llama 3 is supported ( #4175 )
2024-04-18 10:55:48 -07:00
e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill ( #4128 )
2024-04-18 00:51:28 -07:00
53b018edcb
[Bugfix] Get available quantization methods from quantization registry ( #4098 )
2024-04-18 00:21:55 -07:00
66ded03067
Allow model to be served under multiple names ( #2894 )
...
Co-authored-by: Alexandre Payot <alexandrep@graphcore.ai >
2024-04-18 00:16:26 -07:00
6dc1fc9cfe
[Core] nccl integrity check and test ( #4155 )
...
[Core] Add integrity check during initialization; add test for it (#4155 )
2024-04-17 22:28:52 -07:00
533d2a1f39
[Typing] Mypy typing part 2 ( #4043 )
...
Co-authored-by: SangBin Cho <sangcho@sangcho-LT93GQWG9C.local >
2024-04-17 17:28:43 -07:00
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA ( #4134 )
2024-04-17 10:02:45 -07:00
fe3b5bbc23
[Bugfix] fix output parsing error for trtllm backend ( #4137 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-17 11:07:23 +00:00
8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication ( #4024 )
...
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024 )
2024-04-17 08:34:33 +00:00
11d652bd4f
[CI] Move CPU/AMD tests to after wait ( #4123 )
2024-04-16 22:53:26 -07:00
d150e4f89f
[Misc] [CI] Fix CI failure caught after merge ( #4126 )
2024-04-16 17:56:01 -07:00
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine ( #3894 )
2024-04-16 13:09:21 -07:00
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
05434764cd
LM Format Enforcer Guided Decoding Support ( #3868 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-16 05:54:57 +00:00
4e7ee664e2
[Core] Fix engine-use-ray broken ( #4105 )
2024-04-16 05:24:53 +00:00
37e84a403d
[Typing] Fix Sequence type GenericAlias only available after Python 3.9. ( #4092 )
2024-04-15 14:47:31 -07:00
4695397dcf
[Bugfix] Fix ray workers profiling with nsight ( #4095 )
2024-04-15 14:24:45 -07:00
d619ae2d19
[Doc] Add better clarity for tensorizer usage ( #4090 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-15 13:28:25 -07:00
eb46fbfda2
[Core] Simplifications to executor classes ( #4071 )
2024-04-15 13:05:09 -07:00
0003e9154b
[Misc][Minor] Fix CPU block num log in CPUExecutor. ( #4088 )
2024-04-15 08:35:55 -07:00
e11e200736
[Bugfix] Fix filelock version requirement ( #4075 )
2024-04-14 21:50:08 -07:00
8db1bf32f8
[Misc] Upgrade triton to 2.2.0 ( #4061 )
2024-04-14 17:43:54 -07:00
aceb17cf2d
[Docs] document that mixtral 8x22b is supported ( #4073 )
2024-04-14 14:35:55 -07:00
563c54f760
[BugFix] Fix tensorizer extra in setup.py ( #4072 )
2024-04-14 14:12:42 -07:00
2cd6b4f362
[Core] avoid too many cuda context by caching p2p test ( #4021 )
2024-04-13 23:40:21 -07:00
711a000255
[Frontend] [Core] feat: Add model loading using tensorizer
( #3476 )
2024-04-13 17:13:01 -07:00
989ae2538d
[Kernel] Add punica dimension for Baichuan-13B ( #4053 )
2024-04-13 07:55:05 -07:00
0a430b4ae2
[Bugfix] fix_small_bug_in_neuron_executor ( #4051 )
2024-04-13 07:54:03 -07:00
ec8e3c695f
[Bugfix] fix_log_time_in_metrics ( #4050 )
2024-04-13 07:52:36 -07:00
98afde19fc
[Core][Distributed] improve logging for init dist ( #4042 )
2024-04-13 07:12:53 -07:00
5c2e66e487
[Bugfix] More type hint fixes for py 3.8 ( #4039 )
2024-04-12 21:07:04 -07:00
546e721168
[CI/Test] expand ruff and yapf for all supported python version ( #4037 )
2024-04-13 01:43:37 +00:00
b8aacac31a
[Bugfix] Fix LoRA bug ( #4032 )
2024-04-12 16:56:37 -07:00
d04973ad54
Fix triton compilation issue ( #3984 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-12 16:41:26 -07:00
fbb9d9eef4
[Core] fix custom allreduce default value ( #4040 )
2024-04-12 16:40:39 -07:00
09473ee41c
[mypy] Add mypy type annotation part 1 ( #4006 )
2024-04-12 14:35:50 -07:00
d4ec9ffb95
[Misc] Fix typo in scheduler.py ( #4022 )
2024-04-12 13:56:04 -07:00
96b6a6d790
[Bugfix] fix type hint for py 3.8 ( #4036 )
2024-04-12 19:35:44 +00:00
36729bac13
[Test] Test multiple attn backend for chunked prefill. ( #4023 )
2024-04-12 09:56:57 -07:00
7fd3949a0b
[Frontend][Core] Move merge_async_iterators
to utils ( #4026 )
2024-04-12 05:30:54 +00:00
1096717ae9
[Core] Support LoRA on quantized models ( #4012 )
2024-04-11 21:02:44 -07:00
c2b4a1bce9
[Doc] Add typing hints / mypy types cleanup ( #3816 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-11 17:17:21 -07:00
e46a60aa4c
[BugFix] Fix handling of stop strings and stop token ids ( #3672 )
2024-04-11 15:34:12 -07:00
1e96c3341a
Add extra punica sizes to support bigger vocabs ( #4015 )
2024-04-11 22:18:57 +00:00
95e7d4a97c
Fix echo/logprob OpenAI completion bug ( #3441 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com >
2024-04-11 22:15:50 +00:00
559eb852f8
[Core] init_distributed_environment align with init_process_group( #4014 )
...
[Core][Distributed] make init_distributed_environment compatible with init_process_group (#4014 )
2024-04-11 14:00:48 -07:00
a10d3056da
[Core] Set linear_weights
directly on the layer ( #3977 )
2024-04-11 16:35:51 -04:00
8afca50889
[Hardware][Intel] Isolate CPUModelRunner and ModelRunner for better maintenance ( #3824 )
2024-04-11 11:56:49 -07:00
08ccee1e83
punica fix-bgmv-kernel-640 ( #4007 )
2024-04-11 08:59:26 -07:00
c1dc547129
[Kernel] Fused MoE Config for Mixtral 8x22 ( #4002 )
2024-04-11 07:50:00 -07:00
f3d0bf7589
[Doc][Installation] delete python setup.py develop ( #3989 )
2024-04-11 03:33:02 +00:00
e9da5a40c6
[Misc] Add indirection layer for custom ops ( #3913 )
2024-04-10 20:26:07 -07:00
e42df7227d
[Test] Add xformer and flash attn tests ( #3961 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-11 03:09:50 +00:00
caada5e50a
[Core][Model] torch.compile for layernorm in commandr ( #3985 )
...
[Core][Model] Use torch.compile to accelerate layernorm in commandr (#3985 )
2024-04-11 01:48:26 +00:00
67b4221a61
[Core][5/N] Fully working chunked prefill e2e ( #3884 )
2024-04-10 17:56:48 -07:00
63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed ( #3950 )
...
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950 )
2024-04-10 15:33:30 -07:00
934d3662f7
[Bugfix] handle hf_config with architectures == None ( #3982 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-04-10 22:28:25 +00:00
92cd2e2f21
[Doc] Fix getting stared to use publicly available model ( #3963 )
2024-04-10 18:05:52 +00:00
e4c4072c94
[Bugfix] Remove key sorting for guided_json
parameter in OpenAi compatible Server ( #3945 )
2024-04-10 10:15:51 -07:00
e35397468f
[Doc] Add doc to state our model support policy ( #3948 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-10 17:03:02 +00:00
8b317c6dd0
[Model][AMD] ROCm support for 256 head dims for Gemma ( #3972 )
2024-04-10 08:12:00 -07:00
bd3c144e0b
[Bugfix][ROCm] Add numba to Dockerfile.rocm ( #3962 )
2024-04-10 07:37:17 -07:00
0258b7a94b
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty ( #3876 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
2024-04-10 01:39:56 -07:00
b3104b2a10
[Bugfix] Fix logits processor when prompt_logprobs is not None ( #3899 )
2024-04-10 00:09:36 -07:00
c2e00af523
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable ( #3955 )
...
Co-authored-by: tianyi_zhao <tianyi.zhao@transwarp.io >
2024-04-10 04:49:11 +00:00
c013d32c75
[Benchmark] Add cpu options to bench scripts ( #3915 )
2024-04-09 21:30:03 -07:00
11dd6ebb89
[Misc] Avoid loading incorrect LoRA config ( #3777 )
2024-04-09 19:47:15 -07:00
6c0b04515f
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm ( #3643 )
...
Co-authored-by: jpvillam <jpvillam@amd.com >
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-09 15:10:47 -07:00
e23a43aef8
[Bugfix] Fix KeyError on loading GPT-NeoX ( #3925 )
2024-04-09 12:11:31 -07:00
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" ( #3837 )
2024-04-09 11:44:15 -07:00
6d592eb430
[Core] separate distributed_init from worker ( #3904 )
2024-04-09 08:49:02 +00:00
d036198e23
[BugFix][Model] Fix commandr RoPE max_position_embeddings ( #3919 )
2024-04-09 06:17:21 +08:00
59a6abf3c9
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations ( #3782 )
2024-04-08 14:31:02 -07:00
bc0c0192d1
[Bugfix] Enable Proper attention_bias
Usage in Llama Model Configuration ( #3767 )
...
Co-authored-by: roy <jasonailu87@gmail.com >
2024-04-08 19:42:35 +00:00
f46864d68d
[Bugfix] Added Command-R GPTQ support ( #3849 )
...
Co-authored-by: Egor Tolmachev <t333ga@gmail.com >
2024-04-08 14:59:38 +00:00
b4543c8f6b
[Model] add minicpm ( #3893 )
2024-04-08 18:28:36 +08:00
0ce0539d47
[Bugfix] Fix Llava inference with Tensor Parallelism. ( #3883 )
2024-04-07 22:54:13 +08:00
2f19283549
[Core] latency optimization ( #3890 )
2024-04-06 19:14:06 -07:00
95baec828f
[Core] enable out-of-tree model register ( #3871 )
2024-04-06 17:11:41 -07:00
e4be7d70bb
[CI/Benchmark] add more iteration and use median for robust latency benchmark ( #3889 )
2024-04-06 21:32:30 +00:00
54951ac4bf
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism ( #3869 )
2024-04-05 12:02:09 -07:00
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. ( #3853 )
2024-04-05 10:17:58 -07:00
1d7c940d74
Add option to completion API to truncate prompt tokens ( #3144 )
2024-04-05 10:15:42 -07:00
cfaf49a167
[Misc] Define common requirements ( #3841 )
2024-04-05 00:39:17 -07:00
9edec652e2
[Bugfix] Fixing requirements.txt ( #3865 )
2024-04-04 23:46:01 -07:00
e0dd4d3589
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py ( #3864 )
2024-04-04 21:57:33 -07:00
e5043a3e75
[Misc] Add pytest marker to opt-out of global test cleanup ( #3863 )
2024-04-04 21:54:16 -07:00
d03d64fd2e
[CI/Build] refactor dockerfile & fix pip cache
...
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels (#3859 )
2024-04-04 21:53:16 -07:00
78107fa091
[Doc]Add asynchronous engine arguments to documentation. ( #3810 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-04-04 21:52:01 -07:00
c391e4b68e
[Core] improve robustness of pynccl ( #3860 )
2024-04-04 16:52:12 -07:00
9117f892f0
[Model] Cohere CommandR+ ( #3829 )
2024-04-04 13:31:49 -07:00
db2a6a41e2
[Hardware][CPU] Update cpu torch to match default of 2.2.1 ( #3854 )
2024-04-04 19:49:49 +00:00
ca81ff5196
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 ( #3805 )
2024-04-04 10:26:19 -07:00
b7782002e1
[Benchmark] Refactor sample_requests in benchmark_throughput ( #3613 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-04 09:56:22 +00:00
819a309c0f
[Bugfix] Fix args in benchmark_serving ( #3836 )
...
Co-authored-by: Roger Wang <ywang@roblox.com >
2024-04-04 07:41:05 +00:00
aabe8f40f2
[Core] [Frontend] Make detokenization optional ( #3749 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-04-03 21:52:18 -07:00
498eb5cfa3
[Bugfix] Add kv_scale input parameter to CPU backend ( #3840 )
2024-04-04 04:33:08 +00:00
537ee25f43
[Core] Enable hf_transfer by default if available ( #3817 )
2024-04-04 04:02:43 +00:00
294f8f6665
[BugFix] Pass tokenizer_config to local_tokenizer_group ( #3754 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-04-03 20:31:46 -07:00
b95047f2da
[Misc] Publish 3rd meetup slides ( #3835 )
2024-04-03 15:46:10 -07:00
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com >
Co-authored-by: HaiShaw <hixiao@gmail.com >
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com >
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com >
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu >
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com >
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com >
Co-authored-by: guofangze <guofangze@kuaishou.com >
Co-authored-by: Michael Goin <mgoin64@gmail.com >
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-04-03 14:15:55 -07:00
3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling ( #3550 )
2024-04-03 14:13:49 -07:00
c64cf38673
[Doc] Update contribution guidelines for better onboarding ( #3819 )
2024-04-03 07:31:43 +00:00
76b889bf1d
[Doc] Update README.md ( #3806 )
2024-04-02 23:11:10 -07:00
c9b506dad4
[BugFix] Use different mechanism to get vllm version in is_cpu()
( #3804 )
2024-04-02 23:06:25 -07:00
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding ( #3706 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com >
2024-04-03 00:40:57 +00:00
a3c226e7eb
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary ( #3803 )
2024-04-02 12:57:04 -07:00
b321d4881b
[Bugfix] Add __init__.py
files for vllm/core/block/
and vllm/spec_decode/
( #3798 )
2024-04-02 12:35:31 -07:00
ad6eca408b
Fix early CUDA init via get_architecture_class_name import ( #3770 )
...
Signed-off-by: Lei Wen <wenlei03@qiyi.com >
Co-authored-by: Lei Wen <wenlei03@qiyi.com >
2024-04-02 11:56:26 -07:00
205b94942e
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build ( #3801 )
2024-04-02 11:54:33 -07:00
3bec41f41a
[Doc] Fix vLLMEngine Doc Page ( #3791 )
2024-04-02 09:49:37 -07:00
0739b1947f
[Frontend][Bugfix] allow using the default middleware with a root path ( #3788 )
...
Co-authored-by: A-Mahla <>
2024-04-02 01:20:28 -07:00
77a6572aa5
[HotFix] [CI/Build] Minor fix for CPU backend CI ( #3787 )
2024-04-01 22:50:53 -07:00
0e3f06fe9c
[Hardware][Intel] Add CPU inference backend ( #3634 )
...
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com >
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com >
2024-04-01 22:07:30 -07:00
eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup ( #3783 )
2024-04-02 00:49:51 +00:00
7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format ( #3689 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com >
2024-04-01 19:32:01 -04:00
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding ( #3250 )
2024-04-01 22:55:24 +00:00
ccb58b23e6
[Misc] Fix Benchmark TTFT Calculation for Chat Completions ( #3768 )
2024-04-01 15:24:30 -07:00
49782fcb76
[Misc] Some minor simplifications to detokenization logic ( #3670 )
...
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
2024-04-01 13:22:06 -07:00
f03cc667a0
[Misc] Minor fixes in requirements.txt ( #3769 )
2024-04-01 10:15:48 +00:00
563c1d7ec5
[CI/Build] Make Marlin Tests Green ( #3753 )
2024-03-30 19:18:34 -07:00
9c82a1bec3
[Doc] Update installation doc ( #3746 )
...
[Doc] Update installation doc for build from source and explain the dependency on torch/cuda version (#3746 )
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-30 16:34:38 -07:00
b6d103542c
[Kernel] Layernorm performance optimization ( #3662 )
2024-03-30 14:26:38 -07:00
51c31bc10c
CMake build elf without PTX ( #3739 )
2024-03-30 01:53:08 +00:00
3ad438c66f
Fix build when nvtools is missing ( #3698 )
2024-03-29 18:52:39 -07:00
203d4f82ac
[Core][Bugfix] cache len of tokenizer ( #3741 )
2024-03-29 18:46:39 -07:00
991143cfcd
[BugFix] Use consistent logger everywhere ( #3738 )
2024-03-29 23:26:44 +00:00
8b2d3cbc1b
usage lib get version another way ( #3735 )
2024-03-29 15:57:08 -07:00
9765b5c406
[ROCm][Bugfix] Fixed several bugs related to rccl path and attention selector logic ( #3699 )
2024-03-29 14:52:36 -07:00
430530fc18
bump version to v0.4.0 ( #3712 )
2024-03-29 12:28:33 -07:00
97356f3c7e
[Bugfix] Command-R Max Model Length ( #3727 )
2024-03-29 12:27:51 -07:00
f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error ( #3731 )
2024-03-29 09:38:21 -07:00
6110c39dc8
[BugFix] Fix tokenizer out of vocab size ( #3685 )
2024-03-29 08:18:59 -07:00
d8658c8cc1
Usage Stats Collection ( #2852 )
2024-03-28 22:16:12 -07:00
7bc94a0fdd
add ccache to docker build image ( #3704 )
2024-03-28 22:14:24 -07:00
756b30a5f3
[Core][Test] move local_rank to the last arg with default value( #3711 )
...
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711 )
2024-03-28 21:19:45 -07:00
395aa823ea
[Misc] Minor type annotation fix ( #3716 )
2024-03-28 21:12:24 -07:00
26422e477b
[Test] Make model tests run again and remove --forked from pytest ( #3631 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-03-28 21:06:40 -07:00
f342153b48
Revert "bump version to v0.4.0" ( #3708 )
2024-03-28 18:49:42 -07:00
27a57cad52
bump version to v0.4.0 ( #3705 )
2024-03-28 18:26:51 -07:00
98a42e7078
[Benchmark] Change mii to use persistent deployment and support tensor parallel ( #3628 )
2024-03-28 17:33:52 -07:00
0267fef52a
[Core] fix del of communicator ( #3702 )
2024-03-29 00:24:58 +00:00
4716a32dd4
fix logging msg for block manager ( #3701 )
2024-03-28 23:29:55 +00:00
c0935c96d3
[Bugfix] Set enable_prefix_caching=True in prefix caching example ( #3703 )
2024-03-28 16:26:30 -07:00
cb40b3ab6b
[Kernel] Add MoE Triton kernel configs for A100 40GB ( #3700 )
2024-03-28 15:26:24 -07:00
515386ef3c
[Core] Support multi-node inference(eager and cuda graph) ( #3686 )
2024-03-28 15:01:55 -07:00
a4075cba4d
[CI] Add test case to run examples scripts ( #3638 )
2024-03-28 14:36:10 -07:00
96aa014d1e
fix benchmark format reporting in buildkite ( #3693 )
2024-03-28 14:35:16 -07:00
1715056fef
[Bugfix] Update neuron_executor.py to add optional vision_language_config ( #3695 )
2024-03-28 10:43:34 -07:00
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00
ce567a2926
[Kernel] DBRX Triton MoE kernel H100 ( #3692 )
2024-03-28 10:05:34 -07:00
d6ea427f04
[Model] Add support for Qwen2MoeModel ( #3346 )
2024-03-28 15:19:59 +00:00
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability ( #3492 )
2024-03-27 23:59:28 -07:00
8267b06c30
[Kernel] Add Triton MoE kernel configs for DBRX on A100 ( #3679 )
2024-03-27 22:22:25 -07:00
3492859b68
[CI/Build] update default number of jobs and nvcc threads to avoid overloading the system ( #3675 )
2024-03-28 00:18:54 -04:00
098e1776ba
[Model] Add support for xverse ( #3610 )
...
Co-authored-by: willhe <hexin@xverse.cn >
Co-authored-by: root <root@localhost.localdomain >
2024-03-27 18:12:54 -07:00
10e6322283
[Model] Fix and clean commandr ( #3671 )
2024-03-28 00:20:00 +00:00
6d9aa00fc4
[Docs] Add Command-R to supported models ( #3669 )
2024-03-27 15:20:00 -07:00
1182607e18
Add support for Cohere's Command-R model ( #3433 )
...
Co-authored-by: José Maria Pombal <jose.pombal@unbabel.com >
Co-authored-by: youkaichao <youkaichao@gmail.com >
2024-03-27 14:19:32 -07:00
45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark ( #3277 )
2024-03-27 13:39:26 -07:00
1956931436
[Misc] add the "download-dir" option to the latency/throughput benchmarks ( #3621 )
2024-03-27 13:39:05 -07:00
e24336b5a7
[Model] Add support for DBRX ( #3660 )
2024-03-27 13:01:46 -07:00
d18f4e73f3
[Bugfix] [Hotfix] fix nccl library name ( #3661 )
2024-03-27 17:23:54 +00:00
82c540bebf
[Bugfix] More faithful implementation of Gemma ( #3653 )
2024-03-27 09:37:18 -07:00
8f44facddd
[Core] remove cupy dependency ( #3625 )
2024-03-27 00:33:26 -07:00
e66b629c04
[Misc] Minor fix in KVCache type ( #3652 )
2024-03-26 23:14:06 -07:00
76879342a3
[Doc]add lora support ( #3649 )
2024-03-27 02:06:46 +00:00
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels ( #3636 )
2024-03-27 00:37:42 +00:00
0dc72273b8
[BugFix] Fix ipv4 address parsing regression ( #3645 )
2024-03-26 14:39:44 -07:00
a979d9771e
[Bugfix] Fix ipv6 address parsing bug ( #3641 )
2024-03-26 11:58:20 -07:00
8af890a865
Enable more models to inference based on LoRA ( #3382 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-03-25 18:09:31 -07:00
dfeb2ecc3a
[Misc] Include matched stop string/token in responses ( #2976 )
...
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com >
2024-03-25 17:31:32 -07:00
3a243095e5
Optimize _get_ranks
in Sampler ( #3623 )
2024-03-25 16:03:02 -07:00
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
f408d05c52
hotfix isort on logprobs ranks pr ( #3622 )
2024-03-25 11:55:46 -07:00
0b4997e05c
[Bugfix] API stream returning two stops ( #3450 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com >
2024-03-25 10:14:34 -07:00
c13ad1b7bd
feat: implement the min_tokens sampling parameter ( #3124 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com >
Co-authored-by: Nick Hill <nickhill@us.ibm.com >
2024-03-25 10:14:26 -07:00
819924e749
[Core] Adding token ranks along with logprobs ( #3516 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com >
2024-03-25 10:13:10 -07:00
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
e67c295b0c
[Bugfix] fix automatic prefix args and add log info ( #3608 )
2024-03-25 05:35:22 -07:00
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
b0dfa91dd7
[Model] Add starcoder2 awq support ( #3569 )
2024-03-24 21:07:36 -07:00
56a8652f33
[Bugfix] store lock file in tmp directory ( #3578 )" ( #3599 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-03-24 20:06:50 -07:00
6d93d35308
[BugFix] tensor.get_device() -> tensor.device ( #3604 )
2024-03-24 19:01:13 -07:00
837e185142
[CI/Build] fix flaky test ( #3602 )
2024-03-24 17:43:05 -07:00
42bc386129
[CI/Build] respect the common environment variable MAX_JOBS ( #3600 )
2024-03-24 17:04:00 -07:00
8b268a46a7
[CI] typo fix: is_hip --> is_hip() ( #3595 )
2024-03-24 16:03:06 -07:00
41deac4a3d
[BugFix] 1D query fix for MoE models ( #3597 )
2024-03-24 16:00:16 -07:00
af9e53496f
[BugFix] Fix Falcon tied embeddings ( #3590 )
...
Co-authored-by: 44670 <44670@users.noreply.github.com >
2024-03-24 06:34:01 -07:00
f8a12ecc7f
[Misc] Bump transformers version ( #3592 )
2024-03-24 06:32:45 -07:00
3c5ab9b811
[Misc] Fix BLOOM copyright notice ( #3591 )
2024-03-23 23:30:56 -07:00
743a0b7402
[Bugfix] use SoftLockFile instead of LockFile ( #3578 )
2024-03-23 11:43:11 -07:00
bfdb1ba5c3
[Core] Improve detokenization performance for prefill ( #3469 )
...
Co-authored-by: MeloYang <meloyang05@gmail.com >
2024-03-22 13:44:12 -07:00
cf2f084d56
Dynamic scheduler delay to improve ITL performance ( #3279 )
...
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com >
2024-03-22 12:28:14 -07:00
f721096d48
[BugFix] Some fixes for custom allreduce kernels ( #2760 )
2024-03-21 23:02:58 -07:00
e90fc21f2e
[Hardware][Neuron] Refactor neuron support ( #3471 )
2024-03-22 01:22:17 +00:00
ea5f14e6ff
[Bugfix][Model] Fix Qwen2 ( #3554 )
2024-03-22 00:18:58 +00:00
b7050ca7df
[BugFix] gemma loading after quantization or LoRA. ( #3553 )
2024-03-21 13:16:57 -07:00
c188ecb080
[Misc] Bump up transformers to v4.39.0 & Remove StarCoder2Config ( #3551 )
...
Co-authored-by: Roy <jasonailu87@gmail.com >
Co-authored-by: Roger Meier <r.meier@siemens.com >
2024-03-21 07:58:12 -07:00
865732342b
[Misc][Log] Add log for tokenizer length not equal to vocabulary size ( #3500 )
2024-03-21 18:07:48 +08:00
4c07dd28c0
[ 🚀 Ready to be merged] Added support for Jais models ( #3183 )
2024-03-21 09:45:24 +00:00
3bbff9e5ab
Fix 1D query issue from _prune_hidden_states
( #3539 )
2024-03-21 08:49:06 +00:00
6ebd02bdef
[PREFIX CACHING FOLLOW UP] OrderedDict-based evictor ( #3431 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com >
Co-authored-by: Luka <luka@paperspace>
2024-03-20 23:20:04 -07:00
523e30ea0c
[BugFix] Hot fix in setup.py for neuron build ( #3537 )
2024-03-20 17:59:52 -07:00
f1c0fc3919
Migrate logits
computation and gather to model_runner
( #3233 )
2024-03-20 23:25:01 +00:00
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes ( #3236 )
2024-03-20 14:46:05 -07:00
426ec4ec67
[1/n] Triton sampling kernel ( #3186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-03-20 14:45:08 -07:00
80e254834d
[Bugfix] Fix ROCm support in CMakeLists.txt ( #3534 )
2024-03-20 21:05:03 +00:00
ba8ae1d84f
Check for _is_cuda() in compute_num_jobs ( #3481 )
2024-03-20 10:06:56 -07:00
84eaa68425
Abort when nvcc command is not found in the PATH ( #3527 )
2024-03-20 09:28:29 -07:00
5ee14494e4
[Misc] Remove cache stream and cache events ( #3461 )
2024-03-20 00:38:53 -07:00
4ad521d8b5
[Core] Add generic typing to LRUCache
( #3511 )
2024-03-20 00:36:09 -07:00
9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled ( #3357 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-20 00:11:11 -07:00
20478c4d3a
Use lru_cache for some environment detection utils ( #3508 )
2024-03-19 21:34:15 +00:00
63e8b28a99
[Doc] minor fix of spelling in amd-installation.rst ( #3506 )
2024-03-19 20:32:30 +00:00
cc63d03fbb
Revert "[Core] Cache some utils" ( #3507 )
2024-03-19 13:22:58 -07:00
2a60c9bd17
[Doc] minor fix to neuron-installation.rst ( #3505 )
2024-03-19 13:21:35 -07:00
c614cfee58
Update dockerfile with ModelScope support ( #3429 )
2024-03-19 10:54:59 -07:00
7341c77d69
[BugFix] Avoid initializing CUDA too early ( #3487 )
2024-03-18 23:05:20 -07:00
ef65dcfa6f
[Doc] Add docs about OpenAI compatible server ( #3288 )
2024-03-18 22:05:34 -07:00
6a9c583e73
[Core] print error before deadlock ( #3459 )
2024-03-19 04:06:23 +00:00
b37cdce2b1
[Core] Cache some utils ( #3474 )
2024-03-18 17:14:26 -07:00
b30880a762
[Misc] Update README for the Third vLLM Meetup ( #3479 )
2024-03-18 15:58:38 -07:00
49eedea373
[Core] Zero-copy asdict for InputMetadata ( #3475 )
2024-03-18 22:56:40 +00:00
9fdf3de346
Cmake based build system ( #2830 )
2024-03-18 15:38:33 -07:00
c0c17d4896
[Misc] Fix PR Template ( #3478 )
2024-03-18 15:00:31 -07:00
097aa0ea22
[CI/Build] Fix Bad Import In Test ( #3473 )
2024-03-18 20:28:00 +00:00
482b0adf1b
[Testing] Add test_config.py to CI ( #3437 )
2024-03-18 12:48:45 -07:00
8c654c045f
CI: Add ROCm Docker Build ( #2886 )
2024-03-18 19:33:47 +00:00
9101d832e6
[Bugfix] Make moe_align_block_size AMD-compatible ( #3470 )
2024-03-18 11:26:24 -07:00
93348d9458
[CI] Shard tests for LoRA and Kernels to speed up ( #3445 )
2024-03-17 14:56:30 -07:00
abfc4f3387
[Misc] Use dataclass for InputMetadata ( #3452 )
...
Co-authored-by: youkaichao <youkaichao@126.com >
2024-03-17 10:02:46 +00:00
6b78837b29
Fix setup.py neuron-ls issue ( #2671 )
2024-03-16 16:00:25 -07:00
120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar ( #3211 )
2024-03-16 13:35:27 -07:00
8e67598aa6
[Misc] fix line length for entire codebase ( #3444 )
2024-03-16 00:36:29 -07:00
ad50bf4b25
fix lint
2024-03-15 22:23:38 -07:00
cf6ff18246
Fix Baichuan chat template ( #3340 )
2024-03-15 21:02:12 -07:00
14e3f9a1b2
Replace lstrip()
with removeprefix()
to fix Ruff linter warning ( #2958 )
2024-03-15 21:01:30 -07:00
3123f15138
Fixes the incorrect argument in the prefix-prefill test cases ( #3246 )
2024-03-15 20:58:10 -07:00
413366e9a2
[Misc] PR templates ( #3413 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-15 18:25:51 -07:00
10585e035e
Removed Extraneous Print Message From OAI Server ( #3440 )
2024-03-16 00:35:36 +00:00
fb96c1e98c
Asynchronous tokenization ( #2879 )
2024-03-15 23:37:01 +00:00
8fa7357f2d
fix document error for value and v_vec illustration ( #3421 )
2024-03-15 16:06:09 -07:00
a7af4538ca
Fix issue templates ( #3436 )
2024-03-15 21:26:00 +00:00
604f235937
[Misc] add error message in non linux platform ( #3438 )
2024-03-15 21:21:37 +00:00
14b8ae02e7
Fixes the misuse/mixuse of time.time()/time.monotonic() ( #3220 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-03-15 18:25:43 +00:00
03d37f2441
[Fix] Add args for mTLS support ( #3430 )
...
Co-authored-by: declark1 <daniel.clark@ibm.com >
2024-03-15 09:56:13 -07:00
a7c871680e
Fix tie_word_embeddings for Qwen2. ( #3344 )
2024-03-15 09:36:53 -07:00
429284dc37
Fix dist.broadcast
stall without group argument ( #3408 )
2024-03-14 23:25:05 -07:00
253a98078a
Add chat templates for ChatGLM ( #3418 )
2024-03-14 23:19:22 -07:00
21539e6856
Add chat templates for Falcon ( #3420 )
2024-03-14 23:19:02 -07:00
b522c4476f
[Misc] add HOST_IP env var ( #3419 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-03-14 21:32:52 -07:00
78b6c4845a
Dynamically configure shared memory size for moe_align_block_size_kernel ( #3376 )
2024-03-14 18:18:07 -07:00
b983ba35bd
fix marlin config repr ( #3414 )
2024-03-14 16:26:19 -07:00
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled ( #3373 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-03-14 13:56:57 -07:00
dfc77408bd
[issue templates] add some issue templates ( #3412 )
2024-03-14 13:16:00 -07:00
c17ca8ef18
Add args for mTLS support ( #3410 )
...
Co-authored-by: Daniel Clark <daniel.clark@ibm.com >
2024-03-14 13:11:45 -07:00
06ec486794
Install flash_attn
in Docker image ( #3396 )
2024-03-14 10:55:54 -07:00
8fe8386591
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 ( #3389 )
2024-03-14 08:11:48 +00:00
a37415c31b
allow user to chose which vllm's merics to display in grafana ( #3393 )
2024-03-14 06:35:13 +00:00
81653d9688
[Hotfix] [Debug] test_openai_server.py::test_guided_regex_completion ( #3383 )
2024-03-13 17:02:21 -07:00
eeab52a4ff
[FIX] Simpler fix for async engine running on ray ( #3371 )
2024-03-13 14:18:40 -07:00
c33afd89f5
Fix lint ( #3388 )
2024-03-13 13:56:49 -07:00
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. ( #3350 )
2024-03-13 12:18:25 -07:00
739c350c19
[Minor Fix] Use cupy-cuda11x in CUDA 11.8 build ( #3256 )
2024-03-13 09:43:24 -07:00
ba8dc958a3
[Minor] Fix bias in if to remove ambiguity ( #3259 )
2024-03-13 09:16:55 -07:00
e221910e77
add hf_transfer to requirements.txt ( #3031 )
2024-03-12 23:33:43 -07:00
b167109ba1
[Fix] Fix quantization="gptq" when using Marlin ( #3319 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-03-12 22:51:42 -07:00
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
49a3c8662b
Fixes #1556 double free ( #3347 )
2024-03-13 00:30:08 +00:00
b0925b3878
docs: Add BentoML deployment doc ( #3336 )
...
Signed-off-by: Sherlock113 <sherlockxu07@gmail.com >
2024-03-12 10:34:30 -07:00
654865e21d
Support Mistral Model Inference with transformers-neuronx ( #3153 )
2024-03-11 13:19:51 -07:00
c9415c19d3
[ROCm] Fix warp and lane calculation in blockReduceSum ( #3321 )
2024-03-11 13:14:07 -07:00
4c922709b6
Add distributed model executor abstraction ( #3191 )
2024-03-11 11:03:45 -07:00
657061fdce
[docs] Add LoRA support information for models ( #3299 )
2024-03-11 00:54:51 -07:00
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
4b59f00e91
[Fix] Fix best_of behavior when n=1 ( #3298 )
2024-03-10 19:17:46 -07:00
9e8744a545
[BugFix] Fix get tokenizer when using ray ( #3301 )
2024-03-10 19:17:16 -07:00
e4a28e5316
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA ( #3262 )
2024-03-10 15:27:45 -07:00
0bba88df03
Enhance lora tests with more layer and rank variations ( #3243 )
2024-03-09 17:14:16 -08:00
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling ( #3103 )
2024-03-08 23:32:46 -08:00
f48c6791b7
[FIX] Fix prefix test error on main ( #3286 )
2024-03-08 17:16:14 -08:00
c2c5e0909a
Move model filelocks from /tmp/
to ~/.cache/vllm/locks/
dir ( #3241 )
2024-03-08 13:33:10 -08:00
1cb0cc2975
[FIX] Make flash_attn
optional ( #3269 )
2024-03-08 10:52:20 -08:00
99c3cfb83c
[Docs] Fix Unmocked Imports ( #3275 )
2024-03-08 09:58:01 -08:00
1ece1ae829
[Minor Fix] Fix comments in benchmark_serving ( #3252 )
2024-03-07 22:22:59 -08:00
c59e120c55
Feature add lora support for Qwen2 ( #3177 )
2024-03-07 21:58:24 -08:00
d2339d6840
Connect engine healthcheck to openai server ( #3260 )
2024-03-07 16:38:12 -08:00
b35cc93420
Fix auto prefix bug ( #3239 )
2024-03-07 16:37:28 -08:00
8cbba4622c
Possible fix for conflict between Automated Prefix Caching ( #2762 ) and multi-LoRA support ( #1804 ) ( #3263 )
2024-03-07 23:03:22 +00:00
385da2dae2
Measure model memory usage ( #3120 )
2024-03-07 11:42:42 -08:00
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
cbf4c05b15
Update requirements-dev.txt to include package for benchmarking scripts. ( #3181 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-07 08:39:28 +00:00
d3c04b6a39
Add GPTQ support for Gemma ( #3200 )
2024-03-07 08:19:14 +08:00
4cb3b924cd
Add tqdm dynamic_ncols=True
( #3242 )
2024-03-06 22:41:42 +00:00
a33ce60c66
[Testing] Fix core tests ( #3224 )
2024-03-06 01:04:23 -08:00
24aecf421a
[Tests] Add block manager and scheduler tests ( #3108 )
2024-03-05 18:23:34 -08:00
2efce05dc3
[Fix] Avoid pickling entire LLMEngine for Ray workers ( #3207 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com >
2024-03-06 00:17:20 +00:00
8999ec3c16
Store eos_token_id
in Sequence
for easy access ( #3166 )
2024-03-05 15:35:43 -08:00
05af6da8d9
[ROCm] enable cupy in order to enable cudagraph mode for AMD GPUs ( #3123 )
...
Co-authored-by: lcskrishna <lollachaitanya@gmail.com >
2024-03-04 18:14:53 -08:00
9a4548bae7
Fix the openai benchmarking requests to work with latest OpenAI apis ( #2992 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-03-04 15:51:56 -08:00
ff578cae54
Add health check, make async Engine more robust ( #3015 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com >
2024-03-04 22:01:40 +00:00
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com >
2024-03-04 19:54:06 +00:00
76e8a70476
[Minor fix] The domain dns.google may cause a socket.gaierror exception ( #3176 )
...
Co-authored-by: guofangze <guofangze@kuaishou.com >
2024-03-04 19:17:12 +00:00
9cbc7e5f3b
enable --gpu-memory-utilization in benchmark_throughput.py ( #3175 )
...
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
2024-03-04 10:37:58 -08:00
27a7b070db
Add document for vllm paged attention kernel. ( #2978 )
2024-03-04 09:23:34 -08:00
901cf4c52b
[Minor Fix] Remove unused code in benchmark_prefix_caching.py ( #3171 )
2024-03-03 22:48:27 -08:00
d0fae88114
[DOC] add setup document to support neuron backend ( #2777 )
2024-03-04 01:03:51 +00:00
17c3103c56
Make it easy to profile workers with nsight ( #3162 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com >
2024-03-03 16:19:13 -08:00
996d095c54
[FIX] Fix styles in automatic prefix caching & add a automatic prefix caching benchmark ( #3158 )
2024-03-03 14:37:18 -08:00
d65fac2738
Add vLLM version info to logs and openai API server ( #3161 )
2024-03-02 21:00:29 -08:00
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com >
Co-authored-by: Michael Goin <michael@neuralmagic.com >
2024-03-02 00:50:01 -08:00
baee28c46c
Reorder kv dtype check to avoid nvcc not found error on AMD platform ( #3104 )
2024-03-02 14:34:48 +08:00
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. ( #3109 )
...
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com >
Co-authored-by: Simon Mo <simon.mo@hey.com >
2024-03-01 23:28:41 +00:00
82091b864a
Bump up to v0.3.3 ( #3129 )
2024-03-01 12:58:06 -08:00
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com >
Co-authored-by: alexm <alexm@neuralmagic.com >
2024-03-01 12:47:51 -08:00
90fbf12540
fix relative import path of protocol.py ( #3134 )
...
Co-authored-by: huohuarong <huohuarong@zuoshouyisheng.com >
2024-03-01 19:42:06 +00:00
49d849b3ab
docs: Add tutorial on deploying vLLM model with KServe ( #2586 )
...
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com >
2024-03-01 11:04:14 -08:00
27ca23dc00
Remove exclude_unset in streaming response ( #3143 )
2024-03-01 09:59:06 -08:00
54d3544784
Fix: Output text is always truncated in some models ( #3016 )
2024-03-01 07:52:22 +00:00
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de >
Co-authored-by: simon-mo <simon.mo@hey.com >
2024-02-29 22:13:08 +00:00
29a8d6a554
[Fix] Don't deep-copy LogitsProcessors when copying SamplingParams ( #3099 )
2024-02-29 19:20:42 +00:00
2c08ff23c0
Fix building from source on WSL ( #3112 )
2024-02-29 11:13:58 -08:00
bfdcfa6a05
Support starcoder2 architecture ( #3089 )
2024-02-29 00:51:48 -08:00
9289e577ec
add cache_config's info to prometheus metrics. ( #3100 )
2024-02-29 06:15:18 +00:00
a6d471c759
Fix: AttributeError
in OpenAI-compatible server ( #3018 )
2024-02-28 22:04:07 -08:00
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models ( #2330 )
2024-02-28 21:52:23 -08:00
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
e46fa5d52e
Restrict prometheus_client >= 0.18.0 to prevent errors when importing pkgs ( #3070 )
2024-02-28 05:38:26 +00:00
a8683102cc
multi-lora documentation fix ( #3064 )
2024-02-27 21:26:15 -08:00
71bcaf99e2
Enable GQA support in the prefix prefill kernels ( #3007 )
...
Signed-off-by: Tao He <sighingnow@gmail.com >
2024-02-27 01:14:31 -08:00
8b430d7dea
[Minor] Fix StableLMEpochForCausalLM -> StableLmForCausalLM ( #3046 )
2024-02-26 20:23:50 -08:00
e0ade06d63
Support logit bias for OpenAI API ( #3027 )
2024-02-27 11:51:53 +08:00
4bd18ec0c7
[Minor] Fix type annotation in fused moe ( #3045 )
2024-02-26 19:44:29 -08:00
2410e320b3
fix get_ip
error in pure ipv6 environment ( #2931 )
2024-02-26 19:22:16 -08:00
48a8f4a7fd
Support Orion model ( #2539 )
...
Co-authored-by: zhangdacheng <zhangdacheng@ainirobot.com >
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu >
2024-02-26 19:17:06 -08:00
4dd6416faf
Fix stablelm ( #3038 )
2024-02-26 18:31:10 -08:00
c1c0d00b88
Don't use cupy when enforce_eager=True
( #3037 )
2024-02-26 17:33:38 -08:00
d9f726c4d0
[Minor] Remove unused config files ( #3039 )
2024-02-26 17:25:22 -08:00
d6e4a130b0
[Minor] Remove gather_cached_kv kernel ( #3043 )
2024-02-26 15:00:54 -08:00
cfc15a1031
Optimize Triton MoE Kernel ( #2979 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com >
2024-02-26 13:48:56 -08:00
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI ( #2918 )
2024-02-26 10:39:34 +08:00
ef978fe411
Port metrics from aioprometheus
to prometheus_client
( #2730 )
2024-02-25 11:54:00 -08:00
f7c1234990
[Fix] Fissertion on YaRN model len ( #2984 )
2024-02-23 12:57:48 -08:00
57f044945f
Fix nvcc not found in vlm-openai image ( #2781 )
2024-02-22 14:25:07 -08:00
4caf7044e0
Include tokens from prompt phase in counter_generation_tokens
( #2802 )
2024-02-22 14:00:12 -08:00
6f32cddf1c
Remove Flash Attention in test env ( #2982 )
2024-02-22 09:58:29 -08:00
c530e2cfe3
[FIX] Fix a bug in initializing Yarn RoPE ( #2983 )
2024-02-22 01:40:05 -08:00
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
93dc5a2870
chore(vllm): codespell for spell checking ( #2820 )
2024-02-21 18:56:01 -08:00
95529e3253
Use Llama RMSNorm custom op for Gemma ( #2974 )
2024-02-21 18:28:23 -08:00
344020c926
Migrate MistralForCausalLM to LlamaForCausalLM ( #2868 )
2024-02-21 18:25:05 -08:00
5574081c49
Added early stopping to completion APIs ( #2939 )
2024-02-21 18:24:01 -08:00
d7f396486e
Update comment ( #2934 )
2024-02-21 18:18:37 -08:00
8fbd84bf78
Bump up version to v0.3.2 ( #2968 )
...
This version is for more model support. Add support for Gemma models (#2964 ) and OLMo models (#2832 ).
2024-02-21 11:47:25 -08:00
7d2dcce175
Support per-request seed ( #2514 )
2024-02-21 11:47:00 -08:00
dc903e70ac
[ROCm] Upgrade transformers to v4.38.0 ( #2967 )
2024-02-21 09:46:57 -08:00
a9c8212895
[FIX] Add Gemma model to the doc ( #2966 )
2024-02-21 09:46:15 -08:00
c20ecb6a51
Upgrade transformers to v4.38.0 ( #2965 )
2024-02-21 09:38:03 -08:00
5253edaacb
Add Gemma model ( #2964 )
2024-02-21 09:34:30 -08:00
017d9f1515
Add metrics to RequestOutput ( #2876 )
2024-02-20 21:55:57 -08:00
181b27d881
Make vLLM logging formatting optional ( #2877 )
2024-02-20 14:38:55 -08:00
63e2a6419d
[FIX] Fix beam search test ( #2930 )
2024-02-20 14:37:39 -08:00
264017a2bf
[ROCm] include gfx908 as supported ( #2792 )
2024-02-19 17:58:59 -08:00
e433c115bc
Fix vllm:prompt_tokens_total
metric calculation ( #2869 )
2024-02-18 23:55:41 -08:00
86fd8bb0ac
Add warning to prevent changes to benchmark api server ( #2858 )
2024-02-18 21:36:19 -08:00
ab3a5a8259
Support OLMo models. ( #2832 )
2024-02-18 21:05:15 -08:00
a61f0521b8
[Test] Add basic correctness test ( #2908 )
2024-02-18 16:44:50 -08:00
537c9755a7
[Minor] Small fix to make distributed init logic in worker looks cleaner ( #2905 )
2024-02-18 14:39:00 -08:00
786b7f18a5
Add code-revision config argument for Hugging Face Hub ( #2892 )
2024-02-17 22:36:53 -08:00
8f36444c4f
multi-LoRA as extra models in OpenAI server ( #2775 )
...
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py )):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-2-7b-hf \
--enable-lora \
--lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs
no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
185b2c29e2
Defensively copy sampling_params
( #2881 )
...
If the SamplingParams object passed to LLMEngine.add_request() is mutated after it returns, it could affect the async sampling process for that request.
Suggested by @Yard1 https://github.com/vllm-project/vllm/pull/2514#discussion_r1490106059
2024-02-17 11:18:04 -08:00