frozenleaves/vllm - vllm - Gitea: Git for Me

mirror of https://github.com/vllm-project/vllm.git synced 2025-10-20 23:03:52 +08:00

Author	SHA1	Message	Date
Kuntai Du	81ede99ca4	[Core] Deprecating block manager v1 and make block manager v2 default (#8704 ) Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).	2024-10-17 11:38:15 -05:00
Lily Liu	89feb4c84d	[SpecDec] Remove Batch Expansion (2/3) (#9298 )	2024-10-12 05:13:37 +00:00
youkaichao	9aaf14c62e	[misc] add forward context for attention (#9029 )	2024-10-03 12:09:42 -07:00
Varun Sundar Rabindranath	afb050b29d	[Core] CUDA Graphs for Multi-Step + Chunked-Prefill (#8645 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-10-02 19:44:39 +00:00
Lily Liu	1570203864	[Spec Decode] (1/2) Remove batch expansion (#8839 )	2024-10-01 16:04:42 -07:00
Varun Sundar Rabindranath	c2ec430ab5	[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-09-27 13:32:07 -07:00
youkaichao	a9b15c606f	[torch.compile] use empty tensor instead of None for profiling (#8875 )	2024-09-27 08:11:32 -07:00
Luka Govedič	71c60491f2	[Kernel] Build flash-attn from source (#8245 )	2024-09-20 23:27:10 -07:00
William Lin	a6c0f3658d	[multi-step] add flashinfer backend (#7928 )	2024-09-12 11:16:22 -07:00
youkaichao	7de49aa86c	[torch.compile] hide slicing under custom op for inductor (#8384 )	2024-09-12 00:11:55 -07:00
Alexander Matveev	22f3a4bc6c	[Bugfix] lookahead block table with cuda graph max capture (#8340 ) [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340)	2024-09-10 16:00:35 -07:00
Kevin Lin	5faedf1b62	[Spec Decode] Move ops.advance_step to flash attn advance_step (#8224 )	2024-09-10 13:18:14 -07:00
Antoni Baum	3b682179dd	[Core] Add `AttentionState` abstraction (#7663 )	2024-08-20 18:50:45 +00:00
William Lin	f366f6339b	[spec decode] [4/N] Move update_flash_attn_metadata to attn backend (#7571 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-16 11:41:56 -07:00
youkaichao	54bd9a03c4	register custom op for flash attn and use from torch.ops (#7536 )	2024-08-15 22:38:56 -07:00
Woosuk Kwon	cfba4def5d	[Bugfix] Fix logit soft cap in flash-attn backend (#7425 )	2024-08-12 09:58:28 -07:00
Alexander Matveev	e02ac55617	[Performance] Optimize e2e overheads: Reduce python allocations (#7162 )	2024-08-08 21:34:28 -07:00
Cody Yu	ef527be06c	[MISC] Use non-blocking transfer in prepare_input (#7172 )	2024-08-05 23:41:27 +00:00
Zach Zheng	fb2c1c86c1	[Bugfix] Fix block table for seqs that have prefix cache hits (#7018 )	2024-08-02 22:38:15 -07:00
Woosuk Kwon	805a8a75f2	[Misc] Support attention logits soft-capping with flash-attn (#7022 )	2024-08-01 13:14:37 -07:00
Cody Yu	309aaef825	[Bugfix] Fix decode tokens w. CUDA graph (#6757 )	2024-07-24 22:33:56 -07:00
Antoni Baum	0e63494cf3	Add fp8 support to `reshape_and_cache_flash` (#6667 )	2024-07-24 18:36:52 +00:00
Cody Yu	e0c15758b8	[Core] Modulize prepare input and attention metadata builder (#6596 )	2024-07-23 00:45:24 +00:00
Cyrus Leung	9042d68362	[Misc] Consolidate and optimize logic for building padded tensors (#6541 )	2024-07-20 04:17:24 +00:00
Cody Yu	2fa4623d9e	[Core] Refactor _prepare_model_input_tensors - take 2 (#6164 )	2024-07-17 09:37:16 -07:00
Michael Goin	978aed5300	[Kernel][Attention] Separate `Attention.kv_scale` into `k_scale` and `v_scale` (#6081 )	2024-07-16 15:31:32 -07:00
afeldman-nm	543aa48573	[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-08 17:12:15 +00:00
Stephanie Wang	dda4811591	[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408 ) Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Stephanie <swang@anyscale.com> Co-authored-by: Stephanie <swang@anyscale.com>	2024-06-25 20:30:03 -07:00
Charles Riggins	9e74d9d003	Correct alignment in the seq_len diagram. (#5592 ) Co-authored-by: Liqian Chen <liqian.chen@deeplang.ai>	2024-06-17 12:05:33 -04:00
Antoni Baum	6b0511a57b	Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478 )	2024-06-13 11:22:50 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Antoni Baum	0ab278ca31	[Core] Remove unnecessary copies in flash attn backend (#5138 )	2024-06-03 09:39:31 -07:00
Eric Xihui Lin	8e192ff967	[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799 ) Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-24 22:00:52 -07:00
Isotr0py	99eff67ba9	[Bugfix][Kernel] Add head size check for attention backend selection (#4944 )	2024-05-21 15:33:25 -04:00
Woosuk Kwon	b57e6c5949	[Kernel] Add flash-attn back (#4907 )	2024-05-19 18:11:30 -07:00
Woosuk Kwon	9a31a817a8	[Bugfix] Fix FP8 KV cache support (#4869 )	2024-05-16 22:42:29 +00:00
SangBin Cho	65bf2ac165	[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681 ) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR	2024-05-15 14:00:10 +09:00
SangBin Cho	8a7cc254a0	Revert "[Kernel] Use flash-attn for decoding (#3648 )" (#4820 ) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit 1356df5. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)	2024-05-15 11:52:45 +09:00
Stephen Krider	1356df53bd	[Kernel] Use flash-attn for decoding (#3648 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2024-05-13 15:50:33 -07:00
Woosuk Kwon	0fca3cdcf2	[Misc] Enhance attention selector (#4751 )	2024-05-13 10:47:25 -07:00
Woosuk Kwon	89579a201f	[Misc] Use vllm-flash-attn instead of flash-attn (#4686 )	2024-05-08 13:15:34 -07:00
youkaichao	20cfcdec99	[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659 )	2024-05-08 12:07:05 -07:00
Woosuk Kwon	5510cf0e8a	[Misc] Add `get_name` method to attention backends (#4685 )	2024-05-08 09:59:31 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
Michał Moskal	32881f3f31	[kernel] fix sliding window in prefix prefill Triton kernel (#4405 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2024-05-02 11:23:37 -07:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Woosuk Kwon	925f3332ca	[Core] Refactor Attention Take 2 (#3462 )	2024-03-25 04:39:33 +00:00

50 Commits