mirror of https://github.com/vllm-project/vllm.git synced 2025-10-20 23:03:52 +08:00

Files

Lyu Han 875af38e01 Support Intern-S1 (#21628 )

Signed-off-by: Roger Wang <hey@rogerw.me>
Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Roger Wang <hey@rogerw.me>
Co-authored-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

2025-07-26 19:14:04 +08:00

api

[V0 Deprecation] Remove Prompt Adapters (#20588 )

2025-07-23 16:36:48 -07:00

assets

[Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md (#19199 )

2025-07-16 06:11:38 -07:00

cli

Add full serve CLI reference back to docs (#20978 )

2025-07-15 17:42:30 +00:00

community

Stop using title frontmatter and fix doc that can only be reached by search (#20623 )

2025-07-08 03:27:40 -07:00

configuration

[Misc] unify variable for LLM instance (#20996 )

2025-07-21 12:18:33 +01:00

contributing

[Docs] Add requirements/common.txt to run unit tests (#21572 )

2025-07-24 20:51:15 -07:00

deployment

[Docs] Add intro and fix 1-2-3 list in frameworks/open-webui.md (#19199 )

2025-07-16 06:11:38 -07:00

design

[Docs] Clean up v1/metrics.md (#21449 )

2025-07-23 03:37:25 -07:00

features

[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL (#21530 )

2025-07-25 18:37:32 -07:00

getting_started

[CI] Unifying Dockerfiles for ARM and X86 Builds (#21343 )

2025-07-25 07:33:56 -07:00

mkdocs

Add full serve CLI reference back to docs (#20978 )

2025-07-15 17:42:30 +00:00

models

Support Intern-S1 (#21628 )

2025-07-26 19:14:04 +08:00

serving

[Docs] Add Expert Parallelism Initial Documentation (#21373 )

2025-07-24 12:36:06 -07:00

training

Add Unsloth to RLHF.md (#21636 )

2025-07-25 17:06:48 -07:00

usage

[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models. (#21233 )

2025-07-19 16:09:58 -07:00

.nav.yml

Stop using title frontmatter and fix doc that can only be reached by search (#20623 )

2025-07-08 03:27:40 -07:00

README.md

[Docs] Data Parallel deployment documentation (#20768 )

2025-07-11 09:42:10 -07:00

README.md

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-LoRA support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
vLLM Meetups