Files

Harry Mellor 2b61d2e22f [Docs] Remove in-tree Gaudi install instructions (#23628 )

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

2025-08-27 09:22:21 -07:00

api

[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs (#23053 )

2025-08-18 09:52:00 +00:00

assets

[doc] Hybrid KV Cache Manager design doc (#22688 )

2025-08-26 20:19:05 +00:00

cli

[Docs] Add comprehensive CLI reference for all large vllm subcommands (#22601 )

2025-08-11 00:13:33 -07:00

community

Fix pre-commit on main (#23747 )

2025-08-27 06:39:48 -07:00

configuration

[Model] Interface to enable batch-level DP support (#23733 )

2025-08-27 06:41:22 -07:00

contributing

[Refactor] Define MultiModalKwargsItems separate from MultiModalKwargs (#23053 )

2025-08-18 09:52:00 +00:00

deployment

[Doc: ]fix various typos in multiple files (#23487 )

2025-08-25 00:04:04 +00:00

design

[doc] Hybrid KV Cache Manager design doc (#22688 )

2025-08-26 20:19:05 +00:00

examples

[Docs] Fix broken links to docs/api/summary.md (#23637 )

2025-08-26 13:00:18 +00:00

features

[Docs] Move quant supported hardware table to README (#23663 )

2025-08-26 22:26:46 +00:00

getting_started

[Docs] Remove in-tree Gaudi install instructions (#23628 )

2025-08-27 09:22:21 -07:00

mkdocs

[Docs] Fix math rendering in docs (#23676 )

2025-08-26 18:47:08 -07:00

models

[Model] Enable native HF format InternVL support (#23742 )

2025-08-27 14:45:17 +00:00

serving

[Docs] Rename “Distributed inference and serving” to “Parallelism & Scaling” (#22466 )

2025-08-08 19:26:21 +00:00

training

Add Unsloth to RLHF.md (#21636 )

2025-07-25 17:06:48 -07:00

usage

[V1] [Hybrid] Disable prefix caching by default for hybrid or mamba-based models (#23716 )

2025-08-27 12:51:54 +00:00

.nav.yml

[Docs] Improve docs navigation (#22720 )

2025-08-12 04:25:55 -07:00

README.md

[Docs] Hide the navigation and toc sidebars on home page (#22749 )

2025-08-12 17:12:26 -07:00

README.md

hide

navigation

toc

Welcome to vLLM

![](./assets/logos/vllm-logo-text-light.png){ align="center" alt="vLLM Light" class="logo-light" width="60%" } ![](./assets/logos/vllm-logo-text-dark.png){ align="center" alt="vLLM Dark" class="logo-dark" width="60%" }

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

Where to get started with vLLM depends on the type of user. If you are looking to:

Run open-source models on vLLM, we recommend starting with the Quickstart Guide
Build applications with vLLM, we recommend starting with the User Guide
Build vLLM, we recommend starting with Developer Guide

For information about the development of vLLM, see:

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and GPUs, IBM Power CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-LoRA support

For more information, check out the following:

vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
vLLM Meetups