mirror of
https://github.com/vllm-project/vllm.git
synced 2025-10-20 23:03:52 +08:00
Compare commits
1 Commits
v0.10.2rc1
...
codex/upda
Author | SHA1 | Date | |
---|---|---|---|
2e773e55b3 |
@ -14,8 +14,14 @@ This document provides an overview of the vLLM architecture.
|
||||
vLLM provides a number of entrypoints for interacting with the system. The
|
||||
following diagram shows the relationship between them.
|
||||
|
||||
:::{image} /assets/design/arch_overview/entrypoints.excalidraw.png
|
||||
:alt: Entrypoints Diagram
|
||||
:::{mermaid}
|
||||
flowchart TD
|
||||
CLI["vllm CLI"] --> APIServer["OpenAI API Server"]
|
||||
LLM["LLM Class"] --> LLMEngine
|
||||
APIServer --> AsyncLLMEngine
|
||||
LLMEngine --> EngineCoreClient
|
||||
AsyncLLMEngine --> EngineCoreClient
|
||||
EngineCoreClient --> EngineCore
|
||||
:::
|
||||
|
||||
### LLM Class
|
||||
@ -84,8 +90,14 @@ More details on the API server can be found in the [OpenAI-Compatible Server](#o
|
||||
The `LLMEngine` and `AsyncLLMEngine` classes are central to the functioning of
|
||||
the vLLM system, handling model inference and asynchronous request processing.
|
||||
|
||||
:::{image} /assets/design/arch_overview/llm_engine.excalidraw.png
|
||||
:alt: LLMEngine Diagram
|
||||
:::{mermaid}
|
||||
flowchart LR
|
||||
Processor --> EngineCoreClient
|
||||
EngineCoreClient --> EngineCore
|
||||
EngineCore --> Executor
|
||||
Executor --> Worker
|
||||
Worker --> ModelRunner
|
||||
ModelRunner --> Model
|
||||
:::
|
||||
|
||||
### LLMEngine
|
||||
@ -104,7 +116,7 @@ processing.
|
||||
- **Output Processing**: Processes the outputs generated by the model, decoding the
|
||||
token IDs from a language model into human-readable text.
|
||||
|
||||
The code for `LLMEngine` can be found in <gh-file:vllm/engine/llm_engine.py>.
|
||||
The code for `LLMEngine` can be found in <gh-file:vllm/v1/engine/llm_engine.py>.
|
||||
|
||||
### AsyncLLMEngine
|
||||
|
||||
@ -116,7 +128,7 @@ can handle multiple concurrent requests and stream outputs to clients.
|
||||
The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo
|
||||
API server that serves as a simpler example in <gh-file:vllm/entrypoints/api_server.py>.
|
||||
|
||||
The code for `AsyncLLMEngine` can be found in <gh-file:vllm/engine/async_llm_engine.py>.
|
||||
The code for `AsyncLLMEngine` can be found in <gh-file:vllm/v1/engine/async_llm.py>.
|
||||
|
||||
## Worker
|
||||
|
||||
@ -140,15 +152,29 @@ Every model runner object has one model object, which is the actual
|
||||
`torch.nn.Module` instance. See [huggingface_integration](#huggingface-integration) for how various
|
||||
configurations affect the class we ultimately get.
|
||||
|
||||
## Class Hierarchy
|
||||
## Class Hierarchy and vLLM V1 Architecture
|
||||
|
||||
The following figure shows the class hierarchy of vLLM:
|
||||
The following diagram shows how the main classes interact:
|
||||
|
||||
> :::{figure} /assets/design/hierarchy.png
|
||||
> :align: center
|
||||
> :alt: query
|
||||
> :width: 100%
|
||||
> :::
|
||||
:::{mermaid}
|
||||
classDiagram
|
||||
class LLMEngine
|
||||
class AsyncLLMEngine
|
||||
class EngineCoreClient
|
||||
class EngineCore
|
||||
class Executor
|
||||
class Worker
|
||||
class ModelRunner
|
||||
class Model
|
||||
|
||||
AsyncLLMEngine --> LLMEngine
|
||||
LLMEngine --> EngineCoreClient
|
||||
EngineCoreClient --> EngineCore
|
||||
EngineCore --> Executor
|
||||
Executor --> Worker
|
||||
Worker --> ModelRunner
|
||||
ModelRunner --> Model
|
||||
:::
|
||||
|
||||
There are several important design choices behind this class hierarchy:
|
||||
|
||||
@ -250,3 +276,32 @@ big problem.
|
||||
|
||||
In summary, the complete config object `VllmConfig` can be treated as an
|
||||
engine-level global state that is shared among all vLLM classes.
|
||||
|
||||
vLLM V1 introduces a streamlined engine that splits responsibilities between a thin frontend and a highly optimized backend. The design is centered on three core layers:
|
||||
|
||||
1. **Frontend (`LLMEngine` and `AsyncLLM`)** – user-facing classes that handle tokenization, batching of incoming requests, and postprocessing of generated outputs. These classes interact with the engine core through an `EngineCoreClient`.
|
||||
2. **Engine Core** – the inner loop that schedules requests and runs the model. The core lives in `vllm/v1/engine/core.py` and exposes a lightweight API for adding requests, aborting them, or stepping the model.
|
||||
3. **Executor and Workers** – the executor (for example `MultiprocExecutor` in <gh-file:vllm/v1/executor/multiproc_executor.py>) manages worker processes. Each worker controls a single accelerator device and hosts a `ModelRunner` (such as `GPUModelRunner` in <gh-file:vllm/v1/worker/gpu_model_runner.py>) which executes the forward pass.
|
||||
|
||||
### EngineCore and Scheduler
|
||||
|
||||
The `EngineCore` maintains a [`Scheduler`](<gh-file:vllm/v1/core/sched/scheduler.py>) and a `KVCacheManager` (<gh-file:vllm/v1/core/kv_cache_manager.py>). At each iteration the scheduler chooses how many tokens to process for every active `Request`, supporting features like prefix caching, chunked prefill and speculative decoding. Scheduled tokens are passed to the model runner and the resulting `EngineCoreOutputs` include generated tokens and per-request events.
|
||||
The scheduler keeps separate waiting and running queues and enforces limits from
|
||||
`VllmConfig` such as `max_num_seqs` and `max_num_batched_tokens`. When GPU
|
||||
memory becomes scarce it can preempt lower priority requests, freeing their KV
|
||||
cache blocks before resuming them later. After a step finishes it records
|
||||
statistics and updates each request's progress based on the returned events.
|
||||
|
||||
### Communication via EngineCoreClient
|
||||
|
||||
To overlap computation with I/O, the engine core often runs in a separate process. `EngineCoreClient` (<gh-file:vllm/v1/engine/core_client.py>) forwards requests and pulls results over ZeroMQ sockets. When using multiple data-parallel ranks, `DPAsyncMPClient` manages a set of engine-core processes and aggregates their outputs.
|
||||
|
||||
### Workers and Model Runners
|
||||
|
||||
Workers are defined in <gh-dir:vllm/v1/worker>. The default GPU worker initializes CUDA, sets up distributed communication and hosts a `GPUModelRunner` which loads the model, prepares KV cache memory and executes inference kernels. The runner also handles LoRA adapters, attention backends, and cudagraph capture.
|
||||
|
||||
### Output Processing
|
||||
|
||||
`OutputProcessor` (<gh-file:vllm/v1/engine/output_processor.py>) converts raw `EngineCoreOutputs` into `RequestOutput` objects, assembling logprobs, speculative tokens, and final texts. When using `AsyncLLM`, an asynchronous loop continuously fetches these outputs and streams them back to callers.
|
||||
|
||||
This new layering keeps the hot path (`EngineCore`) minimal while letting the frontend focus on user interactions and request bookkeeping. It reduces CPU overhead and simplifies the addition of new optimizations.
|
||||
|
Reference in New Issue
Block a user