mirror of
https://github.com/vllm-project/vllm.git
synced 2025-10-20 14:53:52 +08:00
[doc] format fix (#10789)
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
This commit is contained in:
@ -25,7 +25,7 @@ With this mapping, we can add another indirection in vLLM’s KV cache managemen
|
||||
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
|
||||
|
||||
|
||||
# Generalized Caching Policy
|
||||
## Generalized Caching Policy
|
||||
|
||||
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
|
||||
|
||||
|
@ -4,7 +4,7 @@ Installation with Intel® Gaudi® AI Accelerators
|
||||
This README provides instructions on running vLLM with Intel Gaudi devices.
|
||||
|
||||
Requirements and Installation
|
||||
=============================
|
||||
-----------------------------
|
||||
|
||||
Please follow the instructions provided in the `Gaudi Installation
|
||||
Guide <https://docs.habana.ai/en/latest/Installation_Guide/index.html>`__
|
||||
@ -13,7 +13,7 @@ please follow the methods outlined in the `Optimizing Training Platform
|
||||
Guide <https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html>`__.
|
||||
|
||||
Requirements
|
||||
------------
|
||||
~~~~~~~~~~~~
|
||||
|
||||
- OS: Ubuntu 22.04 LTS
|
||||
- Python: 3.10
|
||||
@ -22,7 +22,7 @@ Requirements
|
||||
|
||||
|
||||
Quick start using Dockerfile
|
||||
----------------------------
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
.. code:: console
|
||||
|
||||
$ docker build -f Dockerfile.hpu -t vllm-hpu-env .
|
||||
@ -34,10 +34,10 @@ Quick start using Dockerfile
|
||||
|
||||
|
||||
Build from source
|
||||
-----------------
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Environment verification
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To verify that the Intel Gaudi software was correctly installed, run:
|
||||
|
||||
@ -53,7 +53,7 @@ Verification <https://docs.habana.ai/en/latest/Installation_Guide/SW_Verificatio
|
||||
for more details.
|
||||
|
||||
Run Docker Image
|
||||
~~~~~~~~~~~~~~~~
|
||||
^^^^^^^^^^^^^^^^
|
||||
|
||||
It is highly recommended to use the latest Docker image from Intel Gaudi
|
||||
vault. Refer to the `Intel Gaudi
|
||||
@ -68,7 +68,7 @@ Use the following commands to run a Docker image:
|
||||
$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest
|
||||
|
||||
Build and Install vLLM
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To build and install vLLM from source, run:
|
||||
|
||||
@ -90,7 +90,7 @@ Currently, the latest features and performance optimizations are developed in Ga
|
||||
|
||||
|
||||
Supported Features
|
||||
==================
|
||||
------------------
|
||||
|
||||
- `Offline batched
|
||||
inference <https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference>`__
|
||||
@ -107,7 +107,7 @@ Supported Features
|
||||
- Attention with Linear Biases (ALiBi)
|
||||
|
||||
Unsupported Features
|
||||
====================
|
||||
--------------------
|
||||
|
||||
- Beam search
|
||||
- LoRA adapters
|
||||
@ -115,7 +115,7 @@ Unsupported Features
|
||||
- Prefill chunking (mixed-batch inferencing)
|
||||
|
||||
Supported Configurations
|
||||
========================
|
||||
------------------------
|
||||
|
||||
The following configurations have been validated to be function with
|
||||
Gaudi2 devices. Configurations that are not listed may or may not work.
|
||||
@ -152,10 +152,10 @@ Gaudi2 devices. Configurations that are not listed may or may not work.
|
||||
with tensor parallelism on 8x HPU, BF16 datatype with random or greedy sampling
|
||||
|
||||
Performance Tuning
|
||||
==================
|
||||
------------------
|
||||
|
||||
Execution modes
|
||||
---------------
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
Currently in vLLM for HPU we support four execution modes, depending on selected HPU PyTorch Bridge backend (via ``PT_HPU_LAZY_MODE`` environment variable), and ``--enforce-eager`` flag.
|
||||
|
||||
@ -184,7 +184,7 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
|
||||
|
||||
|
||||
Bucketing mechanism
|
||||
-------------------
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. `Intel Gaudi Graph Compiler <https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime>`__ is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
|
||||
In a dynamic inference serving scenario, there is a need to minimize the number of graph compilations and reduce the risk of graph compilation occurring during server runtime. Currently it is achieved by "bucketing" model's forward pass across two dimensions - ``batch_size`` and ``sequence_length``.
|
||||
@ -233,7 +233,7 @@ As an example, if a request of 3 sequences, with max sequence length of 412 come
|
||||
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
|
||||
|
||||
Warmup
|
||||
------
|
||||
~~~~~~
|
||||
|
||||
Warmup is an optional, but highly recommended step occurring before vLLM server starts listening. It executes a forward pass for each bucket with dummy data. The goal is to pre-compile all graphs and not incur any graph compilation overheads within bucket boundaries during server runtime. Each warmup step is logged during vLLM startup:
|
||||
|
||||
@ -257,7 +257,7 @@ This example uses the same buckets as in *Bucketing mechanism* section. Each out
|
||||
Compiling all the buckets might take some time and can be turned off with ``VLLM_SKIP_WARMUP=true`` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
|
||||
|
||||
HPU Graph capture
|
||||
-----------------
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
`HPU Graphs <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html>`__ are currently the most performant execution method of vLLM on Intel Gaudi. When HPU Graphs are enabled, execution graphs will be traced (recorded) ahead of time (after performing warmup), to be later replayed during inference, significantly reducing host overheads. Recording can take large amounts of memory, which needs to be taken into account when allocating KV cache. Enabling HPU Graphs will impact the number of available KV cache blocks, but vLLM provides user-configurable variables to control memory management.
|
||||
|
||||
@ -321,7 +321,7 @@ Each described step is logged by vLLM server, as follows (negative values corres
|
||||
|
||||
|
||||
Recommended vLLM Parameters
|
||||
---------------------------
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- We recommend running inference on Gaudi 2 with ``block_size`` of 128
|
||||
for BF16 data type. Using default values (16, 32) might lead to
|
||||
@ -333,7 +333,7 @@ Recommended vLLM Parameters
|
||||
If you encounter out-of-memory issues, see troubleshooting section.
|
||||
|
||||
Environment variables
|
||||
---------------------
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
**Diagnostic and profiling knobs:**
|
||||
|
||||
@ -380,7 +380,7 @@ Additionally, there are HPU PyTorch Bridge environment variables impacting vLLM
|
||||
- ``PT_HPU_ENABLE_LAZY_COLLECTIVES``: required to be ``true`` for tensor parallel inference with HPU Graphs
|
||||
|
||||
Troubleshooting: Tweaking HPU Graphs
|
||||
====================================
|
||||
------------------------------------
|
||||
|
||||
If you experience device out-of-memory issues or want to attempt
|
||||
inference at higher batch sizes, try tweaking HPU Graphs by following
|
||||
|
Reference in New Issue
Block a user