mirror of
https://github.com/vllm-project/vllm.git
synced 2025-10-20 14:53:52 +08:00
[Doc] Add caution for API server scale-out (#23550)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
This commit is contained in:
@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
|
||||
!!! note
|
||||
API server scale-out is only available for online inference.
|
||||
|
||||
!!! warning
|
||||
By default, 8 CPU threads are used in each API server to load media items (e.g. images)
|
||||
from request data.
|
||||
|
||||
If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
|
||||
to avoid CPU resource exhaustion.
|
||||
|
||||
!!! note
|
||||
[Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled
|
||||
because it requires a one-to-one correspondance between API and engine core processes.
|
||||
|
Reference in New Issue
Block a user