[Doc] Add caution for API server scale-out (#23550)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
2025-10-20 14:53:52 +08:00 · 2025-08-25 21:14:15 +08:00
parent 5c4b6e66fe
commit e269be2ba2
1 changed files with 7 additions and 0 deletions
--- a/docs/configuration/optimization.md
+++ b/docs/configuration/optimization.md
@ -196,6 +196,13 @@ vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
 !!! note
    API server scale-out is only available for online inference.

+!!! warning
+    By default, 8 CPU threads are used in each API server to load media items (e.g. images)
+    from request data.
+
+    If you apply API server scale-out, consider adjusting `VLLM_MEDIA_LOADING_THREAD_COUNT`
+    to avoid CPU resource exhaustion.
+
 !!! note
    [Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled
    because it requires a one-to-one correspondance between API and engine core processes.