mirror of
https://github.com/vllm-project/vllm.git
synced 2025-10-20 14:53:52 +08:00
[Docs] Document the extra memory footprint overhead when using EPLB (#24537)
Signed-off-by: Tyler Michael Smith <tyler@neuralmagic.com>
This commit is contained in:
committed by
GitHub
parent
9dbefd88e9
commit
8b83b93739
@ -156,6 +156,13 @@ vllm serve Qwen/Qwen3-30B-A3B \
|
||||
- **Default**: Each EP rank has `NUM_TOTAL_EXPERTS ÷ NUM_EP_RANKS` experts
|
||||
- **With redundancy**: Each EP rank has `(NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS` experts
|
||||
|
||||
### Memory Footprint Overhead
|
||||
|
||||
EPLB uses redundant experts to that need to fit in GPU memory. This means that EPLB may not be a good fit for memory constrained environments or when KV cache space is at a premium.
|
||||
|
||||
This overhead equals `NUM_MOE_LAYERS * BYTES_PER_EXPERT * (NUM_TOTAL_EXPERTS + NUM_REDUNDANT_EXPERTS) ÷ NUM_EP_RANKS`.
|
||||
For DeepSeekV3, this is approximately `2.4 GB` for one redundant expert per rank.
|
||||
|
||||
### Example Command
|
||||
|
||||
Single node deployment with EPLB enabled:
|
||||
|
Reference in New Issue
Block a user