[bugfix] Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp parallel example (#3287)

### What this PR does / why we need it?
Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp
parallel example.
For large-parameter models of Qwen3-30B and above, weight loading alone
takes 4 to 5 minutes. Therefore, the 5-minute timeout in the current
example code implementation is too short, causing some DP instances to
be killed prematurely and eventually stuck in the DP synchronization
all-reduce operation.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
NA

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0

- vLLM version: v0.11.0rc3
- vLLM main:
https://github.com/vllm-project/vllm/commit/releases/v0.11.0

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
This commit is contained in:
leo-pony
2025-09-30 15:30:01 +08:00
committed by GitHub
parent a486ff8c11
commit 3a27b15ddc

View File

@ -244,10 +244,10 @@ if __name__ == "__main__":
procs.append(proc)
exit_code = 0
for proc in procs:
proc.join(timeout=300)
proc.join(timeout=900)
if proc.exitcode is None:
print(
f"Killing process {proc.pid} that didn't stop within 5 minutes."
f"Killing process {proc.pid} that didn't stop within 15 minutes."
)
proc.kill()
exit_code = 1