mirror of
https://github.com/vllm-project/vllm-ascend.git
synced 2025-10-20 13:43:53 +08:00
[bugfix] Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp parallel example (#3287)
### What this PR does / why we need it? Fix Qwen3-30B-A3B dp parallel hung issue when running with the dp parallel example. For large-parameter models of Qwen3-30B and above, weight loading alone takes 4 to 5 minutes. Therefore, the 5-minute timeout in the current example code implementation is too short, causing some DP instances to be killed prematurely and eventually stuck in the DP synchronization all-reduce operation. ### Does this PR introduce _any_ user-facing change? NA ### How was this patch tested? NA vLLM version: v0.11.0rc3 vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/releases/v0.11.0 --------- Signed-off-by: leo-pony <nengjunma@outlook.com>
This commit is contained in:
@ -244,10 +244,10 @@ if __name__ == "__main__":
|
||||
procs.append(proc)
|
||||
exit_code = 0
|
||||
for proc in procs:
|
||||
proc.join(timeout=300)
|
||||
proc.join(timeout=900)
|
||||
if proc.exitcode is None:
|
||||
print(
|
||||
f"Killing process {proc.pid} that didn't stop within 5 minutes."
|
||||
f"Killing process {proc.pid} that didn't stop within 15 minutes."
|
||||
)
|
||||
proc.kill()
|
||||
exit_code = 1
|
||||
|
Reference in New Issue
Block a user