mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 05:34:18 +08:00
[RFC] Allow elastic agent to fail fast (#99051)
Summary: Today, on a segfault on a single trainer , we end up keeping the gpu on all ranks blocked for 5 minutes due to elastic agents barrier timeouts Test Plan: Rely on existing test to validate . Looking to get some feedback on adding UTs Differential Revision: D44929488 Pull Request resolved: https://github.com/pytorch/pytorch/pull/99051 Approved by: https://github.com/kurman, https://github.com/kiukchung
This commit is contained in:
committed by
PyTorch MergeBot
parent
eddb3a060e
commit
676a23f452
@ -903,7 +903,6 @@ class SimpleElasticAgent(ElasticAgent):
|
||||
else:
|
||||
self._stop_workers(self._worker_group)
|
||||
self._worker_group.state = WorkerState.FAILED
|
||||
self._exit_barrier()
|
||||
return run_result
|
||||
elif state == WorkerState.HEALTHY:
|
||||
# membership changes do not count as retries
|
||||
|
Reference in New Issue
Block a user