[RFC] Allow elastic agent to fail fast (#99051)

Summary: Today, on a segfault on a single trainer , we end up keeping the gpu on all ranks blocked for 5 minutes due to elastic agents barrier timeouts

Test Plan: Rely on existing test to validate . Looking to get some feedback on adding UTs

Differential Revision: D44929488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99051
Approved by: https://github.com/kurman, https://github.com/kiukchung
This commit is contained in:
Shrikant Nagori
2023-04-25 23:51:15 +00:00
committed by PyTorch MergeBot
parent eddb3a060e
commit 676a23f452
2 changed files with 67 additions and 1 deletions

View File

@ -903,7 +903,6 @@ class SimpleElasticAgent(ElasticAgent):
else:
self._stop_workers(self._worker_group)
self._worker_group.state = WorkerState.FAILED
self._exit_barrier()
return run_result
elif state == WorkerState.HEALTHY:
# membership changes do not count as retries