mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Torchelastic: populate __init__.py with failover documentation
Summary: Torchelastic: populate __init__.py with failover documentation
Test Plan: {F613772684}
Reviewed By: cbalioglu
Differential Revision: D28243715
fbshipit-source-id: aeed8d3ddd2d27ef86d837e7e3ebfa7a0b80a07d
This commit is contained in:
committed by
Facebook GitHub Bot
parent
ad31aa652c
commit
cb7197ce3f
@ -5,3 +5,73 @@
|
||||
#
|
||||
# This source code is licensed under the BSD-style license found in the
|
||||
# LICENSE file in the root directory of this source tree.
|
||||
|
||||
"""
|
||||
|
||||
Torchelastic agent and user worker failover contract:
|
||||
|
||||
**TL;DR;**:
|
||||
|
||||
* TE(torchelastic) expects user workers to finish with the 5 minutes drift
|
||||
* It is better to design DDP app to fail for all workers, rather than a single one.
|
||||
* TE does not synchronize number of restarts between agents
|
||||
* TE re-rendezvous does not trigger restart decrease
|
||||
* When a single agent finishes its job(successfully or not), it will close rendezvous.
|
||||
If other agents still have workers in progress, they will be terminated.
|
||||
* Based on above, scale down does not work if at least single agent finishes the job.
|
||||
* When Scale up is detected by agents, it will not decrease ``max_restarts``
|
||||
|
||||
|
||||
In general TE(torchelastic) can launch arbitrary user code, but there is some
|
||||
clarifications need to be done around what failover mechanism torchelastic
|
||||
provides and what failover mechanism it expects from user workers.
|
||||
|
||||
Torchelastic currently supports DDP style applications. That means that
|
||||
TE expects *ALL* workers finish approximately at the same time. In practice,
|
||||
it is nearly to impossible to guarantee that all workers in arbitrary
|
||||
DDP application finish at the time, so TE provides a finalization barrier
|
||||
that waits for TIMEOUT(5 minutes) for worker finalization.
|
||||
|
||||
**Worker Failure**
|
||||
|
||||
When worker fails, TE will check the number of restarts
|
||||
available, if there is more than 0 restarts, TE will start a new rendezvous
|
||||
round and restart the worker process. New rendezvous round will other
|
||||
TE agents to terminate their workers.
|
||||
|
||||
.. note:: The TE agent does not synchronize restarts between themselves.
|
||||
When a single agent performs restart, it will trigger a local ``max_restarts``
|
||||
decrease, other agent will not decrease their ``max_restarts``.
|
||||
the user to run the distributed application locally on a dev host.
|
||||
|
||||
A single worker failure can cause the whole cluster to fail:
|
||||
If a single worker is constantly failing, it will cause the TE agent
|
||||
``max_restarts`` to go to zero. This will cause an agent to finish its
|
||||
work and close rendezvous. If there are any other workers on different
|
||||
agents, they will be terminated.
|
||||
|
||||
|
||||
**Re-Rendezvous**
|
||||
|
||||
Re-rendezvous occurs when TE agents detect a new node
|
||||
trying to joint a cluster. TE will not decrease ``max_restarts``. TE agents
|
||||
will terminate its workers and start a new rendezvous round.
|
||||
|
||||
Note about DynamicRendezvous(etcd-v2, c10d-experimental): If the rendezvous
|
||||
has already max_nodes, the new node won't be added to the wait list right
|
||||
away since there is no need to tear down a rendezvous that is already fully
|
||||
utilized. The new node will wait until its timeout (600 secs by default)
|
||||
and periodically check the number of participants. If the number becomes
|
||||
less than max_nodes, it will be added to the wait list; otherwise, it will time out after 600 secs.
|
||||
|
||||
*Scale up event*. When scale up event happens, torchelastic rendezvous
|
||||
will detect that there are new nodes trying to join. Torchelastic agent
|
||||
will stop all workers and perform re-rendezvous. Note: when scale up event
|
||||
happens, *``max_restarts``* will *not* decrease.
|
||||
|
||||
*Scale down event*. When scale down event happens, rendezvous will not
|
||||
notify the torchelastic agent about it. If TE agent launched with ``max_restarts=0`` ,
|
||||
it relies on the underlying scheduler to handle job restart. If the ``max_restarts>0`` ,
|
||||
TE agent will terminate workers and start a new rdzv round, which is a *Scale up event*.
|
||||
|
||||
"""
|
||||
|
||||
Reference in New Issue
Block a user