mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 13:44:15 +08:00
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)
```python
class BooleanOptionalAction(Action):
def __init__(...):
if option_string.startswith('--'):
option_string = '--no-' + option_string[2:]
_option_strings.append(option_string)
```
It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
51 lines
1.9 KiB
ReStructuredText
51 lines
1.9 KiB
ReStructuredText
.. _elastic_train_script:
|
|
|
|
Train script
|
|
-------------
|
|
|
|
If your train script works with ``torch.distributed.launch`` it will continue
|
|
working with ``torchrun`` with these differences:
|
|
|
|
1. No need to manually pass ``RANK``, ``WORLD_SIZE``,
|
|
``MASTER_ADDR``, and ``MASTER_PORT``.
|
|
|
|
2. ``rdzv_backend`` and ``rdzv_endpoint`` can be provided. For most users
|
|
this will be set to ``c10d`` (see `rendezvous <rendezvous.html>`_). The default
|
|
``rdzv_backend`` creates a non-elastic rendezvous where ``rdzv_endpoint`` holds
|
|
the master address.
|
|
|
|
3. Make sure you have a ``load_checkpoint(path)`` and
|
|
``save_checkpoint(path)`` logic in your script. When any number of
|
|
workers fail we restart all the workers with the same program
|
|
arguments so you will lose progress up to the most recent checkpoint
|
|
(see `elastic launch <run.html>`_).
|
|
|
|
4. ``use_env`` flag has been removed. If you were parsing local rank by parsing
|
|
the ``--local-rank`` option, you need to get the local rank from the
|
|
environment variable ``LOCAL_RANK`` (e.g. ``int(os.environ["LOCAL_RANK"])``).
|
|
|
|
Below is an expository example of a training script that checkpoints on each
|
|
epoch, hence the worst-case progress lost on failure is one full epoch worth
|
|
of training.
|
|
|
|
.. code-block:: python
|
|
|
|
def main():
|
|
args = parse_args(sys.argv[1:])
|
|
state = load_checkpoint(args.checkpoint_path)
|
|
initialize(state)
|
|
|
|
# torch.distributed.run ensures that this will work
|
|
# by exporting all the env vars needed to initialize the process group
|
|
torch.distributed.init_process_group(backend=args.backend)
|
|
|
|
for i in range(state.epoch, state.total_num_epochs)
|
|
for batch in iter(state.dataset)
|
|
train(batch, state.model)
|
|
|
|
state.epoch += 1
|
|
save_checkpoint(state)
|
|
|
|
For concrete examples of torchelastic-compliant train scripts, visit
|
|
our `examples <examples.html>`_ page.
|