Files
pytorch/docs/source/elastic/train_script.rst
Xuehai Pan a229b4526f [BE] Prefer dash over underscore in command-line options (#94505)
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.

Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:

`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)

```python
class BooleanOptionalAction(Action):
    def __init__(...):
            if option_string.startswith('--'):
                option_string = '--no-' + option_string[2:]
                _option_strings.append(option_string)
```

It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
2023-02-09 20:16:49 +00:00

51 lines
1.9 KiB
ReStructuredText

.. _elastic_train_script:
Train script
-------------
If your train script works with ``torch.distributed.launch`` it will continue
working with ``torchrun`` with these differences:
1. No need to manually pass ``RANK``, ``WORLD_SIZE``,
``MASTER_ADDR``, and ``MASTER_PORT``.
2. ``rdzv_backend`` and ``rdzv_endpoint`` can be provided. For most users
this will be set to ``c10d`` (see `rendezvous <rendezvous.html>`_). The default
``rdzv_backend`` creates a non-elastic rendezvous where ``rdzv_endpoint`` holds
the master address.
3. Make sure you have a ``load_checkpoint(path)`` and
``save_checkpoint(path)`` logic in your script. When any number of
workers fail we restart all the workers with the same program
arguments so you will lose progress up to the most recent checkpoint
(see `elastic launch <run.html>`_).
4. ``use_env`` flag has been removed. If you were parsing local rank by parsing
the ``--local-rank`` option, you need to get the local rank from the
environment variable ``LOCAL_RANK`` (e.g. ``int(os.environ["LOCAL_RANK"])``).
Below is an expository example of a training script that checkpoints on each
epoch, hence the worst-case progress lost on failure is one full epoch worth
of training.
.. code-block:: python
def main():
args = parse_args(sys.argv[1:])
state = load_checkpoint(args.checkpoint_path)
initialize(state)
# torch.distributed.run ensures that this will work
# by exporting all the env vars needed to initialize the process group
torch.distributed.init_process_group(backend=args.backend)
for i in range(state.epoch, state.total_num_epochs)
for batch in iter(state.dataset)
train(batch, state.model)
state.epoch += 1
save_checkpoint(state)
For concrete examples of torchelastic-compliant train scripts, visit
our `examples <examples.html>`_ page.