The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
Observed Problem
---------------------
When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully.
This results in misleading warning log messages towards the end of the job like the one below:
```
W0510 14:52:48.185934 672413 api.py:513] Closing process 675171 via signal SIGTERM
W0510 14:52:48.185984 672413 api.py:513] Closing process 675172 via signal SIGTERM
W0510 14:52:48.186013 672413 api.py:513] Closing process 675174 via signal SIGTERM
# <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ --->
I0510 14:52:48.229119 672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish.
I0510 14:52:48.229161 672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
I0510 14:52:48.229395 672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds
I0510 14:52:48.257544 672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'.
I0510 14:52:48.568198 672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq
I0510 14:52:48.568989 672413 distributed.py:202] Finished running `main`
```
Root Cause
------------------
I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`.
`torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`.
`torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited.
Fix
---------
The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True`
> **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function.
> **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969
Approved by: https://github.com/d4l3k
Summary:
Minor logging cleanup in distributed library
1. Don't use "f" formatted strings - address linter issues.
2. Nits: Make use of unused `e` (error) in a few logs.
3. Change info->debug as asked in issue #113545
4. Nit: rename log -> logger in a few files for consistency
5. Fix a linter error.
Test Plan:
1. Local build passes.
2. Linter is happy.
Reviewers: wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921
Approved by: https://github.com/wanchaol
Summary:
## No Functional Change
- Refactor Subprocess Handler into a separate folder for easier subclassing
- SubprocessHandler
- added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class
- pass in `local_rank_id` from subprocess start
Test Plan: No functional changes.
Differential Revision: D54038627
#suppress-api-compatibility-check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373
Approved by: https://github.com/kurman
Summary:
Expose an option to users to specify name of the LogsSpec implementation to use.
- Has to be defined in entrypoints under `torchrun.logs_specs` group.
- Must implement LogsSpec defined in prior PR/diff.
Test Plan: unit test+local tests
Reviewed By: ezyang
Differential Revision: D54180838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120942
Approved by: https://github.com/ezyang
Summary:
Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism)
Why?
Right now the logging approach is quite rigid:
- Requires for log directory to exist and not be empty
- Will create tempdir otherwise,
- Creates subdir for a run
- creates subdir for each attempt
- creates files named as stdout.log, stderr.log, error.json
In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix.
With current changes, users can create custom log spec that can use env variables to change the behavior.
Notes:
Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change.
Test Plan: CI + unit tests
Differential Revision: D54176265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691
Approved by: https://github.com/ezyang
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
When torchelastic notices that one rank has failed, it will sent a SIGTERM
signal to other trainer ranks to tear them down before restarting. However,
if the trainer itself launches subprocesses, or is launched by a non-python
wrapper script, then the SIGTERM will be delivered only to the direct child of
torch eleastic and not all descendants. This opens subprocesses in a new
linux 'session' which starts a new process group with the pgid the same
as the trainers pid. Then when we send signals, we deliver them to the
process group rather than just the direct child.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113231
Approved by: https://github.com/H-Huang
Summary:
Add an ability to customize log lines and addtional template like behavior to enrich log information.
Motivation:
a) Log stream processing/aggregation gains additional value when it includes information about the global rank. Extension to that is that it will be easier to map ranks to hosts from log stream information (less relevant at the moment)
b) Users can easily map the failure to the right rank without matching node rank offset+local rank.
Implementation
- BC change - keeps the logs line prefix as `[<role name><local rank>]:`
- Optional env variable TORCHELASTIC_LOG_LINE_HEADER that will be used as a prefix when specified and currently exposes `role_name`, `rank` and `local_rank` variables that will be bound when agent assigns the ranks.
Test Plan:
CI
https://fburl.com/mlhub/mzx5xspv
Differential Revision: D50584590
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112357
Approved by: https://github.com/kiukchung
Summary:
Fixes https://github.com/pytorch/pytorch/issues/71946
This commit removes some bits of code that were hard coded for Python 3.6 support from the `.circleci` and `torch` folders. It should only be merged if https://github.com/pytorch/pytorch/issues/66462 is complete.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72211
Reviewed By: dagitses, seemethere
Differential Revision: D33982604
Pulled By: musebc
fbshipit-source-id: 8f453bf9909df615addd59538adb369c65484044
(cherry picked from commit 944a9970fe68a40999b5c8af731e632c28fd15c5)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62823
The diff makes sure that the warning message is printed only when the child processes are stuck after sending termination code.
Test Plan:
sandcastle
buck build mode/dev-nosan //caffe2:run
buck-out/gen/caffe2/run.par --nnodes 1 --nproc_per_node 1 main.py
P435691445
Differential Revision: D30046695
fbshipit-source-id: c59170b297f4a0e530906fa5069234303deee938
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61602
The diff introduces signal handlers and SignalException that is raised when the agent process receives SIGTERM or SIGINT.
When any of these signals received, the termination handler will raise the `SignalException`. The exception will then be processed by the main agent loop. The `shutdown(signum)` will be invoked, that would propagate the received signal to the child processes. The default 30 seconds timeout introduced: if child processes will not be able gracefully terminate during this timeout, the agent process would kill the processes via SIGKILL.
Test Plan: unittests, sandcastle
Reviewed By: cbalioglu
Differential Revision: D29671783
fbshipit-source-id: 3dbca2125676dc18d417cc3e3bb0301fdd42737a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61294
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925
* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies
Issues resolved:
https://github.com/pytorch/pytorch/issues/60716https://github.com/pytorch/pytorch/issues/60754
Test Plan:
sandcastle
python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts
python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning
python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning
Output of running torch.distributed.launch without --use_env:
$path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ('LOCAL_RANK')` instead.
New section:
{F628923078}
{F628974089}
Reviewed By: cbalioglu
Differential Revision: D29559553
fbshipit-source-id: 03ed9ba638bf154354e1530ffc964688431edf6b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60925
* Make `torch.distributed.launch` restarts to 0
* Remove unnecessary `-use_env` warning, move `-use_env` warnings
* Move `-use_env` warnings to `torch.distributed.launch`
* Make default log level WARNING
* Add new doc section around transitioning to `torch.distributed.run`
* Make `torch.distributed.launch` not use error-propagation
* Set default events handler to `null` that does not print events to console
* Add reference from `torch.distributed.launch` to `torch.distributed.run`
* Set correct preexec function that sends SIGTERM to child processes when parent dies
Issues resolved:
https://github.com/pytorch/pytorch/issues/60716https://github.com/pytorch/pytorch/issues/60754
Test Plan:
sandcastle
python -m torch.distributed.launch --nproc_per_node 2 main.py -> uses 0 restarts
python -m torch.distributed.run --nproc_per_node 2 main.py -> uses default for torchelastic, 0 restarts
python -m torch.distributed.launch --nproc_per_node=4 --use_env --no_python main.py -> produces error
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py -> no warning
python -m torch.distributed.launch --nproc_per_node=4 --no_python main.py ->warning
Output of running torch.distributed.launch without --use_env:
$path/torch/distributed/launch.py:173: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ('LOCAL_RANK')` instead.
New section:
{F628923078}
{F628974089}
Reviewed By: kiukchung, cbalioglu
Differential Revision: D29413019
fbshipit-source-id: 323bfbad9d0e4aba3b10ddd7a243ca6e48169630
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56386
The diff resolves bug around incorrect handler resolution:
_create_static_handler pointed towards etcd, and _create_etcd_handler pointed towards static.
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed:test_launcher
Added test_launcher to the ci/cd tests
Reviewed By: cbalioglu
Differential Revision: D27858897
fbshipit-source-id: 440155789958c091ce5755e7c9524e4bb704203a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55412
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/
When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
User workflow: f263531643
Reviewed By: cbalioglu
Differential Revision: D27602838
fbshipit-source-id: 29871178232e3af4ad3dec406c234aba9c5faba1
Summary:
The diff resolves bug where worker processes could exit before torchelastic process would read the return values. This is a rare event, but still can happen, e.g. https://fb.workplace.com/groups/319878845696681/permalink/512409069776990/
When users want to return torch.Tensor object from worker process, the torchelastic multiprocessing will fail. Currently worker process finishes its job after it writes output to the IPC queue without receiver process confirmation. When this happens, the underlying channel between worker and torchelastic process could be closed (in case of mp.SimpleQueue it is file descriptors, that is why we see FileNotFoundException: since worker process finished execution, the file descriptor just got deleted, and torchelastic process cannot find it).
Test Plan:
buck test mode/dev-nosan //caffe2/test/distributed/elastic/agent/server/test:local_agent_test
User workflow: f263531643
Reviewed By: cbalioglu, wilson100hong
Differential Revision: D27572158
fbshipit-source-id: 9a360468acc98d85d587ebf223e7e96d4b43fe4b