# Context
In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`.
However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff).
The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start:
> Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted.
But on further reading, the Linux docs say [`sched_setaffinity` is per *thread*.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer.
I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07)
The upshot is that we actually *can* safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread.
# This PR
Remove restrictions against parallel start for NUMA binding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576
Approved by: https://github.com/d4l3k
# Context
In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges.
Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set.
# This PR
On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836)
Specifically, for each local rank `i`:
1. The parent process sets its own CPU affinity to what local rank `i`'s should be.
2. Then, the parent spawns the subprocess for local rank `i`.
3. Finally, the parent resets its own CPU affinity to what it was originally.
There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for *both* `str` and `Callable`, and it's pretty simple.
This required a bit of refactoring:
1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`.
2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.).
3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process.
4. Use the context manager for both `str` and `Callable` paths
# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`
## Manual
See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183
Approved by: https://github.com/d4l3k
# Context
This is an extension of #149334.
# This PR
Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`.
Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary.
Other changes:
* Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).)
* Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints.
# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`
## Manual
Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran
```
$ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 | tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 | tee none_callable.txt
```
and observed
* 6.6% remote memory accesses with 'node' bindings
* 11.6% remote without bindings
I also ran similar with `str` entrypoints as before just to be sure it's still working.
NOTE: [--run-path triggers the code to be run inside a `Callable`.](017259f9c6/torch/distributed/run.py (L870))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163
Approved by: https://github.com/d4l3k
By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 ) at exit, as at this point handle to child process might no longer be valid
Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)
Fixes https://github.com/pytorch/pytorch/issues/153050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
Changes:
1. Bump `ruff` from 0.7.4 to 0.8.4
2. Change `%`-formatted strings to f-string
3. Change arguments with the `__`-prefix to positional-only arguments with the `/` separator in function signature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143753
Approved by: https://github.com/Skylion007
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).
This PR adds a grace period. Default behavior is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278
Approved by: https://github.com/albanD
Summary:
This is to fix the pytorch issue filed https://github.com/pytorch/pytorch/issues/133010
one way to fix this problem is to enable parallel start processes in mp.start_processes.
What else in the diff:
refactored a test case api_test which was repeating a lot of tests due to the inheritance.
added unit test for forkserver when parallel start is on.
Test Plan: Added unit tests
Differential Revision: D61878552
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134629
Approved by: https://github.com/d4l3k
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
Resolves#126888
- #126888
This PR is split from PR #126898.
- #126898
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.
Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.
UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.
Resolves#126888
- #126888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
multiprocessing.Queue relies on, among other things, background threads to send messages between processes. This works in the happy path but can cause issues if a process is exiting by bypassing atexit handlers or crashing because the writer to the Queue can terminate while the reader is blocked reading the queue. The reader sees the queue as non-empty yet even with a timeout will actually block forever.
An example of a Queue deadlock is here: https://gist.github.com/chipturner/342f72341f087737befe9df84d0e41ce
Since the error reporting case here is a simple one-shot message from the dying child to the parent, we can just use a file-based rendezvous. This eliminates the deadlock when a large traceback is still being flushed to the network when a child exits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114688
Approved by: https://github.com/suo, https://github.com/yifuwang
Summary:
[pytorch] Multiprocessing api to use sigkill if sigterm doesn't kill the process
We have seen a handful of jobs training stuck where one of the trainer goes down
while others are stuck in c++ land and hence not handling the sigterm.
Test Plan: Manually validated by attaching gdb to one of the processes and sent a kill -9 to another. Saw the log ```WARNING] Unable to shutdown process 4422 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL```
Differential Revision: D51862545
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115219
Approved by: https://github.com/wconstab, https://github.com/fduwjj
On some systems it is possible to receive a signal that does not have a name. Rare, but possible. This prevents our error handler from crashing and instead properly reports the signal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114696
Approved by: https://github.com/xmfan
Fixes#112595
- `torch/autograd/profiler.py` </br>
**Before: 37**
```
torch/autograd/profiler.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/profiler.py:91 in public class `profile`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:175 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:261 in public method `config`:
D102: Missing docstring in public method
torch/autograd/profiler.py:272 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:290 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:308 in public method `__repr__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:313 in public method `__str__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:322 in public method `table`:
D102: Missing docstring in public method
torch/autograd/profiler.py:346 in public method `export_chrome_trace`:
D102: Missing docstring in public method
torch/autograd/profiler.py:355 in public method `export_stacks`:
D102: Missing docstring in public method
torch/autograd/profiler.py:361 in public method `key_averages`:
D102: Missing docstring in public method
torch/autograd/profiler.py:368 in public method `total_average`:
D102: Missing docstring in public method
torch/autograd/profiler.py:377 in public method `self_cpu_time_total`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:377 in public method `self_cpu_time_total`:
D400: First line should end with a period (not 'f')
torch/autograd/profiler.py:555 in public class `record_function`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:555 in public class `record_function`:
D400: First line should end with a period (not 'f')
torch/autograd/profiler.py:591 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:602 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:608 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:625 in private method `_call_end_callbacks_on_future`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:625 in private method `_call_end_callbacks_on_future`:
D400: First line should end with a period (not 'c')
torch/autograd/profiler.py:707 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:712 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:733 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:826 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:831 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:853 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:863 in public function `load_nvprof`:
D401: First line should be in imperative mood (perhaps 'Open', not 'Opens')
torch/autograd/profiler.py:874 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:877 in public method `see`:
D102: Missing docstring in public method
torch/autograd/profiler.py:883 in public function `parse_nvprof_trace`:
D103: Missing docstring in public function
torch/autograd/profiler.py:951 in public class `KinetoStepTracker`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:991 in public method `init_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:995 in public method `erase_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:1000 in public method `increment_step`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/profiler.py:1023 in public method `current_step`:
D102: Missing docstring in public method
37
```
**After: 27**
```
torch/autograd/profiler.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/profiler.py:176 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:262 in public method `config`:
D102: Missing docstring in public method
torch/autograd/profiler.py:273 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:291 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:309 in public method `__repr__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:314 in public method `__str__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:323 in public method `table`:
D102: Missing docstring in public method
torch/autograd/profiler.py:347 in public method `export_chrome_trace`:
D102: Missing docstring in public method
torch/autograd/profiler.py:356 in public method `export_stacks`:
D102: Missing docstring in public method
torch/autograd/profiler.py:362 in public method `key_averages`:
D102: Missing docstring in public method
torch/autograd/profiler.py:369 in public method `total_average`:
D102: Missing docstring in public method
torch/autograd/profiler.py:593 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:604 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:610 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:708 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:713 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:734 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:827 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:832 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:854 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/profiler.py:875 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/profiler.py:878 in public method `see`:
D102: Missing docstring in public method
torch/autograd/profiler.py:884 in public function `parse_nvprof_trace`:
D103: Missing docstring in public function
torch/autograd/profiler.py:993 in public method `init_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:997 in public method `erase_step_count`:
D102: Missing docstring in public method
torch/autograd/profiler.py:1025 in public method `current_step`:
D102: Missing docstring in public method
27
```
- `torch/autograd/graph.py` </br>
**Before: 22**
```
torch/autograd/graph.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/graph.py:24 in public class `Node`:
D101: Missing docstring in public class
torch/autograd/graph.py:27 in public method `name`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/autograd/graph.py:42 in public method `next_functions`:
D102: Missing docstring in public method
torch/autograd/graph.py:47 in public method `metadata`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/autograd/graph.py:56 in public method `register_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:94 in public method `register_prehook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:129 in public method `__subclasshook__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
D400: First line should end with a period (not 'f')
torch/autograd/graph.py:147 in public function `get_gradient_edge`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/autograd/graph.py:166 in public function `increment_version`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:166 in public function `increment_version`:
D400: First line should end with a period (not 'd')
torch/autograd/graph.py:166 in public function `increment_version`:
D401: First line should be in imperative mood; try rephrasing (found 'This')
torch/autograd/graph.py:243 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/graph.py:251 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:256 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:261 in public class `save_on_cpu`:
D205: 1 blank line required between summary line and description (found 0)
torch/autograd/graph.py:261 in public class `save_on_cpu`:
D400: First line should end with a period (not 'e')
torch/autograd/graph.py:303 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/graph.py:365 in public function `register_multi_grad_hook`:
D401: First line should be in imperative mood (perhaps 'Register', not 'Registers')
torch/autograd/graph.py:588 in public function `allow_mutation_on_saved_tensors`:
D400: First line should end with a period (not 'd')
22
```
**After: 8**
```
torch/autograd/graph.py:1 at module level:
D100: Missing docstring in public module
torch/autograd/graph.py:24 in public class `Node`:
D101: Missing docstring in public class
torch/autograd/graph.py:42 in public method `next_functions`:
D102: Missing docstring in public method
torch/autograd/graph.py:129 in public method `__subclasshook__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:244 in public method `__init__`:
D107: Missing docstring in __init__
torch/autograd/graph.py:252 in public method `__enter__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:257 in public method `__exit__`:
D105: Missing docstring in magic method
torch/autograd/graph.py:303 in public method `__init__`:
D107: Missing docstring in __init__
8
```
- `torch/multiprocessing/pool.py` </br>
**Before: 6**
```
torch/multiprocessing/pool.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/pool.py:7 in public function `clean_worker`:
D103: Missing docstring in public function
torch/multiprocessing/pool.py:18 in public class `Pool`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/pool.py:18 in public class `Pool`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/pool.py:29 in private method `_repopulate_pool`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/pool.py:29 in private method `_repopulate_pool`:
D400: First line should end with a period (not ',')
6
```
**After: 2**
```
torch/multiprocessing/pool.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/pool.py:7 in public function `clean_worker`:
D103: Missing docstring in public function
2
```
- `torch/multiprocessing/queue.py` </br>
**Before: 11**
```
torch/multiprocessing/queue.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/queue.py:8 in public class `ConnectionWrapper`:
D400: First line should end with a period (not 'o')
torch/multiprocessing/queue.py:11 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:14 in public method `send`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:19 in public method `recv`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:23 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/multiprocessing/queue.py:29 in public class `Queue`:
D101: Missing docstring in public class
torch/multiprocessing/queue.py:30 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:38 in public class `SimpleQueue`:
D101: Missing docstring in public class
11
```
**After: 8**
```
torch/multiprocessing/queue.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/queue.py:10 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:13 in public method `send`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:18 in public method `recv`:
D102: Missing docstring in public method
torch/multiprocessing/queue.py:22 in public method `__getattr__`:
D105: Missing docstring in magic method
torch/multiprocessing/queue.py:28 in public class `Queue`:
D101: Missing docstring in public class
torch/multiprocessing/queue.py:29 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/queue.py:37 in public class `SimpleQueue`:
D101: Missing docstring in public class
8
```
- `torch/multiprocessing/reductions.py` </br>
**Before: 31**
```
torch/multiprocessing/reductions.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/reductions.py:24 in public class `StorageWeakRef`:
D209: Multi-line docstring closing quotes should be on a separate line
torch/multiprocessing/reductions.py:31 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:38 in public method `from_weakref`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:44 in public method `expired`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:47 in public method `__del__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:50 in public method `__hash__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:53 in public method `__eq__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:60 in public class `SharedCache`:
D400: First line should end with a period (not 'f')
torch/multiprocessing/reductions.py:62 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:75 in public method `get`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:79 in public method `__setitem__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:85 in public method `free_dead_references`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:99 in public function `rebuild_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:103 in public function `reduce_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:108 in public function `rebuild_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:121 in public function `rebuild_cuda_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:189 in public function `reduce_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:347 in public function `rebuild_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:364 in public function `reduce_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:389 in public function `fd_id`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:397 in public function `storage_from_cache`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:404 in public function `rebuild_storage_fd`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:417 in public function `rebuild_storage_filename`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:437 in public function `rebuild_storage_empty`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:441 in public function `rebuild_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:446 in public function `reduce_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:450 in public function `rebuild_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:455 in public function `reduce_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:459 in public function `reduce_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:488 in public function `init_reductions`:
D103: Missing docstring in public function
31
```
**After: 29**
```
torch/multiprocessing/reductions.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/reductions.py:32 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:39 in public method `from_weakref`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:45 in public method `expired`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:48 in public method `__del__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:51 in public method `__hash__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:54 in public method `__eq__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:63 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/reductions.py:76 in public method `get`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:80 in public method `__setitem__`:
D105: Missing docstring in magic method
torch/multiprocessing/reductions.py:86 in public method `free_dead_references`:
D102: Missing docstring in public method
torch/multiprocessing/reductions.py:100 in public function `rebuild_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:104 in public function `reduce_event`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:109 in public function `rebuild_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:122 in public function `rebuild_cuda_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:190 in public function `reduce_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:348 in public function `rebuild_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:365 in public function `reduce_nested_tensor`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:390 in public function `fd_id`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:398 in public function `storage_from_cache`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:405 in public function `rebuild_storage_fd`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:418 in public function `rebuild_storage_filename`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:438 in public function `rebuild_storage_empty`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:442 in public function `rebuild_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:447 in public function `reduce_typed_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:451 in public function `rebuild_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:456 in public function `reduce_typed_storage_child`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:460 in public function `reduce_storage`:
D103: Missing docstring in public function
torch/multiprocessing/reductions.py:489 in public function `init_reductions`:
D103: Missing docstring in public function
29
```
- `torch/multiprocessing/spawn.py` </br>
**Before: 19**
```
torch/multiprocessing/spawn.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/spawn.py:11 in public class `ProcessException`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:14 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:20 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:25 in public class `ProcessRaisedException`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:25 in public class `ProcessRaisedException`:
D400: First line should end with a period (not 'n')
torch/multiprocessing/spawn.py:30 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:40 in public class `ProcessExitedException`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:40 in public class `ProcessExitedException`:
D400: First line should end with a period (not 'l')
torch/multiprocessing/spawn.py:47 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:59 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:85 in public class `ProcessContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:86 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:93 in public method `pids`:
D102: Missing docstring in public method
torch/multiprocessing/spawn.py:97 in public method `join`:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/spawn.py:97 in public method `join`:
D401: First line should be in imperative mood (perhaps 'Try', not 'Tries')
torch/multiprocessing/spawn.py:166 in public class `SpawnContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:167 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:180 in public function `start_processes`:
D103: Missing docstring in public function
19
```
**After: 13**
```
torch/multiprocessing/spawn.py:1 at module level:
D100: Missing docstring in public module
torch/multiprocessing/spawn.py:11 in public class `ProcessException`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:14 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:20 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:27 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:41 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:53 in public method `__reduce__`:
D105: Missing docstring in magic method
torch/multiprocessing/spawn.py:79 in public class `ProcessContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:80 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:87 in public method `pids`:
D102: Missing docstring in public method
torch/multiprocessing/spawn.py:161 in public class `SpawnContext`:
D101: Missing docstring in public class
torch/multiprocessing/spawn.py:162 in public method `__init__`:
D107: Missing docstring in __init__
torch/multiprocessing/spawn.py:175 in public function `start_processes`:
D103: Missing docstring in public function
13
```
- `torch/multiprocessing/__init__.py` </br>
**Before: 0**
```
torch/multiprocessing/__init__.py:1 at module level:
D205: 1 blank line required between summary line and description (found 0)
torch/multiprocessing/__init__.py:1 at module level:
D400: First line should end with a period (not '`')
torch/multiprocessing/__init__.py:57 in public function `set_sharing_strategy`:
D401: First line should be in imperative mood (perhaps 'Set', not 'Sets')
torch/multiprocessing/__init__.py:69 in public function `get_sharing_strategy`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
torch/multiprocessing/__init__.py:74 in public function `get_all_sharing_strategies`:
D401: First line should be in imperative mood (perhaps 'Return', not 'Returns')
5
```
**After: 0**
- `torch/nn/__init__.py` </br>
**Before: 3**
```
torch/nn/__init__.py:1 at module level:
D104: Missing docstring in public package
torch/nn/__init__.py:14 in public function `factory_kwargs`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/__init__.py:14 in public function `factory_kwargs`:
D400: First line should end with a period (not 'd')
3
```
**After: 1**
```
torch/nn/__init__.py:1 at module level:
D104: Missing docstring in public package
1
```
- `torch/nn/cpp.py` </br>
**Before: 16**
```
torch/nn/cpp.py:7 in public class `OrderedDictWrapper`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/cpp.py:7 in public class `OrderedDictWrapper`:
D400: First line should end with a period (not 'e')
torch/nn/cpp.py:16 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:21 in public method `cpp_dict`:
D102: Missing docstring in public method
torch/nn/cpp.py:27 in public method `items`:
D102: Missing docstring in public method
torch/nn/cpp.py:30 in public method `keys`:
D102: Missing docstring in public method
torch/nn/cpp.py:33 in public method `values`:
D102: Missing docstring in public method
torch/nn/cpp.py:36 in public method `__iter__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:39 in public method `__len__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:42 in public method `__contains__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:45 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:50 in public class `ModuleWrapper`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/cpp.py:50 in public class `ModuleWrapper`:
D400: First line should end with a period (not 'd')
torch/nn/cpp.py:55 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:83 in public method `training`:
D102: Missing docstring in public method
torch/nn/cpp.py:90 in public method `__repr__`:
D105: Missing docstring in magic method
16
```
**After: 12**
```
torch/nn/cpp.py:16 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:21 in public method `cpp_dict`:
D102: Missing docstring in public method
torch/nn/cpp.py:27 in public method `items`:
D102: Missing docstring in public method
torch/nn/cpp.py:30 in public method `keys`:
D102: Missing docstring in public method
torch/nn/cpp.py:33 in public method `values`:
D102: Missing docstring in public method
torch/nn/cpp.py:36 in public method `__iter__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:39 in public method `__len__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:42 in public method `__contains__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:45 in public method `__getitem__`:
D105: Missing docstring in magic method
torch/nn/cpp.py:52 in public method `__init__`:
D107: Missing docstring in __init__
torch/nn/cpp.py:80 in public method `training`:
D102: Missing docstring in public method
torch/nn/cpp.py:87 in public method `__repr__`:
D105: Missing docstring in magic method
12
```
- `torch/nn/grad.py` </br>
**Before: 10**
```
torch/nn/grad.py:1 at module level:
D400: First line should end with a period (not 'e')
torch/nn/grad.py:8 in public function `conv1d_input`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:8 in public function `conv1d_input`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:40 in public function `conv1d_weight`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:71 in public function `conv2d_input`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:71 in public function `conv2d_input`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:103 in public function `conv2d_weight`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:134 in public function `conv3d_input`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/grad.py:134 in public function `conv3d_input`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
torch/nn/grad.py:166 in public function `conv3d_weight`:
D401: First line should be in imperative mood (perhaps 'Compute', not 'Computes')
10
```
**After: 0**
- `torch/nn/parameter.py` </br>
**Before: 17**
```
torch/nn/parameter.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parameter.py:14 in public class `Parameter`:
D204: 1 blank line required after class docstring (found 0)
torch/nn/parameter.py:33 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:54 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:62 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:65 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:84 in public class `UninitializedTensorMixin`:
D101: Missing docstring in public class
torch/nn/parameter.py:105 in public method `materialize`:
D205: 1 blank line required between summary line and description (found 0)
torch/nn/parameter.py:125 in public method `shape`:
D102: Missing docstring in public method
torch/nn/parameter.py:132 in public method `share_memory_`:
D102: Missing docstring in public method
torch/nn/parameter.py:138 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:141 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:149 in public method `__torch_function__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:164 in public function `is_lazy`:
D103: Missing docstring in public function
torch/nn/parameter.py:186 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:191 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:217 in public method `__new__`:
D102: Missing docstring in public method
17
```
**After: 15**
```
torch/nn/parameter.py:1 at module level:
D100: Missing docstring in public module
torch/nn/parameter.py:34 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:55 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:63 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:66 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:85 in public class `UninitializedTensorMixin`:
D101: Missing docstring in public class
torch/nn/parameter.py:127 in public method `shape`:
D102: Missing docstring in public method
torch/nn/parameter.py:134 in public method `share_memory_`:
D102: Missing docstring in public method
torch/nn/parameter.py:140 in public method `__repr__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:143 in public method `__reduce_ex__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:151 in public method `__torch_function__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:166 in public function `is_lazy`:
D103: Missing docstring in public function
torch/nn/parameter.py:188 in public method `__new__`:
D102: Missing docstring in public method
torch/nn/parameter.py:193 in public method `__deepcopy__`:
D105: Missing docstring in magic method
torch/nn/parameter.py:219 in public method `__new__`:
D102: Missing docstring in public method
15
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113052
Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer
Enables PyLint error codes implemented in ruff. These are un-opinionated static analysis checks on Python code that finds common bugs. After running all the PLE error codes that are implemented in ruff, I fixed the bugs, added a few ignores for malformed Python code that is part of our JIT test script, and finally added a few ignores for a false positive on PLE0605 and submitted an issue upstream to fix in ruff https://github.com/charliermarsh/ruff/issues/4345 .
Common bugs found here include analysis for malformed logging format calls, bad string format calls, invalid escape sequences, and more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101079
Approved by: https://github.com/malfet
Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl).
This PR takes care of all of the lower hanging fruit.
- Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage
- Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98529
Approved by: https://github.com/jansel, https://github.com/ngimel
### Description
Since the major changes for `_TypedStorage` and `_UntypedStorage` are now complete, they can be renamed to be public.
`TypedStorage._untyped()` is renamed to `TypedStorage.untyped()`.
Documentation for storages is improved as well.
### Issue
Fixes#82436
### Testing
N/A
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82438
Approved by: https://github.com/ezyang