pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Aaron Gokaslan	8219bf051b	[BE]: Apply RUF015 to torch folder (#113025 ) Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-11-07 00:48:15 +00:00
Xiaodong Wang	3080fd8383	[profiler] add send/recv src/dst info (#111811 ) Summary: There is an ask to add src/dst to nccl trace. This feels like the easiest way to do - adding it to metadata seems to require plumbing a few stacks so will be more work Test Plan: {F1128545195} Reviewed By: davidberard98 Differential Revision: D50560692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111811 Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi, https://github.com/fduwjj	2023-10-28 02:48:23 +00:00
Pritam Damania	0ad91c2bfb	Add an explicit _shutdown method to ProcessGroupNCCL (#111392 ) Currently, the only way ProcessGroupNCCL shuts down its background threads and aborts all communicators is via the destructor. However, given how python GC works and code holding references to the PG in multiple places, in practice calling `destroy_process_group` doesn't actually end up invoking the destructor. As a result, in this PR I'm adding a explicit shutdown method to that users can call to cleanup all resources. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111392 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fduwjj	2023-10-24 05:47:12 +00:00
Kazuaki Ishizaki	b5f9696d81	Fix typo under torch directory (#110824 ) This PR fixes typo `the the` of comments and exception messages in files under `torch` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110824 Approved by: https://github.com/H-Huang	2023-10-09 19:16:43 +00:00
Pritam Damania	704b0b3c67	[RESUBMIT] Standardize on error types for distributed errors. (#108191 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191 Approved by: https://github.com/H-Huang	2023-08-30 21:47:39 +00:00
PyTorch MergeBot	d4ff06ec84	Revert "Standardize on error types for distributed errors. (#107651 )" This reverts commit 0e2317479b3cb987e1f3230876654f156bd11a09. Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))	2023-08-28 23:58:33 +00:00
Pritam Damania	0e2317479b	Standardize on error types for distributed errors. (#107651 ) We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc. This results in messy code during error handling somewhat like this: ``` if "NCCL" in exception_str: ... if "Timed out initializing process group in store based barrier on rank" in exception_str: ... if "The client socket has timed out after" in exception_str: ... if "Broken pipe" in exception_str: ... if "Connection reset by peer" in exception_str: ... ``` To address this issue, in this PR I've ensured added these error types: 1. DistError - the base type of all distributed errors 2. DistBackendError - this already existed and referred to PG backend errors 3. DistStoreError - for errors originating from the store 4. DistNetworkError - for general network errors coming from the socket library Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651 Approved by: https://github.com/H-Huang	2023-08-28 21:58:15 +00:00
Howard Huang	3577ae3e53	Fix TestDistBackendWithSpawn.test_backend_group and test_backend_full_group (#107231 ) Fixes https://github.com/pytorch/pytorch/issues/107078 and allows tests to be run with 2 GPUs only. testing command: `BACKEND=gloo WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_group` `BACKEND=nccl WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_full_group` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107231 Approved by: https://github.com/rohan-varma	2023-08-16 12:01:09 +00:00
Rohan Varma	c11412b4a8	[DDP] Support optim in backward after DDP init (#105995 ) This allows in backward optimizers to be configured after DDP init, in addition to before as was previously supported. Differential Revision: [D47783347](https://our.internmc.facebook.com/intern/diff/D47783347/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105995 Approved by: https://github.com/fegin	2023-07-29 01:36:25 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	be03a56955	[BE] Enable ruff's UP rules and autoformat testing/ (#105425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425 Approved by: https://github.com/malfet	2023-07-18 21:04:39 +00:00
Pritam Damania	572ff2779b	[RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103925 ) https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/103925 Approved by: https://github.com/osalpekar	2023-06-27 04:22:03 +00:00
Rohan Varma	f044613f78	Back out "Revert "[DDP] multiple forward support for static graph (#103487 )" (#103873 )" (#103938 ) Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938 Approved by: https://github.com/awgu, https://github.com/fegin	2023-06-22 21:55:58 +00:00
PyTorch MergeBot	f7737bb96b	Revert "Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264 )" This reverts commit 03881b0c925f191ec41d6899d589ed420ac285b5. Reverted https://github.com/pytorch/pytorch/pull/103264 on behalf of https://github.com/osalpekar due to This commits seems to have been causing failures in test_nccl_init_abort. Those failures may have been masked by pre-existing failures in the distributed jobs on trunk when running CI on this PR. Since those breaking changes are now reverted, we should be able to rebase this and get clean signal + uncover the breakages caused by this PR. ([comment](https://github.com/pytorch/pytorch/pull/103264#issuecomment-1599451197))	2023-06-20 20:29:43 +00:00
Huy Do	b1ddd5a293	Revert "[DDP] multiple forward support for static graph (#103487 )" (#103873 ) Per the discussion in https://github.com/pytorch/pytorch/pull/103629#issuecomment-1598001313, I preemptively create this revert PR to revert all commits in the stack. This seems like a safer option than using the bot as the commit has already been in trunk since last week. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103873 Approved by: https://github.com/rohan-varma	2023-06-20 16:25:00 +00:00
Pritam Damania	03881b0c92	Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264 ) https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior. However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op. To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`. I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp Pull Request resolved: https://github.com/pytorch/pytorch/pull/103264 Approved by: https://github.com/kwen2501	2023-06-15 23:40:22 +00:00
Rohan Varma	80139fc2db	[DDP] multiple forward support for static graph (#103487 ) Adds support for multiple forward before bwd call for static_graph=True. There are 2 changes: 1) Change tracking of accounting of when to populate static grap related maps from relying on forward iteration to backward calls 2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the delay allreduce. Instead use a flag. Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487 Approved by: https://github.com/awgu	2023-06-14 16:14:52 +00:00
Will Constable	a8549357d2	Add distributed category to TORCH_LOGS (#103351 ) Fix use of torch distributed testing assertLogs Pull Request resolved: https://github.com/pytorch/pytorch/pull/103351 Approved by: https://github.com/wanchaol	2023-06-10 02:21:36 +00:00
Pritam Damania	9a2df0a5af	[RFC] Add method to DDP to check for backward finalization. (#100773 ) Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck. As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction. Test Plan: Added unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773 Approved by: https://github.com/rohan-varma	2023-05-31 20:43:06 +00:00
Howard Huang	11d1cd899a	Replace require_backend with require_backend_is_available (#101891 ) [BE] `require_backend_is_available` offers the a more thorough check as `require_backend` but both are often used together. This remove `require_backend` and centralizes on the `require_backend_is_available` decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/101891 Approved by: https://github.com/awgu	2023-05-25 00:00:06 +00:00
Howard Huang	d7f6bfe651	Fix require_backends_available to reenable distributed tests (#101704 ) ## TLDR Fix decorator to re-enable 26+ distributed tests that were previously being skipped in CI ## Explanation As part of the UCC upstream, we updated the backend tests cases to also include "ucc". `3ed1569e86/torch/testing/_internal/common_distributed.py (L90-L92)` In distributed tests we use a decorator which reads from this config and makes sure all backends are available on the system. `3ed1569e86/torch/testing/_internal/distributed/distributed_test.py (L7131)` However, UCC is not configured on by default for a certain subset of CI tests, which causes the entire test to be skipped (even if the test is meant for nccl and the backend being tested is nccl). As the fix, we should just check that only the `BACKEND` being tested is available ## Changes - Change logic to only check if the current backend being used is available - Rename `require_backends_available` -> `require_backend_is_available` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101704 Approved by: https://github.com/rohan-varma	2023-05-18 21:33:15 +00:00
Ke Wen	0848ed21b8	[c10d] Figure out device to use for object collectives (#100954 ) Fixes https://github.com/pytorch/pytorch/issues/97938 this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But @kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction. the only confliction is `distributed_c10d.py:2653` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954 Approved by: https://github.com/kwen2501	2023-05-11 01:49:09 +00:00
Justin Chu	01abbfbaae	[BE] Fix all B022 `useless-contextlib-suppress` (#100335 ) No arguments passed to contextlib.suppress. No exceptions will be suppressed and therefore this context manager is redundant Pull Request resolved: https://github.com/pytorch/pytorch/pull/100335 Approved by: https://github.com/Skylion007	2023-04-30 18:47:40 +00:00
Ke Wen	628a8df1c9	[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215 ) This is a mirror PR of D45339293 Summary: These tests cause the following errors internally with unknown reason: ``` AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd' ``` Commenting these tests out to unblock other PRs. Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215 Approved by: https://github.com/wz337, https://github.com/fduwjj	2023-04-28 17:38:12 +00:00
PyTorch MergeBot	9609aeefbb	Revert "[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215 )" This reverts commit ae40a6c7356190ef86b14b10a94a58ca41ca496b. Reverted https://github.com/pytorch/pytorch/pull/100215 on behalf of https://github.com/huydhn due to Sorry for revert your change, but it breaks lint, please run lintrunner -a torch/testing/_internal/distributed/distributed_test.py to fix the issue then reland it	2023-04-28 01:21:06 +00:00
Ke Wen	ae40a6c735	[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215 ) This is a mirror PR of D45339293 Summary: These tests cause the following errors internally with unknown reason: ``` AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw' AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd' ``` Commenting these tests out to unblock other PRs. Test Plan: Sandcastle Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215 Approved by: https://github.com/wz337, https://github.com/fduwjj	2023-04-28 00:05:46 +00:00
Aaron Gokaslan	e2a3817dfd	[BE] Enable C419 rule for any all shortcircuiting (#99890 ) Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890 Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet	2023-04-25 15:02:13 +00:00
Ke Wen	3a09aa5977	[c10d] Faster coalescing (#98793 ) ### Description The PR aims at reducing CPU overhead of context manager style coalescing. By "context manager style coalescing", we mean: Sync style: ``` with _coalescing_manager(): for i in range(num_coll): dist.all_reduce(tensors[i]) ``` Async style: ``` with _coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_reduce(tensors[i]) cm.wait() ``` In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead. In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager. ### Tests In current PR, the "fast path" only applies to all-reduce. - Flattened 512M: 16.38 ms, including CPU time 131.21 us - Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us - New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us Hence a 4x reduction in CPU overhead (dependent on `num_coll`). Cc @mrshenli @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793 Approved by: https://github.com/kumpera	2023-04-24 21:27:26 +00:00
PyTorch MergeBot	9861ec9785	Revert "[c10d] Faster coalescing (#98793 )" This reverts commit db456ab83da6a505dcebc128903d5ee4fc2d5712. Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build	2023-04-21 09:15:04 +00:00
Ke Wen	db456ab83d	[c10d] Faster coalescing (#98793 ) ### Description The PR aims at reducing CPU overhead of context manager style coalescing. By "context manager style coalescing", we mean: Sync style: ``` with _coalescing_manager(): for i in range(num_coll): dist.all_reduce(tensors[i]) ``` Async style: ``` with _coalescing_manager(async_ops=True) as cm: for i in range(num_coll): dist.all_reduce(tensors[i]) cm.wait() ``` In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead. In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager. ### Tests In current PR, the "fast path" only applies to all-reduce. - Flattened 512M: 16.38 ms, including CPU time 131.21 us - Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us - New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us Hence a 4x reduction in CPU overhead (dependent on `num_coll`). Cc @mrshenli @kumpera @wanchaol @fegin Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793 Approved by: https://github.com/kumpera	2023-04-19 20:17:58 +00:00
Rohan Varma	bba2090831	Enable fused optimizer for DP (#98270 ) Differential Revision: [D42714482](https://our.internmc.facebook.com/intern/diff/D42714482/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42714482/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/98270 Approved by: https://github.com/awgu	2023-04-13 20:16:32 +00:00
PyTorch MergeBot	279ca5f9db	Revert "[CUDA12] set_device change (#94864 )" This reverts commit c18be2b2ec00133abe28efcdd0462e50ddd45a1a. Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11	2023-04-05 14:53:00 +00:00
Aidyn-A	c18be2b2ec	[CUDA12] set_device change (#94864 ) This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this: ```Python import torch x = torch.randn(1, device="cuda:1") ``` would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`. After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864 Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang	2023-04-05 14:34:00 +00:00
Andrew Gu	3686416a57	[SyncBatchNorm] Support running with low precision parameters (#98332 ) This PR fixes https://github.com/pytorch/pytorch/issues/96203. Details When using `nn.SyncBatchNorm` with the model converted to FP16, there is a dtype discrepancy in the `SyncBatchNorm.forward()` causing an error like: ``` File "/.../pytorch/torch/nn/modules/_functions.py", line 91, in forward mean, invstd = torch.batch_norm_gather_stats_with_counts( RuntimeError: Expected counts to have type Half but got Float ``` [`torch.batch_norm_gather_stats_with_counts()`](`fe9da29842/torch/nn/modules/_functions.py (L88-L97)`) requires the `running_mean`, `running_var`, and `counts` to have the same dtype. However, when the model has been converted to FP16, only `running_mean` and `running_var` use FP16, while the `counts` are in FP32 due to [`mean` being in FP32](`fe9da29842/torch/nn/modules/_functions.py (L25-L30)`). This PR resolves this by casting `counts` from FP32 to FP16 instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. Moreover, for the backward, this PR casts `weight` from FP16 to FP32 to match the dtype of `mean` and `invstd` as required by `torch.batch_norm_backward_elemt()` instead of the alternative to cast `mean` and `invstd` from FP32 to FP16. Test Plan I dug up this run command from 2021: For `world_size` in `{1,2}` and `backend` in `{nccl, gloo}`: ``` WORLD_SIZE=world_size BACKEND=backend python -m pytest test/distributed/test_distributed_spawn.py -k test_DistributedDataParallel_SyncBatchNorm_half -vs ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/98332 Approved by: https://github.com/rohan-varma	2023-04-05 00:00:30 +00:00
Ke Wen	f89af60183	Rewrite NCCL watchdog to more reliably throw timeout (#97066 ) Fixes #97191 This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job. ### Previous output in #97191 ``` Rank 0 is the problematic rank Rank 4 completed Rank 5 completed Rank 3 completed Rank 6 completed Rank 2 completed Rank 7 completed Rank 1 completed [E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out. Rank 0 completed [E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down. ``` Although it says that it is taking the process down, it sometimes fails to do so. ### New output after this PR: ``` ... [E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. [E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python Traceback (most recent call last): File "/pytorch-dev-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, *kwargs) File "/pytorch-dev/torch/distributed/run.py", line 794, in main run(args) File "/pytorch-dev/torch/distributed/run.py", line 785, in run elastic_launch( File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ hang.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-20_22:00:42 host : node0 rank : 0 (local_rank: 0) exitcode : -6 (pid: 194470) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194470 ============================================================ ``` The log suggests that TorchX monitor is triggered, and job is torn down. ### Major changes in this PR: 1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined. Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level. 2. Rethrow exception at watchdog thread. 3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`. 4. Turn on ASYNC_ERROR_HANDLING by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066 Approved by: https://github.com/rohan-varma	2023-03-25 04:30:20 +00:00
PyTorch MergeBot	f25cdf8aeb	Revert "Rewrite NCCL watchdog to more reliably throw timeout (#97066 )" This reverts commit 95e8d0c39ec523f5a35c31155285fd4242928d8a. Reverted https://github.com/pytorch/pytorch/pull/97066 on behalf of https://github.com/clee2000 due to sorry but I think this broke periodic mutigpu tests `416bac5b81` https://github.com/pytorch/pytorch/actions/runs/4505085943/jobs/7930826040	2023-03-24 06:27:00 +00:00
Ke Wen	95e8d0c39e	Rewrite NCCL watchdog to more reliably throw timeout (#97066 ) Fixes #97191 This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job. ### Previous output in #97191 ``` Rank 0 is the problematic rank Rank 4 completed Rank 5 completed Rank 3 completed Rank 6 completed Rank 2 completed Rank 7 completed Rank 1 completed [E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out. Rank 0 completed [E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down. ``` Although it says that it is taking the process down, it sometimes fails to do so. ### New output after this PR: ``` ... [E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. [E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python Traceback (most recent call last): File "/pytorch-dev-env/bin/torchrun", line 33, in <module> sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')()) File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, *kwargs) File "/pytorch-dev/torch/distributed/run.py", line 794, in main run(args) File "/pytorch-dev/torch/distributed/run.py", line 785, in run elastic_launch( File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ hang.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-20_22:00:42 host : node0 rank : 0 (local_rank: 0) exitcode : -6 (pid: 194470) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 194470 ============================================================ ``` The log suggests that TorchX monitor is triggered, and job is torn down. ### Major changes in this PR: 1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined. Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level. 2. Rethrow exception at watchdog thread. 3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`. 4. Turn on ASYNC_ERROR_HANDLING by default Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066 Approved by: https://github.com/rohan-varma	2023-03-23 21:31:21 +00:00
Kazuaki Ishizaki	4610ce49f6	Fix typo under torch/testing directory (#97254 ) This PR fixes typo in comments and messages under `torch/testing` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97254 Approved by: https://github.com/kit1980, https://github.com/malfet	2023-03-23 01:46:17 +00:00
Pritam Damania	e20e5f5578	[RFC] Add an API to remove autograd hooks from DDP (#96490 ) Summary: When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time. As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP Test Plan: Unit test added Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490 Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma	2023-03-21 02:56:16 +00:00
Aaron Gokaslan	5471621497	[BE] Remove unnecessary dict comprehensions (#97116 ) Removes unnecessary dict comprehensions that optimize creation of dicts from iterables Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116 Approved by: https://github.com/kit1980	2023-03-20 00:56:57 +00:00
Rohan Varma	71adb32ddc	[DDP] API to get data parallel parameters (#95097 ) Add a private API to retrieve data parallel parameters. This is useful for example for apply_optimizer_in_backward in the case user wishes to ensure it is applied only on DDP managed parameters. Differential Revision: [D43383878](https://our.internmc.facebook.com/intern/diff/D43383878/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95097 Approved by: https://github.com/zhaojuanmao, https://github.com/fegin	2023-03-16 00:30:37 +00:00
Rohan Varma	32f11f58c9	DDP native mixed precision (#92882 ) Implements native mixed precision support for DDP in a similar fashion to how it is enabled for FSDP. The implementation works as follows: 1. In DDP init, we save `_mp_param` and `_fp_param` variables to manage mixed precision parameter usage. In particular, _mp_param will represent the parameter in the reduced precision, while _fp_param will represent the param in regular precision. During forward/backward, we swap back and forth as needed. 2. The root module gets a root pre-forward hook that kicks off copies to the reduced precision for all submodules. An event is recorded for each submodule to allow for waiting, as we run these asynchronously. 3. Each module gets a pre-forward hook that waits on its corresponding event. note that modules might be reused during training, in this case the wait is only done for the first module. After this wait, the module's parameters are in reduced precision. 4. In the pre-forward hook, we register a backward hook on the lower precision parameters in order to run reduced precision allreduce + parameter upcast. We can't rely on the Reducer's constructor setting up these hooks because the gradient is accumulated on the low precision param, so we need to register them ourselves. 5. In the backward pass, when the hook runs, we first run allreduce + divide in the reduced precision. Next, we upcast parameters and gradients back to fp32 asynchronously. We also queue a callback at the end of backward to wait on these upcasts so that the upcast is complete before optim.step() runs. 6. Parameters that don't require grad are also cast since they may be used in computation, they are upcast back in the final autograd callback. 7. DDP Ignored parameters are not touched. Follow-ups: 1. Unify comm hooks and make it work with apply optimizer in backward 2. implement keep_low_precision_grads, 3. allow BN, LN, or custom units to run in reduced precision, 4. support for cast_forward_inputs 5. Unify certain APIs / helpers with FSDP where possible, such as for _cast_forward_inputs 6. Integrate this with replicate() API. 7. The order in which we kick off copies and wait for them is set by the iteration order of module.modules(), but this might not be how the modules are used in the actual training. In the worst case, the last module in module.modules() could be used first which would result in waiting for all copies unnecessarily. For static graphs, we should record the module execution order and copy / wait in this order. 8. Entirely unused modules probably don't need to be cast. Differential Revision: [D42515803](https://our.internmc.facebook.com/intern/diff/D42515803/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92882 Approved by: https://github.com/zhaojuanmao	2023-03-13 14:10:31 +00:00
Sergii Dymchenko	35bf5bac26	Fix "sandcastle_skip_if decorator name is confusing" (#95649 ) Fixes https://github.com/pytorch/pytorch/issues/89473 See the issue https://github.com/pytorch/pytorch/issues/89473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649 Approved by: https://github.com/atalman, https://github.com/malfet	2023-03-03 09:29:40 +00:00
fduwjj	a88bfc60c7	[2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450 ) No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450 Approved by: https://github.com/wanchaol	2023-02-26 03:03:37 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Xuehai Pan	5b1cedacde	[BE] [2/3] Rewrite `super()` calls in functorch and torch (#94588 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-10 21:16:33 +00:00
Aaron Gokaslan	1e2d82b8e4	[BE] Merge isinstance calls together (#94419 ) Simplify and speeds up isinstance calls by checking for multiple types at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94419 Approved by: https://github.com/ezyang	2023-02-09 00:47:26 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
Richard Zou	5d01277fea	Deprecate torch.nn.utils.stateless.functional_call (#92280 ) This PR: - Updates the docs to say it is deprecated - Raises a UserWarning - Changes most of the callsites inside PyTorch to use torch.func.functional_call, minus the test_stateless testing. The motivation behind this is that we can now align behind a single functional_call API in PyTorch. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/92280 Approved by: https://github.com/albanD	2023-01-18 14:26:25 +00:00
Rohan Varma	e8bf7c21e4	Integrate apply_optim_in_backward with DDP (#89194 ) Allow _apply_optim_in_backward to work with DDP. Example: ``` dist.init_process_group("nccl", rank=rank, world_size=2) torch.cuda.set_device(rank) e = enc().cuda(rank) _apply_optimizer_in_backward( optimizer_class=torch.optim.SGD, params=e.parameters(), optimizer_kwargs={"lr": 0.03}, ) e = DDP(e, device_ids=[rank]) inp = torch.randn(1, 10, device=rank) e(inp).sum().backward() ``` Constraints: 1. Custom communication hook not yet supported 2. _apply_optim_in_backward needs to be called _before_ wrapping model in DDP. 3. DDP will remove the gradient hooks _apply_optim_in_backward registers, so these gradient hooks will not be fired and cannot be used. 4. All DDP managed parameters have grads set to None by default once optimizer is applied. There is no support for setting only some parameter grads to None, this must be done manually by user (and DDP_OVERLAPPED_OPTIM_SET_GRADS_TO_NONE=0 needs to be set.) Differential Revision: [D41329694](https://our.internmc.facebook.com/intern/diff/D41329694/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41329694/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/89194 Approved by: https://github.com/zhaojuanmao	2022-12-21 07:35:19 +00:00

1 2 3 4 5 ...

348 Commits