pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Yanli Zhao	1c61d8c43f	[PT1.11] make static graph to be stable (#71459 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71459 1. add static_graph feature to DDP constructor; 2. still keep _set_static_graph() API, so that existing use cases are not affected, also it can be called internally by DDP constructor 3. four cases are covered: static_graph = False, _set_static_graph() is called; static_graph = False, _set_static_graph() is not called; static_graph = True, _set_static_graph() is not called; static_graph = True, _set_static_graph() is called; ghstack-source-id: 147263797 Test Plan: unit tests Reviewed By: rohan-varma Differential Revision: D33646738 fbshipit-source-id: 8c1730591152aab91afce7133d2adf1efd723855 (cherry picked from commit dc246a1129a8ce5f70e551d7d8e00e0dab8ec6af)	2022-01-20 19:38:41 +00:00
Rohan Varma	fcd1375b2b	[DDP][BE][Docs] Clarify checkpoint support (#68827 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/68827 Add a note about current checkpoint support with DDP. Note that this does not include the features enabled with _set_static_graph yet, as it is an undocumented private API. Once we support static graph as beta feature in OSS we can add to the note here. ghstack-source-id: 144285041 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D32624957 fbshipit-source-id: e21d156a1c4744b6e2a807b5b5289ed26701886f	2021-11-30 12:37:37 -08:00
Santiago Castro	f776f30780	Keep the sequence or mapping type in `default_collate` (#68779 ) Summary: `default_collate`, `default_convert`, and `pin_memory` convert sequences into lists. I believe they should keep the original type when possible (e.g., I have a class that inherits from `list`, which comes from a 3rd party library that I can't change, and provides extra functionality). Note it's easy to do when the type supports an iterable in its creation but it's not always the case (e.g., `range`). Even though this can be accomplished if using a custom `default_collate`/`default_convert`, 1) this is behavior they should support out-of-the-box IMHO, and 2) `pin_memory` still does it. cc VitalyFedyunin ejguan NivekT Pull Request resolved: https://github.com/pytorch/pytorch/pull/68779 Reviewed By: wenleix Differential Revision: D32651129 Pulled By: ejguan fbshipit-source-id: 17c390934bacc0e4ead060469cf15dde815550b4	2021-11-29 13:14:20 -08:00
Yifan Xiong	c7eaec86f0	[NCCL] Patch bfloat16 support (#67843 ) Summary: Patch bfloat16 support in NCCL, PR https://github.com/pytorch/pytorch/issues/63260 adds bfloat16 support but is still not complete to enable bfloat16 for allreduce in end-to-end training. This patch does the followings: * fix minimum NCCL version from 2.9.7 to 2.10, NCCL adds bf16 support in v2.10.3-1 (commit 7e51592) * update bfloat16 datatype flag in `csrc/cuda/nccl.cpp` so that NCCL operations like all reduce can use it * enable unit tests for bfloat16 datatype if possible cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/67843 Reviewed By: H-Huang Differential Revision: D32248132 Pulled By: mrshenli fbshipit-source-id: 081e96e725af3b933dd65ec157c5ad11c6873525	2021-11-09 13:46:13 -08:00
James Reed	80178d6152	[DDP] Fix some issues with code example in DDP docstring (#67883 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/67883 cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Test Plan: Imported from OSS Reviewed By: zhaojuanmao Differential Revision: D32190946 Pulled By: jamesr66a fbshipit-source-id: a376324b95cbe833ffa606ecdfc6156432880f70	2021-11-05 17:32:45 -07:00
Rohan Varma	bff64e84cd	[DDP] Track models with sync bn (#66680 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66680 Closes https://github.com/pytorch/pytorch/issues/66215. Tracks models with sync BN so we can find workflows that use them and target for perf optimization. ghstack-source-id: 140875182 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D31679477 fbshipit-source-id: 0e68cd1a7aabbc5b26227895c53d33b8e98bfb8e	2021-10-18 22:31:52 -07:00
Rohan Varma	38f5144eae	Fix https://github.com/pytorch/pytorch/issues/61982 (#66015 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66015 Fixes https://github.com/pytorch/pytorch/issues/61982 by clone of tensors in DDPSink. Only applies once for static_graph and generally for unused params which already has overhead, so perf hit should not be an issue. Will verify with benchmark. Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D31346633 fbshipit-source-id: 5b9245ade628565cffe01731f6a0dcbb6126029b	2021-10-07 18:11:18 -07:00
Rohan Varma	71704349aa	[DDP] Allow await of custom buffer reduction in backward (#64515 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64515 For performance reasons, we would like to ensure that we can await user collectives as part of custom buffer reduction in parallel to other work. As a result, add support to return futures from custom buffer hooks and await those futures at end of backwards pass. Also added some docs to clarify how to use these APIs. ghstack-source-id: 138793803 Test Plan: I Reviewed By: zhaojuanmao Differential Revision: D30757761 fbshipit-source-id: e1a2ead9ca850cb345fbee079cf0614e91bece44	2021-09-23 13:02:53 -07:00
Wanchao Liang	2f67579864	[ddp] use named_params and named_buffers explicitly (#65181 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65181 This PR changes `state_dict()` during sync to `named_parameters` and `named_buffers` explicitly. the underlying motivation is that, `state_dict()` doesn't necessarily equals to "params + buffers" for all cases, state_dict is used for checkpoint purpose mainly, and params/buffers are used for training, we might have cases that params/buffers be in different forms with state_dict (i.e. state_dict we might want to save in small pieces of tensors while in training we want to concat the tensors together for performance reasons). ghstack-source-id: 138701159 Test Plan: wait for ci Reviewed By: divchenko, rohan-varma Differential Revision: D31007085 fbshipit-source-id: 4e1c4fbc07110163fb9b09b043ef7b4b75150f18	2021-09-22 17:32:54 -07:00
Rohan Varma	5739f77775	[DDP] Refactor and remove sync_params (#64514 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64514 sync_params is a misnomer since we don't actually synchroniz parameters. While removing this I realized `self._check_and_sync_module_buffers` does almost everything we need it to, so just refactored that and made DDP forward call into it. ghstack-source-id: 138684982 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30751231 fbshipit-source-id: add7c684f5c6c71dad9e9597c7759849fa74f47a	2021-09-22 14:12:51 -07:00
Rohan Varma	ce5981e431	[DDP] Custom buffer reduction (#64513 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64513 Proposal: https://github.com/pytorch/pytorch/issues/63041 Support custom buffer reduction in DDP via hook ghstack-source-id: 138655663 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D30751152 fbshipit-source-id: 257a9d46bb178d8812d4ea5a4d9c6140b8a1791f	2021-09-22 14:11:35 -07:00
Jessica Choi	f24bd43375	Changing type and name of local_used_maps to reflect that it is only one map (#65380 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/65380 Fixing bugs that arise when running setup.py develop cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D31104844 Pulled By: jaceyca fbshipit-source-id: acfd4cf316c71177df758ca55b470f51a17f776b	2021-09-22 11:35:33 -07:00
Jessica Choi	158b8bdc8a	Cleaning up DDP SPMD in reducer.cpp (#64113 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64113 Since there is only one model replica per process, `replicas` can be simplified from `std::vector<std::vector<at::Tensor>>` to `std::vector<at::Tensor>` in the Reducer class. Test Plan: All tests are passing `pytest test/distributed/test_c10d_gloo.py -vs` Imported from OSS Reviewed By: mrshenli Differential Revision: D30615965 fbshipit-source-id: d2ec809d99b788c200b01411333e7dbad1269b51	2021-09-21 16:13:18 -07:00
Rohan Varma	45bd0f6181	Back out "Revert D30745960: [DDP] Remove SPMD from self.modules_buffers" (#64778 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64778 Original commit changeset: d3f3fb813c45 ghstack-source-id: 138326910 Test Plan: ci Reviewed By: H-Huang Differential Revision: D30849443 fbshipit-source-id: 15dab8a959a29d2e2fefac6ad52b8d8168eacc02	2021-09-17 12:28:36 -07:00
Rohan Varma	70f286c1e2	Back out "Revert D30745961: [DDP] Remove self.modules_params" (#64777 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64777 Original commit changeset: 59f7cc50d369 ghstack-source-id: 138326909 Test Plan: ci Reviewed By: H-Huang Differential Revision: D30849442 fbshipit-source-id: bb87ba83935374d8a3ebbc29365df1417dd4f26f	2021-09-17 12:28:34 -07:00
Rohan Varma	61dfcbf4bc	Back out "Revert D30745921: [DDP] Fix when buffers are reassigned in module" (#64776 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64776 Original commit changeset: 343ead86bf1e ghstack-source-id: 138326914 Test Plan: ci Reviewed By: H-Huang Differential Revision: D30849444 fbshipit-source-id: 9a72805416fe7d6c68e51bdcdb88f6e1fecb614d	2021-09-17 12:28:32 -07:00
Howard Huang	459653a0f6	Revert D30745921: [DDP] Fix when buffers are reassigned in module Test Plan: revert-hammer Differential Revision: D30745921 (`d59ecc02df`) Original commit changeset: 25eb1edbf445 fbshipit-source-id: 343ead86bf1e2d0b2d4124be331ea2fa437303ad	2021-09-09 08:23:16 -07:00
Howard Huang	5bc53ac5ef	Revert D30745961: [DDP] Remove self.modules_params Test Plan: revert-hammer Differential Revision: D30745961 (`8c09510294`) Original commit changeset: 32d102502570 fbshipit-source-id: 59f7cc50d369b6cc2856cf4ebd0f58b96202336d	2021-09-09 08:23:14 -07:00
Howard Huang	f1aaf8afcd	Revert D30745960: [DDP] Remove SPMD from self.modules_buffers Test Plan: revert-hammer Differential Revision: D30745960 (`1553259520`) Original commit changeset: 66a8f9847e9f fbshipit-source-id: d3f3fb813c45ac1b0ff15c6154b2e99e5dbab433	2021-09-09 08:22:12 -07:00
Rohan Varma	1553259520	[DDP] Remove SPMD from self.modules_buffers (#64474 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64474 No need for a nested list here. ghstack-source-id: 137526312 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30745960 fbshipit-source-id: 66a8f9847e9fe1e02c51b79647e93bf7665cf4d9	2021-09-08 19:16:15 -07:00
Rohan Varma	8c09510294	[DDP] Remove self.modules_params (#64473 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64473 Unused after SPMD deprecated. ghstack-source-id: 137526305 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30745961 fbshipit-source-id: 32d102502570291e01579e5b47a6d74dc71013bb	2021-09-08 19:16:13 -07:00
Rohan Varma	d59ecc02df	[DDP] Fix when buffers are reassigned in module (#64472 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64472 Sometimes, user module can reassign tensor buffer, as in: ``` self.buffer = torch.randn(1, 2) # in init self.buffer += 1 # in forward ``` in this case, `self.modules_buffers` will become outdated and we should repopulate self.modules_buffers if we need to sync module buffers. See https://github.com/pytorch/pytorch/issues/63916 for full description of the issue. ghstack-source-id: 137526309 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D30745921 fbshipit-source-id: 25eb1edbf445703a481802e07f3058d38ea6fc64	2021-09-08 19:14:55 -07:00
Yinbin Ma	0d437fe6d0	BF16 allreduce hook (#63260 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63260 Add BF16 all-reduce communication hook. Skip if CUDA version < 11 or NCCL version < 2.9.7. Reviewed By: SciPioneer Differential Revision: D30238317 fbshipit-source-id: bad35bf7d43f10f1c40997a282b831b61ef592bb	2021-08-18 20:53:49 -07:00
Rohan Varma	5fb79f61a8	[DDP] Dont set thread local state in reducer autograd hook. (#62996 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62996 No need to set this because autograd engine already propagates TLS states. ghstack-source-id: 135438220 Test Plan: CI Reviewed By: albanD Differential Revision: D30202078 fbshipit-source-id: e5e917269a03afd7a6b8e61f28b45cdb71ac3e64	2021-08-10 10:50:16 -07:00
Rohan Varma	3df4870343	[Reland][DDP] Support not all outputs used in loss calculation (#61753 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61753 Reland of https://github.com/pytorch/pytorch/pull/57081. Main difference is that the former diff moved `prepare_for_backward` check into `DDPSink` backward, but that resulted in issues due to potential autograd engine races. The original diff moved `prepare_for_backward` into `DDPSink` as part of a long-term plan to always call it within `DDPSink`. In particular this doesn't work because `prepare_for_backward` sets `expect_autograd_hooks=true` which enables autograd hooks to fire, but there were several use cases internally where autograd hooks were called before DDPSink called `prepare_for_backward`, resulting in errors/regression. We instead keep the call to `prepare_for_backward` in the forward pass, but still run outputs through `DDPSink` when find_unused_parameters=True. As a result, outputs that are not used when computing loss have `None` gradients and we don't touch them if they are globally `None`. Note that the hooks still fire with a undefined gradient which is how we avoid the Reducer erroring out with the message that some hooks did not fire. Added the unittests that were part of the reverted diff. ghstack-source-id: 135388925 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29726179 fbshipit-source-id: 54c8819e0aa72c61554104723a5b9c936501e719	2021-08-09 22:29:11 -07:00
Rohan Varma	80091cb0f7	[DDP] Allow tuning of first bucket (#62748 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62748 Previously after buckets were rebuilt the first bucket size was always defaulted to 1MB, this diff allows first bucket to be tuned like the rest of the bucket sizes can. Setting `dist._DEFAULT_FIRST_BUCKET_BYTES = 1` results in the following logs as expected: I0804 12:31:47.592272 246736 reducer.cpp:1694] 3 buckets rebuilt with size limits: 1, 1048, 1048 bytes. ghstack-source-id: 135074696 Test Plan: CI Reviewed By: SciPioneer, wanchaol Differential Revision: D30110041 fbshipit-source-id: 96f76bec012de129d1645e7f50e266d4b255ec66	2021-08-05 16:35:07 -07:00
Sean Lawlor	34c9f5a8da	[DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662 Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface. Reviewed By: SciPioneer Differential Revision: D30012869 fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482	2021-08-04 09:27:31 -07:00
Andrew Gu	62a90c227f	Make _Join, _Joinable, _JoinHook public (#62605 ) Summary: Overview: This removes the preceding `_` from `_Join`, `_Joinable`, and `_JoinHook` in preparation for adding the generic join context manager tutorial (see [here](https://github.com/pytorch/tutorials/pull/1610)). This also adds a docs page, which can be linked from the tutorial. [Here](https://github.com/pytorch/pytorch/files/6919475/render.pdf) is a render of the docs page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62605 Test Plan: `DistributedDataParallel.join()`: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` `ZeroRedundancyOptimizer`: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py ``` NOTE: DDP overlap tests are failing due to a landing race. See https://github.com/pytorch/pytorch/pull/62592. Once the fix is landed, I will rebase, and tests should be passing. `Join`: ``` gpurun4 python test/distributed/algorithms/test_join.py ``` Reviewed By: mrshenli Differential Revision: D30055544 Pulled By: andwgu fbshipit-source-id: a5ce1f1d9f1904de3bdd4edd0b31b0a612d87026	2021-08-03 12:20:11 -07:00
Rohan Varma	4d5607bb25	[Reland][DDP] log bucket sizes (#62625 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62625 reland of https://github.com/pytorch/pytorch/pull/62232 which ran into a land race. Test Plan: ci Reviewed By: SciPioneer Differential Revision: D30058217 fbshipit-source-id: 1454dd481e630f3de9ec6111b3f2e18cd8976091	2021-08-03 10:55:46 -07:00
Andrew Gu	51f687fd4b	Add overlap with DDP to ZeRO (two approaches) (#62157 ) Summary: Overview: This adds two approaches to overlapping `DistributedDataParallel.backward()` with `ZeroRedundancyOptimizer.step()` by providing two hook constructors: `hook_with_zero_step()` and `hook_with_zero_step_interleaved()`. The former waits for all backward computation to finish before starting optimizer computation, while the latter launches a partial optimizer computation using the contents of a gradient bucket once that bucket's all-reduce completes. The two approaches each suffer from their own weaknesses, and which one to use depends on the specific hardware configuration. Both approaches can share changes to `ZeroRedundancyOptimizer`. A user should pass `overlap_with_ddp=True` to `ZeroRedundancyOptimizer`, construct a DDP communication hook using either `hook_with_zero_step()` or `hook_with_zero_step_interleaved()`, and register that communication hook. `ZeroRedundancyOptimizer.step()` should still be called in the training loop, though the optimizer computation and communication will be offloaded to originate from the communication hook. Currently, the first two iterations are vacuous, meaning they do not result in parameter updates and the inputs are ignored. This is required to finalize the DDP bucket strategy and to then initialize the `ZeroRedundancyOptimizer`'s local optimizer based on that bucketing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/62157 Test Plan: The existing `ZeroRedundancyOptimizer` tests pass, and new unit tests for both hooks pass: - ~~`test_ddp_with_zero_step_parity_cpu`~~ (removed for now due to flakiness in CI -- under investigation, could possibly be similar Gloo issue as with `hook_with_zero_step_interleaved()`) - `test_ddp_with_zero_step_parity_gpu` - `test_ddp_with_zero_step_interleaved_parity_gpu` These were tested on the AI AWS cluster. An analogous `test_ddp_with_zero_step_interleaved_parity_cpu` is missing due to existing bugs with Gloo. See https://github.com/pytorch/pytorch/pull/62302. Both approaches have been verified using an internal accuracy benchmark. Reviewed By: mrshenli Differential Revision: D29971046 Pulled By: andwgu fbshipit-source-id: a7234c23c7ea253f144a698fd7e3c0fe039de5e8	2021-08-02 08:33:34 -07:00
Yi Wang	32b37ba246	[DDP Communication Hook] Update the typing info of comm hook output as well as some docstring (#62457 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62457 Specify `Future[torch.Tensor]` as DDP communication hook return type, which should be explicitly a single tensor. The previous API takes a list that has a single tensor. Note that now the typing info no longer accepts the internal type of `torch._C.Future`, which does not support torchscript and hence cannot support `Future[torch.Tensor]`. ghstack-source-id: 134771419 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_invalid_comm_hook_return_type Reviewed By: rohan-varma Differential Revision: D30007390 fbshipit-source-id: 246667c9b575b4c6e617b0a5b373151f1bd81e7f	2021-07-30 20:51:34 -07:00
Yi Wang	72295da6c3	Reformat (#62456 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62456 as title ghstack-source-id: 134771417 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D30006493 fbshipit-source-id: 1d1dc9cfff69a9b4fa31470177c1f4fa206a94ef	2021-07-30 20:50:19 -07:00
Eli Uriegas	bd9f35313a	Revert D29922299: [DDP] log bucket sizes Test Plan: revert-hammer Differential Revision: D29922299 (`5429f68f00`) Original commit changeset: 538b331c96e7 fbshipit-source-id: 3595fe04e8dea38bc9d05e8c70f2dcd2ad496ced	2021-07-30 20:27:36 -07:00
Rohan Varma	5429f68f00	[DDP] log bucket sizes (#62232 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62232 Logs the bucket sizes in DDP logging so that we know which workflow ran with what bucket size config. Will be used to verify how changing bucket sizes in DDP affects perf. Based on the test, we can see inconsistency where the "first" bucket size actually is (last before rebuild buckets, first after). ghstack-source-id: 134663867 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29922299 fbshipit-source-id: 538b331c96e77048164ad130b377433be100a761	2021-07-30 18:07:04 -07:00
Rohan Varma	1f2b96e7c4	[DDP] Make compute_bucket_assignment_by_size return per bucket sizes (#62231 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62231 `compute_bucket_assignment_by_size` is responsible for setting per-bucket size limits, return this information from the function so that we are aware of size limits for each bucket. This is currently not being consumed, but will be in the next diff when we log bucket size limits to DDP logging. This will help us run experiments under different bucket size configs and analyze the impact. ghstack-source-id: 134480575 Test Plan: CI Reviewed By: mrshenli Differential Revision: D29919056 fbshipit-source-id: dd5a096fa23d22e5d9dc1602899270a110db4a19	2021-07-28 20:21:01 -07:00
Rohan Varma	10c6811a6b	[DDP] Run test_ddp_new_tensor_in_fwd with static graph (#61992 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61992 This test previously was not enabled for static graph but to ensure this feature is supported with DDPSink, enable it for static graph which currently passes outputs to DDPSink. ghstack-source-id: 134471406 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29830887 fbshipit-source-id: 2d3f750d9eb4289558ed21acccd172d83d9b82cc	2021-07-28 09:49:12 -07:00
Andrew Gu	3e3acf8a9a	Minor documentation fixes (#61785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61785 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29746648 Pulled By: andwgu fbshipit-source-id: 435bbd8894f2ae5c814b9acd562673affea1daf6	2021-07-19 09:01:29 -07:00
Andrew Gu	813b887dad	Fix indent (#61784 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61784 Test Plan: Imported from OSS Reviewed By: mrshenli Differential Revision: D29746647 Pulled By: andwgu fbshipit-source-id: f42d3a0864a8291941d695a0cf575a5737cbb35c	2021-07-19 09:00:25 -07:00
Rohan Varma	f1114364ad	[DDP] Enhance comm hook docs (#61677 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61677 Specify return type more clearly, 2) Misc fixes ghstack-source-id: 133657895 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29701384 fbshipit-source-id: 7f77b99065bd2977153f397745e07b75bbdd7a94	2021-07-16 08:35:49 -07:00
Rohan Varma	7177509380	Revert [DDP] Support not all outputs used in loss calculation (#61497 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61497 Reverts [DDP] Support not all outputs used in loss calculation ghstack-source-id: 133589153 Test Plan: CI, ping authors to run their workflow on this diff Reviewed By: zhaojuanmao Differential Revision: D29642892 fbshipit-source-id: 81a15b9ab3329602f34d3758bb0799005a053d4f	2021-07-15 10:28:14 -07:00
Rohan Varma	25f9c35dd7	Revert [DDP] Support for multiple backwards (#61401 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61401 Reverts https://github.com/pytorch/pytorch/pull/59359, which is causing a few internal issues in DDP training. We will evaluate the internal use cases and reland it after reconsidering the design. Also moves `prepare_for_backward` back into forward pass instead of DDP Sink for `find_unused_parameters`. This ensures that hooks will always fire in the backwards pass, which is behavior that internal training workloads rely on. Calling `prepare_for_backward` in DDPSink autograd function is not the best solution since other autograd threads may have been executing which can cause races. ghstack-source-id: 133589152 Test Plan: CI Reviewed By: pritamdamania87 Differential Revision: D29608948 fbshipit-source-id: f060f41cd103573ddff8da50cdbb6c56768dab46	2021-07-15 10:28:13 -07:00
Andrew Gu	57feb35474	Refactor non-joined process computation (#61555 ) Summary: Overview: This refactors the computation on non-joined processes relating to the join context manager. The concept was inspired by a comment from pritamdamania. Changes: This introduces a `_Joinable` abstract base class, which requires a `_join_hook()` method and `_join_device()` and `_join_process_group()` property methods. Any class that we want to be compatible with the generic join context manager should inherit from `_Joinable` and implement `_join_hook()`, `_join_device()`, and `_join_process_group()`. (The `device` and `process_group` information has been moved from `_JoinHook` to `_Joinable`.) The generic join context manager now takes in a `List[_Joinable]` instead of `List[_JoinHook]`. The motivation for this is that previously, by passing the `_JoinHook`s into the context manager, the class providing a `_JoinHook` can modify the context manager's behavior, but the context manager cannot modify the class's behavior. This is solved by giving the context manager a reference to the class's instance. This implementation reserves the field `_join_config` in every `_Joinable` to store a `_JoinConfig` instance, which holds all dynamic fields needed from the `_Joinable` for the join context manager: `enable`, `throw_on_early_termination`, and `is_first_joinable`. ("dynamic" here means that for a given `_Joinable` instance, the values for those fields may change across different join context usages.) In particular, these fields are needed to implement a method `notify_join_context()`, which encapsulates the computation performed on non-joined processes relating to the join context manager --- (1) the all-reduce to indicate that the process has not yet joined and (2) the all-reduce to check whether to throw an exception if `throw_on_uneven_inputs=True`. The idea is that every `_Joinable` class only needs to make a call to `notify_join_context()` before its per-iteration collective communications; it is a simple one-line addition. Only the first `_Joinable` instance passed into the context manager actually performs the collective communications in `notify_join_context()`. In that case, the method returns an async work handle for the initial all-reduce indicating that the process not yet joined. Otherwise, the method returns `None`. This conditional logic is handled internally without additional input from the user. New API: Now, the example usage would look like: ``` ddp_model = DistributedDataParallel(...) zero_optim = ZeroRedundancyOptimizer(ddp_model.parameters(), ...) with _Join([ddp_model, zero_optim]): ... ``` Any arguments meant for a join hook (e.g. `divide_by_initial_world_size`) must be specified as keyword arguments. For example: ``` with _Join([ddp_model, zero_optim], divide_by_initial_world_size=False): ... ``` They will be forwarded to every `_join_hook()` function via `kwargs`. This creates a clear separation between the variables needed by the context manager (`enable` and `throw_on_early_termination`) and those needed by the `_Joinable` class (e.g. `divide_by_initial_world_size`). Recap:** After this change, the relevant information to use the generic join context manager looks like the following (omitting prefix `_` from names): - Suppose we have a class `C` (e.g. `DistributedDataParallel`) that we want to be able to use the `Join` context. - We make `C` inherit from `Joinable` and implement `join_hook() -> JoinHook`, `join_device()`, and `join_process_group()`. - To implement `join_hook()`, we define a `CJoinHook` class inheriting from `JoinHook` and implement `main_hook()` and `post_hook()` as needed. - We locate a place before `C`'s per-iteration collective communications and add a call to `Join.notify_join_context()`. - We call `Joinable.__init__(self)` in `C`'s constructor. - The `C.join_config` field will be used internally by the context manager. This does not affect `C`'s serializability. - Run time arguments for `C`'s join hook can be passed in as keyword arguments to the context manager: `with Join([C()], arg1=..., arg2=...):`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/61555 Test Plan: I ran the existing DDP join tests: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- TestDistBackendWithFork.test_ddp_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_inputs_stop_iteration_sync_bn TestDistBackendWithFork.test_ddp_grad_div_uneven_inputs TestDistBackendWithFork.test_ddp_uneven_input_join_disable TestDistBackendWithFork.test_ddp_uneven_input_exception ``` I ran the ZeRO join tests: ``` gpurun4 python test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_zero_join_gpu TestZeroRedundancyOptimizerDistributed.test_zero_join_cpu ``` Reviewed By: zou3519 Differential Revision: D29690359 Pulled By: andwgu fbshipit-source-id: 2950f78de755eb5fb13b95b803dd7c705879a9c7	2021-07-14 08:20:40 -07:00
Andrew Gu	179249084b	Refactor DDP join() API, adding hooks (#60757 ) Summary: Targets https://github.com/pytorch/pytorch/issues/54318. Overview: DDP offers a `join()` context manager to accommodate training on uneven inputs. This creates a new generic `_Join()` API permitting custom hooks, refactors DDP `join()` to call this generic `_Join()`, and implements a hook for ZeRO. (For now, the generic `_Join()` is implemented as private, but this may change after design discussions are cleared.) There are two classes introduced: `_JoinHook`, the class defining the customizable join hook, and `_Join`, the generic join context manager. The `_JoinHook` provides two entry points: `main_hook()`, which is called repeatedly while there exists a non-joined process, and `post_hook()`, which is called once all process have joined with the additional `bool` argument `is_last_joiner`. The class also requires `process_group` and `device` information by defining corresponding abstract property methods. Thus, to implement a join hook, (1) inherit from `_JoinHook`, (2) override `main_hook()` and `post_hook()` as appropriate, and (3) override `process_group()` and `device()` to provide process group and device information to be used by the join context manager implementation for collective communications. The `_Join` constructor requires `join_hooks: List[_JoinHook]` and optionally `enable: bool = True` and `throw_on_early_termination: bool = False`. A training loop only needs to be wrapped with `with _Join(join_hooks):` (using the appropriate `join_hooks`) to be able to train on uneven inputs without hanging/erroring. The context manager requires a `dist.all_reduce(torch.ones(1))` to be called on every non-joined process each time before it performs its collective communications in order to indicate that the process has not yet joined. It also requires that all `process_group` attributes in the `_JoinHook` objects are the same. Notes: - The argument `is_last_joiner` to `post_hook()` may be useful for finding an authoritative rank when synchronizing. - `enable` is a flag that can be set to `False` if the user knows the current training loop will not have uneven inputs. This may be used to disable join-related computation in the classes providing join hooks. - `throw_on_early_termination` is a flag that can be set to `True` to notify processes to terminate upon detecting uneven inputs (i.e. upon the first process joining when there exists a non-joined process). Notably, the notification requires an all-reduce, so to prevent hanging/erroring, non-joined process must participate in the all-reduce. The first-joining process raises a `RuntimeError`, and the other processes are expected (but not required) to do the same. This may be used to implement training on uneven inputs in cases that do not conform to the generic join context manager (e.g. `SyncBatchNorm`). - Classes providing a join hook should do so via a `_join_hook()` method that returns a `_JoinHook` instance with the methods appropriately overridden. - If there are multiple join hooks, the device specified by the first is used by the join context manager implementation to perform its collective communications. - If there are multiple join hooks, both the main and post-hooks are iterated in the order in which the `_JoinHook` objects are passed into the context manager constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60757 Test Plan: The current implementation preserves backward compatibility by not changing the existing DDP `join()` API at all. To check this, I ran through the uneven input tests (`test_ddp_grad_div_uneven_inputs`, `test_ddp_uneven_inputs_stop_iteration_sync_bn`, `test_ddp_uneven_inputs`, `test_ddp_uneven_input_join_disable`, `test_ddp_uneven_input_exception`) on the AI AWS cluster: ``` touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" gpurun python test/distributed/test_distributed_fork.py -- ``` Because the existing DDP join logic does not provide correct gradients to the joined processes if `gradient_as_bucket_view=False` and a joined process requires those gradients to correctly update its shard of the parameters in `ZeroRedundancyOptimizer.step()`, DDP and ZeRO are not fully compatible at the moment. To work around this and to test ZeRO's join hook separately, I added a test `_test_zero_join()` (with `test_zero_join_gpu()` and `test_zero_join_cpu()` flavors), which compares DDP with a local optimizer on uneven inputs against ZeRO on uneven inputs with the gradients set manually. Reviewed By: iramazanli, mrshenli Differential Revision: D29624636 Pulled By: andwgu fbshipit-source-id: ec70a290e02518b0d8b683f9fed2126705b896c7	2021-07-09 08:29:20 -07:00
Yi Wang	4beb5f9ad6	[DDP Comm Hook] Fix some comments (#61376 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61376 After SPMD is retired, the API of `get_tensors` becomes `get_tensor`. Fix some comments that refer to the obsolete API. The `allreduce` hook example does not do division inside, which actually is incorrect. ghstack-source-id: 133174272 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D29596857 fbshipit-source-id: 2046b185225cd6d1d104907b5f9b4009b6e87c99	2021-07-08 12:30:24 -07:00
Rohan Varma	43fb39c3eb	[DDP] Make uneven inputs work with comm. hook (#61020 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61020 Makes uneven input support with `join` context manager work with custom communication hooks. This will ensure that the two features can work well together. Added relevant unittests to test allreduce and powerSGD hooks. Instead of calling `allreduce`, the join manager now calls into `_run_reduction_hook` which will automatically run whatever hook is installed. ghstack-source-id: 132950108 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29480028 fbshipit-source-id: c91dc467a62c5f1e0ec702a2944ae3deb10f93f4	2021-07-02 18:48:21 -07:00
Rohan Varma	94b730681f	[DDP] Refactor uneven inputs to take GradBucket (#61019 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61019 Changes uneven input logic of running allreduce to using `GradBucket` structure. This is to enable support for comm. hook with join in the next diff. ghstack-source-id: 132950107 Test Plan: ci Reviewed By: SciPioneer Differential Revision: D29480027 fbshipit-source-id: 7c42c53653052f71b86a75e14a5fc7ae656433f7	2021-07-02 18:47:23 -07:00
Rohan Varma	b21df03f3b	[DDP] Remove SPMD from get_bucket_tensors (#61017 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61017 Removes SPMD nested vector logic from this codepath. This is mostly in preparation for the next diffs in this stack which enable support for join with comm. hook. ghstack-source-id: 132924223 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D29477360 fbshipit-source-id: f8132a94b1abfe28586aa78ac47e13a7ce6bb137	2021-07-01 20:40:53 -07:00
Rohan Varma	60509f8921	Update DDP documentation to mention outputs not used in loss is supported (#60275 ) Summary: We recently landed a change to ensure that when running under ``find_unused_parameters=True``, not all module outputs have to be used in loss computation and DDP will work as expected. Mention this update in the documentation and add some additional clarification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/60275 Reviewed By: SciPioneer Differential Revision: D29502609 Pulled By: rohan-varma fbshipit-source-id: ddb3129cff9492018e61813413b30711af212309	2021-07-01 11:56:53 -07:00
Rohan Varma	12b63f4046	[DDP] Fix case where new tensors with no grad_fn are returned in DDP forward. (#60882 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60882 Fixes https://github.com/pytorch/pytorch/issues/60733, which identified an issue with a previous PR that resulted in DDP no longer supporting cases where newly created tensors are returned that don't have a grad_fn. The result of this is the grad_fn is set to that of the `DDPSink` custom backward which results in errors during the backwards pass. This PR fixes the issue by ensuring we don't touch the `grad_fn` of the tensors if it is `None`. Added relevant tests as well. ghstack-source-id: 132632515 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D29423822 fbshipit-source-id: a9e01046c7be50aa43ffb955f6e0f48fef4bc881	2021-06-29 12:50:48 -07:00
Rohan Varma	d5df274ea5	[DDP] Support for multiple backwards (#59359 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59359 Move `prepare_for_backward` into `_DDPSink` backward instead of calling it in DDP forward pass so that we can run multiple backwards in DDP with `retain_graph=True`. ghstack-source-id: 131774159 Test Plan: CI Reviewed By: zhaojuanmao Differential Revision: D28855226 fbshipit-source-id: 6b7b25d75b7696f5b5629078233433f97663d61c	2021-06-18 09:23:57 -07:00

... 2 3 4 5 6 ...

381 Commits