pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
shaoyf42	443edb9015	[DOCS][DDP]Fix the simple of saving and reloading PowerSGD state and hook. (#102721 ) Fix the simple of saving and reloading PowerSGD state and hook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102721 Approved by: https://github.com/H-Huang	2023-06-10 00:15:00 +00:00
Xuehai Pan	8d45f555d7	[BE] [1/3] Rewrite `super()` calls in caffe2 and benchmarks (#94587 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94587 Approved by: https://github.com/ezyang	2023-02-11 18:19:48 +00:00
Danielle Pintz	ae5c166035	Fix two small typos in ddp_comm_hooks.rst (#82047 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/82047 Approved by: https://github.com/kit1980	2022-07-23 19:10:57 +00:00
Olga Andreeva	8a6d83079c	Functionality/pickling for commhooks (#79334 ) This PR addresses issue address #75666. Stateful communication hook now can be saved and reloaded to resume training. Current PR adds the functionality for PowerSGD communication hook and tests that communication hook can be properly saved and restored. PowerSGD implementation uses ``__slots__``, as a result introduced __getstate__ and __setstate__ methods are implemented to work with `__slots__` and not` __dict__`. `__getstate__ ` Returns: A dictionary that represents a ``PowerSGDState`` which will be pickled and saved. ``process_group`` is non-serializable and excluded from a returned state. `__setstate__` Takes a provided ``state`` and retrieves ``PowerSGDState``. ``process_group`` is set to default with a proper warning issued to a user. Unit test A hook-independent `_test_hook_pickling` is added with this PR, as well as `test_ddp_hook_pickling_powerSGD`, which tests `powerSGD`’s ability to be saved and reloaded. Currently, the test creates a ddp model with a provided hook, trains it for 10 epochs and saves model’s state and hook’s state. During reloading, unit test makes sure that a warning was logged (only one warning and the proper one). It then proceeds to check that reloaded hook and original hook are the same. Finally, it checks that a hook’s state was properly initialized: - it compares slot values (all, but 2: `process_group` and `rng`) for original and reloaded state - it checks that process group was set to a default group - it checks that a random state was restored properly with np.testing.assert_array_equal, because `rng` is an instance of `np.random.RandomState`, represented by a tuple. One of entries is of `ndarray dtype[uint32]` type and `np.testing.assert_array_equal` is used for assertion. Future To-Do: - Implement similar __getstate__ and __setstate__ for other stateful communication hooks - Add appropriate tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/79334 Approved by: https://github.com/rohan-varma, https://github.com/awgu	2022-06-16 23:15:34 +00:00
Yi Wang	778af56504	[DDP Comm Hook] Add debugging communication hooks to ddp_comm_hooks.rst (#64352 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64352 as title ghstack-source-id: 137246253 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D30694089 fbshipit-source-id: a78110b11d59bb0718f43c99ede23f2fd8ab21d0	2021-09-01 17:37:19 -07:00
Yi Wang	a8f9aab840	[DDP Comm Hook] Add bf16 gradient compression to ddp_comm_hooks.rst (#64346 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64346 as title ghstack-source-id: 137170288 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D30693513 fbshipit-source-id: 8c64b8404ff3b0322e1bbbd93f6ef051ea91307d	2021-09-01 16:34:00 -07:00
Yi Wang	7a3f1386ae	Add GradBucket::parameters() to ddp_comm_hooks.rst (#62877 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62877 as title ghstack-source-id: 135214612 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D30153490 fbshipit-source-id: d4cec434a53ef6e65b60c065804884d1a114aa0d	2021-08-06 14:50:47 -07:00
Sean Lawlor	34c9f5a8da	[DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662 Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface. Reviewed By: SciPioneer Differential Revision: D30012869 fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482	2021-08-04 09:27:31 -07:00
Yi Wang	db071ef005	[Reland][DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62592 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62592 Reland #62510 `GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity: 1) get_index -> index 2) is_the_last_bucket_to_allreduce -> is_last, 3) get_per_parameter_tensors -> gradients, 4) get_model_params_for_bucket -> parameters. ghstack-source-id: 134848352 Test Plan: unit test Reviewed By: andwgu Differential Revision: D30049431 fbshipit-source-id: 1bcac331aa30e529b7230e3891bc811c531b0ea9	2021-08-02 16:38:09 -07:00
Eli Uriegas	6f95850127	Revert D30024161: [DDP Communication Hook] Rename 4 Methods of GradBucket Class Test Plan: revert-hammer Differential Revision: D30024161 (`29c8b1db57`) Original commit changeset: 07e6072a2f7b fbshipit-source-id: d571c2caadaf7b71fe2aba3c0597bd8074d153de	2021-08-02 10:26:54 -07:00
Qing Hu	29c8b1db57	[DDP Communication Hook] Rename 4 Methods of GradBucket Class (#62510 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62510 `GradBucket` is an important class defined in both C++ and Python, used for PyTorch Distributed Training. We need to rename the following methods for simplicity: 1) get_index -> index 2) is_the_last_bucket_to_allreduce -> is_last, 3) get_per_parameter_tensors -> gradients, 4) get_model_params_for_bucket -> parameters. Test Plan: Local run comprehensive test with following results: https://pxl.cl/1Ml8b For two timeout failure test cases, most likely environment related and fail in my devserver. Reviewed By: SciPioneer Differential Revision: D30024161 fbshipit-source-id: 07e6072a2f7b81f731425d9b71f8c8b60d383b0f	2021-08-02 09:33:32 -07:00
Yi Wang	581bf01074	[Gradient Compression] Remove unnecessary warning on the rst file and the check on C++ version (#58170 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58170 Now comm hook can be supported on MPI and GLOO backends besides NCCL. No longer need these warnings and check. ghstack-source-id: 128799123 Test Plan: N/A Reviewed By: agolynski Differential Revision: D28388861 fbshipit-source-id: f56a7b9f42bfae1e904f58cdeccf7ceefcbb0850	2021-05-12 14:15:10 -07:00
Yi Wang	6a2f046504	[SPMD] Restrict DDP communication hooks to SPSD mode (#55253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55253 Previously DDP communication hooks takes a tensor list as the input. Now only takes a single tensor, as the preparation of retiring SPMD and only providing a single model replica for DDP communication hooks. The next step is limiting only 1 model replica in Reducer. ghstack-source-id: 125677637 Test Plan: waitforbuildbot Reviewed By: zhaojuanmao Differential Revision: D27533898 fbshipit-source-id: 5db92549c440f33662cf4edf8e0a0fd024101eae	2021-04-05 16:46:47 -07:00
Yi Wang	e593044748	[Gradient Compression] Update a warning in ddp_comm_hooks.rst (#55031 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55031 It turns out that PowerSGD hooks can work on PyTorch native AMP package, but not Apex AMP package, which can somehow mutate gradients during the execution of communication hooks. {F561544045} ghstack-source-id: 125268206 Test Plan: Used native amp backend for the same pytext model and worked: f261564342 f261561664 Reviewed By: rohan-varma Differential Revision: D27436484 fbshipit-source-id: 2b63eb683ce373f9da06d4d224ccc5f0a3016c88	2021-04-02 12:07:50 -07:00
Yi Wang	4b00bce156	[Gradient Compression] Introduce fp16_compress_wrapper in ddp_comm_hooks.rst (#54052 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54052 Introduce `fp16_compress_wrapper`, which can give some speedup on top of some gradient compression algorithms like PowerSGD. ghstack-source-id: 124001805 Test Plan: {F509205173} Reviewed By: iseessel Differential Revision: D27076064 fbshipit-source-id: 4845a14854cafe2112c0caefc1e2532efe9d3ed8	2021-03-16 15:40:10 -07:00
Isaac Seessel	3078233e9a	[Gradient Compression] Make FP16 compression as a wrapper that can be combined with other communication hooks (#53808 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53808 Create a FP16 wrapper that can combine FP16 gradient compression with any gradient compression algorithm. Test Plan: Unit test: ``` buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper ``` Performance Test on DDP QPS Benchmark: Check if AllReduce + FP16 Wrapper = FP16 Compression 1) FP16 Compression: f256897690 2) FP16 Wrapper + AllReduce (after patching D26960986): f256897289 Reviewed By: SciPioneer Differential Revision: D26978832 fbshipit-source-id: 0dcd18b050c02f5e9f3cff56344d1f39a04e20c0	2021-03-12 17:31:07 -08:00
Yi Wang	8d8a4a0624	Remove the extra ":noindex:" in ddp_comm_hooks.rst (#53855 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53855 Remove "noindex" here: {F492926346} ghstack-source-id: 123724419 Test Plan: waitforbuildbot The failure on doctest does not seem to be relevant. Reviewed By: rohan-varma Differential Revision: D26967086 fbshipit-source-id: adf9db1144fa1475573f617402fdbca8177b7c08	2021-03-11 17:26:50 -08:00
Yi Wang	fe0810e2f8	Add a section to introduce GradBucket class in ddp_comm_hooks.rst (#53253 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53253 Since GradBucket class becomes public, mention this class in ddp_comm_hooks.rst. Screenshot: {F478201008} ghstack-source-id: 123596842 Test Plan: viewed generated html file Reviewed By: rohan-varma Differential Revision: D26812210 fbshipit-source-id: 65b70a45096b39f7d41a195e65b365b722645000	2021-03-10 16:14:34 -08:00
Martin Jaggi	b6806308ac	typo in docs ddp_comm_hooks.rst (#51986 ) Summary: Fixes a typo in ddp_comm_hooks.rst Pull Request resolved: https://github.com/pytorch/pytorch/pull/51986 Reviewed By: SciPioneer Differential Revision: D26360314 Pulled By: mrshenli fbshipit-source-id: 50349501c53823cbcbad0f72be7c6ac9d51a4120	2021-02-11 12:02:37 -08:00
Yi Wang	9e4f3b89c4	[Gradient Compression] Add register_comm_hook API to DDP communication hooks documentation page (#51846 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51846 `register_comm_hook` method is defined in DistributedDataParallel module, but it is not covered in `distributed.rst`. Since it's closely related to DDP communication hook, add the docstrings to `ddp_comm_hooks.rst` instead of a reference. Screenshot: {F370425625} ghstack-source-id: 121278173 Test Plan: view locally python_doc_test: https://app.circleci.com/pipelines/github/pytorch/pytorch/271234/workflows/dc0b443d-8a62-4334-9b42-800c33a68553/jobs/10770636 Reviewed By: rohan-varma Differential Revision: D26298191 fbshipit-source-id: 32e0685fd3c935cf9a2d129e6c520a94aa3e3817	2021-02-08 15:12:39 -08:00
Yi Wang	4b3c99ce4a	[Resubmission] Add a documentation page for DDP communication hooks (#51773 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51773 Resubmission of #51715. Minor changes: 1) Removed "Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]" in powerSGD_hook.py. 2) Removed the duplicate description of `torch.nn.parallel.DistributedDataParallel.register_comm_hook` in ddp_comm_hooks.rst, because it is already covered by distributed.rst. Also updated the doc based on the comments from PowerSGD paper author Thijs Vogels . It seems that `python_doc_test` was flaky. The previous error message was not informative: https://app.circleci.com/pipelines/github/pytorch/pytorch/270682/workflows/8d186a3c-d682-46bf-b617-ad4eef5991e2/jobs/10739143, and all the warnings did also appear on the master branch. Rebasing to a new master branch seems to get this fixed: https://app.circleci.com/pipelines/github/pytorch/pytorch/270696/workflows/1a3adbea-6443-4876-b87b-e17d90d41428/jobs/10740021/steps Screenshot: {F369899792} ghstack-source-id: 121199613 Test Plan: View locally Reviewed By: mingzhe09088 Differential Revision: D26272687 fbshipit-source-id: 6677db496a68171798940a80343f4d9a508e15db	2021-02-06 21:22:04 -08:00
Natalia Gimelshein	d3023d86ba	Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks Test Plan: revert-hammer Differential Revision: D26249330 (`e62aabac43`) Original commit changeset: ab973390ddb7 fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd	2021-02-04 22:38:06 -08:00
Yi Wang	e62aabac43	[Gradient Compression] Add a documentation page for DDP communication hooks (#51715 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51715 Add a documentation page for DDP communication hooks. Screenshot: {F369781049} Test Plan: View locally Reviewed By: pritamdamania87 Differential Revision: D26249330 fbshipit-source-id: ab973390ddb785c5191f587a1b2b6de7d229e50e	2021-02-04 18:53:53 -08:00

23 Commits