pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-07 01:50:04 +08:00

Author	SHA1	Message	Date
Will Feng	941d094dd1	[Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315 ) Before the fix, the unit test will fail at forward Dynamo tracing: ``` File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp loss = compiled_replicate_model(data).sum() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant from user code: File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor result = DTensor.from_local( ``` After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474). I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for. Fixes https://github.com/pytorch/pytorch/issues/130978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315 Approved by: https://github.com/bdhirsh	2024-09-07 00:11:25 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Sam Larsen	ded5bdb0de	Use inductor TestCase for test_replicate_with_compiler.py (#131053 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` torch.compiles. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Test Plan: `python test/distributed/_composable/test_replicate_with_compiler.py` Differential Revision: [D59925519](https://our.internmc.facebook.com/intern/diff/D59925519) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131053 Approved by: https://github.com/eellison	2024-07-23 16:59:55 +00:00
Will Feng	208dffa702	[Compiled DDP] DDP + AC unit test (#130981 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130981 Approved by: https://github.com/fegin	2024-07-19 01:07:41 +00:00
PyTorch MergeBot	fff92d4f18	Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494 )" This reverts commit 9f392f8294e928aec49599ad649aa899e1356102. Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))	2024-07-18 17:42:05 +00:00
Sam Larsen	9f392f8294	Use inductor TestCase for test_replicate_with_compiler.py (#129494 ) Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494 Approved by: https://github.com/eellison	2024-07-18 03:08:32 +00:00
chilli	a2b1673dfb	[Horace's PR #126446 ] Prevent partitioner from ever saving views (#129039 ) Most work is done by Horace in https://github.com/pytorch/pytorch/issues/126446, this PR just additionally adds the config for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129039 Approved by: https://github.com/Chillee	2024-06-19 23:21:16 +00:00
Michael Hsu	85172fbe84	Back out "Prevent partitioner from ever saving views (#126446 )" (#127316 ) Summary: Revert "Prevent partitioner from ever saving views (#126446)" due to a torchinductor failure on CU Training Framework tests. Reviewed By: Chillee Differential Revision: D57868343 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127316 Approved by: https://github.com/Chillee	2024-05-29 00:29:44 +00:00
chilli	d4ec18bdad	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-22 17:28:46 +00:00
PyTorch MergeBot	0f37fd06d9	Revert "Prevent partitioner from ever saving views (#126446 )" This reverts commit da2292ce6b37028746bf5beeae04442eef1e803d. Reverted https://github.com/pytorch/pytorch/pull/126446 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))	2024-05-22 08:23:40 +00:00
chilli	da2292ce6b	Prevent partitioner from ever saving views (#126446 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446 Approved by: https://github.com/anijain2305 ghstack dependencies: #126615	2024-05-20 23:40:56 +00:00
Chien-Chin Huang	8b2f8ee5ef	[DDP][PT2D] Fix no_compiled_forward flag in the test (#124829 ) As title Differential Revision: [D56508696](https://our.internmc.facebook.com/intern/diff/D56508696/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124829 Approved by: https://github.com/yf225 ghstack dependencies: #124421, #124422, #123424	2024-04-25 04:55:39 +00:00
Chien-Chin Huang	290bfbe01f	[DDP][PT2D] Lazy Initialization of DDP Module for Replicate API (#123424 ) In order to make replicate work with Meta tensor, we need to do lazy Initialization for the replicate API. This PR impelements the lazy initialization and ensures that replicate still work with the new DDP compilation. Differential Revision: [D55787340](https://our.internmc.facebook.com/intern/diff/D55787340/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123424 Approved by: https://github.com/yf225 ghstack dependencies: #124421, #124422	2024-04-24 06:30:19 +00:00
Yifu Wang	2a2e1d8e4f	[functional collective] change the Python APIs to only use the native funcol ops (#123777 ) ## Summary After this PR, the functional collective Python APIs will stop honoring `TORCH_DISABLE_NATIVE_FUNCOL` and only use native funcol ops. Specifically, this PR: - Removed `use_native_funcol()`. - Removed the code path in the Python APIs when `use_native_funcol()` is `False`. - Changed the CI tests that runs on both native funcol and legacy funcol through the Python API to only run with native funcol. ## Test Changes `test_functional_api.py` - Removed the tests where only one of output_split_sizes or input_split_sizes is specified. This behavior is unreliable has been removed from the native funcol. - Removed `TestWaitiness` which tests an implementation detail of the legacy funcol. We have equivalent tests for native funcol in `test/distributed/test_c10d_functional_native.py` `b7fac76fc2/test/distributed/test_c10d_functional_native.py (L114-L116)` `test/distributed/_tensor/test_dtensor.py` `test/distributed/_tensor/test_dtensor_compile.py` `test/distributed/test_device_mesh.py` `test/distributed/_tensor/experimental/test_tp_transform.py` `test/distributed/_tensor/test_matrix_ops.py` `test/distributed/test_inductor_collectives.py` - All these tests were double running with both native funcol and legacy funcol. Changed to only run with native funcol. `test/distributed/test_c10d_functional_native.py` - Removed the `run_with_native_funcol` decorators. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123777 Approved by: https://github.com/wanchaol ghstack dependencies: #123776	2024-04-13 03:08:36 +00:00
Chien-Chin Huang	b279034e5a	[DDP][PT2D] Add the trace rules for DDP (#121741 ) Add the trace rules for DDP and refactor the tests to verify both DDP and replicate. Differential Revision: [D54815909](https://our.internmc.facebook.com/intern/diff/D54815909/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121741 Approved by: https://github.com/yf225 ghstack dependencies: #123206, #123207	2024-04-08 19:53:13 +00:00
Chien-Chin Huang	a2327d203b	[PT2D][DDP] Remove some hacks to get the test work (#123206 ) It seems that these bugs are fixed (not sure what PRs) and we don't need to disable the buffer reused. Differential Revision: [D55657388](https://our.internmc.facebook.com/intern/diff/D55657388/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123206 Approved by: https://github.com/kwen2501, https://github.com/yifuwang	2024-04-08 17:40:14 +00:00
Chien-Chin Huang	c7193f4099	[DDP][PT2D][2D] Enable DDP + TP and add test for compiled DDP + TP (#120479 ) This PR enables DDP + TP using a TP internal API. This should not be the final implementation. A more sound implementation is to inline the TP internal API in DDP. In other words, DDP needs to be aware of DTensor so that we can support 2D state_dict. This PR adds a compiled DDP + TP test to ensure the new compiled DDP fusion doesn't break TP all_reduce. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [x] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of all_reduces. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Differential Revision: [D54105050](https://our.internmc.facebook.com/intern/diff/D54105050/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120479 Approved by: https://github.com/wz337 ghstack dependencies: #113209	2024-03-13 21:41:22 +00:00
Chien-Chin Huang	8e6d572b4e	[DDP][PT2D] Allreduce fusion fx pass using concat and all_reduce_coalesced (#113209 ) Differential Revision: [D49858057](https://our.internmc.facebook.com/intern/diff/D49858057/) TL;DR This PR implements 2 different DDP all_reduce fusions in Inductor post_grad fx passes. The two fusions are 1) fusion with concat op and 2) fusion with all_reduce_coalesced. When DDP detects that Python reducer is being used, DDP will automatically turn on the fusion. This PR does not invent any algorithm and simply reflects the bucket size users set to DDP. Implementation Details Fusion with concat op The idea of this fusion is to use a concat op to concatenate all the gradients into one tensor and perform one `all_reduce`. After the `wait` op of the `all_reduce`, splitting and reshaping will also be perform to get the individual gradient. Because DDP needs to perform gradient scaling, the benefit of using this fusion is that we could perform the gradient scaling over the the concatenated buffer. Fusion with `all_reduce_coalesced` The idea of this fusion is to use `all_reduce_coalesced` op to directly perform the `all_reduce` over multiple buffers. This avoid the copy overhead but may not achieve the best NCCL performance. In addition, because there are multiple buffers, we could not do one simple gradient scaling but have to rely on `foreach_div` to help the gradient scaling. Limitations Current fusions do not distinguish `all_reduce` generated by different DDP modules. This is okay if all DDP instances use the same PG and data type. The support of multiple DDP instances with different PG and data type will come in the later PRs. TODOs - [x] Implement DDP allreduce fusion algorithm for Inductor post_grad pass. - [ ] Add unit tests to ensure the fusion doesn't DDP + TP. - [ ] Group different PG and data type of `all_reduce`s. - [ ] Mixed precision supports and tests - [ ] Implement the fusions with Inductor IR. - [ ] Add auto bucketing based on Inductor profiling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113209 Approved by: https://github.com/yf225	2024-03-13 20:37:09 +00:00
Chien-Chin Huang	1d2382f141	[DDP] Use compiled_autograd to trace DDP backward allreduce (#110662 ) Summary The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor. Key Logic 1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters. 2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`. 3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter. Bucketing The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces. The bucketing is done in a separate PR. Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662 Approved by: https://github.com/wconstab	2024-02-08 03:03:15 +00:00

19 Commits