pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Xuehai Pan	995df34b19	[BE][PYFMT] migrate PYFMT for `torch.{distributed,distributions}` to `ruff format` (#144547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144547 Approved by: https://github.com/kwen2501	2025-02-28 07:35:56 +00:00
Ivan Skorokhodov	df776d64f7	chore: fix typos in error messages in FSDP (#146805 ) Fixes two small typos in FSDP error messages Pull Request resolved: https://github.com/pytorch/pytorch/pull/146805 Approved by: https://github.com/awgu, https://github.com/Skylion007	2025-02-13 15:22:13 +00:00
Aaron Orenstein	c64e657632	PEP585 update - torch/distributed/fsdp (#145162 ) See #145101 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145162 Approved by: https://github.com/bobrenjc93	2025-01-19 20:04:05 +00:00
bobrenjc93	08be9ec312	Migrate from Tuple -> tuple in torch/distributed (#144258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144258 Approved by: https://github.com/aorenste	2025-01-10 08:34:54 +00:00
Chien-Chin Huang	d53dfa4680	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu, https://github.com/wz337	2024-08-12 18:28:02 +00:00
PyTorch MergeBot	50595ecef4	Revert "[BE] Raise when the target model has scalar parameters (#132934 )" This reverts commit ea00036841b225330396f8d8f6ecf796f4826786. Reverted https://github.com/pytorch/pytorch/pull/132934 on behalf of https://github.com/clee2000 due to I think this broke distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter [GH job link](https://github.com/pytorch/pytorch/actions/runs/10314920655/job/28563430905) [HUD commit link](`ea00036841`). Dr CI is wrong, it is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132934#issuecomment-2278208789))	2024-08-09 15:30:34 +00:00
Chien-Chin Huang	ea00036841	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu ghstack dependencies: #132908, #132933	2024-08-09 06:45:48 +00:00
daitian1995	aff48f7378	Autoselect default device in FSDP construction. (#127609 ) There are still some differences between CUDA and non-CUDA custom devices when construct FSDP because CUDA is selected as the default device. For example, when construct FSDP from CPU model and device_id is not passed, device_handle will choose CUDA as default device. This PR will autoselect the real device as the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609 Approved by: https://github.com/awgu	2024-08-08 05:25:17 +00:00
wz337	87053132ea	[DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339 ) Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0)) ``` We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0)) ``` With this change, we will have two types of meshes in an environment. 1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing. 2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339 Approved by: https://github.com/wanchaol ghstack dependencies: #132310, #132311	2024-08-07 07:01:12 +00:00
Vishwa Raj Singh	bcdba9f91d	Added hpu backend support in fsdp utils (#127757 ) In fsdp init_utils, adding support for hpu backend device on _get_device API. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757 Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu	2024-07-27 03:30:59 +00:00
Aaron Orenstein	634b62f111	typing proxy_tensor.py (#129182 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182 Approved by: https://github.com/Chillee	2024-07-12 23:17:09 +00:00
Xuehai Pan	973037be6a	[BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): `list()` / `tuple()` / `dict()` (#130199 ) This PR changes the empty collection factory call to Python literals: - `list()` -> `[]` - `tuple()` -> `()` - `dict()` -> `{}` The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary: ```bash $ python3 -m dis - <<EOS import collections d1 = {} d2 = dict() dict = collections.OrderedDict d3 = dict() EOS ``` ```text 0 0 RESUME 0 1 2 LOAD_CONST 0 (0) 4 LOAD_CONST 1 (None) 6 IMPORT_NAME 0 (collections) 8 STORE_NAME 0 (collections) 3 10 BUILD_MAP 0 12 STORE_NAME 1 (d1) 4 14 PUSH_NULL 16 LOAD_NAME 2 (dict) 18 CALL 0 26 STORE_NAME 3 (d2) 6 28 LOAD_NAME 0 (collections) 30 LOAD_ATTR 8 (OrderedDict) 50 STORE_NAME 2 (dict) 7 52 PUSH_NULL 54 LOAD_NAME 2 (dict) 56 CALL 0 64 STORE_NAME 5 (d3) 66 RETURN_CONST 1 (None) ``` The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199 Approved by: https://github.com/malfet	2024-07-11 17:30:28 +00:00
Xuehai Pan	3b798df853	[BE][Easy] enable UFMT for `torch/distributed/{fsdp,optim,rpc}/` (#128869 ) Part of #123062 - #123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869 Approved by: https://github.com/fegin ghstack dependencies: #128868	2024-06-18 21:49:08 +00:00
Aaron Orenstein	7c12cc7ce4	Flip default value for mypy disallow_untyped_defs [6/11] (#127843 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843 Approved by: https://github.com/oulgen ghstack dependencies: #127842	2024-06-08 18:49:29 +00:00
Iris Z	1d84c7e100	[DeviceMesh] Update get_group and add get_all_groups (#128097 ) Fixes #121984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2024-06-08 04:28:56 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
Sergii Dymchenko	a2bff4dc8c	Fix lint (#127584 ) Trivial fix after https://github.com/pytorch/pytorch/pull/124678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127584 Approved by: https://github.com/huydhn	2024-05-31 00:00:11 +00:00
Rohan Varma	f9a1bc2c65	[FSDP] Remove _sync_module_states (#124678 ) Remove this unused API Differential Revision: [D56445639](https://our.internmc.facebook.com/intern/diff/D56445639/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124678 Approved by: https://github.com/awgu	2024-05-30 23:02:09 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
Andrew Gu	2978f07d0e	[FSDP] Fixed docs for inter/intra node PG helpers (#126288 ) 1. This fixes an issue where we had 9 ranks in one node and 7 in the other. 2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126288 Approved by: https://github.com/weifengpy	2024-05-15 19:45:10 +00:00
Aaron Gokaslan	1dd42e42c4	[BE]: Try TCH autofixes on torch/ (#125536 ) Tries TCH autofixes and see what breaks Pull Request resolved: https://github.com/pytorch/pytorch/pull/125536 Approved by: https://github.com/ezyang	2024-05-05 23:13:59 +00:00
Andrew Gu	79af814369	[FSDP] Added private `_unshard` API (#124304 ) Some toy example: <img width="998" alt="Screenshot 2024-04-17 at 2 00 05 PM" src="https://github.com/pytorch/pytorch/assets/31054793/b5665a63-beb0-4ca1-92c6-c57a052812fd"> We define `FullyShardedDataParallel._unshard(async_op: bool = False)` that can be used to prefetch all-gathers. The user should make sure: 1. Run lazy init before the first `_unshard` call of training. For example, this can hackily be done via `root_module.check_is_root()` on the root FSDP module `root_module`. 2. Call `root_module._wait_unshard_streams_on_current_stream()` before the first `_unshard` call of the current iteration (just need to call it once after last optimizer step and before first `_unshard` of this iteration). Differential Revision: [D56262876](https://our.internmc.facebook.com/intern/diff/D56262876) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124304 Approved by: https://github.com/wanchaol	2024-05-03 13:14:15 +00:00
Shawn Xu	e203aa9fab	[FSDP] [easy] fix HSDP validation error msg (#123019 ) Summary: This would otherwise yield > ValueError: ('Manual wrapping with ShardingStrategy.HYBRID_SHARD', 'requires explicit specification of process group or device_mesh.') which is odd. Remove the extra tailing commas. Test Plan: CI Differential Revision: D55549851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123019 Approved by: https://github.com/Skylion007	2024-03-30 18:12:34 +00:00
Andrew Gu	bf8db86a19	[FSDP] Added deprecation msg for `NO_SHARD` (#119553 ) This only includes the warning for world size >1 since we clamp to `NO_SHARD` for world size 1. We mainly do not want `NO_SHARD` to proliferate anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119553 Approved by: https://github.com/Skylion007	2024-02-09 20:32:03 +00:00
Mihir Patel	33761969a4	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/Skylion007	2024-02-08 00:49:28 +00:00
PyTorch MergeBot	3aeaa21eb0	Revert "Remove parent device mesh check (#118620 )" This reverts commit 3f1f057adfcd4cef67fff9605a894cb075c02881. Reverted https://github.com/pytorch/pytorch/pull/118620 on behalf of https://github.com/atalman due to broke periodic linux-focal-cuda11.8-py3.9-gcc9 ([comment](https://github.com/pytorch/pytorch/pull/118620#issuecomment-1924933878))	2024-02-03 00:22:56 +00:00
Mihir Patel	3f1f057adf	Remove parent device mesh check (#118620 ) Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - https://github.com/pytorch/pytorch/pull/118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/118620 Approved by: https://github.com/wz337, https://github.com/wanchaol	2024-02-02 05:29:49 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit 4f13f69a45ef53747e2eefffd65d91ce840b431b. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Wanchao Liang	eebf115686	[fsdp][2d] FSDP sync module states handle tensor subclass (#117336 ) This PR adds the ability to let FSDP sync module states kwarg to handle tensor subclass, because FSDP works on the "dp" mesh dimension, as long as FSDP works on a different device mesh dimension, we can safety let FSDP just broadcast the DTensor local shards. fixes https://github.com/pytorch/pytorch/issues/117126 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117336 Approved by: https://github.com/awgu	2024-01-13 19:33:47 +00:00
Wanchao Liang	848cfe8d45	[reland] unflatten_tensor on compute stream for DTensorExtension (#117020 ) reland of https://github.com/pytorch/pytorch/pull/116559, which was reverted by internal. The underlying reason for the revert is that the torch.dynamo.disable can't be used by the pytorch codebase, as it's conflicting with some torch.deploy together, although the later one only run some inference, but it somehow take that weird dependency on fsdp.. We have seen this issue with our functional collectives that we can't use any dynamo components otherwise torch.deploy would complain.. verified internally that after removing torch.dynamo.disable the test passed again Pull Request resolved: https://github.com/pytorch/pytorch/pull/117020 Approved by: https://github.com/awgu	2024-01-09 21:25:15 +00:00
Qinfan Wu	b847290ddd	Back out "[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 )" (#116939 ) Summary: Original commit changeset: 65298112f3db Original Phabricator Diff: D52530451 Differential Revision: D52583345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116939 Approved by: https://github.com/842974287	2024-01-07 03:53:40 +00:00
Wanchao Liang	d9c0e37bab	[2d] unflatten_tensor on compute stream for DTensorExtension (#116559 ) Context: Existing FSDPExtension have some bug in the case when the unflatten tensor involves some compute/communications in cuda stream, the current logic of FSDPExtension unflatten tensor happens in the unshard stream, which makes runtime lost sync with the compute stream, and if there're some dependencies between the compute stream and the unflatten tensor logic, currently it would lose sync point, which could possibly lead to NaN. This PR make the FSDPExtension to record the compute stream and let DTensorExtension to directly use the compute stream for unflatten_tensor. In long term we might want to directly make the FSDP runtime logic to only make the unshard happen in unshard stream, and use unshard views to happen in the compute stream. We currently fix this in the Extension directly as this is the simplest thing to do without affecting FSDP runtime logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/116559 Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/yifuwang ghstack dependencies: #116426	2024-01-03 07:29:08 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit eaa64339d640ed1d36520ada379213f8361be5ff.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit 729ac7317a50a6a195b324cf6cefd748bf4f5498. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00
wz337	7b3e45be59	[DeviceMesh] Rename get_dim_groups to get_group (#114708 ) Rename get_dim_groups to get_group and update all callsites. Differential Revision: [D51629801](https://our.internmc.facebook.com/intern/diff/D51629801/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114708 Approved by: https://github.com/XilunWu, https://github.com/wanchaol, https://github.com/fegin	2023-11-30 23:40:14 +00:00
Aaron Gokaslan	b7b2178204	[BE]: Remove useless lambdas (#113602 ) Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602 Approved by: https://github.com/albanD	2023-11-14 20:06:48 +00:00
BJ Hargrave	670abff6ff	docs: Fix docstring lint errors in torch/distributed/fsdp/_flat_param.py & torch/distributed/fsdp/_init_utils.py (#113358 ) Fixes #113189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113358 Approved by: https://github.com/kit1980	2023-11-11 01:53:02 +00:00
wz337	31ded95cd5	[2D] Bind _fsdp_extension to FSDP instances (#113237 ) Currently, when we have 2D composition, a global variable _extensions controls the 2D deviation we need to take in state_dict calls (See https://github.com/pytorch/pytorch/blob/release/2.1/torch/distributed/fsdp/_fsdp_extensions.py#L66-L68). This is problematic when we have both a 2D model and a plain FSDP model in the same dist environment, as the _extensions will be mistakenly turned on for the plain FSDP model, resulting in state_dict error (RuntimeError: No parent device_mesh is found for FSDP device_mesh.). This PR binds _fsdp_extension to the FSDP instances to make sure that state_dict calls would not get interfered with each other when mixing both 2D and 1D parallelism. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113237 Approved by: https://github.com/fduwjj, https://github.com/fegin	2023-11-09 03:31:03 +00:00
Iris Zhang	12c1465d76	[DeviceMesh] Make mesh_resources private (#112294 ) This is to prepare moving DeviceMesh as a standalone distributed package. `_mesh_resources` should only be used in torch.distributed package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112294 Approved by: https://github.com/fegin	2023-10-28 17:28:46 +00:00
Matthew Hoffman	68b0db1274	Define the public API for torch.distributed.fsdp (#109922 ) Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation Related: https://github.com/microsoft/pylance-release/issues/2953 This fixes pylance issues for these classes: ``` "FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp" ``` These classes all have public docs: * [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch) * [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload) * [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel) * [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) * [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy) And it seems like all the newly added classes will have docs once they are released. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922 Approved by: https://github.com/wanchaol	2023-09-28 02:15:58 +00:00
wz337	8140494afd	[3/N][2D] Enable training with new 2D flow (#110034 ) Replacing https://github.com/pytorch/pytorch/pull/109553 as it gets reverted. This PR enables training with new 2D flow and adds associated test. In addition, this PR moves the tensor/parallel/_data_parallel_utils.py that are fsdp specific back to tensor/parallel/fsdp.py to avoid circular dependency for ddp.py and test/distributed/tensor/parallel/test_ddp_2d_parallel.py. state_dict related changes would be in later PRs. cc. @fegin, @fduwjj, @wanchaol, @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/110034 Approved by: https://github.com/fduwjj	2023-09-26 09:14:15 +00:00
PyTorch MergeBot	f5886bf352	Revert "[3/N][2D] Enable training with new 2D flow (#109553 )" This reverts commit 217b37c023d58854a7a6117c3726ed44786c9d03. Reverted https://github.com/pytorch/pytorch/pull/109553 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but those distributed failures look legit and they are failing in trunk https://hud.pytorch.org/pr/109553 ([comment](https://github.com/pytorch/pytorch/pull/109553#issuecomment-1734100546))	2023-09-25 16:37:19 +00:00
wz337	217b37c023	[3/N][2D] Enable training with new 2D flow (#109553 ) This PR enables training with new 2D flow and adds associated test. state_dict related changes would be in later PRs. cc. @fegin, @fduwjj, @wanchaol, @awgu Pull Request resolved: https://github.com/pytorch/pytorch/pull/109553 Approved by: https://github.com/fegin, https://github.com/awgu	2023-09-25 05:32:07 +00:00
wz337	0aedacb4f7	[2D][1/N] Add _enable_extension to fsdp state (#109242 ) Add _enable_extension to fsdp state. We will use this to determine whether we should enable the extension or not. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109242 Approved by: https://github.com/fegin	2023-09-16 19:03:10 +00:00

1 2 3

125 Commits