pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-24 07:27:32 +08:00

Author	SHA1	Message	Date
Yeounoh Chung	f7ec984b1b	[DTensor][XLA] support XLA backend in distirbute_module API (#121355 ) Addresses #92909 cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang Pull Request resolved: https://github.com/pytorch/pytorch/pull/121355 Approved by: https://github.com/wanchaol	2024-03-08 15:47:33 +00:00
Yeounoh Chung	4f9d4e1ab0	[DTensor][XLA] refactor DTensor _xla API (#113214 ) In response to the change pytorch/xla#5776 and #92909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214 Approved by: https://github.com/wanchaol	2024-03-07 06:18:05 +00:00
Wanchao Liang	2e50566722	[dtensor] change distribute_module input/output_fn to accept module (#120895 ) This is a BC breaking change to distribute_module. The underlying rationle for this change is that sometimes in the input_fn/output_fn, user would want to access to the current module for some attributes. This might not be common enough, but in some cases it's worth to access to the module. An outstanding use case we want to support is float8, if we want to make float8 works with the TP API, the input_fn/output_fn of TP parallel styles would need to get access to the module, where the module might encapsulates `dynamic_linear.emulate` attribute, that is useful for input/output casting Since this is needed for fp8 and DTensor still under prototype release, I feel it's worth the change and it's better we make the change as early. Right now making it a soft BC breaking, which means we maintain BC still but throw deprecation messages. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895 Approved by: https://github.com/tianyu-l	2024-03-04 07:22:32 +00:00
Andrew Gu	87fb8b6218	[DTensor] Relaxed `to_local` `requires_grad` warning (#118186 ) The existing warning in `DTensor.__new__()` checks `if requires_grad != local_tensor.requires_grad:` and warns with: > To construct DTensor from `torch.Tensor`, it's recommended to use `local_tensor.detach()` and make `requires_grad` consistent. Calling `local_tensor.detach()` will have the returned `Tensor` have `requires_grad=False`, so the error message refers to the case where `local_tensor.requires_grad is True` but the user passed `requires_grad=False` to `to_local()`. However, there is the converse case, where `local_tensor.requires_grad is False` but the user passed `requires_grad=True`. In this case, the original `if requires_grad != local_tensor.requires_grad:` check succeeds, and the warning is emitted. However, the warning message does not apply in that case. This can happen via `_prepare_output_fn` -> `redistribute` -> `Redistribute.forward()`, where `output.requires_grad is False` but it passes `requires_grad=input.requires_grad` which can be `True`. We should not warn in this case since `Redistribute.forward()` is our own framework code, so I was proposing to relax the warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118186 Approved by: https://github.com/XilunWu, https://github.com/wanchaol ghstack dependencies: #117994	2024-01-25 15:49:32 +00:00
Wanchao Liang	c170fbd309	[dtensor] refactor redistribute and fix uneven sharding redistribution (#115525 ) This PR: - refactors the redistribute implementation logic to make it more sound, by figuring out the transform informations first and then apply transformation step by step, we also cache the decisions so that it could be reuse again - for uneven sharding, refactor uneven sharding logic, and use a logical shape concept for each transform information to fix the uneven sharding multi-mesh redistribute bug fixes https://github.com/pytorch/pytorch/issues/115310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115525 Approved by: https://github.com/XilunWu	2024-01-22 18:57:44 +00:00
Yue Dong	270ed13e87	[DTensor] Make DTensor `from_local` backward partial() to replicate() pass through (#115967 ) Summary: This change makes the `DTensor.from_local()` placements in backward pass from `Partial()` to `Replicate()` as pass through for following reasons: 1. When we run backward pass of DTensor.from_local, if the target placement is partial() (i.e. from user manual overwrite code instead of torch_dispatch) we keep the grad as replicate. This is because converting the gradients back to `Partial()` is meaningless. 2. The current div logic will lead to wrong numerical value in the above case. Test Plan: CI: CI Tests Unit test: `buck2 test mode/dev-nosan //caffe2/test/distributed/_tensor:redistribute` - Passed With model training: ``` # We tested the case where input tensor is manually overwrite as Partial() and # output tensor manually overwrite to Shard() then to local. # Before the change: numerical value not correct Forward pass: collective: ReduceScatter backward pass: collective: AllGather + div by process group size # After the change: div is removed as expected. Forward pass: collective: ReduceScatter Backward pas: collective: AllGather ``` Differential Revision: D52175709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115967 Approved by: https://github.com/wanchaol	2023-12-19 00:16:10 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
wz337	dacf5d6e92	[DTensor] Remove assert to allow tensor sharding dimension < Shard(x).ndim (#115114 ) Consolidated by changes made by @yoyoyocmu. https://www.internalfb.com/diff/D51821717 Remove assert to allow tensor dimension < Shard(x).ndim. With the current padding, we do support this already. Follow up: we will still need to fix the size mismatch and `full_tensor()` hang when tensor is uneven-sharded. Created issue here: https://github.com/pytorch/pytorch/issues/115310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115114 Approved by: https://github.com/yoyoyocmu, https://github.com/wanchaol	2023-12-07 21:57:30 +00:00
Joel Schlosser	22704426c3	Expand dynamic dims support for traceable subclasses (#114311 ) Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo). Summary: * Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors of the same dim as outer when `mark_dynamic(outer, ...)` is called * Addresses this: `6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)` * Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols) * Signatures now: ```python # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr) # ctx is anything useful for rebuilding the class we want to guard on attrs, ctx = x.__tensor_flatten__() ... # inner_tensors is a dict of {attr -> tensor} # ctx is taken unmodified from flattening and (eventually) guarded on # outer_size is the expected size of the output; possibly symbolic # outer_stride is the expected strides of the output; possibly symbolic y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride) # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride # the assert simplifies symbols when there are relationships between outer and inner symbols ``` * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now * ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work) * ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~ * Now handled in #114469 * Next PR: add TENSOR_MATCH guards on inner tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311 Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh	2023-12-05 21:09:25 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit eaa64339d640ed1d36520ada379213f8361be5ff.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit 729ac7317a50a6a195b324cf6cefd748bf4f5498. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00
Andrew Gu	c39c69953f	[DTensor] Used new placements for neg dim in `distribute_tensor` (#113930 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113930 Approved by: https://github.com/wanchaol ghstack dependencies: #113919, #113924, #114134, #113925	2023-11-20 22:32:58 +00:00
Andrew Gu	e2095a04ae	[DTensor] Ensured `grad_placements` was tuple (#113925 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113925 Approved by: https://github.com/wanchaol ghstack dependencies: #113919, #113924, #114134	2023-11-20 22:32:58 +00:00
Andrew Gu	f4ffd46c08	[DTensor] Used new placements for neg dim in `from_local` (#114134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114134 Approved by: https://github.com/wanchaol ghstack dependencies: #113919, #113924	2023-11-20 22:32:51 +00:00
Andrew Gu	b41ad7d695	[DTensor] Used new placements for neg dim in `redistribute` (#113924 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113924 Approved by: https://github.com/wanchaol ghstack dependencies: #113919	2023-11-20 22:30:16 +00:00
Wanchao Liang	b16e3b5373	[funcol] add two APIs: wait() and numpy() (#113323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113323 Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/wconstab	2023-11-14 09:27:45 +00:00
Wanchao Liang	6ed20af10e	[dtensor] refactor op dispatch and fix is_same_size/equal (#112927 ) torch.equal/is_same_size currently skips sharding prop and directly do local tensor compute, this is wrong. for these two ops: - torch.equal: should not skip sharding prop, need to have two DTensor have the SAME sharding before compare local shard values - torch.is_same_size: need to completely skip both sharding prop and local compute This PR refactors the existing op_dispatch to make it a class instance so that we can do custom op handling, then fixes both torch.equal and torch.is_same_size Pull Request resolved: https://github.com/pytorch/pytorch/pull/112927 Approved by: https://github.com/fduwjj, https://github.com/XilunWu	2023-11-13 22:46:31 +00:00
Wanchao Liang	9834fb7fd0	[dtensor] full_tensor to return synchronously (#113322 ) full_tensor API should return synchronously instead of AsyncCollectiveTensor and if the return is that, we do the wait directly, this makes the full_tensor API be more percise Pull Request resolved: https://github.com/pytorch/pytorch/pull/113322 Approved by: https://github.com/wz337	2023-11-09 18:02:40 +00:00
Iris Zhang	9af3f98faf	[DTensor] Fix DTensor.from_local() returns DTensor with wrong size for uneven sharded tensor (#110781 ) Fixes #110762 This PR: fixes issue described in #110762 by adding kwarg for shape and stride when creating DTensor using `DTensor.from_local()`. When `shape` and `stride` are provided, we skip calcualtion for `tensor_shape` and `tensor_stride` using `compute_global_tensor_info()`, as `compute_global_tensor_info()` always assume even sharding. Test plan: ``` python3 test/distributed/_tensor/test_dtensor.py -k test_from_local_uneven_sharding python3 test/distributed/_tensor/test_dtensor.py -k test_from_local_uneven_sharding_raise_error ``` cc. @wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/110781 Approved by: https://github.com/wanchaol	2023-11-04 11:21:10 +00:00
Wanchao Liang	2f09da3a21	[dtensor] Introduce full_tensor API to DTensor (#112224 ) This PR introduces a `full_tensor` API to DTensor, there were so many callsites that exercises the `redistribute(replicate)` path and I feel it deserves a separate API, mostly just a syntactic sugar Pull Request resolved: https://github.com/pytorch/pytorch/pull/112224 Approved by: https://github.com/wz337	2023-10-31 00:44:09 +00:00
Iris Zhang	12c1465d76	[DeviceMesh] Make mesh_resources private (#112294 ) This is to prepare moving DeviceMesh as a standalone distributed package. `_mesh_resources` should only be used in torch.distributed package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112294 Approved by: https://github.com/fegin	2023-10-28 17:28:46 +00:00
Wanchao Liang	61461f39d1	[dtensor] handle negative dim and fix TP regression (#111750 ) TP style still have some regression due to negative dim specifications, fix it by allow DTensor API to handle negative dims and normalize them. i.e. TP uses `Shard(-1)`, and then try to redistribute `Shard(1) -> Shard(-1)`, this should ideally be no-op but current it runs a decompose sharding phrase and it would turn this transformation to `Shard(1) -> Replicate -> Shard(-1)`, which is wrong and triggers unnecessary allgathers Pull Request resolved: https://github.com/pytorch/pytorch/pull/111750 Approved by: https://github.com/rohan-varma	2023-10-22 04:25:45 +00:00
Wanchao Liang	1d291e1f19	[dtensor] hide xla imports to avoid warning (#111751 ) xla imports throw warnings about xla not imported and we should only import xla when needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/111751 Approved by: https://github.com/rohan-varma	2023-10-22 04:09:10 +00:00
Yeounoh Chung	8376079b97	[DTensor][XLA] Support Xla backend in distribute_tensor API (#110275 ) This addresses #92909 , and enable XLA backend support for `distribute_tensor` API. Test plan: added a unit test case & tested with CloudTPU. The CI should skip this unless it's a XLA workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110275 Approved by: https://github.com/wanchaol, https://github.com/alanwaketan, https://github.com/JackCaoG	2023-10-21 01:17:15 +00:00
fduwjj	fdc29f58c6	[TP] Refactor style to make it work with torch.compile (#111625 ) We are refactoring parallel style to solve the following things: 1. To further simplifying code logic to make more readable for users. 2. To remove tuple check so that we can work with dynamo for now. Ideally dynamo needs to support this case and we will fix it in parallel. 3. Add tests for newly added parallel style in UT and torch compile test so that we can capture regression due to code change. 4. Move placements early return check into DTensor since it is by passed by dynamo. 5. Remove PairwiseParallelStyle from unit tests to use the new Col/Rowwise parallel style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111625 Approved by: https://github.com/wanchaol	2023-10-20 19:20:43 +00:00
Brian Hirsh	4d29b40299	torch.compile DTensor E2E (#105236 ) This PR updates DTensor to support torch.compile Cool stuff: there are some new tests in `test_dtensor.py` that show both the forward and backward graphs that we can send to inductor, when running a matmul with DTensor's. In particular, for this user code: ``` def fn(x, y): dt = DTensor.from_local(x.reshape(2, 4), mesh, [Shard(0)], run_check=False) dt2 = DTensor.from_local(y.reshape(4, 2), mesh, [Shard(1)], run_check=False) dt_out = torch.matmul(dt, dt2) dt_out_redistribute = dt_out.redistribute(mesh, [Replicate()]) return dt_out.to_local() ``` We generate the following fw and backward graphs. Forward graph: ``` def forward(self, primals_1, primals_2): view = torch.ops.aten.view.default(primals_1, [2, 4]); primals_1 = None _to_copy = torch.ops.aten._to_copy.default(view, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); view = None detach = torch.ops.aten.detach.default(_to_copy); _to_copy = None detach_1 = torch.ops.aten.detach.default(detach); detach = None view_1 = torch.ops.aten.view.default(primals_2, [4, 2]); primals_2 = None _to_copy_1 = torch.ops.aten._to_copy.default(view_1, dtype = torch.float32, layout = torch.strided, device = device(type='cuda', index=0)); view_1 = None detach_2 = torch.ops.aten.detach.default(_to_copy_1); _to_copy_1 = None detach_3 = torch.ops.aten.detach.default(detach_2); detach_2 = None detach_4 = torch.ops.aten.detach.default(detach_1) all_gather_into_tensor = torch.ops.c10d_functional.all_gather_into_tensor.default(detach_3, 'ptd:0', [0, 1], 2) wait_tensor = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor); all_gather_into_tensor = None split = torch.ops.aten.split.Tensor(wait_tensor, 4); wait_tensor = None getitem = split[0] getitem_1 = split[1]; split = None cat = torch.ops.aten.cat.default([getitem, getitem_1], 1); getitem = getitem_1 = None detach_5 = torch.ops.aten.detach.default(cat); cat = None mm = torch.ops.aten.mm.default(detach_4, detach_5); detach_4 = detach_5 = None detach_6 = torch.ops.aten.detach.default(mm); mm = None detach_9 = torch.ops.aten.detach.default(detach_6); detach_6 = None detach_10 = torch.ops.aten.detach.default(detach_9); detach_9 = None t = torch.ops.aten.t.default(detach_1); detach_1 = None detach_13 = torch.ops.aten.detach.default(t); t = None t_1 = torch.ops.aten.t.default(detach_3); detach_3 = None detach_15 = torch.ops.aten.detach.default(t_1); t_1 = None clone = torch.ops.aten.clone.default(detach_15, memory_format = torch.contiguous_format); detach_15 = None return [detach_10, detach_13, clone] ``` Backward graph: ``` def forward(self, detach_13, clone, tangents_1): detach_11 = torch.ops.aten.detach.default(tangents_1); tangents_1 = None detach_12 = torch.ops.aten.detach.default(detach_11); detach_11 = None mm_1 = torch.ops.aten.mm.default(detach_13, detach_12); detach_13 = None detach_14 = torch.ops.aten.detach.default(mm_1); mm_1 = None detach_16 = torch.ops.aten.detach.default(detach_12); detach_12 = None all_gather_into_tensor_2 = torch.ops.c10d_functional.all_gather_into_tensor.default(clone, 'ptd:0', [0, 1], 2); clone = None wait_tensor_2 = torch.ops.c10d_functional.wait_tensor.default(all_gather_into_tensor_2); detach_17 = torch.ops.aten.detach.default(wait_tensor_2); wait_tensor_2 = None mm_2 = torch.ops.aten.mm.default(detach_16, detach_17); detach_16 = detach_17 = None detach_18 = torch.ops.aten.detach.default(mm_2); mm_2 = None split_1 = torch.ops.aten.split.Tensor(detach_14, 2, 1); detach_14 = None getitem_2 = split_1[0] getitem_3 = split_1[1]; split_1 = None cat_1 = torch.ops.aten.cat.default([getitem_2, getitem_3]); getitem_2 = getitem_3 = None reduce_scatter_tensor = torch.ops.c10d_functional.reduce_scatter_tensor.default(cat_1, 'SUM', 'ptd:0', [0, 1], 2); cat_1 = None wait_tensor_3 = torch.ops.c10d_functional.wait_tensor.default(reduce_scatter_tensor); reduce_scatter_tensor = None detach_19 = torch.ops.aten.detach.default(wait_tensor_3); wait_tensor_3 = None detach_20 = torch.ops.aten.detach.default(detach_19); detach_19 = None detach_21 = torch.ops.aten.detach.default(detach_20); detach_20 = None detach_22 = torch.ops.aten.detach.default(detach_21); detach_21 = None _to_copy_2 = torch.ops.aten._to_copy.default(detach_22, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')); detach_22 = None view_2 = torch.ops.aten.view.default(_to_copy_2, [8]); _to_copy_2 = None detach_23 = torch.ops.aten.detach.default(detach_18); detach_18 = None detach_24 = torch.ops.aten.detach.default(detach_23); detach_23 = None _to_copy_3 = torch.ops.aten._to_copy.default(detach_24, dtype = torch.float32, layout = torch.strided, device = device(type='cpu')); detach_24 = None view_3 = torch.ops.aten.view.default(_to_copy_3, [8]); _to_copy_3 = None return [view_3, view_2] ``` Some of the stuff in this graph looks kinda of silly though (e.g. an unnecessary split() + cat(), and all the extra detach() calls). Stuff that's broken: - functionalization is pretty horribly broken. In particular, the original strategy I used in this stack was to have functionalization run above subclass desugaring. But that doesn't play well with with the way we want to compile DTensor. DTensor has a few API's like `.redistribute()`, `.to_local()`, and the `DTensor()` constructor, that we want to put directly into the graph so that we can compile them (e.g. redistribute() will desugar into collective ops). Doing this requires functionalization to run underneath the subclass though. I hacked around this for now, by forcing these functions to run functionalization first if they need to. - the backward test that I have is... wrong. The backward graph that we trace out looks kind of reasonable, but it gives incorrect gradients on one of the two inputs. This needs further debugging (presumably we should be able to stare at the graph and identify which part of it is wrong?). Pull Request resolved: https://github.com/pytorch/pytorch/pull/105236 Approved by: https://github.com/wanchaol	2023-10-11 21:55:27 +00:00
Wanchao Liang	2a76c7f018	[dtensor] skip move to device when device_type match (#110774 ) skip tensor.to in from_local and distribute_tensor when device_type of device mesh matches tensor.device type, since from_local on the critial path of TP, this might also reduce some overhead Pull Request resolved: https://github.com/pytorch/pytorch/pull/110774 Approved by: https://github.com/fduwjj	2023-10-09 19:39:11 +00:00
fduwjj	2dc5e166a5	[TP][Inference] Enable DTensor TP inference (#110751 ) In https://github.com/pytorch/pytorch/pull/109977, we observed that during inference mode, aten.Linear does not get decomposed. So instead of enabling sharding propagation for linear op, we use func.decompose so that it gets decomposed to matmul and mm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110751 Approved by: https://github.com/bdhirsh, https://github.com/wanchaol	2023-10-07 18:57:27 +00:00
Wanchao Liang	c95cf4b4c9	[dtensor] add grad placements kwarg to to_local API (#110629 ) When we convert to local tensor, dtensor can't track autograd or gradient layout of the local tensor anymore, if user do sth not expected, there needs to be a way for user to hint about the gradient layout of the local tensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629 Approved by: https://github.com/zdevito	2023-10-05 21:34:01 +00:00
Wanchao Liang	9456de937b	[dtensor] Fix and improve the sharding cache behavior (#109306 ) resolves https://github.com/pytorch/pytorch/issues/109101 The problem is essentially because we were hashing all the arguments, including the scalar too (i.e. aten.div(tensor, scalar)), in the optimizer, the scalar might change everytime we call the op, thus cache miss everytime we call the op This PR improves the sharding cache behavior by introducing a RuntimeSchemaInfo, used to record some runtime necessary hashing information during op registration time. This enable us to: * only hash arguments that are tensor or have static_argnum, this is to enable many cases like aten.div.Tensor(tensor, 0.23231) hit the cache. as we currently hashing all args which exclude those cases * with the correct cache behavior, optimizers will hit the cache again and resolve the high cpu overhead issue. simple MLP shows all cache hit and for a single addmm -> 0.319ms (from 0.341ms), shows some hashing improvements: <img width="1172" alt="Screenshot 2023-09-14 at 11 06 07 AM" src="https://github.com/pytorch/pytorch/assets/9443650/3406d673-dd8d-4ad9-9b80-9d4721c430e3"> Adam optimizer shows aten.div hit sharding cache again <img width="1016" alt="Screenshot 2023-09-14 at 11 02 10 AM" src="https://github.com/pytorch/pytorch/assets/9443650/4280e8e3-af44-4fc2-8360-ea80b768f1d9"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/109306 Approved by: https://github.com/fduwjj	2023-09-15 10:32:49 +00:00
Wanchao Liang	09f3e08bcc	[dtensor][3/n] use dedicated TensorMeta instead of the fx one (#108261 ) This PR switches the usage of fx's shape prop TensorMetadata to dtensor's own dedicated defined TensorMeta, this is because DTensor only cares three fields: shape/stride/dtype, all other fields are not necessary and can be inferred from local_tensor directly. This would help significantly simplify how we deal with the tensor metadata by not caring other fields. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108261 Approved by: https://github.com/fduwjj ghstack dependencies: #107306	2023-09-13 04:08:02 +00:00
Brian Hirsh	5efd63b1b8	better support for fakeifying and dynamoing through torch_dispatch subclasses (with dynamic shapes) (#107415 ) There is already some support for plumbing `__torch_dispatch__` tensor subclasses through dynamo, but this PR beefs it up a bit and adds a test. In particular: (1) Fakeifying tensor subclasses didn't properly set autograd metadata (requires_grad, is_leaf) on the newly fakeified wrapper subclass. I don't actually have a test for this in this PR, but it's tested pretty heavily later in my aot autograd tests (2) Fakeifying tensor subclasses didn't properly track source information for dynamic shapes on the inner tensors. I added a new `WrapperSubclassFieldSource` subclass, that represents a source coming from a tensor field on a wrapper subclass, which I use in the fakeifying logic, and again in symbolic_shapes.py to generate proper guards. (3) `_make_wrapper_subclass()` marginally updated this code to work better with dynamic shapes. One thing that's a bit weird about `_make_wrapper_subclass`: it has two overloads, and the first explicitly does not support dynamic shapes (and the second.. does not support kwargs). I think that later we probably want to consolidate / at least make the first overload work with dynamic shapes, but I didn't want to handle that in this PR (so these smaller changes seemed like a strict improvement). Pull Request resolved: https://github.com/pytorch/pytorch/pull/107415 Approved by: https://github.com/ezyang	2023-08-29 02:36:48 +00:00
Wanchao Liang	945fa7e8a8	[dtensor] fix requires_grad in distribute_tensor (#107606 ) This PR fixes the requires_grad set when calling distribute_tensor, we should set the requires_grad of the local tensor after the detach call to make sure we create the leaf correctly, otherwise it would raise warnings Pull Request resolved: https://github.com/pytorch/pytorch/pull/107606 Approved by: https://github.com/fduwjj	2023-08-22 23:08:13 +00:00
Wanchao Liang	d8f2ef10a6	[dtensor][1/n] refactor op dispatch logic to reduce overhead (#107305 ) This PR is the first change of a series of refactors to the op dispatch logic to: 1. remove the redundant logic in the op dispatch, simplify the error checking 2. reduce the number of tree_map/tree_flatten/unflatten needed to reduce the overhead coming from those operations 3. remove the CachedShardingPropagator by using lru_cache from functools directly, this makes it not only helps TP, but general DTensor operations could be faster! 4. change the view ops behavior by inplace changing the op_schema, which is dangerous for sharding prop caching, model the view op as one type of resharding too 5. enrich output sharding to include whether the op needs redistribute so that we don't need explicit op schema comparison to know it. This should help with further reducing the CPU overhead, benchmark results: before (without this change), aten.addmm latency: 0.476ms ![Screenshot 2023-08-16 at 10 46 26 AM](https://github.com/pytorch/pytorch/assets/9443650/7692e6c1-1936-4c7f-bf9c-6c8c9b8f6c76) after (with this change), aten.addmm latency: 0.341ms ![Screenshot 2023-08-16 at 11 05 49 AM](https://github.com/pytorch/pytorch/assets/9443650/15a53f0b-7a95-444e-ab2f-3ee0ad2fa47f) overall one layer of mlp time reduced from 13.535 -> 9.665ms Apart from overhead reduction, this PR simplifies the op dispatching logic and the resharding logic (more refactor needed to make things more clean, which will be done in later PRs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107305 Approved by: https://github.com/fduwjj	2023-08-18 18:30:46 +00:00
fduwjj	4a6ca4cc05	[TP][DTensor Perf] Some perf improvement to reduce DTensor CPU overhead (#106524 ) By inspecting a small TP benchmark, we found couple things we can optimize: 1. We call deep_copy so many times when we initialize DTensor. 2. Some shading_prop is not cached successfully. 3. We are still calling redistribute when not necessary. ![image](https://github.com/pytorch/pytorch/assets/6937752/b847d110-eea1-45df-9298-066d0ba07dd7) ![image](https://github.com/pytorch/pytorch/assets/6937752/fc08f564-caed-496b-80d7-275c1dba3806) ![image](https://github.com/pytorch/pytorch/assets/6937752/fdc06cc4-a4ba-48e8-a118-c041bbd04f5e) So we want to: 1. Remove the deep_copy, and we now make placements a tuple so we are sure it's immutable. 2. Somehow the op_schema gets changed during sharding_op propogation, so we store a hash version of it before passing it to sharding_prop. Ideally we want to figure out why `op_schema` gets changed, but looks like in both index and detach/view op, all get changed, it might take more time to debug. 3. Also when we do hashing of op_schema, we want to hash the entire args_schema not just the args_spec which only contains the DTensorSpec from args which are Dtensors. 4. It turns out that sometimes, DTensor has mem_format to be None (not contiguous) and this will lead to redistribute get triggered, so that we only need to compare type/shape and stride in the metadata. Also we need to ensure _Partial and Shard have different hash value in the DTensorSpec. ![image](https://github.com/pytorch/pytorch/assets/6937752/321e6890-1ab6-4975-adc9-524c6ef9a76b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106524 Approved by: https://github.com/wanchaol	2023-08-14 20:03:19 +00:00
Wanchao Liang	c9cbcb2449	[device_mesh] move remaining collectives to a separate file (#107012 ) Move the remaining collectives to a separate file to prepare device mesh to become a public distributed API For those remaining utils, we need to upstream them to functional collectives with proper implementation, added TODO there for a follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/107012 Approved by: https://github.com/fduwjj	2023-08-11 23:49:27 +00:00
alanhe151220037	1afbc985fe	Make RNGStateTracker support cuda-like device (#106771 ) replace `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771 Approved by: https://github.com/wanchaol	2023-08-10 19:14:33 +00:00
Iris	0cba33e176	[DTensor]Minor Docstring Update (#106250 ) Fix docstring to reflect change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/106250 Approved by: https://github.com/wanchaol	2023-08-02 00:27:29 +00:00
Wanchao Liang	f139aab2f4	[dynamo] add initial dynamo support for DTensor (#103146 ) This PR adds initial dynamo support for DTensor, in particular, it: - allows DTensor be passed into a compiled function, and allow fakify DTensor during dynamo tracing by turning the inner local tensor to meta tensor. - We use `allow_in_graph` to include `DTensor` and `DTensor.from_local` to be represented as `TorchVariable` - The dtensor created becomes a normal `TensorVariable` and it would insert any tensor operations to the output graph just like torch.Tensor - note that dtensor have a new instance method `redistribute` compare to plain tensor, and we currently special handle it in `TensorVariable` `from_local` and `redistribute` both accepts some non-trival metadata as arguments (i.e. DeviceMesh, Placement) which fx.Graph does not support. In order to let these two APIs appear in the dynamo captured graph, we encoded the metadata into a new_function (like `functools.partial`) and the new function only accepts prim args (i.e. tensor), then we put `call_function` with this new_function to the graph. This is suggested by @ezyang. The underlying rationale here is that the metadata will not change across the graph invocations so it's safe to encode them. Captured graph: ``` def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:685, code: dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) prim_from_local = torch__dynamo_variables_torch_prim_from_local(l_x_, run_check = False); l_x_ = None # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:686, code: return dt.redistribute(mesh, [Replicate()]).to_local() + 2 prim_redistribute = torch__dynamo_variables_tensor_prim_redistribute(prim_from_local); prim_from_local = None to_local = prim_redistribute.to_local(); prim_redistribute = None add = to_local + 2; to_local = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103146 Approved by: https://github.com/voznesenskym	2023-07-19 16:01:12 +00:00
Wanchao Liang	cb23373264	[dynamo] allow tensor subclass fakification in dynamo (#105308 ) This PR adds necessary plumbing through torchdynamo to allow tensor subclasses with certain contract (i.e. with `__tensor_flatten__` and `__tensor_unflatten__`) to goes through the dynamo fakification pass by fakifying the tensor subclass internal components. Some of the tensor subclass contract logic mostly borrowed from https://github.com/pytorch/pytorch/pull/97540 Added some tests to verify simply passing through a tensor subclass (i.e. DTensor) through dynamo eager works as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105308 Approved by: https://github.com/ezyang	2023-07-18 17:28:04 +00:00
Wanchao Liang	bcb9ca4e5a	[dtensor] canonicalize detach callsites and use `view_as` when appropriate (#105239 ) This PR canonicalize the detach callsite to only call the detach from `distribute_tensor`. Change other callsite to view_as and remove the tensor constructor detach call This is so that we don't detach local tensor for every op run when rewrapping the DTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/105239 Approved by: https://github.com/albanD	2023-07-18 17:13:37 +00:00
Xilun Wu	a66107a30c	[DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235 ) # Change This PR adds two classes to DTensor: 1. `CudaRNGStateTracker`: `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG). 2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators. # Warning - With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that. - The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235 Approved by: https://github.com/wanchaol	2023-06-27 19:00:25 +00:00
Wanchao Liang	4cc474dec4	[dtensor] support torch.save/load with DTensor (#103106 ) This PR actually enables DTensor to be pickable and add tests to test torch.save/load works correctly for DTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/103106 Approved by: https://github.com/kumpera	2023-06-09 04:11:15 +00:00
fduwjj	92923aca61	[TP] Use Stride inferred from local tensor in to_local bwd (#102630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102630 Approved by: https://github.com/wanchaol	2023-06-01 04:30:24 +00:00
Wanchao Liang	c5d4ee2d73	[dtensor][simple] fix some comments (#102661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102661 Approved by: https://github.com/fduwjj, https://github.com/XilunWu	2023-06-01 03:23:19 +00:00
Wanchao Liang	70eccdbf92	[dtensor] add necessary logging to APIs and components (#101994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101994 Approved by: https://github.com/wz337	2023-05-23 18:17:54 +00:00
Wanchao Liang	599ae95d1a	[dtensor] use stack to manage mesh resources (#101202 ) This PR changes the context manager behavior of device mesh, now we use a mesh env to track the current mesh and save the mesh to a stack so that we can allow nested context manager Pull Request resolved: https://github.com/pytorch/pytorch/pull/101202 Approved by: https://github.com/wz337	2023-05-11 23:48:36 +00:00
Wanchao Liang	55a1dc7f88	[dtensor] redistributed by default take self mesh instead (#99060 ) This PR switches redistribute to default use self mesh instead of the global mesh, which is more user friendly Pull Request resolved: https://github.com/pytorch/pytorch/pull/99060 Approved by: https://github.com/mrshenli	2023-04-14 05:14:28 +00:00

1 2

61 Commits