pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
fduwjj	4a6ca4cc05	[TP][DTensor Perf] Some perf improvement to reduce DTensor CPU overhead (#106524 ) By inspecting a small TP benchmark, we found couple things we can optimize: 1. We call deep_copy so many times when we initialize DTensor. 2. Some shading_prop is not cached successfully. 3. We are still calling redistribute when not necessary. ![image](https://github.com/pytorch/pytorch/assets/6937752/b847d110-eea1-45df-9298-066d0ba07dd7) ![image](https://github.com/pytorch/pytorch/assets/6937752/fc08f564-caed-496b-80d7-275c1dba3806) ![image](https://github.com/pytorch/pytorch/assets/6937752/fdc06cc4-a4ba-48e8-a118-c041bbd04f5e) So we want to: 1. Remove the deep_copy, and we now make placements a tuple so we are sure it's immutable. 2. Somehow the op_schema gets changed during sharding_op propogation, so we store a hash version of it before passing it to sharding_prop. Ideally we want to figure out why `op_schema` gets changed, but looks like in both index and detach/view op, all get changed, it might take more time to debug. 3. Also when we do hashing of op_schema, we want to hash the entire args_schema not just the args_spec which only contains the DTensorSpec from args which are Dtensors. 4. It turns out that sometimes, DTensor has mem_format to be None (not contiguous) and this will lead to redistribute get triggered, so that we only need to compare type/shape and stride in the metadata. Also we need to ensure _Partial and Shard have different hash value in the DTensorSpec. ![image](https://github.com/pytorch/pytorch/assets/6937752/321e6890-1ab6-4975-adc9-524c6ef9a76b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/106524 Approved by: https://github.com/wanchaol	2023-08-14 20:03:19 +00:00
Wanchao Liang	5c48ff20b5	AsyncCollectiveTensor: dont sync on view ops (#105240 ) AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used. Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: `1518d5eec4/torch/distributed/_tensor/api.py (L207)`) AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op. Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab	2023-08-11 19:20:25 +00:00
Wanchao Liang	f139aab2f4	[dynamo] add initial dynamo support for DTensor (#103146 ) This PR adds initial dynamo support for DTensor, in particular, it: - allows DTensor be passed into a compiled function, and allow fakify DTensor during dynamo tracing by turning the inner local tensor to meta tensor. - We use `allow_in_graph` to include `DTensor` and `DTensor.from_local` to be represented as `TorchVariable` - The dtensor created becomes a normal `TensorVariable` and it would insert any tensor operations to the output graph just like torch.Tensor - note that dtensor have a new instance method `redistribute` compare to plain tensor, and we currently special handle it in `TensorVariable` `from_local` and `redistribute` both accepts some non-trival metadata as arguments (i.e. DeviceMesh, Placement) which fx.Graph does not support. In order to let these two APIs appear in the dynamo captured graph, we encoded the metadata into a new_function (like `functools.partial`) and the new function only accepts prim args (i.e. tensor), then we put `call_function` with this new_function to the graph. This is suggested by @ezyang. The underlying rationale here is that the metadata will not change across the graph invocations so it's safe to encode them. Captured graph: ``` def forward(self, L_x_ : torch.Tensor): l_x_ = L_x_ # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:685, code: dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False) prim_from_local = torch__dynamo_variables_torch_prim_from_local(l_x_, run_check = False); l_x_ = None # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:686, code: return dt.redistribute(mesh, [Replicate()]).to_local() + 2 prim_redistribute = torch__dynamo_variables_tensor_prim_redistribute(prim_from_local); prim_from_local = None to_local = prim_redistribute.to_local(); prim_redistribute = None add = to_local + 2; to_local = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/103146 Approved by: https://github.com/voznesenskym	2023-07-19 16:01:12 +00:00
Rodrigo Kumpera	e645f2adaf	[DTensor] Fix device detection logic for TestDTensorPlacementTypes::test_split_tensor. (#105357 ) The test should respect self.device_type as it checks whether the environment has enough GPUs to serve the requested world size. The test will lead to hangs if we try to run 8 ranks over our 2-4 GPUs CI instances. Fixes #104769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105357 Approved by: https://github.com/wanchaol	2023-07-18 21:53:50 +00:00
Wanchao Liang	cb23373264	[dynamo] allow tensor subclass fakification in dynamo (#105308 ) This PR adds necessary plumbing through torchdynamo to allow tensor subclasses with certain contract (i.e. with `__tensor_flatten__` and `__tensor_unflatten__`) to goes through the dynamo fakification pass by fakifying the tensor subclass internal components. Some of the tensor subclass contract logic mostly borrowed from https://github.com/pytorch/pytorch/pull/97540 Added some tests to verify simply passing through a tensor subclass (i.e. DTensor) through dynamo eager works as expected. Pull Request resolved: https://github.com/pytorch/pytorch/pull/105308 Approved by: https://github.com/ezyang	2023-07-18 17:28:04 +00:00
Wanchao Liang	4cc474dec4	[dtensor] support torch.save/load with DTensor (#103106 ) This PR actually enables DTensor to be pickable and add tests to test torch.save/load works correctly for DTensor Pull Request resolved: https://github.com/pytorch/pytorch/pull/103106 Approved by: https://github.com/kumpera	2023-06-09 04:11:15 +00:00
fduwjj	92923aca61	[TP] Use Stride inferred from local tensor in to_local bwd (#102630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102630 Approved by: https://github.com/wanchaol	2023-06-01 04:30:24 +00:00
Iris	568db1b464	[dtensor] Relax condition for _split_tensor() (#101218 ) When tensor.size(self.dim) < num_chunks, we will fill empty chunk with empty tensor (https://github.com/pytorch/pytorch/pull/98722). Therefore, we no longer needs this assert. For example, when sharding a tensor with 1 element on 2 ranks along dim 0, results would be as follows: ``` rank:0, dtensor:DTensor(local_tensor=tensor([0.4963], device='cuda:0'), device_mesh=DeviceMesh:([0, 1]), placements=[Shard(dim=0)]) rank:1, dtensor:DTensor(local_tensor=tensor([], device='cuda:1'), device_mesh=DeviceMesh:([0, 1]), placements=[Shard(dim=0)]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101218 Approved by: https://github.com/wanchaol	2023-05-14 07:39:27 +00:00
Wanchao Liang	3ae612ba7f	[dtensor] remove assertions about submesh checks (#101229 ) This PR removes assertions from submesh checks to directly return local tensor, this is so that all the other APIs can work with submesh Pull Request resolved: https://github.com/pytorch/pytorch/pull/101229 Approved by: https://github.com/fduwjj	2023-05-12 04:20:35 +00:00
Wanchao Liang	599ae95d1a	[dtensor] use stack to manage mesh resources (#101202 ) This PR changes the context manager behavior of device mesh, now we use a mesh env to track the current mesh and save the mesh to a stack so that we can allow nested context manager Pull Request resolved: https://github.com/pytorch/pytorch/pull/101202 Approved by: https://github.com/wz337	2023-05-11 23:48:36 +00:00
Shen Li	02179827cb	[Easy] Include SPMD and DTensor files in UFMT checks (#98148 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98148 Approved by: https://github.com/fegin	2023-04-02 15:34:49 +00:00
Xilun Wu	c2d7508276	[DTensor] default value for DTensor ops on non-participating devices (#95852 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95852 Approved by: https://github.com/wanchaol	2023-03-23 19:30:02 +00:00
Wanchao Liang	789fc4c292	[dtensor] refactor shape/offset calculation (#95923 ) Shape offset calculation is commonly used and extract them into a separate util Pull Request resolved: https://github.com/pytorch/pytorch/pull/95923 Approved by: https://github.com/fduwjj	2023-03-05 06:33:32 +00:00
Wanchao Liang	2a1cb9640c	[dtensor] support creating DTensor in submesh (#95458 ) This PR supports creating DTensor in a submesh, if the rank is not participating in the mesh, we assign the local tensor to be empty tensor, and do nothing in the operator dispatch Differential Revision: [D43643577](https://our.internmc.facebook.com/intern/diff/D43643577) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95458 Approved by: https://github.com/XilunWu	2023-02-28 17:54:26 +00:00
Wanchao Liang	bb9a05b116	[dtensor] use tracing for metadata prop (#95456 ) This PR uses tracing for metadata prop, so that we can get correct shape/stride metadata without manual calculation by ourselves. The follow up PR on this would be adopt tracing for the sharding prop itself Differential Revision: [D43643578](https://our.internmc.facebook.com/intern/diff/D43643578) Pull Request resolved: https://github.com/pytorch/pytorch/pull/95456 Approved by: https://github.com/XilunWu	2023-02-28 17:54:22 +00:00
Xuehai Pan	046e88a291	[BE] [3/3] Rewrite `super()` calls in test (#94592 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94592 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-12 22:20:53 +00:00
fduwjj	77f336600a	[PT-D] Enable Meta Tensor Support for DTensor (#92652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92652 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2023-01-26 04:54:57 +00:00
Wanchao Liang	a1186d6af9	[dtensor][1/N] add __hash__ to device_mesh and dtensor_spec (#90731 ) This PR adds __hash__ to device_mesh and dtensor_spec to allow things like dict indexing Pull Request resolved: https://github.com/pytorch/pytorch/pull/90731 Approved by: https://github.com/XilunWu, https://github.com/fduwjj	2023-01-18 07:16:21 +00:00
Wanchao Liang	c37c5163da	[dtensor] ufmt test/distributed/_tensor (#89968 ) cmd: `ufmt format test/distributed/_tensor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89968 Approved by: https://github.com/fduwjj	2022-12-01 20:58:15 +00:00
Wanchao Liang	527c5bdb45	[dtensor] PART 5: move DTensor basic tests to core distributed (#88178 ) This PR moves DTensor basic tests to torch.distributed, including dtensor, device_mesh tests part of https://github.com/pytorch/pytorch/issues/88838 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88178 Approved by: https://github.com/fduwjj	2022-11-16 08:07:46 +00:00

20 Commits