499 Commits

Author SHA1 Message Date
7ebd816aab Switch DTensor to use funcol::all_reduce. (#95804)
This is relanding the troubling part of #95009 that caused a regression.

BC: This changes the signature and semantics of DeviceMesh::all_reduce.

DeviceMesh::all_reduce now uses a functional collective under the hood which makes it more easily traceable.
You no longer need to use CommTensor to get a trace.

all_reduce now is async only and uses AsyncCollectiveTensor to ensure proper stream synchronization.

Signature changed: removed async_op param and changes return type from Optional[Work] to torch.Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95804
Approved by: https://github.com/fegin
2023-03-02 17:55:01 +00:00
b3d8fae042 Fix typos in documents under torch directory (#95709)
This PR fixes typo in `.md` files under `torch` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95709
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-01 23:43:35 +00:00
7a772bfff9 [dtensor] add submesh example to checkpoint_example (#95655)
This PR adds a submesh example for checkpoing purposes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95655
Approved by: https://github.com/XilunWu
2023-03-01 08:19:27 +00:00
2a1cb9640c [dtensor] support creating DTensor in submesh (#95458)
This PR supports creating DTensor in a submesh, if the rank is not
participating in the mesh, we assign the local tensor to be empty
tensor, and do nothing in the operator dispatch

Differential Revision: [D43643577](https://our.internmc.facebook.com/intern/diff/D43643577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95458
Approved by: https://github.com/XilunWu
2023-02-28 17:54:26 +00:00
261eb46ddd [dtensor] refactor get_coordiniate (#95457)
This refactor get_coordinate to return a optional[list] instead of
directly the coordinate on dim, this is so that we can check if the
rank is inside the mesh easily

Differential Revision: [D43643579](https://our.internmc.facebook.com/intern/diff/D43643579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95457
Approved by: https://github.com/XilunWu
2023-02-28 17:54:26 +00:00
bb9a05b116 [dtensor] use tracing for metadata prop (#95456)
This PR uses tracing for metadata prop, so that we can get correct
shape/stride metadata without manual calculation by ourselves.

The follow up PR on this would be adopt tracing for the sharding
prop itself

Differential Revision: [D43643578](https://our.internmc.facebook.com/intern/diff/D43643578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95456
Approved by: https://github.com/XilunWu
2023-02-28 17:54:22 +00:00
d950f45577 Revert "[Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009)"
This reverts commit 0765dbc25ed9368f41225e7de231ee3dd6b188a3.

Reverted https://github.com/pytorch/pytorch/pull/95009 on behalf of https://github.com/jeanschmidt due to this PR is causing internal breakages. Check https://fburl.com/diff/me41urq8
2023-02-27 19:21:58 +00:00
31ce32b03d Fix typos in documents under torch (#95597)
This PR fixes typos of documents in `.md` files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95597
Approved by: https://github.com/ezyang
2023-02-27 19:07:47 +00:00
0765dbc25e [Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009)
BC: This changes the signature and semantics of DeviceMesh::all_reduce.

DeviceMesh::all_reduce now uses a functional collective under the hood which makes it more easily traceable.
You no longer need to use CommTensor to get a trace.

all_reduce now is async only and uses AsyncCollectiveTensor to ensure proper stream synchronization.

Signature changed: removed `async_op` param and changes return type from `Optional[Work]` to `torch.Tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95009
Approved by: https://github.com/wanchaol
2023-02-24 02:10:55 +00:00
ee0e7f0529 [dtensor] add checkpointing example (#94743)
This PR adds some DTensor sharding example on a simple MLP model
for checkpointing reference purposes

Note that checkpointing itself is not implemented yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94743
Approved by: https://github.com/wz337
2023-02-16 22:04:09 +00:00
b209d8fa0d [PT-D][Sequence Parallelism] Enable DTensor based Naive sequence parallelism (#94369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94369
Approved by: https://github.com/wanchaol
2023-02-16 21:21:00 +00:00
67d9790985 [BE] Apply almost all remaining flake8-comprehension checks (#94676)
Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676
Approved by: https://github.com/ezyang
2023-02-12 01:01:25 +00:00
680fc84e7b [dtensor] group public APIs together (#94524)
This PR groups distribute_tensor/module to api.py

rename some to non-public (ToTensor/FromTensor)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94524
Approved by: https://github.com/XilunWu
2023-02-10 23:40:34 +00:00
09598b603f [dtensor] update readme for prototype release (#94517)
This PR updates the README for prototype release, remove some code
that are not available yet and use the ones that works.

Also rename to DTensor in most sentences
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94517
Approved by: https://github.com/fegin
2023-02-09 22:35:26 +00:00
8fce9a09cd [BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308)
Apply parts of pyupgrade to torch (starting with the safest changes).
This PR only does two things: removes the need to inherit from object and removes unused future imports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-07 21:10:56 +00:00
d05ec0efeb [dtensor] add split_with_sizes op (#93957)
add the split_with_sizes op, sharing with split op impl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93957
Approved by: https://github.com/XilunWu
2023-02-03 04:16:30 +00:00
6f3018d50b [DTensor] implement dist_split as a sharding prop rule (#93306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93306
Approved by: https://github.com/wanchaol
2023-02-02 07:56:44 +00:00
b82f93d561 [DTensor] fix DTensorSpec dim_map description (#93160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93160
Approved by: https://github.com/wanchaol
2023-02-02 07:56:44 +00:00
60e503d468 [dtensor][6/N] change to a better/safer op registration (#90735)
This PR changes the op registration to a better mechanism, now
we require the directly overload registration instead of the op
key str, this have several benefits:
1. We ensure that the op registration registers the correct op, which
  means it would be faild if the op registration become wrong (this PR
  already fixing several op registration errors as we use direct
  OpOverload registration
2. If the overload name get changed/deleted, we immediately know it at
  the source code compilation level, which is safer
3. This also keep it consistents with the op registration mechanism with
  other tensor subclasses within PyTorch

Differential Revision: [D42876250](https://our.internmc.facebook.com/intern/diff/D42876250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90735
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-02-01 05:06:33 +00:00
9a56997fe1 [dtensor][5/N] add cached propagator for TP (#90734)
This PR adds a cached propagator for TP use, it caches the sharding
prop decision for the same input sharding on an operator. This could
improve eager mode performance.

Differential Revision: [D42876249](https://our.internmc.facebook.com/intern/diff/D42876249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90734
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-02-01 05:04:08 +00:00
b072245178 [dtensor][4/N] refactor dispatching logic and add propagator (#90733)
This PR refactors the dispatching logic to make it more clean, and
isolate the sharding propagation logic out to a separate class.

This is so that we can implement more complicated propagation features
later.

Differential Revision: [D42876251](https://our.internmc.facebook.com/intern/diff/D42876251)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90733
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-02-01 05:02:11 +00:00
8b3e01cd30 [DTensor] implement dist_cat as a sharding prop rule (#92677)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92677
Approved by: https://github.com/wanchaol
2023-01-27 02:14:17 +00:00
77f336600a [PT-D] Enable Meta Tensor Support for DTensor (#92652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92652
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2023-01-26 04:54:57 +00:00
b985c2ef4a [PT-D] Enable init ops for DTensor (#92651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92651
Approved by: https://github.com/wanchaol
2023-01-23 04:38:11 +00:00
c55f6973e4 [dtensor][3/N] move OpSchema and types to a separate file (#90732)
This PR moves OpSchema and types to a separate file to resolve
circular dependency better, this is part of refactor on dispatching
logic to enable more complicated features
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90732
Approved by: https://github.com/XilunWu
2023-01-18 07:16:23 +00:00
dc95ef25e5 [dtensor][2/N] add __repr__ to placements (#91785)
This PR added __repr__ to all placement types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91785
Approved by: https://github.com/XilunWu
2023-01-18 07:16:23 +00:00
a1186d6af9 [dtensor][1/N] add __hash__ to device_mesh and dtensor_spec (#90731)
This PR adds __hash__ to device_mesh and dtensor_spec to allow
things like dict indexing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90731
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-01-18 07:16:21 +00:00
9942ddd5b3 [threaded_pg] enable subpg creation and concurrent collective (#91649)
This PR refactors the threaded PG logic to enable multiple sub pg
creation under the world threaded pg, and allow the case where
we can call collectives together on different subpgs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91649
Approved by: https://github.com/XilunWu
2023-01-17 03:26:34 +00:00
513c1e71e2 [DTensor] check DeviceMesh ranks contiguity (#91802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91802
Approved by: https://github.com/wanchaol
2023-01-16 01:17:45 +00:00
b7cad020b5 [DTensor] require DeviceMesh size equals world size (#91801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91801
Approved by: https://github.com/wanchaol
2023-01-12 22:37:55 +00:00
3dd9dbd942 [DTensor] create default process group when absent (#91756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91756
Approved by: https://github.com/wanchaol
2023-01-12 22:37:55 +00:00
712170e929 [threaded pg] adapt test_pointwise_ops.py (#90713)
Differential Revision: [D42153660](https://our.internmc.facebook.com/intern/diff/D42153660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90713
Approved by: https://github.com/wanchaol
2022-12-20 23:37:40 +00:00
7afba50508 [dtensor] delete unused torch_function (#90449)
torch_function is not actually getting used yet today, deleting
it first and we can revisit once we really need it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90449
Approved by: https://github.com/fduwjj
2022-12-10 01:29:02 +00:00
f51f6aa387 Fix non-existing parameters in docstrings (#90505)
Continuation after https://github.com/pytorch/pytorch/pull/90163.

Here is a script I used to find all the non-existing arguments in the docstrings (the script can give false positives in presence of *args/**kwargs or decorators):

_Edit:_
I've realized that the indentation is wrong for the last `break` in the script, so the script only gives output for a function if the first docstring argument is wrong. I'll create a separate PR if I find more issues with corrected script.

``` python
import ast
import os
import docstring_parser

for root, dirs, files in os.walk('.'):
    for name in files:
        if root.startswith("./.git/") or root.startswith("./third_party/"):
            continue
        if name.endswith(".py"):
            full_name = os.path.join(root, name)
            with open(full_name, "r") as source:
                tree = ast.parse(source.read())
                for node in ast.walk(tree):
                    if isinstance(node, ast.FunctionDef):
                        all_node_args = node.args.args
                        if node.args.vararg is not None:
                            all_node_args.append(node.args.vararg)
                        if node.args.kwarg is not None:
                            all_node_args.append(node.args.kwarg)
                        if node.args.posonlyargs is not None:
                            all_node_args.extend(node.args.posonlyargs)
                        if node.args.kwonlyargs is not None:
                            all_node_args.extend(node.args.kwonlyargs)
                        args = [a.arg for a in all_node_args]
                        docstring = docstring_parser.parse(ast.get_docstring(node))
                        doc_args = [a.arg_name for a in docstring.params]
                        clean_doc_args = []
                        for a in doc_args:
                            clean_a = ""
                            for c in a.split()[0]:
                                if c.isalnum() or c == '_':
                                    clean_a += c
                            if clean_a:
                                clean_doc_args.append(clean_a)
                        doc_args = clean_doc_args
                        for a in doc_args:
                            if a not in args:
                                print(full_name, node.lineno, args, doc_args)
                            break

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90505
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2022-12-09 21:43:09 +00:00
9e314bd822 [dtensor] handle the case where output of op is Optional[Tensor] (#90241)
Observed by @aazzolini, some op might have Optional[Tensor] returns
where it return None (i.e. native_layer_norm_backward), it's a mismatch
between C++ aten op signature and python None, but we need to handle it
in the python side
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90241
Approved by: https://github.com/aazzolini
2022-12-06 18:17:20 +00:00
2c2cce73d4 [dtensor] remove torchgen function schema and parse manually (#90106)
This PR get rids of torchgen FunctionSchema parsing and parse
it manually, it should resolve torchgen package issue and also
provide some perf wins when running DTensor eagerly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90106
Approved by: https://github.com/awgu
2022-12-06 05:45:00 +00:00
29ea1c9c8e [doc] update dtensor readme (#89991)
I fixed some import erros in readme of dtensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89991
Approved by: https://github.com/wanchaol
2022-12-01 22:16:39 +00:00
bf23e0bdbd [dtensor] ufmt distributed._tensor (#89967)
cmd: `ufmt format torch/distributed/_tensor`

copy from Andrew:

Notes
For VSCode users,

Install ufmt: https://pypi.org/project/ufmt/
Install VSCode ufmt extension: https://marketplace.visualstudio.com/items?itemName=omnilib.ufmt
Include in settings.json:
```
{
    "[python]": {
        "editor.defaultFormatter": "omnilib.ufmt",
        "editor.formatOnSave": true,
    },
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89967
Approved by: https://github.com/fduwjj
2022-12-01 20:58:13 +00:00
4451eb24e6 Move tensor_parallel out to distributed.tensor folder (#89878)
This PR moves tensor parallel from torch.distributed._tensor.parallel
to torch.distributed.tensor.parallel, to prepare for beta release
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89878
Approved by: https://github.com/fduwjj
2022-11-30 22:13:10 +00:00
009dd3c4af [PT-D][Tensor Parallel] Add more test cases when we use use_orig_params for FSDP wrapping (#89779)
Differential Revision: [D41600656](https://our.internmc.facebook.com/intern/diff/D41600656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89779
Approved by: https://github.com/wanchaol
2022-11-30 06:34:58 +00:00
12f98f85bc [dtensor] update README (#89800)
This PR updates README to include the RFC details
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89800
Approved by: https://github.com/mrshenli
2022-11-30 04:35:32 +00:00
de0dee30d0 [PT-D][3/N] Sync TP API change to Pytorch (#89535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89535
Approved by: https://github.com/wanchaol
2022-11-23 16:13:49 +00:00
00b9473ad6 [PT-D][Tensor Parallelism][2/N] Sync TP API change to PT prod (#89467)
This is part of TP Beta Release efforts.
ref: https://github.com/pytorch/tau/issues/576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89467
Approved by: https://github.com/wanchaol
2022-11-22 03:05:53 +00:00
6afe341276 [PT-D][1/N] Sync TP Beta change to prod (#89242)
This is part of TP Beta Release efforts.

ref: https://github.com/pytorch/tau/issues/576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89242
Approved by: https://github.com/wanchaol
2022-11-19 18:01:25 +00:00
f20b3f2e57 [dtensor] PART 8: move tensor parallel api and tests to core distributed (#88180)
This PR moves tensor/parallel folder and tests to torch.distributed.

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88180
Approved by: https://github.com/aazzolini
2022-11-16 08:07:50 +00:00
1b88476320 [dtensor] PART 4: move remaining DTensor ops to core distributed (#88550)
This PR moves the view related DTensor ops to core distributed,
tests will be add in follow up PRs

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88550
Approved by: https://github.com/fduwjj
2022-11-16 08:07:44 +00:00
2dcf0978a2 [dtensor] PART 3: move most DTensor ops to core distributed (#88177)
This PR moves most DTensor ops to torch.distributed._tensor. We will
add all tests in the following PRs.

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88177
Approved by: https://github.com/fduwjj
2022-11-16 08:07:42 +00:00
4b945967de [dtensor] PART 2: move DTensor abstraction and APIs to core distributed (#88176)
This PR moves the core DTensor abstraction and high level APIs to
torch.distributed._tensor folder, which includes the following:
1. DTensor class
2. high level APIs (distribute_tensor/module)
3. dispatching logic
4. redistribute logic

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88176
Approved by: https://github.com/fduwjj
2022-11-16 08:07:41 +00:00
370fc5cb42 [dtensor] PART 1: move DeviceMesh and placement to core distributed (#88549)
This PR creates `torch.distributed._tensor` package and moves
DeviceMesh, PlacementTypes to it

part of https://github.com/pytorch/pytorch/issues/88838
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88549
Approved by: https://github.com/fduwjj
2022-11-16 08:07:39 +00:00