Commit Graph

29 Commits

Author SHA1 Message Date
2f3b0befed [BE]: Apply ruff FURB 118. (#124743)
Replaces various lambdas with operator.itemgetter which is more efficient (as it's a builtin function). Particularly useful for when lambdas are used as 'key' functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124743
Approved by: https://github.com/albanD, https://github.com/malfet
2024-04-26 14:34:52 +00:00
81740fd1f6 [DCP] minor readability fix: make param name consistent with overriden function (#124770)
Summary:
This diff has no logic changes. It updates the variable names to be in sync with the name used in prepare_global_plan in StorageWriter. Pasting func signature for easy reference -

    abc.abstractmethod
    def prepare_global_plan(self, plans: List[SavePlan]) -> List[SavePlan]:

Differential Revision: D56480396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124770
Approved by: https://github.com/fegin
2024-04-24 05:31:26 +00:00
de7edeea25 [DCP] DCP logger (#121352)
Adds additional logging for improved observability in DCP.

Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352
Approved by: https://github.com/wz337, https://github.com/fegin
2024-04-05 17:50:50 +00:00
b0a0850a5c [DCP] Replaced storage() with untyped_storage() (#121538)
Let us try to remove this warning 😄 :
```
[rank0]:/data/users/andgu/pytorch/torch/distributed/checkpoint/filesystem.py:150: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
[rank0]:  if tensor.storage().size() != tensor.numel():
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121538
Approved by: https://github.com/wz337, https://github.com/fegin
2024-03-08 23:46:17 +00:00
54c1cf8d8a add distributed checkpoint support for custom device (#120201)
Fixes #120200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120201
Approved by: https://github.com/fegin, https://github.com/wz337
2024-02-24 19:14:29 +00:00
6d8f192fd0 [DCP] Call os.sync if os.fsync does not work for fsspec (#119287)
Some fsspec storage may not support fileno(). In such a case, we fall back to os.sync()

If may not be necessary to call `os.sync()` as in such a case, the storage may be a remote storage that requires a special sync API call.

Differential Revision: [D53433425](https://our.internmc.facebook.com/intern/diff/D53433425/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119287
Approved by: https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #118888
2024-02-08 17:10:38 +00:00
d947534782 [DCP] Enable filesystem/fsspec auto detection (#118888)
This API enables the ability to automatically detect whether to use filesystem or fsspec based on the checkpoint_id.

Differential Revision: [D53318043](https://our.internmc.facebook.com/intern/diff/D53318043/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118888
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-02-08 16:38:04 +00:00
494c2ec054 [DCP][BE] Let FsspecWriter and FsspecReader inherit from FileSystemWriter and FileSystemReader (#118887)
There is no logic changed. However this PR dramatially reduces the effort to maintain filesystem-like storage backend. As we are going to enable fsspec, this is a must BE iteam.

Differential Revision: [D53318044](https://our.internmc.facebook.com/intern/diff/D53318044/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118887
Approved by: https://github.com/wz337
2024-02-03 01:14:13 +00:00
644bc69530 [DCP] Allow users to save and load without creating storage reader and writer (#117772)
Right now DCP API requires users to create StorageWriter and StorageReader for every API call. This PR allows users to only pass the checkpointer_id (a path) and use it to read/write a checkpoint without creating a StorageReader and Writer.

Differential Revision: [D52740556](https://our.internmc.facebook.com/intern/diff/D52740556/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117772
Approved by: https://github.com/wz337
ghstack dependencies: #116248
2024-01-26 09:08:35 +00:00
ea851eb027 Uses Serial Loader for DCP.save when more then one thread is used. (#118114)
The OverlappingCPU Loader is causing a major drop in performance when used with multiple threads. This PR is a temporary fix while we investigate why this is the case.

Benchmarks for save, using a 7.25GB FSDP model, as per the TSS benchmark. Both benchmarks run on 8 ranks.

Before this PR
9.475 s
8 threads

After this PR
1.632 s
8 threads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118114
Approved by: https://github.com/wz337, https://github.com/fegin
2024-01-25 21:11:16 +00:00
3d1869d0ae [DCP][BE] Improve the readability of filesystem and fsspec filesystem (#116246)
1. Better typing
2. Remove 1-liner function

Differential Revision: [D52357731](https://our.internmc.facebook.com/intern/diff/D52357731/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116246
Approved by: https://github.com/wz337
ghstack dependencies: #116245
2024-01-11 16:27:21 +00:00
b342286646 adds async save, makes checkpointer private (#116293)
Adds Async Save and also makes `Checkpointer` classes private.

The original PR was here: https://github.com/pytorch/pytorch/pull/115864

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293
Approved by: https://github.com/fegin
2023-12-22 05:22:39 +00:00
a548ff40de [DCP][BE] Remove unused function (#116006)
As title

Differential Revision: [D52245433](https://our.internmc.facebook.com/intern/diff/D52245433/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116006
Approved by: https://github.com/wz337
2023-12-21 07:20:08 +00:00
db8d409d08 [DCP][BE] Apply ufmt to DCP and turn on lintrunner for DCP (#115302)
No logic change. Just typing and ufmt.

Differential Revision: [D51914982](https://our.internmc.facebook.com/intern/diff/D51914982/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115302
Approved by: https://github.com/XilunWu, https://github.com/wz337, https://github.com/LucasLLC
ghstack dependencies: #115523
2023-12-13 10:32:36 +00:00
5432088098 Adds Checkpointer Wrapper for DCP [3/N] (#114603)
Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers.

Instead of doing:

```
DCP.save(
    state_dict={...},
    storage_writer=StorageWriter(...)
)

DCP.load(
    state_dict={...},
    storage_reader=StorageReader(...)
)
```

We can now do:

```
checkpointer = Checkpointer(...)

checkpointer.save(state_dict={...})
checkpointer.load(state_dict={...})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603
Approved by: https://github.com/fegin, https://github.com/wz337
2023-12-08 01:03:21 +00:00
b7b2178204 [BE]: Remove useless lambdas (#113602)
Applies PLW0108 which removes useless lambda calls in Python, the rule is in preview so it is not ready to be enabled by default just yet. These are the autofixes from the rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113602
Approved by: https://github.com/albanD
2023-11-14 20:06:48 +00:00
44c0521e8c fix: docstring error in torch/distributed module (#113241)
Fixes: #113193

`pydocstyle <all_files_in_issue> --count`

- Before: 345
- After: 130

For deprecated methods, I have added a `noqa` to ignore them. I was not able to find the file `torch/distributed/tensor/parallel/multihead_attention_tp.py`, so I've ignored it for this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113241
Approved by: https://github.com/kit1980
2023-11-09 19:10:20 +00:00
bb45f89cd9 Hackable distributed filesystem reader and writer (#106635)
I propose some changes so that the `FileSystemReader` and `FileSystemWriter` can be used on other file systems. User only needs to provide `path` as a subclass of `Path` that overrides the necessary interfaces.

For example, one can utilize `tf.io.gfile` to implement an interface to save to or load from HDFS. The following code snippet shows a working implementation.

```python
from pathlib import Path
import tensorflow as tf

class GFileWrapper(tf.io.gfile.GFile):
    def __init__(self, path, mode="r") -> None:
        super().__init__(path, mode)

    def write(self, data):
        return super().write(bytes(data))

    # a not quite efficient readinto, but it works
    def readinto(self, buffer):
        # read up to buffer's length
        data = self.read(len(buffer))
        length = len(data)
        buffer[:length] = data
        return length

class HdfsPath(type(Path())):
    def __new__(cls, *pathsegments):
        return super().__new__(cls, *pathsegments)

    @staticmethod
    def _fix_path(path):
        path = str(path)
        if path.startswith("hdfs:/") and not path.startswith("hdfs://"):
          path = path.replace("hdfs:/", "hdfs://")
        return path

    def open(self, mode="r", *args, **kwargs):
        return GFileWrapper(HdfsPath._fix_path(self), mode=mode)

    def mkdir(self, **kwargs) -> None:
        return tf.io.gfile.makedirs(HdfsPath._fix_path(self))

    def rename(self, target):
        return tf.io.gfile.rename(HdfsPath._fix_path(self), HdfsPath._fix_path(target))
```

```python
writer = FileSystemWriter(HdfsPath("hdfs://..."), sync_files=False)
reader = FileSystemReader(HdfsPath("hdfs://..."))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106635
Approved by: https://github.com/fduwjj
2023-10-31 19:36:18 +00:00
ff37f6018d Enable custom device support in fsdp checkpoint (#107289)
Fixes https://github.com/pytorch/pytorch/issues/104390
Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289
Approved by: https://github.com/wz337
2023-08-25 11:50:03 +00:00
4833dc10b8 [DCP] Rewrite read slicing to use a wrapper. (#99167)
Moved SlicedBufferedReader to utils and renamed to _ReaderView.

It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's.

Should help with #98386
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167
Approved by: https://github.com/wz337
2023-06-08 13:52:13 +00:00
8769fb854d [BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715)
Enables B027 and applies fixes by adding abstract method decorators. Autofix generated by ruff master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100715
Approved by: https://github.com/ezyang
2023-05-09 17:28:48 +00:00
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
bebe58bd71 [DCP] Set single_file_per_rank default to True (#94501)
The default behavior of FileSystemWriter should produce one file per rank instead of one file per tensor/blob.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94501
Approved by: https://github.com/fegin
2023-02-09 21:45:31 +00:00
748bac8757 [BE]: Apply pyupgrade yield from and unit test alias upgrades (#94309)
Applies some more harmless pyupgrades. This one gets rid of deprecated aliases in unit_tests and more upgrades yield for loops into yield from generators which are more performance and propagates more information / exceptions from original generator. This is the modern recommended way of forwarding generators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94309
Approved by: https://github.com/albanD
2023-02-07 20:08:58 +00:00
dd05f028e2 [PT-D][Checkpoint] Rename DCP storage layer init() (#92869)
Rename DCP storage layer init() and update tests accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92869
Approved by: https://github.com/kumpera
2023-01-25 23:52:45 +00:00
eee2869ea7 [PT-D][checkpoint] Resolve no such file or directory issue when checkpointing on multi hosts (#92553)
Previously, we only create the directory in rank 0. Therefore, if running on multihosts with multiple GPUs, we would run into issues of "No such file or directory".

This is the fix for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92553
Approved by: https://github.com/kumpera
2023-01-20 21:54:04 +00:00
0cc0e5ef65 [PT-D][Checkpoint]Add MultiThreaded FileSystemWriter for distributed checkpointing and Update tests (#87987)
This PR includes:

Changes from @kumpera (https://github.com/pytorch/pytorch/pull/86327): adding MultiThreaded FileSystemWriter for distributed checkpointing, which adds two knobs to FileSystemWriter: thread_count and per_thread_copy_ahead. This increases up to 50% performance improvement on 32 GPUS workloads on AWS.
Add parametrize tests to /test/distributed/_shard/checkpoint/test_file_system_checkpoint.py and /test/distributed/_shard/checkpoint/test_file_system_checkpoint_cpu.py
Modify @with_comms in ShardedTensorTestBase to take in *args and **kwargs.
Tests:

```
python3 test/distributed/checkpoint/test_file_system_checkpoint_cpu.py
```

test/distributed/checkpoint/test_file_system_checkpoint.py(GPU tests) runs fine locally but would timeout on CI. We will use thread-based PG and update this test in following PR.

[T134844615]

## Add docstring and update comments in the following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87987
Approved by: https://github.com/fduwjj
2022-11-30 08:19:41 +00:00
cefece3726 Fix typo in filesystem.py (#89849)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89849
Approved by: https://github.com/H-Huang
2022-11-30 01:06:58 +00:00
aee96bbf5a [PT-D][Checkpointing] Move distributed checkpointing from torch.distributed._shard.checkpoint to torch.distributed.checkpoint (#88698)
Context in RFC: https://github.com/pytorch/pytorch/issues/86620

.rst file will be finalized in subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88698
Approved by: https://github.com/wanchaol
2022-11-16 21:06:38 +00:00