39 Commits

Author SHA1 Message Date
39116409a1 [torch/utils][Code Clean] Clean asserts in benchmark/ and data/ in torch/utils/ (#165299)
Including:
- `torch/utils/benchmarks/`
- `torch/utils/data/`

Fixes part of #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165299
Approved by: https://github.com/albanD
2025-10-14 04:50:39 +00:00
086dec3235 Pyrefly suppressions 6/n (#164877)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Almost there!

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:

INFO 0 errors (5,064 ignored)

Only four directories left to enable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877
Approved by: https://github.com/oulgen
2025-10-08 02:30:57 +00:00
2f9d378f7b PEP585 update - torch/utils (#145201)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145201
Approved by: https://github.com/bobrenjc93
2025-01-21 21:04:10 +00:00
f1df13f023 [BE][Easy] Fix PYI001: unprefixed-type-param in torch/utils/data/datapipes (#129885)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129885
Approved by: https://github.com/ezyang
2024-07-02 14:56:27 +00:00
7cf0b90e49 [BE] enable UFMT in torch.utils.data (#127705)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127705
Approved by: https://github.com/ezyang
ghstack dependencies: #127706, #127704
2024-06-27 23:16:24 +00:00
f911957573 [BE] sort imports in torch.utils.data (#127704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127704
Approved by: https://github.com/ezyang
ghstack dependencies: #127706
2024-06-27 23:16:24 +00:00
92e7f79609 Doc: Add and Fix docstrings for torch.util.data files (#112817)
Fixes #112635

Fix docstrings for `torch.utils.data` files.

```
Before:
> pydocstyle torch/utils/data/graph.py --count
Before: 5
After: 1

> pydocstyle torch/utils/data/graph_settings.py --count
Before: 8
After: 3

> pydocstyle torch/utils/data/dataloader.py --count
Before: 12
After: 6

> pydocstyle torch/utils/data/dataset.py --count
Before: 28
After: 23

> pydocstyle torch/utils/data/sampler.py --count
Before: 24
After: 19

> pydocstyle torch/utils/data/_utils/signal_handling.py --count
Before: 1
After: 0

> pydocstyle torch/utils/data/_utils/__init__.py --count
Before: 2
After: 0

> pydocstyle torch/utils/data/_utils/collate.py --count
Before: 20
After: 6

> pydocstyle torch/utils/data/_utils/fetch.py --count
Before: 3
After: 0

> pydocstyle torch/utils/data/_utils/pin_memory.py --count
Before: 4
After: 1

> pydocstyle torch/utils/data/datapipes/_decorator.py --count
Before: 19
After: 16

> pydocstyle torch/utils/data/datapipes/_hook_iterator.py --count
Before: 13
After: 0

> pydocstyle torch/utils/data/datapipes/_typing.py --count
Before: 17
After: 4

> pydocstyle torch/utils/data/datapipes/gen_pyi.py --count
Before: 19
After: 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112817
Approved by: https://github.com/kit1980
2023-11-07 17:59:56 +00:00
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
3e18d3958b [DataLoader] Follow-up Fix: TypeVars of Sampler (#100409)
API backward compatibility fixed:
https://github.com/pytorch/pytorch/pull/97338#discussion_r1169164163

Mapped Dataset can accept noninteger indices from custom Samplers.

Fixes #97338

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100409
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-05-03 17:38:31 +00:00
867b07b424 Sampler API described for customization. (#97338)
Explanation with examples of sampler customization added.

* fixed TypeVar
* removed unused init from Sampler class
* added examples for custom sampler and batch sampler
* Distributed sampler typing fixed.
* _InfiniteConstantSampler fixed

Fixes #92268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97338
Approved by: https://github.com/NivekT
2023-03-28 06:40:38 +00:00
4618371da5 Integrate xdoctest - Rebased (#82797)
This is a new version of #15648 based on the latest master branch.

Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.

In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)

Fixes https://github.com/pytorch/pytorch/issues/71105

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
2022-08-12 02:08:01 +00:00
0289ab2cec Fix data-related public API (#368)
Summary:
X-link: https://github.com/pytorch/data/pull/368

This is PR aims to expose the right data-relate API.

There are two more changes made in this PR to convert public api to private api
`check_lambda_fn` -> `_check_lambda_fn`
`deprecation_warning` -> `_deprecation_warning`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76143

Reviewed By: albanD, NivekT

Differential Revision: D35798311

Pulled By: ejguan

fbshipit-source-id: b13fded5c88a533c706702fb2070c918c839dca4
(cherry picked from commit 0b534b829a2e90e1e533951c6d334fdeaa9358b9)
2022-04-21 17:27:05 -07:00
d74bb42f7a Add a missing precondition to DistributedSampler docstring (#70104)
Summary:
Distributed sampler sets different indices for different processes. By doing this, it assumes that the data is the same across the board and in the same order. This may seem trivial, however, there are times that users don't guarantee the order items are gonna have, because they rely on something such as the order the filesystem lists a directory (which is not guaranteed and may vary on different computers), or the order a `set` is iterated.

I think it's better to make it clearer.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70104

Reviewed By: bdhirsh

Differential Revision: D33569539

Pulled By: rohan-varma

fbshipit-source-id: 68ff028cb360cadaee8c441256c1b027a57c7089
2022-01-14 13:55:12 -08:00
0721fc6474 Decouple MapDataPipe from Dataset (#70991)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70991

Test Plan: Imported from OSS

Reviewed By: dagitses

Differential Revision: D33477680

Pulled By: ejguan

fbshipit-source-id: d3e89492e921a96791319f35052a229684ddf7cf
2022-01-07 14:28:41 -08:00
d5988c5eca remove unused type: ignore directives (#60006)
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.

With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006

Reviewed By: jbschlosser, malfet

Differential Revision: D29133237

Pulled By: albanD

fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
2021-06-18 07:23:31 -07:00
a5dcd3c4b7 Revert D28240105: [pytorch][PR] Fix DistributedSampler mem usage on large datasets
Test Plan: revert-hammer

Differential Revision:
D28240105 (a0ce8da26e)

Original commit changeset: 4c6aa493d0f7

fbshipit-source-id: 8a0e17764c2f26c8316f88ad6c8772b08883ceee
2021-06-01 14:44:23 -07:00
a0ce8da26e Fix DistributedSampler mem usage on large datasets (#51841)
Summary:
The current implementation of DistributedSampler generates a python list to hold all of the indices, and then returns a slice of this list for the given rank (creating a partial copy of the list). When the underlying dataset is large, both of these choices waste a large amount of memory. It is much more efficient to create a tensor to hold the indices, and then index into that tensor instead of creating slices.

In the case of a sampler with `shuffle=False`, it would be possible to avoid creating the `indices` tensor entirely (since the index will always match the value), but I have opted instead here to keep the implementation as similar to the existing version as possible. One possible benefit of this approach is that memory usage will not significantly change based on changing this parameter. Still, it might be better to simply return the indices directly without the underlying array.

Additionally, the logic around calculating the number of samples is unnecessarily complex. When dropping the last batch, this can be a simple floor division.

In a simple test script which creates a sampler for a dataset with a 100,000,000 items, memory usage is reduced 98% compared to the existing implementation.

Fixes https://github.com/pytorch/pytorch/issues/45427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51841

Reviewed By: albanD

Differential Revision: D28240105

Pulled By: rohan-varma

fbshipit-source-id: 4c6aa493d0f75c07ec14c98791b3a531300fb1db
2021-06-01 14:15:14 -07:00
75024e228c Add lint for unqualified type: ignore (#56290)
Summary:
The other half of https://github.com/pytorch/pytorch/issues/56272.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290

Test Plan:
CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed:

- https://github.com/pytorch/pytorch/runs/2384511062
- https://github.com/pytorch/pytorch/actions/runs/765036024

Reviewed By: seemethere

Differential Revision: D27867219

Pulled By: samestep

fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235
2021-04-21 08:07:23 -07:00
e6779d4357 [*.py] Rename "Arguments:" to "Args:" (#49736)
Summary:
I've written custom parsers and emitters for everything from docstrings to classes and functions. However, I recently came across an issue when I was parsing/generating from the TensorFlow codebase: inconsistent use of `Args:` and `Arguments:` in its docstrings.

```sh
(pytorch#c348fae)$ for name in 'Args:' 'Arguments:'; do
    printf '%-10s %04d\n' "$name" "$(rg -IFtpy --count-matches "$name" | paste -s -d+ -- | bc)"; done
Args:      1095
Arguments: 0336
```

It is easy enough to extend my parsers to support both variants, however it looks like `Arguments:` is wrong anyway, as per:

  - https://google.github.io/styleguide/pyguide.html#doc-function-args @ [`ddccc0f`](https://github.com/google/styleguide/blob/ddccc0f/pyguide.md)

  - https://chromium.googlesource.com/chromiumos/docs/+/master/styleguide/python.md#describing-arguments-in-docstrings @ [`9fc0fc0`](https://chromium.googlesource.com/chromiumos/docs/+/9fc0fc0/styleguide/python.md)

  - https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html @ [`c0ae8e3`](https://github.com/sphinx-contrib/napoleon/blob/c0ae8e3/docs/source/example_google.rst)

Therefore, only `Args:` is valid. This PR replaces them throughout the codebase.

PS: For related PRs, see tensorflow/tensorflow/pull/45420

PPS: The trackbacks automatically appearing below are sending the same changes to other repositories in the [PyTorch](https://github.com/pytorch) organisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49736

Reviewed By: albanD

Differential Revision: D25710534

Pulled By: soumith

fbshipit-source-id: 61e8ff01abb433e9f78185c2d1d0cbd7c22c1619
2020-12-28 09:34:47 -08:00
21c38e1799 Additional validation for DistributedSampler. (#48865)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48865

If DistributedSampler was provided an invalid rank (ex:
https://discuss.pytorch.org/t/distributed-datasets-on-multi-machines/105113),
it failed with a cryptic assertion failure.

To fix this issue, I've added an additional check to DistributedSampler to
validate we provide a valid rank.
ghstack-source-id: 117906769

Test Plan:
1) waitforbuildbot
2) Unit test added.

Reviewed By: malfet

Differential Revision: D25344945

fbshipit-source-id: 7685e00c8b2c200efbd2949fb32ee32ea7232a08
2020-12-11 17:22:22 -08:00
0387f2a6fa Fix default value of num_replicas in DistributedSampler docstring (#48135)
Summary:
Change default value of `num_replicas` from `rank` to `world_size` in DistributedSampler docstring.

Addresses https://github.com/pytorch/pytorch/issues/48055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/48135

Reviewed By: gchanan

Differential Revision: D25045328

Pulled By: rohan-varma

fbshipit-source-id: 6f84f7bb69087d8dae931cda51891b3cb1894306
2020-11-18 11:18:40 -08:00
a69910868a Fix possible padding length overflow in DistributedSampler (#45329)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/45324

This fix handles cases for `len(dataset) * 2 < num_replica` in DistributedSampler. (which previous code resulted in error.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45329

Reviewed By: mruberry

Differential Revision: D24205035

Pulled By: rohan-varma

fbshipit-source-id: f94329d9c1e7deaee41e5af319e7c7d0c741910c
2020-10-14 17:19:44 -07:00
eb39542e67 Add typing annotations for torch.utils.data.* modules (#44136)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/44135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44136

Reviewed By: gchanan

Differential Revision: D23963273

Pulled By: ezyang

fbshipit-source-id: 939234dddbe89949bd8e5ff05d06f6c8add6935c
2020-09-29 18:12:05 -07:00
da32bf4cc6 Move type annotations for remaining torch.utils stub files inline (#43406)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43406

Reviewed By: mruberry

Differential Revision: D23319736

Pulled By: malfet

fbshipit-source-id: e25fbb49f27aa4893590b022441303d6d98263a9
2020-08-31 18:44:09 -07:00
5ed7cd0025 Allow drop_last option in DistributedSampler (#41171)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41171

DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.

This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108617516

Test Plan: Added unittest

Reviewed By: mrshenli

Differential Revision: D22449974

fbshipit-source-id: e3156b751f5262cc66437b9191818b78aee8ddea
2020-07-28 11:33:08 -07:00
d753f1c2e1 Fixes formatting of vander, count_nonzero, DistributedSampler documentation (#41025)
Summary:
Bundle of small edits to fix formatting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/41025

Differential Revision: D22398364

Pulled By: mruberry

fbshipit-source-id: 8d484cb52a1cf4a8eb1f64914574250c9fd5043d
2020-07-06 14:26:13 -07:00
479b04e26a Improve DistributedSampler docs and add seed option (#39628)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/39628

Differential Revision: D21920373

Pulled By: mrshenli

fbshipit-source-id: d7d1005db6feef4a83a1a094b85fcff964bd0ac6
2020-06-06 14:24:22 -07:00
deb4100928 [DistributedSampler] Only create torch.generator and seed when shuffling (#37604)
Summary:
We don't need to create `torch.Generator()` and seed it if we are not shuffling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37604

Differential Revision: D21346167

Pulled By: rohan-varma

fbshipit-source-id: 6ed560d236bc5c026a7d321755ddc02a29db1604
2020-05-01 10:56:40 -07:00
6305e4a88f Add warning and example for seeding to DistributedSampler (#32951)
Summary:
Closes gh-31771

Also note that the `epoch` attribute is *only* used as a manual seed in each iteration (so it could easily be changed/renamed).  Seeding consecutive iterations with `[0, 1, 2, ...]` is low-entropy, however in practice it probably doesn't matter when using the sampler in combination with a dataloader (because there won't be enough data nor epochs to run into statistical issues
due to low-entropy seeding). So leaving that as is.

Rendered docstring:

<img width="534" alt="image" src="https://user-images.githubusercontent.com/98330/73701250-35134100-46e9-11ea-97b8-3baeb60fcb37.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32951

Differential Revision: D19729333

Pulled By: ezyang

fbshipit-source-id: 3ddf90a3828b8bbae88aa2195a5d0b7d8ee1b066
2020-02-04 14:36:59 -08:00
7730346853 Make shuffling optional in DistributedSampler (#22479)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22479

In some cases, for example, when we training on CTR data, we would like to start training from old samples and finish on new recent samples.

This diff add the option to disable the shuffling in DistributedSampler to accommodate this use case.

Reviewed By: soumith

Differential Revision: D16100388

fbshipit-source-id: 35566581f5250040b2db5ec408a63037b47a9f5d
2019-07-05 18:56:28 -07:00
Jie
a3fb004b18 (#12474)
Summary:
Modifies the DistributedSampler logic. Now each process samples elements with
a given interval, instead of a consecutive section.

  This eliminates the possibility where the DataLoader uses padded data while
dropping the real data. It happens when:
  1. DistributedSampler padded data; and
  2. DataLoader drops_last is effectively true, and drops less then the number
of padded data.
  from the example down, we see that data (10, 11, 12) are padded through
duplicating data sample (1, 2, 3)
  The old sampler drops legit original data (3, 6, 9) and introduces duplication
(10, 11) into the training set; while the new sampler logic samples correct data
points from the data set.
  This example has been added to dataloader unit test

example:
```
  data after shuffle: 1, 2, 3, 4, 5, 6, 7, 8, 9
  padded data : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12

  old sampler:       ->  DataLoader with (batch_size=2 and drop_last=True)
   p 1: 1, 2, 3          1, 2
   p 2: 4, 5, 6          4, 5
   p 3: 7, 8, 9          7, 8
   p 4:10,11,12         10,11

  new sampler:       ->
   p 1: 1, 5, 9          1, 5
   p 2: 2, 6,10          2, 6
   p 3: 3, 7,11          3, 7
   p 4: 4, 8,12          4, 8
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12474

Differential Revision: D10260410

Pulled By: SsnL

fbshipit-source-id: 710856571260f42ce25955b81a5b8008e04938cf
2018-10-09 11:23:50 -07:00
0988bbad2d C10d release to torch.distributed for PT1 (#11405)
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`

Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405

Reviewed By: pietern

Differential Revision: D9733733

Pulled By: teng-li

fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
2018-09-10 23:27:22 -07:00
18d2fcde7a Fix performance of DistributedSampler per #8958
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/10361

Differential Revision: D9240798

Pulled By: ezyang

fbshipit-source-id: dc4cfe79612f711bbcff34a147877df6a5f7b89f
2018-08-09 12:54:37 -07:00
1b0ad8678b import *Sampler to utils.data (Better fix than #6982) (#7007) 2018-04-27 10:18:29 +02:00
3a8feb7fb7 Address integer division to make it compatible with py2 2017-08-15 21:12:21 -04:00
8915e2710c Refactor scatter/gather and add distributed docs 2017-07-12 14:47:36 -04:00
9c53c6dcb9 Fix errors and warnings when building docs (#1806) 2017-06-14 13:50:14 -04:00
d9d50f80c7 Rename arguments to distributed collectives 2017-06-12 22:02:11 -04:00
12813b88f6 Add DistributedDataParallel 2017-06-12 22:00:22 -04:00