This is a new version of #15648 based on the latest master branch.
Unlike the previous PR where I fixed a lot of the doctests in addition to integrating xdoctest, I'm going to reduce the scope here. I'm simply going to integrate xdoctest, and then I'm going to mark all of the failing tests as "SKIP". This will let xdoctest run on the dashboards, provide some value, and still let the dashboards pass. I'll leave fixing the doctests themselves to another PR.
In my initial commit, I do the bare minimum to get something running with failing dashboards. The few tests that I marked as skip are causing segfaults. Running xdoctest results in 293 failed, 201 passed tests. The next commits will be to disable those tests. (unfortunately I don't have a tool that will insert the `#xdoctest: +SKIP` directive over every failing test, so I'm going to do this mostly manually.)
Fixes https://github.com/pytorch/pytorch/issues/71105
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82797
Approved by: https://github.com/ezyang
Summary:
X-link: https://github.com/pytorch/data/pull/368
This is PR aims to expose the right data-relate API.
There are two more changes made in this PR to convert public api to private api
`check_lambda_fn` -> `_check_lambda_fn`
`deprecation_warning` -> `_deprecation_warning`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76143
Reviewed By: albanD, NivekT
Differential Revision: D35798311
Pulled By: ejguan
fbshipit-source-id: b13fded5c88a533c706702fb2070c918c839dca4
(cherry picked from commit 0b534b829a2e90e1e533951c6d334fdeaa9358b9)
Summary:
Distributed sampler sets different indices for different processes. By doing this, it assumes that the data is the same across the board and in the same order. This may seem trivial, however, there are times that users don't guarantee the order items are gonna have, because they rely on something such as the order the filesystem lists a directory (which is not guaranteed and may vary on different computers), or the order a `set` is iterated.
I think it's better to make it clearer.
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70104
Reviewed By: bdhirsh
Differential Revision: D33569539
Pulled By: rohan-varma
fbshipit-source-id: 68ff028cb360cadaee8c441256c1b027a57c7089
Summary:
During development it is common practice to put `type: ignore` comments on lines that are correct, but `mypy` doesn't recognize this. This often stems from the fact, that the used `mypy` version wasn't able to handle the used pattern.
With every new release `mypy` gets better at handling complex code. In addition to fix all the previously accepted but now failing patterns, we should also revisit all `type: ignore` comments to see if they are still needed or not. Fortunately, we don't need to do it manually: by adding `warn_unused_ignores = True` to the configuration, `mypy` will error out in case it encounters an `type: ignore` that is no longer needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60006
Reviewed By: jbschlosser, malfet
Differential Revision: D29133237
Pulled By: albanD
fbshipit-source-id: 41e82edc5cd5affa7ccedad044b59b94dad4425a
Summary:
The current implementation of DistributedSampler generates a python list to hold all of the indices, and then returns a slice of this list for the given rank (creating a partial copy of the list). When the underlying dataset is large, both of these choices waste a large amount of memory. It is much more efficient to create a tensor to hold the indices, and then index into that tensor instead of creating slices.
In the case of a sampler with `shuffle=False`, it would be possible to avoid creating the `indices` tensor entirely (since the index will always match the value), but I have opted instead here to keep the implementation as similar to the existing version as possible. One possible benefit of this approach is that memory usage will not significantly change based on changing this parameter. Still, it might be better to simply return the indices directly without the underlying array.
Additionally, the logic around calculating the number of samples is unnecessarily complex. When dropping the last batch, this can be a simple floor division.
In a simple test script which creates a sampler for a dataset with a 100,000,000 items, memory usage is reduced 98% compared to the existing implementation.
Fixes https://github.com/pytorch/pytorch/issues/45427
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51841
Reviewed By: albanD
Differential Revision: D28240105
Pulled By: rohan-varma
fbshipit-source-id: 4c6aa493d0f75c07ec14c98791b3a531300fb1db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48865
If DistributedSampler was provided an invalid rank (ex:
https://discuss.pytorch.org/t/distributed-datasets-on-multi-machines/105113),
it failed with a cryptic assertion failure.
To fix this issue, I've added an additional check to DistributedSampler to
validate we provide a valid rank.
ghstack-source-id: 117906769
Test Plan:
1) waitforbuildbot
2) Unit test added.
Reviewed By: malfet
Differential Revision: D25344945
fbshipit-source-id: 7685e00c8b2c200efbd2949fb32ee32ea7232a08
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41171
DistributedSampler allows data to be split evenly across workers in
DDP, but it has always added additional samples in order for the data to be
evenly split in the case that the # of samples is not evenly divisible by the
number of workers. This can cause issues such as when doing distributed
validation accuracy, where multiple samples could be considered twice.
This PR adds a drop_last option where the tail of the data is dropped such that
the effective dataset size is still evenly divisible across the workers. This
ensures that DDP can train fine (there is no uneven inputs) and each replica
gets an equal number of data indices.
ghstack-source-id: 108617516
Test Plan: Added unittest
Reviewed By: mrshenli
Differential Revision: D22449974
fbshipit-source-id: e3156b751f5262cc66437b9191818b78aee8ddea
Summary:
We don't need to create `torch.Generator()` and seed it if we are not shuffling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37604
Differential Revision: D21346167
Pulled By: rohan-varma
fbshipit-source-id: 6ed560d236bc5c026a7d321755ddc02a29db1604
Summary:
Closes gh-31771
Also note that the `epoch` attribute is *only* used as a manual seed in each iteration (so it could easily be changed/renamed). Seeding consecutive iterations with `[0, 1, 2, ...]` is low-entropy, however in practice it probably doesn't matter when using the sampler in combination with a dataloader (because there won't be enough data nor epochs to run into statistical issues
due to low-entropy seeding). So leaving that as is.
Rendered docstring:
<img width="534" alt="image" src="https://user-images.githubusercontent.com/98330/73701250-35134100-46e9-11ea-97b8-3baeb60fcb37.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32951
Differential Revision: D19729333
Pulled By: ezyang
fbshipit-source-id: 3ddf90a3828b8bbae88aa2195a5d0b7d8ee1b066
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22479
In some cases, for example, when we training on CTR data, we would like to start training from old samples and finish on new recent samples.
This diff add the option to disable the shuffling in DistributedSampler to accommodate this use case.
Reviewed By: soumith
Differential Revision: D16100388
fbshipit-source-id: 35566581f5250040b2db5ec408a63037b47a9f5d
Summary:
Modifies the DistributedSampler logic. Now each process samples elements with
a given interval, instead of a consecutive section.
This eliminates the possibility where the DataLoader uses padded data while
dropping the real data. It happens when:
1. DistributedSampler padded data; and
2. DataLoader drops_last is effectively true, and drops less then the number
of padded data.
from the example down, we see that data (10, 11, 12) are padded through
duplicating data sample (1, 2, 3)
The old sampler drops legit original data (3, 6, 9) and introduces duplication
(10, 11) into the training set; while the new sampler logic samples correct data
points from the data set.
This example has been added to dataloader unit test
example:
```
data after shuffle: 1, 2, 3, 4, 5, 6, 7, 8, 9
padded data : 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
old sampler: -> DataLoader with (batch_size=2 and drop_last=True)
p 1: 1, 2, 3 1, 2
p 2: 4, 5, 6 4, 5
p 3: 7, 8, 9 7, 8
p 4:10,11,12 10,11
new sampler: ->
p 1: 1, 5, 9 1, 5
p 2: 2, 6,10 2, 6
p 3: 3, 7,11 3, 7
p 4: 4, 8,12 4, 8
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12474
Differential Revision: D10260410
Pulled By: SsnL
fbshipit-source-id: 710856571260f42ce25955b81a5b8008e04938cf
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`
Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405
Reviewed By: pietern
Differential Revision: D9733733
Pulled By: teng-li
fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08