Commit Graph

42 Commits

Author SHA1 Message Date
cyy
03e35efbf7 replace torch::make_unique with std::make_unique (#108866)
It should be safe to remove the old torch::make_unique functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108866
Approved by: https://github.com/albanD
2023-09-14 20:52:26 +00:00
cyy
483f748dd5 [BE] Enforce missing override keyword (#104032)
This PR enables `-Winconsistent-missing-destructor-override` and `-Winconsistent-missing-override`
and fixes violations.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 47e904e</samp>

This pull request updates the code of various classes and operators in the `caffe2` and `aten` subdirectories to use the `override` specifier instead of the `virtual` keyword for destructors and other virtual functions that override a base class function. This improves the code readability, quality, and consistency with C++ best practices. It also modifies the `./CMakeLists.txt` file to enable warnings for these specifiers, but disable errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104032
Approved by: https://github.com/malfet
2023-06-24 02:34:24 +00:00
322e4b4c8a set -Wsuggest-override for builds (#89852)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/89852).
* __->__ #89852
* #89851

set -Wsuggest-override for builds

Summary: This was flagged by a Meta internal build.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89852
Approved by: https://github.com/malfet
2022-12-19 22:08:47 +00:00
1b2a17b8f9 Build MacOS binaries with -Werror (#83049)
Should prevent proliferating MPS warnings
Fixes https://github.com/pytorch/pytorch/issues/82966
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83049
Approved by: https://github.com/albanD, https://github.com/ezyang
2022-08-10 17:29:44 +00:00
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
ab2ca95dd1 turn on -Werror=unused-variable in our Bazel CPU build
Summary:
We also fix any existing issues. Note that we only do this for the CPU
build because nvcc is considered a C++ toolchain but it does not have
the same flag support. Adding flags to the GPU build will cause nvcc
errors.

Test Plan: Built locally, rely on CI to confirm.

Reviewers: malfet

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79156

Approved by: https://github.com/seemethere, https://github.com/osalpekar, https://github.com/albanD
2022-06-11 02:46:34 +00:00
81d765ef1f Fix sign-compare violations in cpp tests
Prerequisite change for enabling `-Werror=sign-compare` across PyTorch repo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75080

Approved by: https://github.com/atalman
2022-04-04 23:05:31 +00:00
afb742382a use irange for loops 10 (#69394)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69394

Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D32837991

fbshipit-source-id: fc7c4f76d2f32a17a0faf329294b3fe7cb81df32
2021-12-09 09:49:34 -08:00
f587267dc7 Revert D31705359: use irange for loops 8
Test Plan: revert-hammer

Differential Revision:
D31705359 (17e5200441)

Original commit changeset: c9ea2fbc0f9c

fbshipit-source-id: 08fff2d12beca953ad30dd0baabf86e39ac84f14
2021-12-02 12:55:08 -08:00
17e5200441 use irange for loops 8 (#66743)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66743

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705359

fbshipit-source-id: c9ea2fbc0f9cd29e97a52dcb203addc5f2abb09b
2021-12-02 10:21:29 -08:00
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
ee44d73e59 Modernize override (#61744)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717320

fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc
2021-07-23 23:04:46 -07:00
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
5bcbbf5373 Lint trailing newlines (#54737)
Summary:
*Context:* https://github.com/pytorch/pytorch/issues/53406 added a lint for trailing whitespace at the ends of lines. However, in order to pass FB-internal lints, that PR also had to normalize the trailing newlines in four of the files it touched. This PR adds an OSS lint to normalize trailing newlines.

The changes to the following files (made in 54847d0adb9be71be4979cead3d9d4c02160e4cd) are the only manually-written parts of this PR:

- `.github/workflows/lint.yml`
- `mypy-strict.ini`
- `tools/README.md`
- `tools/test/test_trailing_newlines.py`
- `tools/trailing_newlines.py`

I would have liked to make this just a shell one-liner like the other three similar lints, but nothing I could find quite fit the bill. Specifically, all the answers I tried from the following Stack Overflow questions were far too slow (at least a minute and a half to run on this entire repository):

- [How to detect file ends in newline?](https://stackoverflow.com/q/38746)
- [How do I find files that do not end with a newline/linefeed?](https://stackoverflow.com/q/4631068)
- [How to list all files in the Git index without newline at end of file](https://stackoverflow.com/q/27624800)
- [Linux - check if there is an empty line at the end of a file [duplicate]](https://stackoverflow.com/q/34943632)
- [git ensure newline at end of each file](https://stackoverflow.com/q/57770972)

To avoid giving false positives during the few days after this PR is merged, we should probably only merge it after https://github.com/pytorch/pytorch/issues/54967.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54737

Test Plan:
Running the shell script from the "Ensure correct trailing newlines" step in the `quick-checks` job of `.github/workflows/lint.yml` should print no output and exit in a fraction of a second with a status of 0. That was not the case prior to this PR, as shown by this failing GHA workflow run on an earlier draft of this PR:

- https://github.com/pytorch/pytorch/runs/2197446987?check_suite_focus=true

In contrast, this run (after correcting the trailing newlines in this PR) succeeded:

- https://github.com/pytorch/pytorch/pull/54737/checks?check_run_id=2197553241

To unit-test `tools/trailing_newlines.py` itself (this is run as part of our "Test tools" GitHub Actions workflow):
```
python tools/test/test_trailing_newlines.py
```

Reviewed By: malfet

Differential Revision: D27409736

Pulled By: samestep

fbshipit-source-id: 46f565227046b39f68349bbd5633105b2d2e9b19
2021-03-30 13:09:52 -07:00
8c798e0622 Forbid trailing whitespace (#53406)
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857

These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
  - `GLOSSARY.md`
  - `aten/src/ATen/core/op_registration/README.md`
  - `scripts/README.md`
  - `torch/csrc/jit/codegen/fuser/README.md`

The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```

I looked over the auto-generated changes and didn't see anything that looked problematic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406

Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377

This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348

Reviewed By: walterddr, seemethere

Differential Revision: D26856620

Pulled By: samestep

fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
2021-03-05 17:22:55 -08:00
e7fe64f6a6 Fix typos (#30606)
Summary:
Should be non-semantic.

Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/30606

Differential Revision: D18763028

Pulled By: mrshenli

fbshipit-source-id: 896515a2156d062653408852e6c04b429fc5955c
2019-12-02 20:17:42 -08:00
595209bddc Fix bugs in torch::tensor constructor (#28523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28523

New features:
1. Previously, `torch::tensor({true, false, true})` throws `"tensor_cpu" not implemented for 'Bool'`. After this PR, it produces the correct bool tensor, matching the Python API behavior.
2. Tensors with zero-size dimensions are now supported, e.g. `torch::tensor({{}, {}})` produces a tensor with sizes `{2, 0}`, matching the Python API behavior.

BC-breaking bug fixes:
1. Previously, `torch::tensor({{1}, {2}})` produces a tensor of sizes `{2}`. After this PR, it produces a tensor of sizes `{2, 1}`, matching the Python API behavior.
2. Fixed semantics of `torch::tensor(1.1)`: it now returns a 0-dim tensor instead of a 1-dim tensor, matching the Python API behavior.
3. Previously, when passed a non-dtype `TensorOptions` to the `torch::tensor` constructor, it always produces a tensor of dtype `float`. After this PR, it produces tensor of different dtypes based on the dtype of the braced-init-list, matching the behavior of the no-options case.
```cpp
// Previously:
torch::tensor({1, 2, 3}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> float
torch::tensor({{1, 2, 3}}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> float
torch::tensor({1., 2., 3.}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> float
torch::tensor({{1., 2., 3.}}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> float

// Now:
torch::tensor({1, 2, 3}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> int
torch::tensor({{1, 2, 3}}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> int
torch::tensor({1., 2., 3.}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> double
torch::tensor({{1., 2., 3.}}, torch::TensorOptions(/*non-dtype-options*/)).dtype() -> double

// As comparison, currently:
torch::tensor({1, 2, 3}).dtype() -> int
torch::tensor({{1, 2, 3}}).dtype() -> int
torch::tensor({1., 2., 3.}).dtype() -> double
torch::tensor({{1., 2., 3.}}).dtype() -> double
```

Notes:
1. From now on, the behavior of `at::tensor(scalar_value)` (which produces a 1-dim tensor) would be different from `torch::tensor(scalar_value)` (which produces a 0-dim tensor). I will fix the behavior of `at::tensor(scalar_value)` in a follow-up PR.
2. From now on, the behavior of `at::tensor({1, 2, 3}, torch::TensorOptions(/*non-dtype-options*/))` (which produces a `float` tensor) would be different from `torch::tensor({1, 2, 3}, torch::TensorOptions(/*non-dtype-options*/))` (which produces a an `int` tensor). I will fix this behavior of `at::tensor` constructor in a follow-up PR.

Context for the changes in this PR:

The motivation comes from fixing the "`torch::tensor({{1}, {2}})` gives tensor of wrong sizes" bug - in order to fix it, I have to move the handling of `at::ArrayRef` and `std::vector` into `InitListTensor` (see below on why we need to do this) and renamed `InitListTensor` to `TensorDataContainer`. After such changes, support for bool values comes out of the box without extra effort, and support for tensors with zero-size dimensions only requires adding a default constructor for `TensorDataContainer`, so I added those two in this PR.

For the semantic change of `torch::tensor(1.1)`, it's actually more effort to preserve the original wrong behavior (i.e. we need to check the sizes of the tensor converted from `TensorDataContainer` and reshape any scalar tensor to a 1-D tensor). I think preserving the original wrong behavior doesn't give us much value, and since the above changes naturally fix the problem, we should just start using the right behavior instead.

For the "constructor with non-dtype options behavior" fix, the code looks simpler and easier to reason about with the fix, so I included it in this PR.

--------

Why we need to move the handling of `at::ArrayRef` and `std::vector` into `TensorDataContainer`:

`torch::tensor({{1}, {2}})` can match this function overload:
`torch::tensor(at::ArrayRef<int> values)`, because `{1}` and `{2}` can be treated as
a list-initialization of an `int` value. However, this will produce a Tensor with sizes `{2}`,
but we actually want a Tensor with sizes `{2, 1}`. In order to avoid matching this function overload,
we removed the function overload and moved the ability to convert `at::ArrayRef<T>`
(and similarly `std::vector<T>`) into `TensorDataContainer`, and since for braced-init-list the
`TensorDataContainer(std::initializer_list<TensorDataContainer>)` constructor is always preferred over all other constructors, it will take the `std::initializer_list` path, and all is good.

Test Plan: Imported from OSS

Differential Revision: D18234625

Pulled By: yf225

fbshipit-source-id: 0f3f6912e82e2117d2103e31b74e7e97baaa8693
2019-10-31 12:53:06 -07:00
e8087a3060 Change C++ API test files to only include torch/torch.h (#27067)
Summary:
One of the purposes of the C++ API tests in `test/cpp/api/` should be to check that including `torch/torch.h` is a sufficient prerequisite for using all C++ frontend features. This PR change ensures that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27067

Differential Revision: D17856815

Pulled By: yf225

fbshipit-source-id: 49c057bd807b003e4a00f6ba73131d763a0f277a
2019-10-10 09:46:29 -07:00
31f1928096 add sorting policy to ChunkDataset (#23053)
Summary:
Add a sorting policy to ChunkDataset.

This is considered an advanced parameter for developers who want to apply a 'sorting policy' to the chunk data before sampling into minibatch.

Different than the collate method, this policy is applied on the chunk level instead of minibatch level. When a chunk of data is loaded (multiple chunks if cross_chunk_shuffle_count_ is greater than 1), this policy is targeting to the full loaded data. It will be useful if developers want to perform some pre-processing (like bucketing) to the chunk data before example sampler samples the data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23053

Differential Revision: D16537692

Pulled By: colesbury

fbshipit-source-id: cd21ed40ab787a18b8c6dd304e5b806a7a45e6ba
2019-07-29 12:34:02 -07:00
f0f2331a1c Add support for cross-chunk shuffling in ChunkDataset (#22347)
Summary:
This change adds one advanced support for cross-chunk shuffling.

For training with static dataset, the default configuration is at user's disposal. However, in some user cases, over each epoch, new data is added to the current dataset, thus the dataset's size is dynamically changing/increasing. In order to mix the new data and the old data for better random sampling, one approach is to shuffle examples from more than 1 chunks. This feature is supported with this change. By specifying the `cross_chunk_shuffle_count_` on construction, advanced user can specify how many chunks to shuffle example from.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22347

Differential Revision: D16081378

Pulled By: zhangguanheng66

fbshipit-source-id: fd001dfb9e66947839adecfb9893156fbbce80d0
2019-07-01 19:13:34 -07:00
f39b6624ba ChunkDataset checkpoint support (#21889)
Summary:
When dealing with large scale dataset, it is handy if we can save the dataset status and resume later. Especially in cases where some unexpected crash happens, user don't need to start over the whole dataset from begining. Instead, they can reload it from the last checkpoint.

This change adds support for checkpoint save/load logic in ChunkDataset.

On ChunkDataset construction, user can specify a file name from which to load the checkpoint. If it is empty, default to start from fresh; otherwise the ChunkDataset will 'fast forward' the chunk sampler to the corresponding checkpoint.

The user can also call ChunkDataset::save() to serialize current status to a file, which can be used later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21889

Differential Revision: D16024582

Pulled By: ailzhang

fbshipit-source-id: 1862ab5116f94c9d29da174ce04a91041d06cad5
2019-06-26 22:54:14 -07:00
3d4d7b9082 Refactor ChunkDataReader API + fix missing headers (#19485)
Summary:
This PR restricts the BatchType template argument of ChunkDataReader to STL
vectors only. Internally, ChunkDataReader was assuming BatchType was a
vector, but the user could pass any type to the template argument,
leading to compiling issues during CPP extensions.

Additionally to the proposed API change, this PR adds missing include
headers to chunk.h. Currently the current implementation works but if
users try to create C++ extensions that implements new ChunkDataReaders
to be along with the existing ChunkDataset, the build will fail due to
the missing headers.

In terms of functionality, nothing has changed. This PR simply makes the
implementation slightly more robust for future extensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19485

Differential Revision: D15261725

Pulled By: soumith

fbshipit-source-id: 38c9465d665392ae6a2d12c5a520a4f501e1a6ca
2019-05-08 22:20:19 -07:00
bb3a2d99ac Jaliyae/chunk buffer fix (#17409)
Summary:
The chunk buffer had a possibility to hang when no data is read and the buffer size is lower than chunk size. We detected this while running with larger dataset and hence the fix. I added a test to mimic the situation and validated that the fix is working. Thank you Xueyun for finding this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17409

Differential Revision: D14198546

Pulled By: soumith

fbshipit-source-id: b8ca43b0400deaae2ebb6601fdc65b47f32b0554
2019-02-23 08:48:53 -08:00
9477c143c6 C++ Frontend: adding two distributed samples (Random and Sequential) (#16910)
Summary:
Adding two distrbuted samplers, Random and Sequential to the mix. Similar to python counterpart, DistributedSampler introduces a new method `set_epoch(size_t epoch)` which can be use to shuffle data determinstically between distributed processes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16910

Differential Revision: D14130980

Pulled By: soumith

fbshipit-source-id: ec08b7130c01e2fc6dc3693f7ac622a0a6d60f10
2019-02-19 05:40:37 -08:00
bc39cf4d5e Remove chunk count check on the ChunkBuffer (#16868)
Summary:
Previously, the ChunkBuffer depends on the remaining chunk count to signal end of dataloading. This does not work with distributed samplers where each sampler only loads a subset of  chunks. This refactor remove the dependency on the remaining chunk count at the ChunkBuffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16868

Differential Revision: D14066517

Pulled By: goldsborough

fbshipit-source-id: 293dfe282ceff326dff0876c2f75c2ee4f4463e2
2019-02-13 11:09:42 -08:00
6249442e90 Chunk dataset implementation (#15932)
Summary:
This PR contains the implementation of chunk dataset, with the API proposed in PR https://github.com/pytorch/pytorch/pull/15562

A chunk dataset is derived from StatefulDataset. It utilizes worker threads to prefetches chunk data, splits it into batches and caches them into a queue. When get_batch is called from dataloader, batch data is retrieved from the queue, and data in new chunks will be pushed for later following batches.

Chunk dataset uses two samplers (chunk_sampler and example_sampler) to perform sampling. The chunk_sampler decides which chunk to load, and example_sampler shuffles the examples inside a specific chunk. More detail of this sampling approach can be found here: http://martin.zinkevich.org/publications/nips2010.pdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15932

Differential Revision: D13868688

Pulled By: soumith

fbshipit-source-id: a43000c478ca2a3c64cc84b3626d6b8b1ad9a07e
2019-01-29 18:06:01 -08:00
a4c1aa4bc5 Add the normalize transform to the core library (#15891)
Summary:
Adds the `Normalize` transform to the core C++ frontend library.

ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15891

Differential Revision: D13642167

Pulled By: goldsborough

fbshipit-source-id: 573428e626d6106cf2aadf3dc2e2aecb9a85efc3
2019-01-11 19:50:18 -08:00
ad6799537e Support stateful dataset (#15096)
Summary:
Currently re-implements the dataloader for stateful datasets. Outstanding work:
- Refactor DataLoader and DataLoader2 to have common base classes and only differ in specifi pieces of logic,
- Figure out how to not duplicate the `MapDataset` logic for stateful vs. non-stateful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15096

Differential Revision: D13522043

Pulled By: goldsborough

fbshipit-source-id: 08e461ca51783047f11facc4d27dfa2e4f1e4c2a
2018-12-24 06:26:40 -08:00
44cb43bcc1 Jaliyae/samplers (#13870)
Summary:
Make Samplers optionally accept new size in their reset() method. This helps dataloader or dataset to reset the sampler for an epoch or a chunk of data with different sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13870

Differential Revision: D13240120

Pulled By: soumith

fbshipit-source-id: 19c53f8be13c0fdcf504f0637b0d3e6009a8e599
2018-11-29 07:07:19 -08:00
f639249d51 Fix dataloader iterator test (#14045)
Summary:
I noticed the test `DataLoaderTest.CanDereferenceIteratorMultipleTimes` doesn't test proper progression of the iterator. I also added a test for using `std::copy`.

Fixes https://github.com/pytorch/pytorch/issues/14276

ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14045

Differential Revision: D13092187

Pulled By: goldsborough

fbshipit-source-id: 57698ec00fa7b914b159677a4ab38b6b25c2860b
2018-11-26 17:06:41 -08:00
fb6535ec70 Add SharedDataset (#13800)
Summary:
This PR adds a `SharedDataset` to the C++ frontend data API, which allows wrapping a shared_ptr to a dataset into a class that conforms to the `Dataset` interface (with `get_batch`). This enables use cases where a custom dataset is (1) thread-safe and (2) expensive to copy. All workers will reference a single instance of this dataset. No additional copies are incurred.

jaliyae apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13800

Differential Revision: D13075610

Pulled By: goldsborough

fbshipit-source-id: 4ffdfd7959d49b042c0e254110085f62a0bfeb6c
2018-11-16 13:07:10 -08:00
8f4dc192b6 Fix DataLoaderTest.EnforcesOrderingAmongThreadsWhenConfigured (#14038)
Summary:
I think this will be it. So for one, the previous test was bullshit because it was returning the thread id instead of the sample index (which is the thing whose ordering is enforced). Just turning up the number of threads to 10 from 4 made this very obvious. I also think there is a race condition, which may or may not have surfaced, in that there was nothing stopping one worker to get multiple batches, which would screw with the whole ordering logic. I've added a barrier struct such that workers wait for all workers to be in the `get_batch` function before actually doing something.

Fixes https://github.com/pytorch/pytorch/issues/14002

ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14038

Differential Revision: D13088132

Pulled By: goldsborough

fbshipit-source-id: 4bded63756c6a49502ee07ef8709a03073e7e05f
2018-11-15 17:30:41 -08:00
0478d32cb8 Move AlignOf, SmallVector and ArrayRef to c10.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13916

Reviewed By: smessmer

Differential Revision: D13046722

fbshipit-source-id: 1583d3170d60e22f0a535cd1fd56bdf928186f5d
2018-11-14 11:13:16 -08:00
5151d33287 Unflake the ordering enforcement test (#13919)
Summary:
Attempts to unflake the dataloader ordering enforcement test. I think the issue was that the `thread_counter` variable was not atomic. I've made it atomic, and also global just to make it a bit clearer.

Fixes https://github.com/pytorch/pytorch/issues/13634

colesbury SsnL ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13919

Differential Revision: D13051718

Pulled By: goldsborough

fbshipit-source-id: b9f7f6317701a8b861a1d5c6a9b2b17b44782561
2018-11-13 21:05:02 -08:00
393ad6582d Use torch:: instead of at:: in all C++ APIs (#13523)
Summary:
In TorchScript and C++ extensions we currently advocate a mix of `torch::` and `at::` namespace usage. In the C++ frontend I had instead exported all symbols from `at::` and some from `c10::` into the `torch::` namespace. This is far, far easier for users to understand, and also avoid bugs around creating tensors vs. variables. The same should from now on be true for the TorchScript C++ API (for running and loading models) and all C++ extensions.

Note that since we're just talking about typedefs, this change does not break any existing code.

Once this lands I will update stuff in `pytorch/tutorials` too.

zdevito ezyang gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13523

Differential Revision: D12942787

Pulled By: goldsborough

fbshipit-source-id: 76058936bd8707b33d9e5bbc2d0705fc3d820763
2018-11-06 14:32:25 -08:00
8fafa7b6ac Remove size() from BatchDataset and templatize IndexType (#12960)
Summary:
This PR brings to changes to the recently landed C++ Frontend dataloader:

1. Removes the `size()` method from `BatchDataset`. This makes it cleaner to implement unsized ("infinite stream") datasets. The method was not used much beyond initial configuration.
2. Makes the index type of a dataset a template parameter of `BatchDataset` and `Sampler`. This essentially allows custom index types instead of only `vector<size_t>`. This greatly improves flexibility.

See the `InfiniteStreamDataset` and `TestIndex` datasets in the tests for what this enables.

Some additional minor updates and code movements too.

apaszke SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12960

Differential Revision: D12893342

Pulled By: goldsborough

fbshipit-source-id: ef03ea0f11a93319e81fba7d52a0ef1a125d3108
2018-11-05 17:13:09 -08:00
c21471c77f Sampler serialization and deserialization (#12999)
Summary:
Implements serialization and deserialization for samplers in the C++ frontend dataloader.

apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12999

Differential Revision: D10859676

Pulled By: goldsborough

fbshipit-source-id: cd132100fd35323e5a3df33e314511750806f48d
2018-10-26 12:20:51 -07:00
49046239f2 Change explicit usages of at::optional to c10::optional (#13082)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13082

Follow up of D10511254. For these cases we can move to preferred `optional` without namespace right away.

Reviewed By: ezyang, Yangqing

Differential Revision: D10844117

fbshipit-source-id: 99a59e692fb4b236b299579f937f1536d443d899
2018-10-25 15:17:53 -07:00
a022fd2d6b Implement DataLoader (#11918)
Summary:
This PR implements a DataLoader API for the C++ frontend.

The components present in this API largely match the Python API. It consists of:
- `Dataset`s: Conceptually a function from a set of indices to a batch of examples;
- `Transform`s: A functional transformation of a dataset. A `Map<D, T>` for Dataset `D` and transform `T` is itself a dataset;
- `Sampler`s: Specify a strategy for generating indices for a new batch;
- A `DataLoader`, with the ability to automatically parallelize fetching of samples across multiple worker threads;

Note that collation functions fall naturally out of the `Map<Dataset, Transform>` abstraction.

Things that are missing right now that maybe should be added:
- Memory pinning for CUDA tensors

The API was designed to be generalizable to almost any kind of dataset, transform or sampling strategy, while providing a convenient API out of the box. To achieve this, it is quite heavily templatized on various possible input types.

There are many parts to this PR! Right now, I would like feedback on:
- Your impression of the general usability of the API;
- Your impression of which parts seem too complex or overthought;
- The implementation of the parallelization aspects of the DataLoader. I've followed the Python implementation in some matters, but also differ in others. I think my implementation is a little cleaner and decouples components slightly better than the Python dataloader.

I haven't added too many comments yet, as this is fresh out of the oven. Let me know if anything is unclear from the code itself.

There also aren't any tests yet. I will write a comprehensive test suite once we agree on the API and implementation.

apaszke ezyang The controller you requested could not be found. pietern
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11918

Reviewed By: ezyang

Differential Revision: D9998881

Pulled By: goldsborough

fbshipit-source-id: 22cf357b63692bea42ddb1cc2abc71dae5030aea
2018-10-22 10:22:41 -07:00