The [fastNLP](https://github.com/fastnlp/fastNLP/blob/v0.6.0/fastNLP/core/batch.py#L51) model uses DataSetGetter to fetch data from the dataset. The following code breaks because of https://github.com/pytorch/pytorch/pull/84301:
```
from fastNLP.io.pipe.qa import CMRC2018BertPipe
input_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), ".data", "cmrc2018-sim")
data_bundle = CMRC2018BertPipe().process_from_file(paths=input_dir)
data_bundle.rename_field('chars', 'words')
data_bundle.get_dataset('dev')
dataset = DataSetGetter(dataset, as_numpy)
dataiter = torch.utils.data.DataLoader(dataset=dataset)
for batch in dataiter:
# data-processing...
```
This is because for the `DataSetGetter` class, the following condition holds:
```
# hasattr(dataset_getter, '__getitems__') == True
# dataset_getter.__getitems__ == None
```
This PR adds an additional check to make sure `__getitems__` is only called when it is not None.
This error was found by the torchbench nightly CI, original error stack trace:
```
ERROR: test_fastNLP_Bert_train_cuda (__main__.TestBenchmark)
----------------------------------------------------------------------
components._impl.workers.subprocess_rpc.ChildTraceException: Traceback (most recent call last):
File "/home/circleci/project/components/_impl/workers/subprocess_rpc.py", line 470, in _run_block
exec( # noqa: P204
File "<subprocess-worker>", line 35, in <module>
File "<subprocess-worker>", line 12, in _run_in_worker_f
File "/home/circleci/project/torchbenchmark/util/model.py", line 16, in __call__
obj = type.__call__(cls, *args, **kwargs)
File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 93, in __init__
self.example_inputs = self._prefetch(example_inputs)
File "/home/circleci/project/torchbenchmark/models/fastNLP_Bert/__init__.py", line 133, in _prefetch
for batch_x, batch_y in example_inputs:
File "/home/circleci/miniconda3/lib/python3.8/site-packages/fastNLP/core/batch.py", line 266, in __iter__
for indices, batch_x, batch_y in self.dataiter:
File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 719, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/circleci/miniconda3/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
data = self.dataset.__getitems__(possibly_batched_index)
TypeError: 'NoneType' object is not callable
```
Full error log: https://app.circleci.com/pipelines/github/pytorch/benchmark/5143/workflows/0676f36d-0ab4-42bd-adb4-90e6b0df76d1/jobs/5293
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85099
Approved by: https://github.com/ejguan
Summary: This diffs add a check in the fetcher, that if the dataset to be fetched has a function "getitems" then use it for fetching a batch of elements, as oppose to one by one. This is benefical for io bounded usage.
Differential Revision: D39145980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84301
Approved by: https://github.com/VitalyFedyunin
Summary:
Back in April, malmaud added type annotations for `dataloader.py`. However, at about the same time, SsnL in https://github.com/pytorch/pytorch/issues/19228 replaced `_DataLoaderIter` with `_BaseDataLoaderIter` and two subclasses, `_SingleProcessDataLoaderIter`, and `_MultiProcessingDataLoaderIter`. However - probably because these changes happened in parallel at roughly the same time, the type stubs and several other references in the codebase were never updated to match this refactoring.
I've gone ahead and done the updates to reflect the refactoring in https://github.com/pytorch/pytorch/issues/19228, which fixes the specific type stub/impelementation mismatch pointed out in https://github.com/pytorch/pytorch/issues/26673, although not the broader problem that pytorch doesn't have a test to make sure that the `.pyi` type stub files match the real API defined in `.py` files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/27105
Differential Revision: D17813641
Pulled By: ezyang
fbshipit-source-id: ed7ac025c8d6ad3f298dd073347ec83bb4b6600c
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.
1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.
1. `Iterable` if the `dataset` is an instance of `IterableDataset`
2. `Map` o.w.
3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`
Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228
Reviewed By: bddppq
Differential Revision: D15058152
fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde