Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396
Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.
I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?
Fixes https://github.com/pytorch/data/issues/213
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D34481962
Pulled By: NivekT
fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
(cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73119
Test if a DataPipe is serializable after its contents are partially read and completely read. This is especially important for DataPipes with buffers.
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D34354496
Pulled By: NivekT
fbshipit-source-id: 36971d68b9ca1de81fb254e9a459b8f54fe0f9ff
(cherry picked from commit e8f39a7aa364bd2b19145788f7e67c06f948f81b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72896
Fixing the issue described here: https://github.com/pytorch/data/issues/214
There will be a follow-up PR in TorchData as well
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D34258669
Pulled By: NivekT
fbshipit-source-id: 6dd88250ed14ebe779915dc46139be7e012e9d1b
(cherry picked from commit 025b8ed98019e576bfef04c33a3f33ed1a426a66)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72123
There is a bug to fix the typing system in DataPipe, which would take more than 1 week to fix. I will follow up on it later this month. As branch cut is today, add this PR to disable typing to make sure release works.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D33920610
Pulled By: ejguan
fbshipit-source-id: febff849ab2272fd3b1c5127a20f27eb82992d9c
(cherry picked from commit ee103e62e70b69236294f8228ac8061fd95cd4fd)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70103
I used an argument so it can be disabled. I called it `deterministic_order` because `sort` can be confusing, as it's actually sorted but by dir levels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70435
Reviewed By: albanD
Differential Revision: D33899755
Pulled By: ejguan
fbshipit-source-id: e8a08f03a49120333b2d27f332cd21a3240a02a9
(cherry picked from commit 4616e43ec30ba425585c041f8895196909f94d1b)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70215
A few renaming, formatting, and additional tests to make the unit tests better.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D33344610
Pulled By: NivekT
fbshipit-source-id: bb36f7452bdc44964c9ce0650c7ae308ba2c5aa5
(cherry picked from commit 0aae20cb27038b7b3598520db4304a604f1e6799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71161
Users should import these DataPipes from [TorchData](https://github.com/pytorch/data) if they would like to use them. We will be checking for any downstream library usage before landing this PR.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D33532272
Pulled By: NivekT
fbshipit-source-id: 9dbfb21baf2d1183e0aa379049ad8304753e08a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367
This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch.
Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue.
cc VitalyFedyunin ejguan NivekT pmeier Nayef211
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D33301648
Pulled By: NivekT
fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69391
As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `FilterIterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D32849462
Pulled By: NivekT
fbshipit-source-id: 91cf1dc03dd3d3cbd7a9c6ccbd791ade91355f30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69390
As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `MapperIterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D32849465
Pulled By: NivekT
fbshipit-source-id: 963ce70b84a7658331d126e5ed9fdb12273c8e1f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277
Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485823
Pulled By: NivekT
fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275
Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.
cc VitalyFedyunin ejguan NivekT
Test Plan: Imported from OSS
Reviewed By: ejguan
Differential Revision: D31485822
Pulled By: NivekT
fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65220Fixes#65221
- Remove deepcopy from Mapper to support file handles
- Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch)
- Convert `IDP` to `IterableWrapper` in test_datapipe.py
- Refine the variable names (prevent using `dp` that is module reference)
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D31021886
Pulled By: ejguan
fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223
Summary:
ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit).
This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.
It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.
The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream.
Stack:
* __->__ https://github.com/pytorch/pytorch/issues/64788
* https://github.com/pytorch/pytorch/issues/64786
cc VitalyFedyunin ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788
Reviewed By: jbschlosser, ejguan
Differential Revision: D30901176
Pulled By: NivekT
fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e
Summary:
There are two warnings produced by `test_fork_datapipe`. This PR addresses the issues raised by those warnings without impacting the test cases.
cc VitalyFedyunin ejguan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64827
Reviewed By: ejguan
Differential Revision: D30870528
Pulled By: NivekT
fbshipit-source-id: 580a001c6fa3ff6f8b04a7e5183e58861938204b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404
This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like.
cc VitalyFedyunin ejguan
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D30713120
Pulled By: NivekT
fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64220
Remove `ByKeyGrouperIterDataPipe` due to duplicated functionality.
Fix a bug in `GrouperIterDataPipe` using the existing test.
Test Plan: Imported from OSS
Reviewed By: VitalyFedyunin
Differential Revision: D30650542
Pulled By: ejguan
fbshipit-source-id: 666b4d28282fb4f49f3ff101b8d08be16a50d836
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422Fixes#63095
Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D30379027
Pulled By: ejguan
fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d