Commit Graph

94 Commits

Author SHA1 Message Date
5667c4ea21 Remove default parameter of ShufflerIterDataPipe (#74370)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74370

Closes https://github.com/pytorch/data/issues/298. This PR:

- removes the `default` parameter of `ShufflerIterDataPipe`
- renames `set_shuffle_setting()` into `set_shuffle()`
- let `set_shuffle()` return `self`.

Test Plan: Imported from OSS

Reviewed By: george-qi

Differential Revision: D35073666

Pulled By: NicolasHug

fbshipit-source-id: 9847b037e70f44f36eaf4471f2c12fa8ec2ed73c
(cherry picked from commit b07ab646f308532886e8daddd57e937a53edb153)
2022-03-28 12:47:24 +00:00
eec994fc16 [DataPipe] Separating DataPipes from Dataset into different files (#73396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73396

Separating DataPipes from Dataset into different files. This makes the code more maintainable and simplifies some of the code generation.

I have also tried to move `datapipe.py` into `torch.utils.data.datapipes`, but that will lead to circular import and rewriting many import statements. Should I put more time and go down that path some more?

Fixes https://github.com/pytorch/data/issues/213

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34481962

Pulled By: NivekT

fbshipit-source-id: 42fb26fe7fc334636852cfd8719fc807bdaa7912
(cherry picked from commit 81e76a64e297cb5c58caa951c554e49526173936)
2022-03-15 14:46:34 +00:00
8811d217ed [DataPipe] Slight refactoring IterDataPipe serialization test (#73922)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73922

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34732288

Pulled By: NivekT

fbshipit-source-id: f31229332fe4eac85cc2085484f6e1b1d802987d
(cherry picked from commit ace20054e4f3f9bd9610640755400fbde82650c3)
2022-03-09 15:33:12 +00:00
0821154072 [DataPipe] Adding serialization test for all MapDataPipe (#73921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73921

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34732286

Pulled By: NivekT

fbshipit-source-id: 893af2fbb83feb1bae226d3205105de5d3836378
(cherry picked from commit f44fd3c5210d0afdbf826e3b7e7fbe2ec216c3b7)
2022-03-09 15:33:12 +00:00
f85309e478 [DataPipe] Adding serialization test at different stages of reading for IterDataPipes (#73119)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73119

Test if a DataPipe is serializable after its contents are partially read and completely read. This is especially important for DataPipes with buffers.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D34354496

Pulled By: NivekT

fbshipit-source-id: 36971d68b9ca1de81fb254e9a459b8f54fe0f9ff
(cherry picked from commit e8f39a7aa364bd2b19145788f7e67c06f948f81b)
2022-02-23 16:31:21 +00:00
cd4ecce1bb [DataPipe] Fix issue with DataPipe serialization with dill (#72896)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72896

Fixing the issue described here: https://github.com/pytorch/data/issues/214

There will be a follow-up PR in TorchData as well

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D34258669

Pulled By: NivekT

fbshipit-source-id: 6dd88250ed14ebe779915dc46139be7e012e9d1b
(cherry picked from commit 025b8ed98019e576bfef04c33a3f33ed1a426a66)
2022-02-23 16:31:20 +00:00
6297aa114f [DataPipe] Extend FileLister to support load multiple directories (#72260)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/72260

Test Plan: Imported from OSS

Reviewed By: dagitses, NivekT

Differential Revision: D33979744

Pulled By: ejguan

fbshipit-source-id: 5733d20382642fc2274afd838b33c98150d81e91
(cherry picked from commit f70537ae76448f5898da8d3f4884d0b3a29d40eb)
2022-02-04 07:55:00 +00:00
7b014cc645 [DataPipe] Disable Typing for DataPipe before branch cut (#72123)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72123

There is a bug to fix the typing system in DataPipe, which would take more than 1 week to fix. I will follow up on it later this month. As branch cut is today, add this PR to disable typing to make sure release works.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D33920610

Pulled By: ejguan

fbshipit-source-id: febff849ab2272fd3b1c5127a20f27eb82992d9c
(cherry picked from commit ee103e62e70b69236294f8228ac8061fd95cd4fd)
2022-02-02 05:00:41 +00:00
5024c1bc7b Make get_file_pathnames_from_root output order deterministic (#70435)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/70103

I used an argument so it can be disabled. I called it `deterministic_order` because `sort` can be confusing, as it's actually sorted but by dir levels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/70435

Reviewed By: albanD

Differential Revision: D33899755

Pulled By: ejguan

fbshipit-source-id: e8a08f03a49120333b2d27f332cd21a3240a02a9
(cherry picked from commit 4616e43ec30ba425585c041f8895196909f94d1b)
2022-02-01 18:12:23 +00:00
b36b11cbc1 Separating CaptureDataFrame out of DFIterDataPipe (#71776)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71776

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33771602

Pulled By: VitalyFedyunin

fbshipit-source-id: 59d85bc707a9568f1f0960fc184113a4f422d2df
(cherry picked from commit 93522768efc8c525887ad52b415009535fe02cfb)
2022-01-26 03:25:02 +00:00
bb157dd4eb Make methods of internal file_obj visible from StreamWrapper (#71653)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71653

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33718749

Pulled By: ejguan

fbshipit-source-id: f3a8244f22ca37049b8678afa0e329b23c957a9d
(cherry picked from commit a4d12ca48ec153ad5f058152e7df4a9a1421b184)
2022-01-25 15:34:24 +00:00
13ea2cb330 [DataPipe] Make GroupBy serializable with lambda function (#71497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71497

Related to https://github.com/pytorch/data/issues/172

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33668749

Pulled By: NivekT

fbshipit-source-id: 6506614e9d4389dc645d8985c00fdb3402122d9b
(cherry picked from commit 458e76fcb1a60691a225f3f5e4a058a51490732d)
2022-01-21 16:04:45 +00:00
36b4c95e74 [DataPipe] adding serialization test for all core IterDataPipes (#71456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71456

Related to https://github.com/pytorch/data/issues/172

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D33668748

Pulled By: NivekT

fbshipit-source-id: ea2085d5ed47533ca49258cc52471373c6ae1847
(cherry picked from commit d5f6fde1d08bf77789176930cf4dc7faa7a6b5a3)
2022-01-21 16:04:45 +00:00
011fd1d933 [DataPipe] improving DataPipe unit tests (#70215)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70215

A few renaming, formatting, and additional tests to make the unit tests better.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33344610

Pulled By: NivekT

fbshipit-source-id: bb36f7452bdc44964c9ce0650c7ae308ba2c5aa5
(cherry picked from commit 0aae20cb27038b7b3598520db4304a604f1e6799)
2022-01-20 15:49:53 +00:00
fd9e08df5d Make Demux serializable with lambda function (#71311)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71311

Test Plan: Imported from OSS

Reviewed By: NivekT

Differential Revision: D33584552

Pulled By: ejguan

fbshipit-source-id: 52324faf5547f9f77582ec170ec91ce3114cfc61
2022-01-18 06:47:54 -08:00
1e3893ecbb [DataPipe] Removing deprecated DataPipes (#71161)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71161

Users should import these DataPipes from [TorchData](https://github.com/pytorch/data) if they would like to use them. We will be checking for any downstream library usage before landing this PR.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33532272

Pulled By: NivekT

fbshipit-source-id: 9dbfb21baf2d1183e0aa379049ad8304753e08a1
2022-01-13 07:37:48 -08:00
8dcfdf39e7 [DataPipe] Renaming FileLoader to FileOpener with deprecation warning for FileLoader (#70367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70367

This PR renames the `FileLoaderIterDataPipe` to `FileOpenerIterDataPipe`. For the sake of not breaking many CI tests immediately, it still preserves `FileLoader` as an alias. This will allow downstream libraries/users to migrate their use cases before we fully remove all references to `FileLoader` from PyTorch.

Fixes https://github.com/pytorch/data/issues/103. More detailed discussion about this decision is also in the linked issue.

cc VitalyFedyunin ejguan NivekT pmeier Nayef211

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33301648

Pulled By: NivekT

fbshipit-source-id: 59278dcd44e372df0ba2001a4eecbf9792580d0b
2022-01-04 09:14:50 -08:00
ad0cd8a76e [DataPipe] Improve inline doc and testing for CollatorIterDataPipe (#70139)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70139

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D33199107

Pulled By: NivekT

fbshipit-source-id: f96d77490998ac9bc3da8d4ff1a9caa08e9e7f27
2021-12-20 08:05:21 -08:00
3d51c88032 [DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from MapDataPipes (#69561)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69561

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32952099

Pulled By: NivekT

fbshipit-source-id: 95b725774a9d04d655e2542760726908f33043f4
2021-12-16 18:11:00 -08:00
b89c283c80 [DataPipe] Unifying API - removing options to have fn_args and fn_kwargs from IterDataPipes (#69560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69560

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32952100

Pulled By: NivekT

fbshipit-source-id: e0cc31408c7cf3220fe274feed1c7202a1aaae70
2021-12-16 18:09:52 -08:00
d90012689f [DataPipe] Control shuffle settings from DataLoader2 (#65756)
Summary:
Makes `shuffle` DataPipe sensitive to DataLoader(2) `shuffle` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65756

Reviewed By: albanD

Differential Revision: D31344867

Pulled By: VitalyFedyunin

fbshipit-source-id: e0084e0ac193ac784d6298328ca1222745681347
2021-12-14 07:35:26 -08:00
81a60b9813 [DataPipe] Adding output types to DataPipe interface file (#69647)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69647

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D32989067

Pulled By: NivekT

fbshipit-source-id: 2c2e71e9e514e0d584affaa0b71b7b0d07a2ddbf
2021-12-10 12:04:45 -08:00
357160e68e [DataPipe] Unifying API - removing nesting_level argument from FilterIterDataPipe (#69391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69391

As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `FilterIterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32849462

Pulled By: NivekT

fbshipit-source-id: 91cf1dc03dd3d3cbd7a9c6ccbd791ade91355f30
2021-12-07 11:40:46 -08:00
4478b14e4c [DataPipe] Unifying API - removing nesting_level argument from MapperIterDataPipe (#69390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69390

As part of the efforts to unify the APIs across different data backends (e.g. TorchData, TorchArrow), we are making changes to different DataPipes' APIs. In this PR, we are removing the input argument `nesting_level` from `MapperIterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32849465

Pulled By: NivekT

fbshipit-source-id: 963ce70b84a7658331d126e5ed9fdb12273c8e1f
2021-12-07 11:39:08 -08:00
6baaec30cd [DataPipe] Adding ShufflerMapDataPipe (#68606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68606

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32813290

Pulled By: NivekT

fbshipit-source-id: 8d1ebd5bc776563c23250f76a2efc1d395f1af9c
2021-12-03 11:36:33 -08:00
0465f64bb8 [DataPipe] Adding BatcherMapDataPipe (#68197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68197

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32440963

Pulled By: NivekT

fbshipit-source-id: 277cbe8d735afe341a7c189be20e1d334ecf9d4a
2021-12-02 07:27:17 -08:00
61a94495d9 [DataPipe] adding ZipperMapDataPipe (#68032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68032

Part of #57031

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D32263058

Pulled By: NivekT

fbshipit-source-id: 13a30ee9d9779284a9fd9bb7222fc41253c6fe3b
2021-11-11 10:36:05 -08:00
803e88d418 [DataPipe] Fixing pickling issues with fork and demux (#67930)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67930

Fixes #67848

Test Plan: Imported from OSS

Reviewed By: H-Huang

Differential Revision: D32222184

Pulled By: NivekT

fbshipit-source-id: 48871c45a855d92cd599e21f3b53827dd32c91ef
2021-11-09 07:54:02 -08:00
39215ddf84 [skip ci] Set test owners for dataloader tests (#66839)
Summary:
Action following https://github.com/pytorch/pytorch/issues/66232

cc SsnL VitalyFedyunin ejguan NivekT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66839

Reviewed By: ejguan

Differential Revision: D31761722

Pulled By: janeyx99

fbshipit-source-id: 8315ac03352c11b3215d89856b3cfda6cd78fa0c
2021-10-19 08:31:16 -07:00
8ebe1a924d [DataPipe] moving mux IterDataPipe test to the right location (#66277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66277

Previously, it is grouped together with tests related to `MapDataPipe`, but it should be with `IterDataPipe`.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485823

Pulled By: NivekT

fbshipit-source-id: d13d8c28cbfc305da0e3033d4109a0f971281a02
2021-10-08 08:32:29 -07:00
ed17851642 [DataPipe] adding test for IterableWrapperIterDataPipe (#66276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66276

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485824

Pulled By: NivekT

fbshipit-source-id: c7b21636e4b17e264bfb5dbea69cd3c477472f0b
2021-10-08 08:32:26 -07:00
e808e3d3d6 [DataPipe] adding SequenceWrapperMapDataPipe (#66275)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66275

Once this is added to Core, TorchData's PR will not need a custom class and can use this wrapper instead.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31485822

Pulled By: NivekT

fbshipit-source-id: 790de27629c89c0ca7163a8ee5a09ee8b8233340
2021-10-08 08:32:24 -07:00
a1216061c1 [DataPipe] Fix deepcopy filehandle for Mapper and in-place modification for IterableWrapper (#65220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65220

Fixes #65221

- Remove deepcopy from Mapper to support file handles
- Convert `IterableWrapper` to deepcopy iterable instance within each iterator to prevent in-place modification (different data per epoch)
- Convert `IDP` to `IterableWrapper` in test_datapipe.py
- Refine the variable names (prevent using `dp` that is module reference)

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D31021886

Pulled By: ejguan

fbshipit-source-id: 72a9eee66c758e2717d591cd0942892bddedc223
2021-09-21 14:29:40 -07:00
cf60d24028 [DataPipe] Unlimited buffer for Forker and Demultiplexer (#64994)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64994

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D30934362

Pulled By: ejguan

fbshipit-source-id: d3b774d7e28c0b9659e999511e5a68c3929857d4
2021-09-20 09:30:39 -07:00
c625f971d3 [DataPipe] Make TarArchiveReader and ZipArchiveReader accepts FileSream with attempt to close and additional warning (#64788)
Summary:
ghstack is not working for the second commit so I'm manually creating this PR for now. Please only look at changes related to the second commit in this PR (there is a PR for the first commit).

This PR removes TarArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.

It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.

The whole stack fixes https://github.com/pytorch/pytorch/issues/64281 - issues related to unclosed buffer stream.

Stack:
* __->__ https://github.com/pytorch/pytorch/issues/64788
* https://github.com/pytorch/pytorch/issues/64786

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64788

Reviewed By: jbschlosser, ejguan

Differential Revision: D30901176

Pulled By: NivekT

fbshipit-source-id: 59746a8d0144fc6d3ce0feb2d76445b82e6d414e
2021-09-15 07:34:29 -07:00
c65128679b [DataPipe] Improve Mapper to accept input/output index when apply fn (#64951)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64951

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30910035

Pulled By: ejguan

fbshipit-source-id: d687fe10939920a3617a60552fe743e8526438a0
2021-09-14 15:46:42 -07:00
ab5e1c69a7 [WIP] Example of DataPipes and DataFrames integration (#60840)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60840

Test Plan: Imported from OSS

Reviewed By: wenleix, ejguan

Differential Revision: D29461080

Pulled By: VitalyFedyunin

fbshipit-source-id: 4909394dcd39e97ee49b699fda542b311b7e0d82
2021-09-13 18:50:15 -07:00
f3f410880a [DataPipe] Remove ZipArchiveReader's dependency on FileLoader (#64786)
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* https://github.com/pytorch/pytorch/issues/64788
* __->__ https://github.com/pytorch/pytorch/issues/64786

This PR removes ZipArchiveReader's dependency on FileLoader DataPipe, by allowing it to use a IterDataPipe of path names as input rather than a tuple of path name and a stream.

It also adds additional tests to ensure that the DataPipe is functioning properly when it is read multiple times or reset half way through reading.

The whole stack fixes issues related to unclosed buffer stream (see https://github.com/pytorch/pytorch/issues/64281).

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64786

Reviewed By: ngimel

Differential Revision: D30870968

Pulled By: NivekT

fbshipit-source-id: 64b04d1697b99772f2fa20fc141668e6b8e18c41
2021-09-10 16:49:17 -07:00
5060b69d62 [DataPipe] fixing tests related fork() to remove warnings (#64827)
Summary:
There are two warnings produced by `test_fork_datapipe`. This PR addresses the issues raised by those warnings without impacting the test cases.

cc VitalyFedyunin ejguan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64827

Reviewed By: ejguan

Differential Revision: D30870528

Pulled By: NivekT

fbshipit-source-id: 580a001c6fa3ff6f8b04a7e5183e58861938204b
2021-09-10 11:01:42 -07:00
4ce9c530d6 [DataPipe] removing filter's inheritance from map (#64404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64404

This PR remove `filter`'s inheritance from `map`. This allows `filter` to not have a `__len__` function and that behavior is what we would like.

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D30713120

Pulled By: NivekT

fbshipit-source-id: 4d5d07555297ee2bd4b49842c0d26cdc00638f6c
2021-09-02 13:09:47 -07:00
4f43480186 [DataPipe] adding/removing __len__ for different DataPipe (#64398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64398

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30710437

Pulled By: NivekT

fbshipit-source-id: 524eda43a2faa0db0c1a662bf9bb4283f0ade83c
2021-09-02 13:08:32 -07:00
491bf7cb74 [DataPipe] adding description, __len__, tests for mux() (#64224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64224

cc VitalyFedyunin ejguan

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30651551

Pulled By: NivekT

fbshipit-source-id: f8af98ba71a592900b992a8077432062ec57bb48
2021-08-31 14:34:28 -07:00
0ef8760bf6 [DataPipe] implementing __len__ for fork (no valid length for demux) (#64215)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64215

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30648672

Pulled By: NivekT

fbshipit-source-id: 4780f2f6a79ae15a4009092475e7d92f96dd09a2
2021-08-31 08:32:31 -07:00
0deb7a0bc0 [DataPipe] implementing demux() (#63650)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63650

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30493944

Pulled By: NivekT

fbshipit-source-id: 0aa06dee8c7fb1744975b8f6a0694b90c11ef80d
2021-08-31 08:32:29 -07:00
eee054e6ea [DataPipe] implementing fork() (#63649)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63649

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D30493945

Pulled By: NivekT

fbshipit-source-id: 40db7d4134facd266d86bc0dc2edf2729c4e5842
2021-08-31 08:32:27 -07:00
af85bc5ffd Replace group_by_key by group_by IterDataPipe (#64220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64220

Remove `ByKeyGrouperIterDataPipe` due to duplicated functionality.
Fix a bug in `GrouperIterDataPipe` using the existing test.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D30650542

Pulled By: ejguan

fbshipit-source-id: 666b4d28282fb4f49f3ff101b8d08be16a50d836
2021-08-30 18:45:44 -07:00
7946f8a9f6 Rename DataPipe to Op-er (#63325)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63325

Rename each DataPipe to an operation name ending with er. Functional API should remain `verb` such as `read_from_tar` , `shuffle`, ... (Discussed in [here](https://github.com/facebookexternal/torchdata/pull/97#discussion_r688553905))
- Batch -> Batcher
- Collate -> Collator
- Concat -> Concater
- GroupByKey - > ByKeyGrouper ?
- ListDirFiles -> FileLister
- LoadFilesFromDisk -> FileLoader
- Map -> Mapper
- ReadFilesFromTar -> TarArchiveReader
- ReadFilesFromZip -> ZipArchiveReader
- ReadLinesFromFile -> LineReader
- Shuffle -> Shuffler
- ToBytes -> StreamReader
- Transforms -> Transformer
- Zip -> Zipper

Let me know if you have better name for each DataPipe

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D30466950

Pulled By: ejguan

fbshipit-source-id: 72909dca7b3964ab83b965891f96cc1ecf62d049
2021-08-23 14:36:10 -07:00
383a33a0eb Make DataChunk support list in-place ops (#63422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63422

Fixes #63095

Make `DataChunk` delegate to list method. Then it will support in-place operations:
- `sort`
- `reverse`
- `append`
- `extend`
- `random.shuffle`

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D30379027

Pulled By: ejguan

fbshipit-source-id: d176bd0cc8b89b915c7bb184ff243ab1f605616d
2021-08-18 08:48:47 -07:00
d1cbee7b2b Refactor BucketBatch (#63185)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63185

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288893

Pulled By: ejguan

fbshipit-source-id: b88b792d12a83c99d8ea9e516e3b4c54a82100f6
2021-08-16 06:42:56 -07:00
56d609d93e Replace str by repr for DataChunk (#63184)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/63184

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D30288892

Pulled By: ejguan

fbshipit-source-id: 45c88fdd3987e234f2c22ebbbfd8d5044983c34c
2021-08-16 06:41:38 -07:00