Add Data onboarding resource

2025-10-20 21:14:14 +08:00 · 2021-10-11 12:51:23 -04:00
parent 2e5489557d
commit 5bd38c8428
2 changed files with 95 additions and 1 deletions
--- a/Core-Frontend-Onboarding.md
+++ b/Core-Frontend-Onboarding.md
@ -8,6 +8,7 @@ New to developing PyTorch? We've got you covered.
 - [[Dispatcher and Python bindings|PyTorch-dispatcher-walkthrough]]
 - [[nn|nn-Basics]], [[C++|Cpp-API-Quick-Walkthrough]]
 - [[CUDA-basics]]
+- [[data|Data-Basics]]

 Some further things:
- optim, data
+- optim
--- a/Data-Basics.md
+++ b/Data-Basics.md
@ -0,0 +1,93 @@
+## Introduction to DataLoader and Dataset
+
+Read through [link](https://pytorch.org/docs/stable/data.html)
+
+### Common Object in DataLoader
+- Sampler: Randomly choosing index per iteration. It would yield indices when `batch_size` is not `None`.
+  - For `IterableDataset`, it would keep yielding None(s) per iteration using [`_InfiniteConstantSampler`](https://github.com/pytorch/pytorch/blob/0be36d798ba959bfda6c448fc4832b5691df6e61/torch/utils/data/dataloader.py#L55-L68)
+- Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invoke `collate_fn` over each batch of data and drop the remaining unfilled batch if `drop_last` is set.
+  - For `IterableDataset`, it would simply get next batch-size elements as a batch.
+
+### Data/Control flow in DataLoader
+- Single Process:
+```
+         Sampler
+            |
+      index/indices
+            |
+            V
+         Fetcher
+            |
+      index/indices
+            |
+            V
+         dataset
+            |
+            V
+        collate_fn
+            |
+            V
+         output
+```
+- Multiple processes:
+```
+          Sampler (Main process)
+                    |
+              index/indices
+                    |
+                    V
+Index Multiprocessing Queue (one healthy worker)
+                    |
+              index/indices
+                    |
+                    V
+          Fetcher (Worker process)
+                    |
+              index/indices
+                    |
+                    V
+                 dataset
+                    |
+              Batch of data
+                    |
+                    V
+                collate_fn
+                    |
+                    V
+        Result Multiprocessing Queue
+                    |
+                   Data
+                    |
+                    V
+      pin_memory_thread (Main process)
+                    |
+                    V
+                  output
+```
+This is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.
+
+## Common gotchas for DataLoader
+Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.
+- Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
+  - Control randomness per worker using `worker_init_fn`. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork.
+  - COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
+- Difference between Map-style Datset and Iterable-style Dataset
+  - Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
+  - Iterable-style Dataset requires users to manually implement sharding inside `__iter__` method using `torch.utils.data.get_worker_info()`. Please check the [example](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset).
+- Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities inside `IterableDataset` class. (This is solved by TorchData project)
+
+### Lab for DataLoader [WIP]
+- Reproduce "leaking memory issue" with fork
+- Control randomness per worker
+
+## Introduction to next-generation Data API (TorchData)
+Read through [link](https://github.com/pytorch/data#why-composable-data-loading) and [link](https://github.com/pytorch/data#what-are-datapipes)
+Expected features:
+- Automatic/Dydamic sharding
+- Determinism Control
+- Snapshotting
+- DataFrame integration
+- etc.
+
+### Task for TorchData
+Please reach out to onboarding POC to get a task