From 5bd38c842892dfcd28de08bf82225bd095250ef4 Mon Sep 17 00:00:00 2001
From: Erjia Guan <erjia@fb.com>
Date: Mon, 11 Oct 2021 12:51:23 -0400
Subject: [PATCH] Add Data onboarding resource

---
 Core-Frontend-Onboarding.md |  3 +-
 Data-Basics.md              | 93 +++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100644 Data-Basics.md

diff --git a/Core-Frontend-Onboarding.md b/Core-Frontend-Onboarding.md
index 0f98102..91ac0fd 100644
--- a/Core-Frontend-Onboarding.md
+++ b/Core-Frontend-Onboarding.md
@@ -8,6 +8,7 @@ New to developing PyTorch? We've got you covered.
 - [[Dispatcher and Python bindings|PyTorch-dispatcher-walkthrough]]
 - [[nn|nn-Basics]], [[C++|Cpp-API-Quick-Walkthrough]]
 - [[CUDA-basics]]
+- [[data|Data-Basics]]
 
 Some further things:
-- optim, data
\ No newline at end of file
+- optim
diff --git a/Data-Basics.md b/Data-Basics.md
new file mode 100644
index 0000000..549305f
--- /dev/null
+++ b/Data-Basics.md
@@ -0,0 +1,93 @@
+## Introduction to DataLoader and Dataset
+
+Read through [link](https://pytorch.org/docs/stable/data.html)
+
+### Common Object in DataLoader
+- Sampler: Randomly choosing index per iteration. It would yield indices when `batch_size` is not `None`.
+  - For `IterableDataset`, it would keep yielding None(s) per iteration using [`_InfiniteConstantSampler`](https://github.com/pytorch/pytorch/blob/0be36d798ba959bfda6c448fc4832b5691df6e61/torch/utils/data/dataloader.py#L55-L68)
+- Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invoke `collate_fn` over each batch of data and drop the remaining unfilled batch if `drop_last` is set.
+  - For `IterableDataset`, it would simply get next batch-size elements as a batch.
+
+### Data/Control flow in DataLoader
+- Single Process:
+```
+         Sampler
+            |
+      index/indices
+            |
+            V
+         Fetcher
+            |
+      index/indices
+            |
+            V
+         dataset
+            |
+            V
+        collate_fn
+            |
+            V
+         output
+```
+- Multiple processes:
+```
+          Sampler (Main process)
+                    |
+              index/indices
+                    |
+                    V
+Index Multiprocessing Queue (one healthy worker)
+                    |
+              index/indices
+                    |
+                    V
+          Fetcher (Worker process)
+                    |
+              index/indices
+                    |
+                    V
+                 dataset
+                    |
+              Batch of data
+                    |
+                    V
+                collate_fn
+                    |
+                    V
+        Result Multiprocessing Queue
+                    |
+                   Data
+                    |
+                    V
+      pin_memory_thread (Main process)
+                    |
+                    V
+                  output
+```
+This is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.
+
+## Common gotchas for DataLoader
+Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.
+- Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
+  - Control randomness per worker using `worker_init_fn`. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork.
+  - COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
+- Difference between Map-style Datset and Iterable-style Dataset
+  - Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
+  - Iterable-style Dataset requires users to manually implement sharding inside `__iter__` method using `torch.utils.data.get_worker_info()`. Please check the [example](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset).
+- Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities inside `IterableDataset` class. (This is solved by TorchData project)
+
+### Lab for DataLoader [WIP]
+- Reproduce "leaking memory issue" with fork
+- Control randomness per worker
+
+## Introduction to next-generation Data API (TorchData)
+Read through [link](https://github.com/pytorch/data#why-composable-data-loading) and [link](https://github.com/pytorch/data#what-are-datapipes)
+Expected features:
+- Automatic/Dydamic sharding
+- Determinism Control
+- Snapshotting
+- DataFrame integration
+- etc.
+
+### Task for TorchData
+Please reach out to onboarding POC to get a task