mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 12:54:11 +08:00
Page:
Data Basics
Pages
A quick guide on how to add and cache dependencies on PyTorch CI
Adding and maintaining type annotations in PyTorch
Autograd Basics
Autograd Onboarding Lab
Best Practices to Edit and Compile PyTorch Source Code On Windows
Bot commands
Boxing and Unboxing in the PyTorch Operator Library
Build PyTorch and LibTorch on Windows ARM64
CUDA basics
Code review values
Codegen and Structured Kernels
Continuous Integration
Core Frontend Onboarding
Cpp API Quick Walkthrough
Data Basics
DataPipes Testing Requirements
Debugging CI Failures without SSH Access
Debugging Windows with Remote Desktop or CDB (CLI windbg) on CircleCI
Debugging using with ssh for Github Actions
Dev Infra Office Hours
Developer FAQ
Dispatcher Structured Kernels Lab
Docker image build on CircleCI
Docstring Guidelines
Getting help as a contributor
Guide for adding type annotations to PyTorch
Home
How to propose feature changes to PyTorch
How to support `torch.set_deterministic()` in PyTorch operators
How to use TensorIterator
Life of a Tensor
MPS Backend
Memory format propagation rules
Modular components for benchmarking PyTorch snippets. (Experimental)
Module Onboarding Lab
Multiprocessing Technical Notes
OpInfos FAQ
Operators with Channels Last support
Public API definition and documentation
Pull request review etiquette
PyTorch's Python Frontend Backward and Forward Compatibility Policy
PyTorch AutoLabel Bot
PyTorch Basics
PyTorch CI Metrics Dashboards: the HUD
PyTorch Data Flow and Interface Diagram
PyTorch IR
PyTorch ONNX Exporter Code Reviews and Duty Rotation
PyTorch ONNX Topics
PyTorch ONNX exporter
PyTorch OSS benchmark infra
PyTorch Ops to oneDNN Functions Mapping
PyTorch Versions
PyTorch Workflow Cheatsheet
PyTorch dispatcher walkthrough
Pytorch Training Loops
Running and writing tests
Sharing design documents for discussion
Software Architecture for c10
TH to ATen porting guide
Tensor and Operator Basics
The PyTorch Contribution Process
The Ultimate Guide to PyTorch Contributions
The torch.fft module in PyTorch 1.7
Troubleshooting Documentation Build
Troubleshooting
Using hud.pytorch.org
What is considered a SEV?
Where or how should I add documentation
Writing Python in cpp (a manifesto)
Writing memory format aware operators
Writing tests in PyTorch 1.8
clang format
function transforms (aka torch.func, functorch)
lintrunner
nn Basics
torch.nn Module Documentation Style Guide
torch.onnx Namespacing
vmap Basics
vmap Onboarding Lab
Clone
3
Data Basics
Manuel edited this page 2024-07-03 19:34:19 +02:00
Introduction to DataLoader and Dataset
Read through link
Common Object in DataLoader
- Sampler: Randomly choosing index per iteration. It would yield indices when
batch_size
is notNone
.- For
IterableDataset
, it would keep yielding None(s) per iteration using_InfiniteConstantSampler
- For
- Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invoke
collate_fn
over each batch of data and drop the remaining unfilled batch ifdrop_last
is set.- For
IterableDataset
, it would simply get next batch-size elements as a batch.
- For
Data/Control flow in DataLoader
- Single Process:
Sampler
|
index/indices
|
V
Fetcher
|
index/indices
|
V
dataset
|
V
collate_fn
|
V
output
- Multiple processes:
Sampler (Main process)
|
index/indices
|
V
Index Multiprocessing Queue (one healthy worker)
|
index/indices
|
V
Fetcher (Worker process)
|
index/indices
|
V
dataset
|
Batch of data
|
V
collate_fn
|
V
Result Multiprocessing Queue
|
Data
|
V
pin_memory_thread (Main process)
|
V
output
This is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.
Common gotchas for DataLoader
Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.
- Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
- Control randomness per worker using
worker_init_fn
. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork. - COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
- Control randomness per worker using
- Difference between Map-style Datset and Iterable-style Dataset
- Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
- Iterable-style Dataset requires users to manually implement sharding inside
__iter__
method usingtorch.utils.data.get_worker_info()
. Please check the example.
- Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities inside
IterableDataset
class. (This is solved by TorchData project)
Introduction to next-generation Data API (TorchData)
Read through link and link Expected features:
- Automatic/Dydamic sharding
- Determinism Control
- Snapshotting
- DataFrame integration
- etc.
Lab for DataLoader and DataPipe
Goto N1222094 for Data Lab
Next
Unit 8: function transforms/Training Loops (Optional) - vmap
I would love to contribute to PyTorch!