mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
## Motivation Many PRs optimizing samplers (for eg https://github.com/pytorch/pytorch/pull/147706, https://github.com/pytorch/pytorch/pull/137423) are leveraging an adhoc script for benchmarking samplers. The script and outputs are often copied over in PRs. We want to begin centralizing benchmarks for torch.utils.data components. ## What ? * This PR adds a new sub-folder in `benchmarks` for `data`. This is aimed to cover benchmarking scripts for torch.utils.data components like dataloader and sampler. * Specifically, this PR includes a simple script to time samplers. This is often "copy-pasted" in PRs optimizing samplers. Having it in a centralized location should prevent that, and allow a common standard. ## Output ``` Benchmark Results: +--------------+-------------+----------------+-----------+-----------+ | Batch Size | Drop Last | Original (s) | New (s) | Speedup | +==============+=============+================+===========+===========+ | 4 | True | 0.004 | 0.0088 | -119.62% | +--------------+-------------+----------------+-----------+-----------+ | 4 | False | 0.0083 | 0.009 | -9.23% | +--------------+-------------+----------------+-----------+-----------+ | 8 | True | 0.003 | 0.0074 | -147.64% | +--------------+-------------+----------------+-----------+-----------+ | 8 | False | 0.0054 | 0.0075 | -38.72% | +--------------+-------------+----------------+-----------+-----------+ | 64 | True | 0.0021 | 0.0056 | -161.92% | +--------------+-------------+----------------+-----------+-----------+ | 64 | False | 0.0029 | 0.0055 | -92.50% | +--------------+-------------+----------------+-----------+-----------+ | 640 | True | 0.002 | 0.0055 | -168.75% | +--------------+-------------+----------------+-----------+-----------+ | 640 | False | 0.0024 | 0.0062 | -161.35% | +--------------+-------------+----------------+-----------+-----------+ | 6400 | True | 0.0021 | 0.0055 | -160.13% | +--------------+-------------+----------------+-----------+-----------+ | 6400 | False | 0.0021 | 0.0068 | -215.46% | +--------------+-------------+----------------+-----------+-----------+ | 64000 | True | 0.0042 | 0.0065 | -55.29% | +--------------+-------------+----------------+-----------+-----------+ | 64000 | False | 0.0029 | 0.0077 | -169.56% | +--------------+-------------+----------------+-----------+-----------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156974 Approved by: https://github.com/ramanishsingh
1.9 KiB
1.9 KiB
PyTorch Data Benchmarks
This directory contains benchmarks for the torch.utils.data
module components, focusing on the performance of samplers.
Dependencies
The benchmarks require the following dependencies:
numpy
tabulate
You can install them using pip:
pip install numpy tabulate
Running the benchmarks
To run the BatchSampler benchmark:
python samplers_benchmark.py
Sampler Benchmark
The samplers_benchmark.py
script benchmarks the performance of PyTorch's BatchSampler against an alternative implementation as an example. It tests with the following parameters:
- Batch sizes: 4, 8, 64, 640, 6400, 64000
- Drop last options: True, False
- Each configuration is run 10 times and averaged
- Results include speedup percentage calculations
Output
The benchmark outputs a table with the following columns:
- Batch Size
- Drop Last
- Original (s): Time taken by the original implementation
- New (s): Time taken by the alternative implementation
- Speedup: Percentage improvement of the new implementation over the original
Example output:
+------------+-----------+---------------+----------+---------+
| Batch Size | Drop Last | Original (s) | New (s) | Speedup |
+============+===========+===============+==========+=========+
| 4 | True | 0.1234 | 0.1000 | 18.96% |
+------------+-----------+---------------+----------+---------+
| 4 | False | 0.1345 | 0.1100 | 18.22% |
+------------+-----------+---------------+----------+---------+
...
Extending the Benchmark
To benchmark a different implementation:
On local:
- Modify the
NewBatchSampler
class insamplers_benchmark.py
with your implementation. Similarly replaceBatchSampler
with the corresponding PyTorch implementation.- Ensure to include all inputs like
replacement
forRandomSampler
and its variations
- Ensure to include all inputs like
- Run the benchmark to compare its performance against the original