mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 12:54:11 +08:00
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115485 Approved by: https://github.com/yanboliang
92 lines
5.3 KiB
Markdown
92 lines
5.3 KiB
Markdown
# `torch.compile()` Benchmarking
|
|
|
|
This directory contains benchmarking code for TorchDynamo and many
|
|
backends including TorchInductor. It includes three main benchmark suites:
|
|
|
|
- [TorchBenchmark](https://github.com/pytorch/benchmark): A diverse set of models, initially seeded from
|
|
highly cited research models as ranked by [Papers With Code](https://paperswithcode.com). See [torchbench
|
|
installation](https://github.com/pytorch/benchmark#installation) and `torchbench.py` for the low-level runner.
|
|
[Makefile](Makefile) also contains the commands needed to setup TorchBenchmark to match the versions used in
|
|
PyTorch CI.
|
|
|
|
- Models from [HuggingFace](https://github.com/huggingface/transformers): Primarily transformer models, with
|
|
representative models chosen for each category available. The low-level runner (`huggingface.py`) automatically
|
|
downloads and installs the needed dependencies on first run.
|
|
|
|
- Models from [TIMM](https://github.com/huggingface/pytorch-image-models): Primarily vision models, with representative
|
|
models chosen for each category available. The low-level runner (`timm_models.py`) automatically downloads and
|
|
installs the needed dependencies on first run.
|
|
|
|
|
|
## GPU Performance Dashboard
|
|
|
|
Daily results from the benchmarks here are available in the [TorchInductor
|
|
Performance Dashboard](https://hud.pytorch.org/benchmark/compilers),
|
|
currently run on an NVIDIA A100 GPU.
|
|
|
|
The [inductor-perf-test-nightly.yml](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml)
|
|
workflow generates the data in the performance dashboard. If you have the needed permissions, you can benchmark
|
|
your own branch on the PyTorch GitHub repo by:
|
|
1) Select "Run workflow" in the top right of the [workflow](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml)
|
|
2) Select your branch you want to benchmark
|
|
3) Choose the options (such as training vs inference)
|
|
4) Click "Run workflow"
|
|
5) Wait for the job to complete (4 to 12 hours depending on backlog)
|
|
6) Go to the [dashboard](https://hud.pytorch.org/benchmark/compilers)
|
|
7) Select your branch and commit at the top of the dashboard
|
|
|
|
The dashboard compares two commits a "Base Commit" and a "New Commit".
|
|
An entry such as `2.38x → 2.41x` means that the performance improved
|
|
from `2.38x` in the base to `2.41x` in the new commit. All performance
|
|
results are normalized to eager mode PyTorch (`1x`), and higher is better.
|
|
|
|
|
|
## CPU Performance Dashboard
|
|
|
|
The [TorchInductor CPU Performance
|
|
Dashboard](https://github.com/pytorch/pytorch/issues/93531) is tracked
|
|
on a GitHub issue and updated periodically.
|
|
|
|
## Running Locally
|
|
|
|
Raw commands used to generate the data for
|
|
the performance dashboards can be found
|
|
[here](https://github.com/pytorch/pytorch/blob/641ec2115f300a3e3b39c75f6a32ee3f64afcf30/.ci/pytorch/test.sh#L343-L418).
|
|
|
|
To summarize there are three scripts to run each set of benchmarks:
|
|
- `./benchmarks/dynamo/torchbench.py ...`
|
|
- `./benchmarks/dynamo/huggingface.py ...`
|
|
- `./benchmarks/dynamo/timm_models.py ...`
|
|
|
|
Each of these scripts takes the same set of arguments. The ones used by dashboards are:
|
|
- `--accuracy` or `--performance`: selects between checking correctness and measuring speedup (both are run for dashboard).
|
|
- `--training` or `--inference`: selects between measuring training or inference (both are run for dashboard).
|
|
- `--device=cuda` or `--device=cpu`: selects device to measure.
|
|
- `--amp`, `--bfloat16`, `--float16`, `--float32`: selects precision to use `--amp` is used for training and `--bfloat16` for inference.
|
|
- `--cold-start-latency`: disables caching to accurately measure compile times.
|
|
- `--backend=inductor`: selects TorchInductor as the compiler backend to measure. Many more are available, see `--help`.
|
|
- `--output=<filename>.csv`: where to write results to.
|
|
- `--dynamic-shapes --dynamic-batch-only`: used when the `dynamic` config is enabled.
|
|
- `--disable-cudagraphs`: used by configurations without cudagraphs enabled (default).
|
|
- `--freezing`: enable additional inference-only optimizations.
|
|
- `--cpp-wrapper`: enable C++ wrapper code to lower overheads.
|
|
- `TORCHINDUCTOR_MAX_AUTOTUNE=1` (environment variable): used to measure max-autotune mode, which is run weekly due to longer compile times.
|
|
- `--export-aot-inductor`: benchmarks ahead-of-time compilation mode.
|
|
- `--total-partitions` and `--partition-id`: used to parallel benchmarking across different machines.
|
|
|
|
For debugging you can run just a single benchmark by adding the `--only=<NAME>` flag.
|
|
|
|
A complete list of options can be seen by running each of the runners with the `--help` flag.
|
|
|
|
As an example, the commands to run first line of the dashboard (performance only) would be:
|
|
```
|
|
./benchmarks/dynamo/torchbench.py --performance --training --amp --backend=inductor --output=torchbench_training.csv
|
|
./benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend=inductor --output=torchbench_inference.csv
|
|
|
|
./benchmarks/dynamo/huggingface.py --performance --training --amp --backend=inductor --output=huggingface_training.csv
|
|
./benchmarks/dynamo/huggingface.py --performance --inference --bfloat16 --backend=inductor --output=huggingface_inference.csv
|
|
|
|
./benchmarks/dynamo/timm_models.py --performance --training --amp --backend=inductor --output=timm_models_training.csv
|
|
./benchmarks/dynamo/timm_models.py --performance --inference --bfloat16 --backend=inductor --output=timm_models_inference.csv
|
|
```
|