This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.
**Benchmark Configuration and Coverage:**
* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn
Co-authored-by: Huy Do <huydhn@gmail.com>
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:
- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name
Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
To better support the integration of operator benchmark performance data into the OSS benchmark database for the dashboard, I’ve added a JSON output format that meets the required specifications: https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database#output-format
Since the current operator benchmark already has a flag `--output-json` to support saving the results into a JSON file, I add a new flag `--output-json-for-dashboard` for this feature.
At the same time, I renamed the `--output-dir` to `--output-csv` for a clearer and more intuitive expression.
An example of the JSON output of the operator benchmark.
```
[
{
"benchmark": {
"name": "PyTorch operator benchmark - add_M1_N1_K1_cpu",
"mode": "inference",
"dtype": "float32",
"extra_info": {
"input_config": "M: 1, N: 1, K: 1, device: cpu"
}
},
"model": {
"name": "add_M1_N1_K1_cpu",
"type": "micro-benchmark",
"origins": [
"pytorch"
]
},
"metric": {
"name": "latency",
"unit": "us",
"benchmark_values": [
2.074
],
"target_value": null
}
},
{
"benchmark": {
"name": "PyTorch operator benchmark - add_M64_N64_K64_cpu",
"mode": "inference",
"dtype": "float32",
"extra_info": {
"input_config": "M: 64, N: 64, K: 64, device: cpu"
}
},
"model": {
"name": "add_M64_N64_K64_cpu",
"type": "micro-benchmark",
"origins": [
"pytorch"
]
},
"metric": {
"name": "latency",
"unit": "us",
"benchmark_values": [
9.973
],
"target_value": null
}
},
]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154410
Approved by: https://github.com/huydhn
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
This is one of a series of PRs to update us to PEP585 (changing Dict -> dict, List -> list, etc). Most of the PRs were completely automated with RUFF as follows:
Since RUFF UP006 is considered an "unsafe" fix first we need to enable unsafe fixes:
```
--- a/tools/linter/adapters/ruff_linter.py
+++ b/tools/linter/adapters/ruff_linter.py
@@ -313,6 +313,7 @@
"ruff",
"check",
"--fix-only",
+ "--unsafe-fixes",
"--exit-zero",
*([f"--config={config}"] if config else []),
"--stdin-filename",
```
Then we need to tell RUFF to allow UP006 (as a final PR once all of these have landed this will be made permanent):
```
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -40,7 +40,7 @@
[tool.ruff]
-target-version = "py38"
+target-version = "py39"
line-length = 88
src = ["caffe2", "torch", "torchgen", "functorch", "test"]
@@ -87,7 +87,6 @@
"SIM116", # Disable Use a dictionary instead of consecutive `if` statements
"SIM117",
"SIM118",
- "UP006", # keep-runtime-typing
"UP007", # keep-runtime-typing
]
select = [
```
Finally running `lintrunner -a --take RUFF` will fix up the deprecated uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145101
Approved by: https://github.com/bobrenjc93
The regex-based parser would erroneously split on commas in nested brackets, for example, it would do the following parse which is wrong:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16)', ' (64, 32)]', 'ZPB: 2']
The new manual parser handles this situation the right way:
'M: [(32, 16), (64, 32)], ZPB: 2' -> ['M: [(32, 16), (64, 32)]', 'ZPB: 2']
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144297
Approved by: https://github.com/XuehaiPan, https://github.com/jeffdaily
This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well.
Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809
Approved by: https://github.com/albanD
This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.
to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
Summary:
Remove operator_benchmark caffe2 build due to the removal of caffe2: 2fd75667b4
Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain.
Test Plan: Sandcastle CI
Differential Revision: D60086216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460
Approved by: https://github.com/vmpuri
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127123
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.
Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0
| Repository | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7 | 251.8 | 351.1 | 274.9 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
### Summary
In #85398, while fixing a bug (which was _not caused by, but was exposed by_ AVX512 implementation) in `_vec_logsoftmax_lastdim`, I had made some revisions to use more threads in some cases, but was asked to roll back [those changes](https://github.com/pytorch/pytorch/pull/85398#discussion_r1087680237) during the PR's review.
At the time, landing that PR asap seemed essential, so I agreed to roll-back that change,
In some cases, more threads can be used than are being used with the current approach.
<strike>In this PR, I'm reintroducing those changes, which are geared towards more efficient multi-threading.</strike>.
On second thought, even for other softmax kernels besides `_vec_log_softmax_lastdim` and `_vec_softmax_lastdim`, we could simply use `grain_size` of 0 or 1, instead of complicating code because `CHUNK_SIZE` for each thread is already being computed as per some heuristic, and if `grain_size` would be `0`, then work among the OpenMP threads (which, BTW, stay constant in number, unless explicitly changed, since we don't use the OpenMP `num_threads` clause in PyTorch) would be distributed equitably, thus yielding the similar speedup as the approach in the first commit of this PR.
I've also added op-level benchmarks pertaining to example input shapes in this PR.
### Benchmarks
Machine - Intel(R) Xeon(R) Platinum 8468H (Xeon 4th gen, formerly codenamed Sapphire Rapids)
One socket of 48 physical cores was used, with & without HyperThreading.
Intel OpenMP & tcmalloc were preloaded.
Softmax benchmarks can be run with the following command, but the relevant benchmarks are the last dim ones -
`KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=1 KMP_SETTINGS=1 OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 numactl --membind=0 --cpunodebind=0 python -m pt.softmax_test --tag-filter all`
#### Already existing benchmarks
|Benchmark name (dim is 1, by default) | Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup Percentage = (old-new)*100/old | Speedup ratio (old/new)|
|-------------|--------|-------|----------------------------|----------|
|Softmax_N1_C3_H256_W256_cpu|31.364|11.594|63.03% |2.705|
|Softmax_N4_C3_H256_W256_cpu|34.475|24.966| 27.58%|1.380|
|Softmax_N8_C3_H512_W256_cpu|94.044|78.372|16.66%|1.199|
|Softmax2d_N8_C3_H512_W256_cpu|100.195|79.529|20.62%|1.259|
#### Some of the following benchmarks are being added in this PR
|Benchmark name| Previous implementation's latency (in ms) | This implementation's latency (in ms)|Speedup percentage = (old-new)*100/old| Speedup ratio (old/new) |
|-------------|--------|-------|----------------------------|--------------------|
|LogSoftmax_M128_N128_dim1_cpu|7.629|6.475|15.12%| 1.178|
|LogSoftmax_M48_N128_dim1_cpu|6.848|5.969|12.83%| 1.147|
|LogSoftmax_M16_N1024_dim1_cpu|7.004|6.322|9.73%| 1.107|
|LogSoftmax_M32_N1024_dim1_cpu|7.037|6.558|6.80%| 1.073|
|LogSoftmax_M48_N1024_dim1_cpu|7.155|6.773|5.33%|1.056|
|LogSoftmax_M16_N512_dim1_cpu|6.797|5.862|13.75%|1.159|
|LogSoftmax_M32_N512_dim1_cpu|7.223|6.202|14.13%|1.164|
|LogSoftmax_M48_N512_dim1_cpu|7.159|6.301|11.98%|1.136|
|LogSoftmax_M16_N256_dim1_cpu|6.842|5.682|16.95%|1.204|
|LogSoftmax_M32_N256_dim1_cpu|6.840|6.086|11.02%|1.123|
|LogSoftmax_M48_N256_dim1_cpu|7.005|6.031|13.94%|1.161|
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116367
Approved by: https://github.com/jgong5, https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.
I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang