This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:
- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name
Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
To better support the integration of operator benchmark performance data into the OSS benchmark database for the dashboard, I’ve added a JSON output format that meets the required specifications: https://github.com/pytorch/pytorch/wiki/How-to-integrate-with-PyTorch-OSS-benchmark-database#output-format
Since the current operator benchmark already has a flag `--output-json` to support saving the results into a JSON file, I add a new flag `--output-json-for-dashboard` for this feature.
At the same time, I renamed the `--output-dir` to `--output-csv` for a clearer and more intuitive expression.
An example of the JSON output of the operator benchmark.
```
[
{
"benchmark": {
"name": "PyTorch operator benchmark - add_M1_N1_K1_cpu",
"mode": "inference",
"dtype": "float32",
"extra_info": {
"input_config": "M: 1, N: 1, K: 1, device: cpu"
}
},
"model": {
"name": "add_M1_N1_K1_cpu",
"type": "micro-benchmark",
"origins": [
"pytorch"
]
},
"metric": {
"name": "latency",
"unit": "us",
"benchmark_values": [
2.074
],
"target_value": null
}
},
{
"benchmark": {
"name": "PyTorch operator benchmark - add_M64_N64_K64_cpu",
"mode": "inference",
"dtype": "float32",
"extra_info": {
"input_config": "M: 64, N: 64, K: 64, device: cpu"
}
},
"model": {
"name": "add_M64_N64_K64_cpu",
"type": "micro-benchmark",
"origins": [
"pytorch"
]
},
"metric": {
"name": "latency",
"unit": "us",
"benchmark_values": [
9.973
],
"target_value": null
}
},
]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154410
Approved by: https://github.com/huydhn
This pull request adds the functionality of writing the output of operator benchmark to an optional JSON file specified. The output is still printed in the terminal like before, but the user has the option of saving it in a JSON file as well.
Main part of the functionality is implemented using the function _perf_result_to_dict which outputs a dictionary to be put inside a JSON file. Each dictionary corresponds to a single test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142809
Approved by: https://github.com/albanD
Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility.
Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library:
`argparse.BooleanOptionalAction`: 4a9dff0e5a/Lib/argparse.py (L893-L895)
```python
class BooleanOptionalAction(Action):
def __init__(...):
if option_string.startswith('--'):
option_string = '--no-' + option_string[2:]
_option_strings.append(option_string)
```
It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505
Approved by: https://github.com/ezyang, https://github.com/seemethere
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47767
This diff implements the functionality of running benchmark on mobile on top of operator_benchmark framework. It does so through a few steps:
1. create a scripted module from existing benchmark case.
2. run mobile specific optimization pass on the scripted module
3. run the scripted module on AiBench by calling its Python API
A small change in the way of writing a benchmark case is introduced so that both local and mobile run can share the same interface. The change is about having inputs as arguments of the `forward` function, so that mobile optimization pass can be run successfully (otherwise everything will be optimized away by constant propagation).
Test Plan:
## local op_bench run
buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1
buck run caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --iterations 1 --warmup_iterations 1 --use_jit
Exceptions: `py_module` op in `FakeQuantizePerTensorBaseOpBenchmark` and `FakeQuantizePerChannelBaseOpBenchmark` under JIT mode. These tests also failed in the base version
```
RuntimeError:
Module 'FakeQuantizePerChannelOpBenchmark' has no attribute 'op_func' (This function exists as an attribute on the Python module, but we failed to compile it to a TorchScript function.
The error stack is reproduced here:
Python builtin <built-in method apply of FunctionMeta object at 0x619000c652a0> is currently not supported in Torchscript:
File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 260
quant_min: int, quant_max: int
):
return _LearnableFakeQuantizePerChannelOp.apply(input, scale, zero_point, axis, quant_min, quant_max, 1.0)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
:
File "/data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/pt/quantization_test#link-tree/quantization_test.py", line 313
axis: int, quant_min: int, quant_max: int
):
return self.op_func(input, scale, zero_point, axis, quant_min, quant_max)
~~~~~~~~~~~~ <--- HERE
```
`_consume_op` typing mismatch: chunk, split, qobserver, sort in qunary. These will be fixed in D24774105
## OSS test
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1 --use_jit
python3 -m benchmark_all_test --iterations 1 --warmup_iterations 1
## saved module graph
```
module __torch__.mobile_benchmark_utils.OpBenchmarkMobile {
parameters {
}
attributes {
training = True
num_iters = 1
benchmark = <__torch__.pt.add_test.___torch_mangle_4.AddBenchmark object at 0x6070001b8b50>
}
methods {
method forward {
graph(%self : __torch__.mobile_benchmark_utils.OpBenchmarkMobile):
%12 : None = prim::Constant() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:9:4
%4 : bool = prim::Constant[value=1]() # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
%1 : int = prim::GetAttr[name="num_iters"](%self)
= prim::Loop(%1, %4) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/mobile_benchmark_utils.py:10:8
block0(%i : int):
%6 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
%7 : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark = prim::GetAttr[name="benchmark"](%self)
%self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
%9 : Tensor, %10 : Tensor = prim::TupleUnpack(%self.inputs_tuple)
%23 : int = prim::Constant[value=1]()
%24 : Tensor = aten::add(%9, %10, %23) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
-> (%4)
return (%12)
}
}
submodules {
module __torch__.pt.add_test.___torch_mangle_4.AddBenchmark {
parameters {
}
attributes {
mobile_optimized = True
}
methods {
method forward {
graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark,
%input_one.1 : Tensor,
%input_two.1 : Tensor):
%3 : int = prim::Constant[value=1]()
%4 : Tensor = aten::add(%input_one.1, %input_two.1, %3) # /data/users/wangyang19/fbsource/fbcode/buck-out/dev/gen/caffe2/benchmarks/operator_benchmark/fb/pt/mobile/benchmark_all_test_fbcode#link-tree/pt/add_test.py:39:15
return (%4)
}
method get_inputs {
graph(%self : __torch__.pt.add_test.___torch_mangle_4.AddBenchmark):
%self.inputs_tuple : (Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu), Float(1, 1, 1, strides=[1, 1, 1], requires_grad=0, device=cpu)) = prim::Constant[value=({0.48884}, {0.809042})]()
return (%self.inputs_tuple)
}
}
submodules {
}
}
}
}
```
Reviewed By: kimishpatel
Differential Revision: D24322214
fbshipit-source-id: 335317eca4f40c4083883eb41dc47caf25cbdfd1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31334
The wipe cache logic was introduced hoping to reduce the variations in the benchmark results. Based on our experiments result, it didn't actually help with that. In addition, several engineers had encountered the issue of missing cpuinfo.h which was used in the wipe cache logic. So this diff removes that feature to ensure smooth installation and running of the op bench.
Test Plan:
```
buck run caffe2/benchmarks/operator_benchmark/pt:add_test -- --iterations 1
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M1_N1_K1_cpu
# Input: M: 1, N: 1, K: 1, device: cpu
Forward Execution Time (us) : 111.192
A/B test also pass Benchmark Run #2476535015
Reviewed By: hl475
Differential Revision: D19126970
fbshipit-source-id: 9b1ab48c121838836ba6e0ae664a48fe2d18efdd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29864
This diff make `all` as a reserved keyword for tag_filter. When `all` is passed from user, it will run all the supported shapes.
Test Plan:
```
buck run //caffe2/benchmarks/operator_benchmark/pt:add_test -- --iterations 1 --tag_filter all
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : all
# Benchmarking PyTorch: add
# Mode: Eager
# Name: add_M8_N32_K256_cpu
# Input: M: 8, N: 32, K: 256, device: cpu
Forward Execution Time (us) : 6798.688
...
Reviewed By: hl475
Differential Revision: D18520249
fbshipit-source-id: 4d55af9f46f89b2fe8842e1a00dfa8e5acaf4fa2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/28731
as title
Test Plan:
```
Before:
buck run mode/opt caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --operator sigmoid
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
Changed value project.buck_out='buck-out/opt' (was 'buck-out/dev')
... and 69 more. See logs for all changes
Parsing buck files: finished in 7.2 sec
Creating action graph: finished in 10.0 sec
Building: finished in 06:38.4 min (100%) 29890/29890 jobs, 29890 updated
Total time: 06:55.7 min
# ----------------------------------------
# PyTorch/Caffe2 Operator Micro-benchmarks
# ----------------------------------------
# Tag : short
# Benchmarking PyTorch: sigmoid
With this diff
buck run mode/opt caffe2/benchmarks/operator_benchmark:benchmark_all_test -- --operator sigmoid
Parsing buck files: finished in 6.4 sec
Creating action graph: finished in 9.8 sec
Building: finished in 06:35.9 min (100%) 29892/29892 jobs, 29892 updated
Total time: 06:52.1 min
Reviewed By: hl475
Differential Revision: D18152071
fbshipit-source-id: 80c29570581bbd2f0e78e2df32734c17a2b036ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23442
Replace the argument name from `operator` to `operators` which can take a list of operators to test.
Reviewed By: hl475
Differential Revision: D16520779
fbshipit-source-id: 94284a87c64471793e319f5bd3143f89b9a192bb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22892
Think of num_runs as manually run the binary <num_runs> times. Each run runs the operator for many iterations.
Reviewed By: hl475
Differential Revision: D16271597
fbshipit-source-id: b6f509ee0332c70f85bec0d447b84940c5c0cecd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22309
This diff enables PT operators to run with JIT mode. Users can control eager and JIT mode using the `use_jit` flag.
In this diff, we are putting operators in a loop and passed it to JIT. One extra step which wraps the operator with the `_consume` op is introduced to avoid dead code elimination optimization in JIT. With that, the reported time includes the real operator execution time plus the `_consume` (directly return input, nothing else if happening inside) op.
Reviewed By: zheng-xq
Differential Revision: D16033082
fbshipit-source-id: e03be89fd5a505e44e81015dfc63db9cd76fb8a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21440
This diff modifies the output format when ai_pep_format is enabled.
Reviewed By: hl475
Differential Revision: D15681042
fbshipit-source-id: df5f2dbb38d1bd866ca7f74ef4e63459d480be6e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21206
This diff change the default test_name to be a globally unique value across tests. With that, users can list all the tests and choose to run a specific test.
Reviewed By: zheng-xq
Differential Revision: D15543508
fbshipit-source-id: 0814ef6a60d41637fed5245e30c282497cf21bb8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21147
This diff introduces a new interface to add PT/C2 operators to the benchmark suite.
The following steps are needed to add a new operator:
1. Specify the input shapes, args to an operator in configs
2. Create a PT/C2 benchmark class which includes ```init``` (create tensors), ```forward``` (specify the operator to be tested.), and ```backward```(gradient of an op.) methods
3. call generate_pt_test/generate_c2_test to create test cases based on configs
Reviewed By: zheng-xq
Differential Revision: D15250380
fbshipit-source-id: 1025a7cf60d2427baa0f3f716455946d3d3e6a27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19518
Previous design needs to run the op benchmarks from PyTorch root directory which could lead to `module not found` error in OSS environment. This diff fixes that issue by making the benchmark to be launched in the `benchmarks` folder.
Reviewed By: ilia-cher
Differential Revision: D15020787
fbshipit-source-id: eb09814a33432a66cc857702bc86538cd17bea3b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19450
We want to make each operator benchmark as a separate binary. The previous way to run the benchmark is by collecting all operators into a single binary, it is unnecessary when we want to filter a specific operator. This diff aims to resolve that issue.
Reviewed By: ilia-cher
Differential Revision: D14808159
fbshipit-source-id: 43cd25b219c6e358d0cd2a61463b34596bf3bfac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19433
For operator benchmark project, we need to cover a lot of operators, so the interface for adding operators needs to be very clean and simple. This diff is implementing a new interface to add op.
Here is the logic to add new operator to the benchmark:
```
long_config = {}
short_config = {}
map_func
add_test(
[long_config, short_config],
map_func,
[caffe2 op]
[pt op]
)
```
Reviewed By: zheng-xq
Differential Revision: D14791191
fbshipit-source-id: ac6738507cf1b9d6013dc8e546a2022a9b177f05
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18740
Test utilities for writing Caffe2/PyTorch performance microbenchmarks. Brief description of the file structure
* benchmark_core.py : core utiltiites for running microbenchmark tests
* benchmark_caffe2.py : Caffe2 specific benchmark utilitites
* benchmark_pytorch.py: PyTorch specific benchmark utilities
* benchmark_runner.py : Main function. Currently it can run the microbenchmark tests in a stand-alone mode. The next step is to have this integrate with AI-PEP.
The utilities are located at https://github.com/pytorch/pytorch/tree/master/test to have access to both Caffe2/PyTorch Python's frontend.
Include two operator microbenchmarks; support both Caffe2/PyTorch:
* MatMul
* Add
Reference: PyTorch benchmarks : https://github.com/pytorch/benchmark/tree/master/timing/python. In this work, we start with two example binary operators MatMul and Add, but eventually we should to cover unary operators like in the PyTorch benchmark repo.
Reviewed By: zheng-xq
Differential Revision: D13887111
fbshipit-source-id: b7a56b95448c9ec3e674b0de0ffb96af4439bfce