This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.
**Benchmark Configuration and Coverage:**
* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn
Co-authored-by: Huy Do <huydhn@gmail.com>
This adds a PR time benchmark that checks for runtime overhead on a very small graph. This will help track regressions in runtime overhead.
Example Results:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163866
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
Potential issues
* gpt-oss-20b is probably too big (I can't run on my devserver)
* Mistral requires HF authentication
* Mistral also takes a while to run the performance checks (need to wait for CI)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163565
Approved by: https://github.com/huydhn
Summary:
This diff does a big refactor of PrecompileContext to make it considerably simpler: instead of being a CacheArtifactManager and managing a bunch of bytes, it simply stores two things: dynamo cache entries and backend cache entries. When asked, it stitches them together into PrecompileCacheEntries, which are stored by DynamoCache.
This structure then allows us to register DynamoCache to the regular Megacache API, instead of having two separate APIs that are confusing. It also lets us remove the autotune cache integration, since MegaCache API will automatically store autotune cache entries.
The intent here is that users who want to use caching precompile will simply be able to use torch.compiler.save_cache_artifacts as before, just with `torch.dynamo.config.caching_precompile` set to True. They can also directly interact with PrecompileContext if they wish to specifically only load Precompile entries, using PrecompileContext.create_cache_entries().
Saving single entries and such with DynamoCache still works normally.
Test Plan:
All existing unit tests pass.
Rollback Plan:
Differential Revision: D82380307
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162886
Approved by: https://github.com/zhxchen17
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:
- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name
Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
Update PyTorch to the latest Triton release candidate branch (release/3.5.x in triton-lang/triton)
Notably:
* this does *not* include the version number bump from 3.4 -> 3.5 (we'll do that in a follow-up PR)
* sam_fast is still failing, so we've disabled it temporarily https://github.com/pytorch/pytorch/issues/162282 and we are committed to fixing it, ideally before the branch cut but possibly as a cherry-pick into the release branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162278
Approved by: https://github.com/atalman
ghstack dependencies: #162244, #162309
Reland of https://github.com/pytorch/pytorch/pull/159923
Couple of fixes:
1. When we run into an operation we didn't proxy, we end up emitting fake constants. We detect this and warn using the FQN of the lifted constant. We warn because some internal users complained it was regressing their exportability.
2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict
3. We modify yolov3 to fix the previous silent incorrect behaviour
4. We use strict export for levit_128 because it errors in non-strict due to more strict side effect checking
When upgrading torchbench pin, opacus_cifar10 seems to not run on eager anymore. I verified this by pushing a temporary PR on master with new pin. So i added it to expect_fail list.
Differential Revision: [D81133908](https://our.internmc.facebook.com/intern/diff/D81133908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161589
Approved by: https://github.com/avikchaudhuri
For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter
```
python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace
python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace
python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223
Approved by: https://github.com/huydhn
- hf_Reformer: this one starts failing due to increased graph breaks due to transformers pin bump (#159291). We can likely just bump the expected graph break count.
- dla102: this one starts timing out on 8/13 Wed between commit 6e8865f and ee1b041. But based on the PT2 dashboard, this model actually doesn't have compile time or runtime regression. Will try to bump up the timeout and see if it can work.
- hf_BigBird: this one has its accuracy status improved since today. Will update hf_BigBird accuracy status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160932
Approved by: https://github.com/zou3519, https://github.com/huydhn, https://github.com/malfet
GPT2ForSequenceClassification Hugging Face (HF) model fails on ROCm for bfloat16. The failure is numerically small. This PRs adds this model to an exception list for small tensors. The exception list already includes two models. This increases the multiplier factor to 10.0 instead of 3 (default) for this model used in `torch/_dynamo/utils.py`.
In the PR comment below, I include a short analysis of the numerics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160001
Approved by: https://github.com/anijain2305, https://github.com/jataylo, https://github.com/jeffdaily
Summary:
Currently, Linear in FP32 dynamic mode(batch_size has free symbols) does not support weight prepacking since MKL Linear does not support dynamic mode. This PR uses oneDNN Linear to support Linear weight prepacking in FP32 dynamic mode.
I tested the Inductor benchmark in FP32 dynamic mode on CPU using this PR, and saw ~8% improvement in timm_models geomean speedup, ~2% improvement in torchbench geomean speedup, and no change in huggingface. There are about 18 models with different degrees of performance improvement, among which BERT_pytorch, soft_actor_critic, BlenderbotForCausalLM, ElectraForCausalLM, crossvit_9_240, mobilevit_s, twins_pcpvt_base have more than 20% performance improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157542
Approved by: https://github.com/CaoE, https://github.com/jansel
get_free_symbol_uses is used to know what unbacked symbols are used by a given node.
not having correct get_free_symbol_uses defined properly leads to :
- eliminating of some nodes due to not detection of any users. (See the added unit test)
- Incorrect topological sort.
Fix get_free_symbol_uses , NopKernel , ConcarKernel, InputsKerenl, external kernel.
for ComputedBuffer with NonOwningLayout its interesting case.
when layout is NonOwningLayout we need to access the actual view op base layout and use
detect symbols in it. Because when we codegen the ComputedBuffer we uses those symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160314
Approved by: https://github.com/eellison
#158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670
Approved by: https://github.com/oulgen