This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench.
The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml)
• The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout.
**Action:**
• An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025.
To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in:
* timm benchmarks
* torchbench benchmarks
As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but:
• An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated.
Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842
Approved by: https://github.com/malfet
This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes.
In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory.
The motivating kernel is in https://github.com/pytorch/pytorch/issues/149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor.
While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153751
Approved by: https://github.com/jansel
ghstack dependencies: #153723, #153730, #153748
PT2 benchmark scripts has a pattern like:
```
def forward_and_backward_pass(self, mod, inputs, collect_outputs=True):
cloned_inputs = clone_inputs(inputs)
self.optimizer_zero_grad(mod)
with self.autocast(**self.autocast_arg):
pred = mod(**cloned_inputs)
loss = self.compute_loss(pred)
self.grad_scaler.scale(loss).backward()
self.optimizer_step()
if collect_outputs:
return collect_results(mod, pred, loss, cloned_inputs)
return None
```
for training.
The collect_outputs argument is True only for accuracy testing and it's false for performance testing.
For HF benchmark suite, a model usually returns tuple (loss, logits). For performance testing, even though the logits is never used anywhere, dynamo has to keep it due to the control flow.
A few bad things if we keep logits here
1. the peak memory will be higher since the logits is large and we can not release its memory earlier.
2. we can not do optimization like chunking for the logits because the tensor needs to be returned from the pre-grad graph
Actually I think it's fine to not return logits at all.
- For training cases, checking loss and gradients for accuracy is good enough. It's hard to see two runs have mismatch logits but matching loss/gradients.
- Also, discarding logits as soon as possible for perf benchmarking makes it more fair for us.
On the other hand, it may be interesting to let dynamo support something like dynamo.constexpr (similar to tl.constexpr). A variable annotated as dynamo.constexpr will be specialized at compile time and we can do more optimization (DCE e.g.) at compile time. (A small [repro](https://gist.github.com/shunting314/0912a8947028a904c34f361021b8024d))
Benchmark results here [link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Fri%2C%2004%20Apr%202025%2018%3A03%3A26%20GMT&stopTime=Fri%2C%2011%20Apr%202025%2018%3A03%3A26%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/204/head&lCommit=fe25dab3f65e1b0e9db0af03f7664af70fcc9c66&rBranch=main&rCommit=55e62ff74ad5614faf80b060c7bfc551e3b7af5a)
- HF 15% (1.51 -> 1.66 compression ratio) peak memory improvement
- I also see 5% (2.74 -> 2.79x) perf win for HF. It could be true. We may generate more efficient kernels since we don't need keep logits and return it from the pre-grad graph. But I'll double check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151075
Approved by: https://github.com/eellison, https://github.com/jansel
https://github.com/pytorch/pytorch/issues/144893
```
python benchmarks/dynamo/timm_models.py --only poolformer_m36 --accuracy --no-translation-validatio --training --amp --device cuda --backend inductor
```
`--float32`, `--bfloat16` - passes the accuracy
`--disable-cudagraph` does not change the result
accuracy_fail only for `--amp` and gives `0.048` res_error, on 1-element result Tensor.
This fails with `0.01` tolerance.
If to increase tolerance to 0.04 it passes. I have not reproduced "eager_two_runs_differ" on H100.
I think this is a true distribution of results with `--amp`, so increasing tolerance to 0.04 for ano case only makes it passing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145375
Approved by: https://github.com/desertfire
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.
Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:
<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">
What's nice is the dashboard shows the nightly commits for each run.
Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df
Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 )
Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.
Since this is not a real issue, I'll raise the tolerance to make it pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.
## sebotnet33ts_256
The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).
I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.
## DebertaForQuestionAnswering
This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```
From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```
0.02 tolerance should suppress this error.
## gluon_inception_v3
This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```
From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.
# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.
# yolov3
Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
Fix it by using a larger multiplier for smaller tensors and raising the tolereance.
# timm_efficientdet
Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training
```
Raise the tolerance should fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.
Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:
<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">
What's nice is the dashboard shows the nightly commits for each run.
Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df
Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 )
Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.
Since this is not a real issue, I'll raise the tolerance to make it pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.
## sebotnet33ts_256
The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).
I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.
## DebertaForQuestionAnswering
This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```
From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```
0.02 tolerance should suppress this error.
## gluon_inception_v3
This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```
From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.
# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.
# yolov3
Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
Fix it by using a larger multiplier for smaller tensors and raising the tolereance.
# timm_efficientdet
Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training
```
Raise the tolerance should fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting:
```
# Take how many of the top triton kernels to benchmark epilogue
max_epilogue_benchmarked_choices = 3
```
There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent.
Inference:
<img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c">
Training:
<img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031
Approved by: https://github.com/Chillee, https://github.com/shunting314
ghstack dependencies: #124030, #122642, #123229, #122825
Debugging is happening in https://github.com/pytorch/pytorch/issues/123126 .
Upgrading triton cause accuracy failure for mixer_b16_224 and levit_128 .
mixer_b16_224 is debugged specifically. It due to extra FMA instructions being used in a single kernel. That kernel itself only introduce small numerical difference. We conclude that this is not some 'real' accuracy issue and we should raise the tolerance to unblock the triton pin update.
The tolerance is picked such that the CI accuracy test can pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123484
Approved by: https://github.com/jansel
Fix https://github.com/pytorch/pytorch/issues/120545 . The reason why these models fail accuracy test with freezing is due to the conv-batchnorm fusion. Conv-batchnorm fusion causes relative big numerical churn.
For the failed TIMM models, raising the tolerance to `8 * 1e-2` can make the test pass.
For the failed TB models, the numerical difference is too large. Having a discussion with @eellison , we decided to skip them with freezing for now.
One the other hand, we probably should dig more why the conv-bn fusion cause such large numerical difference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121054
Approved by: https://github.com/eellison
The memory compression for these models is at parity, but because we interleave timings between torch.compile and eager run memory is duplicated between between eager and cudagraphs pool and causes OOM.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101837
Approved by: https://github.com/anijain2305
This pr accomplishes
1) Enables retries for downloading torchbenchmark and huggingface models in a similar method to how we do it for timm models right now.
2) creates a `_download_model` function for the hugging face and TIMM runners whose output I plan to use to preload the models somewhere if possible (please double check I'll be saving the right thing). Instead of retries, we plan to just add torchbench to a docker image as it is relatively small.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 3361a4c</samp>
> _We're the brave and bold coders of the `common.py` module_
> _We've made a handy function for downloading models_
> _We've shared it with our mates in the other runners_
> _So pull and push and try again, we'll get them all in time_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101019
Approved by: https://github.com/huydhn, https://github.com/desertfire