b2953f5643
[9/N] Apply ruff UP035 rule ( #165515 )
...
This is follow-up of #165214 to continue applying ruff UP035 rule to the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515
Approved by: https://github.com/Lucaskabela
2025-10-17 00:09:51 +00:00
a2a75be0f8
Rename inductor cache ( #156128 )
...
Requested by Simon on a different PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
b878ca0c91
[cutlass backend] add fp8 to cutlass benchmark script ( #155507 )
...
Summary:
Add fp8.
Right now FP8 only allows fast_accum.
Test Plan:
```
Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn
+-----------------------+--------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | teraflops (TFLOPS) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
| aten | 967.1226739883423 | 1136.8895149998868 | 1.219131228979677 | NA |
| triton | 1764.6185159683228 | 623.08743664783 | 20.373826419003308 | 82.46067054670186 |
| triton_persistent_tma | 1769.0335512161255 | 621.5323768280928 | 20.48663099599071 | 82.91718297956578 |
| cutlass_lvl_default | 790.5075550079346 | 1390.8932568835019 | 13.788519630907103 | -18.26191482535096 |
| cutlass_lvl_3332 | 803.7384748458862 | 1367.996757884245 | 226.81587297911756 | -16.89384434227684 |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
```
Rollback Plan:
Differential Revision: D76310809
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507
Approved by: https://github.com/ColinPeppler
2025-06-13 05:11:15 +00:00
2481c4b2ea
[cutlass backend] add teraflops and increase rep for benchmark script ( #154944 )
...
Differential Revision: [D75840023](https://our.internmc.facebook.com/intern/diff/D75840023/ )
I think I will continue to use do_bench for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154944
Approved by: https://github.com/mlazos
2025-06-05 17:20:29 +00:00
cb56df55dc
[Inductor]Cleanup autotune_fallback_to_aten post-deprecation ( #154331 )
...
Fixes #153298
This PR is the 3rd and final step of #147479
All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated.
All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary.
[henrylhtsang](https://github.com/henrylhtsang )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331
Approved by: https://github.com/henrylhtsang
2025-05-29 20:29:58 +00:00
00ebbbb701
[cutlass backend] add addmm and bmm for cutlass backend benchmark ( #152163 )
...
Copying what @kadeng did.
```
FINAL results...
Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 44.454172253608704 | 3.0991086587309837 | NA |
| triton | 44.06978189945221 | 0.07496077567338943 | -0.8646890374284049 |
| triton_persistent_tma | 43.598245829343796 | 0.06154991965740919 | -1.9254130284597197 |
| cutlass_lvl_default | 39.91834074258804 | 0.056073310784995556 | -10.20338762612423 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+-------------------+----------------------+---------------------+
| aten | 49.05610531568527 | 0.160279156640172 | NA |
| triton | 43.97720843553543 | 0.0660805031657219 | -10.353241145961718 |
| triton_persistent_tma | 43.94153505563736 | 0.061738294549286366 | -10.425960697724962 |
| cutlass_lvl_default | 40.2066633105278 | 0.034127906896173954 | -18.039430460713596 |
+-----------------------+-------------------+----------------------+---------------------+
Average edge over aten (max(-edge, 0), higher is better):
triton: 5.608965091695062 (from 2 valid values)
triton_persistent_tma: 6.175686863092341 (from 2 valid values)
cutlass_lvl_default: 14.121409043418913 (from 2 valid values)
```
Differential Revision: [D73625766](https://our.internmc.facebook.com/intern/diff/D73625766/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152163
Approved by: https://github.com/jingsh
2025-04-28 20:16:17 +00:00
5a51de5ab1
[cutlass backend] Add more logs for cutlass backend benchmark ( #150639 )
...
Goal is to have a way to compare if a change make it better or worse.
```
Average edge over aten (max(-edge, 0), higher is better):
triton: 8.596507086950552 (from 6 valid values)
triton_persistent_tma: 9.517193693923307 (from 6 valid values)
cutlass_lvl_default: 3.3234737908691785 (from 6 valid values)
cutlass_lvl_1111: 7.088173348313991 (from 6 valid values)
cutlass_lvl_2222: 7.291869722320318 (from 6 valid values)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639
Approved by: https://github.com/ColinPeppler
2025-04-15 04:19:51 +00:00
f2d43d866c
[cutlass backend] switch layout for cutlass backend benchmark ( #149009 )
...
```
python benchmarks/inductor_backends/cutlass.py
```
logs:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 13.059554621577263 | 1.580178506206721 | NA |
| triton | 10.245470330119133 | 0.04118620231747627 | -21.54808776410064 |
| triton_persistent_tma | 10.388538241386414 | 0.04225084185600281 | -20.45258400908819 |
| cutlass_lvl_default | 12.882896699011326 | 231.14990583620965 | -1.3527101626732294 |
| cutlass_lvl_1111 | 11.362981051206589 | 126.41650272067636 | -12.99105229490415 |
| cutlass_lvl_2222 | 11.107578873634338 | 555.8380545829423 | -14.946725248331441 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 14.037585817277431 | 0.21587548777461052 | NA |
| triton | 10.571777820587158 | 78.15654796129093 | -24.68948750735019 |
| triton_persistent_tma | 10.761583223938942 | 1.3195342738181353 | -23.337364672110443 |
| cutlass_lvl_default | 12.872588820755482 | 237.0100042372942 | -8.299126443010406 |
| cutlass_lvl_1111 | 11.08622644096613 | 137.55013868492097 | -21.02469338195443 |
| cutlass_lvl_2222 | 11.044904589653015 | 551.265836935956 | -21.319059178545007 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 30.483894050121307 | 0.27990864124149084 | NA |
| triton | 29.567627236247063 | 99.87172158574685 | -3.005740711366232 |
| triton_persistent_tma | 29.66325916349888 | 1.3695051120594144 | -2.692027748401006 |
| cutlass_lvl_default | 29.82821688055992 | 72.61214569816366 | -2.150897022812533 |
| cutlass_lvl_1111 | 29.476772993803024 | 67.7428645719774 | -3.303780857728953 |
| cutlass_lvl_2222 | 30.113255605101585 | 233.84051702311262 | -1.2158500630212203 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 30.58255836367607 | 0.058386584743857384 | NA |
| triton | 29.799651354551315 | 100.18178300186992 | -2.559978795150901 |
| triton_persistent_tma | 29.362043365836143 | 1.534341821912676 | -3.990885861562106 |
| cutlass_lvl_default | 29.4346883893013 | 73.68858492700383 | -3.7533484305817093 |
| cutlass_lvl_1111 | 29.164200648665428 | 75.44329373072833 | -4.637799421958348 |
| cutlass_lvl_2222 | 29.13798950612545 | 227.33327346481383 | -4.7235056020244 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1656.6237211227417 | 0.0549461180344224 | NA |
| triton | 1892.8285837173462 | 2.3174119112081826 | 14.258208401997386 |
| triton_persistent_tma | 1665.332317352295 | 2.7922237082384527 | 0.525683419747917 |
| cutlass_lvl_default | 1705.5492401123047 | 108.31571159465238 | 2.9533272019312116 |
| cutlass_lvl_1111 | 1714.9059772491455 | 17.64627545280382 | 3.518134829489478 |
| cutlass_lvl_2222 | 1680.4152727127075 | 306.9972395859659 | 1.4361469829637354 |
+-----------------------+--------------------+----------------------+--------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1621.416687965393 | 0.06300561130046844 | NA |
| triton | 1782.3902368545532 | 2.318530729971826 | 9.927956834535548 |
| triton_persistent_tma | 1586.0934257507324 | 2.7931175641715527 | -2.178543151605614 |
| cutlass_lvl_default | 1657.4617624282837 | 43.31810224894434 | 2.2230605328307784 |
| cutlass_lvl_1111 | 1641.5367126464844 | 17.648567833006382 | 1.2408916739557292 |
| cutlass_lvl_2222 | 1645.8417177200317 | 249.33647010894492 | 1.5064005407078918 |
+-----------------------+--------------------+----------------------+--------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009
Approved by: https://github.com/chenyang78 , https://github.com/jingsh
2025-03-13 01:57:47 +00:00
66300d3d55
[cutlass backend] try make cutlass backend benchmark more robust ( #149015 )
...
Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/ )
I want to make sure the benchmark even if failed on some experiment can still print most of the results.
```
Experiment group: mm (3x3, 3x3) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+-------------------+----------------------+---------------------+
| aten | 6.175220478326082 | 0.5982149520423263 | NA |
| triton | 5.326753947883844 | 3.2067150759976357 | -13.739858089605114 |
| triton_persistent_tma | 5.340870004147291 | 3.279932268196717 | -13.51126615004617 |
| cutlass_lvl_default | inf | inf | inf |
| cutlass_lvl_1111 | inf | inf | inf |
| cutlass_lvl_2222 | inf | inf | inf |
| cutlass_lvl_3333 | inf | inf | inf |
+-----------------------+-------------------+----------------------+---------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015
Approved by: https://github.com/chenyang78 , https://github.com/jingsh
2025-03-12 18:59:49 +00:00
17518007b2
[cutlass backend] Benchmark compared to aten and triton ( #148347 )
...
Benchmark for cutlass backend.
```
python benchmarks/inductor_backends/cutlass.py
```
Test Plan:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 12.759539298713207 | 2.7271360370796174 | NA |
| triton | 10.573655366897583 | 1.8661278090439737 | -17.131370346859384 |
| triton_persistent_tma | 10.884030722081661 | 0.5315794269554317 | -14.698873781600327 |
| cutlass_lvl_default | 13.09632882475853 | 0.5520401500398293 | 2.6395116481931873 |
| cutlass_lvl_1111 | 11.05172373354435 | 0.569593315012753 | -13.384617776451302 |
| cutlass_lvl_2222 | 11.371277272701263 | 133.58984916994814 | -10.880189272601317 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 14.472318813204765 | 1.5445372510002926 | NA |
| triton | 10.568295605480671 | 16.583424195996486 | -26.975796056689987 |
| triton_persistent_tma | 10.45411266386509 | 5.830657540936954 | -27.764770809729562 |
| cutlass_lvl_default | 12.742593884468079 | 28.994930602959357 | -11.951954286402668 |
| cutlass_lvl_1111 | 11.522261425852776 | 79.85037935699802 | -20.38413764531163 |
| cutlass_lvl_2222 | 10.993581265211105 | 132.86601971101481 | -24.037181552548486 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+---------------------+
| aten | 30.700622126460075 | 2.225986961973831 | NA |
| triton | 29.17378954589367 | 38.571991189033724 | -4.97329524553989 |
| triton_persistent_tma | 29.642896726727486 | 7.2848734309664 | -3.4452897904663744 |
| cutlass_lvl_default | 29.514770954847336 | 29.819900761009194 | -3.8626291243482167 |
| cutlass_lvl_1111 | 29.411429539322853 | 23.82907024596352 | -4.19923929172139 |
| cutlass_lvl_2222 | 29.57325428724289 | 134.31008586101234 | -3.672133530628152 |
+-----------------------+--------------------+----------------------+---------------------+
Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 30.858177691698074 | 1.181898436974734 | NA |
| triton | 28.630023822188377 | 39.24473957403097 | -7.220626868414034 |
| triton_persistent_tma | 28.641965240240097 | 5.275042273919098 | -7.181929126210897 |
| cutlass_lvl_default | 29.16003204882145 | 29.934022572939284 | -5.503065216107967 |
| cutlass_lvl_1111 | 28.79570797085762 | 23.948012012057006 | -6.683705504085324 |
| cutlass_lvl_2222 | 29.02756631374359 | 136.25560767308343 | -5.932337924306467 |
+-----------------------+--------------------+----------------------+--------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1456.143856048584 | 1.020197194069624 | NA |
| triton | 1708.2737684249878 | 5.766509635956027 | 17.31490410985819 |
| triton_persistent_tma | 1476.485013961792 | 7.455113030038774 | 1.3969195302177155 |
| cutlass_lvl_default | 1583.3594799041748 | 50.408804678940214 | 8.736473620182366 |
| cutlass_lvl_1111 | 1636.4418268203735 | 82.82403108896688 | 12.381879030898025 |
| cutlass_lvl_2222 | 1507.5665712356567 | 260.03901409788523 | 3.531430975962381 |
+-----------------------+--------------------+----------------------+--------------------+
Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
| name | forward_time (us) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
| aten | 1382.230520248413 | 1.2586536260787398 | NA |
| triton | 1646.9683647155762 | 5.442052865982987 | 19.15294450447995 |
| triton_persistent_tma | 1423.9195585250854 | 6.515797697938979 | 3.016069871556595 |
| cutlass_lvl_default | 1500.9030103683472 | 51.36402789200656 | 8.58557877152115 |
| cutlass_lvl_1111 | 1446.9740390777588 | 30.65435610699933 | 4.683988515729638 |
| cutlass_lvl_2222 | 1419.661521911621 | 205.1948991640238 | 2.7080144096717635 |
+-----------------------+--------------------+----------------------+--------------------+
```
Differential Revision: D70147589
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347
Approved by: https://github.com/drisspg , https://github.com/chenyang78
2025-03-04 01:45:36 +00:00