Commit Graph

10 Commits

Author SHA1 Message Date
b2953f5643 [9/N] Apply ruff UP035 rule (#165515)
This is follow-up of #165214 to continue applying ruff UP035 rule to the code base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515
Approved by: https://github.com/Lucaskabela
2025-10-17 00:09:51 +00:00
a2a75be0f8 Rename inductor cache (#156128)
Requested by Simon on a different PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128
Approved by: https://github.com/xmfan
2025-06-17 03:57:18 +00:00
b878ca0c91 [cutlass backend] add fp8 to cutlass benchmark script (#155507)
Summary:
Add fp8.

Right now FP8 only allows fast_accum.

Test Plan:
```
Experiment group: _scaled_mm (8192x8192, 8192x8192) torch.float8_e4m3fn
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | teraflops (TFLOPS) | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
|         aten          | 967.1226739883423  | 1136.8895149998868 |  1.219131228979677   |         NA         |
|        triton         | 1764.6185159683228 |  623.08743664783   |  20.373826419003308  | 82.46067054670186  |
| triton_persistent_tma | 1769.0335512161255 | 621.5323768280928  |  20.48663099599071   | 82.91718297956578  |
|  cutlass_lvl_default  | 790.5075550079346  | 1390.8932568835019 |  13.788519630907103  | -18.26191482535096 |
|   cutlass_lvl_3332    | 803.7384748458862  | 1367.996757884245  |  226.81587297911756  | -16.89384434227684 |
+-----------------------+--------------------+--------------------+----------------------+--------------------+
```

Rollback Plan:

Differential Revision: D76310809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155507
Approved by: https://github.com/ColinPeppler
2025-06-13 05:11:15 +00:00
2481c4b2ea [cutlass backend] add teraflops and increase rep for benchmark script (#154944)
Differential Revision: [D75840023](https://our.internmc.facebook.com/intern/diff/D75840023/)

I think I will continue to use do_bench for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154944
Approved by: https://github.com/mlazos
2025-06-05 17:20:29 +00:00
cb56df55dc [Inductor]Cleanup autotune_fallback_to_aten post-deprecation (#154331)
Fixes #153298

This PR is the 3rd and final step of #147479
All references to autotune_fallback_to_aten have been removed, and the feature is now deprecated.
All calls to should_fallback_to_aten() were also removed, as they were deemed unnecessary.

[henrylhtsang](https://github.com/henrylhtsang)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154331
Approved by: https://github.com/henrylhtsang
2025-05-29 20:29:58 +00:00
00ebbbb701 [cutlass backend] add addmm and bmm for cutlass backend benchmark (#152163)
Copying what @kadeng did.

```
FINAL results...

Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 44.454172253608704 |  3.0991086587309837  |         NA          |
|        triton         | 44.06978189945221  | 0.07496077567338943  | -0.8646890374284049 |
| triton_persistent_tma | 43.598245829343796 | 0.06154991965740919  | -1.9254130284597197 |
|  cutlass_lvl_default  | 39.91834074258804  | 0.056073310784995556 | -10.20338762612423  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: bmm (BS: 8, 1024x1024, 1024x1024) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
|         name          | forward_time (us) | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+-------------------+----------------------+---------------------+
|         aten          | 49.05610531568527 |  0.160279156640172   |         NA          |
|        triton         | 43.97720843553543 |  0.0660805031657219  | -10.353241145961718 |
| triton_persistent_tma | 43.94153505563736 | 0.061738294549286366 | -10.425960697724962 |
|  cutlass_lvl_default  | 40.2066633105278  | 0.034127906896173954 | -18.039430460713596 |
+-----------------------+-------------------+----------------------+---------------------+

Average edge over aten (max(-edge, 0), higher is better):
triton: 5.608965091695062 (from 2 valid values)
triton_persistent_tma: 6.175686863092341 (from 2 valid values)
cutlass_lvl_default: 14.121409043418913 (from 2 valid values)
```

Differential Revision: [D73625766](https://our.internmc.facebook.com/intern/diff/D73625766/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152163
Approved by: https://github.com/jingsh
2025-04-28 20:16:17 +00:00
5a51de5ab1 [cutlass backend] Add more logs for cutlass backend benchmark (#150639)
Goal is to have a way to compare if a change make it better or worse.

```
Average edge over aten (max(-edge, 0), higher is better):
triton: 8.596507086950552 (from 6 valid values)
triton_persistent_tma: 9.517193693923307 (from 6 valid values)
cutlass_lvl_default: 3.3234737908691785 (from 6 valid values)
cutlass_lvl_1111: 7.088173348313991 (from 6 valid values)
cutlass_lvl_2222: 7.291869722320318 (from 6 valid values)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150639
Approved by: https://github.com/ColinPeppler
2025-04-15 04:19:51 +00:00
f2d43d866c [cutlass backend] switch layout for cutlass backend benchmark (#149009)
```
python benchmarks/inductor_backends/cutlass.py
```

logs:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 13.059554621577263 |  1.580178506206721   |         NA          |
|        triton         | 10.245470330119133 | 0.04118620231747627  | -21.54808776410064  |
| triton_persistent_tma | 10.388538241386414 | 0.04225084185600281  | -20.45258400908819  |
|  cutlass_lvl_default  | 12.882896699011326 |  231.14990583620965  | -1.3527101626732294 |
|   cutlass_lvl_1111    | 11.362981051206589 |  126.41650272067636  | -12.99105229490415  |
|   cutlass_lvl_2222    | 11.107578873634338 |  555.8380545829423   | -14.946725248331441 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.037585817277431 | 0.21587548777461052  |         NA          |
|        triton         | 10.571777820587158 |  78.15654796129093   | -24.68948750735019  |
| triton_persistent_tma | 10.761583223938942 |  1.3195342738181353  | -23.337364672110443 |
|  cutlass_lvl_default  | 12.872588820755482 |  237.0100042372942   | -8.299126443010406  |
|   cutlass_lvl_1111    | 11.08622644096613  |  137.55013868492097  | -21.02469338195443  |
|   cutlass_lvl_2222    | 11.044904589653015 |   551.265836935956   | -21.319059178545007 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.483894050121307 | 0.27990864124149084  |         NA          |
|        triton         | 29.567627236247063 |  99.87172158574685   | -3.005740711366232  |
| triton_persistent_tma | 29.66325916349888  |  1.3695051120594144  | -2.692027748401006  |
|  cutlass_lvl_default  | 29.82821688055992  |  72.61214569816366   | -2.150897022812533  |
|   cutlass_lvl_1111    | 29.476772993803024 |   67.7428645719774   | -3.303780857728953  |
|   cutlass_lvl_2222    | 30.113255605101585 |  233.84051702311262  | -1.2158500630212203 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.58255836367607  | 0.058386584743857384 |         NA          |
|        triton         | 29.799651354551315 |  100.18178300186992  | -2.559978795150901  |
| triton_persistent_tma | 29.362043365836143 |  1.534341821912676   | -3.990885861562106  |
|  cutlass_lvl_default  |  29.4346883893013  |  73.68858492700383   | -3.7533484305817093 |
|   cutlass_lvl_1111    | 29.164200648665428 |  75.44329373072833   | -4.637799421958348  |
|   cutlass_lvl_2222    | 29.13798950612545  |  227.33327346481383  |  -4.7235056020244   |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1656.6237211227417 |  0.0549461180344224  |         NA         |
|        triton         | 1892.8285837173462 |  2.3174119112081826  | 14.258208401997386 |
| triton_persistent_tma | 1665.332317352295  |  2.7922237082384527  | 0.525683419747917  |
|  cutlass_lvl_default  | 1705.5492401123047 |  108.31571159465238  | 2.9533272019312116 |
|   cutlass_lvl_1111    | 1714.9059772491455 |  17.64627545280382   | 3.518134829489478  |
|   cutlass_lvl_2222    | 1680.4152727127075 |  306.9972395859659   | 1.4361469829637354 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1621.416687965393  | 0.06300561130046844  |         NA         |
|        triton         | 1782.3902368545532 |  2.318530729971826   | 9.927956834535548  |
| triton_persistent_tma | 1586.0934257507324 |  2.7931175641715527  | -2.178543151605614 |
|  cutlass_lvl_default  | 1657.4617624282837 |  43.31810224894434   | 2.2230605328307784 |
|   cutlass_lvl_1111    | 1641.5367126464844 |  17.648567833006382  | 1.2408916739557292 |
|   cutlass_lvl_2222    | 1645.8417177200317 |  249.33647010894492  | 1.5064005407078918 |
+-----------------------+--------------------+----------------------+--------------------+
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149009
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-13 01:57:47 +00:00
66300d3d55 [cutlass backend] try make cutlass backend benchmark more robust (#149015)
Differential Revision: [D71006269](https://our.internmc.facebook.com/intern/diff/D71006269/)

I want to make sure the benchmark even if failed on some experiment can still print most of the results.

```
Experiment group: mm (3x3, 3x3) torch.bfloat16
+-----------------------+-------------------+----------------------+---------------------+
|         name          | forward_time (us) | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+-------------------+----------------------+---------------------+
|         aten          | 6.175220478326082 |  0.5982149520423263  |         NA          |
|        triton         | 5.326753947883844 |  3.2067150759976357  | -13.739858089605114 |
| triton_persistent_tma | 5.340870004147291 |  3.279932268196717   | -13.51126615004617  |
|  cutlass_lvl_default  |        inf        |         inf          |         inf         |
|   cutlass_lvl_1111    |        inf        |         inf          |         inf         |
|   cutlass_lvl_2222    |        inf        |         inf          |         inf         |
|   cutlass_lvl_3333    |        inf        |         inf          |         inf         |
+-----------------------+-------------------+----------------------+---------------------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149015
Approved by: https://github.com/chenyang78, https://github.com/jingsh
2025-03-12 18:59:49 +00:00
17518007b2 [cutlass backend] Benchmark compared to aten and triton (#148347)
Benchmark for cutlass backend.

```
python benchmarks/inductor_backends/cutlass.py
```

Test Plan:
```
Experiment group: mm (1024x1024, 1024x1024) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 12.759539298713207 |  2.7271360370796174  |         NA          |
|        triton         | 10.573655366897583 |  1.8661278090439737  | -17.131370346859384 |
| triton_persistent_tma | 10.884030722081661 |  0.5315794269554317  | -14.698873781600327 |
|  cutlass_lvl_default  | 13.09632882475853  |  0.5520401500398293  | 2.6395116481931873  |
|   cutlass_lvl_1111    | 11.05172373354435  |  0.569593315012753   | -13.384617776451302 |
|   cutlass_lvl_2222    | 11.371277272701263 |  133.58984916994814  | -10.880189272601317 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (1024x1024, 1024x1024) torch.bfloat16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 14.472318813204765 |  1.5445372510002926  |         NA          |
|        triton         | 10.568295605480671 |  16.583424195996486  | -26.975796056689987 |
| triton_persistent_tma | 10.45411266386509  |  5.830657540936954   | -27.764770809729562 |
|  cutlass_lvl_default  | 12.742593884468079 |  28.994930602959357  | -11.951954286402668 |
|   cutlass_lvl_1111    | 11.522261425852776 |  79.85037935699802   | -20.38413764531163  |
|   cutlass_lvl_2222    | 10.993581265211105 |  132.86601971101481  | -24.037181552548486 |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.float16
+-----------------------+--------------------+----------------------+---------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%)  |
+-----------------------+--------------------+----------------------+---------------------+
|         aten          | 30.700622126460075 |  2.225986961973831   |         NA          |
|        triton         | 29.17378954589367  |  38.571991189033724  |  -4.97329524553989  |
| triton_persistent_tma | 29.642896726727486 |   7.2848734309664    | -3.4452897904663744 |
|  cutlass_lvl_default  | 29.514770954847336 |  29.819900761009194  | -3.8626291243482167 |
|   cutlass_lvl_1111    | 29.411429539322853 |  23.82907024596352   |  -4.19923929172139  |
|   cutlass_lvl_2222    | 29.57325428724289  |  134.31008586101234  | -3.672133530628152  |
+-----------------------+--------------------+----------------------+---------------------+

Experiment group: mm (2048x2048, 2048x2048) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 30.858177691698074 |  1.181898436974734   |         NA         |
|        triton         | 28.630023822188377 |  39.24473957403097   | -7.220626868414034 |
| triton_persistent_tma | 28.641965240240097 |  5.275042273919098   | -7.181929126210897 |
|  cutlass_lvl_default  | 29.16003204882145  |  29.934022572939284  | -5.503065216107967 |
|   cutlass_lvl_1111    | 28.79570797085762  |  23.948012012057006  | -6.683705504085324 |
|   cutlass_lvl_2222    | 29.02756631374359  |  136.25560767308343  | -5.932337924306467 |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.float16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1456.143856048584  |  1.020197194069624   |         NA         |
|        triton         | 1708.2737684249878 |  5.766509635956027   | 17.31490410985819  |
| triton_persistent_tma | 1476.485013961792  |  7.455113030038774   | 1.3969195302177155 |
|  cutlass_lvl_default  | 1583.3594799041748 |  50.408804678940214  | 8.736473620182366  |
|   cutlass_lvl_1111    | 1636.4418268203735 |  82.82403108896688   | 12.381879030898025 |
|   cutlass_lvl_2222    | 1507.5665712356567 |  260.03901409788523  | 3.531430975962381  |
+-----------------------+--------------------+----------------------+--------------------+

Experiment group: mm (8192x8192, 8192x8192) torch.bfloat16
+-----------------------+--------------------+----------------------+--------------------+
|         name          | forward_time (us)  | compilation_time (s) | perf_over_aten (%) |
+-----------------------+--------------------+----------------------+--------------------+
|         aten          | 1382.230520248413  |  1.2586536260787398  |         NA         |
|        triton         | 1646.9683647155762 |  5.442052865982987   | 19.15294450447995  |
| triton_persistent_tma | 1423.9195585250854 |  6.515797697938979   | 3.016069871556595  |
|  cutlass_lvl_default  | 1500.9030103683472 |  51.36402789200656   |  8.58557877152115  |
|   cutlass_lvl_1111    | 1446.9740390777588 |  30.65435610699933   | 4.683988515729638  |
|   cutlass_lvl_2222    | 1419.661521911621  |  205.1948991640238   | 2.7080144096717635 |
+-----------------------+--------------------+----------------------+--------------------+
```

Differential Revision: D70147589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148347
Approved by: https://github.com/drisspg, https://github.com/chenyang78
2025-03-04 01:45:36 +00:00