pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Gabriel Ferns	d2d4a3c539	Select Algorithm clear feedback savers (#161654 ) Add `clear_feedback_savers` and tests for the feedback functionality. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161654 Approved by: https://github.com/masnesral	2025-08-28 06:56:03 +00:00
drisspg	b36a20d368	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-27 02:45:20 +00:00
PyTorch MergeBot	818ba434c7	Revert "Ensure large tensor int32 -> int64 indexing is enabled (#157767 )" This reverts commit fc69c2bc67672c3b2d0c62c1821895f09288f1c0. Reverted https://github.com/pytorch/pytorch/pull/157767 on behalf of https://github.com/atalman due to internal failure, sorry will revert ([comment](https://github.com/pytorch/pytorch/pull/157767#issuecomment-3224341111))	2025-08-26 14:12:06 +00:00
PyTorch MergeBot	92ab184824	Revert "[Inductor] Prune configs that require more shared memory than the hardware limit (#161040 )" This reverts commit b2e06e0194c3fa8f7578a1b48751cc027394fb67. Reverted https://github.com/pytorch/pytorch/pull/161040 on behalf of https://github.com/jeffdaily due to still failing on rocm, see https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=inductor%2Ftest_triton_heuristics.py%3A%3ATestTritonHeuristics%3A%3Atest_prune_configs_over_shared_memory_limit_do_pruning_True ([comment](https://github.com/pytorch/pytorch/pull/161040#issuecomment-3222430129))	2025-08-26 03:15:32 +00:00
Wenyuan Chi	b2e06e0194	[Inductor] Prune configs that require more shared memory than the hardware limit (#161040 ) Summary: This diff removes configs that require more shared memory than the hardware limit, which causes the following compilation error: ``` No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 327680 Hardware limit:232448 Reducing block sizes or `num_stages` may help. ``` Test Plan: ``` buck2 test mode/dev-nosan fbcode//caffe2/test/inductor:max_autotune -- test_max_autotune_prune_choices -v 1,stderr ``` Rollback Plan: Differential Revision: D80594562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161040 Approved by: https://github.com/eellison	2025-08-25 23:09:09 +00:00
drisspg	fc69c2bc67	Ensure large tensor int32 -> int64 indexing is enabled (#157767 ) Fixes: #https://github.com/pytorch/pytorch/issues/157446 I think that this delta is worth the switch form block-ptrs especially since they are deprecated ## Perf Summary A is nightly B is this diff, so `negative` means this diff improves perf TOP 5 differences <img width="805" height="754" alt="Screenshot 2025-08-24 at 5 49 49 PM" src="https://github.com/user-attachments/assets/aa359cdf-ee9a-427d-be72-1b9aef6f3115" /> <details> <summary><strong>Full perf table (click to expand)</strong></summary> \| attn_type \| dtype \| shape(B,Hq,M,Hkv,N,D) \| TFlops Version A \| TFlops Version B \| \| --- \| --- \| --- \| --- \| --- \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 258.38834144791923 \| 258.6353685004612 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.2192450677751 \| 140.12393320464972 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 122.32683823617003 \| 118.51603755647925 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 142.48556906165314 \| 137.24259849208627 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 86.59814488695922 \| 84.59431398586257 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 288.52679758135764 \| 292.9174195871856 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 172.25541683643277 \| 172.94326459828508 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 164.40864610599826 \| 165.035129576335 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 176.54876886433945 \| 175.08057670028145 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 125.22491679812626 \| 121.06201152859151 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 339.11952481874283 \| 339.0132835601695 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 227.58583240284406 \| 228.21824999409597 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 185.98569659868966 \| 182.32850843255093 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 188.9495725191772 \| 180.31385312481657 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 64) \| 106.25789530994302 \| 106.55084959448476 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 357.6430536888533 \| 363.30843452247274 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 262.3241154406613 \| 265.73250045488 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 249.30498953911416 \| 249.35928192833785 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 224.74126243851808 \| 223.71776504077988 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 16, 2048, 128) \| 168.26977014013707 \| 165.47991483333809 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 382.8178701785897 \| 384.34752965862685 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 308.1449710013853 \| 311.0653716044644 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 251.96365252505072 \| 243.92283557225903 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 226.69316232745368 \| 215.22769268913356 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 153.34142545296405 \| 151.9312673939401 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 396.0998000753126 \| 398.35036286102473 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 333.5198415274966 \| 344.6354466169716 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 310.5955933379696 \| 305.66347819546 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 260.4012412689896 \| 259.758666997307 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 234.13034252182635 \| 227.61676497283614 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 396.17615538477196 \| 401.1419104525502 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 359.98648311998414 \| 360.8285563463094 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 291.97720707257736 \| 281.41694809965253 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 250.1703628419691 \| 238.556760291579 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 64) \| 199.50782826294306 \| 191.52327358439223 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 411.0632004785396 \| 413.6362648405517 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 382.9404387613185 \| 397.74886235657607 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 357.0998545146633 \| 350.5115200772392 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 281.8033924428203 \| 281.98601309215843 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 16, 8192, 128) \| 282.56595134222135 \| 277.4565795466672 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 408.89838018149516 \| 405.14531386840076 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 396.07662058160264 \| 393.4598228299578 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 317.8822887267849 \| 304.754931401036 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 265.8801304948243 \| 254.22961974295112 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 64) \| 227.87390579965614 \| 222.19481980110393 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 427.36821778477025 \| 431.3766620314935 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 410.67994346825 \| 423.4666944003808 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 381.1968748374038 \| 381.77668006420424 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 292.5540046358546 \| 296.5439130720502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 16, 16384, 128) \| 321.04573768858114 \| 310.7423616656888 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 427.46148866769903 \| 426.162091037068 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 419.75580537687347 \| 421.88640120274334 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 337.3208051798903 \| 327.4912454675092 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 276.5638854539581 \| 262.988360558083 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 64) \| 250.82791326036886 \| 245.07367032501736 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 435.8055824506086 \| 441.8803729460534 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 432.02638235921006 \| 450.33161016596273 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 402.25525939224883 \| 393.8564689669916 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 297.5337286675904 \| 297.0131881135074 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 16, 32768, 128) \| 343.8697037899545 \| 329.8194073407783 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 267.58912366821056 \| 256.91606054118375 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 150.81723692609629 \| 146.32172267858743 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 129.51029293209245 \| 122.72144394093334 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 147.627656359087 \| 141.68956350566188 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 64) \| 87.55100546003591 \| 84.91293287692788 \| \| noop \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 299.5931492743986 \| 305.884253766691 \| \| causal \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 179.39026367843837 \| 181.64741311605096 \| \| alibi \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 173.93547669282367 \| 173.23972950980564 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 185.90234171599252 \| 182.80844545446686 \| \| document_mask \| torch.bfloat16 \| (2, 16, 1024, 4, 1024, 128) \| 128.08176696266082 \| 123.27722685662111 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 340.50674552770664 \| 338.9071088484576 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 225.4438318650432 \| 230.22899884832975 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 194.15123248528312 \| 185.02793973094865 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 200.74289714108176 \| 191.76606719670647 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 64) \| 107.03564946728423 \| 106.82432377861258 \| \| noop \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 371.31799283918406 \| 379.7555394732925 \| \| causal \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 275.97762744310455 \| 276.71106853992995 \| \| alibi \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 261.6648679783462 \| 259.4127232060398 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 237.03108223577615 \| 233.92710216149527 \| \| document_mask \| torch.bfloat16 \| (2, 16, 2048, 4, 2048, 128) \| 172.13926800371152 \| 168.74390922407585 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 381.50199487767276 \| 383.9043681999597 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 307.9748883093411 \| 312.2403515462001 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 251.11319684705438 \| 243.17870127827277 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 236.3253127246763 \| 223.81250201769552 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 64) \| 154.55693991756874 \| 153.11360584987685 \| \| noop \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 407.11400078586615 \| 413.53709886086557 \| \| causal \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 348.1705797722622 \| 360.09771155957367 \| \| alibi \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 321.8593280850388 \| 318.2882327401255 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 270.089032013835 \| 268.767323026064 \| \| document_mask \| torch.bfloat16 \| (2, 16, 4096, 4, 4096, 128) \| 238.07324557907788 \| 228.09842078362692 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 399.8172853171901 \| 401.0954526332136 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 363.4387330438581 \| 364.13111024232677 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 294.1752429133857 \| 283.7235663368415 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 256.8389394007649 \| 246.91771015606483 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 64) \| 199.3378564292656 \| 192.40439590901758 \| \| noop \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 425.5150965556111 \| 430.8190098707553 \| \| causal \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 396.00437184073013 \| 411.3873625655787 \| \| alibi \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 369.92803661607815 \| 361.43244467343663 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 293.4277354412933 \| 295.2529537595746 \| \| document_mask \| torch.bfloat16 \| (2, 16, 8192, 4, 8192, 128) \| 288.0208673072841 \| 281.51896404878863 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 408.3005367220567 \| 408.96116482298913 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 396.90095962766304 \| 396.87385456176486 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 319.0534576137999 \| 302.50950358107764 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 270.3334977708081 \| 258.8506349486557 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 64) \| 227.46824134365394 \| 222.23759438128766 \| \| noop \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 438.24247309479694 \| 437.7975163205371 \| \| causal \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 428.34012029699227 \| 433.3215899950434 \| \| alibi \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 386.52672049728875 \| 388.26216893354984 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 302.71976814728083 \| 302.3574867306459 \| \| document_mask \| torch.bfloat16 \| (2, 16, 16384, 4, 16384, 128) \| 327.39760662780986 \| 308.6348428844912 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 423.31308678262695 \| 426.6306972137279 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 412.6983690923106 \| 419.4961977664297 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 337.41003544742273 \| 324.2155049126126 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 278.7755890910794 \| 265.9194286636502 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 64) \| 251.55678254755364 \| 244.8843180141462 \| \| noop \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 452.5930781172308 \| 457.7117122300742 \| \| causal \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 445.05676260348116 \| 463.9304535499636 \| \| alibi \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 415.78302138389415 \| 406.29229555271456 \| \| sliding_window \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 308.0311067300895 \| 304.91354721414314 \| \| document_mask \| torch.bfloat16 \| (2, 16, 32768, 4, 32768, 128) \| 351.43943626809335 \| 329.4476923070317 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 295.1801525813241 \| 291.36521287398904 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 183.23250549178067 \| 182.35421238887605 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 151.56832453117747 \| 151.3422139154794 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 171.02111935180432 \| 160.72516856727913 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 64) \| 74.05765122783826 \| 74.5885345035243 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 314.3587394591763 \| 319.2938677773619 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 224.57002084153177 \| 225.48868542008177 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.00964804143052 \| 215.39576159953486 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 216.1174237618258 \| 214.28437413525663 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 16, 1024, 128) \| 121.08920423648368 \| 119.55813661872644 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 362.2193857281911 \| 360.05005804275936 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 279.8840217430121 \| 279.5437918286659 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 227.76617121021982 \| 222.8655938229316 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 215.43141176970562 \| 207.71852284994702 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 64) \| 121.35588364218539 \| 121.20636565046884 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 365.1545280898012 \| 373.37585444987326 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 304.360119952975 \| 309.1247297936263 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 287.2603904544586 \| 289.25547903162595 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 257.9852675272418 \| 257.59069234098115 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 16, 2048, 128) \| 188.35158496670232 \| 184.24683960154857 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 389.9744911369211 \| 388.43466897254166 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 345.9228295166513 \| 342.63034895210126 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 279.56334658247437 \| 271.2724375402088 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 245.66477202810066 \| 233.49688207371258 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 64) \| 170.3270720653187 \| 166.23863845657382 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 400.0041140827554 \| 402.11182445396497 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 363.64641830327434 \| 375.9288663364792 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 341.5776139573363 \| 335.1160003213424 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 281.1811770268521 \| 280.21438270014005 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 16, 4096, 128) \| 247.78716118997716 \| 245.3269825179633 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 403.794126680488 \| 405.2353919019577 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 387.079178426863 \| 385.1461762057035 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 309.7847188173431 \| 298.0443968374749 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 262.4721750159666 \| 250.81679725428586 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 64) \| 205.70866004479979 \| 202.9620839129557 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 413.380982988662 \| 418.40270594263103 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 398.450064800682 \| 409.6794973994029 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 372.26297458194466 \| 364.44415106552196 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 293.0818569905912 \| 292.85172400643984 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 16, 8192, 128) \| 296.46717085592087 \| 285.76362010612763 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 419.3186786037592 \| 426.08801580934437 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 408.1648467766632 \| 409.4122254207817 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 329.24396020457345 \| 313.5200995121138 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 274.61257504571876 \| 255.7801815432177 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 64) \| 232.63806001220684 \| 230.03020843492314 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 435.0785891054788 \| 440.39101804225345 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 424.86925312752817 \| 435.18898057396825 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 393.000417896268 \| 395.11543361225256 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 297.7755459218185 \| 300.7208114715287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 16, 16384, 128) \| 331.71570861760534 \| 318.07127352552885 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 424.58602747137405 \| 425.84897078470715 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 422.66607285025725 \| 423.5524945535485 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 344.8625760048626 \| 331.6793888458635 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 282.0787281511649 \| 263.7895634445868 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 64) \| 252.7301927385177 \| 245.41844170037427 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 437.0658069164588 \| 442.9101960063628 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 433.13788271434646 \| 452.3873572709863 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 404.0959191546953 \| 396.7077863894884 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 300.45502211883206 \| 301.3439134717943 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 16, 32768, 128) \| 344.11003202413934 \| 330.8897663350314 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 298.4364205341705 \| 291.6793556507056 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 187.6382133139633 \| 191.05409897308772 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 156.55822078636112 \| 154.178925976516 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 173.47765221825162 \| 169.30862508068464 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 64) \| 74.5885345035243 \| 74.52689061607104 \| \| noop \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 323.12233826013045 \| 328.53889207933514 \| \| causal \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 236.75872140126316 \| 235.8378325547398 \| \| alibi \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 227.17836523816675 \| 226.75357076139966 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 224.07209453308036 \| 224.07209453308036 \| \| document_mask \| torch.bfloat16 \| (4, 16, 1024, 4, 1024, 128) \| 122.85572156047981 \| 121.11642183704716 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 361.3123326658092 \| 360.71014086458337 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 281.5287983927017 \| 281.94301754758345 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 232.7456696285686 \| 226.50976826432776 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 221.5612361744038 \| 214.96188822837055 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 64) \| 121.38311528944315 \| 120.85441868178513 \| \| noop \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 380.2579019244734 \| 389.2520157863988 \| \| causal \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 316.95230660496924 \| 317.87597790618906 \| \| alibi \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 301.07968126657323 \| 298.02424098422983 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 267.2240756921594 \| 267.16353549228154 \| \| document_mask \| torch.bfloat16 \| (4, 16, 2048, 4, 2048, 128) \| 189.82761622494257 \| 186.736450261963 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 389.88665375406805 \| 387.9125133037077 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 348.70619958684887 \| 346.6750499749774 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 280.5472989906087 \| 271.22300822012187 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 250.02397620165968 \| 241.22532776331445 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 64) \| 171.67817496107645 \| 166.95679280483972 \| \| noop \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 412.626880230807 \| 417.60238657950777 \| \| causal \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 374.8829313933945 \| 389.4448546468815 \| \| alibi \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 353.20410434172436 \| 345.7072490717473 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 292.51045924209586 \| 291.66621022138287 \| \| document_mask \| torch.bfloat16 \| (4, 16, 4096, 4, 4096, 128) \| 251.6264062063495 \| 248.45110052911542 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 404.0155784550126 \| 401.90546837237514 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 384.4389015599863 \| 386.9684324594344 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 313.3731284132225 \| 298.17074251037894 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 264.19199737284265 \| 252.8982463999916 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 64) \| 207.03696315185684 \| 202.86697323136772 \| \| noop \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 428.2436763312506 \| 433.45005568619536 \| \| causal \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 411.8516531869893 \| 428.2753623461049 \| \| alibi \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 384.9095037182509 \| 372.90888743000744 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 303.2438915629836 \| 302.05095952914337 \| \| document_mask \| torch.bfloat16 \| (4, 16, 8192, 4, 8192, 128) \| 301.8689122735564 \| 285.0363190513223 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 423.13592231504805 \| 420.3991500185611 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 407.44527331585493 \| 408.5064370765247 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 330.50050996167414 \| 316.8763979925965 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 274.6833786307413 \| 259.86098862141324 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 64) \| 232.24019584158367 \| 226.52040268160232 \| \| noop \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 444.4596314237808 \| 455.99558915752266 \| \| causal \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 437.4245561244369 \| 455.98275147271966 \| \| alibi \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 397.3350686877605 \| 397.88875599028063 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 308.53809114394545 \| 307.1359822042007 \| \| document_mask \| torch.bfloat16 \| (4, 16, 16384, 4, 16384, 128) \| 331.32379843423774 \| 316.85293191675646 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 422.4622274366379 \| 425.0407156418684 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 420.9547052783101 \| 430.33779243510276 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 345.50265346504085 \| 332.094855328957 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 280.81715528243365 \| 264.6543640282054 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 64) \| 252.25635200421783 \| 245.46235499490305 \| \| noop \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 452.5524207341139 \| 461.7512032176736 \| \| causal \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 445.2316469907137 \| 464.4523799578466 \| \| alibi \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 416.87264016717023 \| 409.17124592157046 \| \| sliding_window \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 309.42579489389846 \| 307.9734464665731 \| \| document_mask \| torch.bfloat16 \| (4, 16, 32768, 4, 32768, 128) \| 350.50782004300623 \| 330.98959545427294 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/157767 Approved by: https://github.com/Skylion007	2025-08-25 22:51:00 +00:00
PyTorch MergeBot	2c0650a00a	Revert "[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 )" This reverts commit 8dbe7f99bd707ee28ae12ecb9cab54e1785bf13e. Reverted https://github.com/pytorch/pytorch/pull/160711 on behalf of https://github.com/davidberard98 due to internal failure - T235384144 - I'll revert while I investigate. ([comment](https://github.com/pytorch/pytorch/pull/160711#issuecomment-3215343200))	2025-08-22 19:10:35 +00:00
Jazlyn Li	667245dc60	TritonKernel.inductor_meta_common() -> self.inductor_meta_common() (#160895 ) Summary: use `self.inductor_meta_common()` to call the static method, since the custom subclasses may overwrite the method to be an instance method Test Plan: ``` caffe2/test/inductor:select_algorithm -- test_finalized_subclass_hooks ``` Rollback Plan: Differential Revision: D80375351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160895 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2025-08-21 00:22:51 +00:00
Charlie West-Taylor	7f201baf41	Allow exposing more functions during initial template expansion (#159554 ) Also adds a `_register_hook` utility, and documents & type annotates `PartialRender`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/159554 Approved by: https://github.com/laithsakka, https://github.com/kundaMwiza	2025-08-20 16:08:55 +00:00
Isuru Fernando	f305019377	[inductor] propagate shapes in CSEVariable (#152198 ) Fixes #149905 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152198 Approved by: https://github.com/eellison	2025-08-19 16:46:38 +00:00
David Berard	8dbe7f99bd	[BE][inductor] tl.dot(..., allow_tf32=...) -> tl.dot(..., input_precision=...) (#160711 ) allow_tf32 is deprecated. Also, this will make it easier to support tf32x3 (i.e. #160359). dashboard results on h100 show no change: [inference](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f), [training](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2011%20Aug%202025%2017%3A01%3A22%20GMT&stopTime=Mon%2C%2018%20Aug%202025%2017%3A01%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/davidberard98/399/orig&lCommit=ce12d0fd751a733f22b5bdda00bd58d323e0a526&rBranch=main&rCommit=e444cd24d48b3a46f067974f2cc157f5ed27709f) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160711 Approved by: https://github.com/PaulZhang12, https://github.com/njriasan	2025-08-19 05:27:10 +00:00
bobrenjc93	05c417715f	integrate kernacle into inductor (#160121 ) This adds integration into inductor in two parts 1) It kicks off the best config lookup at lowering time within mm.py 2) It awaits the future at scheduling time in select_algorithm.py Notably this does not do the following 1) Support for enumerating between mm, addmm and bmm 2) Support for enumerating between exhaustive/max 3) Enumerating different hardware SKUs eg. H100, A100, etc. those will come in the next diffs Differential Revision: [D79824921](https://our.internmc.facebook.com/intern/diff/D79824921/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160121 Approved by: https://github.com/izaitsevfb	2025-08-08 02:14:44 +00:00
Wenyuan Chi	d68c323692	Log max_autotune exceptions (#159687 ) (#159688 ) Summary: Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures. Currently, exceptions are dumped to the console in the following format:: ``` [0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help. [0/0] Runtime error during autotuning: [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. [0/0] Ignoring this choice. ``` The exception tracebacks: ``` # inner exception traceback: File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers launchers.append(result.make_launcher()) ^^^^^^^^^^^^^^^^^^^^^^ File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher self.kernel.load_kernel(device) File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel( # wrapped exception traceback: File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run result = self.fn(self.args, *self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout choice.precompile() File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile self.bmreq.precompile() File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile getattr(mod, self.kernel_name).precompile() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile self._make_launchers() File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}") ``` With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event. The format: ``` { "exceptions": [ { "choice_type": "triton", "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0", "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.", "exception": "OutOfMemoryError", "required_memory": "262144", "hardware_limit": "232448" } ] } ``` Test Plan: buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt Rollback Plan: Differential Revision: D79420953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159688 Approved by: https://github.com/stashuk-olek	2025-08-08 01:30:08 +00:00
Wenyuan Chi	b599d91738	Log autotune choices and benchmark result to scuba/chrome trace (#159496 ) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. Example Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS =1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Differential Revision: D79235037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159496 Approved by: https://github.com/masnesral	2025-08-02 05:34:17 +00:00
Ruben Rodriguez Buchillon	255a04baf1	[pt2 event logging] send autotuning data for strides and hinted shapes (#158852 ) Summary: # Why capture relevant data for offline lookup table generation # What report the hinted sizes not just the symbolic sizes Test Plan: ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 \| tee /tmp/epx040 ``` This only validates that this change does not break anything, as the schema is not on scuba yet (not actualized) Rollback Plan: Reviewed By: stashuk-olek Differential Revision: D77837548 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158852 Approved by: https://github.com/jingsh	2025-07-23 06:44:27 +00:00
Mwiza Kunda	2e038793ef	[inductor][templates] Finalize all registered hooks (#157270 ) This refactor ensures all registered template hooks have been finalised before accessing the code object of the template. In `simd.SimdScheduling.codegen_template` the template hooks are finalised manually with `template.finalize_hook(hook_name)` calls, so it is the responsibility of the caller to finalise all the template hooks. This PR adds: - `RenderPartial.finalize_remaining` a function that can be called at the end to finalise the remaining active hooks after a selection of hooks have been finalised manually. - A test with a custom template implementation that registers custom hooks that the scheduler needs to finalise. This test should fail if the scheduler does not finalise the registered custom hook. Pull Request resolved: https://github.com/pytorch/pytorch/pull/157270 Approved by: https://github.com/eellison	2025-07-20 22:07:32 +00:00
bobrenjc93	5221448574	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-12 15:08:21 +00:00
PyTorch MergeBot	9c189ed29a	Revert "multi-kernel matmuls based on varying hint sizes (#156628 )" This reverts commit 6c795306378c47341d58109da03371bba2bec46e. Reverted https://github.com/pytorch/pytorch/pull/156628 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some ROCM jobs went crazy after this lands, so I try to see if reverting helps ([comment](https://github.com/pytorch/pytorch/pull/156628#issuecomment-3064617123))	2025-07-12 03:48:39 +00:00
bobrenjc93	6c79530637	multi-kernel matmuls based on varying hint sizes (#156628 ) The core idea is to generate multiple matmul kernels using different hints for symbolic variables, then select the most appropriate one at runtime for each unique shape we encounter. You can find some early experimentation details in these posts: https://fb.workplace.com/groups/8940092306109185/posts/9803850776399996/ https://fb.workplace.com/groups/8940092306109185/posts/9695805170537891/ https://fb.workplace.com/groups/257735836456307/posts/906589324904285/ Here’s a graph illustrating the empirically observed worst-case performance if an oracle always selected the least optimal hint for a given runtime size: ![image](https://github.com/user-attachments/assets/6d90ee06-a572-453e-9cba-03006f343301) This graph illustrates the performance of a hint size of 64 relative to the worst case. Notice that as the runtime sizes increase, the performance gradually approaches the worst case: ![image](https://github.com/user-attachments/assets/85ad49fe-165a-474c-8d03-db2e57654213) This graph shows the performance of a hint size of 4096 — very poor for small sizes, and also suboptimal for some mid-sized shapes: ![image](https://github.com/user-attachments/assets/adea1106-3bc8-40f3-97b0-20d940fb74f1) Finally, here’s the graph that motivated this PR. It illustrates the performance when selecting the best of three kernels generated with three different hints — 64, 256, and 4096: ![image](https://github.com/user-attachments/assets/a7cb0ce5-8139-48b1-b5c9-7670e75cbfce) ## How to review this PR At a high level, this extends @shunting314's multi-kernel abstraction to support varying GEMM choices driven by different hints. A few key points: 1. Unlike reduction kernels, triton template matmuls pass their grid as arguments to the kernel. This PR updates `MultiKernelCall` to support kernels with varying arguments. 2. The `V.graph.sizevars.size_hints` API is extended to accept a `hint_override`, allowing us to substitute the example input’s size hint with a custom value when generating multiple kernels. 3. The choice generation and benchmarking logic is updated to support multiple hint values. One kernel is generated per value in `torch._inductor.config.multi_kernel_hints`, and at runtime, we select the most suitable kernel for the current shape. 4. This PR does not add support for cpp wrapper codegen to keep it scoped. That will be added in the next PR. ## Results The following is a basic test that shows our basic multi kernel working where we no longer show significant variance based on the original hint size: https://gist.github.com/bobrenjc93/ba711d529e65fd65839b34799f6323ec Before ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0948 \| 0.3124 \| 4.9477 256 \| 0.2243 \| 0.2256 \| 3.3880 4096 \| 0.3384 \| 0.3404 \| 3.3010 ``` After ``` Hint\Runtime \| 64 \| 256 \| 4096 --------------------------------------------------- 64 \| 0.0951 \| 0.2289 \| 3.3013 256 \| 0.0952 \| 0.2258 \| 3.4045 4096 \| 0.0957 \| 0.2231 \| 3.3146 ``` We also see an average speedup of 5.04% for the matrix of all hint/runtime pairs in [64, 4096] for every increment of 64: https://docs.google.com/spreadsheets/d/12TmYUDrAAFASGuP3POXTKPeAvQWIRzKzdrVSIb3vQkA/edit?gid=480268938#gid=480268938 ![Worst Case, multi-kernel](https://github.com/user-attachments/assets/712df23b-87e2-4d9d-95c2-cc25305ba2ed) NB: This is just the beginning and I plan on doing more investigation to see further improve on this initial result. For posterity the script used to generate that matrix is here: https://gist.github.com/bobrenjc93/c211fd0bd97fad8f46b91ad9dee76ad0 HUD benchmark runs: base: https://github.com/pytorch/pytorch/actions/runs/15889871988 head: https://github.com/pytorch/pytorch/actions/runs/15889876842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156628 Approved by: https://github.com/jansel	2025-07-11 19:38:10 +00:00
Aaron Gokaslan	7a08755c5f	[BE][Ez]: Update ruff to 0.12.2 (#157937 ) Updates to the latest version of ruff and apply some fixes that it flagged and silence a few new lints Pull Request resolved: https://github.com/pytorch/pytorch/pull/157937 Approved by: https://github.com/ezyang	2025-07-11 15:16:20 +00:00
Mwiza Kunda	ed508cc018	[inductor][triton] Add experimental use_tensor_descriptor config option (#157906 ) Refactor to allow TMA descriptors to be used in general codegen. TMA descriptors can only be generated if the conditions listed in the triton documentation for [make_tensor_descriptor](https://triton-lang.org/main/python-api/generated/triton.language.make_tensor_descriptor.html) are met. Some implementation details: - The `TMACompatibilityChecker` class holds and checks the conditions required for a load / store operation to be represented by a tma descriptor load / store - The current TMA API requires that the innermost block size loads atleast 16 bytes of data. e.g. if the block shape is [YBLOCK, XBLOCK] and the tensor dtype is float32, this requires that XBLOCK >= 4. It is therefore required that the triton heuristics are aware of the minimum block sizes for the IO operations in the kernel. The minimum block sizes are determined in the `TMACompatibilityChecker` class and are passed to the triton heuristics when the block sizes are not static. The heuristic config options are then filtered to ensure that the minimum block size restriction is met. Testing: - Refactored test_torchinductor_strided_blocks.py to also test the `use_tensor_descriptor` option. This requires an upgrade to Triton version 3.4.0: https://github.com/pytorch/pytorch/issues/154206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157906 Approved by: https://github.com/jansel	2025-07-11 09:32:40 +00:00
Gabriel Ferns	7e83d50845	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-07 22:13:34 +00:00
PyTorch MergeBot	6ef70edd9a	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit 47f10d0ad0dda281c886ff08ac2f938207027316. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Looks like it's breaking ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3025673908))	2025-07-01 22:11:53 +00:00
Gabriel Ferns	47f10d0ad0	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-07-01 16:51:03 +00:00
PyTorch MergeBot	c038719731	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit 347ace4c7ac2dbb14799089c30bd01a9ac312791. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail on ROCm ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3020006655))	2025-06-30 16:58:54 +00:00
Tom Ritchford	e3afbb0362	[inductor] Add typing to _inductor/ir.py (#149958 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958 Approved by: https://github.com/Skylion007	2025-06-30 15:56:35 +00:00
Gabriel Ferns	347ace4c7a	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-29 05:00:47 +00:00
Nicolas Macchioni	13efb2c858	[BE] Deprecate `search_autotune_cache` (#155302 ) We haven't had the offline cache populated in > 1 year, this should be safe; if this passes, we can finally go through and rip out the offline cache logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/155302 Approved by: https://github.com/masnesral	2025-06-26 17:30:08 +00:00
Nicolas Macchioni	dfc31b3345	[BE] comments + try to get rid of secondary `make_autotune_fn` (#156358 ) Not sure this will work, but let's try it on the unit tests. The only thing I am worried about is the counters drifting off from their true values, so let the unit tests check that Pull Request resolved: https://github.com/pytorch/pytorch/pull/156358 Approved by: https://github.com/masnesral	2025-06-26 05:54:01 +00:00
bobrenjc93	3608737347	[ez] fix typo in comment (#156402 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156402 Approved by: https://github.com/BoyuanFeng ghstack dependencies: #156397	2025-06-25 02:07:36 +00:00
bobrenjc93	d8bb5ac260	[ez] fix typo in select_algorithm.py (#156625 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156625 Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng ghstack dependencies: #156445	2025-06-24 23:01:58 +00:00
Ruben Rodriguez Buchillon	4491326fb0	[inductor] select_algorithm: add preprocessing fns (#156464 ) Summary: # Why - keep code cleaner - modular way to hook up preprocessing steps - expand testability of flows that change which choices are provided e.g. to test performance models and lookup tables by running torch.compile # What - similar to feedback_saver_fns, now there are preprocessing_fns - the existing regex logic is exported into those as a proof of concept Test Plan: ``` buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 \| tee /tmp/epx038 ``` This does not exercise the logic, it just shows that it's safe right now Rollback Plan: Differential Revision: D76946993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156464 Approved by: https://github.com/masnesral	2025-06-24 16:44:40 +00:00
Xuehai Pan	6ff6630375	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-23 02:57:12 +00:00
PyTorch MergeBot	f1331f3f1b	Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 )" This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a. Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](`c95f7fa874`) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))	2025-06-22 12:31:57 +00:00
Xuehai Pan	3627270bdf	[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313 Approved by: https://github.com/jingsh	2025-06-22 08:43:09 +00:00
henrylhtsang	9afee0fa96	[inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201 ) internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201 Approved by: https://github.com/masnesral	2025-06-18 22:15:32 +00:00
henrylhtsang	bb462a6237	[cutlass backend] Fix prescreening non-deterministic problem (#156144 ) Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/) What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening. However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have 1. prescreening of first matmul chooses a set of kernels to advance to autotuning 2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit 3. second prescreening results in a slightly different set of kernels 4. since not all timings are present, an autotune is re-done. With this diff: ``` SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144 Approved by: https://github.com/mlazos	2025-06-17 20:39:06 +00:00
Oguz Ulgen	a2a75be0f8	Rename inductor cache (#156128 ) Requested by Simon on a different PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/156128 Approved by: https://github.com/xmfan	2025-06-17 03:57:18 +00:00
henrylhtsang	fdf5d97fa8	[cutlass backend][ez] Log timings from prescreening (#155757 ) Differential Revision: [D76474669](https://our.internmc.facebook.com/intern/diff/D76474669/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155757 Approved by: https://github.com/ColinPeppler	2025-06-13 20:44:04 +00:00
henrylhtsang	6c0b42fd2f	[inductor][cutlass backend] Log prescreening elpase (#155508 ) Differential Revision: [D76311352](https://our.internmc.facebook.com/intern/diff/D76311352/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155508 Approved by: https://github.com/jingsh	2025-06-12 16:48:52 +00:00
Oguz Ulgen	d1947a8707	Migrate from lru_cache to cache (#155613 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155613 Approved by: https://github.com/ezyang ghstack dependencies: #155612	2025-06-11 19:44:18 +00:00
Oleksandr Stashuk	d2f06d2b06	[logs] Change autotune data into separate items (#155525 ) Summary: Split the autotune data into multiple keys and items : this is better for storage of the data and easier querying. Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run (sample) ``` Rollback Plan: Differential Revision: D76303514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155525 Approved by: https://github.com/jamesjwu, https://github.com/masnesral	2025-06-10 21:47:07 +00:00
Shunting Zhang	0b677560e6	[inductor] use int64 for large index (#154575 ) Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31. Fix https://github.com/pytorch/pytorch/issues/154168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-06-10 18:30:43 +00:00
PyTorch MergeBot	eb152ab1dd	Revert "Inductor logging + analysis of torch.profile (#149697 )" This reverts commit 060838c2312ad207c7afe2c86f8a484afea5f328. Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/clee2000 due to broke a bunch of tests internally D76299454, probably also broke rocm inductor/test_analysis.py::TestAnalysisCUDA::test_augment_trace_against_flop_counter_maxat0_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15545277599/job/43766911025) [HUD commit link](`060838c231`) ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-2959747153))	2025-06-10 15:38:40 +00:00
Gabriel Ferns	060838c231	Inductor logging + analysis of torch.profile (#149697 ) Prereqs: - https://github.com/pytorch/pytorch/pull/152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name \| resnet Kernel Count \| resnet FLOPS \| resnet bw gbps \| resnet Dur (ms) \| resnet Achieved FLOPS % \| resnet Achieved Bandwidth % \| newresnet Kernel Count \| newresnet FLOPS \| newresnet bw gbps \| newresnet Dur (ms) \| newresnet Achieved FLOPS % \| newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 \| 24 \| 0 \| 0.11395268248131513 \| 2.5919166666666666 \| 0 \| 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 \| 142 \| 16932673552.422373 \| 0.2585007824198784 \| 12.441619718309857 \| 0.08683422334575583 \| 0.007716441266265022 triton_red_fused__native_batch_norm_legi \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 \| 39 \| 0 \| 0.13990024992108846 \| 5.752589743589743 \| 0 \| 0.004176126863316074 triton_poi_fused__native_batch_norm_legi \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 \| 25 \| 0 \| 0.31824055917536503 \| 2.5291999999999994 \| 0 \| 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 \| 98 \| 16211056473.596165 \| 0.42972434051025826 \| 7.130408163265306 \| 0.08313362294151874 \| 0.012827592254037562 triton_red_fused__native_batch_norm_legi \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 \| 73 \| 0 \| 0.3225381327611705 \| 9.987068493150682 \| 0 \| 0.009628003963020014 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 \| 15 \| 0 \| 1.4491211346487216 \| 4.439333333333333 \| 0 \| 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 \| 186 \| 14501701145.337954 \| 0.2667131401910989 \| 7.873865591397849 \| 0.07436769818122027 \| 0.007961586274361157 triton_poi_fused__native_batch_norm_legi \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 \| 33 \| 0 \| 1.4924556538193923 \| 4.3101515151515155 \| 0 \| 0.044550915039384846 triton_red_fused__native_batch_norm_legi \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 \| 29 \| 0 \| 0.25562590522631107 \| 6.296275862068965 \| 0 \| 0.007630624036606301 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 \| 13 \| 0 \| 0.5870562174192726 \| 2.7397692307692307 \| 0 \| 0.01752406619162008 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 \| 34 \| 0 \| 0.41409928846284 \| 2.853588235294117 \| 0 \| 0.012361172789935523 triton_per_fused__native_batch_norm_legi \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 \| 34 \| 0 \| 0.11705315007018151 \| 3.460647058823529 \| 0 \| 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 \| 16 \| 0 \| 0.17207853197124584 \| 2.3459375000000002 \| 0 \| 0.005136672596156592 triton_per_fused__native_batch_norm_legi \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 \| 30 \| 0 \| 0.2639714322022256 \| 6.131199999999999 \| 0 \| 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 \| 100 \| 11875430356.891787 \| 0.19494470869421385 \| 16.36534 \| 0.06089964285585531 \| 0.005819245035648175 triton_poi_fused__native_batch_norm_legi \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 \| 8 \| 0 \| 0.9854096626224687 \| 3.2757500000000004 \| 0 \| 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 \| 56 \| 34377923395.147064 \| 0.8310300045762317 \| 3.4199999999999986 \| 0.17629704305203628 \| 0.024806865808245714 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 \| 23 \| 0 \| 0.9944002965861103 \| 3.2431304347826084 \| 0 \| 0.02968359094286896 triton_per_fused__native_batch_norm_legi \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 \| 10 \| 0 \| 0.1826801058931057 \| 4.428800000000001 \| 0 \| 0.00545313748934644 triton_poi_fused__native_batch_norm_legi \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 \| 10 \| 0 \| 0.3168973585366449 \| 2.5471999999999997 \| 0 \| 0.009459622642884923 triton_poi_fused__native_batch_norm_legi \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 \| 34 \| 0 \| 1.1463614897015777 \| 4.124323529411764 \| 0 \| 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 \| 44 \| 44045510816.64277 \| 2.0661232850348643 \| 3.6887499999999993 \| 0.22587441444432194 \| 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 \| 95 \| 7876855400.165316 \| 0.4694941555946739 \| 18.224315789473682 \| 0.04039413025725802 \| 0.014014750913273854 triton_per_fused__native_batch_norm_legi \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 \| 41 \| 0 \| 0.06825669875995298 \| 3.0384146341463416 \| 0 \| 0.002037513395819492 triton_poi_fused__native_batch_norm_legi \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 \| 23 \| 0 \| 0.08808154712430301 \| 2.3275652173913044 \| 0 \| 0.0026292999141582997 triton_per_fused__native_batch_norm_legi \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 \| 40 \| 0 \| 0.18179321034952417 \| 4.556825 \| 0 \| 0.005426662995508183 triton_poi_fused__native_batch_norm_legi \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 \| 15 \| 0 \| 0.5887415155454232 \| 2.783866666666667 \| 0 \| 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 \| 38 \| 14242013806.264643 \| 0.256592404353939 \| 7.217631578947369 \| 0.0730359682372546 \| 0.007659474756834 triton_poi_fused__native_batch_norm_legi \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 \| 21 \| 0 \| 0.5842860973430516 \| 2.7779047619047623 \| 0 \| 0.017441376040091088 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 \| 16 \| 0 \| 0.11509365173486417 \| 3.5959375000000002 \| 0 \| 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 \| 14 \| 0 \| 0.1704672000243914 \| 2.4044285714285714 \| 0 \| 0.00508857313505646 triton_poi_fused__native_batch_norm_legi \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 \| 58 \| 0 \| 2.307520779930795 \| 8.190706896551722 \| 0 \| 0.06888121731136704 triton_per_fused__native_batch_norm_legi \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 \| 29 \| 0 \| 0.037243248971881276 \| 3.0277586206896556 \| 0 \| 0.001111738775280038 triton_poi_fused__native_batch_norm_legi \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 \| 20 \| 0 \| 0.04741699795428918 \| 2.2911500000000005 \| 0 \| 0.0014154327747549007 triton_per_fused__native_batch_norm_legi \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 \| 25 \| 0 \| 0.13357016893727824 \| 3.37536 \| 0 \| 0.003987169222008305 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 \| 13 \| 0 \| 0.3089862268300253 \| 2.8111538461538457 \| 0 \| 0.009223469457612694 triton_poi_fused__native_batch_norm_legi \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 \| 17 \| 0 \| 0.3129385387909844 \| 2.673 \| 0 \| 0.009341448919133863 triton_per_fused__native_batch_norm_legi \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 \| 19 \| 0 \| 0.2215568162533158 \| 3.8837368421052636 \| 0 \| 0.0066136363060691275 std::enable_if<!(false), void>::type int \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 \| 23 \| 504916805.19297093 \| 1.0118296096314707 \| 8.113913043478261 \| 0.0025893169497075447 \| 0.030203868944223014 triton_poi_fused_add_copy__38 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 \| 56 \| 0 \| 0 \| 2.132482142857143 \| 0 \| 0 triton_poi_fused_convolution_0 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 \| 18 \| 0 \| 0.43458610794936897 \| 2.773333333333334 \| 0 \| 0.012972719640279667 triton_poi_fused_convolution_1 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 \| 17 \| 0 \| 0.028816312469162712 \| 2.6145882352941174 \| 0 \| 0.0008601884319153051 void convolve_common_engine_float_NHWC<f \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 \| 44 \| 8641868995.31118 \| 0.024730540008465626 \| 25.87327272727273 \| 0.04431727689903169 \| 0.0007382250748795709 triton_per_fused__native_batch_norm_legi \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 \| 12 \| 0 \| 0.6809930918986744 \| 4.82675 \| 0 \| 0.020328151996975356 triton_per_fused__native_batch_norm_legi \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 \| 14 \| 0 \| 0.02883030597936608 \| 2.6651428571428575 \| 0 \| 0.0008606061486377935 triton_per_fused__native_batch_norm_legi \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 \| 16 \| 0 \| 0.0014658988233201874 \| 2.098 \| 0 \| 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 \| 13 \| 0 \| 0.9926297180284697 \| 3.2367692307692306 \| 0 \| 0.02963073785159611 triton_poi_fused__native_batch_norm_legi \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 \| 9 \| 0 \| 1.3008817095666507 \| 3.0863333333333336 \| 0 \| 0.03883228983781048 void at::native::(anonymous namespace):: \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 \| 98 \| 0 \| 0.09174335613709389 \| 4.408520408163265 \| 0 \| 0.0027386076458833994 void at::native::vectorized_elementwise_ \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 \| 7 \| 0 \| 0 \| 1.7278571428571428 \| 0 \| 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697 Approved by: https://github.com/eellison, https://github.com/shunting314	2025-06-09 21:43:21 +00:00
PyTorch MergeBot	27df0c56b7	Revert "[inductor] use int64 for large index (#154575 )" This reverts commit 2596e3d0617852469241be8777cf46db5c83928c. Reverted https://github.com/pytorch/pytorch/pull/154575 on behalf of https://github.com/clee2000 due to broke inductor/test_op_dtype_prop.py::TestCaseCUDA::test_op_dtype_propagation_add_cuda_int32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15510656657/job/43673763835) [HUD commit link](`2596e3d061`), note for self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/154575#issuecomment-2954175761))	2025-06-08 16:58:59 +00:00
Shunting Zhang	2596e3d061	[inductor] use int64 for large index (#154575 ) Split reduction may need add an extra mask to avoid invalid index. Previously we always uses torch.int32 dtype. That causes problem when the tensor numel exceeds 2^31. Fix https://github.com/pytorch/pytorch/issues/154168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154575 Approved by: https://github.com/ngimel, https://github.com/jansel	2025-06-07 18:41:46 +00:00
Oleksandr Stashuk	400f439670	[pt][easy] Rename metadata column (#155365 ) Summary: Fixing typo: our logging requires autotuning_data instead of autotune_data, making it consistent Test Plan: Run benchmark, observe in perfetto trace proper name Rollback Plan: Differential Revision: D76159393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155365 Approved by: https://github.com/masnesral, https://github.com/Skylion007	2025-06-07 02:25:55 +00:00
Oleksandr Stashuk	64436c38c9	[logs] Add autotuning data (#154771 ) Summary: Add autotuning logging data to scuba/chrome trace. Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE=1 tlp buck run //scripts/sashko:compilation_sample ``` Open https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/viewer?local_cache_key=00000000-0000-0000-92db-f23383ebf5b5, search for template_autotuning, see in metadata strides (see screenshot) Differential Revision: D75457770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154771 Approved by: https://github.com/masnesral, https://github.com/PaulZhang12	2025-06-06 17:12:55 +00:00
PyTorch MergeBot	7e4c097b07	Revert "[inductor] Add typing to _inductor/ir.py (#149958 )" This reverts commit 529e0357c6c4e74f8cd32c29198c5f1c9f6e329d. Reverted https://github.com/pytorch/pytorch/pull/149958 on behalf of https://github.com/malfet due to Looks like it broke inductor_torchbind tests, due to more graphbreaks, see `b0fbbef136/1` ([comment](https://github.com/pytorch/pytorch/pull/149958#issuecomment-2949583209))	2025-06-06 15:19:16 +00:00

1 2 3 4 5 ...

419 Commits