pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Manuel Candales	12b02137af	[MPS] Add benchmark for scan operations (#156241 ) Comparison of cumsum performance before and after Metal implementaton: Previous performance (using torch==2.7.1): ```[------------------------------- -------------------------------] \| eager \| compile 1 threads: ------------------------------------------------------- cumsum-dim0-32x32 (torch.float16) \| 131.0 \| 136.9 cumsum-dim0-128x128 (torch.float16) \| 116.9 \| 121.2 cumsum-dim0-512x512 (torch.float16) \| 132.5 \| 151.9 cumsum-dim0-1024x1024 (torch.float16) \| 150.0 \| 163.0 cumsum-dim1-32x32 (torch.float16) \| 125.9 \| 140.9 cumsum-dim1-128x128 (torch.float16) \| 116.4 \| 129.4 cumsum-dim1-512x512 (torch.float16) \| 135.9 \| 150.1 cumsum-dim1-1024x1024 (torch.float16) \| 139.5 \| 154.2 cumsum-1d-100 (torch.float16) \| 119.5 \| 127.1 cumsum-1d-10000 (torch.float16) \| 128.9 \| 142.5 cumsum-1d-1000000 (torch.float16) \| 140.6 \| 145.6 cumsum-dim0-32x32 (torch.float32) \| 115.7 \| 132.5 cumsum-dim0-128x128 (torch.float32) \| 118.0 \| 131.5 cumsum-dim0-512x512 (torch.float32) \| 138.8 \| 151.6 cumsum-dim0-1024x1024 (torch.float32) \| 155.5 \| 164.2 cumsum-dim1-32x32 (torch.float32) \| 127.2 \| 141.7 cumsum-dim1-128x128 (torch.float32) \| 117.7 \| 130.5 cumsum-dim1-512x512 (torch.float32) \| 138.2 \| 152.3 cumsum-dim1-1024x1024 (torch.float32) \| 144.4 \| 158.6 cumsum-1d-100 (torch.float32) \| 118.6 \| 128.0 cumsum-1d-10000 (torch.float32) \| 125.5 \| 141.5 cumsum-1d-1000000 (torch.float32) \| 143.9 \| 158.4 cumsum-dim0-32x32 (torch.bfloat16) \| 106.6 \| 137.6 cumsum-dim0-128x128 (torch.bfloat16) \| 118.1 \| 131.0 cumsum-dim0-512x512 (torch.bfloat16) \| 140.0 \| 154.3 cumsum-dim0-1024x1024 (torch.bfloat16) \| 153.2 \| 164.4 cumsum-dim1-32x32 (torch.bfloat16) \| 127.9 \| 132.6 cumsum-dim1-128x128 (torch.bfloat16) \| 116.5 \| 129.6 cumsum-dim1-512x512 (torch.bfloat16) \| 136.5 \| 151.2 cumsum-dim1-1024x1024 (torch.bfloat16) \| 139.8 \| 144.8 cumsum-1d-100 (torch.bfloat16) \| 115.7 \| 129.4 cumsum-1d-10000 (torch.bfloat16) \| 125.0 \| 143.3 cumsum-1d-1000000 (torch.bfloat16) \| 127.8 \| 143.4 Times are in microseconds (us). ``` Current performance: ``` [-------------------------------- --------------------------------] \| eager \| compile 1 threads: --------------------------------------------------------- cumsum-dim0-32x32 (torch.float16) \| 107.4 \| 123.8 cumsum-dim0-128x128 (torch.float16) \| 134.2 \| 145.8 cumsum-dim0-512x512 (torch.float16) \| 207.3 \| 231.6 cumsum-dim0-1024x1024 (torch.float16) \| 318.9 \| 355.3 cumsum-dim1-32x32 (torch.float16) \| 98.0 \| 114.3 cumsum-dim1-128x128 (torch.float16) \| 110.8 \| 121.6 cumsum-dim1-512x512 (torch.float16) \| 193.0 \| 209.1 cumsum-dim1-1024x1024 (torch.float16) \| 844.7 \| 870.8 cumsum-1d-100 (torch.float16) \| 108.4 \| 125.0 cumsum-1d-10000 (torch.float16) \| 784.7 \| 852.3 cumsum-1d-1000000 (torch.float16) \| 65855.2 \| 66725.9 cumsum-dim0-32x32 (torch.float32) \| 114.7 \| 115.7 cumsum-dim0-128x128 (torch.float32) \| 139.0 \| 151.6 cumsum-dim0-512x512 (torch.float32) \| 197.3 \| 208.0 cumsum-dim0-1024x1024 (torch.float32) \| 312.7 \| 332.9 cumsum-dim1-32x32 (torch.float32) \| 92.0 \| 110.8 cumsum-dim1-128x128 (torch.float32) \| 114.2 \| 125.0 cumsum-dim1-512x512 (torch.float32) \| 186.2 \| 196.1 cumsum-dim1-1024x1024 (torch.float32) \| 752.0 \| 825.0 cumsum-1d-100 (torch.float32) \| 112.4 \| 122.0 cumsum-1d-10000 (torch.float32) \| 793.5 \| 863.5 cumsum-1d-1000000 (torch.float32) \| 66431.8 \| 66040.0 cumsum-dim0-32x32 (torch.bfloat16) \| 111.6 \| 121.6 cumsum-dim0-128x128 (torch.bfloat16) \| 139.0 \| 138.4 cumsum-dim0-512x512 (torch.bfloat16) \| 217.6 \| 230.1 cumsum-dim0-1024x1024 (torch.bfloat16) \| 305.2 \| 325.6 cumsum-dim1-32x32 (torch.bfloat16) \| 100.5 \| 110.9 cumsum-dim1-128x128 (torch.bfloat16) \| 112.8 \| 125.0 cumsum-dim1-512x512 (torch.bfloat16) \| 187.8 \| 208.9 cumsum-dim1-1024x1024 (torch.bfloat16) \| 790.9 \| 864.7 cumsum-1d-100 (torch.bfloat16) \| 111.6 \| 124.6 cumsum-1d-10000 (torch.bfloat16) \| 778.1 \| 844.9 cumsum-1d-1000000 (torch.bfloat16) \| 64654.3 \| 64082.5 Times are in microseconds (us). ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241 Approved by: https://github.com/malfet	2025-06-17 22:30:22 +00:00
Nikita Shulga	ee97299961	[MPS][Testing] Benchmark reduction ops (#150452 ) That compares eager vs compile On my M4Pro mini I'm getting the following now ``` [--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------] \| eager-512x512 \| compile-512x512 \| eager-1024x1024 \| compile-1024x1024 \| eager-2048x2048 \| compile-2048x2048 \| eager-4096x4096 \| compile-4096x4096 1 threads: ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sum (torch.float32) \| 121.0 \| 201.5 \| 130.3 \| 772.3 \| 179.4 \| 1470.5 \| 476.1 \| 2980.0 max (torch.float32) \| 154.1 \| 165.9 \| 198.7 \| 211.6 \| 344.2 \| 386.9 \| 1326.6 \| 1345.6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452 Approved by: https://github.com/dcci, https://github.com/manuelcandales	2025-04-02 01:06:27 +00:00
Nikita Shulga	1c6e88eb03	[MPS] Test bf16 perf of few unary and binary ops (#150382 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150382 Approved by: https://github.com/Skylion007	2025-04-01 13:58:20 +00:00
Nikita Shulga	23183fef7e	[Test] Add simple MPS op benchmarks (#149914 ) Lots of benchmark tests has been posted in PRs, but they might get lost over time So let's create a benchmark and populate it with results (preferably from the run on CI machine) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149914 Approved by: https://github.com/dcci, https://github.com/cyyever	2025-03-25 11:31:27 +00:00

4 Commits