|
12b02137af
|
[MPS] Add benchmark for scan operations (#156241)
Comparison of cumsum performance before and after Metal implementaton:
Previous performance (using torch==2.7.1):
```[------------------------------- -------------------------------]
| eager | compile
1 threads: -------------------------------------------------------
cumsum-dim0-32x32 (torch.float16) | 131.0 | 136.9
cumsum-dim0-128x128 (torch.float16) | 116.9 | 121.2
cumsum-dim0-512x512 (torch.float16) | 132.5 | 151.9
cumsum-dim0-1024x1024 (torch.float16) | 150.0 | 163.0
cumsum-dim1-32x32 (torch.float16) | 125.9 | 140.9
cumsum-dim1-128x128 (torch.float16) | 116.4 | 129.4
cumsum-dim1-512x512 (torch.float16) | 135.9 | 150.1
cumsum-dim1-1024x1024 (torch.float16) | 139.5 | 154.2
cumsum-1d-100 (torch.float16) | 119.5 | 127.1
cumsum-1d-10000 (torch.float16) | 128.9 | 142.5
cumsum-1d-1000000 (torch.float16) | 140.6 | 145.6
cumsum-dim0-32x32 (torch.float32) | 115.7 | 132.5
cumsum-dim0-128x128 (torch.float32) | 118.0 | 131.5
cumsum-dim0-512x512 (torch.float32) | 138.8 | 151.6
cumsum-dim0-1024x1024 (torch.float32) | 155.5 | 164.2
cumsum-dim1-32x32 (torch.float32) | 127.2 | 141.7
cumsum-dim1-128x128 (torch.float32) | 117.7 | 130.5
cumsum-dim1-512x512 (torch.float32) | 138.2 | 152.3
cumsum-dim1-1024x1024 (torch.float32) | 144.4 | 158.6
cumsum-1d-100 (torch.float32) | 118.6 | 128.0
cumsum-1d-10000 (torch.float32) | 125.5 | 141.5
cumsum-1d-1000000 (torch.float32) | 143.9 | 158.4
cumsum-dim0-32x32 (torch.bfloat16) | 106.6 | 137.6
cumsum-dim0-128x128 (torch.bfloat16) | 118.1 | 131.0
cumsum-dim0-512x512 (torch.bfloat16) | 140.0 | 154.3
cumsum-dim0-1024x1024 (torch.bfloat16) | 153.2 | 164.4
cumsum-dim1-32x32 (torch.bfloat16) | 127.9 | 132.6
cumsum-dim1-128x128 (torch.bfloat16) | 116.5 | 129.6
cumsum-dim1-512x512 (torch.bfloat16) | 136.5 | 151.2
cumsum-dim1-1024x1024 (torch.bfloat16) | 139.8 | 144.8
cumsum-1d-100 (torch.bfloat16) | 115.7 | 129.4
cumsum-1d-10000 (torch.bfloat16) | 125.0 | 143.3
cumsum-1d-1000000 (torch.bfloat16) | 127.8 | 143.4
Times are in microseconds (us).
```
Current performance:
```
[-------------------------------- --------------------------------]
| eager | compile
1 threads: ---------------------------------------------------------
cumsum-dim0-32x32 (torch.float16) | 107.4 | 123.8
cumsum-dim0-128x128 (torch.float16) | 134.2 | 145.8
cumsum-dim0-512x512 (torch.float16) | 207.3 | 231.6
cumsum-dim0-1024x1024 (torch.float16) | 318.9 | 355.3
cumsum-dim1-32x32 (torch.float16) | 98.0 | 114.3
cumsum-dim1-128x128 (torch.float16) | 110.8 | 121.6
cumsum-dim1-512x512 (torch.float16) | 193.0 | 209.1
cumsum-dim1-1024x1024 (torch.float16) | 844.7 | 870.8
cumsum-1d-100 (torch.float16) | 108.4 | 125.0
cumsum-1d-10000 (torch.float16) | 784.7 | 852.3
cumsum-1d-1000000 (torch.float16) | 65855.2 | 66725.9
cumsum-dim0-32x32 (torch.float32) | 114.7 | 115.7
cumsum-dim0-128x128 (torch.float32) | 139.0 | 151.6
cumsum-dim0-512x512 (torch.float32) | 197.3 | 208.0
cumsum-dim0-1024x1024 (torch.float32) | 312.7 | 332.9
cumsum-dim1-32x32 (torch.float32) | 92.0 | 110.8
cumsum-dim1-128x128 (torch.float32) | 114.2 | 125.0
cumsum-dim1-512x512 (torch.float32) | 186.2 | 196.1
cumsum-dim1-1024x1024 (torch.float32) | 752.0 | 825.0
cumsum-1d-100 (torch.float32) | 112.4 | 122.0
cumsum-1d-10000 (torch.float32) | 793.5 | 863.5
cumsum-1d-1000000 (torch.float32) | 66431.8 | 66040.0
cumsum-dim0-32x32 (torch.bfloat16) | 111.6 | 121.6
cumsum-dim0-128x128 (torch.bfloat16) | 139.0 | 138.4
cumsum-dim0-512x512 (torch.bfloat16) | 217.6 | 230.1
cumsum-dim0-1024x1024 (torch.bfloat16) | 305.2 | 325.6
cumsum-dim1-32x32 (torch.bfloat16) | 100.5 | 110.9
cumsum-dim1-128x128 (torch.bfloat16) | 112.8 | 125.0
cumsum-dim1-512x512 (torch.bfloat16) | 187.8 | 208.9
cumsum-dim1-1024x1024 (torch.bfloat16) | 790.9 | 864.7
cumsum-1d-100 (torch.bfloat16) | 111.6 | 124.6
cumsum-1d-10000 (torch.bfloat16) | 778.1 | 844.9
cumsum-1d-1000000 (torch.bfloat16) | 64654.3 | 64082.5
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241
Approved by: https://github.com/malfet
|
2025-06-17 22:30:22 +00:00 |
|
|
ee97299961
|
[MPS][Testing] Benchmark reduction ops (#150452)
That compares eager vs compile
On my M4Pro mini I'm getting the following now
```
[--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------]
| eager-512x512 | compile-512x512 | eager-1024x1024 | compile-1024x1024 | eager-2048x2048 | compile-2048x2048 | eager-4096x4096 | compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sum (torch.float32) | 121.0 | 201.5 | 130.3 | 772.3 | 179.4 | 1470.5 | 476.1 | 2980.0
max (torch.float32) | 154.1 | 165.9 | 198.7 | 211.6 | 344.2 | 386.9 | 1326.6 | 1345.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452
Approved by: https://github.com/dcci, https://github.com/manuelcandales
|
2025-04-02 01:06:27 +00:00 |
|
|
1c6e88eb03
|
[MPS] Test bf16 perf of few unary and binary ops (#150382)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150382
Approved by: https://github.com/Skylion007
|
2025-04-01 13:58:20 +00:00 |
|
|
23183fef7e
|
[Test] Add simple MPS op benchmarks (#149914)
Lots of benchmark tests has been posted in PRs, but they might get lost over time
So let's create a benchmark and populate it with results (preferably from the run on CI machine)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149914
Approved by: https://github.com/dcci, https://github.com/cyyever
|
2025-03-25 11:31:27 +00:00 |
|