5116c49b52
[BE] Remove macos-13 guard from bench_mps_ops ( #159732 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159732
Approved by: https://github.com/dcci
ghstack dependencies: #159731
2025-08-03 20:53:58 +00:00
fecdebe385
[CI][MPS] Fix compile benchmark correctness ( #159731 )
...
By passing `fullgraph=True` attribute and increasing cache size limit to 2**16
Otherwise, compiler might decide not to fall back to eager to avoid recompilations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159731
Approved by: https://github.com/dcci
2025-08-03 20:53:50 +00:00
beed033b6e
[MPS] Fix index_kernel
for large tensors ( #158064 )
...
Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator
Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before
```
[------------------------------------------------------------ -----------------------------------------------------------]
| 11x50x50 | 11x100x100 | 11x500x500 | 11x1000x1000 | 11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
__getitem__ (torch.int8, torch.int64) | 383.5 | 379.8 | 470.9 | 1232.9 | 4410.3
__getitem__ (torch.float16, torch.int64) | 379.6 | 354.5 | 533.2 | 1290.3 | 4442.2
__getitem__ (torch.float32, torch.int64) | 360.8 | 338.6 | 478.6 | 1348.9 | 4870.4
Times are in microseconds (us).
```
and after
```
[------------------------------------------------------------ -----------------------------------------------------------]
| 11x50x50 | 11x100x100 | 11x500x500 | 11x1000x1000 | 11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
__getitem__ (torch.int8, torch.int64) | 349.8 | 330.5 | 432.6 | 764.5 | 1961.2
__getitem__ (torch.float16, torch.int64) | 342.5 | 330.7 | 434.7 | 741.0 | 1969.4
__getitem__ (torch.float32, torch.int64) | 332.2 | 326.1 | 445.4 | 751.3 | 1972.6
Times are in microseconds (us).
```
While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint
Fixes https://github.com/pytorch/pytorch/issues/153560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064
Approved by: https://github.com/dcci
2025-07-11 22:35:44 +00:00
43a09189c6
[MPS] Add benchmark for scan with indices ( #156860 )
...
Baseline performance on M4 Max 64GB (macOS 15.5):
```
[-------------------------------- --------------------------------]
| eager | compile
1 threads: ---------------------------------------------------------
cummin-dim0-32x32 (torch.float16) | 102.5 | 115.0
cummin-dim0-128x128 (torch.float16) | 133.6 | 147.8
cummin-dim0-512x512 (torch.float16) | 233.1 | 243.1
cummin-dim0-1024x1024 (torch.float16) | 364.2 | 385.2
cummin-dim1-32x32 (torch.float16) | 94.4 | 109.8
cummin-dim1-128x128 (torch.float16) | 109.9 | 122.5
cummin-dim1-512x512 (torch.float16) | 227.0 | 233.8
cummin-dim1-1024x1024 (torch.float16) | 985.1 | 1010.5
cummin-1d-100 (torch.float16) | 100.7 | 114.3
cummin-1d-10000 (torch.float16) | 805.0 | 879.1
cummin-1d-1000000 (torch.float16) | 70545.6 | 71310.3
cummin-dim0-32x32 (torch.float32) | 102.7 | 115.5
cummin-dim0-128x128 (torch.float32) | 137.2 | 143.8
cummin-dim0-512x512 (torch.float32) | 209.7 | 222.0
cummin-dim0-1024x1024 (torch.float32) | 340.1 | 389.9
cummin-dim1-32x32 (torch.float32) | 99.2 | 107.8
cummin-dim1-128x128 (torch.float32) | 111.9 | 119.3
cummin-dim1-512x512 (torch.float32) | 250.7 | 255.1
cummin-dim1-1024x1024 (torch.float32) | 987.9 | 1013.2
cummin-1d-100 (torch.float32) | 100.6 | 114.6
cummin-1d-10000 (torch.float32) | 794.7 | 862.2
cummin-1d-1000000 (torch.float32) | 71995.3 | 71963.5
cummin-dim0-32x32 (torch.bfloat16) | 105.9 | 113.9
cummin-dim0-128x128 (torch.bfloat16) | 135.7 | 147.9
cummin-dim0-512x512 (torch.bfloat16) | 231.9 | 240.7
cummin-dim0-1024x1024 (torch.bfloat16) | 327.7 | 366.9
cummin-dim1-32x32 (torch.bfloat16) | 91.3 | 103.3
cummin-dim1-128x128 (torch.bfloat16) | 108.5 | 117.4
cummin-dim1-512x512 (torch.bfloat16) | 222.0 | 233.6
cummin-dim1-1024x1024 (torch.bfloat16) | 936.9 | 982.5
cummin-1d-100 (torch.bfloat16) | 106.6 | 112.4
cummin-1d-10000 (torch.bfloat16) | 795.8 | 819.6
cummin-1d-1000000 (torch.bfloat16) | 68667.4 | 68557.9
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156860
Approved by: https://github.com/malfet
2025-06-26 18:44:16 +00:00
12b02137af
[MPS] Add benchmark for scan operations ( #156241 )
...
Comparison of cumsum performance before and after Metal implementaton:
Previous performance (using torch==2.7.1):
```[------------------------------- -------------------------------]
| eager | compile
1 threads: -------------------------------------------------------
cumsum-dim0-32x32 (torch.float16) | 131.0 | 136.9
cumsum-dim0-128x128 (torch.float16) | 116.9 | 121.2
cumsum-dim0-512x512 (torch.float16) | 132.5 | 151.9
cumsum-dim0-1024x1024 (torch.float16) | 150.0 | 163.0
cumsum-dim1-32x32 (torch.float16) | 125.9 | 140.9
cumsum-dim1-128x128 (torch.float16) | 116.4 | 129.4
cumsum-dim1-512x512 (torch.float16) | 135.9 | 150.1
cumsum-dim1-1024x1024 (torch.float16) | 139.5 | 154.2
cumsum-1d-100 (torch.float16) | 119.5 | 127.1
cumsum-1d-10000 (torch.float16) | 128.9 | 142.5
cumsum-1d-1000000 (torch.float16) | 140.6 | 145.6
cumsum-dim0-32x32 (torch.float32) | 115.7 | 132.5
cumsum-dim0-128x128 (torch.float32) | 118.0 | 131.5
cumsum-dim0-512x512 (torch.float32) | 138.8 | 151.6
cumsum-dim0-1024x1024 (torch.float32) | 155.5 | 164.2
cumsum-dim1-32x32 (torch.float32) | 127.2 | 141.7
cumsum-dim1-128x128 (torch.float32) | 117.7 | 130.5
cumsum-dim1-512x512 (torch.float32) | 138.2 | 152.3
cumsum-dim1-1024x1024 (torch.float32) | 144.4 | 158.6
cumsum-1d-100 (torch.float32) | 118.6 | 128.0
cumsum-1d-10000 (torch.float32) | 125.5 | 141.5
cumsum-1d-1000000 (torch.float32) | 143.9 | 158.4
cumsum-dim0-32x32 (torch.bfloat16) | 106.6 | 137.6
cumsum-dim0-128x128 (torch.bfloat16) | 118.1 | 131.0
cumsum-dim0-512x512 (torch.bfloat16) | 140.0 | 154.3
cumsum-dim0-1024x1024 (torch.bfloat16) | 153.2 | 164.4
cumsum-dim1-32x32 (torch.bfloat16) | 127.9 | 132.6
cumsum-dim1-128x128 (torch.bfloat16) | 116.5 | 129.6
cumsum-dim1-512x512 (torch.bfloat16) | 136.5 | 151.2
cumsum-dim1-1024x1024 (torch.bfloat16) | 139.8 | 144.8
cumsum-1d-100 (torch.bfloat16) | 115.7 | 129.4
cumsum-1d-10000 (torch.bfloat16) | 125.0 | 143.3
cumsum-1d-1000000 (torch.bfloat16) | 127.8 | 143.4
Times are in microseconds (us).
```
Current performance:
```
[-------------------------------- --------------------------------]
| eager | compile
1 threads: ---------------------------------------------------------
cumsum-dim0-32x32 (torch.float16) | 107.4 | 123.8
cumsum-dim0-128x128 (torch.float16) | 134.2 | 145.8
cumsum-dim0-512x512 (torch.float16) | 207.3 | 231.6
cumsum-dim0-1024x1024 (torch.float16) | 318.9 | 355.3
cumsum-dim1-32x32 (torch.float16) | 98.0 | 114.3
cumsum-dim1-128x128 (torch.float16) | 110.8 | 121.6
cumsum-dim1-512x512 (torch.float16) | 193.0 | 209.1
cumsum-dim1-1024x1024 (torch.float16) | 844.7 | 870.8
cumsum-1d-100 (torch.float16) | 108.4 | 125.0
cumsum-1d-10000 (torch.float16) | 784.7 | 852.3
cumsum-1d-1000000 (torch.float16) | 65855.2 | 66725.9
cumsum-dim0-32x32 (torch.float32) | 114.7 | 115.7
cumsum-dim0-128x128 (torch.float32) | 139.0 | 151.6
cumsum-dim0-512x512 (torch.float32) | 197.3 | 208.0
cumsum-dim0-1024x1024 (torch.float32) | 312.7 | 332.9
cumsum-dim1-32x32 (torch.float32) | 92.0 | 110.8
cumsum-dim1-128x128 (torch.float32) | 114.2 | 125.0
cumsum-dim1-512x512 (torch.float32) | 186.2 | 196.1
cumsum-dim1-1024x1024 (torch.float32) | 752.0 | 825.0
cumsum-1d-100 (torch.float32) | 112.4 | 122.0
cumsum-1d-10000 (torch.float32) | 793.5 | 863.5
cumsum-1d-1000000 (torch.float32) | 66431.8 | 66040.0
cumsum-dim0-32x32 (torch.bfloat16) | 111.6 | 121.6
cumsum-dim0-128x128 (torch.bfloat16) | 139.0 | 138.4
cumsum-dim0-512x512 (torch.bfloat16) | 217.6 | 230.1
cumsum-dim0-1024x1024 (torch.bfloat16) | 305.2 | 325.6
cumsum-dim1-32x32 (torch.bfloat16) | 100.5 | 110.9
cumsum-dim1-128x128 (torch.bfloat16) | 112.8 | 125.0
cumsum-dim1-512x512 (torch.bfloat16) | 187.8 | 208.9
cumsum-dim1-1024x1024 (torch.bfloat16) | 790.9 | 864.7
cumsum-1d-100 (torch.bfloat16) | 111.6 | 124.6
cumsum-1d-10000 (torch.bfloat16) | 778.1 | 844.9
cumsum-1d-1000000 (torch.bfloat16) | 64654.3 | 64082.5
Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241
Approved by: https://github.com/malfet
2025-06-17 22:30:22 +00:00
ee97299961
[MPS][Testing] Benchmark reduction ops ( #150452 )
...
That compares eager vs compile
On my M4Pro mini I'm getting the following now
```
[--------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------]
| eager-512x512 | compile-512x512 | eager-1024x1024 | compile-1024x1024 | eager-2048x2048 | compile-2048x2048 | eager-4096x4096 | compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sum (torch.float32) | 121.0 | 201.5 | 130.3 | 772.3 | 179.4 | 1470.5 | 476.1 | 2980.0
max (torch.float32) | 154.1 | 165.9 | 198.7 | 211.6 | 344.2 | 386.9 | 1326.6 | 1345.6
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150452
Approved by: https://github.com/dcci , https://github.com/manuelcandales
2025-04-02 01:06:27 +00:00
1c6e88eb03
[MPS] Test bf16 perf of few unary and binary ops ( #150382 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150382
Approved by: https://github.com/Skylion007
2025-04-01 13:58:20 +00:00
23183fef7e
[Test] Add simple MPS op benchmarks ( #149914 )
...
Lots of benchmark tests has been posted in PRs, but they might get lost over time
So let's create a benchmark and populate it with results (preferably from the run on CI machine)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149914
Approved by: https://github.com/dcci , https://github.com/cyyever
2025-03-25 11:31:27 +00:00