737 Commits

Author SHA1 Message Date
653c52fe52 [MPS] Fix batch norm incorrect gradient (#156867)
Fixes #156555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156867
Approved by: https://github.com/malfet
2025-06-25 23:05:49 +00:00
20a74c370b Add error message with assert to topK if ndims() - dim > 4 (#155475)
Addressing #154890

Not really a proper fix but at least it's more informative than the current crash.

For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475
Approved by: https://github.com/kulinseth
2025-06-13 21:10:06 +00:00
dd41a3907c [MPS] Fix unary/binary ops for 2**32+ elem tensors (#155183)
By using `TensorIterator::with_32bit_indexing()` primitive

Add `bind_tensors` helper function that correctly sets up MPS tensors originating from TensorIterator

TODO: Add comments to bind_tensors as well asunit test, based on
```
python  -c "import torch;print((torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps')).sin())"
```

Fixes https://github.com/pytorch/pytorch/issues/154828
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155183
Approved by: https://github.com/cyyever, https://github.com/dcci, https://github.com/Skylion007
ghstack dependencies: #155150, #155178, #155184
2025-06-05 18:57:14 +00:00
9a4c08ddfc [MPS] Parametrize test_scaled_dot_product_attention_autocast (#155005)
Also moving comments inside the function scope for some of my previous regression tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155005
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-06-05 13:24:53 +00:00
9cdce682a1 [MPS][BE] Reimplement log1p as Metal shader (#154936)
That should make it faster than MPSGraph implementation, but also
improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$

Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} -  \frac{x}{3}))$

Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one.

Parametrize and modify regression test to check for accuracy of small values

TODOs:
 - Do proper implementation for complex values as well, perhaps using 0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)
 - May be implement it using Remez-like algorithm documented here 207f3b2b25/lib/msun/src/s_log1pf.c (L37)
 - Or use llvm's implementation from f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)
 - Benchmark which algorithm is faster and delivers better accuracy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-06-03 14:10:13 +00:00
981bdb39ca Enable ConvTranspose3D for FP32 and Complex64 (#154696)
Fixes #154615

Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15.

For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs.

Tests for ConvTranspose3D were enabled for the supported data types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696
Approved by: https://github.com/malfet
2025-06-02 16:24:03 +00:00
41092cb86c [MPS] index copy impl (#154326)
Second most requested op according to #154052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154326
Approved by: https://github.com/malfet
2025-05-29 16:57:43 +00:00
7ae204c3b6 [BE][CI][Easy] Run lintrunner on generated .pyi stub files (#150732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732
Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste
2025-05-27 14:58:02 +00:00
975bbc63db [MPS][BE] Move fmod/remainder to Metal ops (#154280)
This accomplishes following:
 - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer)
 - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph)
 - Eliminates need for several correctness workarounds

Fixes https://github.com/pytorch/pytorch/issues/154171
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280
Approved by: https://github.com/dcci
ghstack dependencies: #154275, #154290
2025-05-24 01:45:33 +00:00
633ed01145 [MPS] Add support for two more isin variants (#154010)
`isin_Tensor_Scalar_out` is just a redispatch to eq/neq
`isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars
Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps`

Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage

Before this change both of those failed
```python
>>> import torch
>>> t = torch.tensor([0, 1, 2], device='mps')
>>> torch.isin(t, 1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
>>> torch.isin(1, t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales
ghstack dependencies: #153970, #153971, #153997
2025-05-22 17:59:35 +00:00
d5ddc5ab20 [MPS] Fix float64 scalar tensor handling (#153582)
Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor

Add regssion test for this behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582
Approved by: https://github.com/dcci
2025-05-15 05:15:14 +00:00
8749fe8439 [CI][MPS] Speedup test_large_bmm (#153562)
By computing matmuls of only one random non-zero batch on CPU

This reduces test runtime from 11 minutes to 14 sec
```
 % python3 test/test_mps.py -v -k test_large_bmm_
test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok
test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok

----------------------------------------------------------------------
Ran 2 tests in 27.495s

```

TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562
Approved by: https://github.com/Skylion007, https://github.com/clee2000
2025-05-14 18:49:42 +00:00
56492bfcb9 [MPS] SDPA specialized kernels (#152781)
Paritally fixes #139668 and #152550

Still work in progress. Following needs to be addressed:
- [x] Some tests are failing and need to check why and bugfix
- [x] Benchmark the new kernels and  add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels)
- [x] Add tests to cover the specialized paths(if applicable)
- [x] Code cleanup

**Tested on Macbook M1 Pro**
### Vector Fast Path (q_len=1, k_len=256)
- Old: 0.378 ms
- New: 0.260 ms
- **31.2% speed improvement**

### Vector 2-pass (q_len=1, k_len=4096)
- Old: 0.627 ms
- New: 0.370 ms
- **41.0% speed improvement**

### Vector Fast Path (q_len=8, k_len=256)
- Old: 0.545 ms
- New: 0.322 ms
- **40.9% speed improvement**

### Vector 2-pass (q_len=8, k_len=4096)
- Old: 1.318 ms
- New: 1.057 ms
- **19.8% speed improvement**

Script to get perf:
```
import torch
import time

def benchmark_sdpa(config, iterations=100):
    device = config.get("device", "cpu")
    batch = config["batch"]
    heads = config["heads"]
    q_len = config["q_len"]
    k_len = config["k_len"]
    head_dim = config["head_dim"]

    q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32)
    k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)
    v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32)

    for _ in range(5):
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()

    total_time = 0.0
    for i in range(iterations):
        start = time.perf_counter()
        _ = torch.nn.functional.scaled_dot_product_attention(q, k, v)
        if device == "mps":
            torch.mps.synchronize()
        end = time.perf_counter()
        total_time += end - start

    avg_time = total_time / iterations
    print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations")
    return avg_time

def main():
    device = "mps" if torch.backends.mps.is_available() else "cpu"
    print(f"Running benchmarks on device: {device}")

    benchmarks = [
        {
            "name": "Vector Fast - Small q_len & moderate k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length triggers vector fast path
            "k_len": 256,    # moderate key length
            "head_dim": 64,
            "device": device,
        },
        {
            "name": "Vector 2-pass - Small q_len & long k_len",
            "batch": 1,
            "heads": 8,
            "q_len": 1,      # small query sequence length
            "k_len": 4096,   # long key length triggers the 2-pass variant
            "head_dim": 64,
            "device": device,
        },
        # {
        #     "name": "Full Attention - Moderate q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # longer query sequence length
        #     "k_len": 8192,    # matching key length for full attention paths
        #     "head_dim": 64,
        #     "device": device,
        # },
        # {
        #     "name": "Full Attention - Longer q_len/k_len",
        #     "batch": 1,
        #     "heads": 8,
        #     "q_len": 128,    # very long sequence length
        #     "k_len": 8192,
        #     "head_dim": 64,
        #     "device": device,
        # },
    ]

    iterations = 100
    for config in benchmarks:
        benchmark_sdpa(config, iterations=iterations)

if __name__ == "__main__":
    main()

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781
Approved by: https://github.com/malfet
2025-05-07 00:40:11 +00:00
0ffd31dc8a [MPS] Migrate div roudning modes (#152758)
By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25
```
import torch
print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc"))
tensor([[24.,  3.]])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758
Approved by: https://github.com/dcci
ghstack dependencies: #152663, #152515, #152737, #152743
2025-05-05 03:02:29 +00:00
99c42722f6 [MPS] fix memory leak in sdpa float32 (#152371)
Fixes #152344

Leak seems to be on the MPS Graph side, even though there is an identity tensor it seems like it's no longer enough to bypass the SDPA sequence which seems to leak memory.

Even adding 0.0f seems to be optimized to be ignored and still take the sdpa sequence(that's the reason for adding 1e-20)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152371
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-29 04:51:10 +00:00
899eec665c [MPS] col2im kernel implementation (#152282)
Fixes #151820
Also requested in #141287

Mainly based on the cuda kernel implementations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152282
Approved by: https://github.com/malfet
2025-04-28 03:48:41 +00:00
3ef6d6924a [BE] Switch TestConsistency to MPS device (#147893)
Which will eventually allow move decorators away more `common_mps.py`

Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893
Approved by: https://github.com/atalman
ghstack dependencies: #152204
2025-04-26 01:19:21 +00:00
5e9bdc9b86 [MPS] layernorm forward kernel (#152010)
Implements layernorm forward pass as a metal kernel instead of MPSGraph ops. Speed ups are indicated on the chart below:
![Figure_1](https://github.com/user-attachments/assets/27a4d2ef-b3e4-4650-9ce3-b939c080321e)

Script for generating times, need to build torch with old/new codebase and then run this with different file name indicated at the end of the script
```python
import csv
import time

import numpy as np

import torch
import torch.nn.functional as F

matrix_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192]
batch_sizes = [1]
elementwise_affine = [False, True]
num_runs = 50
warmup_runs = 3

def create_input_tensor(n, batch_size):
    torch.manual_seed(42)
    return torch.randn(batch_size, n, dtype=torch.float32)

def run_layer_norm(A, normalized_shape, elementwise_affine):
    torch.mps.synchronize()
    start = time.perf_counter()
    out = F.layer_norm(A, normalized_shape)
    torch.mps.synchronize()
    end = time.perf_counter()
    return out, end - start

results = {"N": [], "elementwise_affine": [], "batch_size": [], "mean_time": [], "std_time": []}

for el_aff in elementwise_affine:
    for n in matrix_sizes:
        for batch_size in batch_sizes:
            print(f"\nBenchmarking LayerNorm for input size N={n}, batch_size={batch_size}, elementwise_affine={el_aff}")

            try:
                A_cpu = create_input_tensor(n, batch_size)
                A_mps = A_cpu.to("mps")

                normalized_shape = (n,)

                for _ in range(warmup_runs):
                    _, _ = run_layer_norm(A_mps, normalized_shape, el_aff)

                times = []
                for _ in range(num_runs):
                    _, t = run_layer_norm(A_mps, normalized_shape, el_aff)
                    times.append(t)

                mean_time = np.mean(times)
                std_time = np.std(times)

                results["N"].append(n)
                results["elementwise_affine"].append(el_aff)
                results["batch_size"].append(batch_size)
                results["mean_time"].append(mean_time)
                results["std_time"].append(std_time)

                print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s")

            except RuntimeError as e:
                print(f"Error for N={n}, batch_size={batch_size}: {e}")
                continue

with open("layernorm_benchmark_times_new.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(["N", "elementwise_affine", "batch_size", "mean_time", "std_time"])
    for i in range(len(results["N"])):
        writer.writerow(
            [
                results["N"][i],
                results["elementwise_affine"][i],
                results["batch_size"][i],
                results["mean_time"][i],
                results["std_time"][i],
            ]
        )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152010
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-24 05:07:46 +00:00
3aecf2dc52 [MPS] Extend index_put to half precision floats (#151869)
By reusing `c10/metal/atomic.h`
This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support

TODOs:
 - Get rid of indexing kernel and compute it directly when kernel is run
 - Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch
 - Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-04-22 22:00:08 +00:00
fbd29527d8 [MPS] Move ops modifiers to testing utils so other tests can reuse (#151781)
Test collection check:
```
python -m pytest test/test_mps.py --collect-only
```
Before:
```
6390 tests collected in 8.34s
```

After:
```
6390 tests collected in 7.71s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151781
Approved by: https://github.com/malfet
2025-04-22 19:19:52 +00:00
f37e138bc4 [MPS] Enable log1p and sigmoid for int64 (#151791)
It works on MacOS-15, but likely will need a skip for MacOS-13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151791
Approved by: https://github.com/Skylion007
ghstack dependencies: #151790
2025-04-21 18:30:04 +00:00
470132c6a1 [MPS] Add support for hermite_polynomial_he (inductor/eager). (#151754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151754
Approved by: https://github.com/malfet, https://github.com/jansel
2025-04-20 17:44:40 +00:00
14293c2377 [MPS] Allow isin for mixed types (#151600)
To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed

- Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes
```
>>> import torch
>>> x=torch.arange(4.0, device='mps')
>>> y=torch.arange(1.0, 3.0, device='mps')
>>> x, y, torch.isin(x, y), torch.isin(y, x)
(tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False,  True, False, False], device='mps:0'), tensor([False, False], device='mps:0'))
>>> torch.__version__
'2.6.0'
```
- Cleanup code a bit

Fixes https://github.com/pytorch/pytorch/issues/151443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600
Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth
2025-04-18 12:30:32 +00:00
1ffaa00ad7 [MPS] Migrate bitwise_not to unary operator (#151460)
That kills to birds with one stone:
 - Makes implementations more standartized (and faster for strided inputs/outputs)
 - Fixes bug strided inplace bitwise_not

I.e. before this change
```python
import torch
x=torch.arange(32, device="mps")
x[::2].bitwise_not_()
print(x)
```
produced
```
tensor([ -1,  -2,  -3,  -4,  -5,  -6,  -7,  -8,  -9, -10, -11, -12, -13, -14,
        -15, -16,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
         28,  29,  30,  31], device='mps:0')
```
after, it generates reasonable output
```
tensor([ -1,   1,  -3,   3,  -5,   5,  -7,   7,  -9,   9, -11,  11, -13,  13,
        -15,  15, -17,  17, -19,  19, -21,  21, -23,  23, -25,  25, -27,  27,
        -29,  29, -31,  31], device='mps:0')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151460
Approved by: https://github.com/dcci, https://github.com/qqaatw, https://github.com/Skylion007
2025-04-16 21:34:45 +00:00
b8a2824755 [MPS] Fix logit output for half/bfloat (#151282)
Which also fixes MPSInductor pointwise test
TODO: (as followup PRs): get rid of special native_function.yaml dispatches and use stub
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151282
Approved by: https://github.com/dcci
ghstack dependencies: #151224, #151246, #151272
2025-04-15 06:25:00 +00:00
ddfc14b3ae [MPS] Fix where (#151176)
Fixes #150967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-04-13 20:44:50 +00:00
bc47d539fc [MPS] Support ArgumentBuffer bindings from C++/Python (#150780)
To workaround limitation of 32-arguments per kernel and being able to eventually compile something like
```python
import torch

def foo(*args):
  rc = torch.empty_like(args[0])
  for arg in args:
      rc += arg
  return rc

tensors = torch.rand(100, 32, device='mps').unbind(0)
print(torch.compile(foo)(*tensors))
```

For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU.

Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer):
> To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them.

Add both C++ and PyThon unittests that validate that this works.
Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780
Approved by: https://github.com/dcci
2025-04-09 04:24:37 +00:00
49f6cce736 [MPS] grad scaler (#150255)
Fixes #142397

Basic implementation is done. What's left:
- [x] Different dtype/device tensors in the TensorList
- [x] fast path for grouping the foreach kernel
- [x] Tests

Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device.

By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put:
`instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)`

This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-06 17:06:55 +00:00
cfea55dbec [MPS] fix inverse bug for N>1024 (#146754)
Fixes #138200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146754
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-04-05 21:49:21 +00:00
7ac8186851 [MPSInductor] Speedup sum/prod reductions (#150566)
By using cooperative `simd_sum`/`simd_product` instead of a C-style for loop for threadgroup reductions. This also allows significantly reduce amount of shared memory needed to perform those reductions

Using such reduction increases the `torch.compile` performance for gpt-fast using `stories110M` from 29 tokens/sec to 630 tokens/sec on M4 and changes perf of torch.rand as follows:
|size| before | after |
|------------------------|------------|-------------|
| 512x512         | 202.1       | 131.8       |
| 1024x1024   |   780.6    | 176.9       |
| 2048x2048    |   1423.4       | 339.9      |
| 4096x4097    |    2982.2 | 1047.2      |

Unfortunately, none of the SIMDgroup operations are available for 64-bit integers, but one can simulate the behavior using using `simd_shuffle_down` of 64-bit values represented as `int2` types, that yields reduction in $log_2(threadgroup\\_size)$ steps. [`mlx/kernels/reduction/ops.h](86389bf970/mlx/backend/metal/kernels/reduction/ops.h (L15-L18)) contains an implementation of such algorithm, but alas it yields wrong results on M1/M2(and may be M3 machines) if not all threads in the simdgroup are active which could be observed by running
```python
import torch
lib=torch.mps.compile_shader("""
kernel void do_sum(device int* out, constant int* in, uint idx [[thread_position_in_grid]]) {
  out[idx] = metal::simd_shuffle_down(in[idx], 8);
}
""")
x=torch.arange(22, device='mps', dtype=torch.int32)
y=torch.empty_like(x)
lib.do_sum(y, x)
print(y)
```
that returns following on M4
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,  0,  0,  0,  0, 0,  0,  0,  0], device='mps:0', dtype=torch.int32)
```
but same kernel running on M1 returns
```
tensor([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 15, 16, 17, 18, 19, 20, 21], device='mps:0', dtype=torch.int32)
```
This discrepancy in behavior can be addressed by using `simd_shuffle_and_fill_down`, but any kernels using simd_shuffle_and_fill_down cause an internal compiler error on MacOS-13.2. Considering that OS is to be EOL soon, skip the offending tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150566
Approved by: https://github.com/manuelcandales
ghstack dependencies: #150452, #150457
2025-04-05 02:47:27 +00:00
827b730f4e [CI] Skip test_copy_large_tensor on M2-15 runners (#150377)
They have more than 12Gb memory, but may be running this test causes OOM in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150377
Approved by: https://github.com/atalman
2025-04-01 02:33:43 +00:00
b48505a8a1 [MPS] Add support for hermite_polynomial_h. (#150279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150279
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-03-31 23:30:19 +00:00
7c65911b11 [MPS] Fix dot/mm for conj_tensors (#150157)
- Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key
- For matmul or dot, add `conjugateWithTensor:name:` calls before running the op
- Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo
- Filter  `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR)
- Preserve conj property when gathering the views, that fixes `cov` operator

Fixes https://github.com/pytorch/pytorch/issues/148156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157
Approved by: https://github.com/dcci
2025-03-28 20:36:44 +00:00
ef1cb6b646 [BE] Suppress user_warnings while running opinfo tests (#150115)
Some of the samples are constructed in a way that are expected to trigger those, but what's the point displaying them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150115
Approved by: https://github.com/dcci
ghstack dependencies: #150060
2025-03-27 22:36:27 +00:00
6aca002d82 [MPS] Add chebyshev_polynomial_[uvw] (#150060)
For both eager and inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060
Approved by: https://github.com/dcci, https://github.com/jansel
2025-03-26 23:35:05 +00:00
de68ddc68e [MPS] Fix metal ops with different dtypes (#149974)
By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes.

Been dreaded to write this PR for a while, as it end up to be pretty bulky:
 - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType`
 - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values
 - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types
 - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op

Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.)

|                                           |  dense-dense  |  transp-transp  |  dense-transp  |  transp-dense  |  dense-scalar  |  dense-bcast |
|-------------------------|---------------|----------------|----------------|----------------|---------------|--------------- |
|      fmax (torch.float16, torch.float16)  |     160.9     |      159.9      |     270.5      |     270.9      |     236.6      |     293.0
|      fmax (torch.float32, torch.float32)  |     176.9     |      171.0      |     273.7      |     293.5      |     242.6      |     294.2
|      fmax (torch.float32, torch.float16)  |     171.4     |      170.9      |     283.6      |     303.0      |     253.7      |     302.3
|      add (torch.float16, torch.float16)   |     218.0     |      223.6      |     221.0      |     222.0      |     214.9      |     218.3
|      add (torch.float32, torch.float32)   |     227.4     |      233.9      |     228.8      |     231.9      |     218.9      |     221.4
|      add (torch.float32, torch.float16)   |     226.1     |      227.5      |     227.5      |     226.9      |     177.0      |     190.8

TODOS:
 - Include input and output dtype in non-cast kernel name
 - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR`
- Extend mixed_dytpes testing via OpInfo

Fixes https://github.com/pytorch/pytorch/issues/149951
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974
Approved by: https://github.com/manuelcandales
2025-03-26 07:03:21 +00:00
ba46643df1 [MPS] tril op not handling infs correctly (#149866)
Fixes #149813

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866
Approved by: https://github.com/malfet
2025-03-24 23:38:41 +00:00
9179178728 [MPS] Add support for chebyshev_polynomial_t in eager. (#149816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149816
Approved by: https://github.com/malfet
2025-03-24 19:19:55 +00:00
248487f455 [MPS] nanmedian with dims (#149680)
Third most voted op from #77764

Tests were deleted because they are covered by the regular test_output_match tests so those were redundant and were added in the last PR before the nanmedian dim version would be implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149680
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-24 03:49:16 +00:00
b9a5e1d038 [MPS] Add support for scaled_modified_bessel_k1 to eager. (#149783)
Another day another op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149783
Approved by: https://github.com/malfet
2025-03-22 02:13:41 +00:00
bdc132d0e1 [MPS] Add support for scaled_modified_bessel_k0 for eager. (#149705)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149705
Approved by: https://github.com/malfet
2025-03-21 16:14:29 +00:00
0ed34210b2 [MPS] Add support for modified_bessel_k1 to eager and inductor. (#149687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687
Approved by: https://github.com/malfet
2025-03-21 04:59:06 +00:00
95e71765f2 [MPS] nanmedian implementation (#149407)
Implements nanmedian on MPS. This implementation only implements `torch.nanmedian(tensor)` without `keepdim` and `dim`
Will implement nanmedian with dim and keepdim in a followup

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149407
Approved by: https://github.com/malfet
2025-03-20 03:50:26 +00:00
88c2fe533f [MPS] Add modified_bessel_k0 support to eager. (#149563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149563
Approved by: https://github.com/malfet
2025-03-19 23:10:55 +00:00
2e0c98ff05 [MPS] Add bicubic2d_aa (#149378)
Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287

Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from eec43cfbc0/src/libImaging/Resample.c as well as
bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)

Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa

At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378
Approved by: https://github.com/dcci
2025-03-18 05:35:41 +00:00
c43e35d6f7 [MPS] Implement support for modified_bessel_i1 in eager. (#149368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-18 03:29:10 +00:00
186cc7327c [MPS/BE] Remove decorator that skipped test on macOS 12. (#149365)
macOS 12 is not really supported anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365
Approved by: https://github.com/malfet
2025-03-18 00:58:08 +00:00
9f33c6f0a0 [MPS] Add support for modified_bessel_i0 in eager. (#149264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-03-16 04:45:49 +00:00
96795e9533 [BE] Parametrize TestMPS.test_binops_dtype_precedence (#149234)
No op change, just splits a longer tests into a series of a smaller ones
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234
Approved by: https://github.com/atalman, https://github.com/dcci
ghstack dependencies: #149216, #149233
2025-03-15 00:37:11 +00:00
dd6e9df3d0 [MPS] fix attention enable_gqa crash on mps (#149147)
Fixes #149132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147
Approved by: https://github.com/malfet
2025-03-14 21:25:54 +00:00