736 Commits

Author SHA1 Message Date
fdab48a7c1 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 07:36:18 +00:00
24520b8386 Revert "Enable all PIE rules on ruff (#165814)"
This reverts commit c79dfdc6550e872783aa5cb5fc9e86589bf18872.

Reverted https://github.com/pytorch/pytorch/pull/165814 on behalf of https://github.com/cyyever due to Need to cover more files ([comment](https://github.com/pytorch/pytorch/pull/165814#issuecomment-3417931863))
2025-10-18 07:21:08 +00:00
c79dfdc655 Enable all PIE rules on ruff (#165814)
This PR enables all PIE rules on ruff, there are already some enabled rules from this family, the new added rules are
```
PIE796  Enum contains duplicate value: {value}
PIE808  Unnecessary start argument in range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165814
Approved by: https://github.com/ezyang
2025-10-18 06:40:12 +00:00
935ccdbe75 [MPS] Fix internal assertion in torch.linalg.solve for singular matrices (#165254)
Fixes #163962 by special casing MPS in the negative status code branch in `_linalg_check_errors`.

Checks if info is [`MPSMatrixDecompositionStatus.singular`](https://developer.apple.com/documentation/metalperformanceshaders/mpsmatrixdecompositionstatus/singular) (which has a raw value of -2). I didn't find an official Apple source with this raw value (besides printing the enum value), so I'm not sure if we can (or should) depend on it? Is there a way to directly get the Objective-C enum value in C++?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165254
Approved by: https://github.com/malfet
2025-10-17 15:35:49 +00:00
8e510e1095 [MPS] fix empty dot op crash (#165237)
reproducer
```
import torch

# does not crash
a = torch.rand((0), device="cpu")
b = torch.rand((0), device="cpu")
a.dot(b)

# crashes due to internal assert
a = torch.rand((0), device="mps")
b = torch.rand((0), device="mps")
a.dot(b)

```

Discovered when implementing an op for SparseMPS backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165237
Approved by: https://github.com/malfet
2025-10-15 04:49:29 +00:00
83cbba8759 [MPS] Support large tensors in torch.cat (#164416)
Fixes #164415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164416
Approved by: https://github.com/malfet
2025-10-13 16:56:56 +00:00
ef50c6e3e3 [MPS] Add backward pass for embedding_bag (#163931)
Fixes #162270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163931
Approved by: https://github.com/malfet
2025-10-03 00:48:38 +00:00
1c12d7416b [SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598)
Fixes #163597

- Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride.
- Updated tests with additional transpose/permutation layouts. New tests catch the regression.

### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport)

Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro

### Vector Fast Path (q_len=1, k_len=256)

- Before: 0.160 ms
- After: 0.157 ms

### Vector 2-pass (q_len=1, k_len=4096)

- Before: 0.342 ms
- After: 0.339 ms

### Vector Fast Path (q_len=8, k_len=256)

- Before: 0.228 ms
- After: 0.231 ms

### Vector 2-pass (q_len=8, k_len=4096)

- Before: 0.432 ms
- After:  0.436 ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598
Approved by: https://github.com/malfet
2025-09-29 16:09:46 +00:00
a0136f149c [MPS] Fix nan behavior in grid_sampler_3d (#163881)
Fixes #163851
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163881
Approved by: https://github.com/malfet
2025-09-26 23:08:00 +00:00
20149080f2 [MPS] Compute offset2bag/bag_size/max_indices in _embedding_bag (#163281)
Part of #162270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163281
Approved by: https://github.com/malfet
2025-09-23 22:30:48 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
d25c35d2b2 [MPS] Fix [nan]median output for empty tensors (#162846)
It should be `NaN` rather than 0

Added respective checks to `test_empty_tensor`

Fixes https://github.com/pytorch/pytorch/issues/162798
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162846
Approved by: https://github.com/dcci
2025-09-12 22:26:29 +00:00
468c1f9e9d Revert "[nn] Assert parsed iterable arguments are an appropriate length (#162340)"
This reverts commit b5e6e58050bd2a15f4173cfffa00c7e32e382b49.

Reverted https://github.com/pytorch/pytorch/pull/162340 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break an MPS tests on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/162340#issuecomment-3282676242))
2025-09-11 21:22:57 +00:00
b5e6e58050 [nn] Assert parsed iterable arguments are an appropriate length (#162340)
Fixes #162327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162340
Approved by: https://github.com/Skylion007
2025-09-10 15:15:49 +00:00
583bbf7761 [MPS] Add native_dropout and native_dropout_backward (#162108)
Fixes #162002
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162108
Approved by: https://github.com/malfet
2025-09-09 01:44:06 +00:00
dcf385395d [MPS] Move sparsemps testing from test_mps to test_sparse (#161852)
Moves Sparse MPS testing from test_mps to test_sparse. Lots of skips now but I expect to remove them iteratively once ops are implemented

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161852
Approved by: https://github.com/malfet
2025-09-02 19:04:11 +00:00
f3697b033e [MPS] add bunch of unary funcs for sparse tensors (#161846)
adds bunch of unary functions for sparse tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161846
Approved by: https://github.com/malfet
2025-08-30 21:13:05 +00:00
8627a19adf [MPS] sparse add unary funcs + add for sparse tensors (#160839)
Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-30 01:09:00 +00:00
7c30a9d7fc [MPS] Add slow version of kthvalue (#161817)
Which heavily borrows implementation logic from `topk`
As this method is non-deterministic, modified the logic for cpu-ops indices comparison with just an equality statement, as by default random numbers picked for input tensor allow for quite a lot of overlaps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161817
Approved by: https://github.com/dcci
2025-08-30 00:44:29 +00:00
3daf20f8e1 [MPS] fix empty input in posneg functions (#161824)
fix empty posneg function for mps:
```python
import torch

input_tensor = torch.empty(0, device="mps")
out_pos = torch.isposinf(input_tensor)
```

Gives:
```
RuntimeError: [srcBuf length] > 0 INTERNAL ASSERT FAILED at "/Users/Irakli_Salia/Desktop/pytorch/aten/src/ATen/native/mps/OperationUtils.mm":551, please report a bug to PyTorch. Placeholder tensor is empty!
```

on main branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161824
Approved by: https://github.com/malfet
2025-08-29 23:12:04 +00:00
f6368e934e Revert "[MPS] sparse add unary funcs + add for sparse tensors (#160839)"
This reverts commit 93c5112f46a978a029644ae599979416ead5c917.

Reverted https://github.com/pytorch/pytorch/pull/160839 on behalf of https://github.com/atalman due to test_sparse_csr.py::TestSparseCompressedCPU::test_consistency_SparseCSR_asinh_cpu_complex64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/17329155095/job/49201551217) [HUD commit link](93c5112f46) ([comment](https://github.com/pytorch/pytorch/pull/160839#issuecomment-3238093296))
2025-08-29 19:55:39 +00:00
93c5112f46 [MPS] sparse add unary funcs + add for sparse tensors (#160839)
Adds several unary functions and add. Enables tests for unary functions in test_sparse but not enabling other tests yet, needs more ops before we fully migrate to testing SparseMPS with `test_sparse.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160839
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-29 16:28:58 +00:00
ee0ec21191 Ensure that tensors are contiguous before using no-graph MPS impl (#161641)
Fixes #161640

Check if tensors are contiguous before using the no-graph implementation. Using the script in the issue above with this change I get expected results.

```
MPS contiguous result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061], device='mps:0')
MPS non-contig result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061], device='mps:0')
CPU non-contig result sample: tensor([ 1.3600, -2.9516,  1.3207, -3.5132,  1.7061])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161641
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-27 22:31:57 +00:00
121afd6a8f [MPS] Update avg_pool2d to use Metal kernel when ceil_mode=True (#161011)
Fixes #160743

The MPS impl of `avg_pool2d` seems to only give incorrect results when `ceil_mode=True`. I wrote a performance measurement script (0ee6e58643/avg_pool_mps/perf_2d.py) which tests a bunch of different cases and also marks the cases where MPS and CPU results do not match.

I found that if I update `avg_pool2d` to use the new Metal kernel in all cases, that fixes all the mismatches, but it also decreases performance for some of the `ceil_mode=False` cases. So I opted to only run the new Metal kernel when  `ceil_mode=True`, which does not significantly decrease performance in any of the cases tested.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161011
Approved by: https://github.com/malfet
2025-08-23 02:36:22 +00:00
cee72119b2 [Test] Adding a testcase for constant_pad_nd (#161259)
Fixes #161066

This PR adds a simple testcase for constant_pad_nd on MPS as mentioned in https://github.com/pytorch/pytorch/pull/161149#issuecomment-3211701274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161259
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-23 01:00:50 +00:00
b0071c65e2 [MPS] Fix error check for torch.var on scalar (#160889)
Fixes https://github.com/pytorch/pytorch/issues/160738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160889
Approved by: https://github.com/Skylion007
ghstack dependencies: #160850
2025-08-18 17:36:42 +00:00
6382302990 [MPS] Add grid_sampler_3d for MPS (#160541)
This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation.

NOTE: "nearest" interpolation is not yet supported

Fixes #159882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541
Approved by: https://github.com/malfet
2025-08-15 16:19:25 +00:00
7d87e358ac Fix MPS conv3d autocast bias dtype mismatch (#160423)
## Summary
- register conv3d with MPS autocast to ensure bias dtypes match under AMP
- add regression test chaining two Conv3d layers on MPS autocast

Written by Codex, see https://chatgpt.com/codex/tasks/task_e_689b64192df883278648935963d2776d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160423
Approved by: https://github.com/dcci
2025-08-13 16:23:21 +00:00
7d2ec704e4 Fix MPS autocast for ConvTranspose3d (#160345)
## Summary
- ensure ConvTranspose3d uses fp32 under MPS autocast
- add MPS autocast test for ConvTranspose3d

Generated by Codex, see https://chatgpt.com/codex/tasks/task_e_689a360388288327a2cac6f55bbfc42c

Fixes https://github.com/pytorch/pytorch/issues/160332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160345
Approved by: https://github.com/dcci
2025-08-11 21:01:52 +00:00
d25c4f954d [MPS] Type-promote tensor-iterator common dtype (#160334)
Otherwise, `torch.add(FloatTensor, IntTensor, alpha=2)` and `torch.add(FloatTensor, IntTensor, alpha=2)` were dispatched to different kernels

Fixes https://github.com/pytorch/pytorch/issues/160208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160334
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-08-11 17:53:56 +00:00
a84b60c0c4 [MPS] Sparse coalesce more dtypes to match cpu (#160254)
More dtypes to match the cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160254
Approved by: https://github.com/malfet
2025-08-10 12:25:18 +00:00
7f4cb4a3e0 [MPS] coalesce for sparse tensors (#159729)
MPS coalesce function for sparse tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159729
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-08 13:49:55 +00:00
74a754aae9 Add meta kernel for sdpa_math_for_mps (#159695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159695
Approved by: https://github.com/malfet
ghstack dependencies: #159456
2025-08-05 22:27:06 +00:00
f946b25865 [MPS] Speedup argmax/argmin (#159524)
By using efficient `threadgroup_arg[max|min]` primitives.
- Fixed bug in `simd_argmax` when result of the `simd_ballot` were prematurely cast to `ushort` and adjusted unit test
- Fixed nan handling in compiled argmax, but can't reliably test it as MPS(eager) implementaiton of argmax is buggy

Now according to `bench_mps_ops.py` `max(x, dim=0)` is reliably faster than eager implementaiton:
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float16)  |      285.8      |       272.2       |       422.3       |        354.5        |       721.6       |        683.5        |       2224.0      |        1979.1
      max (torch.float32)  |      300.2      |       267.0       |       389.6       |        342.5        |       769.4       |        682.6        |       2995.7      |        2609.8
      max (torch.int32)    |      299.6      |       275.4       |       390.0       |        361.7        |       758.7       |        686.1        |       3103.4      |        2646.5
      max (torch.int64)    |      297.5      |       275.5       |       417.0       |        382.1        |       856.1       |        722.6        |       5467.7      |        3156.8

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159524
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #158990
2025-07-31 16:18:32 +00:00
1293405c8d [MPS] Add simd_[arg][max|min] (#158990)
And add eager tests for those.
Re-implement `threadgroup_[max|min]` using those function as they are significantly faster (though much slower than eager, due to the arg part) than before, which could be verified by running the following script
```python
import itertools
import timeit
import torch
from torch.utils.benchmark import Compare, Measurement, Timer

def bench_unary_op(func, x, label) -> Measurement:
    sync_cmd = "torch.mps.synchronize()" if "mps" in str(x.device) else ""
    t = Timer(
        stmt=f"f(x);{sync_cmd}",
        globals={"f": func, "x": x},
        language="python",
        timer=timeit.default_timer,
        sub_label=f"{func.__name__} ({str(x.dtype)})",
        description=label,
        env=torch.__version__,
    )
    return t.blocked_autorange()

def bench_reduction(
    reduction_func, device: str = "mps", dtype: torch.dtype = torch.float32
) -> list[Measurement]:
    rc = []

    # Bench 2D with reduction over dim=0
    def f(t):
        return reduction_func(t, dim=0)[0]

    f.__name__ = reduction_func.__name__
    f_c = torch.compile(f, dynamic=False, fullgraph=True)

    for size in (512, 1024, 2048, 4096):
        x = torch.testing.make_tensor(size, size, device=device, dtype=dtype)
        rc_c, rc_e = f(x), f_c(x)
        rc_c, rc_e = (rc_c[0], rc_e[0]) if isinstance(rc_c, tuple) else (rc_c, rc_e)
        rc.append(bench_unary_op(f, x, f"eager-{size}x{size}"))
        rc.append(bench_unary_op(f_c, x, f"compile-{size}x{size}"))
    return rc

def main() -> None:
    #dtypes = [torch.float16, torch.float32, torch.bfloat16, torch.int32, torch.int64]
    dtypes = [torch.float32, torch.int32, torch.int64]

    # Profile reduction ops
    rc = []
    for op, dtype in itertools.product([torch.max], dtypes):
        rc.extend(bench_reduction(op, dtype=dtype))
    Compare(rc).print()

if __name__ == "__main__":
    torch._dynamo.config.cache_size_limit = 2**16
    main()
```

Produces the following table before
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      297.3      |       531.6       |       394.1       |        2550.5       |       773.0       |        4904.7       |       3647.2      |        9682.0
      max (torch.int32)    |      297.8      |       359.2       |       387.7       |        1179.4       |       768.2       |        2175.0       |       3677.1      |        4495.9
      max (torch.int64)    |      278.7      |       541.4       |       410.2       |        2873.3       |       858.9       |        5620.4       |       6107.2      |       11176.1

Times are in microseconds (us).
```
And after
```
[---------------------------------------------------------------------------------------------  --------------------------------------------------------------------------------------------]
                           |  eager-512x512  |  compile-512x512  |  eager-1024x1024  |  compile-1024x1024  |  eager-2048x2048  |  compile-2048x2048  |  eager-4096x4096  |  compile-4096x4096
1 threads: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      max (torch.float32)  |      307.9      |       265.3       |       401.0       |        340.8        |       766.5       |        661.9        |       3463.5      |        2829.5
      max (torch.int32)    |      293.5      |       263.1       |       405.0       |        338.8        |       761.4       |        672.5        |       3050.0      |        2688.6
      max (torch.int64)    |      308.2      |       255.7       |       417.4       |        341.4        |       877.0       |        695.0        |       5812.2      |        5762.2

```

`argmax`/`argmin` are much tricker due to the nan-handling logic that need to be added there.

Also fixes `torch.max/min` compilation for half-precision types, added regression types for it.

This PR also introduces a bunch of helper functions, such as `simd_broadcast` that works for int64 and `c10:🤘:pair` template, which are used by `simd_argmax` to return both value and index

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158990
Approved by: https://github.com/dcci, https://github.com/Skylion007
2025-07-30 21:57:25 +00:00
70d2e9ba45 [MPS] Avoid outputing zeros from exponential_ for MPS (#159386)
Fixes #159103
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159386
Approved by: https://github.com/malfet
2025-07-30 00:20:31 +00:00
15bb81ea4f [2/N][CI] Remove MacOS-13 workarounds from tests (#159304)
Part of https://github.com/pytorch/pytorch/issues/159275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159304
Approved by: https://github.com/dcci, https://github.com/cyyever
ghstack dependencies: #159277, #159278
2025-07-29 23:12:13 +00:00
d0c00d9a69 [MPS] Do not crash if tensor dim > INT_MAX (#158824)
Looks like all MPS operations will crash if one of tensor dimentions are
greater than `2**31-1`

Change it into a structured exception, by checking tensor size before
attempting to create MPS Tensor

Add regression test for it. Before this change running following will abort with exception
```
% python3 -c "import torch; torch.randint(0, 10, (2**31,), dtype=torch.uint8, device='mps')"
/AppleInternal/Library/BuildRoots/1c8f7852-1ca9-11f0-b28b-226177e5bb69/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
zsh: abort      python3 -c·
```

Skip the test on MacOS-13, as it crashes somewhere deep in MPSGraph framework with
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158824
Approved by: https://github.com/dcci
ghstack dependencies: #158690, #158823
2025-07-22 15:12:26 +00:00
194539e9c3 Address NaNs if SDPA is called with all values masked from query (#157727)
Fixes #156707

Detect if all values along the softmax axis are infs and overwrite the outputs for those computations with zeros before the final matmul. The behavior should be aligned with the CPU implementation.

These types of cases where all values along the dimension in the attention mask are false leading to the undefined outputs in softmax occur with left padded batches for generation in HF transformers according to the original issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157727
Approved by: https://github.com/malfet
2025-07-14 22:09:35 +00:00
beed033b6e [MPS] Fix index_kernel for large tensors (#158064)
Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator

Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   383.5    |    379.8     |    470.9     |     1232.9     |     4410.3
      __getitem__ (torch.float16, torch.int64)  |   379.6    |    354.5     |    533.2     |     1290.3     |     4442.2
      __getitem__ (torch.float32, torch.int64)  |   360.8    |    338.6     |    478.6     |     1348.9     |     4870.4

Times are in microseconds (us).
```
and after
```
[------------------------------------------------------------  -----------------------------------------------------------]
                                                |  11x50x50  |  11x100x100  |  11x500x500  |  11x1000x1000  |  11x2000x2000
1 threads: ----------------------------------------------------------------------------------------------------------------
      __getitem__ (torch.int8, torch.int64)     |   349.8    |    330.5     |    432.6     |     764.5      |     1961.2
      __getitem__ (torch.float16, torch.int64)  |   342.5    |    330.7     |    434.7     |     741.0      |     1969.4
      __getitem__ (torch.float32, torch.int64)  |   332.2    |    326.1     |    445.4     |     751.3      |     1972.6

Times are in microseconds (us).
```

While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint

Fixes https://github.com/pytorch/pytorch/issues/153560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064
Approved by: https://github.com/dcci
2025-07-11 22:35:44 +00:00
8088958793 port 4 dynamo test files to Intel GPU (#157779)
For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Six dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056) and [#156575](https://github.com/pytorch/pytorch/pull/156575.) In this PR we will port 4 more dynamo test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- added XPU support in decorators like @requires_gpu
- enabled XPU for some test path
- added xfailIfXPU to skip xpu test when there is a bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157779
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-07-11 10:11:49 +00:00
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
a5c61eb78d [MPS][BE] Delete as_strided_tensorimpl_mps (#157772)
Because it's just copy-n-paste of `as_strided_tensorimpl` with call to `updateTensorBaseShape`, which is not called/used anywhere else.

Fixes https://github.com/pytorch/pytorch/issues/152701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157772
Approved by: https://github.com/Skylion007
2025-07-08 17:02:36 +00:00
510c398a4f Add max_pool3d backward pass for MPS (#157498)
Note on backward precision over fp16:

A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024)  \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))

Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024)  \exp2(-3) \approx 0.002$.

The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024)  \exp2(-2) \approx 0.002$. Same for $e_b=-1$.

So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9  * 0.002$.

That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
2025-07-07 19:46:44 +00:00
d56f11a1f2 [MPS] Implement logcumsumexp metal kernel (#156858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156858
Approved by: https://github.com/malfet
ghstack dependencies: #157512
2025-07-03 18:16:25 +00:00
5e636d664a [BE] @serialTest decorator must be called (#157388)
Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates
```python
import torch
from torch.testing._internal.common_utils import serialTest, run_tests, TestCase

class MegaTest(TestCase):
    @serialTest
    def test_foo(self):
        if hasattr(self.test_foo, "pytestmark"):
            print("foo has attr and it is", self.test_foo.pytestmark)
        print("foo")

    @serialTest()
    def test_bar(self):
        if hasattr(self.test_bar, "pytestmark"):
            print("bar has attr and it is", self.test_bar.pytestmark)
        print("bar")

if __name__ == "__main__":
    run_tests()
```

That will print
```
test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})]
bar
ok
test_foo (__main__.MegaTest.test_foo) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.013s

```

Added assert that arg is boolean in the decorator to prevent such silent skips in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388
Approved by: https://github.com/clee2000
2025-07-02 19:15:19 +00:00
019e30e3b8 [BE] Decorate LargeTensorTest with serialTests (#157382)
May be it'll help make M2-15 jobs more stable, as that was the last test run before OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157382
Approved by: https://github.com/clee2000
2025-07-01 20:35:42 +00:00
a1282b1823 [MPS] Add boilerplate sparse code support (#157238)
This PR makes minimal changes to support sparse tensors on MPS. In the followup PRs I'll start adding different operations slowly so we can fix the issue of
https://github.com/pytorch/pytorch/issues/129842
which is highly requested(I assume because of whisper using sparse tensors)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157238
Approved by: https://github.com/malfet
2025-06-30 01:53:45 +00:00
a1e4f1f98a [MPS] Reimplement tri[ul] as Metal shaders (#157179)
And add in-place flavor, as it is currently broken for non-contig tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157179
Approved by: https://github.com/dcci
2025-06-28 01:33:18 +00:00
653c52fe52 [MPS] Fix batch norm incorrect gradient (#156867)
Fixes #156555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156867
Approved by: https://github.com/malfet
2025-06-25 23:05:49 +00:00