pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 13:44:15 +08:00

Author	SHA1	Message	Date
Isalia20	653c52fe52	[MPS] Fix batch norm incorrect gradient (#156867 ) Fixes #156555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156867 Approved by: https://github.com/malfet	2025-06-25 23:05:49 +00:00
Joona Havukainen	20a74c370b	Add error message with assert to topK if ndims() - dim > 4 (#155475 ) Addressing #154890 Not really a proper fix but at least it's more informative than the current crash. For a more long term solution I'm testing if we can use the TopK API released in MacOS14 as it does not have the same MPSScan op issue that the Sort and ArgSort are hitting. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155475 Approved by: https://github.com/kulinseth	2025-06-13 21:10:06 +00:00
Nikita Shulga	dd41a3907c	[MPS] Fix unary/binary ops for 2**32+ elem tensors (#155183 ) By using `TensorIterator::with_32bit_indexing()` primitive Add `bind_tensors` helper function that correctly sets up MPS tensors originating from TensorIterator TODO: Add comments to bind_tensors as well asunit test, based on ``` python -c "import torch;print((torch.rand(1, 1024, 1024, dtype=torch.bfloat16, device='mps') + torch.rand(5000, 1, 1, dtype=torch.bfloat16, device='mps')).sin())" ``` Fixes https://github.com/pytorch/pytorch/issues/154828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155183 Approved by: https://github.com/cyyever, https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #155150, #155178, #155184	2025-06-05 18:57:14 +00:00
Roy Hvaara	9a4c08ddfc	[MPS] Parametrize `test_scaled_dot_product_attention_autocast` (#155005 ) Also moving comments inside the function scope for some of my previous regression tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155005 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-06-05 13:24:53 +00:00
Nikita Shulga	9cdce682a1	[MPS][BE] Reimplement log1p as Metal shader (#154936 ) That should make it faster than MPSGraph implementation, but also improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$ Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} - \frac{x}{3}))$ Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one. Parametrize and modify regression test to check for accuracy of small values TODOs: - Do proper implementation for complex values as well, perhaps using `0408ba0a76/mlx/backend/metal/kernels/utils.h (L339)` - May be implement it using Remez-like algorithm documented here `207f3b2b25/lib/msun/src/s_log1pf.c (L37)` - Or use llvm's implementation from `f393986b53/libclc/clc/lib/generic/math/clc_log1p.inc (L22)` - Benchmark which algorithm is faster and delivers better accuracy Pull Request resolved: https://github.com/pytorch/pytorch/pull/154936 Approved by: https://github.com/dcci, https://github.com/Skylion007	2025-06-03 14:10:13 +00:00
Joona Havukainen	981bdb39ca	Enable ConvTranspose3D for FP32 and Complex64 (#154696 ) Fixes #154615 Enables using ConvTranspose3D since it seems support exists both on MacOS 14 and 15. For the half dtypes the discrepancy of CPU and GPU implementations is too large to conclude whether there is a bug in the implementation or not without a more rigorous study on what bounds are there to the expected error. So they are left unsupported for now and an assert is added to notify the user if the op is called with fp16 or bf16 inputs. Tests for ConvTranspose3D were enabled for the supported data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154696 Approved by: https://github.com/malfet	2025-06-02 16:24:03 +00:00
Isalia20	41092cb86c	[MPS] index copy impl (#154326 ) Second most requested op according to #154052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154326 Approved by: https://github.com/malfet	2025-05-29 16:57:43 +00:00
Xuehai Pan	7ae204c3b6	[BE][CI][Easy] Run `lintrunner` on generated `.pyi` stub files (#150732 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150732 Approved by: https://github.com/malfet, https://github.com/cyyever, https://github.com/aorenste	2025-05-27 14:58:02 +00:00
Nikita Shulga	975bbc63db	[MPS][BE] Move fmod/remainder to Metal ops (#154280 ) This accomplishes following: - Fixes correctness problem with large integer types (though probably makes it slower, but this could not be avoided if one wants to compute accurate answer) - Makes op faster for floating point types (as Metal kernel invocation is faster than creating MPSGraph) - Eliminates need for several correctness workarounds Fixes https://github.com/pytorch/pytorch/issues/154171 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154280 Approved by: https://github.com/dcci ghstack dependencies: #154275, #154290	2025-05-24 01:45:33 +00:00
Nikita Shulga	633ed01145	[MPS] Add support for two more isin variants (#154010 ) `isin_Tensor_Scalar_out` is just a redispatch to eq/neq `isin_Scalar_Tensor_out` redispatches back to generic `isin` op, but needs a small tweak to handle float scalars Make sure that `out` is resized to an expected value in `isin_Tensor_Tensor_out_mps` Add unittests to validate that, but skip them on MacOS-13, where MPS op just returns garbage Before this change both of those failed ```python >>> import torch >>> t = torch.tensor([0, 1, 2], device='mps') >>> torch.isin(t, 1) Traceback (most recent call last): File "<stdin>", line 1, in <module> NotImplementedError: The operator 'aten::isin.Tensor_Scalar_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. >>> torch.isin(1, t) Traceback (most recent call last): File "<stdin>", line 1, in <module> NotImplementedError: The operator 'aten::isin.Scalar_Tensor_out' is not currently implemented for the MPS device. If you want this op to be considered for addition please comment on https://github.com/pytorch/pytorch/issues/141287 and mention use-case, that resulted in missing op as well as commit hash 3b875c25ea6d8802a0c53af9eb961ddf2f058188. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/154010 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/manuelcandales ghstack dependencies: #153970, #153971, #153997	2025-05-22 17:59:35 +00:00
Nikita Shulga	d5ddc5ab20	[MPS] Fix float64 scalar tensor handling (#153582 ) Current implementation causes silent correction problem with torch.compile when someone tries to `torch.compile` function where one of the arguments is say `np.exp(.3)`, which will be represented as torch.float64 scalar tensor Add regssion test for this behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/153582 Approved by: https://github.com/dcci	2025-05-15 05:15:14 +00:00
Nikita Shulga	8749fe8439	[CI][MPS] Speedup test_large_bmm (#153562 ) By computing matmuls of only one random non-zero batch on CPU This reduces test runtime from 11 minutes to 14 sec ``` % python3 test/test_mps.py -v -k test_large_bmm_ test_large_bmm_bfloat16 (__main__.TestMPS.test_large_bmm_bfloat16) ... ok test_large_bmm_float16 (__main__.TestMPS.test_large_bmm_float16) ... ok ---------------------------------------------------------------------- Ran 2 tests in 27.495s ``` TODO: Compute it over two slices when https://github.com/pytorch/pytorch/issues/153560 is fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/153562 Approved by: https://github.com/Skylion007, https://github.com/clee2000	2025-05-14 18:49:42 +00:00
Isalia20	56492bfcb9	[MPS] SDPA specialized kernels (#152781 ) Paritally fixes #139668 and #152550 Still work in progress. Following needs to be addressed: - [x] Some tests are failing and need to check why and bugfix - [x] Benchmark the new kernels and add to this PR for varying sequence lengths head dimensions(the ones that get dispatched to kernels) - [x] Add tests to cover the specialized paths(if applicable) - [x] Code cleanup Tested on Macbook M1 Pro ### Vector Fast Path (q_len=1, k_len=256) - Old: 0.378 ms - New: 0.260 ms - 31.2% speed improvement ### Vector 2-pass (q_len=1, k_len=4096) - Old: 0.627 ms - New: 0.370 ms - 41.0% speed improvement ### Vector Fast Path (q_len=8, k_len=256) - Old: 0.545 ms - New: 0.322 ms - 40.9% speed improvement ### Vector 2-pass (q_len=8, k_len=4096) - Old: 1.318 ms - New: 1.057 ms - 19.8% speed improvement Script to get perf: ``` import torch import time def benchmark_sdpa(config, iterations=100): device = config.get("device", "cpu") batch = config["batch"] heads = config["heads"] q_len = config["q_len"] k_len = config["k_len"] head_dim = config["head_dim"] q = torch.randn(batch, heads, q_len, head_dim, device=device, dtype=torch.float32) k = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32) v = torch.randn(batch, heads, k_len, head_dim, device=device, dtype=torch.float32) for _ in range(5): _ = torch.nn.functional.scaled_dot_product_attention(q, k, v) if device == "mps": torch.mps.synchronize() total_time = 0.0 for i in range(iterations): start = time.perf_counter() _ = torch.nn.functional.scaled_dot_product_attention(q, k, v) if device == "mps": torch.mps.synchronize() end = time.perf_counter() total_time += end - start avg_time = total_time / iterations print(f"[{config['name']}] Avg time per run: {avg_time * 1000:.3f} ms over {iterations} iterations") return avg_time def main(): device = "mps" if torch.backends.mps.is_available() else "cpu" print(f"Running benchmarks on device: {device}") benchmarks = [ { "name": "Vector Fast - Small q_len & moderate k_len", "batch": 1, "heads": 8, "q_len": 1, # small query sequence length triggers vector fast path "k_len": 256, # moderate key length "head_dim": 64, "device": device, }, { "name": "Vector 2-pass - Small q_len & long k_len", "batch": 1, "heads": 8, "q_len": 1, # small query sequence length "k_len": 4096, # long key length triggers the 2-pass variant "head_dim": 64, "device": device, }, # { # "name": "Full Attention - Moderate q_len/k_len", # "batch": 1, # "heads": 8, # "q_len": 128, # longer query sequence length # "k_len": 8192, # matching key length for full attention paths # "head_dim": 64, # "device": device, # }, # { # "name": "Full Attention - Longer q_len/k_len", # "batch": 1, # "heads": 8, # "q_len": 128, # very long sequence length # "k_len": 8192, # "head_dim": 64, # "device": device, # }, ] iterations = 100 for config in benchmarks: benchmark_sdpa(config, iterations=iterations) if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152781 Approved by: https://github.com/malfet	2025-05-07 00:40:11 +00:00
Nikita Shulga	0ffd31dc8a	[MPS] Migrate div roudning modes (#152758 ) By implementing `div_floor` and `div_trunc` . Do not mark `div_trunc` as OPMATH, to align following output with CPU(if division is performed in fp32, than result will be truncated to 25 ``` import torch print(torch.tensor([[-7.4688, -3.1289]], dtype=torch.float16,device="cpu").div(torch.tensor([-0.2988, -0.8789], dtype=torch.bfloat16,device="cpu"), rounding_mode="trunc")) tensor([[24., 3.]]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152758 Approved by: https://github.com/dcci ghstack dependencies: #152663, #152515, #152737, #152743	2025-05-05 03:02:29 +00:00
Isalia20	99c42722f6	[MPS] fix memory leak in sdpa float32 (#152371 ) Fixes #152344 Leak seems to be on the MPS Graph side, even though there is an identity tensor it seems like it's no longer enough to bypass the SDPA sequence which seems to leak memory. Even adding 0.0f seems to be optimized to be ignored and still take the sdpa sequence(that's the reason for adding 1e-20) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152371 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-29 04:51:10 +00:00
Isalia20	899eec665c	[MPS] col2im kernel implementation (#152282 ) Fixes #151820 Also requested in #141287 Mainly based on the cuda kernel implementations Pull Request resolved: https://github.com/pytorch/pytorch/pull/152282 Approved by: https://github.com/malfet	2025-04-28 03:48:41 +00:00
Nikita Shulga	3ef6d6924a	[BE] Switch `TestConsistency` to MPS device (#147893 ) Which will eventually allow move decorators away more `common_mps.py` Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893 Approved by: https://github.com/atalman ghstack dependencies: #152204	2025-04-26 01:19:21 +00:00
Isalia20	5e9bdc9b86	[MPS] layernorm forward kernel (#152010 ) Implements layernorm forward pass as a metal kernel instead of MPSGraph ops. Speed ups are indicated on the chart below: ![Figure_1](https://github.com/user-attachments/assets/27a4d2ef-b3e4-4650-9ce3-b939c080321e) Script for generating times, need to build torch with old/new codebase and then run this with different file name indicated at the end of the script ```python import csv import time import numpy as np import torch import torch.nn.functional as F matrix_sizes = [32, 64, 128, 256, 512, 1024, 2048, 4096, 8192] batch_sizes = [1] elementwise_affine = [False, True] num_runs = 50 warmup_runs = 3 def create_input_tensor(n, batch_size): torch.manual_seed(42) return torch.randn(batch_size, n, dtype=torch.float32) def run_layer_norm(A, normalized_shape, elementwise_affine): torch.mps.synchronize() start = time.perf_counter() out = F.layer_norm(A, normalized_shape) torch.mps.synchronize() end = time.perf_counter() return out, end - start results = {"N": [], "elementwise_affine": [], "batch_size": [], "mean_time": [], "std_time": []} for el_aff in elementwise_affine: for n in matrix_sizes: for batch_size in batch_sizes: print(f"\nBenchmarking LayerNorm for input size N={n}, batch_size={batch_size}, elementwise_affine={el_aff}") try: A_cpu = create_input_tensor(n, batch_size) A_mps = A_cpu.to("mps") normalized_shape = (n,) for _ in range(warmup_runs): _, _ = run_layer_norm(A_mps, normalized_shape, el_aff) times = [] for _ in range(num_runs): _, t = run_layer_norm(A_mps, normalized_shape, el_aff) times.append(t) mean_time = np.mean(times) std_time = np.std(times) results["N"].append(n) results["elementwise_affine"].append(el_aff) results["batch_size"].append(batch_size) results["mean_time"].append(mean_time) results["std_time"].append(std_time) print(f"Mean time: {mean_time:.4f}s ± {std_time:.4f}s") except RuntimeError as e: print(f"Error for N={n}, batch_size={batch_size}: {e}") continue with open("layernorm_benchmark_times_new.csv", "w", newline="") as f: writer = csv.writer(f) writer.writerow(["N", "elementwise_affine", "batch_size", "mean_time", "std_time"]) for i in range(len(results["N"])): writer.writerow( [ results["N"][i], results["elementwise_affine"][i], results["batch_size"][i], results["mean_time"][i], results["std_time"][i], ] ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152010 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-24 05:07:46 +00:00
Nikita Shulga	3aecf2dc52	[MPS] Extend index_put to half precision floats (#151869 ) By reusing `c10/metal/atomic.h` This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support TODOs: - Get rid of indexing kernel and compute it directly when kernel is run - Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch - Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-04-22 22:00:08 +00:00
Li-Huai (Allan) Lin	fbd29527d8	[MPS] Move ops modifiers to testing utils so other tests can reuse (#151781 ) Test collection check: ``` python -m pytest test/test_mps.py --collect-only ``` Before: ``` 6390 tests collected in 8.34s ``` After: ``` 6390 tests collected in 7.71s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151781 Approved by: https://github.com/malfet	2025-04-22 19:19:52 +00:00
Nikita Shulga	f37e138bc4	[MPS] Enable log1p and sigmoid for int64 (#151791 ) It works on MacOS-15, but likely will need a skip for MacOS-13 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151791 Approved by: https://github.com/Skylion007 ghstack dependencies: #151790	2025-04-21 18:30:04 +00:00
Davide Italiano	470132c6a1	[MPS] Add support for hermite_polynomial_he (inductor/eager). (#151754 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151754 Approved by: https://github.com/malfet, https://github.com/jansel	2025-04-20 17:44:40 +00:00
Nikita Shulga	14293c2377	[MPS] Allow isin for mixed types (#151600 ) To follow pattern set by CPU and CUDA impls: define common_dtype and optionally casts `elements` and `test_elements` to common dtype if needed - Add regression test, though skip it on MacOS-13, as `isin` seems to produce garbage there even for same dtypes ``` >>> import torch >>> x=torch.arange(4.0, device='mps') >>> y=torch.arange(1.0, 3.0, device='mps') >>> x, y, torch.isin(x, y), torch.isin(y, x) (tensor([0., 1., 2., 3.], device='mps:0'), tensor([1., 2.], device='mps:0'), tensor([False, True, False, False], device='mps:0'), tensor([False, False], device='mps:0')) >>> torch.__version__ '2.6.0' ``` - Cleanup code a bit Fixes https://github.com/pytorch/pytorch/issues/151443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151600 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/kulinseth	2025-04-18 12:30:32 +00:00
Nikita Shulga	1ffaa00ad7	[MPS] Migrate `bitwise_not` to unary operator (#151460 ) That kills to birds with one stone: - Makes implementations more standartized (and faster for strided inputs/outputs) - Fixes bug strided inplace bitwise_not I.e. before this change ```python import torch x=torch.arange(32, device="mps") x[::2].bitwise_not_() print(x) ``` produced ``` tensor([ -1, -2, -3, -4, -5, -6, -7, -8, -9, -10, -11, -12, -13, -14, -15, -16, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], device='mps:0') ``` after, it generates reasonable output ``` tensor([ -1, 1, -3, 3, -5, 5, -7, 7, -9, 9, -11, 11, -13, 13, -15, 15, -17, 17, -19, 19, -21, 21, -23, 23, -25, 25, -27, 27, -29, 29, -31, 31], device='mps:0') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151460 Approved by: https://github.com/dcci, https://github.com/qqaatw, https://github.com/Skylion007	2025-04-16 21:34:45 +00:00
Nikita Shulga	b8a2824755	[MPS] Fix logit output for half/bfloat (#151282 ) Which also fixes MPSInductor pointwise test TODO: (as followup PRs): get rid of special native_function.yaml dispatches and use stub Pull Request resolved: https://github.com/pytorch/pytorch/pull/151282 Approved by: https://github.com/dcci ghstack dependencies: #151224, #151246, #151272	2025-04-15 06:25:00 +00:00
Li-Huai (Allan) Lin	ddfc14b3ae	[MPS] Fix where (#151176 ) Fixes #150967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176 Approved by: https://github.com/kulinseth, https://github.com/malfet	2025-04-13 20:44:50 +00:00
Nikita Shulga	bc47d539fc	[MPS] Support ArgumentBuffer bindings from C++/Python (#150780 ) To workaround limitation of 32-arguments per kernel and being able to eventually compile something like ```python import torch def foo(args): rc = torch.empty_like(args[0]) for arg in args: rc += arg return rc tensors = torch.rand(100, 32, device='mps').unbind(0) print(torch.compile(foo)(tensors)) ``` For now, introduce `at::native:🤘:get_tensor_gpu_address` and use it from both C++ test and compile_shader to convert list of tensors to list of pointers valid on GPU. Initially this binding were done via `id< MTLArgumentEncoder>`, but according to [Improving CPU Performance by Using Argument Buffers](https://developer.apple.com/documentation/metal/improving-cpu-performance-by-using-argument-buffers?language=objc#Encode-Resources-into-Argument-Buffers) article, this is not necessary when targeting Tier2-only devices (which is true of all devices on MacOS-13 or newer): > To directly encode the argument buffer resources on these Tier 2 devices, write the [MTLBuffer](https://developer.apple.com/documentation/metal/mtlbuffer?language=objc).[gpuAddress](https://developer.apple.com/documentation/metal/mtlbuffer/gpuaddress?language=objc) property — and for other resource types (samplers, textures, and acceleration structures), the [gpuResourceID](https://developer.apple.com/documentation/metal/mtlcomputepipelinestate/gpuresourceid?language=objc) property — into the corresponding structure member. To encode offsets, treat these property values as uint64 types and add the offset to them. Add both C++ and PyThon unittests that validate that this works. Please note, that using either ArgumentEncoder or directly encoding the data does not guarantee buffer will not be freed until shader execution is complete. On the other hand, this should already be guaranteed by MPSCachingAllocator that would only free the memory after all streams completed its execution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150780 Approved by: https://github.com/dcci	2025-04-09 04:24:37 +00:00
Isalia20	49f6cce736	[MPS] grad scaler (#150255 ) Fixes #142397 Basic implementation is done. What's left: - [x] Different dtype/device tensors in the TensorList - [x] fast path for grouping the foreach kernel - [x] Tests Regarding tests, I found some tests in `test/test_torch.py` for GradScaler but I couldn't figure out what is the best way to enable the test for MPS device. By removing `@onlyNativeDeviceTypes`, one enables the tests for MPS but also enables tests for all other devices which are not included in the native device types. If I put: `instantiate_device_type_tests(TestTorchDeviceType, globals(), allow_mps=True)` This enables lots of tests in that class for MPS which were not(?) being tested before? This part needs some clarification Pull Request resolved: https://github.com/pytorch/pytorch/pull/150255 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-06 17:06:55 +00:00
Isalia20	cfea55dbec	[MPS] fix inverse bug for N>1024 (#146754 ) Fixes #138200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146754 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-04-05 21:49:21 +00:00
Nikita Shulga	7ac8186851	[MPSInductor] Speedup `sum`/`prod` reductions (#150566 ) By using cooperative `simd_sum`/`simd_product` instead of a C-style for loop for threadgroup reductions. This also allows significantly reduce amount of shared memory needed to perform those reductions Using such reduction increases the `torch.compile` performance for gpt-fast using `stories110M` from 29 tokens/sec to 630 tokens/sec on M4 and changes perf of torch.rand as follows: \|size\| before \| after \| \|------------------------\|------------\|-------------\| \| 512x512 \| 202.1 \| 131.8 \| \| 1024x1024 \| 780.6 \| 176.9 \| \| 2048x2048 \| 1423.4 \| 339.9 \| \| 4096x4097 \| 2982.2 \| 1047.2 \| Unfortunately, none of the SIMDgroup operations are available for 64-bit integers, but one can simulate the behavior using using `simd_shuffle_down` of 64-bit values represented as `int2` types, that yields reduction in $log_2(threadgroup\\_size)$ steps. [`mlx/kernels/reduction/ops.h](`86389bf970/mlx/backend/metal/kernels/reduction/ops.h (L15-L18)`) contains an implementation of such algorithm, but alas it yields wrong results on M1/M2(and may be M3 machines) if not all threads in the simdgroup are active which could be observed by running ```python import torch lib=torch.mps.compile_shader(""" kernel void do_sum(device int* out, constant int* in, uint idx [[thread_position_in_grid]]) { out[idx] = metal::simd_shuffle_down(in[idx], 8); } """) x=torch.arange(22, device='mps', dtype=torch.int32) y=torch.empty_like(x) lib.do_sum(y, x) print(y) ``` that returns following on M4 ``` tensor([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 0, 0, 0, 0, 0, 0, 0, 0], device='mps:0', dtype=torch.int32) ``` but same kernel running on M1 returns ``` tensor([ 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 14, 15, 16, 17, 18, 19, 20, 21], device='mps:0', dtype=torch.int32) ``` This discrepancy in behavior can be addressed by using `simd_shuffle_and_fill_down`, but any kernels using simd_shuffle_and_fill_down cause an internal compiler error on MacOS-13.2. Considering that OS is to be EOL soon, skip the offending tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150566 Approved by: https://github.com/manuelcandales ghstack dependencies: #150452, #150457	2025-04-05 02:47:27 +00:00
Nikita Shulga	827b730f4e	[CI] Skip test_copy_large_tensor on M2-15 runners (#150377 ) They have more than 12Gb memory, but may be running this test causes OOM in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/150377 Approved by: https://github.com/atalman	2025-04-01 02:33:43 +00:00
Davide Italiano	b48505a8a1	[MPS] Add support for hermite_polynomial_h. (#150279 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/150279 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-03-31 23:30:19 +00:00
Nikita Shulga	7c65911b11	[MPS] Fix dot/mm for conj_tensors (#150157 ) - Distinguish between conjugated/non_conjugated inputs by appending conjugation to the operator key - For matmul or dot, add `conjugateWithTensor:name:` calls before running the op - Enable testing for conjugated ops by passing `include_conjugated_inputs` to opinfo - Filter `include_conjugated_inputs` argument from `sample_inputs_window` (probably should have landed as separate PR) - Preserve conj property when gathering the views, that fixes `cov` operator Fixes https://github.com/pytorch/pytorch/issues/148156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150157 Approved by: https://github.com/dcci	2025-03-28 20:36:44 +00:00
Nikita Shulga	ef1cb6b646	[BE] Suppress user_warnings while running opinfo tests (#150115 ) Some of the samples are constructed in a way that are expected to trigger those, but what's the point displaying them Pull Request resolved: https://github.com/pytorch/pytorch/pull/150115 Approved by: https://github.com/dcci ghstack dependencies: #150060	2025-03-27 22:36:27 +00:00
Nikita Shulga	6aca002d82	[MPS] Add `chebyshev_polynomial_[uvw]` (#150060 ) For both eager and inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/150060 Approved by: https://github.com/dcci, https://github.com/jansel	2025-03-26 23:35:05 +00:00
Nikita Shulga	de68ddc68e	[MPS] Fix metal ops with different dtypes (#149974 ) By implementing `_cast_` flavors of both dense and strided ops. Add regression tests that tests `fmax`/`fmin` for mixed dtypes. Been dreaded to write this PR for a while, as it end up to be pretty bulky: - Adds 1C10_METAL_ALL_TYPES_FUNCTOR` and `c10:🤘:ScalarType` to `c10/metal/common.h` and test that its values always match `c10::ScalarType` - Add `c10:🤘:cast_to` to `c10/metal/utils.h` which could be used to cast any scalar metal dtype to any other one, including complex values - Implement `val_at_offs<T>(constant void *, long offs, ScalarType dtype)` that is used to dynamically cast types - Add `binary_strided_cast` and `binary_dense_cast` that are invoked for output dtype and cast both inputs to that output before performing the op Benchmark collected on M2Pro that runs fmax for 1 mln element tensors (Times are in microseconds.) \| \| dense-dense \| transp-transp \| dense-transp \| transp-dense \| dense-scalar \| dense-bcast \| \|-------------------------\|---------------\|----------------\|----------------\|----------------\|---------------\|--------------- \| \| fmax (torch.float16, torch.float16) \| 160.9 \| 159.9 \| 270.5 \| 270.9 \| 236.6 \| 293.0 \| fmax (torch.float32, torch.float32) \| 176.9 \| 171.0 \| 273.7 \| 293.5 \| 242.6 \| 294.2 \| fmax (torch.float32, torch.float16) \| 171.4 \| 170.9 \| 283.6 \| 303.0 \| 253.7 \| 302.3 \| add (torch.float16, torch.float16) \| 218.0 \| 223.6 \| 221.0 \| 222.0 \| 214.9 \| 218.3 \| add (torch.float32, torch.float32) \| 227.4 \| 233.9 \| 228.8 \| 231.9 \| 218.9 \| 221.4 \| add (torch.float32, torch.float16) \| 226.1 \| 227.5 \| 227.5 \| 226.9 \| 177.0 \| 190.8 TODOS: - Include input and output dtype in non-cast kernel name - Make TensorFactory.h use `C10_METAL_ALL_TYPES_FUNCTOR` - Extend mixed_dytpes testing via OpInfo Fixes https://github.com/pytorch/pytorch/issues/149951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149974 Approved by: https://github.com/manuelcandales	2025-03-26 07:03:21 +00:00
Isalia20	ba46643df1	[MPS] tril op not handling infs correctly (#149866 ) Fixes #149813 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149866 Approved by: https://github.com/malfet	2025-03-24 23:38:41 +00:00
Davide Italiano	9179178728	[MPS] Add support for `chebyshev_polynomial_t` in eager. (#149816 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149816 Approved by: https://github.com/malfet	2025-03-24 19:19:55 +00:00
Isalia20	248487f455	[MPS] nanmedian with dims (#149680 ) Third most voted op from #77764 Tests were deleted because they are covered by the regular test_output_match tests so those were redundant and were added in the last PR before the nanmedian dim version would be implemented Pull Request resolved: https://github.com/pytorch/pytorch/pull/149680 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-24 03:49:16 +00:00
Davide Italiano	b9a5e1d038	[MPS] Add support for scaled_modified_bessel_k1 to eager. (#149783 ) Another day another op Pull Request resolved: https://github.com/pytorch/pytorch/pull/149783 Approved by: https://github.com/malfet	2025-03-22 02:13:41 +00:00
Davide Italiano	bdc132d0e1	[MPS] Add support for scaled_modified_bessel_k0 for eager. (#149705 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149705 Approved by: https://github.com/malfet	2025-03-21 16:14:29 +00:00
Davide Italiano	0ed34210b2	[MPS] Add support for `modified_bessel_k1` to eager and inductor. (#149687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149687 Approved by: https://github.com/malfet	2025-03-21 04:59:06 +00:00
Isalia20	95e71765f2	[MPS] nanmedian implementation (#149407 ) Implements nanmedian on MPS. This implementation only implements `torch.nanmedian(tensor)` without `keepdim` and `dim` Will implement nanmedian with dim and keepdim in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/149407 Approved by: https://github.com/malfet	2025-03-20 03:50:26 +00:00
Davide Italiano	88c2fe533f	[MPS] Add `modified_bessel_k0` support to eager. (#149563 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149563 Approved by: https://github.com/malfet	2025-03-19 23:10:55 +00:00
Nikita Shulga	2e0c98ff05	[MPS] Add `bicubic2d_aa` (#149378 ) Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287 Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from `eec43cfbc0/src/libImaging/Resample.c` as well as `bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)` Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378 Approved by: https://github.com/dcci	2025-03-18 05:35:41 +00:00
Davide Italiano	c43e35d6f7	[MPS] Implement support for `modified_bessel_i1` in eager. (#149368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-18 03:29:10 +00:00
Davide Italiano	186cc7327c	[MPS/BE] Remove decorator that skipped test on macOS 12. (#149365 ) macOS 12 is not really supported anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365 Approved by: https://github.com/malfet	2025-03-18 00:58:08 +00:00
Davide Italiano	9f33c6f0a0	[MPS] Add support for modified_bessel_i0 in eager. (#149264 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149264 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-03-16 04:45:49 +00:00
Nikita Shulga	96795e9533	[BE] Parametrize `TestMPS.test_binops_dtype_precedence` (#149234 ) No op change, just splits a longer tests into a series of a smaller ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/149234 Approved by: https://github.com/atalman, https://github.com/dcci ghstack dependencies: #149216, #149233	2025-03-15 00:37:11 +00:00
Isalia20	dd6e9df3d0	[MPS] fix attention enable_gqa crash on mps (#149147 ) Fixes #149132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149147 Approved by: https://github.com/malfet	2025-03-14 21:25:54 +00:00

1 2 3 4 5 ...

737 Commits