pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Nikita Shulga	28ccc9e724	[MPS] Extend `index_put` to complex types (#160159 ) And delete confusing supported types check. Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header Fixes https://github.com/pytorch/pytorch/issues/160034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159 Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007	2025-08-08 21:54:30 +00:00
Nikita Shulga	e2a5c42e7e	[BE][MPS] Build metal kernels of MacOS-14+ (#159733 ) Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary. Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic. Part of https://github.com/pytorch/pytorch/issues/159275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733 Approved by: https://github.com/dcci ghstack dependencies: #159731, #159732	2025-08-03 20:53:58 +00:00
Nikita Shulga	9ca080db87	[MPS] Extend atomic operations to all int types (#158179 ) That fixes `index_put(..., accumulate=True)` for all dtypes int64 operation is not really atomic, but eventually consistent from the `index_put_accumulate` kernel point of view: i.e. by the end of the operation results in the global memory are indeed accumulation of the operands at given indices Pull Request resolved: https://github.com/pytorch/pytorch/pull/158179 Approved by: https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #158064, #158178	2025-07-14 04:25:05 +00:00
Nikita Shulga	beed033b6e	[MPS] Fix `index_kernel` for large tensors (#158064 ) Move `MetalShaderLibrary::bind_tensors` private method to OperatorUtils.h and extract `iter_tensor_offset` method, that returns an offset from the start of the storage associated with given tensor inside the iterator Migrated `index`, `index_put[_accumulate][_serial]` to the new paradigm that does not require additional tensor for indices nor special handling for 32 vs 64-bit offset, which resulted in almost 2x perf gain for 2000x2000 tensor, see results below before ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 383.5 \| 379.8 \| 470.9 \| 1232.9 \| 4410.3 __getitem__ (torch.float16, torch.int64) \| 379.6 \| 354.5 \| 533.2 \| 1290.3 \| 4442.2 __getitem__ (torch.float32, torch.int64) \| 360.8 \| 338.6 \| 478.6 \| 1348.9 \| 4870.4 Times are in microseconds (us). ``` and after ``` [------------------------------------------------------------ -----------------------------------------------------------] \| 11x50x50 \| 11x100x100 \| 11x500x500 \| 11x1000x1000 \| 11x2000x2000 1 threads: ---------------------------------------------------------------------------------------------------------------- __getitem__ (torch.int8, torch.int64) \| 349.8 \| 330.5 \| 432.6 \| 764.5 \| 1961.2 __getitem__ (torch.float16, torch.int64) \| 342.5 \| 330.7 \| 434.7 \| 741.0 \| 1969.4 __getitem__ (torch.float32, torch.int64) \| 332.2 \| 326.1 \| 445.4 \| 751.3 \| 1972.6 Times are in microseconds (us). ``` While migrating also fixed index_put_accumulate for boolean types, by using compare_and_exchange trick over uint Fixes https://github.com/pytorch/pytorch/issues/153560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158064 Approved by: https://github.com/dcci	2025-07-11 22:35:44 +00:00
Nikita Shulga	3aecf2dc52	[MPS] Extend index_put to half precision floats (#151869 ) By reusing `c10/metal/atomic.h` This also fixes `GPUTests.test_index_put_fallback[12]_mps` that is unrolled by inductor, so no need for dedicated atomic_add support TODOs: - Get rid of indexing kernel and compute it directly when kernel is run - Simulate atomic_add for int64 types as series of int32 atomic-add-and-fetch - Setup tolerances correctly to pass float16/bfloat16 tests (as CPU always takes sequential strategy) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151869 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-04-22 22:00:08 +00:00
Nikita Shulga	d778c92e16	[Metal][BE] Move atomic ops to c10/metal/atomic.h (#151868 ) To be reused from indexing and MPSInductor implementaiton of atomic_add stores Added wrapper for `metal::atomic<int>`(to be used by followup PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151868 Approved by: https://github.com/Skylion007	2025-04-22 14:11:29 +00:00

6 Commits