Commit Graph

16882 Commits

Author SHA1 Message Date
0d71ca2c46 [EZ] Replace pytorch-labs with meta-pytorch (#160459)
This PR replaces all instances of 'pytorch-labs' with 'meta-pytorch' in this repository now that the 'pytorch-labs' org has been renamed to 'meta-pytorch'

## Changes Made
- Replaced all occurrences of 'pytorch-labs' with 'meta-pytorch'
- Only modified files with extensions: .py, .md, .sh, .rst, .cpp, .h, .txt, .yml
- Skipped binary files and files larger than 1MB due to GitHub api payload limits in the script to cover all repos in this org. Will do a more manual second pass later to cover any larger files

## Files Modified
This PR updates files that contained the target text.

Generated by automated script on 2025-08-12T20:41:29.888681+00:00Z
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160459
Approved by: https://github.com/huydhn, https://github.com/clee2000, https://github.com/atalman, https://github.com/malfet
2025-08-12 22:44:25 +00:00
2e4e5ab4be [MPS] Add mps keys to indices and values ops (#160223)
enable indices and values on sparse mps

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160223
Approved by: https://github.com/malfet
2025-08-12 22:08:44 +00:00
f27232a213 [ROCm] Limit number of values per thread for reductions on three dimensions (#159652)
In the current implementation of reductions in three dimensions for AMD GPUs the number of values per thread is unbounded and can end up being in the hundreds of thousands for certain tensors. This of course is bad for performance. This patch fixes this issue by increasing the parallelism and thus lowering the number of value per thread to reasonable limits i.e. less than 2048 values per thread. The performance gains can be between 10x-17x for certain examples where the number of values per thread was originally very high.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159652
Approved by: https://github.com/jeffdaily
2025-08-12 21:15:56 +00:00
eqy
9903ca4f70 [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel, https://github.com/atalman
2025-08-12 18:07:41 +00:00
f341077ce4 Revert "[ROCm] Support large inputs for coalesceValuesKernel (#158281)"
This reverts commit a7abf57aabec0ce686092e2d66e53ba185dbc56b.

Reverted https://github.com/pytorch/pytorch/pull/158281 on behalf of https://github.com/clee2000 due to broke windows cuda build? [GH job link](https://github.com/pytorch/pytorch/actions/runs/16915172288/job/47927141460) [HUD commit link](a7abf57aab).  Not caught b/c PR didn't have ciflow/trunk ([comment](https://github.com/pytorch/pytorch/pull/158281#issuecomment-3180408766))
2025-08-12 17:57:57 +00:00
ee9f8ba11d [ROCm] Use opportunistic fastatomics based on hueristics (#159430)
* Opportunistic fast atomics works better with small sizes, since there is more chance of lanes doing atomics on the same address

Co-author: @amd-hhashemi

Reproducer:
```
import time
import torch

x = torch.randn((1_632_960, 128), device='cuda', dtype=torch.float)
ind = torch.randint(0, x.size(0), size=(5_079_670,), device='cuda')
src = torch.randn((5_079_670, 128), device='cuda', dtype=torch.float)

for _ in range(20):
    x.index_add_(0, ind, src)

start_time = time.time()
for i in range(100):
    x.index_add_(0, ind, src)
torch.cuda.synchronize()
end_time = time.time()
mean_time = (end_time - start_time)/100
print(f"Avg time for index_add_: {mean_time * 1e6:.2f} us")
```

Perf numbers:
```
Before:
Avg time for index_add_: 25652.16 us

After:
Avg time for index_add_: 2675.15 us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159430
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-08-12 17:13:54 +00:00
a7abf57aab [ROCm] Support large inputs for coalesceValuesKernel (#158281)
# Description

`.coalesce` cannot handle large inputs on ROCM due to maximal grid size limit.

This PR splits axis `X` into axes `X` and `Y`, and repurposes `Z` for original `Y` on ROCm to avoid such limitation.

Confirmed the new approach can handle large inputs. Correctness needs validation.

# Testing Command

`python torch_spmv.py 22500000 272500000`

## Script `torch_spmv.py`

``` python
import torch
import argparse

def parse_args():
    parser = argparse.ArgumentParser(
        description="Sparse COO Matrix by Dense Vector Multiplication using PyTorch"
    )
    parser.add_argument("n", type=int, help="Size of the NxN matrix")
    parser.add_argument("nnz", type=int, help="Number of non-zero entries")
    return parser.parse_args()

def main():
    args = parse_args()
    n = args.n
    nnz = args.nnz
    dtype = torch.float32
    device = torch.device('cuda')

    # Generate random indices for the sparse matrix in COO format.
    torch.manual_seed(42)
    rows = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    cols = torch.randint(0, n, (nnz,), dtype=torch.int64, device=device)
    indices = torch.stack([rows, cols], dim=0)

    # Generate random values.
    values = torch.randn(nnz, dtype=torch.float32, device=device)

    # Create the sparse COO matrix and move it to the target device.
    sparse_matrix = torch.sparse_coo_tensor(indices, values, size=(n, n), dtype=torch.float32, device=device)
    sparse_matrix = sparse_matrix.coalesce()

    # Generate a random dense vector.
    dense_vector = torch.randn(n, dtype=torch.float32, device=device)

    # Perform sparse matrix - dense vector multiplication.
    # Using torch.sparse.mm which expects a 2D tensor for the vector.
    result = torch.sparse.mm(sparse_matrix, dense_vector.unsqueeze(1)).squeeze()
    # result = torch.mv(sparse_matrix, dense_vector)

    # Print the result.
    print("Result of the multiplication:")
    print(torch.sum(result))

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158281
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-08-12 16:42:55 +00:00
5a40c57844 [MTIA] Implement isAvailable() for MTIA hooks (#160304)
Summary: MTIA is missing the `isAvailable()` override, which is necessary for some of the device agnostic methods.

Test Plan:
`torch._C._get_accelerator()`

Rollback Plan:

Differential Revision: D79981115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160304
Approved by: https://github.com/nautsimon
2025-08-11 21:45:11 +00:00
7d2ec704e4 Fix MPS autocast for ConvTranspose3d (#160345)
## Summary
- ensure ConvTranspose3d uses fp32 under MPS autocast
- add MPS autocast test for ConvTranspose3d

Generated by Codex, see https://chatgpt.com/codex/tasks/task_e_689a360388288327a2cac6f55bbfc42c

Fixes https://github.com/pytorch/pytorch/issues/160332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160345
Approved by: https://github.com/dcci
2025-08-11 21:01:52 +00:00
d25c4f954d [MPS] Type-promote tensor-iterator common dtype (#160334)
Otherwise, `torch.add(FloatTensor, IntTensor, alpha=2)` and `torch.add(FloatTensor, IntTensor, alpha=2)` were dispatched to different kernels

Fixes https://github.com/pytorch/pytorch/issues/160208
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160334
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-08-11 17:53:56 +00:00
d8cb3db533 Add unsigned support to IValue (#160102)
- Moved repeated logic of saving int64/uint64 into a polymorphic container into `THPUtils_unpackInteger`
- Added `TestPythonDispatch.test_dispatch_uint64` regression test

Fixes https://github.com/pytorch/pytorch/issues/159168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160102
Approved by: https://github.com/ezyang
2025-08-11 03:57:18 +00:00
842cc77ab9 [MPS] Extend addmm to integral types (#160270)
By adding `addmm` kernel, which is a logical continuation  of `mm` one. The only tricking part are how alpha and beta constants are handled, which are passed as `optmath_t`, i.e. that it could be, int64, int32 or float

Unified all MM flavors instantiations thru `INSTANTIATE_MM_OPS` and tested that `addmm` metal kernel works as expected for floating types as well by testing it via
```
 PYTORCH_MPS_PREFER_METAL=1 python test/test_mps.py -v -k test_output_match_addmm_mps_
```

Fixes https://github.com/pytorch/pytorch/issues/154901
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160270
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #160228, #160234
2025-08-11 00:54:17 +00:00
a84b60c0c4 [MPS] Sparse coalesce more dtypes to match cpu (#160254)
More dtypes to match the cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160254
Approved by: https://github.com/malfet
2025-08-10 12:25:18 +00:00
cyy
10e3514c96 Remove tensorexpr tests (#158928)
The tests are not maintained.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158928
Approved by: https://github.com/albanD, https://github.com/malfet
2025-08-09 02:21:22 +00:00
1128f4c2a8 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-08-08 22:22:48 +00:00
28ccc9e724 [MPS] Extend index_put to complex types (#160159)
And delete confusing supported types check.
Move all pseudo atomic (but eventually consistent) ops to `c10/metal/atomic.h` header

Fixes https://github.com/pytorch/pytorch/issues/160034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160159
Approved by: https://github.com/manuelcandales, https://github.com/dcci, https://github.com/Skylion007
2025-08-08 21:54:30 +00:00
5f5f508aa8 [ROCm] Ck backend UX refactor (#152951)
Refactors how the enablement/disablement of CK Gemms and SDPA works.

- Adds USE_ROCM_CK_GEMM compile flag for enabling CK gemms.
- USE_ROCM_CK_GEMM is set to True by default on Linux
- Updates USE_CK_FLASH_ATTENTION to USE_ROCM_CK_SDPA.
- USE_ROCM_CK_SDPA is set to False by default
- (USE_CK_FLASH_ATTENTION still works for now, but will be deprecated in a future release)
- Prevents these CK libraries from being used unless pytorch has been built specifically with the functionality AND is running on a system architecture that supports it.
- the getters for these library backends will also do some validity checking in case the user used an environment variable to change the backend. If invalid, (i.e. one of the cases mentioned above is false) the backend will be set as the current non-CK default

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152951
Approved by: https://github.com/eqy, https://github.com/jeffdaily, https://github.com/m-gallus

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-08-08 18:40:17 +00:00
84f7e88aef Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-08 17:41:22 +00:00
d7114f05b1 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-08-08 17:41:10 +00:00
86eb65f7f0 [MPS] Move max_pool2d to Metal for stride != 1 (#157876)
This PR updates `max_pool2d` to use a Metal kernel instead of the old MPS graph impl. However, when the `stride` argument is 1 in all dimensions, the old implementation gives significantly better performance, so we fall back to it in that case. Below is a performance comparison of `max_pool2d` before and after this PR, obtained from this script: 2f02f2bf7a/max_pool_mps/perf.py

<details><summary>Click to expand</summary>

case | before PR | after PR | speedup |   | case info
-- | -- | -- | -- | -- | --
0 | 0.014264 | 0.004473 | 3.188911245 |   | (3, 2, 2), {'kernel_size': 2, 'return_indices': True}
1 | 0.010752 | 0.00421 | 2.55391924 |   | (3, 2, 2), {'kernel_size': 2, 'return_indices': False}
2 | 0.020777 | 0.006123 | 3.393271272 |   | (3, 10, 10), {'kernel_size': 5, 'return_indices': True}
3 | 0.011065 | 0.005759 | 1.921340511 |   | (3, 10, 10), {'kernel_size': 5, 'return_indices': False}
4 | 0.01452 | 0.007829 | 1.854642994 |   | (3, 100, 100), {'kernel_size': 5, 'return_indices': True}
5 | 0.009258 | 0.007075 | 1.308551237 |   | (3, 100, 100), {'kernel_size': 5, 'return_indices': False}
6 | 0.188137 | 0.168688 | 1.115295694 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True}
7 | 0.161362 | 0.154746 | 1.042753932 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False}
8 | 0.182883 | 0.16945 | 1.079274122 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True}
9 | 0.156875 | 0.163346 | 0.9603847049 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False}
10 | 0.193433 | 0.167396 | 1.155541351 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True}
11 | 0.158967 | 0.151246 | 1.051049284 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False}
12 | 0.931071 | 0.932883 | 0.9980576342 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True}
13 | 0.324496 | 0.3252 | 0.9978351784 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False}
14 | 0.944071 | 0.936246 | 1.008357846 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True}
15 | 0.322171 | 0.314854 | 1.023239343 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False}
16 | 0.894158 | 0.886408 | 1.008743152 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True}
17 | 0.309338 | 0.304146 | 1.017070749 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False}
18 | 0.606 | 0.260546 | 2.325884873 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True}
19 | 0.30445 | 0.231054 | 1.317657344 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False}
20 | 0.474708 | 0.261925 | 1.812381407 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True}
21 | 0.23175 | 0.231883 | 0.9994264349 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False}
22 | 0.434475 | 0.266246 | 1.631855502 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True}
23 | 0.236942 | 0.231792 | 1.022218196 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False}
24 | 0.202396 | 0.174888 | 1.157289237 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True}
25 | 0.160679 | 0.158246 | 1.015374796 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False}
26 | 0.200354 | 0.184133 | 1.088093932 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True}
27 | 0.160779 | 0.160679 | 1.000622359 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False}
28 | 0.199175 | 0.178625 | 1.115045486 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True}
29 | 0.159458 | 0.160883 | 0.9911426316 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False}
30 | 0.199021 | 0.165329 | 1.203787599 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True}
31 | 0.156337 | 0.158213 | 0.9881425673 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False}
32 | 0.180146 | 0.174483 | 1.032455884 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True}
33 | 0.156988 | 0.158167 | 0.9925458534 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False}
34 | 0.182133 | 0.176521 | 1.031792251 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True}
35 | 0.169042 | 0.156483 | 1.080257919 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False}
36 | 1.767821 | 1.766254 | 1.000887188 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True}
37 | 1.059346 | 1.058775 | 1.000539302 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False}
38 | 1.85755 | 1.859429 | 0.9989894747 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True}
39 | 1.100417 | 1.097683 | 1.002490701 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False}
40 | 1.843167 | 1.847558 | 0.9976233493 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True}
41 | 1.090142 | 1.093163 | 0.9972364597 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False}
42 | 0.480867 | 0.251733 | 1.910226311 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True}
43 | 0.319246 | 0.236479 | 1.349997251 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False}
44 | 0.49315 | 0.256408 | 1.923301925 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True}
45 | 0.316746 | 0.227854 | 1.390127011 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False}
46 | 0.4912 | 0.257762 | 1.905633879 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True}
47 | 0.324771 | 0.229371 | 1.41592006 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False}
48 | 0.152904 | 0.095079 | 1.608178462 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True}
49 | 0.102963 | 0.089217 | 1.154073775 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False}
50 | 0.155158 | 0.095429 | 1.625899884 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True}
51 | 0.104338 | 0.089979 | 1.15958168 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False}
52 | 0.153121 | 0.096429 | 1.587914424 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True}
53 | 0.103642 | 0.090254 | 1.148336916 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False}
54 | 0.191071 | 0.165125 | 1.157129447 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True}
55 | 0.153971 | 0.149021 | 1.033216795 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False}
56 | 0.193192 | 0.166892 | 1.157586942 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True}
57 | 0.156617 | 0.15215 | 1.029359185 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False}
58 | 0.178033 | 0.167308 | 1.06410333 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True}
59 | 0.157425 | 0.164404 | 0.9575496947 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False}
60 | 1.757638 | 1.750896 | 1.0038506 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True}
61 | 1.048471 | 1.047967 | 1.000480931 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False}
62 | 1.790708 | 1.789767 | 1.000525767 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True}
63 | 1.054575 | 1.054796 | 0.9997904808 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False}
64 | 1.785837 | 1.784192 | 1.000921986 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True}
65 | 1.054713 | 1.054492 | 1.00020958 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False}
66 | 0.478267 | 0.261017 | 1.832321266 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True}
67 | 0.32005 | 0.226654 | 1.412064204 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False}
68 | 0.484008 | 0.254721 | 1.900149575 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True}
69 | 0.321 | 0.218842 | 1.466811672 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False}
70 | 0.482087 | 0.248771 | 1.937874591 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True}
71 | 0.316558 | 0.230533 | 1.373156988 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False}
72 | 0.137842 | 0.085088 | 1.619993419 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True}
73 | 0.100671 | 0.0769 | 1.309115735 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False}
74 | 0.148321 | 0.086967 | 1.705485989 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True}
75 | 0.101392 | 0.075454 | 1.343759112 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False}
76 | 0.150208 | 0.083742 | 1.793699697 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True}
77 | 0.099587 | 0.075825 | 1.313379492 |   | (3, 1000, 1000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False}
78 | 0.622546 | 0.602729 | 1.03287879 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': True}
79 | 0.531696 | 0.5067 | 1.049330965 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 0, 'return_indices': False}
80 | 0.626646 | 0.617038 | 1.015571164 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': True}
81 | 0.530354 | 0.525367 | 1.009492412 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 1, 'return_indices': False}
82 | 0.633933 | 0.577775 | 1.097197006 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': True}
83 | 0.533067 | 0.526954 | 1.011600633 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': None, 'padding': 2, 'return_indices': False}
84 | 3.372867 | 3.386412 | 0.9960001914 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': True}
85 | 1.155975 | 1.156604 | 0.9994561665 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 0, 'return_indices': False}
86 | 3.401921 | 3.39755 | 1.001286515 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': True}
87 | 1.202829 | 1.192538 | 1.008629494 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 1, 'return_indices': False}
88 | 3.23675 | 3.220238 | 1.005127571 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': True}
89 | 1.077067 | 1.085613 | 0.9921279498 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 1, 'padding': 2, 'return_indices': False}
90 | 1.572925 | 0.925625 | 1.699311276 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': True}
91 | 0.791204 | 0.793454 | 0.9971642969 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 0, 'return_indices': False}
92 | 1.572742 | 0.922729 | 1.704446268 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': True}
93 | 0.784292 | 0.788871 | 0.9941955022 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 1, 'return_indices': False}
94 | 1.526546 | 0.925708 | 1.649057802 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': True}
95 | 0.769321 | 0.787675 | 0.9766985114 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 2, 'padding': 2, 'return_indices': False}
96 | 0.736033 | 0.612808 | 1.201082558 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': True}
97 | 0.574625 | 0.530925 | 1.082309177 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 0, 'return_indices': False}
98 | 0.722021 | 0.614488 | 1.174996094 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': True}
99 | 0.563171 | 0.533721 | 1.055178642 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 1, 'return_indices': False}
100 | 0.735725 | 0.613992 | 1.198264798 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': True}
101 | 0.583487 | 0.532513 | 1.095723485 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 1, 'stride': 4, 'padding': 2, 'return_indices': False}
102 | 0.656383 | 0.575313 | 1.140914598 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': True}
103 | 0.559796 | 0.509079 | 1.099625009 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 0, 'return_indices': False}
104 | 0.662046 | 0.572362 | 1.156691045 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': True}
105 | 0.552633 | 0.508671 | 1.086425214 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 1, 'return_indices': False}
106 | 0.634108 | 0.574629 | 1.103508525 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': True}
107 | 0.534013 | 0.510996 | 1.045043405 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': None, 'padding': 2, 'return_indices': False}
108 | 7.056642 | 7.066717 | 0.9985743026 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': True}
109 | 4.144275 | 4.142658 | 1.000390329 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 0, 'return_indices': False}
110 | 7.172683 | 7.189867 | 0.9976099697 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': True}
111 | 4.162538 | 4.158875 | 1.000880767 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 1, 'return_indices': False}
112 | 7.194233 | 7.181837 | 1.001726021 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': True}
113 | 4.294083 | 4.196062 | 1.023360236 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 1, 'padding': 2, 'return_indices': False}
114 | 1.875692 | 0.891071 | 2.104986022 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': True}
115 | 1.097479 | 0.781175 | 1.404907991 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 0, 'return_indices': False}
116 | 1.8883 | 0.89015 | 2.121327866 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': True}
117 | 1.101329 | 0.778542 | 1.414604479 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 1, 'return_indices': False}
118 | 1.872833 | 0.893654 | 2.095702587 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': True}
119 | 1.096712 | 0.784579 | 1.397835017 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 2, 'padding': 2, 'return_indices': False}
120 | 0.513029 | 0.374417 | 1.370207549 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': True}
121 | 0.349546 | 0.305763 | 1.143192603 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 0, 'return_indices': False}
122 | 0.518929 | 0.377487 | 1.374693698 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': True}
123 | 0.364662 | 0.3145 | 1.159497615 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 1, 'return_indices': False}
124 | 0.521275 | 0.375242 | 1.389170189 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': True}
125 | 0.367488 | 0.308354 | 1.191773092 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 2, 'stride': 4, 'padding': 2, 'return_indices': False}
126 | 0.652342 | 0.569308 | 1.145850752 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': True}
127 | 0.555696 | 0.506892 | 1.096280865 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 0, 'return_indices': False}
128 | 0.654333 | 0.570367 | 1.147213987 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': True}
129 | 0.548925 | 0.505825 | 1.085207335 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 1, 'return_indices': False}
130 | 0.655908 | 0.571904 | 1.146884792 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': True}
131 | 0.560808 | 0.508238 | 1.103435792 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': None, 'padding': 2, 'return_indices': False}
132 | 6.949462 | 6.949112 | 1.000050366 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': True}
133 | 4.072913 | 4.065013 | 1.001943413 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 0, 'return_indices': False}
134 | 7.200896 | 7.197792 | 1.000431243 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': True}
135 | 4.291367 | 4.218538 | 1.017264038 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 1, 'return_indices': False}
136 | 7.1823 | 7.306933 | 0.9829431856 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': True}
137 | 4.151175 | 4.149592 | 1.000381483 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 1, 'padding': 2, 'return_indices': False}
138 | 1.781279 | 0.884288 | 2.014365229 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': True}
139 | 1.050804 | 0.774362 | 1.356993241 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 0, 'return_indices': False}
140 | 1.860758 | 0.884637 | 2.103414169 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': True}
141 | 1.099908 | 0.775887 | 1.417613647 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 1, 'return_indices': False}
142 | 1.857387 | 0.885738 | 2.096993693 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': True}
143 | 1.105279 | 0.77365 | 1.428655077 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 2, 'padding': 2, 'return_indices': False}
144 | 0.489408 | 0.269583 | 1.815426047 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': True}
145 | 0.322525 | 0.236979 | 1.360985573 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 0, 'return_indices': False}
146 | 0.515475 | 0.265813 | 1.93923924 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': True}
147 | 0.315525 | 0.228146 | 1.382995976 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 1, 'return_indices': False}
148 | 0.503438 | 0.277204 | 1.816128194 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': True}
149 | 0.335421 | 0.228275 | 1.469372467 |   | (3, 2000, 2000), {'kernel_size': 5, 'dilation': 4, 'stride': 4, 'padding': 2, 'return_indices': False}
150 | 5.72495 | 4.909554 | 1.166083518 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': True}
151 | 4.45215 | 4.251333 | 1.047236243 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': None, 'return_indices': False}
152 | 29.953021 | 29.879879 | 1.002447868 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True}
153 | 9.854683 | 9.839517 | 1.001541336 |   | (10, 10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False}
154 | 6.178033 | 5.697375 | 1.084364817 |   | (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': True}
155 | 6.280317 | 5.712525 | 1.099394226 |   | (10, 10, 1000, 1000), {'kernel_size': 100, 'padding': 50, 'return_indices': False}
156 | 10.256062 | 11.336527 | 0.9046917103 |   | (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': True}
157 | 9.469546 | 11.33705 | 0.8352742556 |   | (10, 10, 1000, 1000), {'kernel_size': 250, 'padding': 50, 'return_indices': False}
158 | 0.119087 | 0.0797 | 1.494190715 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True}
159 | 0.098713 | 0.047173 | 2.092574142 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False}
160 | 0.960812 | 0.675762 | 1.421820108 |   | (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': True}
161 | 0.536546 | 0.485958 | 1.104099531 |   | (10, 10, 300, 300), {'kernel_size': 2, 'return_indices': False}
162 | 2.555225 | 1.791567 | 1.426251432 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True}
163 | 1.419087 | 1.305137 | 1.087308842 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False}
164 | 5.182008 | 3.48085 | 1.488719135 |   | (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': True}
165 | 2.831779 | 2.498537 | 1.133374851 |   | (10, 10, 700, 700), {'kernel_size': 2, 'return_indices': False}
166 | 8.546038 | 5.7783 | 1.478988284 |   | (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': True}
167 | 4.731004 | 4.161975 | 1.136720908 |   | (10, 10, 900, 900), {'kernel_size': 2, 'return_indices': False}
168 | 0.084754 | 0.07435 | 1.139932751 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': True}
169 | 0.057933 | 0.043096 | 1.344277891 |   | (10, 10, 100, 100), {'kernel_size': 2, 'return_indices': False}
170 | 2.568592 | 1.802117 | 1.425319222 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': True}
171 | 1.433054 | 1.307342 | 1.096158465 |   | (10, 10, 500, 500), {'kernel_size': 2, 'return_indices': False}
172 | 10.3213 | 7.111604 | 1.451332217 |   | (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': True}
173 | 5.680525 | 5.168129 | 1.099145358 |   | (10, 10, 1000, 1000), {'kernel_size': 2, 'return_indices': False}
174 | 1.02255 | 1.01375 | 1.008680641 |   | (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': False}
175 | 3.074233 | 3.094383 | 0.993488201 |   | (10, 1000, 1000), {'kernel_size': 2, 'padding': 1, 'stride': 1, 'return_indices': True}
176 | 1.016812 | 1.030575 | 0.9866453194 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': False}
177 | 3.053658 | 3.089504 | 0.9883974903 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': 1, 'return_indices': True}
178 | 1.025863 | 1.032088 | 0.9939685376 |   | (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': False}
179 | 3.798942 | 3.799213 | 0.9999286694 |   | (10, 1000, 1000), {'kernel_size': 8, 'padding': 1, 'stride': 1, 'return_indices': True}
180 | 4.492979 | 4.493421 | 0.999901634 |   | (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': False}
181 | 51.543363 | 51.266204 | 1.005406271 |   | (10, 1000, 1000), {'kernel_size': 16, 'padding': 1, 'stride': 1, 'return_indices': True}
182 | 1.018008 | 1.001587 | 1.016394981 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': False}
183 | 3.035404 | 3.003113 | 1.010752509 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 1), 'return_indices': True}
184 | 0.610421 | 0.56 | 1.0900375 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': False}
185 | 1.138983 | 0.757296 | 1.504012962 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (1, 4), 'return_indices': True}
186 | 0.641558 | 0.557808 | 1.150141267 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': False}
187 | 1.181475 | 0.754725 | 1.565437742 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 0, 'stride': (4, 1), 'return_indices': True}
188 | 1.03045 | 1.026904 | 1.003453098 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': False}
189 | 3.041421 | 3.0263 | 1.00499653 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 1), 'return_indices': True}
190 | 0.609929 | 0.572304 | 1.065743032 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': False}
191 | 1.146875 | 0.756446 | 1.516135983 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (1, 4), 'return_indices': True}
192 | 0.645187 | 0.561708 | 1.148616363 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': False}
193 | 1.181721 | 0.758054 | 1.558887625 |   | (10, 1000, 1000), {'kernel_size': 4, 'padding': 1, 'stride': (4, 1), 'return_indices': True}
194 | 0.927654 | 0.925946 | 1.0018446 |   | (10, 1000, 1000), {'kernel_size': 1, 'return_indices': False}
195 | 2.749983 | 2.740354 | 1.00351378 |   | (10, 1000, 1000), {'kernel_size': 1, 'return_indices': True}

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157876
Approved by: https://github.com/malfet
2025-08-08 16:40:10 +00:00
b5fd7223b1 Improve pin_memory error message on CPU-only systems (#159994)
## Summary
- clarify pin_memory error message when no accelerator backend is available

## Testing
- `python repro_pin_memory.py` (fails: Need to provide pin_memory allocator to use pin memory)
- `lintrunner -a`

------
https://chatgpt.com/codex/tasks/task_e_6893ba92c93483238a9bdfdd6c52812b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159994
Approved by: https://github.com/albanD
2025-08-08 14:36:45 +00:00
7f4cb4a3e0 [MPS] coalesce for sparse tensors (#159729)
MPS coalesce function for sparse tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159729
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-08-08 13:49:55 +00:00
8147370733 Fix qembeddingbag_byte_prepack_meta to use sym_sizes (#159985)
Summary: In qembeddingbag_byte_prepack_meta, weight.sizes() would return a concrete int. we should use .sym_size() to return a SymInt instead.

Test Plan:
CI

Rollback Plan:

Reviewed By: kqfu, henryoier

Differential Revision: D79744512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159985
Approved by: https://github.com/jerryzh168, https://github.com/henryoier
2025-08-07 21:22:29 +00:00
f3a4d742ec Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit f7a66da5f9f6b8b75119b1ee8ce9ddc23e15570e.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
74da2604c9 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 15f1173e5d72d6d45faba4cecd135e0160f06c6f.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/jithunnair-amd due to Broke ROCm periodic runs on MI300 e.g. https://github.com/pytorch/pytorch/actions/runs/16764977800/job/47470050573 ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3164941815))
2025-08-07 16:34:36 +00:00
d2368aa6f3 [CPUBLAS] add macros for brgemm APIs for versioning (#158629)
**Summary**
Add macros for brgemm, so that callers (e.g., Torchao's cpp kernels) know which APIs are available. It is useful when callers need to co-work with old versions of PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158629
Approved by: https://github.com/CaoE, https://github.com/Valentine233, https://github.com/ezyang
2025-08-06 20:54:05 +00:00
d10e9e4781 [MPS] Remove all pre-MacOS14 logic (#159912)
Delete older enums, checks for MacOS-13.3+ for int64 support, etc

Fixes https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159912
Approved by: https://github.com/manuelcandales
2025-08-06 19:48:12 +00:00
98316e5896 [WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325)
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849

**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads

**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]

where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path

Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769

```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/

Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6

**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:

Differential Revision: D79042656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
2025-08-06 10:28:08 +00:00
15f1173e5d Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-08-06 02:22:18 +00:00
f7a66da5f9 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD, https://github.com/Camyll
2025-08-06 00:40:29 +00:00
eqy
9884d0351e [CUDA] Decrease launch bounds of CTCLoss backward for blackwell (#159522)
Otherwise we see `CUDA error: too many resources requested for launch`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159522
Approved by: https://github.com/janeyx99
2025-08-05 19:26:25 +00:00
c1145852a5 Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156165
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312
2025-08-05 04:08:42 +00:00
1ca8388442 [BE][MPS] Remove unused size12 variable (#159832)
Fixes following compilation warning
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:433:8: warning: unused variable 'size12' [-Wunused-variable]
  auto size12 = input_sizes[1] * input_sizes[2];
       ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159832
Approved by: https://github.com/dcci
2025-08-05 02:32:06 +00:00
b59b61a099 Add avg_pool3d backward pass for MPS (#159089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159089
Approved by: https://github.com/malfet
2025-08-05 01:55:38 +00:00
fd6655a0f5 Feature: Implement support for cudnn_batch_norm_out kernel to replace the autogen approach. (#123020)
Fixes #115611

Autogen kernel may cause redundant copy, so we develop the kernel to improve efficiency.

Test Case:

```c++
#include <torch/torch.h>
#include <iostream>
#include <ATen/ATen.h>
#include <ATen/cuda/CUDAContext.h>

int main() {
    auto input = torch::rand({2, 3, 4, 4}, torch::device(torch::kCUDA));
    auto weight = torch::randn({3}, torch::device(torch::kCUDA));
    auto bias = torch::randn({3}, torch::device(torch::kCUDA));
    auto running_mean = torch::zeros({3}, torch::device(torch::kCUDA));
    auto running_var = torch::ones({3}, torch::device(torch::kCUDA));

    bool training = true;
    double exponential_average_factor = 0.1;
    double epsilon = 1e-5;

    auto output = torch::empty_like(input);
    auto save_mean = torch::empty({3}, torch::device(torch::kCUDA));
    auto save_var = torch::empty({3}, torch::device(torch::kCUDA));
    auto reserve = torch::empty({0}, torch::device(torch::kCUDA)); // empty place-holder

    at::native::cudnn_batch_norm_out(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon, output, save_mean, save_var, reserve);
    auto outputs = at::native::cudnn_batch_norm(input, weight, bias, running_mean, running_var, training, exponential_average_factor, epsilon);

    bool is_close_output = torch::allclose(output, std::get<0>(outputs));
    bool is_close_save_mean = torch::allclose(save_mean, std::get<1>(outputs));
    bool is_close_save_var = torch::allclose(save_var, std::get<2>(outputs));
    bool is_close_reserve = torch::allclose(reserve, std::get<3>(outputs));

    std::cout << "Is output close: " << is_close_output << std::endl;
    std::cout << "Is save_mean close: " << is_close_save_mean << std::endl;
    std::cout << "Is save_var close: " << is_close_save_var << std::endl;
    std::cout << "Is reserve close: " << is_close_reserve << std::endl;

    return 0;
}
```

Please CC @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123020
Approved by: https://github.com/andrewor14, https://github.com/eqy, https://github.com/albanD
2025-08-04 22:40:33 +00:00
d4109a0f99 [MPS] Add max_unpool1d/2d/3d (#159789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159789
Approved by: https://github.com/malfet
2025-08-04 20:00:59 +00:00
fc340d0ca3 [export] Allow comparing device w/o index with device w/ index (#159665)
In the case where we have expected device "cuda" and given device "cuda:0" I think we should succeed?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159665
Approved by: https://github.com/yushangdi
2025-08-04 17:00:07 +00:00
a9dc1566d4 [MTIA Aten Backend] Migrate arange.start_out (#159540)
Differential Revision: [D79317519](https://our.internmc.facebook.com/intern/diff/D79317519/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159540
Approved by: https://github.com/malfet, https://github.com/nautsimon
2025-08-04 07:38:05 +00:00
33a1996714 Fix perf downgrad by reverting template use in use_mkldnn_matmul (#159024)
This PR is to fix the performance downgrad by reverting template use in `use_mkldnn_matmul` in #157520 . Fix https://github.com/pytorch/pytorch/issues/159031 and https://github.com/pytorch/pytorch/issues/159551.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159024
Approved by: https://github.com/mingfeima
2025-08-04 05:49:46 +00:00
e2a5c42e7e [BE][MPS] Build metal kernels of MacOS-14+ (#159733)
Which makes `#if __METAL_VERSION__ >= 310` guards for `bfloat` use support unnecessary.
Rename `kernels_bfloat.metallib` into `kernels_basic` and remove custom build/selection logic.

Part of https://github.com/pytorch/pytorch/issues/159275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159733
Approved by: https://github.com/dcci
ghstack dependencies: #159731, #159732
2025-08-03 20:53:58 +00:00
9c18901bfd [MTIA Aten Backend] Migrate all.out (#159539)
Differential Revision: [D79317033](https://our.internmc.facebook.com/intern/diff/D79317033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159539
Approved by: https://github.com/malfet
ghstack dependencies: #159098
2025-08-03 02:08:35 +00:00
fd6a6658c3 Enable _int_mm on Intel GPU (#157769)
# Moativation

This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.

# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext

Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2025-08-02 05:16:01 +00:00
a81ffbc5f5 improve shape checks for grouped_mm (#159666)
Check that contraction dimension matches between tensors if it's known, and do device-side checks for correct offsets
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159666
Approved by: https://github.com/danielvegamyhre, https://github.com/eqy
2025-08-02 00:12:25 +00:00
06d28de17a Update CK Kernel generation and update ck submodule (#157964)
changes required to reduce the number of ck kernels generated. This change depends on https://github.com/ROCm/composable_kernel/pull/2480 to be merged first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157964
Approved by: https://github.com/842974287
2025-08-01 22:24:27 +00:00
df9720b8b5 [MTIA Aten Backend] Migrate all foreach ops (#159098)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate all foreach operators to in-tree, including:
  - _foreach_abs
  - _foreach_abs_
  - _foreach_add.List
  - _foreach_add_.List
  - _foreach_add_.Scalar
  - _foreach_add_.Tensor
  - _foreach_addcmul.Scalar
  - _foreach_addcmul_.Scalar
  - _foreach_copy
  - _foreach_copy_
  - _foreach_mul.List
  - _foreach_mul_.List
  - _foreach_mul_.Scalar
  - _foreach_mul.Tensor
  - _foreach_mul_.Tensor
  - _foreach_norm.Scalar
  - _foreach_sqrt_

Differential Revision: [D78913847](https://our.internmc.facebook.com/intern/diff/D78913847/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159098
Approved by: https://github.com/malfet
2025-08-01 22:10:12 +00:00
90f13f3b2a Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)"
This reverts commit 1fc010a9d8ea95bb74e54b31d17eba56ef16c27c.

Reverted https://github.com/pytorch/pytorch/pull/156165 on behalf of https://github.com/guangyey due to Static initialization order issue impact the downstream repo ([comment](https://github.com/pytorch/pytorch/pull/150312#issuecomment-3142035444))
2025-08-01 03:24:54 +00:00
83e2ea8135 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-08-01 01:55:48 +00:00
d994027a41 [Doc fix] fix spelling of enough (#159587)
fixes typo in word `enought` to correct `enough` at 3 places in these files
```
aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
aten/src/ATen/native/cuda/CuFFTPlanCache.h
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587
Approved by: https://github.com/ezyang
2025-08-01 01:50:57 +00:00
669009bcd1 [inductor] respect layout tags for ops with registered lowerings (#159134)
scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue.

~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work.

FIXES https://github.com/pytorch/pytorch/issues/159097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134
Approved by: https://github.com/eellison
2025-07-31 21:29:40 +00:00
668d414ae7 [CPU] Fix bias dtype issue for FP8 qlinear (#159125)
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`

With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.

Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE
2025-07-31 01:26:45 +00:00