Commit Graph

519 Commits

Author SHA1 Message Date
955f21dc2c [ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355)
This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort.
@dnikolaev-amd contributed all the test skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355
Approved by: https://github.com/jeffdaily

Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 22:36:25 +00:00
b63bbe1661 Remove old ROCm version check in tests (#164245)
This PR removes ROCm<6 version checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245
Approved by: https://github.com/jeffdaily
2025-10-06 22:42:01 +00:00
15d726005d Enable several unit tests on ROCm (#163087)
Code change enables:
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float16
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float32
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float64
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_gelu_cuda_float16
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float32
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float64
test_ops::TestCommonCUDA::test_complex_half_reference_testing_as_strided_scatter_cuda_complex32

Fixes https://github.com/pytorch/pytorch/issues/134687
Fixes https://github.com/pytorch/pytorch/issues/78205

Closing github issues:
inductor/test_gpu_cpp_wrapper unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157084

test_nn unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157167
Fixes https://github.com/pytorch/pytorch/issues/157119
Fixes https://github.com/pytorch/pytorch/issues/157118
Fixes https://github.com/pytorch/pytorch/issues/157115
Fixes https://github.com/pytorch/pytorch/issues/157081
Fixes https://github.com/pytorch/pytorch/issues/155216
Fixes https://github.com/pytorch/pytorch/issues/157259
Fixes https://github.com/pytorch/pytorch/issues/157166
Fixes https://github.com/pytorch/pytorch/issues/157165
Fixes https://github.com/pytorch/pytorch/issues/157164
Fixes https://github.com/pytorch/pytorch/issues/157117
Fixes https://github.com/pytorch/pytorch/issues/157116
Fixes https://github.com/pytorch/pytorch/issues/157114
Fixes https://github.com/pytorch/pytorch/issues/157113
Fixes https://github.com/pytorch/pytorch/issues/157082
Fixes https://github.com/pytorch/pytorch/issues/157080
Fixes https://github.com/pytorch/pytorch/issues/157079
Fixes https://github.com/pytorch/pytorch/issues/157078

test_linalg unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157427
Fixes https://github.com/pytorch/pytorch/issues/157414
Fixes https://github.com/pytorch/pytorch/issues/157369
Fixes https://github.com/pytorch/pytorch/issues/157349
Fixes https://github.com/pytorch/pytorch/issues/157348
Fixes https://github.com/pytorch/pytorch/issues/157337
Fixes https://github.com/pytorch/pytorch/issues/157336
Fixes https://github.com/pytorch/pytorch/issues/157297
Fixes https://github.com/pytorch/pytorch/issues/157281
Fixes https://github.com/pytorch/pytorch/issues/157260
Fixes https://github.com/pytorch/pytorch/issues/157171
Fixes https://github.com/pytorch/pytorch/issues/157169
Fixes https://github.com/pytorch/pytorch/issues/157168
Fixes https://github.com/pytorch/pytorch/issues/157125
Fixes https://github.com/pytorch/pytorch/issues/157124
Fixes https://github.com/pytorch/pytorch/issues/157123
Fixes https://github.com/pytorch/pytorch/issues/157089
Fixes https://github.com/pytorch/pytorch/issues/157088
Fixes https://github.com/pytorch/pytorch/issues/157087
Fixes https://github.com/pytorch/pytorch/issues/157068
Fixes https://github.com/pytorch/pytorch/issues/157067
Fixes https://github.com/pytorch/pytorch/issues/157066
Fixes https://github.com/pytorch/pytorch/issues/157047
Fixes https://github.com/pytorch/pytorch/issues/157046
Fixes https://github.com/pytorch/pytorch/issues/157045
Fixes https://github.com/pytorch/pytorch/issues/157044
Fixes https://github.com/pytorch/pytorch/issues/156997
Fixes https://github.com/pytorch/pytorch/issues/156996
Fixes https://github.com/pytorch/pytorch/issues/156995
Fixes https://github.com/pytorch/pytorch/issues/156994
Fixes https://github.com/pytorch/pytorch/issues/156993
Fixes https://github.com/pytorch/pytorch/issues/156991
Fixes https://github.com/pytorch/pytorch/issues/156990
Fixes https://github.com/pytorch/pytorch/issues/156989
Fixes https://github.com/pytorch/pytorch/issues/105118
Fixes https://github.com/pytorch/pytorch/issues/157415
Fixes https://github.com/pytorch/pytorch/issues/157282
Fixes https://github.com/pytorch/pytorch/issues/157261
Fixes https://github.com/pytorch/pytorch/issues/157170
Fixes https://github.com/pytorch/pytorch/issues/157126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-10-03 19:30:59 +00:00
531f3bf5e1 Adding check for square matrix for input tensor in matrix_exp backwar… (#163357)
…d op.

Fixes #146796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163357
Approved by: https://github.com/lezcano
2025-10-01 03:12:30 +00:00
5d749ceb92 Remove test conditions for CUDA<12 (#163495)
Because it required that CUDA >=12.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163495
Approved by: https://github.com/janeyx99
2025-09-23 07:52:00 +00:00
bec967eaa4 Remove C++ and test branches for CUDA<12 (#163443)
Remove conditional branches for CUDA<12.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163443
Approved by: https://github.com/eqy
2025-09-22 18:20:08 +00:00
e769026bcb [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094
Fixes #157093
Fixes #157092
Fixes #157091
Fixes #157064
Fixes #157063
Fixes #157062
Fixes #157061
Fixes #157042
Fixes #157041
Fixes #157039
Fixes #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-18 13:53:48 +00:00
66308fb470 Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)"
This reverts commit cef815dc2ce37f98e01a6469a15b69f15995c1f9.

Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))
2025-09-16 20:39:41 +00:00
cef815dc2c [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-16 12:48:45 +00:00
8e05749d5c Fix integer overflow bug in triu/tril for large diagonal values (#153240)
This PR fixes a bug in the implementation of `apply_triu_tril_single` where using extremely large values for the diagonal argument (e.g. `diagonal=9223372036854775807`) could result in integer overflow and incorrect results. The masking logic is re-written to avoid this issue by always iterating over all columns, ensuring correctness even for large or extreme diagonal values.

Example of the original incorrect behavior:
```python
a = torch.ones(5,5)
torch.triu(a, 9223372036854775807)
# Before:
# tensor([[0., 0., 0., 0., 0.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.]])
```

The new implementation guards against overflow and produces correct results for all valid input values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153240
Approved by: https://github.com/albanD
2025-09-15 18:07:19 +00:00
3a20a20e70 Fix largeTensorTest malfunction on XPU (#161988)
# Motivation
https://github.com/pytorch/pytorch/pull/143553/files#diff-6492991193449e118ff0c8d42ca544cc38a73604e505ff246a3c711aeab91748R1345 makes `largeTensorTest` malfunction on XPU. This PR aims to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161988
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-09-04 16:10:03 +00:00
e9bbd28f22 make einsum produce contiguous inputs in more cases (#161755)
Fixes #161729
Written by codex
This won't produce contiguous inputs for all einsum applications, because we flatten all right-only and left-only dimensions, so if right and left operand dimensions are interleaved in output, we cannot (with current algo) produce contiguous output, however, for common cases like in the linked issue it works. Let's see what CI says

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161755
Approved by: https://github.com/malfet, https://github.com/albanD
2025-08-29 18:50:46 +00:00
a53d14d5f8 Revert "unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786)"
This reverts commit 3a2c3c8ed365eb4e4cf4620c25d70b2f70483762.

Reverted https://github.com/pytorch/pytorch/pull/157786 on behalf of https://github.com/albanD due to Breaks lint ([comment](https://github.com/pytorch/pytorch/pull/157786#issuecomment-3164126250))
2025-08-07 13:09:33 +00:00
3a2c3c8ed3 unskipped mobilenet_v3 quantization and mobilenet_v2 quantization plus tests from https://github.com/pytorch/pytorch/issues/125438 (#157786)
These tests now pass on AArch64 in our downstream CI.

`test_quantization.py::TestNumericSuiteEager::test_mobilenet_v2 <- test/quantization/eager/test_numeric_suite_eager.py PASSED [2.4434s] [ 35%]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157786
Approved by: https://github.com/jerryzh168, https://github.com/malfet
2025-08-06 22:41:07 +00:00
98316e5896 [WOQ] Add CUDA kernel for _weight_int8pack_mm (#159325)
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849

**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads

**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]

where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path

Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769

```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/

Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6

**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:

Differential Revision: D79042656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
2025-08-06 10:28:08 +00:00
83e2ea8135 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-08-01 01:55:48 +00:00
61aa2ae20f Revert "[CPU] fix _weight_int8pack_mm with large output shape (#158341)"
This reverts commit e469414b59ceeaae2860e36708de8852b9892776.

Reverted https://github.com/pytorch/pytorch/pull/158341 on behalf of https://github.com/albanD due to Breaks slowtest ([comment](https://github.com/pytorch/pytorch/pull/158341#issuecomment-3132641530))
2025-07-29 13:56:20 +00:00
e469414b59 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-07-29 01:14:50 +00:00
16c0ccd669 [ROCm][CI] upgrade to 6.4.2 patch release (#158887)
Upgrade to ROCm 6.4.2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158887
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-25 03:45:44 +00:00
f5e2de928b [BE] fix remaining flake8 v7 warnings (#159044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044
Approved by: https://github.com/Skylion007
ghstack dependencies: #159043
2025-07-25 02:56:34 +00:00
c917c63282 [ROCm][tunableop] UT tolerance increase for matmul_small_brute_force_tunableop at FP16 (#158788)
TunableOp will sometimes find a less precise solution due to the small input vectors used in this UT. Bumping op tolerance to eliminate flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158788
Approved by: https://github.com/jeffdaily
2025-07-22 19:45:35 +00:00
f4d8bc46c7 Enable TF32 as fp32 internal precision for matmul/linear/conv (#157520)
### Description

This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.

```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```

Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-17 08:57:34 +00:00
7444debaca Revert "Fix logdet returning finite values for singular matrices on CUDA (#157910)"
This reverts commit 7d4228dbfd13d1ac8fac2c78c042dbb8314f042d.

Reverted https://github.com/pytorch/pytorch/pull/157910 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some internal tests accuracy ([comment](https://github.com/pytorch/pytorch/pull/157910#issuecomment-3064368647))
2025-07-12 00:22:51 +00:00
7d4228dbfd Fix logdet returning finite values for singular matrices on CUDA (#157910)
Fixes https://github.com/pytorch/pytorch/issues/154312

Fix logdet returning finite values for singular matrices on CUDA (https://github.com/pytorch/pytorch/issues/154312
https://github.com/pytorch/pytorch/issues/154312)

PyTorch's logdet function returns mathematically incorrect finite values for
singular matrices on CUDA devices instead of the expected -inf. This occurs
because cuSOLVER and LAPACK produce tiny non-zero diagonal elements (~1e-16)
instead of exact zeros for singular matrices.

**Problem:**
Issue https://github.com/pytorch/pytorch/issues/154312 matrix returns finite values instead of -inf for singular matrices.

**Solution:**
Implemented NumPy-style two-tier singularity detection with GPU sync point removal:

1. **Primary detection**: Use LAPACK's built-in singularity detection via info parameter
2. **Backup detection**: Apply threshold-based detection for numerical edge cases
3. **Zero GPU sync points**: Eliminated all .item(), std::get<0>(), and scalar extractions
4. **Pure tensor operations**: All computations use tensor operations throughout

**Performance Impact:**
Based on comprehensive benchmarking across matrix sizes and data types:

- **Overall Impact**: 0.85× average speedup (+18.0% overhead)
- **CPU Performance**: 0.84× average speedup (+18.8% overhead)
- **CUDA Performance**: 0.85× average speedup (+17.3% overhead)

**Performance Trade-offs:**
- **Small matrices (16×16, 64×64)**: Higher overhead due to tensor operation setup costs
- **Large matrices (512×512, 2048×2048)**: Near-zero overhead, with some cases showing slight improvements
- **GPU sync elimination**: Removes expensive GPU→CPU synchronization bottlenecks

**Results:**
-  All singular matrices now correctly return -inf on both CPU and CUDA
-  Original issue https://github.com/pytorch/pytorch/issues/154312 matrix now works correctly
-  Results match NumPy's slogdet behavior exactly
-  Zero GPU synchronization points for improved performance
-  Comprehensive edge case testing added

**Verification:**
Before: torch.linalg.slogdet(singular_matrix) → finite values (incorrect)
After:  torch.linalg.slogdet(singular_matrix) → (sign=0, logabsdet=-inf) 

The implementation uses pure tensor operations to eliminate GPU sync points while
maintaining robust singularity detection through a two-tier approach.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157910
Approved by: https://github.com/lezcano, https://github.com/IvanYashchuk, https://github.com/albanD

Co-authored-by: Claude <noreply@anthropic.com>
2025-07-11 02:23:46 +00:00
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
ab3393e923 [ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368)
Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-19 15:02:40 +00:00
29391c7cf9 [ez] Mark linalg svd memory allocation test as serial b/c OOMing on cu128 (#155811)
9df2e8020f/1

8e8d4b13b0 (43980565863-box)

started OOMing after switching to cuda 12.8

Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory
```
2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
2025-06-12 21:05:32 +00:00
7a03b0d2ca [BE] Remove CUDA 11 artifacts. Fix Check Binary workflow (#155555)
Please see: https://github.com/pytorch/pytorch/issues/147383

1. Remove CUDA 11 build and test artifacts. One place CUDA 12.4
2. Fix Check Binary Workflow to use Stable Cuda version variable rather then hardcoded one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155555
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-06-10 21:32:08 +00:00
e8d29c45e0 [ROCm][TunableOp] Unit test to verify that there is only one kernel launch per PyTorch API invocation. (#155077)
TunableOp UT covers breakage that was fixed in this PR: https://github.com/pytorch/pytorch/pull/153764

After tuning is complete, verify that there is only one kernel launch. for each PyTorch API invocation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155077
Approved by: https://github.com/jeffdaily
2025-06-10 16:11:43 +00:00
fa3c38c7ae Add tensor overlap check for cross (#154999)
Fixes #132031

## Test Result

```python
In [1]: import torch
   ...: torch.manual_seed(0)
   ...: torch.cuda.manual_seed(0)
   ...: a = torch.randn(3, 4)
   ...: b = torch.randn(3, 4)
   ...: torch.cross(a, b, out=a)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[1], line 6
      4 a = torch.randn(3, 4)
      5 b = torch.randn(3, 4)
----> 6 torch.cross(a, b, out=a)

RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154999
Approved by: https://github.com/lezcano
2025-06-05 10:00:01 +00:00
edc2d539d1 torch.tensordot: performance improvements when contracting to a scalar. (#145936)
As per title.
Fixes https://github.com/pytorch/pytorch/issues/145731

Touches only compute. The CPU overhead can potentially be further reduced.

Before:
```python
In [3]: n = 512

In [4]: A = torch.rand(n, n)

In [5]: B = torch.rand(n, n)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After
```python
In [2]: n = 512

In [3]: A = torch.rand(n, n)

In [4]: B = torch.rand(n, n)

In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936
Approved by: https://github.com/albanD, https://github.com/ngimel
2025-05-13 10:57:30 +00:00
22c31046d4 Fixed rerr computation in lobpcg (#152789)
Fixes #101075

This PR fixes an issue with the computation of residuals in the LOBPCG algorithm.

**Bug**: [Line 788](8f54e56e62/torch/_lobpcg.py (L788)) is supposed to compute the denominator in Equation 9 of [Duersch et al., 2018](https://arxiv.org/abs/1704.07458), as also suggested in [line 776](8f54e56e62/torch/_lobpcg.py (L776)), but it uses the raw eigenvalue-estimates instead of their absolute values.

**Consequence**: This made the algorithm's success sensitive to initialization of eigenvectors.

**Tests**:
- I have tested @jtorde's [script](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1545349559), and I did NOT run into any assertion errors for a few minutes (as opposed to the original implementation, which fails after a few seconds).
- I have also tried @pearu's specific [test case](https://github.com/pytorch/pytorch/issues/101075#issuecomment-1548483685), which also executes successfully - the residuals remain positive, and the final output is the same as one returned by SciPy (with and without enforcing the use of LOBPCG).
- I extracted the relevant test cases from [test/test_autograd.py](https://github.com/pytorch/pytorch/blob/main/test/test_autograd.py) and [test/test_linalg.py](https://github.com/pytorch/pytorch/blob/main/test/test_linalg.py), and they ran successfully.

Let me know if further test cases or benchmarks are needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152789
Approved by: https://github.com/pearu, https://github.com/lezcano
2025-05-08 12:22:31 +00:00
08f5371571 [float16]: Fix the accumulation type for dot and gemv (#152676)
Fixes #147860

Also, partially address: https://github.com/pytorch/pytorch/issues/125438

Use float32 for accumulation with float16 and and bfloat16 types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152676
Approved by: https://github.com/malfet
2025-05-06 18:10:08 +00:00
ece1658418 [ROCm][TunableOp] Fix ScaledGEMM rowwise (#152403)
Fixes TunableOp ScaledGEMM regression for rowwise scaling caused by this https://github.com/pytorch/pytorch/pull/147548

Credit goes to @mawong-amd for fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152403
Approved by: https://github.com/jeffdaily
2025-04-30 08:33:03 +00:00
0d99b4e9e2 ROCm: Enable tf32 testing on test_nn (#148945)
Add tf32 support for ROCm tests.
test command: python test/test_nn.py -v

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148945
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-04-28 23:01:04 +00:00
f6c1cf04b5 [ROCm][TunableOp] Support submatrices in offline tuning (#151138)
This PR adds support for submatrices in offline tuning for:
- GEMM
- GEMM and bias
- ScaledGEMM
- Batch Strided GEMM

New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately.

There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case.

To improve code readability, swapped definition of transA and transB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138
Approved by: https://github.com/jeffdaily
2025-04-19 04:14:27 +00:00
ae53510b9e Fix setUpClass() / tearDownClass() for device-specific tests (#151129)
Finishes up the work started in #121686 + adds test

Update: this was not as straightforward as I originally imagined. Context below.

**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.

**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.

Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.

(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:

070f389745/test/test_ops.py (L218-L221)

(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:

070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)

070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)

To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).

**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)

To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
2025-04-16 02:18:42 +00:00
98b1e82ba8 Revert "Fix setUpClass() / tearDownClass() for device-specific tests (#151129)"
This reverts commit bd4cf30e31a2a0b0a57f54c7eedd3a39d5778cbe.

Reverted https://github.com/pytorch/pytorch/pull/151129 on behalf of https://github.com/jbschlosser due to flex attention tests failing ([comment](https://github.com/pytorch/pytorch/pull/151129#issuecomment-2807632119))
2025-04-15 22:07:25 +00:00
bd4cf30e31 Fix setUpClass() / tearDownClass() for device-specific tests (#151129)
Finishes up the work started in #121686 + adds test

Update: this was not as straightforward as I originally imagined. Context below.

**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.

**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.

Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.

(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:

070f389745/test/test_ops.py (L218-L221)

(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:

070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)

070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)

To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).

**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)

To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
2025-04-15 20:13:26 +00:00
f76b7ef33c Add error check for out variant of tensordot function with requries_grad tensor (#150270)
Fixes #147846. Previously there is no error out under out variant of`tensordot` while `requires_grad=True`. This can cause potential issue when out tensor is part of a computation graph.

Enforces the out variant of tensordot to run without setting `requries_grad=True`. Change same to #117067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150270
Approved by: https://github.com/soulitzer
2025-04-14 18:43:14 +00:00
8910e4f2bb Fix 32-bit indexing overflows in ReducedPrecisionGemV (#150949)
By chaining `lda` type from `int` to  ~~`long`~~ `int64_t`

Add regression test (but probably restrict it to CPUs (or may be skip float32 testing on GPUs)

Fixes https://github.com/pytorch/pytorch/issues/150637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150949
Approved by: https://github.com/Skylion007
2025-04-11 20:55:20 +00:00
5471e80fb4 Remove guard_size_oblivious from vector_norm decomposition. (#148809)
This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check,
this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious
used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809
Approved by: https://github.com/bobrenjc93
2025-04-10 16:19:00 +00:00
881d99495d Add more check for torch.ormqr (#150759)
As the title statd.

Please refer to https://github.com/pytorch/pytorch/issues/150674 for more info.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150759
Approved by: https://github.com/lezcano
2025-04-08 08:26:05 +00:00
d0026fa138 [ROCm][TunableOp] Fix UT race condition and reduce UT duration. (#150463)
This PR fixes two race conditions that occur when UT tests are run:
- In a particular order within a single shard.
- Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name.

There were two other minor improvements to the UTs:
- matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed.
- bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful.

Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150463
Approved by: https://github.com/jeffdaily
2025-04-04 01:12:03 +00:00
ca2ffc23ab [ROCm][TunableOp] Stricter unit tests for online and offline tuning (#150142)
Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details:
- Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig.
- The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB.
- Offline tuning does not support some tensor shapes. Emit warning and skip tuning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-31 04:12:08 +00:00
45b11730f1 [ROCm][TunableOp] TunableOp Context Manager for unit tests (#149930)
This PR is cleanup only. There are no feature changes or bug fixes.

We create a TunableOp context manager for setting up and cleanup. We re-write TunableOp unit tests in terms of this context manager. Ultimately reduces the amount of copy-paste code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149930
Approved by: https://github.com/jeffdaily
2025-03-26 02:59:58 +00:00
01b1d1f91b [ROCm][TunableOp] Fix offline tuning for ScaledGEMM. (#149677)
The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally:
- All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match.
- We raise an error if submatrices are encountered as this is only supported in online tuning mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677
Approved by: https://github.com/jeffdaily
2025-03-22 02:22:13 +00:00
37bb7f79c6 [ROCm][TunableOp] Unit test for TunableOp BLAS logging. (#148982)
Add unit test for new TunableOp BLAS logging feature.

Requires this PR to be merged in first: https://github.com/pytorch/pytorch/pull/148979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148982
Approved by: https://github.com/jeffdaily
2025-03-19 20:57:19 +00:00
11d4438a5f [ROCm][TunableOp] More TF32 support. (#149088)
This PR includes additional enhancements to TF32 support in TunableOp.
- OpSignature now differentiates between float32 and tf32 data types.
- Offline tuning now supports TF32.
- Unit tests for online and offline tuning of TF32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149088
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-03-19 00:26:20 +00:00
a0ac63cbd9 [BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-18 00:46:07 +00:00