This PR fixes a bug in the implementation of `apply_triu_tril_single` where using extremely large values for the diagonal argument (e.g. `diagonal=9223372036854775807`) could result in integer overflow and incorrect results. The masking logic is re-written to avoid this issue by always iterating over all columns, ensuring correctness even for large or extreme diagonal values.
Example of the original incorrect behavior:
```python
a = torch.ones(5,5)
torch.triu(a, 9223372036854775807)
# Before:
# tensor([[0., 0., 0., 0., 0.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.],
# [1., 1., 1., 1., 1.]])
```
The new implementation guards against overflow and produces correct results for all valid input values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153240
Approved by: https://github.com/albanD
Fixes#161729
Written by codex
This won't produce contiguous inputs for all einsum applications, because we flatten all right-only and left-only dimensions, so if right and left operand dimensions are interleaved in output, we cannot (with current algo) produce contiguous output, however, for common cases like in the linked issue it works. Let's see what CI says
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161755
Approved by: https://github.com/malfet, https://github.com/albanD
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849
**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads
**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]
where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path
Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769
```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/
Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6
**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:
Differential Revision: D79042656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.
**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.
**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
### Description
This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.
```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```
Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/154312
Fix logdet returning finite values for singular matrices on CUDA (https://github.com/pytorch/pytorch/issues/154312https://github.com/pytorch/pytorch/issues/154312)
PyTorch's logdet function returns mathematically incorrect finite values for
singular matrices on CUDA devices instead of the expected -inf. This occurs
because cuSOLVER and LAPACK produce tiny non-zero diagonal elements (~1e-16)
instead of exact zeros for singular matrices.
**Problem:**
Issue https://github.com/pytorch/pytorch/issues/154312 matrix returns finite values instead of -inf for singular matrices.
**Solution:**
Implemented NumPy-style two-tier singularity detection with GPU sync point removal:
1. **Primary detection**: Use LAPACK's built-in singularity detection via info parameter
2. **Backup detection**: Apply threshold-based detection for numerical edge cases
3. **Zero GPU sync points**: Eliminated all .item(), std::get<0>(), and scalar extractions
4. **Pure tensor operations**: All computations use tensor operations throughout
**Performance Impact:**
Based on comprehensive benchmarking across matrix sizes and data types:
- **Overall Impact**: 0.85× average speedup (+18.0% overhead)
- **CPU Performance**: 0.84× average speedup (+18.8% overhead)
- **CUDA Performance**: 0.85× average speedup (+17.3% overhead)
**Performance Trade-offs:**
- **Small matrices (16×16, 64×64)**: Higher overhead due to tensor operation setup costs
- **Large matrices (512×512, 2048×2048)**: Near-zero overhead, with some cases showing slight improvements
- **GPU sync elimination**: Removes expensive GPU→CPU synchronization bottlenecks
**Results:**
- ✅ All singular matrices now correctly return -inf on both CPU and CUDA
- ✅ Original issue https://github.com/pytorch/pytorch/issues/154312 matrix now works correctly
- ✅ Results match NumPy's slogdet behavior exactly
- ✅ Zero GPU synchronization points for improved performance
- ✅ Comprehensive edge case testing added
**Verification:**
Before: torch.linalg.slogdet(singular_matrix) → finite values (incorrect)
After: torch.linalg.slogdet(singular_matrix) → (sign=0, logabsdet=-inf) ✅
The implementation uses pure tensor operations to eliminate GPU sync points while
maintaining robust singularity detection through a two-tier approach.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157910
Approved by: https://github.com/lezcano, https://github.com/IvanYashchuk, https://github.com/albanD
Co-authored-by: Claude <noreply@anthropic.com>
9df2e8020f/18e8d4b13b0 (43980565863-box)
started OOMing after switching to cuda 12.8
Maybe b/c I made some changes fix the per process memory fraction so each proc has fewer memory
```
2025-06-12T15:29:50.4998758Z FAILED [0.0124s] test_linalg.py::TestLinalgCUDA::test_svd_memory_allocation_cuda_complex128 - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.10 GiB. GPU 0 has a total capacity of 7.43 GiB of which 6.85 GiB is free. Process 80272 has 68.75 MiB memory in use. Process 83346 has 68.75 MiB memory in use. Process 83365 has 374.75 MiB memory in use. Process 83384 has 70.75 MiB memory in use. 2.90 GiB allowed; Of the allocated memory 240.00 MiB is allocated by PyTorch, and 2.00 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155811
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman, https://github.com/eqy
Fixes#132031
## Test Result
```python
In [1]: import torch
...: torch.manual_seed(0)
...: torch.cuda.manual_seed(0)
...: a = torch.randn(3, 4)
...: b = torch.randn(3, 4)
...: torch.cross(a, b, out=a)
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[1], line 6
4 a = torch.randn(3, 4)
5 b = torch.randn(3, 4)
----> 6 torch.cross(a, b, out=a)
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154999
Approved by: https://github.com/lezcano
This PR adds support for submatrices in offline tuning for:
- GEMM
- GEMM and bias
- ScaledGEMM
- Batch Strided GEMM
New UTs to cover submatrices. Submatrices for strided batch API is not part of this PR and will be done seperately.
There is also a bug fix for offline tuning for full matrix for GEMM and bias in the `NT` case. Offline and online UTs were updated to cover this corner case.
To improve code readability, swapped definition of transA and transB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151138
Approved by: https://github.com/jeffdaily
Finishes up the work started in #121686 + adds test
Update: this was not as straightforward as I originally imagined. Context below.
**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.
**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.
Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.
(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:
070f389745/test/test_ops.py (L218-L221)
(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:
070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)
To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).
**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)
To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
Finishes up the work started in #121686 + adds test
Update: this was not as straightforward as I originally imagined. Context below.
**TL;DR:** `TestFoo{CPU, CUDA}` now actually derive from `TestFoo`! Also, `{CPU, CUDA}TestBase` setup / teardown logic is now always called (it is required to set the primary device), regardless of whether `super().setUpClass()` / `super().tearDownClass()` are called or not.
**Background:** The typical way to get device-specific tests is to write a generic `TestFoo` and call `instantiate_device_type_tests(TestFoo, locals())` to get `TestFooCPU`, `TestFooCUDA`, etc. After this, generic tests (e.g. `TestFoo.test_bar()`) become `TestFooCPU.test_bar_cpu()` / `TestFooCUDA.test_bar_cuda()`.
Behind the scenes, this was historically accomplished by creating a `TestFooCUDA` that derives from both a `CUDATestBase` and an *empty class* called `TestFoo_base`. This `TestFoo_base` has the same bases as `TestFoo`, but none of the test functions (e.g. `test_bar()`). The documented reason for this is to avoid things like a derived `TestFooCUDA.test_bar()` being discovered in addition to the real device-specific test `TestFooCUDA.test_bar_cuda()`.
(1) A reason this matters is because it should be possible to call e.g. `super().setUpClass()` from a custom setup / teardown classmethod. If the generated TestFooCUDA does not derive from TestFoo, but instead derives from the empty class described above, this syntax does not work; in fact there is no way to form a proper `super()` call that works across the device-specific test variants. Here's an example that breaks in the OpInfo tests:
070f389745/test/test_ops.py (L218-L221)
(2) Further, there is some precedent within a custom `setUpClass()` impl for storing things on the `cls` object to be accessed at test time. This must be the device-specific test class (`TestFooCUDA`) and not `TestFoo` for this to work. As an example, the open device registration tests load a module during setup and use it in the test logic:
070f389745/test/test_cpp_extensions_open_device_registration.py (L63-L77)070f389745/test/test_cpp_extensions_open_device_registration.py (L79-L80)
To accomplish both (1) and (2) at the same time, I decided to revisit the idea of utilizing a proper inheritance hierarchy for `TestFoo` -> `{TestFooCPU, TestFooCUDA}`. That is: have TestFooCPU / TestFooCUDA **actually** derive from `TestFoo`. This achieves both (1) and (2). The only thing left is to make sure the generic tests (e.g. `TestFoo.test_bar()`) are not discoverable, as was the stated reason for diverging from this in the first place. It turns out we can simply `delattr()` these generic tests from `TestFoo` once `TestFooCPU` / `TestFooCUDA` have been setup with the device-specific variants, and all works well. The `instantiate_device_type_tests(...)` logic already deletes `TestFoo` from scope, so I don't see a problem with deleting generic tests from this base class as well (CI will prove me right or wrong ofc).
**Side note:** I was encountering a weird race condition where sometimes the custom `setUpClass()` / `tearDownClass()` defined & swapped in [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L940-L955)) would be used, and sometimes it wouldn't. This non-deterministic behavior was called out previously by @ngimel here:
4a47dd9b3f/test/inductor/test_torchinductor_dynamic_shapes.py (L128-L130)
To address this, I moved this block of logic to before the first call to `instantiate_test()`, as that method queries for the primary device, and the primary device identification logic may manually invoke `setUpClass()` (see [here](4a47dd9b3f/torch/testing/_internal/common_device_type.py (L381-L384))). Goal: define the `setUpClass()` / `tearDownClass()` we want for correctness before they're ever called. This seems to work and the behavior is deterministic now AFAICT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151129
Approved by: https://github.com/janeyx99, https://github.com/masnesral, https://github.com/malfet
Fixes#147846. Previously there is no error out under out variant of`tensordot` while `requires_grad=True`. This can cause potential issue when out tensor is part of a computation graph.
Enforces the out variant of tensordot to run without setting `requries_grad=True`. Change same to #117067
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150270
Approved by: https://github.com/soulitzer
This PR remove the usage of guard_size_oblivious in vector_norm by inlining it in the runtime check,
this prevent any data dependent error from ever appearing here at the locations where guard_size_oblivious
used to exist. Before this PR it used to break potentially. This is NOT BC breaking or changing of semantics from eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148809
Approved by: https://github.com/bobrenjc93
This PR fixes two race conditions that occur when UT tests are run:
- In a particular order within a single shard.
- Concurrently in multiple shards. Each test now gets a unique filename that depends on the test name.
There were two other minor improvements to the UTs:
- matmul_offline_mgpu could occasionally fail if run on 8 GPUs. Criteria was relaxed.
- bmm_tunableop_rocm checks that the rotating buffer is not zero. Otherwise, the test is not useful.
Additionally, several UTs took over 1 minute to run. Their duration was reduced by a combination of setting max tuning iterations to one, setting the rotating buffer size to zero, and/or reducing the matrix dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150463
Approved by: https://github.com/jeffdaily
Improvements to unit tests and warnings for unsupported cases in offline tuning. Here are more details:
- Previously we only compared the OpSig for the untuned vs. tuned entries. This was not strict enough so we now compare OpSig+ParamSig.
- The main offline and online UTs are now stricter to make sure we exercise the code paths for the four combinations of transA and transB.
- Offline tuning does not support some tensor shapes. Emit warning and skip tuning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150142
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
This PR is cleanup only. There are no feature changes or bug fixes.
We create a TunableOp context manager for setting up and cleanup. We re-write TunableOp unit tests in terms of this context manager. Ultimately reduces the amount of copy-paste code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149930
Approved by: https://github.com/jeffdaily
The main purpose of this PR is to fix offline tuning for ScaledGEMM. The previous UT passed because it was not strict enough. Additionally:
- All the offline tuning tests now do a comparison with the online results to ensure that ParamSignature match.
- We raise an error if submatrices are encountered as this is only supported in online tuning mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149677
Approved by: https://github.com/jeffdaily