17271 Commits

Author SHA1 Message Date
e63476b236 [MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732)
Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass.

Test Plan:
CIs created in D81274963 and D81286593 pass.

Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests.
https://www.internalfb.com/intern/testinfra/testrun/15481123829281588

Rollback Plan:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162732
Approved by: https://github.com/danielhou0515
2025-09-17 17:44:03 +00:00
928ac57c2a Upgrades dlpack to v1.1 to include fp8/fp4 (#162195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162195
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/Skylion007, https://github.com/rgommers
2025-09-17 16:39:11 +00:00
9b7a8c4d05 [cuDNN][SDPA][submodule] Roll-back cuDNN frontend upgrade, update Meta registration (#163104)
For https://github.com/pytorch/torchtitan/issues/1713

Also note that we will need to rollback the cuDNN frontend upgrade in 2.9 as it currently introduces a segmentation fault by assuming tensors have their strides and sizes populated at graph creation time 1a7b4b78db/include/cudnn_frontend/node/sdpa_support_surface.h (L447%C2%A0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163104
Approved by: https://github.com/drisspg
2025-09-17 15:48:54 +00:00
c52c4052d8 [WOQ] Integrate CUDA support for int8pack_mm woq optimization pattern (#161680)
Summary:
What: Enables CUDA support for int8_mm woq optimization pattern by:

- Fixing dtype conversion in weight_int8pack_mm_kernel to match CPU
- Updating pattern validation to accept CUDA devices
- Adding test coverage for CUDA

Why: Extend WOQ to more device types

Test Plan:
```
buck2 run 'fbcode//mode/opt' //caffe2/test/inductor:cuda_select_algorithm
```

Rollback Plan:

Reviewed By: jerryzh168

Differential Revision: D80882442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161680
Approved by: https://github.com/jerryzh168
2025-09-17 10:24:13 +00:00
7a3791c5d0 Make torch.cuda.rng_set_state() and torch.cuda.rng_get_state() work during stream capture. (#162505)
Note that this works only in a limited case, where you *don't* change the seed, but change only the offset of the philox generator. This captures the main use case we're interested in: Rewinding the RNG to a previous state. This is done by torch.utils.checkpoint.checkpoint in particular.

Calls to increase() change only the offset, not the seed. Thus, we allow for "no-op" calls to set_seed where the new seed is the same as the old seed. If a user does happen to try to change the seed during stream capture, they will receive an error.

Fixes #162504

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162505
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/eellison, https://github.com/eee4017, https://github.com/cyyever
2025-09-17 03:57:34 +00:00
e28983be76 Add decomp rule to assert_tensor_metadata for BatchedTensors (#163008)
Whenever there is device move, export introduces assert_tensor_metadata aten operator to make sure to guard for device specialization. This aten op didn't work with Vmap because we didn't register explicit decomp rule saying we just skip BatchedTensor and call it on underlying tensor

Differential Revision: [D82483979](https://our.internmc.facebook.com/intern/diff/D82483979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163008
Approved by: https://github.com/huydhn
2025-09-17 03:49:41 +00:00
f6ea41ead2 [CPU] Adding missing brackets in native MaxUnpool log (#163039)
As stated in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163039
Approved by: https://github.com/Skylion007
2025-09-16 21:28:15 +00:00
9494b09549 bf16 support for fused_moving_avg_obs_fake_quant() op (#162620)
enabling bf16 support for `torch.fused_moving_avg_obs_fake_quant()` op on cuda

**testing**
`python test/quantization/pt2e/test_quantize_pt2e.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162620
Approved by: https://github.com/andrewor14, https://github.com/jerryzh168
2025-09-16 21:22:44 +00:00
66308fb470 Revert "[ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)"
This reverts commit cef815dc2ce37f98e01a6469a15b69f15995c1f9.

Reverted https://github.com/pytorch/pytorch/pull/162998 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it seems to break a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/162998#issuecomment-3300280242))
2025-09-16 20:39:41 +00:00
0819de412d Add a new API torch.xpu.can_device_access_peer for Intel GPU (#162705)
# Motivation
Aligned with other backends, this PR introduces an new API `torch.xpu.can_device_access_peer`, which is used in vllm distributed [scenarios](2048c4e379/vllm/distributed/device_communicators/custom_all_reduce.py (L37))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162705
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2025-09-16 18:00:22 +00:00
6db37d7206 [MPS] zeros like, narrow and enable tests (#163011)
zeros like, narrow and enable tests for SparseMPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163011
Approved by: https://github.com/malfet
2025-09-16 17:48:04 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
cef815dc2c [ROCm] Remove HIPBLASLT_ALLOW_TF32 from codebase (#162998)
A few UT failures are caused by `HIPBLASLT_ALLOW_TF32`

Fixes #157094, #157093, #157092, #157091, #157064, #157063, #157062, #157061, #157042, #157041, #157039, #157004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162998
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-16 12:48:45 +00:00
fb1e0321da [CUDA] fix shared memory race in reduce_kernel (#162995)
Reported by compute-sanitizer, otherwise it looks like `block_y_reduce` and `block_x_reduce` both use `shared_memory` for temporaries without synchronization between them

reproduces in e.g.,

`compute-sanitizer --tool=racecheck python test/test_matmul_cuda.py -k test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_128_rhs_block_1_cuda` (note that this test requires H100 to run unless only the non-emulated (cuBLAS impl.) is commented out)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162995
Approved by: https://github.com/msaroufim
2025-09-16 07:53:21 +00:00
76fa381eef [mps] Take into account offset (#163021)
Fixes issue when running AOTI + MPS on voxtral model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163021
Approved by: https://github.com/malfet
2025-09-16 07:14:33 +00:00
29ea6254a0 [Bug] Add more boundary check for FractionalMaxPool3d (#161876)
This PR aims to fix the bug mentioned at [#161853](https://github.com/pytorch/pytorch/issues/161853#issuecomment-3240695121)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161876
Approved by: https://github.com/malfet
2025-09-16 06:59:02 +00:00
9009c4da39 [functional] Avoid duplicate custom get_device call in constructor (#162889)
Trying to reduce the number of `__torch_dispatch__` calls of FakeTensorMode in the AOT metadata collection pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162889
Approved by: https://github.com/Lucaskabela, https://github.com/zou3519
2025-09-16 05:00:19 +00:00
e900a274e5 Add CUDA_KERNEL_ASSERT_PRINTF, a more flexible CUDA_KERNEL_ASSERT_MSG (#160129)
This new assertion helper bundles a printf call with the assertion. The goal is to make changes to instrument asserts with device-side information more intuitive and less error-prone. (See the printf call in ATen/native/cuda/Repeat.cu.) Parametrized error messages are a substantial improvement in debuggability because they show the mismatched device-side values. This lets us avoid a whole cycle of rebuilding + re-running failing training workflows.

We include file, line number, function, and failing condition in the printf (along with the message provided by the user). The format matches the format of the message output by `__assert_fail`. There's also an easy-to-grep-for keyword `CUDA_KERNEL_ASSERT` in the message.

I'm following the existing patterns of arch-specific macros - e.g., on ROCm, this is just a call to abort(), just like the other `CUDA_KERNEL_ASSERT*` variations. I'd appreciate any thoughts on architecture-specific testing (most likely on the OSS side).

# Alternatives
* We could just update `CUDA_KERNEL_ASSERT_MSG`. That would mean introducing `printf` calls from the kernel where there weren't any before, though. This seems like a bad idea because of the performance sensitivity.
* If we want to move more slowly here, I could instrument more `CUDA_KERNEL_ASSERT` callsites without a macro, similar to https://github.com/pytorch/pytorch/pull/157996. But the main downside here is the performance hit, so let's have an organized way of doing it first.

# Risks/Problems
* We're shoving a lot of stuff into this printf. If a filename (at compile-time) contains `%s`, we will end up dereferencing whatever value was pushed in. On a CPU this can cause a segfault. I don't know how it behaves on a GPU.
* Adding printf calls can have a performance impact because of increased register and stack usage. I did not see this play out in practice (see "benchmarks" below). However, there are changes to the generated PTX that could result in performance problems later (see "changes in generated PTX" below).

# Benchmarks

* I ran the following benchmarks a several times on a host with an A100: https://gist.github.com/mjkatmeta/e5494d949204a2afe2d43c452b99424f
* Results are here -- I couldn't find a significant difference before or after  https://gist.github.com/mjkatmeta/0f99ec27bb91214fb2cc7f612938d431

# Change in generated PTX

This is the easiest way I found to run nvcc over just Repeat.cu (this is a buck2 target that includes just a copy of Repeat.cu):
```
buck2 build --show-output scripts/mjk/ai_training/cuda_benchmarks:repeat_cuda
# then use the printed .so file like this:
~/fbsource/third-party/cuda/cuda_12.8.0/x64-linux/bin/cuobjdump -ptx ../buck-out/v2/gen/fbcode/028bde1acfaba823/scripts/mjk/ai_training/cuda_benchmarks/__repeat_cuda__/libscripts_mjk_ai_training_cuda_benchmarks_repeat_cuda.so
```

## with printf
This is the version of the code that appears in this diff:

https://gist.github.com/mjkatmeta/5d18d48282d46b2240d946b335052b9a

## without printf
I recompiled, replacing `CUDA_KERNEL_ASSERT_PRINTF(...)` in Repeat.cu with:
```
CUDA_KERNEL_ASSERT(result_size == cumsum_ptr[size - 1]);
```
https://gist.github.com/mjkatmeta/480df4b3a122e7b326554dd15ebb7c9d

(Both of these are annotated with `// CHAR ARRAY:` comments to make the string constants easier to read.)

Test Plan:
Running this minimal test case:

```
import torch
def main():
    x = torch.ones(10, dtype=torch.int64, device="cuda:0")
    torch.repeat_interleave(x, x, output_size=0)
```

Now we see the new message (from printf) alongside the assert failure:

```
$ buck2 run fbcode//scripts/darshanr/repeat_interleave_errors:repeat_interleave_errors
[...]
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [31,0,0]: Assertion failed: `result_size == cumsum_ptr[size - 1]`: Invalid input! In `repeat_interleave`, the `output_size` argument (0) must be the same as the sum of the elements in the `repeats` tensor (10).

fbcode/caffe2/aten/src/ATen/native/cuda/Repeat.cu:25: compute_cuda_kernel: block: [0,0,0], thread: [384,0,0] Assertion `result_size == cumsum_ptr[size - 1]` failed.
[...[
```

Rollback Plan:

Reviewed By: mradmila

Differential Revision: D79310684

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160129
Approved by: https://github.com/ngimel
2025-09-16 00:23:48 +00:00
0def79fdd9 [ROCm] fix conv relu fusion (#162856)
Fixes #162816.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162856
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-15 22:49:32 +00:00
8e05749d5c Fix integer overflow bug in triu/tril for large diagonal values (#153240)
This PR fixes a bug in the implementation of `apply_triu_tril_single` where using extremely large values for the diagonal argument (e.g. `diagonal=9223372036854775807`) could result in integer overflow and incorrect results. The masking logic is re-written to avoid this issue by always iterating over all columns, ensuring correctness even for large or extreme diagonal values.

Example of the original incorrect behavior:
```python
a = torch.ones(5,5)
torch.triu(a, 9223372036854775807)
# Before:
# tensor([[0., 0., 0., 0., 0.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.]])
```

The new implementation guards against overflow and produces correct results for all valid input values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153240
Approved by: https://github.com/albanD
2025-09-15 18:07:19 +00:00
0826aafa04 [ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330)
Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162330
Approved by: https://github.com/jeffdaily

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
2025-09-15 16:13:03 +00:00
bf6b40da3e fix deterministic scatter_add path for multi-d tensors (#162866)
PReviously for more than 2d tensor `select` didn't work correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162866
Approved by: https://github.com/valentinandrei
2025-09-15 06:50:00 +00:00
b3ad8f4a9c [BUG] Fix nonzero_static crash on CUDA when the input is a empty tensor (#162578)
Fixes #162473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162578
Approved by: https://github.com/ngimel
2025-09-15 05:44:15 +00:00
76e5df3866 [BE] Use fmt::format to define Conv key (#162925)
Also use `getArrayRefString` instead of having separate cases for 2D and 3D Conv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162925
Approved by: https://github.com/Skylion007
ghstack dependencies: #162921
2025-09-15 02:44:12 +00:00
7fe1f5ea49 [BE] Delete [Ventura|Sonoma]Ops header (#162921)
Was a temp solution to make PyTorch+MPS buildable on MacOS-12, but it's no longer needed, as in 2.9+ MPS is only supported on MacOS Sonoma+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162921
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-09-15 02:44:12 +00:00
ba5ca31676 [MPS] sparse mps any (#162885)
Add SparseMPS key for any op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162885
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-14 18:57:53 +00:00
8e1db46493 [MPS] enable empty like and unsqueeze for SparseMPS (#162910)
Enable empty like and unsqueeze for SparseMPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162910
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-14 17:47:06 +00:00
5b9114bf19 Revert "[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330)"
This reverts commit 62843c14bbf694f5722fd6e1075da4792507fe42.

Reverted https://github.com/pytorch/pytorch/pull/162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see https://github.com/pytorch/pytorch/issues/162881 ([comment](https://github.com/pytorch/pytorch/pull/162330#issuecomment-3288544921))
2025-09-13 15:43:50 +00:00
cdfa298a3b Revert "[MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732)"
This reverts commit a3f01f6418667f791f36d928f7e912eb89be2e67.

Reverted https://github.com/pytorch/pytorch/pull/162732 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162732#issuecomment-3287163750))
2025-09-12 23:52:43 +00:00
d25c35d2b2 [MPS] Fix [nan]median output for empty tensors (#162846)
It should be `NaN` rather than 0

Added respective checks to `test_empty_tensor`

Fixes https://github.com/pytorch/pytorch/issues/162798
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162846
Approved by: https://github.com/dcci
2025-09-12 22:26:29 +00:00
53b8bdb977 [MPS] enable cat op for sparse (#162007)
Enable cat op for sparse on MPS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162007
Approved by: https://github.com/malfet
2025-09-12 19:07:39 +00:00
1e9ddf510f [ROCm] fix hardsigmoid op (#162758)
Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648
It can be fixed by explicit typing std::min<opmath_t>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162758
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-12 15:07:13 +00:00
00e9ba75cd Revert "[indexing] Prevent integer overflow from large step values in C++ (#161707)"
This reverts commit c140bf217f5ca5071ab9dbc1bcf9d4006242f44a.

Reverted https://github.com/pytorch/pytorch/pull/161707 on behalf of https://github.com/huydhn due to Look like there is a land race as lots of jobs are failing after this lands ([comment](https://github.com/pytorch/pytorch/pull/161707#issuecomment-3283980465))
2025-09-12 06:49:36 +00:00
c140bf217f [indexing] Prevent integer overflow from large step values in C++ (#161707)
Fixes https://github.com/pytorch/pytorch/issues/160868
hmmm, I found an existing fix PR after I've finished this one. For reference, the old PR was
https://github.com/pytorch/pytorch/pull/147433/files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161707
Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/mlazos
2025-09-12 03:16:23 +00:00
563921619b Fix the regression issue caused by non-arrch64 platforms not hitting the MKLDNN path. (#162168)
This issue was introduced by the commit in issue #161065. Added an extra check to provide a proper path for other platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162168
Approved by: https://github.com/mingfeima, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-12 00:17:08 +00:00
a3f01f6418 [MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732)
Summary: Quick fix for runtime support on foreach_div, see D81274963. Fixed an issue that I created in that diff so that the CIs pass.

Test Plan:
CIs created in D81274963 and D81286593 pass.

Added some logs in [aten_mtia_ops.py](https://www.internalfb.com/code/fbsource/[c56272ba042c43c65517dcac254364cf732fcfa9]/fbcode/mtia/host_runtime/torch_mtia/aten_mtia_ops.cpp?lines=3676) to all the foreach_div ops. We can see that the correct MTIA kernels are being invoked in the tests.
https://www.internalfb.com/intern/testinfra/testrun/15481123829281588

Rollback Plan:

Differential Revision: D82161434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162732
Approved by: https://github.com/danielhou0515
2025-09-11 22:47:03 +00:00
62843c14bb [ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330)
Enables flash attention and/or memory efficient attention on Windows with scaled_dot_product_attention via. aotriton.
Already tested to be working on Windows with TheRock.

Steps to enable: simply set `USE_FLASH_ATTENTION=1` and `USE_MEM_EFF_ATTENTION=1` as usual. See https://github.com/ROCm/TheRock/blob/main/external-builds/pytorch/build_prod_wheels.py#L578-L604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162330
Approved by: https://github.com/xinyazhang, https://github.com/ScottTodd, https://github.com/jeffdaily

Co-authored-by: Scott Todd <scott.todd0@gmail.com>
2025-09-11 22:35:09 +00:00
d65ffdef3d [ROCm] fix miopen batchnorm changing output format (#162112)
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last.  This also unskips a number of related unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2025-09-11 19:37:48 +00:00
9bc648235d [MPS] mps sparse mul op implementation (#162349)
Implements mps sparse mul operation as well as enables other operations such as:
1. copy_
2. div
3. sum
4. floor
5. power
6. sub
7. floor_divide

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162349
Approved by: https://github.com/pearu, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-11 18:36:24 +00:00
24492cbab2 [BE] Cleanup stale comments/copy from gemm (#162001)
Followup after https://github.com/pytorch/pytorch/pull/154012

Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162001
Approved by: https://github.com/drisspg
ghstack dependencies: #161999
2025-09-11 15:48:43 +00:00
6b9b7ce6fe fix torch.sparse.log_softmax on CPU (#161959)
Fix https://github.com/pytorch/pytorch/issues/152293.

**Example:**
```
import torch
from torch.sparse import log_softmax as sparse_log_softmax

def test_bug():
    a = torch.rand(4, 3)
    b = a - 10000000.0
    b_sparse = b.to_sparse()

    cpu_out_sparse = sparse_log_softmax(b_sparse, dim=1).to_dense()
    print('cpu_out_sparse =', cpu_out_sparse)

    b_sparse_double = b.double().to_sparse()
    cpu_out_sparse_double = sparse_log_softmax(b_sparse_double, dim=1).to_dense()
    print('cpu_out_sparse_double =', cpu_out_sparse_double)

if __name__ == '__main__':
    test_bug()
```

**Output:**

- before
```
cpu_out_sparse = tensor([[-2., -1., -2.],
        [-1., -1., -1.],
        [-1., -2., -2.],
        [-1., -1., -2.]])
cpu_out_sparse_double = tensor([[-1.5514, -0.5514, -1.5514],
        [-1.0986, -1.0986, -1.0986],
        [-0.5514, -1.5514, -1.5514],
        [-0.8620, -0.8620, -1.8620]], dtype=torch.float64)
```

- after
```
cpu_out_sparse = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]])
cpu_out_sparse_double = tensor([[-0.8620, -1.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986],
        [-1.8620, -0.8620, -0.8620],
        [-1.0986, -1.0986, -1.0986]], dtype=torch.float64)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161959
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/mingfeima
2025-09-11 07:52:05 +00:00
6944d4b639 [ROCm] rocblas Aten GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
Fix ROCm GEMM helper to set output type (C/D) based on C_Dtype template parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162600
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-09-11 03:34:07 +00:00
e0c910149c Build fbgemm_gpu for TORCH_CUDA_ARCH_LIST=10.0 and CUDA 12.8 and 12.9 (#162544)
## Summary
- pytorch is not built for *a variants of SM architectures, due to non-portability. However, we need fbgemm_gpu kernels built for sm100a (see #162209)

## Changes
- **Setting USE_FBGEMM_GENAI for CUDA builds**: fbgemm_gpu builds for sm100a if using CUDA 12.8 or 12.9 ([source](2033a0a08f/.github/scripts/nova_dir.bash (L29-L32))), so I follow the same rule here.
- **Extra nvcc flags**: if USE_FBGEMM_GENAI and USE_CUDA are set, we add extra nvcc flags for sm100a

## Test plan

Test build:
```
echo $CUDA_HOME
/usr/local/cuda-12.9

export TORCH_CUDA_ARCH_LIST=10.0
python -m pip install --no-build-isolation -v -e .
```

Check build logs:
```
  CMake Warning at CMakeLists.txt:901 (message):
    Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a
```

Run unit tests:
- `pytest test/test_matmul_cuda.py  -k test_mxfp8_scaled_grouped_mm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162544
Approved by: https://github.com/drisspg
2025-09-10 22:59:41 +00:00
bf7f481144 Update misleading torch.sparse_coo_tensor error check (#161900)
Fixes #160622

### Summary
Updated the misleading torch.sparse_coo_tensor error check to provide clear context.
earlier:
`RuntimeError: number of dimensions must be sparse_dim (3) + dense_dim (0), but got 1`

Updated:
`RuntimeError: 'len(size) == sparse_dim + dense_dim' is not satisfied: len(size) = 1, sparse_dim = 3, dense_dim = 0`

**Impacts:**

- Comprehensive error message that will improve developer experience.
- module: sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161900
Approved by: https://github.com/nikitaved, https://github.com/pearu
2025-09-10 19:57:11 +00:00
96ef26f71a Revert "[ROCm] Integrate AITER Fav3 fwd kernels (#160105)"
This reverts commit d2393c2d7da03a1523a12e6f80edb6bd7b464ec5.

Reverted https://github.com/pytorch/pytorch/pull/160105 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing internal ROCm build ([comment](https://github.com/pytorch/pytorch/pull/160105#issuecomment-3273297183))
2025-09-10 04:42:28 +00:00
2281d009e5 Revert "[ROCm] Add specific compile options for CK SDPA (#161759)"
This reverts commit d22d916719eb7daff8455a01d216d65f81899a9e.

Reverted https://github.com/pytorch/pytorch/pull/161759 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to break internal ROCm jobs ([comment](https://github.com/pytorch/pytorch/pull/161759#issuecomment-3272807726))
2025-09-10 00:44:30 +00:00
d2393c2d7d [ROCm] Integrate AITER Fav3 fwd kernels (#160105)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160105
Approved by: https://github.com/jeffdaily
2025-09-09 22:30:12 +00:00
b477fb106f [ROCm] enable grouped gemm fallback (#162419)
Enables bf16 group gemm alternative path as described in #161366
Fast path will be enabled in future through CK integration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162419
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-09 20:04:56 +00:00
d22d916719 [ROCm] Add specific compile options for CK SDPA (#161759)
Updates CK version and adds CK specific compilation options

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161759
Approved by: https://github.com/jeffdaily
2025-09-09 20:04:19 +00:00
82f1eb9b03 Revert "[MPS] mps sparse mul op implementation (#162349)"
This reverts commit 3ea686804925f1291de57ffdb3394da0b46deb54.

Reverted https://github.com/pytorch/pytorch/pull/162349 on behalf of https://github.com/malfet due to Fails trunk tests, with uint8 sum ([comment](https://github.com/pytorch/pytorch/pull/162349#issuecomment-3271783442))
2025-09-09 18:14:16 +00:00