Commit Graph

2342 Commits

Author SHA1 Message Date
4fd70d4e7b [1/N]Enable some tests in test_ops.TestCommon on Intel GPU (#159944)
For https://github.com/pytorch/pytorch/issues/114850, we will port aten unit tests to Intel GPU. This PR will work on some test case of test/test_ops.py. We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. Extended XPUTestBase.get_all_devices to support multiple devices
2. Added skipXPU decorator
3. Extended onlyOn to support device list
4. Enabled 'xpu' for some test pathes
5. Added allow_xpu=True for supported test class.
6. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests
7. Use skipIfXpu and skipXPU to disable unsupported test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159944
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-09-29 09:08:04 +00:00
df9a4824e6 Bugfix for doing negative padding (#161639)
Fixes #161014

This bug fix introduces a fix that is consistent with the exception handling. Outlined in issue #161014, there is an edge case where the negative padding does not make the tensor size negative but still triggers the exception that the size is negative. The fix is simply adding `new_dim >=0` to include the zero dim and letting the operator return an empty tensor.

In the PR I have added the edge case where the test will now check the negative padding where the dimension gets reduced to zero.  But the sample is only for the `constant` type of padding. I would like some feedback if it is necessary to put the same sample on the `reduce` type as well.

This is my first PR to contribute to PyTorch and any help/feedback will be welcome! Thank you!

@malfet @manuelcandales @janeyx99 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161639
Approved by: https://github.com/manuelcandales
2025-09-19 20:57:05 +00:00
0dcd9304aa fix high=0 bug in nll_loss test (#162763)
Minor bug fix for the `nll_loss` test.
Before this PR, it runs `torch.randint(high=0)`, which will fail because it would try to generate a number that >= low and < high, i.e. x>=0 and x<0.

The test did not fail because that line is not run when testing on CPU because it failed earlier because of a unsupported dtype.
However, as we support TPUs at Google, this line is reached first before the dtype check, which triggers the bug.

To my understanding, these OpInfo should be general enough to support different hardware.
Fixing this obvious bug would make it more general cross different hardware.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162763
Approved by: https://github.com/soulitzer
2025-09-12 21:48:18 +00:00
1e9ddf510f [ROCm] fix hardsigmoid op (#162758)
Currently std::min -> ::min did not work as expected on ROCm when input values >= 2147483648
It can be fixed by explicit typing std::min<opmath_t>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162758
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-12 15:07:13 +00:00
e532c9d4f1 Relax tolerance for test_quick_baddbmm_cpu_complex64 (#152424)
On Zen 2 (AMD EPYC) and Intel Sapphire Rapids this fails with small differences when compiled with native targeted optimizations. I.e. it fails with `-march=znver2` but succeeds with `-march=znver1`.

I assume some operator fusing is being used by GCC. Small differences like using `vmovdqa` can be seen in the minimized code of the baddbmm kernel: https://godbolt.org/z/jsxMa91Wb

The greatest differences are consistent and the same on both CPU architectures:
```
Greatest absolute difference: 3.43852152582258e-05 at index (1, 2, 1) (up to 1e-05 allowed)
Greatest relative difference: 3.6034286949870875e-06 at index (1, 2, 1) (up to 1.3e-06 allowed)
```

Hence I assume this is in the expected tolerances  especially as `complex128` and all other types pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152424
Approved by: https://github.com/malfet
2025-09-04 13:26:42 +00:00
81b7b16618 Reland "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)" (#161949)
This PR reland #161142 which is reverted to be able to revert other PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161949
Approved by: https://github.com/jansel
2025-09-02 23:43:27 +00:00
54e275e0d8 Revert "[Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)"
This reverts commit c83cbd2f2a2de2e3258f07de77d8740743df6d2d.

Reverted https://github.com/pytorch/pytorch/pull/161142 on behalf of https://github.com/jeanschmidt due to This PR needs to be reverted to be able to revert another PR, this is due to merge conflicts, I am sorry for this. Please feel free to rebase and merge at your earliest convenience ([comment](https://github.com/pytorch/pytorch/pull/161142#issuecomment-3242937640))
2025-09-01 17:03:50 +00:00
c83cbd2f2a [Fix XPU CI][Inductor UT] Fix test cases broken by community. (#161142)
Fixes #161384, Fixes #161162, Fixes #160946, Fixes #160947, Fixes #160948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161142
Approved by: https://github.com/jansel
2025-08-30 11:09:07 +00:00
8c506e6310 [easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445)
This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed.

Notably:
* the test has output_size kwarg specified
* the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur)

Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445
Approved by: https://github.com/eellison, https://github.com/v0i0
2025-08-26 12:32:24 +00:00
6382302990 [MPS] Add grid_sampler_3d for MPS (#160541)
This PR adds support for `grid_sampler_3d` for MPS with "bilinear" interpolation.

NOTE: "nearest" interpolation is not yet supported

Fixes #159882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160541
Approved by: https://github.com/malfet
2025-08-15 16:19:25 +00:00
db0b7f1cc9 [BE][CI] Adjust error_inputs for cat and complex (#160378)
MPS backend does not support double, so errors should be different
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160378
Approved by: https://github.com/dcci
2025-08-13 18:35:06 +00:00
df55ec7d4b [OpInfo][BE] Better inputs for addmm (#160234)
Right now alpha and betha are both less than zero, which makes them useless for all addmm samples for interal types
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160234
Approved by: https://github.com/Skylion007
ghstack dependencies: #160228
2025-08-10 01:26:48 +00:00
f5e2de928b [BE] fix remaining flake8 v7 warnings (#159044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044
Approved by: https://github.com/Skylion007
ghstack dependencies: #159043
2025-07-25 02:56:34 +00:00
7f649ed4f8 Add basic torch.hash_tensor op (#154149)
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.

- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
2cdafab0bd [BE] Raise ValueError from torch.cat meta func (#158249)
Followup after https://github.com/pytorch/pytorch/pull/155460

From [Python documentation](https://docs.python.org/3/library/exceptions.html#ValueError):
> Raised when an operation or function receives an argument that has the right type but an inappropriate value, and the situation is not described by a more precise exception such as IndexError.

Raise [`TypeError`](https://docs.python.org/3/library/exceptions.html#TypeError) when input-output types are incompatible with each other
> Raised when an operation or function is applied to an object of inappropriate type. The associated value is a string giving details about the type mismatch.

> This exception may be raised by user code to indicate that an attempted operation on an object is not supported, and is not meant to be. If an object is meant to support a given operation but has not yet provided an implementation, [NotImplementedError](https://docs.python.org/3/library/exceptions.html#NotImplementedError) is the proper exception to raise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158249
Approved by: https://github.com/jbschlosser, https://github.com/Skylion007, https://github.com/albanD
2025-07-20 23:49:18 +00:00
794b95d54b Enable Half dtype for logcumsumexp_backward (#157512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157512
Approved by: https://github.com/malfet
2025-07-03 18:13:38 +00:00
c553c55be7 Revert "Fix full_like decomposition to preserve strides (#144765)"
This reverts commit 01b0f09931d47bd2716398a0c335b2807dc3074d.

Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))
2025-07-02 13:56:03 +00:00
01b0f09931 Fix full_like decomposition to preserve strides (#144765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765
Approved by: https://github.com/amjames, https://github.com/jansel
2025-07-01 19:13:22 +00:00
a4b59498c5 Fix fake kernel for the out=... variant of unbind_copy (#156643)
`unbind_copy(..., out=...)` returns None rather than the `out` argument
(see https://github.com/pytorch/pytorch/issues/130829#issuecomment-2283936222),
but the old fake kernel didn't account for that and caused an assertion
failure in `pushPyOutToStack`. This patch fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156643
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #156642
2025-06-27 01:34:07 +00:00
cec2977ed2 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-23 02:57:34 +00:00
3f44fdc03d Revert "[BE][6/16] fix typos in torch/ (#156316)"
This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4.

Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
b210cf1ea5 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-22 08:43:33 +00:00
c2f4cc59a7 [MPS] Fix bug in 3d coords calculation (#156375)
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`

Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
2025-06-19 19:56:15 +00:00
c28e74e457 [MPS] Add nearest_3d forward and backward (#156090)
Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS
Delete `upsample_nearest3d` MPS fallback and replace it with proper shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090
Approved by: https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #156256
2025-06-18 04:48:15 +00:00
b1713c6655 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
ghstack dependencies: #156121
2025-06-17 04:46:26 +00:00
03488d820c Revert "[MPS][Testing][BE] Fix samples for full_like (#156026)"
This reverts commit 2d832c9587fd99db295b62d0c9b459d509c19d06.

Reverted https://github.com/pytorch/pytorch/pull/156026 on behalf of https://github.com/atalman due to Sorry breaks MPS tests: test_ops.py::TestMathBitsCPU::test_neg_view_full_like_cpu_float64 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15683608879/job/44182730620) [HUD commit link](2d832c9587) ([comment](https://github.com/pytorch/pytorch/pull/156026#issuecomment-2977903074))
2025-06-16 19:50:26 +00:00
2d832c9587 [MPS][Testing][BE] Fix samples for full_like (#156026)
Now that device is known, one can avoid creating tensors of `torch.double` type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156026
Approved by: https://github.com/dcci
2025-06-16 14:27:42 +00:00
8a22551300 Fixes OpInfo gradient checks for ctc_loss (#154590)
Fixes #67462

Re-enables `OpInfo` gradient checks for the restricted scenarios where the current `ctc_loss` implementation is accurate and consistent.

The desired `ctc_loss` gradient behavior appears to be an ongoing discussion, see
https://github.com/pytorch/pytorch/issues/52241. The `OpInfo` gradient checks can be updated if/as the underlying implementation advances.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154590
Approved by: https://github.com/soulitzer
2025-06-10 19:56:39 +00:00
abbdf9f363 [BE][Testing] Unskip ones_like/zeros_like testing on MPS (#155476)
But skip `double` dtype form OpInfo variants for this test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155476
Approved by: https://github.com/Skylion007, https://github.com/dcci
2025-06-09 20:37:44 +00:00
7999735d23 [CUDA][MPS] Fix torch.arange bound validation for large float inputs (#154320)
Fixes #153133

Fixes an inconsistency in torch.arange on CUDA and MPS backends when using float32 and large input values. Previously, invalid ranges (e.g., start > end with a positive step) could silently return empty tensors due to precision loss in validation logic.

The fix introduces double precision validation for checking whether the step sign is consistent with the range direction.

This ensures torch.arange behaves consistently with CPU for large float32 inputs, and raises an appropriate error when the range is invalid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154320
Approved by: https://github.com/malfet
2025-06-05 14:51:25 +00:00
34e3930401 fix numpy compatibility for 2d small list indices (#154806)
Will fix #119548 and linked issues once we switch from warning to the new behavior,
but for now, given how much this syntax was used in our test suite, we suspect a silent change will be disruptive.
We will change the behavior after 2.8 branch is cut.
Numpy behavior was changed at least in numpy 1.24 (more than 2 years ago)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154806
Approved by: https://github.com/cyyever, https://github.com/Skylion007, https://github.com/albanD
2025-06-04 01:58:52 +00:00
e9266f807a [BE] Use vendored packaging for testing (#154946)
As the rest of the torch uses it, test should rely on it as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154946
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-06-03 14:22:53 +00:00
edc2d539d1 torch.tensordot: performance improvements when contracting to a scalar. (#145936)
As per title.
Fixes https://github.com/pytorch/pytorch/issues/145731

Touches only compute. The CPU overhead can potentially be further reduced.

Before:
```python
In [3]: n = 512

In [4]: A = torch.rand(n, n)

In [5]: B = torch.rand(n, n)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
2.04 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
2.85 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
2.9 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
4.07 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

After
```python
In [2]: n = 512

In [3]: A = torch.rand(n, n)

In [4]: B = torch.rand(n, n)

In [5]: %timeit torch.tensordot(A, B, [[0, 1], [0, 1]])
30.7 µs ± 2.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [6]: %timeit torch.tensordot(A, B, [[0, 1], [1, 0]])
141 µs ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [7]: %timeit torch.tensordot(A, B, [[1, 0], [0, 1]])
142 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [8]: %timeit torch.tensordot(A, B, [[1, 0], [1, 0]])
62.8 µs ± 4.31 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145936
Approved by: https://github.com/albanD, https://github.com/ngimel
2025-05-13 10:57:30 +00:00
fe8ebacee4 [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-08 16:12:16 +00:00
9919d6b872 [Testing] Add copysign from scalar regression test (#152997)
But instead of adding it just for MPS backend, add it to OpInfo

Fixes https://github.com/pytorch/pytorch/issues/152582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152997
Approved by: https://github.com/wdvr
2025-05-07 00:19:42 +00:00
cc28b43950 Revert "[ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)"
This reverts commit 844842dfbf937c43b41c528e461d3f3931bca6e9.

Reverted https://github.com/pytorch/pytorch/pull/151368 on behalf of https://github.com/malfet due to This broke inductor cpp wrapper ([comment](https://github.com/pytorch/pytorch/pull/151368#issuecomment-2848519706))
2025-05-03 08:31:31 +00:00
eqy
216d81da81 [CUDA][complex] skip test_reference_numerics_large_jiterator_unary_cuda_complex64 on CUDA (#148024)
already skipped on ROCM for a similar reason, recent numpy versions changed convention from `nan+infj` to `-inf+infj`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148024
Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2025-05-02 19:11:11 +00:00
844842dfbf [ROCm] Upgrade ROCm CI to ROCm6.4 (#151368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-05-02 17:21:18 +00:00
f0c9b3385d Support more dtypes for input, indices in gather (#151822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151822
Approved by: https://github.com/ngimel
2025-05-01 16:35:23 +00:00
bb90f66e70 [CUDA][conv3d] bump tolerances for test_variant_consistency_eager conv3d complex64 (#152203)
~1/1000 1.5e-5 mismatch on A100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152203
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2025-04-28 17:59:37 +00:00
e2f9759bd0 Fix broken URLs (#152237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237
Approved by: https://github.com/huydhn, https://github.com/malfet
2025-04-27 09:56:42 +00:00
3ef6d6924a [BE] Switch TestConsistency to MPS device (#147893)
Which will eventually allow move decorators away more `common_mps.py`

Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893
Approved by: https://github.com/atalman
ghstack dependencies: #152204
2025-04-26 01:19:21 +00:00
82200e33b5 Make torch._chunk_cat support non-contiguous inputs (#151263)
Currently, `torch._chunk_cat` only supports contiguous inputs (due to `.view()` usage in `_pad_chunk()` supporting only contiguous tensor). This doesn't work for internal models where there can be non-contiguous input tensors:

- size=[8192, 16416], stride=[16448, 1]  # stride[0] is larger than size[1]
- size=[1152, 384], stride=[1, 1152]  # column-major tensor

In this PR, we relax the assumption on contiguous input tensor, by switching from `.view()` to `.reshape()`. Note that since `.reshape()` will try to use `.view()` under the hood whenever possible, this should not cause regression to existing use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151263
Approved by: https://github.com/BoyuanFeng
2025-04-16 04:18:46 +00:00
ddfc14b3ae [MPS] Fix where (#151176)
Fixes #150967
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151176
Approved by: https://github.com/kulinseth, https://github.com/malfet
2025-04-13 20:44:50 +00:00
1e92579126 Add torch._scaled_mm for CPU (#150410)
This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975.

This PR is to add torch._scaled_mm for CPU backend.

_scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410
Approved by: https://github.com/atalman
2025-04-11 02:23:03 +00:00
d751698a36 Support negative values for fill with uint tensors (#144458)
Fixes https://github.com/pytorch/pytorch/issues/144188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144458
Approved by: https://github.com/amjames, https://github.com/eellison
2025-04-09 21:08:06 +00:00
e6bd133866 add batching rule for torch.Tensor.scatter_add_ (#150543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150543
Approved by: https://github.com/zou3519
2025-04-08 18:00:10 +00:00
881d99495d Add more check for torch.ormqr (#150759)
As the title statd.

Please refer to https://github.com/pytorch/pytorch/issues/150674 for more info.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150759
Approved by: https://github.com/lezcano
2025-04-08 08:26:05 +00:00
4854926aeb Revert "Add torch._scaled_mm for CPU (#150410)"
This reverts commit 3b02f795c5ad2339794b15b370c0e4a235d36adf.

Reverted https://github.com/pytorch/pytorch/pull/150410 on behalf of https://github.com/malfet due to It breaks ROCM tests ([comment](https://github.com/pytorch/pytorch/pull/150410#issuecomment-2777704212))
2025-04-04 06:52:54 +00:00
3b02f795c5 Add torch._scaled_mm for CPU (#150410)
This PR is the duplicated one for https://github.com/pytorch/pytorch/pull/139975.

This PR is to add torch._scaled_mm for CPU backend.

_scaled_mm_out_cpu and _scaled_mm_cpu are new added and included in torch._scaled_mm CPU dispatch. We also add _scaled_mm_out_cpu_emulated as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150410
Approved by: https://github.com/atalman
2025-04-03 19:43:45 +00:00