Commit Graph

2277 Commits

Author SHA1 Message Date
aac5d1a289 Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit f0bdc27f74f8b1d4ab6789156691ee0fd5cbb30f.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like internal ideep version is too old to support this ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2660008996))
2025-02-14 18:31:54 +00:00
f0bdc27f74 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2025-02-14 02:03:53 +00:00
eqy
6f7fda3f49 Bump nn.functional.conv3d tolerances for test_comprehensive (#135719)
`float16` tolerance was previously set to `1e-5` which seemed very low
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135719
Approved by: https://github.com/Chillee, https://github.com/albanD
2025-02-05 18:34:12 +00:00
2b00d211f0 Build RowwiseScaledMM.cu for SM89 (#145676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676
Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy
2025-02-01 11:44:58 +00:00
ccc2878c97 Fix fractional_max_pool lowering in inductor (#144395)
Fixes https://github.com/pytorch/pytorch/issues/141538
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144395
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-28 21:00:18 +00:00
dea7ad3371 PEP585 update - torch/testing (#145200)
See #145101 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145200
Approved by: https://github.com/bobrenjc93
2025-01-20 22:42:42 +00:00
92b9da1fc2 fix torch.atan for torch.complex datatypes on CPU (#144749)
Fix https://github.com/pytorch/pytorch/issues/141487.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `atan`. For correctness, I temporarily fallback the implementation of `atan` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144749
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:45:03 +00:00
c922ccb7c4 fix sigmoid for torch.complex datatypes on CPU (#140391)
Fix https://github.com/pytorch/pytorch/issues/135777.
This issue is caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `reciprocal`. For correctness, I temporarily fallback the implementation of `reciprocal` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140391
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
ghstack dependencies: #140358
2025-01-20 08:23:58 +00:00
507bf65c6a fix torch.exp for torch.complex datatypes on CPU (#140358)
Fix https://github.com/pytorch/pytorch/issues/48010, https://github.com/pytorch/pytorch/issues/136063.
These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `exp`. For correctness, I temporarily fallback the implementation of `exp` to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140358
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-20 08:03:17 +00:00
46fbd63405 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2025-01-17 18:21:22 +00:00
d7f45fc575 dynamic shape support for interpolate(antialias=True) backward (#141198)
Fixes https://github.com/pytorch/pytorch/issues/141187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141198
Approved by: https://github.com/ezyang, https://github.com/Chillee
ghstack dependencies: #141161
2025-01-16 00:08:25 +00:00
3b6b306b71 Migrate from Tuple -> tuple in torch/testing (#144256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144256
Approved by: https://github.com/aorenste
2025-01-10 06:37:55 +00:00
a8e97d9d4d fix torch.acos and torch.asin for torch.complex datatypes on CPU (#134838)
Fix https://github.com/pytorch/pytorch/issues/134487, https://github.com/pytorch/pytorch/issues/138327.

These two issues are caused by the lack of special handling of the case where the real number/imag number is 0/Inf/NaN in the vectorized implementation of `asin`. For correctness, I temporarily fallback the implementation of `asin `to scalar implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134838
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2025-01-06 06:17:39 +00:00
45a709d9ec Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit cbc4cf3043a7316c1f6e86b1e22d96042a59489c.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see d8c3900d80/1 ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))
2024-12-28 16:49:38 +00:00
cbc4cf3043 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2024-12-28 05:49:06 +00:00
fca457b5db Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 3f80632c802f1d9fafd0c303d45ba2376b9c1e11.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2563331259))
2024-12-27 05:25:17 +00:00
3f80632c80 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139974
2024-12-26 22:22:42 +00:00
783065637e Add FP8 support for eye (#139974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-24 10:00:23 +00:00
1519a9e30b Revert "Add FP8 support for eye (#139974)"
This reverts commit 01890526b9068ae20b38b2a33e8f11a6331d7d4b.

Reverted https://github.com/pytorch/pytorch/pull/139974 on behalf of https://github.com/huydhn due to Sorry for reverting your change but this seems to fail some slow tests ([comment](https://github.com/pytorch/pytorch/pull/139974#issuecomment-2560046399))
2024-12-23 17:12:39 +00:00
01890526b9 Add FP8 support for eye (#139974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139974
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-23 06:47:49 +00:00
f1cbf4b1b5 Enable ruff's unused variable checking everywhere in pytorch (#136965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-12-22 02:33:11 +00:00
e0c8abda76 Fix potentially undefined behaviour in index_put sample input (#143116)
From the [docs](https://pytorch.org/docs/stable/generated/torch.Tensor.index_put_.html) for index_put_:

> If accumulate is True, the elements in values are added to self. If accumulate is False, the behavior is undefined if indices contain duplicate elements.

Currently the sample inputs for `index_put` generates 2 indices. Because they are generated randomly, they could be the same leading to undefined behaviour if `accumulate=False`.

This PR changes the input generation to only generate a single index if `accumulate=False` preventing duplicate indices and undefined behaviour.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143116
Approved by: https://github.com/albanD
2024-12-13 17:59:01 +00:00
424156c26c [ROCm] Update to AOTriton 0.8b (#140172)
Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:

1. Nestedtensor support;
2. MQA/GQA support;
3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
    + The kernel should use top-left alignment, bottom right alignment will be added later
4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
   However, users are strongly recommended to update to ROCM 6.2.4, notably for
   its firmware updates.

Related unit tests are enabled as well.

Notable related changes from AOTriton 0.8b:

1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`;
2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency.
    + Should not be a problem, since `lzma` is part of Python Standard Library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140172
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-12-06 21:45:18 +00:00
44707b0667 Pass rounding_mode for div reference inputs through kwargs (#136308)
Previously, the reference inputs for div with rounding mode did not supply the rounding_mode keyword argument. This didn't match the sample inputs for this op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136308
Approved by: https://github.com/albanD

Co-authored-by: Xia, Weiwen <weiwen.xia@intel.com>
Co-authored-by: Bob Ren <bobren@meta.com>
Co-authored-by: Xilun Wu <12968408+XilunWu@users.noreply.github.com>
Co-authored-by: siahuat0727 <tansiahuat@gmail.com>
2024-11-29 21:28:24 +00:00
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
80d63e7dd9 Fix softmax_backward_data cpu implementation error when argument output is noncontinguous (#139740)
Implementation of the `softmax_backward_data` operator for the CPU backend produces incorrect results when the `output` argument is non-contiguous.

Here is a test case that demonstrates this issue:

```python
torch.manual_seed(0)
op = torch.ops.aten._softmax_backward_data
grad_output = torch.ones(3, 3, 3)
temp = torch.randn(3, 10, 3)
out = temp[:, :3, :]
out = out.contiguous()
print(out.is_contiguous())
grad_input = op(grad_output, out, 1, torch.float32)
print(grad_input)
```

In this test case, the variable `grad_input` yields incorrect results if the line `out = out.contiguous()` is commented out. With this fix, `grad_input` consistently produces the same results whenever `output` is contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139740
Approved by: https://github.com/zou3519
2024-11-15 19:53:20 +00:00
cb71bcc542 Replace clone.detach with detach.clone (#140264)
Fixes #64532

As state in issue, replace `clone.detach` by `detach.clone`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264
Approved by: https://github.com/soulitzer
2024-11-13 07:01:02 +00:00
6a368b3fc5 Add ScalarList overload to _foreach_lerp (#134482)
Related:
- https://github.com/pytorch/pytorch/issues/133367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134482
Approved by: https://github.com/janeyx99
2024-11-12 19:03:41 +00:00
213b8ef163 [BE] add empty tensor testing for _foreach_addcmul/div (#140276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140276
Approved by: https://github.com/jbschlosser
ghstack dependencies: #140191
2024-11-12 15:35:06 +00:00
92fb1f79b8 [BE] Test interspersed empty tensors for _foreach_norm test parity (#140191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140191
Approved by: https://github.com/jbschlosser
2024-11-12 15:35:06 +00:00
71d8bb7ede implement torch._foreach_rsqrt (#134574)
Related:
- #133367 c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-11-12 15:34:35 +00:00
565a7942ee Recover non-standard bool test for msort (#139870)
Summary:
I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work.

After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False)

Therefore we add an explicit type cast.

Test Plan: Remove the test skip

Differential Revision: D65472170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2024-11-11 02:00:34 +00:00
ae01f2b61b Extend CPU implementation of MSELoss to BF16 (#139959)
It's strange that it has not been implemented for the type yet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139959
Approved by: https://github.com/jgong5, https://github.com/janeyx99
ghstack dependencies: #139961
2024-11-07 23:50:15 +00:00
a59132b9c8 fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661)
Fix https://github.com/pytorch/pytorch/issues/132634.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661
Approved by: https://github.com/mingfeima, https://github.com/Skylion007
2024-11-07 03:21:36 +00:00
44df6522ee add Half/BFloat16 support for grid_sample on CPU (#134812)
Fix https://github.com/pytorch/pytorch/issues/127224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812
Approved by: https://github.com/Skylion007, https://github.com/mingfeima
2024-11-06 14:02:08 +00:00
e7cf7d00be Support torch.bool in torch.sort + CUDA (#139409)
Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok.

Test Plan: CI

Differential Revision: D65282650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409
Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy
2024-11-06 00:02:54 +00:00
ffb7a08921 Fix torch.histc not checking min > max on cuda for int8 tensors (#139372)
Fixes #139360

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)

Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)

![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487)

Change type of `minvalue` and `maxvalue` to fix it, similar like in line:

86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)

**Test Result**
```bash
$ pytest test/test_reductions.py -vv
```
![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c)

```bash
$ lintrunner
```
![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372
Approved by: https://github.com/eqy
2024-11-05 08:42:38 +00:00
38645e8a3e Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 8aedc649bdd0789b0ea9b9348d552fb1b0e437ff.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))
2024-10-29 04:54:37 +00:00
c2ded9ec0d Fix dot reference checks (#138596)
dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch

i.e.
```python
import torch
x = torch.tensor([1,2,3], dtype=torch.float32)
y = torch.tensor([4,5,6], dtype=torch.float16)
x.dot(y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half
```

However the below does not raise an exception
```python
x.to("meta").dot(y.to("meta"))
```
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596
Approved by: https://github.com/bdhirsh
2024-10-28 19:11:40 +00:00
8aedc649bd Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 19:13:44 +00:00
1bc73f3157 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-23 17:42:11 +00:00
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
7b39fb5712 Revert "Fix unbind_copy and add its decomposition (#134319)"
This reverts commit 9f81270d7589fd7fa98dc247ae4b1b7ab239ca3c.

Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))
2024-10-18 20:09:40 +00:00
770fcaf2ab Fix the Rank of logsumexp Tensor and mGPU support. (#137717)
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors.

The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture.

Fixes #131316 #137414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717
Approved by: https://github.com/drisspg

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
2024-10-17 21:58:14 +00:00
9f81270d75 Fix unbind_copy and add its decomposition (#134319)
* Fixes https://github.com/pytorch/pytorch/issues/130829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-17 21:27:35 +00:00
94537e70b5 Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100
Approved by: https://github.com/Skylion007
2024-10-17 00:34:56 +00:00
4b3035f2fe Revert "Add decomposition for permute_copy (#130944)"
This reverts commit e7a4ad3b409c226a1da0f597c66ece7c06de0e9e.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))
2024-10-16 23:18:53 +00:00
e4d7676c1b [CPU] Expand torch.special.i1 to Half and BF16 (#137899)
To match behavior of `torch.special.i0`

Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849

Also, add explicit high-precision template specialization for  `calc_i0` and `calc_i1` for `BFloat16` and `Half`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899
Approved by: https://github.com/Skylion007
2024-10-15 17:00:58 +00:00
e7a4ad3b40 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-15 13:51:20 +00:00
f984b88718 Ensure noncontiguous tensor creation tests offsetting (#136396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136055
2024-10-02 00:40:43 +00:00