pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Masaki Kozuki	71d8bb7ede	implement `torch._foreach_rsqrt` (#134574 ) Related: - #133367 c Pull Request resolved: https://github.com/pytorch/pytorch/pull/134574 Approved by: https://github.com/eqy, https://github.com/janeyx99	2024-11-12 15:34:35 +00:00
Xiaodong Wang	565a7942ee	Recover non-standard bool test for msort (#139870 ) Summary: I was looking into why the non-standard bool value will fail for msort - it makes sense for argsort and sort to fail, because we're randomly generating uint8 so the order will be different (and thus the indices will be different). But msort should work. After some digging, it's interesting that even though scalar_t is bool, when the actual value is a uint8_t, the comparison will treat them as signed. I tried lhs=255 and rhs=0: lhs < rhs is equivalent to -1 < 0 which is true (but it's supposed to be False) Therefore we add an explicit type cast. Test Plan: Remove the test skip Differential Revision: D65472170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139870 Approved by: https://github.com/Skylion007, https://github.com/davidberard98	2024-11-11 02:00:34 +00:00
Nikita Shulga	ae01f2b61b	Extend CPU implementation of MSELoss to BF16 (#139959 ) It's strange that it has not been implemented for the type yet Pull Request resolved: https://github.com/pytorch/pytorch/pull/139959 Approved by: https://github.com/jgong5, https://github.com/janeyx99 ghstack dependencies: #139961	2024-11-07 23:50:15 +00:00
Sun, Jiayi	a59132b9c8	fix torch.linalg.norm and torch.norm for torch.complex32 datatype (#133661 ) Fix https://github.com/pytorch/pytorch/issues/132634. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133661 Approved by: https://github.com/mingfeima, https://github.com/Skylion007	2024-11-07 03:21:36 +00:00
Sun, Jiayi	44df6522ee	add Half/BFloat16 support for grid_sample on CPU (#134812 ) Fix https://github.com/pytorch/pytorch/issues/127224. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134812 Approved by: https://github.com/Skylion007, https://github.com/mingfeima	2024-11-06 14:02:08 +00:00
Xiaodong Wang	e7cf7d00be	Support torch.bool in torch.sort + CUDA (#139409 ) Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok. Test Plan: CI Differential Revision: D65282650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409 Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy	2024-11-06 00:02:54 +00:00
zeshengzong	ffb7a08921	Fix torch.histc not checking min > max on cuda for int8 tensors (#139372 ) Fixes #139360 `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)` Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)` ![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487) Change type of `minvalue` and `maxvalue` to fix it, similar like in line: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)` Test Result ```bash $ pytest test/test_reductions.py -vv ``` ![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372 Approved by: https://github.com/eqy	2024-11-05 08:42:38 +00:00
PyTorch MergeBot	38645e8a3e	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit 8aedc649bdd0789b0ea9b9348d552fb1b0e437ff. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is still failing the same test on ExecuTorch ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2443209139))	2024-10-29 04:54:37 +00:00
Mwiza Kunda	c2ded9ec0d	Fix dot reference checks (#138596 ) dot reference implementation should be consistent with the cpu / cuda implementations since it may be used for meta dispatch i.e. ```python import torch x = torch.tensor([1,2,3], dtype=torch.float32) y = torch.tensor([4,5,6], dtype=torch.float16) x.dot(y) Traceback (most recent call last): File "<stdin>", line 1, in <module> RuntimeError: dot : expected both vectors to have same dtype, but found Float and Half ``` However the below does not raise an exception ```python x.to("meta").dot(y.to("meta")) ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/138596 Approved by: https://github.com/bdhirsh	2024-10-28 19:11:40 +00:00
Tom Ritchford	8aedc649bd	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 19:13:44 +00:00
Tom Ritchford	1bc73f3157	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-23 17:42:11 +00:00
Tom Ritchford	c0582fd0f8	Remove unused Python variables in torch/[b-z]* (#136963 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963 Approved by: https://github.com/ezyang	2024-10-19 16:45:22 +00:00
PyTorch MergeBot	7b39fb5712	Revert "Fix unbind_copy and add its decomposition (#134319 )" This reverts commit 9f81270d7589fd7fa98dc247ae4b1b7ab239ca3c. Reverted https://github.com/pytorch/pytorch/pull/134319 on behalf of https://github.com/clee2000 due to breaking some executorch tests D64568664 ([comment](https://github.com/pytorch/pytorch/pull/134319#issuecomment-2423157700))	2024-10-18 20:09:40 +00:00
Xinya Zhang	770fcaf2ab	Fix the Rank of logsumexp Tensor and mGPU support. (#137717 ) The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes #131316 #137414 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>	2024-10-17 21:58:14 +00:00
Tom Ritchford	9f81270d75	Fix unbind_copy and add its decomposition (#134319 ) * Fixes https://github.com/pytorch/pytorch/issues/130829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134319 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-17 21:27:35 +00:00
Jane Xu	94537e70b5	Skip test_parity__foreach_mul_fastpath_inplace_cuda_complex128 internally (#138100 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138100 Approved by: https://github.com/Skylion007	2024-10-17 00:34:56 +00:00
PyTorch MergeBot	4b3035f2fe	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit e7a4ad3b409c226a1da0f597c66ece7c06de0e9e. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/clee2000 due to breaking internal builds D64418214 cc @digantdesai @GregoryComer to help get this fixed and remerged ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2418125356))	2024-10-16 23:18:53 +00:00
Nikita Shulga	e4d7676c1b	[CPU] Expand `torch.special.i1` to Half and BF16 (#137899 ) To match behavior of `torch.special.i0` Noticed while looking at the failures in https://github.com/pytorch/pytorch/pull/137849 Also, add explicit high-precision template specialization for `calc_i0` and `calc_i1` for `BFloat16` and `Half` Pull Request resolved: https://github.com/pytorch/pytorch/pull/137899 Approved by: https://github.com/Skylion007	2024-10-15 17:00:58 +00:00
Tom Ritchford	e7a4ad3b40	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-10-15 13:51:20 +00:00
Benjamin Glass	f984b88718	Ensure noncontiguous tensor creation tests offsetting (#136396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136055	2024-10-02 00:40:43 +00:00
Tom Ritchford	b85f21fc1d	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: #136653	2024-10-01 10:23:22 +00:00
Nikita Shulga	c610aa80dc	Testing: Unblock `new_*` testing on MPS (#137003 ) By changing `other_dtype` to `torch.half` rather than `double` in `sample_inputs_new_fns` if MPS is available Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003 Approved by: https://github.com/Skylion007 ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986	2024-09-30 19:06:12 +00:00
IvanKobzarev	370c1c4297	[aotd] Fix rrelu compilation (#136008 ) Issues: https://github.com/pytorch/pytorch/issues/135083 https://github.com/pytorch/pytorch/issues/120292 rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph. Also that decomposition is registered as python_dispatch kernel for AutogradCUDA. Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this. Testing: ``` python test/functorch/test_aotdispatch.py -k test_rrelu ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008 Approved by: https://github.com/bdhirsh	2024-09-25 11:26:19 +00:00
ankurneog	efed357ef5	Add dtypes support in opinfo for Intel Gaudi (#132840 ) ## Motivation This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584 we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840 Approved by: https://github.com/albanD	2024-09-24 17:17:15 +00:00
Isuru Fernando	f276da7f98	Remove prims.slice_in_dim and prims.slice (#136150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150 Approved by: https://github.com/ezyang	2024-09-23 01:27:22 +00:00
Aaron Gokaslan	b6ffa381e1	[BE]: Add half CUDA support nextafter (#136373 ) Making CUDA support match CPU support for nextafter Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373 Approved by: https://github.com/ezyang	2024-09-21 17:13:45 +00:00
Isuru Fernando	0c936c3ecb	Add decomps for max_unpool (#133146 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-20 21:35:25 +00:00
Jeff Daily	15dba021bb	[ROCm][CI] upgrade CI to ROCm 6.2 (#132555 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-09-20 17:39:31 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
Aaron Gokaslan	23c0d2689e	[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091 ) Testing if op info coverage has issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091 Approved by: https://github.com/ezyang	2024-09-16 18:22:16 +00:00
Aaron Gokaslan	b491e2974c	[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114 ) Fixes #136090 * Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches). * Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique. * This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114 Approved by: https://github.com/malfet	2024-09-16 17:49:12 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit 51c52061339069a2162e921e5b464fad5a411522. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
Mayank Mishra	9a04cfbeff	fix for fp16 (#134106 ) This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2024-09-11 22:02:07 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
Avik Chaudhuri	43f4947d44	fix fake tensor tolist implementation (#135131 ) Summary: When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies. Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints. Test Plan: Some expected failures are gone now. Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes. Differential Revision: D62197742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131 Approved by: https://github.com/ezyang	2024-09-05 23:20:31 +00:00
eqy	4f70b3cfae	[CUDA][complex][TF32] Update `test_noncontiguous_samples` tolerances for `complex64` (#134526 ) Recent cuDNN heuristics change surfaces same TF32 issue as `float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526 Approved by: https://github.com/ezyang	2024-09-04 23:37:16 +00:00
Tobias Ringwald	758d787901	Added complex support for `torch.logsumexp` (#133187 ) Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`. Fixes #133047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-09-03 17:28:36 +00:00
chilli	6fce1faa10	change multinomial to use async asserts instead of a synchronization (#134818 ) Fixes https://github.com/pytorch/pytorch/issues/134442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818 Approved by: https://github.com/ezyang ghstack dependencies: #134813	2024-09-03 06:33:24 +00:00
Masaki Kozuki	e21d7b77ce	Update `ForeachfuncInfo.sample_inputs_func` to yield scalars & scalarlists that are more friendly to test_meta (#134552 ) for `test_meta.py` to see more "PASSED" instead of "XFAIL". `pytest test_meta.py -k "_foreach_"` ran 6400 test cases and: - This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed - main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 Approved by: https://github.com/janeyx99	2024-08-30 17:30:50 +00:00
PyTorch MergeBot	f997b2b8e6	Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )" This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3. Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](`f685018ea9`) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))	2024-08-28 23:10:07 +00:00
Nowtryz	f685018ea9	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-08-28 21:30:39 +00:00
David Berard	289486d007	Move attention kernels back from fake_impls to meta_registrations (#134288 ) See #121528 for additional context. In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA). Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels. Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR. Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288 Approved by: https://github.com/drisspg	2024-08-27 21:10:36 +00:00

1 2 3 4 5 ...

2247 Commits