pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Shaobin Ma	63276edb7c	[Inductor] support mixed dtype in the native_layer_norm_backward meta function (#159830 ) Fixes #159829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159830 Approved by: https://github.com/albanD	2025-09-17 20:29:12 +00:00
David Berard	8c506e6310	[easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445 ) This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed. Notably: * the test has output_size kwarg specified * the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur) Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445 Approved by: https://github.com/eellison, https://github.com/v0i0	2025-08-26 12:32:24 +00:00
Xuehai Pan	fc0376e8b1	[BE][2/6] fix typos in test/ (test/test_*.py) (#157636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636 Approved by: https://github.com/yewentao256, https://github.com/mlazos ghstack dependencies: #156311, #156609	2025-07-09 11:02:23 +00:00
Nikita Shulga	13966d0bf5	[BE] Migrate dtype_abbrs into one location (#152229 ) Namely `torch.utils._dtype_abbrs.dtype_abbrs` Before that it was defined in various forms of completeness in `c02edba863/torch/fx/graph.py (L215)`, `c02edba863/torch/testing/_internal/common_utils.py (L5226)` and `c02edba863/torch/testing/_internal/logging_tensor.py (L17)` TODO: - Add linter that `torch.testing._internal` module is not referenced from any of the public facing APIs, as it can have extra dependencies such as `expect_test` Fixes https://github.com/pytorch/pytorch/issues/152225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152229 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2025-04-28 03:52:47 +00:00
Shunting Zhang	0c0583254e	[inductor] fix index.Tensor fallback (#144736 ) The original issue is we see accuracy problem in a meta internal model [meta internal link](https://fb.workplace.com/groups/1075192433118967/posts/1567334737238065/). The debugging is hard but the root cause is relatively simple. The root cause is that the model has mix-device inputs for index.Tensor which causes Inductor to fallback. And the meta kernel for index.Tensor returns a tensor with inconsistent strides to the eager kernel. The following code snippet ``` import torch from torch._subclasses import FakeTensorMode device = "cuda" x = torch.randn((24, 16, 32, 32), device=device).to(memory_format=torch.channels_last) x = x.view(2, 12, 16, 32, 32) i1 = torch.arange(2).unsqueeze(-1) i2 = torch.argsort(torch.rand(2, 12), dim=-1)[:, :3] print(f"Eager stride: {x[i1, i2].stride()}") mode = FakeTensorMode() with mode: f_x = mode.from_tensor(x) f_i1 = mode.from_tensor(i1) f_i2 = mode.from_tensor(i2) f_out = f_x[f_i1, f_i2] print(f"Meta stride: {f_out.stride()}") ``` would output: ``` Eager stride: (49152, 16384, 1, 512, 16) Meta stride: (49152, 16384, 1024, 32, 1) ``` In this PR, I fix the problem to run eager kernel to get the index.Tensor fallback's output layout. A better solution would be to change meta/eager kernel implementation so that their output layout matches. But I'm not sure how to properly do that. In the index.Tensor meta kernel, we always produce dense output: `6d56277682/torch/_meta_registrations.py (L3184)` . While the eager kernel seems to leverage TensorIteratorBase to decide some dimension permutation: `6d56277682/aten/src/ATen/TensorIterator.cpp (L232-L308)` . We can duplicate this logic to the meta kernel implementation if we really want meta matches eager. I can follow up on this if people have strong opinion to do this. And here is an issue https://github.com/pytorch/pytorch/issues/144717 for asserting size/strides for fallback kernels. With that, the issue debugged here would be much easier to root cause. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144736 Approved by: https://github.com/jansel	2025-01-16 09:38:29 +00:00
Runming Lu	b410378d93	Register nonzero for meta device for FBLSim (#144727 ) Summary: Fix `nonzero is not registered to meta` issue: ``` "NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered". ``` Reviewed By: ezyang Differential Revision: D66525640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144727 Approved by: https://github.com/ezyang	2025-01-15 19:40:42 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Yukio Siraichi	446ea2aea5	`pow`: fix meta function output argument dtype check. (#140287 ) Tracking issue: #138399 This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with its Python ref implementation. The following example shows the inconsistency between the two: ```python def run(device): S = (5,) a = torch.rand(S, device=device, dtype=torch.float32) b = 2 out = torch.empty(S, device=device, dtype=torch.float64) return torch.pow(a, b, out=out) >>> run("cpu") Traceback (most recent call last): File "test.py", line 34, in run return torch.pow(a, b, out=out) RuntimeError: Found dtype Double but expected Float >>> run("meta") tensor(..., device='meta', size=(5,), dtype=torch.float64) ``` ~Update:~ ~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed further 2 modifications:~ - ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching output dtypes~ - ~Create a dispatch for `pow` when `_refs.pow()` is called~ Update: Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and, after the discussion below, more correct. The solution was to change the `TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function, setting: - `cast_common_dtype_to_outputs`; and - `enforce_safe_casting_to_output`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287 Approved by: https://github.com/ezyang	2024-11-20 13:28:47 +00:00
Yukio Siraichi	c182c7ccfc	Fix `triangular_solve` meta function out parameter names. (#140186 ) This PR replaces the parameter names specified in the `triangular_solve_meta` function (specifically in its `@out_wrapper(...)` decorator) by those written in the _native_functions.yaml_ file. This name mismatch caused the operation to fail when using the meta device (see error below): ```python Traceback (most recent call last): File "examples/test.py", line 23, in <module> torch.triangular_solve(b.to("meta"), A.to("meta"), out=meta_out) File "torch/_decomp/__init__.py", line 100, in _fn return f(args, kwargs, out=None if is_none else out_kwargs) File "torch/_prims_common/wrappers.py", line 289, in _fn result = fn(args, **kwargs) TypeError: triangular_solve_meta() got an unexpected keyword argument 'X' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140186 Approved by: https://github.com/ezyang	2024-11-12 19:04:34 +00:00
cyy	7deec3942f	[6/N] Don't skip ASAN on some tests (#139565 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139565 Approved by: https://github.com/ezyang	2024-11-04 21:32:44 +00:00
William Wen	92fdea8a39	remove skips due to https://github.com/pytorch/torchdynamo/issues/1991 (#138133 ) Closes https://github.com/pytorch/pytorch/issues/93479. A bunch of other dynamo-wrapped tests also exhibit "torch.* returned non-Tensor output unimplemented" making the issue seem less relevant to me. Some tests are marked as xfail as they fail for other reasons. If these tests are indeed important, we should create a new issue to track them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138133 Approved by: https://github.com/ezyang	2024-10-17 17:42:46 +00:00
Brian Hirsh	53af729a66	add meta for _segment_reduce_backward (#137442 ) reland of https://github.com/pytorch/pytorch/pull/124988 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442 Approved by: https://github.com/albanD	2024-10-08 18:40:06 +00:00
Aidyn-A	bd56bcf0ab	[TEST] Fix _scaled_mm tests (#130897 ) This PR resolves several sets of `_scaled_mm` test failures: - `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them - `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897 Approved by: https://github.com/drisspg	2024-07-18 02:15:00 +00:00
Colin Peppler	a7f54c7f8a	[dynamo] add meta fn for aten.kthvalue.default (#130562 ) I saw ``` torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562 Approved by: https://github.com/jingsh, https://github.com/zou3519	2024-07-12 23:48:31 +00:00
saadelkouari	49ad90349d	Correct error message for aten::_local_scalar_dense on meta tensor (#124554 ) registering a meta for aten::_local_scalar_dense with a different error message. Fixes pytorch#119588 Co-authored-by: Edward Z. Yang <ezyang@mit.edu> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124554 Approved by: https://github.com/ezyang	2024-05-30 00:50:29 +00:00
Edward Z. Yang	aaa2f93a4f	Add meta for _embedding_bag_dense_backward and _embedding_bag_per_sample_weights_backward (#125785 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125785 Approved by: https://github.com/albanD	2024-05-09 04:28:16 +00:00
Edward Z. Yang	c511aed27f	[Meta Tensor] fix meta inplace set storage (#123880 ) Fixes #123879 Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880 Approved by: https://github.com/ezyang	2024-05-01 06:53:49 +00:00
PyTorch MergeBot	ae13c7e593	Revert "[Meta Tensor] fix meta inplace set storage (#123880 )" This reverts commit cccae9355191a807040fb40a65178c4d7fe3f084. Reverted https://github.com/pytorch/pytorch/pull/123880 on behalf of https://github.com/izaitsevfb due to breaks cpu_inductor_torchbench (detectron2_fasterrcnn) ([comment](https://github.com/pytorch/pytorch/pull/123880#issuecomment-2083366385))	2024-04-29 18:19:42 +00:00
Yuanjing Shi	cccae93551	[Meta Tensor] fix meta inplace set storage (#123880 ) Fixes #123879 Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880 Approved by: https://github.com/ezyang	2024-04-28 17:01:12 +00:00
nopperl	0c21161488	Add meta function for `torch.histc` (#124548 ) Registers a meta function for the `aten.histc.default` and `aten.histc.out` ops to support `torch.compile(dynamic=True)`. Fixes #124512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124548 Approved by: https://github.com/lezcano, https://github.com/peterbell10	2024-04-23 00:24:59 +00:00
Edward Z. Yang	8aad72b0d3	Support all unsigned int sizes on unique (#123643 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643 Approved by: https://github.com/albanD, https://github.com/kit1980	2024-04-11 06:50:12 +00:00
Edward Z. Yang	5891c5b3a6	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison	2024-03-25 06:21:17 +00:00
PyTorch MergeBot	f65373e278	Revert "Factor meta conversion through serializable MetaTensorDesc (#122044 )" This reverts commit e2d89e970480d7e5b10a77928442d8caf94e0e84. Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))	2024-03-22 12:46:21 +00:00
Edward Z. Yang	e2d89e9704	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison ghstack dependencies: #122018	2024-03-22 03:56:34 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
Edward Z. Yang	74c09a757b	Simplify Storage meta conversion with PyObject preservation (#122018 ) Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on finalizers on Storage PyObject to handle removal from dict. Irritatingly, we still have to attach finalizer, because we don't have a weak key AND value dict (only one or the other). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018 Approved by: https://github.com/eellison, https://github.com/kurtamohler	2024-03-18 18:55:58 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit 7b4f70eda519ccd7f28de17689edd43c52743bc9. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Peter Bell	eae9751e82	Fix linalg_eigvals invalid use of composite dispatch key (#121142 ) `linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals` also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op as not all types support out variants. Instead, I add a new helper `_linalg_eigvals` which does the same thing in a non-composite operator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142 Approved by: https://github.com/lezcano	2024-03-05 21:13:27 +00:00
Pearu Peterson	ce2903080c	Add sparse compressed fake tensor support (#120920 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120920 Approved by: https://github.com/ezyang	2024-03-04 14:38:45 +00:00
Isuru Fernando	b7df3bba62	add decomposition for frexp (#119217 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217 Approved by: https://github.com/peterbell10 ghstack dependencies: #119284, #120027	2024-02-23 21:52:42 +00:00
Jane Xu	4319735ace	Add meta registration for _foreach_norm (2nd try) (#119927 ) The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927 Approved by: https://github.com/albanD	2024-02-16 00:23:23 +00:00
Sergii Dymchenko	bd9db6a9c7	Update to TorchFix 0.4.0 (#119424 ) `torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424 Approved by: https://github.com/zou3519	2024-02-12 23:30:12 +00:00
PyTorch MergeBot	dea15c9fdc	Revert "Add meta registration for _foreach_norm (#118604 )" This reverts commit b8bb12cd454b716da6a98db826fcc45fd7c0db05. Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))	2024-02-06 22:20:44 +00:00
Jane Xu	b8bb12cd45	Add meta registration for _foreach_norm (#118604 ) This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls. For script: ``` import torch ts = [torch.rand(32, 16, device="cuda") for _ in range(128)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: res = torch._foreach_norm(ts) print(p.key_averages().table(sort_by="cpu_time_total")) ``` OG baseline: ``` (pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 25.36% 4.209ms 99.94% 16.586ms 16.586ms 8.000us 88.89% 9.000us 9.000us 1 cudaLaunchKernel 61.21% 10.159ms 61.21% 10.159ms 2.540ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.43% 71.000us 58.35% 9.683ms 9.683ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.33% 55.000us 57.35% 9.517ms 9.517ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.42% 69.000us 57.01% 9.462ms 9.462ms 1.000us 11.11% 1.000us 1.000us 1 aten::select 8.04% 1.335ms 11.29% 1.873ms 14.633us 0.000us 0.00% 0.000us 0.000us 128 aten::as_strided 3.24% 538.000us 3.24% 538.000us 4.203us 0.000us 0.00% 0.000us 0.000us 128 aten::empty 0.90% 150.000us 0.90% 150.000us 75.000us 0.000us 0.00% 0.000us 0.000us 2 cudaDeviceSynchronize 0.06% 10.000us 0.06% 10.000us 10.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 11.11% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 66.67% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 2.000us 22.22% 2.000us 2.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 16.596ms Self CUDA time total: 9.000us ``` And here's after this PR: ``` STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ aten::_foreach_norm 30.95% 4.653ms 99.95% 15.026ms 15.026ms 9.000us 90.00% 10.000us 10.000us 1 cudaLaunchKernel 52.41% 7.879ms 52.41% 7.879ms 1.970ms 0.000us 0.00% 0.000us 0.000us 4 aten::zeros 0.39% 58.000us 48.29% 7.260ms 7.260ms 0.000us 0.00% 1.000us 1.000us 1 aten::zero_ 0.35% 53.000us 47.25% 7.103ms 7.103ms 0.000us 0.00% 1.000us 1.000us 1 aten::fill_ 0.43% 65.000us 46.90% 7.050ms 7.050ms 1.000us 10.00% 1.000us 1.000us 1 aten::empty 15.42% 2.318ms 15.42% 2.318ms 17.969us 0.000us 0.00% 0.000us 0.000us 129 cudaDeviceSynchronize 0.05% 7.000us 0.05% 7.000us 7.000us 0.000us 0.00% 0.000us 0.000us 1 void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 1.000us 10.00% 1.000us 1.000us 1 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 6.000us 60.00% 6.000us 3.000us 2 void at::native::lpnorm_cleanup<float, (at::native::... 0.00% 0.000us 0.00% 0.000us 0.000us 3.000us 30.00% 3.000us 3.000us 1 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 15.033ms Self CUDA time total: 10.000us ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604 Approved by: https://github.com/albanD	2024-02-05 22:01:01 +00:00
vfdev-5	f6767244cf	Added meta function for _upsample_bicubic2d_aa (#117347 ) This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127 ``` /opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors) E torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>((FakeTensor(..., size=(1, s0, s1, s2)),), {'size': [s4, floor(s3s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}): E aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers E E from user code: E File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image E image = interpolate( E E Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information E E E You can suppress this exception and fall back to eager by setting: E import torch._dynamo E torch._dynamo.config.suppress_errors = True ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347 Approved by: https://github.com/peterbell10	2024-01-16 23:33:55 +00:00
Jason Ansel	f1cdb39da3	[dynamo] Fix handling of one_hot (#116338 ) Fixes #115817 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116338 Approved by: https://github.com/yanboliang	2023-12-24 04:55:35 +00:00
Sun, Jiayi	c173a9d9b3	add Half support for layer_norm on CPU (#99590 ) ### Testing Single socket (icx, 32cores): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.051 \| 0.051 \| 0.050 \| \| (8 ,8, 16) \| 0.013 \| 0.013 \| 0.013 \| 0.054 \| 0.053 \| 0.051 \| \| (32, 8, 16) \| 0.015 \| 0.014 \| 0.014 \| 0.059 \| 0.054 \| 0.052 \| \| (64, 128, 56, 56) \| 1.875 \| 0.790 \| 1.016 \| 12.845 \| 7.151 \| 6.985 \| \| (64, 128, 256, 256) \| 50.226 \| 25.462 \| 35.736 \| 328.957 \| 179.615 \| 175.618 \| Single core (icx): \| shape \| fp32 forward (ms) \| fp16 forward (ms) \| mixed fp32 fp16 forward (ms) \| fp32 backward (ms) \| fp16 backward (ms) \| mixed fp32 fp16 backward (ms) \| \| -- \| -- \| -- \| -- \| -- \| -- \| -- \| \| (1, 8, 16) \| 0.012 \| 0.011 \| 0.011 \| 0.040 \| 0.041 \| 0.041 \| \| (8 ,8, 16) \| 0.012 \| 0.012 \| 0.012 \| 0.042 \| 0.042 \| 0.042 \| \| (32, 8, 16) \| 0.027 \| 0.014 \| 0.014 \| 0.048 \| 0.048 \| 0.046 \| \| (64, 128, 56, 56) \| 58.054 \| 11.034 \| 17.928 \| 108.603 \| 48.816 \| 50.244 \| \| (64, 128, 256, 256) \| 1327.758 \| 352.394 \| 496.994 \| 2846.182 \| 1224.247 \| 1218.422 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-20 01:11:15 +00:00
rzou	3477a2ee03	unMarkDynamoStrictTest on OpInfo-based tests (#115856 ) These take too long to run under strict mode. We'll worry about them later. Note that these decorators don't do anything yet (unless we flip the default from non-strict to strict). Pull Request resolved: https://github.com/pytorch/pytorch/pull/115856 Approved by: https://github.com/voznesenskym ghstack dependencies: #115845, #115855	2023-12-15 01:22:31 +00:00
atalman	ba4285bd9e	Deprecate primTorch module, replace it with decompositions in module Owners (#114754 ) Context: pt2 oncall is revamping its labeling system. One of the guidelines is to remove duplicate labeling in our system. Both primTorch and decomposition labels are referring to the same thing. primTorch was the legacy name (and we no longer have a primTorch project), so using decomposition as the label name makes more sense. Right now, the only open issues that use "module: primTorch" are the ones generated by the DISABLED bots. Once we replace the label in the bot, we can safely remove the primTorch label. Here an example of the issue that has primTorch label : https://github.com/pytorch/pytorch/issues/112719 Torchbot uses following logic to auto extract module owners: https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/flaky-tests/disable.ts#L391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114754 Approved by: https://github.com/huydhn	2023-11-29 18:27:20 +00:00
lezcano	ef982418df	Add OpInfo test that tests meta functions binary ufuncs with different dtypes (#113674 ) As per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/113674 Approved by: https://github.com/peterbell10 ghstack dependencies: #113634	2023-11-16 19:09:12 +00:00
Isuru Fernando	3b915f9de0	[pt2] enable meta tests for `foreach` ops (#113484 ) Try https://github.com/pytorch/pytorch/pull/113059 again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113484 Approved by: https://github.com/lezcano	2023-11-11 02:43:41 +00:00
CaoE	a310cc8968	Add Half support for kthvalue, cross, hist, and logit on CPU (#112135 ) Add Half support for kthvalue, cross, hist, and logit on CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135 Approved by: https://github.com/cpuhrsch	2023-10-31 09:12:47 +00:00
Peter Bell	bbd5b935e4	Use `pytree.tree_leaves` everywhere (#112324 ) This changes all the instances I could find of `tree_flatten(...)[0]` or `x, _ = tree_flatten` to use `tree_leaves`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324 Approved by: https://github.com/lezcano ghstack dependencies: #112327, #112323	2023-10-30 03:39:04 +00:00
William Wen	a380bf3297	[dynamo, test] skip flaky dynamo-wrapped tests (#112310 ) ghstack-source-id: 7a87e33e7513e7924e4513b6473284562989ed4c Pull Request resolved: https://github.com/pytorch/pytorch/pull/112309 Skip flaky tests reported by - https://github.com/pytorch/pytorch/issues/111825 - https://github.com/pytorch/pytorch/issues/111826 - https://github.com/pytorch/pytorch/issues/111909 - https://github.com/pytorch/pytorch/issues/112142 - https://github.com/pytorch/pytorch/issues/112220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112310 Approved by: https://github.com/xmfan	2023-10-28 04:14:57 +00:00
Philip Meier	973c87b320	raise instead of skip in test/test_meta.py (#110939 ) Supersedes #109004. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110939 Approved by: https://github.com/lezcano, https://github.com/kurtamohler	2023-10-17 10:17:43 +00:00
Brian Hirsh	0d368f586a	fix wrong meta for index_select.out (#111364 ) fixes https://github.com/pytorch/pytorch/issues/110699 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111364 Approved by: https://github.com/ezyang ghstack dependencies: #111040	2023-10-16 15:18:20 +00:00
Mwiza Kunda	306b2284f2	Add meta kernel for ctc_loss.intList (#107949 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107949 Approved by: https://github.com/zou3519	2023-10-09 16:35:14 +00:00

1 2 3 4 5 ...

275 Commits