275 Commits

Author SHA1 Message Date
63276edb7c [Inductor] support mixed dtype in the native_layer_norm_backward meta function (#159830)
Fixes #159829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159830
Approved by: https://github.com/albanD
2025-09-17 20:29:12 +00:00
8c506e6310 [easy][test] Add repeat_interleave opinfo that exercises binary search fusion (#161445)
This adds a configuration that would have caught the need for https://github.com/pytorch/pytorch/pull/159961 when https://github.com/pytorch/pytorch/pull/158462 was landed.

Notably:
* the test has output_size kwarg specified
* the input is 1D plus a size-1 dimension (otherwise, if there are non-size-1 dimensions, then the fusion won't occur)

Differential Revision: [D80981715](https://our.internmc.facebook.com/intern/diff/D80981715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161445
Approved by: https://github.com/eellison, https://github.com/v0i0
2025-08-26 12:32:24 +00:00
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
13966d0bf5 [BE] Migrate dtype_abbrs into one location (#152229)
Namely `torch.utils._dtype_abbrs.dtype_abbrs`

Before that it was defined in various forms of completeness in
c02edba863/torch/fx/graph.py (L215),
c02edba863/torch/testing/_internal/common_utils.py (L5226)
 and c02edba863/torch/testing/_internal/logging_tensor.py (L17)

TODO:
 - Add linter that `torch.testing._internal` module is not referenced from any of the public facing APIs, as it can have extra dependencies such as `expect_test`

Fixes https://github.com/pytorch/pytorch/issues/152225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152229
Approved by: https://github.com/clee2000, https://github.com/Skylion007
2025-04-28 03:52:47 +00:00
0c0583254e [inductor] fix index.Tensor fallback (#144736)
The original issue is we see accuracy problem in a meta internal model [meta internal link](https://fb.workplace.com/groups/1075192433118967/posts/1567334737238065/).  The debugging is hard but the root cause is relatively simple. The root cause is that the model has mix-device inputs for index.Tensor which causes Inductor to fallback. And the meta kernel for index.Tensor returns a tensor with inconsistent strides to the eager kernel.

The following code snippet
```
import torch
from torch._subclasses import FakeTensorMode

device = "cuda"

x = torch.randn((24, 16, 32, 32), device=device).to(memory_format=torch.channels_last)
x = x.view(2, 12, 16, 32, 32)

i1 = torch.arange(2).unsqueeze(-1)
i2 = torch.argsort(torch.rand(2, 12), dim=-1)[:, :3]

print(f"Eager stride: {x[i1, i2].stride()}")

mode = FakeTensorMode()
with mode:
    f_x = mode.from_tensor(x)
    f_i1 = mode.from_tensor(i1)
    f_i2 = mode.from_tensor(i2)
    f_out = f_x[f_i1, f_i2]
    print(f"Meta stride: {f_out.stride()}")
```

would output:
```
Eager stride: (49152, 16384, 1, 512, 16)
Meta stride: (49152, 16384, 1024, 32, 1)
```

In this PR, I fix the problem to run eager kernel to get the index.Tensor fallback's output layout. A better solution would be to change meta/eager kernel implementation so that their output layout matches. But I'm not sure how to properly do that.
In the index.Tensor meta kernel, we always produce dense output:  6d56277682/torch/_meta_registrations.py (L3184) . While the eager kernel seems to leverage TensorIteratorBase to decide some dimension permutation: 6d56277682/aten/src/ATen/TensorIterator.cpp (L232-L308) .  We can duplicate this logic to the meta kernel implementation if we really want meta matches eager. I can follow up on this if people have strong opinion to do this.

And here is an issue https://github.com/pytorch/pytorch/issues/144717 for asserting size/strides for fallback kernels. With that, the issue debugged here would be much easier to root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144736
Approved by: https://github.com/jansel
2025-01-16 09:38:29 +00:00
b410378d93 Register nonzero for meta device for FBLSim (#144727)
Summary:
Fix `nonzero is not registered to meta` issue:
```
"NotImplementedError: aten::nonzero: attempted to run this operator with Meta tensors, but there was no fake impl or Meta kernel registered".
```

Reviewed By: ezyang

Differential Revision: D66525640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144727
Approved by: https://github.com/ezyang
2025-01-15 19:40:42 +00:00
d8c8ba2440 Fix unused Python variables in test/[e-z]* (#136964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-12-18 23:02:30 +00:00
446ea2aea5 pow: fix meta function output argument dtype check. (#140287)
Tracking issue: #138399

This PR changes the `pow` C++ implementation, making its C++ meta kernel consistent with
its Python ref implementation. The following example shows the inconsistency between the
two:

```python
def run(device):
    S = (5,)
    a = torch.rand(S, device=device, dtype=torch.float32)
    b = 2
    out = torch.empty(S, device=device, dtype=torch.float64)
    return torch.pow(a, b, out=out)

>>> run("cpu")
Traceback (most recent call last):
  File "test.py", line 34, in run
    return torch.pow(a, b, out=out)
RuntimeError: Found dtype Double but expected Float

>>> run("meta")
tensor(..., device='meta', size=(5,), dtype=torch.float64)
```

**~Update:~**

~Note that this happens only for `pow.Tensor_Scalar` overloads. Therefore, this PR needed
further 2 modifications:~

- ~Split the `pow` ref implementation, making `pow.Tensor_Scalar` error on mismatching
output dtypes~
- ~Create a dispatch for `pow` when `_refs.pow()` is called~

**Update:**

Changing the `TensorIteratorConfig` for `pow.Tensor_Scalar` was easier and,
after the discussion below, more correct. The solution was to change the
`TensorIteratorBase::build_output_borrowing_argument_owning_unary_op` function,
setting:

- `cast_common_dtype_to_outputs`; and
- `enforce_safe_casting_to_output`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140287
Approved by: https://github.com/ezyang
2024-11-20 13:28:47 +00:00
c182c7ccfc Fix triangular_solve meta function out parameter names. (#140186)
This PR replaces the parameter names specified in the `triangular_solve_meta`
function (specifically in its `@out_wrapper(...)` decorator) by those written in the
_native_functions.yaml_ file.

This name mismatch caused the operation to fail when using the meta device (see error
below):

```python
Traceback (most recent call last):
  File "examples/test.py", line 23, in <module>
    torch.triangular_solve(b.to("meta"), A.to("meta"), out=meta_out)
  File "torch/_decomp/__init__.py", line 100, in _fn
    return f(*args, **kwargs, out=None if is_none else out_kwargs)
  File "torch/_prims_common/wrappers.py", line 289, in _fn
    result = fn(*args, **kwargs)
TypeError: triangular_solve_meta() got an unexpected keyword argument 'X'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140186
Approved by: https://github.com/ezyang
2024-11-12 19:04:34 +00:00
cyy
7deec3942f [6/N] Don't skip ASAN on some tests (#139565)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139565
Approved by: https://github.com/ezyang
2024-11-04 21:32:44 +00:00
92fdea8a39 remove skips due to https://github.com/pytorch/torchdynamo/issues/1991 (#138133)
Closes https://github.com/pytorch/pytorch/issues/93479. A bunch of other dynamo-wrapped tests also exhibit "torch.* returned non-Tensor output unimplemented" making the issue seem less relevant to me. Some tests are marked as xfail as they fail for other reasons.

If these tests are indeed important, we should create a new issue to track them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138133
Approved by: https://github.com/ezyang
2024-10-17 17:42:46 +00:00
53af729a66 add meta for _segment_reduce_backward (#137442)
reland of https://github.com/pytorch/pytorch/pull/124988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442
Approved by: https://github.com/albanD
2024-10-08 18:40:06 +00:00
bd56bcf0ab [TEST] Fix _scaled_mm tests (#130897)
This PR resolves several sets of `_scaled_mm` test failures:
- `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them
- `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897
Approved by: https://github.com/drisspg
2024-07-18 02:15:00 +00:00
a7f54c7f8a [dynamo] add meta fn for aten.kthvalue.default (#130562)
I saw
```
torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562
Approved by: https://github.com/jingsh, https://github.com/zou3519
2024-07-12 23:48:31 +00:00
49ad90349d Correct error message for aten::_local_scalar_dense on meta tensor (#124554)
registering a meta for aten::_local_scalar_dense with a different error message.

Fixes pytorch#119588

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124554
Approved by: https://github.com/ezyang
2024-05-30 00:50:29 +00:00
aaa2f93a4f Add meta for _embedding_bag_dense_backward and _embedding_bag_per_sample_weights_backward (#125785)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125785
Approved by: https://github.com/albanD
2024-05-09 04:28:16 +00:00
c511aed27f [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-05-01 06:53:49 +00:00
ae13c7e593 Revert "[Meta Tensor] fix meta inplace set storage (#123880)"
This reverts commit cccae9355191a807040fb40a65178c4d7fe3f084.

Reverted https://github.com/pytorch/pytorch/pull/123880 on behalf of https://github.com/izaitsevfb due to breaks cpu_inductor_torchbench (detectron2_fasterrcnn) ([comment](https://github.com/pytorch/pytorch/pull/123880#issuecomment-2083366385))
2024-04-29 18:19:42 +00:00
cccae93551 [Meta Tensor] fix meta inplace set storage (#123880)
Fixes #123879

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123880
Approved by: https://github.com/ezyang
2024-04-28 17:01:12 +00:00
0c21161488 Add meta function for torch.histc (#124548)
Registers a meta function for the `aten.histc.default` and `aten.histc.out` ops to support `torch.compile(dynamic=True)`. Fixes #124512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124548
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-04-23 00:24:59 +00:00
8aad72b0d3 Support all unsigned int sizes on unique (#123643)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123643
Approved by: https://github.com/albanD, https://github.com/kit1980
2024-04-11 06:50:12 +00:00
5891c5b3a6 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
2024-03-25 06:21:17 +00:00
f65373e278 Revert "Factor meta conversion through serializable MetaTensorDesc (#122044)"
This reverts commit e2d89e970480d7e5b10a77928442d8caf94e0e84.

Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))
2024-03-22 12:46:21 +00:00
e2d89e9704 Factor meta conversion through serializable MetaTensorDesc (#122044)
Fixes https://github.com/pytorch/pytorch/issues/121085

This PR pretty involved so pay attention to this description.  At a high
level, the refactor is intended to be mechanical: anywhere in
MetaConverter where previously we took a Tensor as argument, we now take
a MetaTensorDesc, which contains all of the information that we would
have queried off of the Tensor, but placed into a separate data
structure which we can serialize or use to recreate a fake tensor in
a separate fake tensor mode in exact fidelity to the original.

However, this transformation is not always entirely mechanical.  Here
is what you need to pay attention to:

- The memo table from real Tensor -> meta/fake Tensor is now broken
  into two memo tables: real Tensor -> stable int id -> meta/fake
  Tensor.  The stable int id is needed so that when we do serialization,
  we know when tensors/storages alias each other and can ensure we preserve
  this aliasing upon deserialization.

  The way I have implemented changes the weak reference behavior.
  Previously, when either the real Tensor OR the meta/fake Tensor went
  dead, we would remove the entry from the memo table.  Now, this only
  removes entries from one of the two memo tables.  This semantically
  makes sense, because the user may have held on to the stable int id
  out of band, and may expect a real Tensor to continue to be numbered
  consistently / expect to be able to lookup a meta/fake tensor from
  this id.  If this is unacceptable, it may be possible to rejigger
  the memo tables so that we have real Tensor -> stable int id
  and real Tensor -> meta/fake Tensor, but TBH I find the new
  implementation a lot simpler, and arranging the memo tables in this
  way means that I have to muck around with the real tensor to save
  to the memo table; in the current implementation, I never pass the
  Tensor to meta_tensor function AT ALL, which means it is impossible
  to accidentally depend on it.

- When I fill in the fields of MetaTensorDesc in describe_tensor, I need
  to be careful not to poke fields when they are not valid.  Previously,
  preconditions were implicitly checked via the conditional structure
  ("is this sparse? is this nested?") that is tested before we start
  reading attributes.  This structure has to be replicated in
  describe_tensor, and I have almost assuredly gotten it wrong on my
  first try (I'll be grinding through it on CI; a careful audit will
  help too, by auditing that I've tested all the same conditionals that
  the original access was guarded by.)

- I originally submitted https://github.com/pytorch/pytorch/pull/121821
  for the symbolic shapes change, but it turned out the way I did it
  there didn't actually work so well for this PR.  I ended up just
  inlining the symbolic shapes allocation logic into MetaConverter
  (look for calls to maybe_specialize_sym_int_with_hint), maybe there
  is a better way to structure it, but what I really want is to
  just read sizes/strides/offset directly off of MetaTensorDesc; I
  don't want another intermediate data structure.

- Some fields aren't serializable. These are documented as "NOT
  serializable".  ctx/type should morally be serializable and I just
  need to setup a contract with subclasses to let them be serialized.
  The fake_mode is used solely to test if we are refakefying with
  a pre-existing ShapeEnv and we want to reuse the SymInt
  directly--serializing this case is hopeless but I am kind of hoping
  after this refactor we do not need this at all.  view_func is not
  serializable because it's a bound C implemented method.  Joel has
  promised me that this is not too difficult to actually expose as a
  true data structure, but this is the edgiest of edge cases and there
  is no reason to deal with it right now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044
Approved by: https://github.com/eellison
ghstack dependencies: #122018
2024-03-22 03:56:34 +00:00
773ae817f7 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279)
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-18 21:01:30 +00:00
74c09a757b Simplify Storage meta conversion with PyObject preservation (#122018)
Thanks to https://github.com/pytorch/pytorch/pull/109039 we can rely on
finalizers on Storage PyObject to handle removal from dict.

Irritatingly, we still have to attach finalizer, because we don't have
a weak key AND value dict (only one or the other).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122018
Approved by: https://github.com/eellison, https://github.com/kurtamohler
2024-03-18 18:55:58 +00:00
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda519ccd7f28de17689edd43c52743bc9.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
7b4f70eda5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-08 15:07:15 +00:00
b529c19bdf Revert "Batch Norm Consolidation (#116092)"
This reverts commit 5680f565d5b7d4aa412a3988d3d91ca4c5679303.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))
2024-03-06 17:10:01 +00:00
5680f565d5 Batch Norm Consolidation (#116092)
**Summary:**

This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:

```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
  aten.native_batch_norm ->
  aten._native_batch_norm_legit (export only) ->
  _batch_norm_legit_cpu/cuda (kernels, export only) ->
  _batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```

Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.

Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:

```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```

The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:

```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```

Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.

Test Plan: `OpInfo` tests for `batch_norm_with_update`.

Reviewers: albanD, bdhirsh

Subscribers: albanD, bdhirsh, supriyar

Tasks: https://github.com/pytorch/pytorch/issues/111384

Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-03-06 04:50:46 +00:00
eae9751e82 Fix linalg_eigvals invalid use of composite dispatch key (#121142)
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
2024-03-05 21:13:27 +00:00
ce2903080c Add sparse compressed fake tensor support (#120920)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120920
Approved by: https://github.com/ezyang
2024-03-04 14:38:45 +00:00
b7df3bba62 add decomposition for frexp (#119217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119217
Approved by: https://github.com/peterbell10
ghstack dependencies: #119284, #120027
2024-02-23 21:52:42 +00:00
4319735ace Add meta registration for _foreach_norm (2nd try) (#119927)
The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927
Approved by: https://github.com/albanD
2024-02-16 00:23:23 +00:00
bd9db6a9c7 Update to TorchFix 0.4.0 (#119424)
`torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424
Approved by: https://github.com/zou3519
2024-02-12 23:30:12 +00:00
dea15c9fdc Revert "Add meta registration for _foreach_norm (#118604)"
This reverts commit b8bb12cd454b716da6a98db826fcc45fd7c0db05.

Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))
2024-02-06 22:20:44 +00:00
b8bb12cd45 Add meta registration for _foreach_norm (#118604)
This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls.

For script:
```
import torch

ts = [torch.rand(32, 16, device="cuda") for _ in range(128)]

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    res = torch._foreach_norm(ts)
print(p.key_averages().table(sort_by="cpu_time_total"))
```

OG baseline:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        25.36%       4.209ms        99.94%      16.586ms      16.586ms       8.000us        88.89%       9.000us       9.000us             1
                                       cudaLaunchKernel        61.21%      10.159ms        61.21%      10.159ms       2.540ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.43%      71.000us        58.35%       9.683ms       9.683ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.33%      55.000us        57.35%       9.517ms       9.517ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.42%      69.000us        57.01%       9.462ms       9.462ms       1.000us        11.11%       1.000us       1.000us             1
                                           aten::select         8.04%       1.335ms        11.29%       1.873ms      14.633us       0.000us         0.00%       0.000us       0.000us           128
                                       aten::as_strided         3.24%     538.000us         3.24%     538.000us       4.203us       0.000us         0.00%       0.000us       0.000us           128
                                            aten::empty         0.90%     150.000us         0.90%     150.000us      75.000us       0.000us         0.00%       0.000us       0.000us             2
                                  cudaDeviceSynchronize         0.06%      10.000us         0.06%      10.000us      10.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        11.11%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        66.67%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        22.22%       2.000us       2.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 16.596ms
Self CUDA time total: 9.000us
```

And here's after this PR:
```
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        30.95%       4.653ms        99.95%      15.026ms      15.026ms       9.000us        90.00%      10.000us      10.000us             1
                                       cudaLaunchKernel        52.41%       7.879ms        52.41%       7.879ms       1.970ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.39%      58.000us        48.29%       7.260ms       7.260ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.35%      53.000us        47.25%       7.103ms       7.103ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.43%      65.000us        46.90%       7.050ms       7.050ms       1.000us        10.00%       1.000us       1.000us             1
                                            aten::empty        15.42%       2.318ms        15.42%       2.318ms      17.969us       0.000us         0.00%       0.000us       0.000us           129
                                  cudaDeviceSynchronize         0.05%       7.000us         0.05%       7.000us       7.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        10.00%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        60.00%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        30.00%       3.000us       3.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 15.033ms
Self CUDA time total: 10.000us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604
Approved by: https://github.com/albanD
2024-02-05 22:01:01 +00:00
f6767244cf Added meta function for _upsample_bicubic2d_aa (#117347)
This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127
```
/opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate
    return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors)
E   torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>(*(FakeTensor(..., size=(1, s0, s1, s2)),), **{'size': [s4, floor(s3*s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}):
E   aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers
E
E   from user code:
E      File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image
E       image = interpolate(
E
E   Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
E
E
E   You can suppress this exception and fall back to eager by setting:
E       import torch._dynamo
E       torch._dynamo.config.suppress_errors = True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347
Approved by: https://github.com/peterbell10
2024-01-16 23:33:55 +00:00
f1cdb39da3 [dynamo] Fix handling of one_hot (#116338)
Fixes #115817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116338
Approved by: https://github.com/yanboliang
2023-12-24 04:55:35 +00:00
c173a9d9b3 add Half support for layer_norm on CPU (#99590)
### Testing
Single socket (icx, 32cores):
| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.051 | 0.051 | 0.050 |
| (8 ,8, 16) | 0.013 | 0.013 | 0.013 | 0.054 | 0.053 | 0.051 |
| (32, 8, 16) | 0.015 | 0.014 | 0.014 | 0.059 | 0.054 | 0.052 |
| (64, 128, 56, 56) | 1.875 | 0.790 | 1.016 | 12.845 | 7.151 | 6.985 |
| (64, 128, 256, 256) | 50.226 | 25.462 | 35.736 | 328.957 | 179.615 | 175.618 |

Single core (icx):

| shape | fp32 forward (ms) | fp16 forward (ms) | mixed fp32 fp16 forward (ms) | fp32 backward (ms) | fp16 backward (ms) | mixed fp32 fp16 backward (ms) |
| -- | -- | -- | -- | -- | -- | -- |
| (1, 8, 16) | 0.012 | 0.011 | 0.011 | 0.040 | 0.041 | 0.041 |
| (8 ,8, 16) | 0.012 | 0.012 | 0.012 | 0.042 | 0.042 | 0.042 |
| (32, 8, 16) | 0.027 | 0.014 | 0.014 | 0.048 | 0.048 | 0.046 |
| (64, 128, 56, 56) | 58.054 | 11.034 | 17.928 | 108.603 | 48.816 | 50.244 |
| (64, 128, 256, 256) | 1327.758 | 352.394 | 496.994 | 2846.182 | 1224.247 | 1218.422 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99590
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/cpuhrsch
2023-12-20 01:11:15 +00:00
3477a2ee03 unMarkDynamoStrictTest on OpInfo-based tests (#115856)
These take too long to run under strict mode. We'll worry about them
later. Note that these decorators don't do anything yet (unless we flip
the default from non-strict to strict).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115856
Approved by: https://github.com/voznesenskym
ghstack dependencies: #115845, #115855
2023-12-15 01:22:31 +00:00
ba4285bd9e Deprecate primTorch module, replace it with decompositions in module Owners (#114754)
Context: pt2 oncall is revamping its labeling system. One of the guidelines is to remove duplicate labeling in our system. Both primTorch and decomposition labels are referring to the same thing. primTorch was the legacy name (and we no longer have a primTorch project), so using decomposition as the label name makes more sense.

Right now, the only open issues that use "module: primTorch" are the ones generated by the DISABLED bots. Once we replace the label in the bot, we can safely remove the primTorch label.

Here an example of the issue that has primTorch label :
https://github.com/pytorch/pytorch/issues/112719

Torchbot uses following logic to auto extract module owners:
https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/flaky-tests/disable.ts#L391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114754
Approved by: https://github.com/huydhn
2023-11-29 18:27:20 +00:00
ef982418df Add OpInfo test that tests meta functions binary ufuncs with different dtypes (#113674)
As per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113674
Approved by: https://github.com/peterbell10
ghstack dependencies: #113634
2023-11-16 19:09:12 +00:00
3b915f9de0 [pt2] enable meta tests for foreach ops (#113484)
Try https://github.com/pytorch/pytorch/pull/113059 again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113484
Approved by: https://github.com/lezcano
2023-11-11 02:43:41 +00:00
a310cc8968 Add Half support for kthvalue, cross, hist, and logit on CPU (#112135)
Add Half support for kthvalue, cross, hist, and logit on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112135
Approved by: https://github.com/cpuhrsch
2023-10-31 09:12:47 +00:00
bbd5b935e4 Use pytree.tree_leaves everywhere (#112324)
This changes all the instances I could find of `tree_flatten(...)[0]` or
`x, _ = tree_flatten` to use `tree_leaves`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112324
Approved by: https://github.com/lezcano
ghstack dependencies: #112327, #112323
2023-10-30 03:39:04 +00:00
a380bf3297 [dynamo, test] skip flaky dynamo-wrapped tests (#112310)
ghstack-source-id: 7a87e33e7513e7924e4513b6473284562989ed4c
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112309

Skip flaky tests reported by
- https://github.com/pytorch/pytorch/issues/111825
- https://github.com/pytorch/pytorch/issues/111826
- https://github.com/pytorch/pytorch/issues/111909
- https://github.com/pytorch/pytorch/issues/112142
- https://github.com/pytorch/pytorch/issues/112220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112310
Approved by: https://github.com/xmfan
2023-10-28 04:14:57 +00:00
973c87b320 raise instead of skip in test/test_meta.py (#110939)
Supersedes #109004.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110939
Approved by: https://github.com/lezcano, https://github.com/kurtamohler
2023-10-17 10:17:43 +00:00
0d368f586a fix wrong meta for index_select.out (#111364)
fixes https://github.com/pytorch/pytorch/issues/110699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/111364
Approved by: https://github.com/ezyang
ghstack dependencies: #111040
2023-10-16 15:18:20 +00:00
306b2284f2 Add meta kernel for ctc_loss.intList (#107949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107949
Approved by: https://github.com/zou3519
2023-10-09 16:35:14 +00:00