613 Commits

Author SHA1 Message Date
e3c5d1b7d7 Revert "[optim] Fix: wrong ASGD implementation (#125440)"
This reverts commit 2c5ad9a3d7ea79ca897aec153a401f4b9175a717.

Reverted https://github.com/pytorch/pytorch/pull/125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](https://github.com/pytorch/pytorch/pull/125440#issuecomment-2113833108))
2024-05-16 02:12:29 +00:00
2c5ad9a3d7 [optim] Fix: wrong ASGD implementation (#125440)
> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor.

- [X] Ill assumption that every param will have the same step.
- [x] DIfferent implementation between `foreach=Ture` and `foreach=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440
Approved by: https://github.com/janeyx99
2024-05-15 22:52:15 +00:00
aaa2f93a4f Add meta for _embedding_bag_dense_backward and _embedding_bag_per_sample_weights_backward (#125785)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125785
Approved by: https://github.com/albanD
2024-05-09 04:28:16 +00:00
939b701d3a SymInt-ify mem-efficient attention forward op signature (#125418)
Need this for dynamic shapes! Before this PR, guards on constant min / max seq len values are introduced when SDPA calls mem-efficient attention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125418
Approved by: https://github.com/soulitzer
2024-05-07 23:59:28 +00:00
ca98c2a932 inductor: Add Conv3d support (#124361)
This PR is to add Conv3d support in inductor. Basicly reuse and expand Conv2d logic and unit tests to Conv3d.

Conv3d inductor support will improve the performance of C2D_R50, I3D_R50, I3D_R101, Slow and SlowFast-R50 from OOB models.

  | C2D_R50 | I3D_R50 | I3D_R101 | Slow | SlowFast-R50
-- | -- | -- | -- | -- | --
eager | 15.805 | 13.909 | 11.639 | 12.101 | 6.606
Compile w/o conv3d | 17.244 | 14.893 | 12.109 | 13.015 | 6.603
Compile w/ conv3d | 21.212 | 17.707 | 14.974 | 16.130 | 8.537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124361
Approved by: https://github.com/leslie-fang-intel, https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel
2024-05-03 10:24:14 +00:00
a8574a9719 Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-26 15:35:53 +00:00
1ac60484c1 Revert "Fix global flake8 issues (#124771)"
This reverts commit f01275934bfa1ff358b1c01d3754f2807cd04ee2.

Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))
2024-04-26 06:15:17 +00:00
f01275934b Fix global flake8 issues (#124771)
Prior to this `lintrunner --all-files --take FLAKE8` failed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771
Approved by: https://github.com/Skylion007
ghstack dependencies: #124428
2024-04-25 14:25:00 +00:00
0c21161488 Add meta function for torch.histc (#124548)
Registers a meta function for the `aten.histc.default` and `aten.histc.out` ops to support `torch.compile(dynamic=True)`. Fixes #124512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124548
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-04-23 00:24:59 +00:00
00372b1211 Extend int[48]mm ops to float32 input (#124287)
Just for completeness

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124287
Approved by: https://github.com/mikekgfb
2024-04-17 23:10:49 +00:00
298eb69c91 [EZ] Make weight_int4pack_mm compilable for half input dtype (#124136)
To enable efficient int4 quantization on ARM

Followup after https://github.com/pytorch/pytorch/pull/124022
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124136
Approved by: https://github.com/mikekgfb
2024-04-16 08:10:59 +00:00
a096e99a5d Enable int8mm kernel for float16 (#124022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124022
Approved by: https://github.com/mikekgfb
2024-04-14 19:48:43 +00:00
f5331aade5 Simplify ATen sparse semi-structured operators based on CUTLASS (#123473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473
Approved by: https://github.com/cpuhrsch
2024-04-14 06:57:41 +00:00
97261be0a8 Revert "Simplify ATen sparse semi-structured operators based on CUTLASS (#123473)"
This reverts commit b2a0b8c446234f0b35a66aff87501c4596ea5d51.

Reverted https://github.com/pytorch/pytorch/pull/123473 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/123473#issuecomment-2053561077))
2024-04-13 07:47:32 +00:00
b2a0b8c446 Simplify ATen sparse semi-structured operators based on CUTLASS (#123473)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123473
Approved by: https://github.com/cpuhrsch
2024-04-11 11:56:27 +00:00
02b29e7d07 Add meta function for channel_shuffle operation (#123033)
This commit introduces a meta function for the `channel_shuffle` operation, enabling PyTorch to perform shape inference and optimizations related to this operation without actual computation. The meta function assumes input shape (*, C, H, W) and validates that the number of channels (C) is divisible by the specified number of groups.

Fixes #122771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123033
Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki
2024-04-11 10:07:18 +00:00
adcfc2b582 Add meta reg for addcdiv/addcmul ScalarList (#123486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123486
Approved by: https://github.com/awgu
2024-04-09 22:05:58 +00:00
493478db4a [effects] Add inductor support for tokens (#122347)
Given the following code/dynamo graph:
```
class GraphModule(torch.nn.Module):
    def forward(self, L_x_ : torch.Tensor):
        l_x_ = L_x_
        _print = torch.ops.aten._print('moo')
        res = l_x_ + l_x_;  l_x_ = None
        _print_1 = torch.ops.aten._print('moo')
        return (res,)
```

AOTAutograd will trace the following program, threading tokens from the inputs, through the effectful operator calls (torch.ops.aten._print), and as an output:
```
class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2, 3]"):
        with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops.aten._print.default, 'moo');  arg0_1 = None
        getitem: "f32[0]" = with_effects[0];  with_effects = None
        add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
        with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
        getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
        return (getitem_2, add)
```
However when we get to inductor, since we want the inductor generated code to not have any token inputs/outputs for better readability, we want to modify the aten graph by removing the tokens from inputs, and creating them through `torch.ops.aten._make_dep_token`, and sinking them through the `torch.ops.aten._sink_tokens` operators.
This has to be done *after* the partitioner, otherwise the partitioner will add the make_token/sink_token operators to the backwards graph.
```
class <lambda>(torch.nn.Module):
   def forward(self, arg1_1: "f32[2, 3]"):
       _make_dep_token_default: "f32[0]" = torch.ops.aten._make_dep_token.default()
       with_effects = torch._higher_order_ops.effects.with_effects(_make_dep_token_default, torch.ops.aten._print.default, 'moo');  _make_dep_token_default = None
       getitem: "f32[0]" = with_effects[0];  with_effects = None
       add: "f32[2, 3]" = torch.ops.aten.add.Tensor(arg1_1, arg1_1);  arg1_1 = None
       with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops.aten._print.default, 'moo');  getitem = None
       getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None
       _sink_tokens_default = torch.ops.aten._sink_tokens.default((getitem_2,));  getitem_2 = None
       return (add,)
```
When doing inductor lowering, we convert `with_effects` calls to an `EffectfulKernel`, which just a `FallbackKernel` but with a pointer to previous effectful operator's call. During scheduling, we will create a `StarDep` between the EffectfulKernel and its previous EffectfulKernel so that they don't get reordered. The inductor generated python code looks like:
```
def call(args):
    arg1_1, = args
    args.clear()
    assert_size_stride(arg1_1, (2, 3), (3, 1))
    # Source Nodes: [_print], Original ATen: []
    buf2 = aten._print.default('moo')
    # Source Nodes: [_print_1], Original ATen: []
    buf3 = aten._print.default('moo')
    buf4 = empty_strided_cpu((2, 3), (3, 1), torch.float32)
    cpp_fused_add_0(arg1_1, buf4)
    del arg1_1
    return (buf4, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122347
Approved by: https://github.com/bdhirsh
2024-04-09 03:22:32 +00:00
deeeaded1f Add metas for randint/rand factory functions out overload (#122375)
Fixes https://github.com/pytorch/pytorch/issues/121897

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122375
Approved by: https://github.com/lezcano
2024-03-25 04:01:38 +00:00
bde22835c6 [PT2] - Guard oblivious on meta registrations (#122216)
Summary:
```
[trainer0|0]:Potential framework code culprit (scroll up for full backtrace):
[trainer0|0]:  File "/mnt/xarfuse/uid-539346/56d4bb3d-seed-nspid4026531836_cgpid183208940-ns-4026531840/torch/_meta_registrations.py", line 5043, in scatter_gather_dtype_check
[trainer0|0]:    if index.numel() != 0:
```

Test Plan: General CI.

Reviewed By: ezyang

Differential Revision: D54689183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122216
Approved by: https://github.com/ezyang
2024-03-22 01:36:03 +00:00
b3065f6899 add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
2024-03-07 08:41:43 +00:00
eae9751e82 Fix linalg_eigvals invalid use of composite dispatch key (#121142)
`linalg_eigvals_out` calls into a dispatch stub, so only supports CPU and CUDA
strided tensors but incorrectly claimed to be a composite op. `linalg_eigvals`
also shouldn't defer to the out variant inside a `CompositeImplicitAutograd` op
as not all types support out variants. Instead, I add a new helper
`_linalg_eigvals` which does the same thing in a non-composite operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121142
Approved by: https://github.com/lezcano
2024-03-05 21:13:27 +00:00
0c07c0c15f Revert "add int4 packed gemm support on CPU device (#117475)"
This reverts commit 30befa592e0675cc694f87a4f6fb80894709e719.

Reverted https://github.com/pytorch/pytorch/pull/117475 on behalf of https://github.com/izaitsevfb due to fails meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/117475#issuecomment-1977474686))
2024-03-04 21:20:57 +00:00
a98c17edc7 Revert "add int8 packed gemm support on CPU device (#118056)"
This reverts commit f84375ca5db623a6a53cbce2864d27dfad626228.

Reverted https://github.com/pytorch/pytorch/pull/118056 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/118056#issuecomment-1977368720))
2024-03-04 20:09:40 +00:00
83d848e1c7 [Quant][Inductor] Enable lowering of dynamic qlinear for X86Inductor (#120605)
**description**
Enable lowering of dynamic qlinear for X86Inductor. The pattern is `choose_qparams -> getitem -> q -> dq -> linear`. We only fuse `dq -> linear` and get `choose_qparams -> getitem -> q -> onednn.qlinear_pointwise`. So, we treat it as dynamic quantization of activation + static quantized linear.
The previous implementation of `onednn.qlinear_pointwise` is for the case where `x_scale` and `x_zp` are scalars. Since `choose_qparams` returns tensors, we added a variation `onednn.qlinear_pointwise.tensor` to support the case.
This feature is targeting PyTorch 2.3 release.

**Test plan**
```
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_cpu
python inductor/test_mkldnn_pattern_matcher.py -k test_dynamic_qlinear_qat_cpu
python inductor/test_cpu_cpp_wrapper.py -k test_dynamic_qlinear
```

**Performance before and after lowering `choose_qparam` to Inductor**
Before
- latency for shape (32, 32) = 0.151 ms
  latency for shape (128, 128) = 0.153 ms
  latency for shape (1024, 1024) = 0.247 ms

After
- latency for shape (32, 32) = 0.049 ms
- latency for shape (128, 128) = 0.052 ms
- latency for shape (1024, 1024) = 0.133 ms

Test method: A module with a single Linear layer, dynamic-quantize, lower to X86Inductor
Test env & config: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz, single instance, single core, using Intel OpenMP and Tcmalloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120605
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
2024-03-02 05:11:17 +00:00
f84375ca5d add int8 packed gemm support on CPU device (#118056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118056
Approved by: https://github.com/mikekgfb
ghstack dependencies: #117475
2024-03-02 04:35:49 +00:00
30befa592e add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-02 00:17:34 +00:00
19fcf6de1a Add lowering for fraction_max_pool2d (#120460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120460
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-03-01 20:13:20 +00:00
9b2c35b4fe [dynamo] Fix convolution meta kernel when input channel is 0 (#120944)
Addresses https://github.com/pytorch/pytorch/issues/118797

Adding in special channel handling logic from eager (set output channels to 0 when input channels are 0):
67d3e4f2a2/aten/src/ATen/native/Convolution.cpp (L1400-L1403)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120944
Approved by: https://github.com/zou3519
2024-03-01 01:18:21 +00:00
da559c98e3 Fix isin decomp and add python meta registration (#120821)
Fixes #119792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120821
Approved by: https://github.com/malfet, https://github.com/peterbell10
2024-02-29 22:08:50 +00:00
d6c202975c Move attention kernels from meta_registrations to fake_impls (#120682)
This PR is mostly just code movement to make the code review easier - AFAIK it should not change any functionality. The final goal is to remove the xfails for some of the test_fake opinfos for these ops. The opinfos are failing because the outputs can have mixed devices - we need to move them to fake_impls first before we can support mixed device returns.

This PR:
* Move the `_meta_registrations.py` implementations to `fake_impls.py`
* Change the function signature from taking explicit named variables to taking `{args, kwargs}` and normalizing them
* Wrap all the returned tensors in FakeTensors

Tests: relying on opinfos. I also checked `test_fake_*` for these tests (by removing x-fails and patching things until they passed) to verify general correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120682
Approved by: https://github.com/drisspg
2024-02-28 21:49:13 +00:00
f064dec7e0 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-27 01:34:59 +00:00
b01bd1f7a1 Revert "Add torch.ops.aten.print (#120295)"
This reverts commit 3b944113c837e1111510487f4525aa07039462fe.

Reverted https://github.com/pytorch/pytorch/pull/120295 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54123688 ([comment](https://github.com/pytorch/pytorch/pull/120295#issuecomment-1965618191))
2024-02-27 01:18:48 +00:00
fdae9363b3 [meta registration] efficient_attention_forward fix for NT inputs (#120594)
When cu_seqlens_q is provided, we should use the user-specified max_seqlen_q instead of inferring it as query.size(1):

1c7b0e7cd1/aten/src/ATen/native/transformers/cuda/attention.cu (L989)

This wasn't caught because the value is taken as ceil(max_seqlen / 32) * 32; in the opinfos, and the opinfo inputs were small enough that this value was 32 in either case.

Differential Revision: [D54179733](https://our.internmc.facebook.com/intern/diff/D54179733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120594
Approved by: https://github.com/drisspg
2024-02-27 00:10:37 +00:00
3b944113c8 Add torch.ops.aten.print (#120295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120295
Approved by: https://github.com/zou3519
2024-02-23 17:01:22 +00:00
4319735ace Add meta registration for _foreach_norm (2nd try) (#119927)
The first try reused TensorListMetadata, which caused illegal memory access issues when there were too many tensors in the list. We just launch multiple kernels with a simpler version of the struct (to minimize kernels launched).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119927
Approved by: https://github.com/albanD
2024-02-16 00:23:23 +00:00
31e59766e7 Fix meta registration for _flash_attention_forward() (#119812)
Meta registration wrongly assumes 4D inputs, while the underlying op allows 3D inputs for the `mha_varlen_fwd()` case.
Testing: I added `detach()`es so the NJT test `test_sdpa_compile()` won't fail for a view-related reason. It should pass now with this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119812
Approved by: https://github.com/drisspg
2024-02-14 02:38:53 +00:00
1c1dc0e4e0 [sparse] Add in out_dtype support (i8i8->bf16, i32) for cusparselt (#119296)
Summary:

Adds in out_dtype support for (i8i8->bf16) and (i8i8->i32) matmul with
cuSPARSELt.

Test Plan:

```
python test/test_sparse_semi_structured.py -k mixed
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119296
Approved by: https://github.com/cpuhrsch, https://github.com/alexsamardzic
2024-02-12 16:02:36 +00:00
2c91e13afc Add lowerings to special functions (#119187)
As in the title.

In addition, the PR introduces infrastructure for lowerings of pointwise functions that have both cpp and triton implementations available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119187
Approved by: https://github.com/peterbell10
2024-02-11 16:35:40 +00:00
dea15c9fdc Revert "Add meta registration for _foreach_norm (#118604)"
This reverts commit b8bb12cd454b716da6a98db826fcc45fd7c0db05.

Reverted https://github.com/pytorch/pytorch/pull/118604 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118604#issuecomment-1930849491))
2024-02-06 22:20:44 +00:00
73f0fdea5b [fix] accounting for dilation in pool padding assertion (#118897)
Fixes https://github.com/pytorch/pytorch/issues/7541

It is a copy of https://github.com/pytorch/pytorch/pull/111427, I have failed to fix all its issues in time, and it got closed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118897
Approved by: https://github.com/mikaylagawarecki
2024-02-06 20:32:58 +00:00
b8bb12cd45 Add meta registration for _foreach_norm (#118604)
This PR also fixes the discrepancy between _foreach_norm fast path and slow path, where storage_offsets will be different between the lists of tensors. Here are some profile results showing that we aren't significantly slower. Do note that we're replacing N `as_strided`/`select` calls to N `empty` calls.

For script:
```
import torch

ts = [torch.rand(32, 16, device="cuda") for _ in range(128)]

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    res = torch._foreach_norm(ts)
print(p.key_averages().table(sort_by="cpu_time_total"))
```

OG baseline:
```
(pytorch-3.10) [janeyx@devgpu023.odn1 ~/local/pytorch (7cf98987)]$ python playground2.py
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-01-30 13:16:48 2740431:2740431 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        25.36%       4.209ms        99.94%      16.586ms      16.586ms       8.000us        88.89%       9.000us       9.000us             1
                                       cudaLaunchKernel        61.21%      10.159ms        61.21%      10.159ms       2.540ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.43%      71.000us        58.35%       9.683ms       9.683ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.33%      55.000us        57.35%       9.517ms       9.517ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.42%      69.000us        57.01%       9.462ms       9.462ms       1.000us        11.11%       1.000us       1.000us             1
                                           aten::select         8.04%       1.335ms        11.29%       1.873ms      14.633us       0.000us         0.00%       0.000us       0.000us           128
                                       aten::as_strided         3.24%     538.000us         3.24%     538.000us       4.203us       0.000us         0.00%       0.000us       0.000us           128
                                            aten::empty         0.90%     150.000us         0.90%     150.000us      75.000us       0.000us         0.00%       0.000us       0.000us             2
                                  cudaDeviceSynchronize         0.06%      10.000us         0.06%      10.000us      10.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        11.11%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        66.67%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       2.000us        22.22%       2.000us       2.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 16.596ms
Self CUDA time total: 9.000us
```

And here's after this PR:
```
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-05 08:27:02 1127843:1127843 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                    aten::_foreach_norm        30.95%       4.653ms        99.95%      15.026ms      15.026ms       9.000us        90.00%      10.000us      10.000us             1
                                       cudaLaunchKernel        52.41%       7.879ms        52.41%       7.879ms       1.970ms       0.000us         0.00%       0.000us       0.000us             4
                                            aten::zeros         0.39%      58.000us        48.29%       7.260ms       7.260ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::zero_         0.35%      53.000us        47.25%       7.103ms       7.103ms       0.000us         0.00%       1.000us       1.000us             1
                                            aten::fill_         0.43%      65.000us        46.90%       7.050ms       7.050ms       1.000us        10.00%       1.000us       1.000us             1
                                            aten::empty        15.42%       2.318ms        15.42%       2.318ms      17.969us       0.000us         0.00%       0.000us       0.000us           129
                                  cudaDeviceSynchronize         0.05%       7.000us         0.05%       7.000us       7.000us       0.000us         0.00%       0.000us       0.000us             1
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.000us        10.00%       1.000us       1.000us             1
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       6.000us        60.00%       6.000us       3.000us             2
void at::native::lpnorm_cleanup<float, (at::native::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.000us        30.00%       3.000us       3.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 15.033ms
Self CUDA time total: 10.000us
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118604
Approved by: https://github.com/albanD
2024-02-05 22:01:01 +00:00
1b03423526 [meta registration] fix _efficient_attention_forward for jagged inputs (#118657)
Fixes the meta registration for the logsumexp output, whose shape should
be defined by the size of the offsets tensor when it exists.

644f64f2d1/aten/src/ATen/native/transformers/cuda/attention.cu (L1045)

Differential Revision: [D53234217](https://our.internmc.facebook.com/intern/diff/D53234217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118657
Approved by: https://github.com/YuqingJ
2024-01-31 00:11:39 +00:00
4f5785b6b3 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Co-authored-by: Catherine Lee <csl@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 21:07:01 +00:00
40ece2e579 Revert "Enable possibly-undefined error code (#118533)"
This reverts commit 4f13f69a45ef53747e2eefffd65d91ce840b431b.

Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))
2024-01-30 19:00:34 +00:00
4f13f69a45 Enable possibly-undefined error code (#118533)
Fixes https://github.com/pytorch/pytorch/issues/118129

Suppressions automatically added with

```
import re

with open("error_file.txt", "r") as f:
    errors = f.readlines()

error_lines = {}
for error in errors:
    match = re.match(r"(.*):(\d+):\d+: error:.*\[(.*)\]", error)
    if match:
        file_path, line_number, error_type = match.groups()
        if file_path not in error_lines:
            error_lines[file_path] = {}
        error_lines[file_path][int(line_number)] = error_type

for file_path, lines in error_lines.items():
    with open(file_path, "r") as f:
        code = f.readlines()
    for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True):
        code[line_number - 1] = code[line_number - 1].rstrip() + f"  # type: ignore[{error_type}]\n"
    with open(file_path, "w") as f:
        f.writelines(code)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-01-30 05:08:10 +00:00
01abb5af21 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-01-22 18:33:41 +00:00
b637fdc8b3 Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)"
This reverts commit 74e13624998f2a4de29bce73a949d7f0339ec04e.

Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))
2024-01-19 17:35:04 +00:00
74e1362499 additional support for float8_e4m3fnuz and _e5m2fnuz (#115214)
Follow up to #107586.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214
Approved by: https://github.com/peterbell10
2024-01-19 00:50:18 +00:00
f6767244cf Added meta function for _upsample_bicubic2d_aa (#117347)
This should fix remaining errors with Resize op in torchvision: https://github.com/pytorch/vision/actions/runs/7298953575?pr=8127
```
/opt/conda/envs/ci/lib/python3.8/site-packages/torch/nn/functional.py:4072: in interpolate
    return torch._C._nn._upsample_bicubic2d_aa(input, output_size, align_corners, scale_factors)
E   torch._dynamo.exc.TorchRuntimeError: Failed running call_function <function interpolate at 0x7f4443fe00d0>(*(FakeTensor(..., size=(1, s0, s1, s2)),), **{'size': [s4, floor(s3*s4/floor(s1*s3/s2))], 'mode': 'bicubic', 'align_corners': False, 'antialias': True}):
E   aten/src/ATen/RegisterCompositeImplicitAutograd.cpp:5567: SymIntArrayRef expected to contain only concrete integers
E
E   from user code:
E      File "/pytorch/vision/torchvision/transforms/v2/functional/_geometry.py", line 260, in resize_image
E       image = interpolate(
E
E   Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
E
E
E   You can suppress this exception and fall back to eager by setting:
E       import torch._dynamo
E       torch._dynamo.config.suppress_errors = True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117347
Approved by: https://github.com/peterbell10
2024-01-16 23:33:55 +00:00