Compare commits

...

248 Commits

Author SHA1 Message Date
e6e102cf85 Dynamo testing: add some skips (#128734)
The following tests are failing consistently for me locally, so we're
going to skip them. They're disabled in CI but it looks like they're
just always failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128734
Approved by: https://github.com/williamwen42
ghstack dependencies: #128731
2024-06-14 20:53:30 +00:00
11de50f17c [Dynamo] skip some TorchScript tests (#128731)
We don't care about the Dynamo x TorchScript composition, so I'm
disabling these tests (so they don't get reported as flaky). Not
disabling all of the TorchScript tests yet because they have been useful
to catch random bugs.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731
Approved by: https://github.com/williamwen42
2024-06-14 20:53:30 +00:00
4b96575a09 [dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196)
FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.

For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
2024-06-14 20:28:08 +00:00
1aafb9eb90 [dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)
Fixes https://github.com/pytorch/pytorch/issues/101168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269
Approved by: https://github.com/jansel
ghstack dependencies: #128715
2024-06-14 20:17:03 +00:00
9c77332116 [torch.compile][ci] Flaky models in CI (similar to DISABLED_TEST) (#128715)
These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715
Approved by: https://github.com/eellison
2024-06-14 20:17:03 +00:00
2e5366fbc0 Extended Module Tracker (#128508)
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.

1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
2024-06-14 19:48:46 +00:00
d50712e5e3 [PT2] add inductor log for unbind_stack_pass (#128684)
Summary: Currently, we do not log the pass. To better enable pattern hit inspection, we enable it.

Test Plan: see signal

Differential Revision: D58571992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128684
Approved by: https://github.com/dshi7
2024-06-14 19:45:55 +00:00
9035fff2de [BE] Do not test deprecated torch.nn.utils.weight_norm (#128727)
Test `torch.nn.utils.parametrizations.weight_norm` instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727
Approved by: https://github.com/kit1980
ghstack dependencies: #128726
2024-06-14 19:14:44 +00:00
27458cc097 [BE] Refactor repeated code in test_weight_norm (#128726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726
Approved by: https://github.com/kit1980
2024-06-14 19:14:44 +00:00
a6bd154a42 [inductor] Support mm decomps for matrices with unbacked sizes (#128655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128655
Approved by: https://github.com/jansel
2024-06-14 18:35:42 +00:00
b94c52dd29 [GHF] Refuse merge to non-default branch (#128710)
Unless PR is ghstack one

Test plan:
```
% GITHUB_TOKEN=$(gh auth token)  python3 -c "from trymerge import GitHubPR; pr=GitHubPR('pytorch', 'pytorch', 128591); print(pr.base_ref(), pr.default_branch())"
release/2.4 main
```
Fixes: https://github.com/pytorch/test-infra/issues/5339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128710
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-06-14 18:23:25 +00:00
be0eec9031 [export] Improve static typing in tracer. (#128552)
Summary: as title.

Test Plan: CI

Differential Revision: D58485487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128552
Approved by: https://github.com/angelayi
2024-06-14 17:57:37 +00:00
2367161e4b Revert "[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)"
This reverts commit c339efaf023b4af056dad4cb2f11c07930ed8af6.

Reverted https://github.com/pytorch/pytorch/pull/127966 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/127966#issuecomment-2168505985))
2024-06-14 17:57:23 +00:00
d7fc871175 [inductor] Improve superfluous mask handling in triton codegen (#128518)
This takes the logic from `filter_masks` and factors it out into
`_has_constant_mask`. I also improve support for `persistent_reduction` kernels
by making use of the static RBLOCK value and potentially XBLOCK too in the
`no_x_dim` case.

I then use this helper when generating the `xmask` and `rmask`, so we can
generate them as constants meaning triton can optimize them even if they are
included.

e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)`
before:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), rmask & xmask, other=0.0)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = tl.where(rmask & xmask, tmp1, 0)
    tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0))
    tl.store(out_ptr0 + (x0), tmp4, xmask)
```

after:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = tl.full([RBLOCK], True, tl.int1)
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = tl.full([RBLOCK], True, tl.int1)
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0))
    tl.store(out_ptr0 + (x0), tmp3, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518
Approved by: https://github.com/lezcano
2024-06-14 17:52:55 +00:00
2357490524 [PT2] Enable shape_padding multiplier adjustment (#128346)
Summary:
Our experiments demonstrate that the current defautl value 1.1 may not be the best multiplier, and we thus enable the adjustment of the value to further improve the QPS.

context: https://docs.google.com/document/d/10VjpOJkTv5A4sNX7dD6qT7PyhBxn6LSeLAuaqYtoOto/edit

Test Plan:
# IG_CTR

{F1682138315}

Differential Revision: D58373261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128346
Approved by: https://github.com/jackiexu1992
2024-06-14 17:49:24 +00:00
cyy
d4807da802 Various fixes of torch/csrc files (#127252)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127252
Approved by: https://github.com/r-barnes
2024-06-14 17:31:24 +00:00
089e76cca3 [traced-graph][sparse] remove redundant assert in sparse prop test (#128523)
The assertEqualMeta() method already tests that the first argument is a FakeTensor

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523
Approved by: https://github.com/huydhn
2024-06-14 17:05:17 +00:00
1fb4effe7a [GPT-fast benchmark] Add MLP, gather + gemv, gemv micro benchmark (#128002)
Output example:
```
| name                         | metric                    | target  | actual  |
|------------------------------|---------------------------|---------|---------|
| layer_norm_bfloat16          | memory_bandwidth(GB/s)    | 1017    | 1000.01 |
| mlp_layer_norm_gelu_bfloat16 | flops_utilization         | 0.71    | 0.71    |
| gemv_int8                    | memory_bandwidth(GB/s)    | 990     | 984.06 |
| gemv_bfloat16                | memory_bandwidth(GB/s)    | 1137    | 1137.92 |
| gather_gemv_int8             | memory_bandwidth(GB/s)    | 1113    | 1111.09 |
| gather_gemv_bfloat16         | memory_bandwidth(GB/s)    | 1249    | 1248.15 |

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128002
Approved by: https://github.com/Chillee
2024-06-14 17:03:22 +00:00
4c84af0f5d Fix indexing and slicing of ranges in dynamo (#128567)
Fix https://github.com/pytorch/pytorch/issues/128520
Dynamo does not handle range()[binary subscript] or range()[trinary_subscript] correctly. Right now it calls
the get_item function which basically applies the subscript operation on top of the list of [start, end, step]! which is completely not related to what is  expected.

in python, range()[complex subscript] is another range, ex:
range(1, 10, 2)[1:4:1] is range(3, 9, 2)
and range(1, 10, 2)[1:4:1]  is range(-9, 9, 2)

This diff fix index and slice applications on range.
it mimics implementations from (https://github.com/python/cpython/blob/main/Objects/rangeobject.c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128567
Approved by: https://github.com/anijain2305
2024-06-14 16:49:49 +00:00
f75f5987aa Revert "Extended Module Tracker (#128508)"
This reverts commit 1f46284f9ed5b60981174e689d750b358b19e4c4.

Reverted https://github.com/pytorch/pytorch/pull/128508 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/9515753429/job/26230639980 ([comment](https://github.com/pytorch/pytorch/pull/128508#issuecomment-2168405784))
2024-06-14 16:46:03 +00:00
732b4e9074 Fix generated vararg types (#128648)
In the generated files torchgen is incorrectly generating types on the varargs.

The changes all look like this (changing `*size: _int` to `*size: Union[_int, SymInt]`):
```
--- ./torch/_VF.pyi.sav	2024-06-13 20:36:49.189664629 -0700
+++ ./torch/_VF.pyi	2024-06-13 20:36:57.208894614 -0700
@@ -168,17 +168,17 @@
 @overload
 def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], *, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 @overload
-def _efficientzerotensor(*size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
+def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 @overload
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648
Approved by: https://github.com/jamesjwu
2024-06-14 16:04:37 +00:00
8629939a51 [torch/c10] Add C10_UBSAN_ENABLED macro and use it to disable SymInt_… (#127967)
Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check).

Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED`  and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`.

> NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967
Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet
2024-06-14 16:01:12 +00:00
ee140a198f Revert "[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)"
This reverts commit 03e8a4cf45ee45611de77b55b515a8936f60ce31.

Reverted https://github.com/pytorch/pytorch/pull/128591 on behalf of https://github.com/atalman due to Contains release only changes should not be landed ([comment](https://github.com/pytorch/pytorch/pull/128591#issuecomment-2168308233))
2024-06-14 15:51:00 +00:00
c187593418 Prevent expansion of cat indexing to avoid int64 intermediate (#127815)
Fix for https://github.com/pytorch/pytorch/issues/127652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815
Approved by: https://github.com/shunting314, https://github.com/peterbell10
2024-06-14 15:42:08 +00:00
c339efaf02 [ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)
Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560

This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069

unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping.

The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966
Approved by: https://github.com/pruthvistony, https://github.com/zou3519
2024-06-14 15:24:28 +00:00
c76a9d13cb Revert D56709309 (#128481)
Summary: potential fw compatibility issue raised from D58397323

Test Plan: Sandcastle

Reviewed By: houseroad

Differential Revision: D58443190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128481
Approved by: https://github.com/desertfire
2024-06-14 14:57:17 +00:00
9972e5f447 Rename impl_abstract to register_fake, part 2/2 (#123938)
This PR renames the implementation details of register_fake to align
more with the new name. It is in its own PR because this is risky
(torch.package sometimes depends on private library functions and
implementation details).

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938
Approved by: https://github.com/williamwen42
2024-06-14 14:37:24 +00:00
a2d9c430b4 Adding a note for Getting Started with PyTorch on Intel GPUs (#127872)
Adding a note for Getting Started with PyTorch on Intel GPUs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872
Approved by: https://github.com/svekars
2024-06-14 14:24:28 +00:00
dfc4b608e1 Remove leftover warning causing log spew (#128688)
This warning was left by mistake, and is uninformative (the user is doing nothing wrong) and causing log spew in trainings. See https://github.com/pytorch/pytorch/pull/120750#discussion_r1638430500
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128688
Approved by: https://github.com/drisspg
2024-06-14 14:08:11 +00:00
e1dfc61250 Document CI/CD security philosophy (#128316)
Namely:
-  when use of non-ephemeral runners is OK, vs when it is not
- Why binary build pipelines should not use distributed caching
- Why temporary CI artifacts should not be considered safe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128316
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-06-14 13:47:25 +00:00
cyy
bfd5ea93e0 Enable clang-tidy on c10/util/Float8*.h (#120573)
This PR clears warnings and enables clang-tidy on c10/util/Float8*.h.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120573
Approved by: https://github.com/drisspg
2024-06-14 13:47:07 +00:00
1f46284f9e Extended Module Tracker (#128508)
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.

1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
2024-06-14 12:01:53 +00:00
e397ad6883 Improve codegen for ops.masked in triton (#128054)
Fixes https://github.com/pytorch/pytorch/issues/127930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-06-14 11:52:56 +00:00
7e734e2d08 [inductor] Fix nested indirect indexing case for index_propagation (#128378)
Tries to fix #127677.

# Context

Just as @peterbell10 pointed out, we have the following scenario:
```
a = ops.indirect_indexing(...)
b = ops.index_expr(a, ...)
c = ops.indirect_indexing(b, ...)
```

We can repro this as:
```
def forward(self, arg0_1, arg1_1, arg2_1):
    iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0),
    repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1);
    index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]);
    index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]);
    return (index_1,)
```

which should generate a JIT py file like this:
```
def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
    ...
    tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last')
    tmp1 = ks0
    tmp2 = tmp0 + tmp1
    tmp3 = tmp0 < 0
    tmp4 = tl.where(tmp3, tmp2, tmp0)
    # check_bounds()
    tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) | ~(xmask), "index out of bounds: 0 <= tmp4 < ks0")

def call():
  arg0_1, arg1_1, arg2_1 = args
  buf1 = aten.repeat_interleave.Tensor(arg1_1)
  buf4 = empty_strided_cuda((u0, 64), (64, 1))
  triton_poi_fused_index_select_0.run(
    buf1, arg2_1, buf4, s0,
    triton_poi_fused_index_select_0_xnumel,
    grid=grid(triton_poi_fused_index_select_0_xnumel),
    stream=stream0)
```

# Issue
In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`.
3b555ba477/torch/_inductor/ir.py (L8154-L8160)

When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`.

# Approach
When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-14 10:07:06 +00:00
99988be423 [halide-backend] Add test shard (#127308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127308
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #128266
2024-06-14 10:02:57 +00:00
03e8a4cf45 [Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)
Port #127592 from main to release/2.4

------
Fixes #127402

- Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py
- Add checks of mutation for QLinearPointwiseBinaryPT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592
Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128591
Approved by: https://github.com/jgong5, https://github.com/Chillee
2024-06-14 09:31:38 +00:00
43ae3073f9 Revert "[traced-graph][sparse] remove redundant assert in sparse prop test (#128523)"
This reverts commit ba3726d02b25dff92762c59d4dffe96a7babfa75.

Reverted https://github.com/pytorch/pytorch/pull/128523 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Looks like your changes broke the inductor tests: inux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor. [Here you can find more details](ba3726d02b). ([comment](https://github.com/pytorch/pytorch/pull/128523#issuecomment-2167518145))
2024-06-14 08:27:05 +00:00
0344f95c2e Add missing #include <array> to thread_name.cpp (#128664)
I got local compile errors (using clang 14.0.6) due to this missing include after pulling the
latest pytorch main.  It's totally puzzling why CI appears to pass
without this fix.  Hopefully someone else will have an idea if we are
missing some CI coverage or if I am using a strange build setup locally.

The PR introducing the compile errors was https://github.com/pytorch/pytorch/pull/128448.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128664
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/d4l3k
2024-06-14 07:49:09 +00:00
03725a0512 [dtensor][example] added MLPStacked example for printing sharding (#128461)
**Summary**
Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below:
<img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92">

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369, #128451
2024-06-14 07:30:31 +00:00
dd3b79a08f [dtensor][be] improving readability of comm_mode.py and comm_mode_features_example.py (#128451)
**Summary**
I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code.

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369
2024-06-14 07:30:31 +00:00
e886122e98 [dtensor][debug] add module level tracing and readable display (#128369)
**Summary**
Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is:

<img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791">

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369
Approved by: https://github.com/XilunWu
2024-06-14 07:30:31 +00:00
4669c6d3ae [quant][pt2e][quantizer] Support set_module_name_qconfig in X86InductorQuantizer (#126044)
Summary:
Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`.

For example, only quantize the `sub`:

```python
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
        self.sub = Sub()

    def forward(self, x):
        x = self.linear(x)
        x = self.sub(x)
        return x

m = M().eval()
example_inputs = (torch.randn(3, 5),)
# Set config for a specific submodule.
quantizer = X86InductorQuantizer()
quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config())
```

- Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level.
- Unified the annotation process to follow this order:  `module_name_qconfig`, `operator_type_qconfig`, and `global_config`.
- Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs.
- Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer.

Test Plan

```bash
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name
```

@Xia-Weiwen @leslie-fang-intel  @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2024-06-14 07:13:10 +00:00
674be9d3be Update cu124 dynamo benchmark expected values (#128589)
I believe this corresponds to changes in https://github.com/pytorch/pytorch/pull/127780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128589
Approved by: https://github.com/nWEIdia, https://github.com/DanilBaibak
2024-06-14 07:04:34 +00:00
18f35d9e12 Revert "Run all samples for torchinductor tests (#128343)"
This reverts commit 41df20c07caecddb6d21d69a125f2998ae9313e8.

Reverted https://github.com/pytorch/pytorch/pull/128343 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool3d_cuda_float16 and other tests 41df20c07c https://github.com/pytorch/pytorch/actions/runs/9509191526/job/26213490266. I think this might be a landrace ([comment](https://github.com/pytorch/pytorch/pull/128343#issuecomment-2167275337))
2024-06-14 06:08:17 +00:00
f48f7615dc [easy][subclasses] dynamo.reset() in test_subclass_views (#128659)
When we don't dynamo.reset(), we don't recompile on different dynamic shapes.

Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple.

Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes:

```
  File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards
    sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint
    return self._str(self._print(expr))
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add
    t = self._print(term)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul
    a_str = [self.parenthesize(x, prec, strict=False) for x in a]
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp>
    a_str = [self.parenthesize(x, prec, strict=False) for x in a]
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize
    return self._print(item)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol
    assert self.symbol_to_source.get(expr), (
AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}.  If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659
Approved by: https://github.com/YuqingJ
2024-06-14 05:18:07 +00:00
9ac08dab1f Updates diskspace-cleanup for ROCm CI (#127947)
Gets the location of the docker directory and outputs how much disk space is being used by docker.

This is required since the new Cirrascale CI nodes for ROCm have docker root directory in a different partition.

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127947
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-06-14 04:32:38 +00:00
eff01bce21 Only run inductor A100 perf benchmark smoke test periodically (#128677)
Attempt to mitigate the long queue on A100 as reported in https://github.com/pytorch/pytorch/issues/128627.

From what I see, this change 03467b3fed/1 doubles the job duration from 20+ to 40+ minutes. This, together https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-cu124.yml and maybe an increase number of PR with `ciflow/inductor`, are all contributing to the long queue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128677
Approved by: https://github.com/atalman, https://github.com/desertfire
2024-06-14 02:39:33 +00:00
ba3726d02b [traced-graph][sparse] remove redundant assert in sparse prop test (#128523)
The assertEqualMeta() method already tests that the first argument is a FakeTensor

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523
Approved by: https://github.com/soulitzer
2024-06-14 02:34:51 +00:00
685fcfb40d Fix docstring in autograd (#128657)
Fix docstrings in autograd files.

The fix can be verified by running pydocstyle path-to-file --count

Related #112593

**BEFORE the PR:**

pydocstyle torch/autograd/anomaly_mode.py --count
8
pydocstyle torch/autograd/__init__.py --count
9

**AFTER the PR:**

pydocstyle torch/autograd/anomaly_mode.py --count
0
pydocstyle torch/autograd/__init__.py --count
0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128657
Approved by: https://github.com/soulitzer
2024-06-14 02:18:59 +00:00
0186b386cd Revert "[ONNX] Add upsample trilinear to skip decomp (#128259)"
This reverts commit b72989a2b5ac4637612e31e325d7c8233fcbd7a1.

Reverted https://github.com/pytorch/pytorch/pull/128259 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its ONNX job is failing in trunk b72989a2b5 ([comment](https://github.com/pytorch/pytorch/pull/128259#issuecomment-2167058937))
2024-06-14 01:44:26 +00:00
f48ca2561d Document torch.cuda.profiler.start (#128098)
document https://github.com/pytorch/pytorch/issues/127917 start function of cuda/ profiler.py

Fixes 127917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128098
Approved by: https://github.com/aaronenyeshi
2024-06-14 01:44:18 +00:00
41df20c07c Run all samples for torchinductor tests (#128343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343
Approved by: https://github.com/lezcano
2024-06-14 01:28:32 +00:00
6895a5804c Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795)"
This reverts commit c472cec5656b9ffb668af97a02d711bdbdf5ebec.

Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))
2024-06-14 01:14:59 +00:00
6564d63e69 Use mv kernel for small M (#128632)
Previously we are using:
* mv kernel for M == 1
* mm kernel for 1 < M < 4
* llama.cpp inspired mm kernel for M >= 4

This PR consolidate it to only 2 kernels, use the same mv kernel for M <
12.

Benchmarked on https://github.com/malfet/llm_experiments/blob/main/metal-perf/int8mm.mm

Mac M1 Max, input size M x 4128 x 4096

![llama cpp shader and ATen shader (2)](https://github.com/pytorch/pytorch/assets/8188269/9e2e3024-c5ea-4303-88bf-ff3646296396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128632
Approved by: https://github.com/malfet
2024-06-14 01:06:53 +00:00
ae2359638b Save DOT file of graph instead of SVG for GraphTranformObserver (#128634)
Summary:
GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable.

This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG.

To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details.

Test Plan: buck2 test mode/dev-sand caffe2/test:fx --  fx.test_fx_xform_observer.TestGraphTransformObserver

Differential Revision: D58539182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634
Approved by: https://github.com/mengluy0125
2024-06-14 00:54:22 +00:00
6f181756dc Use by-column algorithm for fp16/bf16 CPUBlas gemm_transb kernels (#127318)
Summary: #96074 (D44340826) changed the algorithm for 16-bit types for gemm_notrans_ and gemm_transb_ for the sake of precision. In this diff, we go back to the old algorithm for gemm_transb_, maintaining precision by allocating temporary space equal to (in elements, so actually double since we are accumulating 16-bit types into fp32) the size of `c` to accumulate into.

Test Plan: Used https://github.com/malfet/llm_experiments (benchmarks/benchmark_torch_mm.py) to benchmark before and after:

before:
```
mv_nt    torch.float32    5.47 usec
mv_nt    torch.float16    8.45 usec
mv_nt   torch.bfloat16  183.43 usec
mv_ta    torch.float32    5.70 usec
mv_ta    torch.float16   24.17 usec
mv_ta   torch.bfloat16   97.27 usec
notrans  torch.float32    5.58 usec
notrans  torch.float16   25.18 usec
notrans torch.bfloat16   63.11 usec
trans_a  torch.float32    5.59 usec
trans_a  torch.float16   68.94 usec
trans_a torch.bfloat16  311.60 usec
trans_b  torch.float32    5.63 usec
trans_b  torch.float16    8.76 usec
trans_b torch.bfloat16   29.17 usec
```

after:
```
mv_nt    torch.float32    5.53 usec
mv_nt    torch.float16    8.57 usec
mv_nt   torch.bfloat16  188.17 usec
mv_ta    torch.float32    5.78 usec
mv_ta    torch.float16   28.59 usec
mv_ta   torch.bfloat16   98.45 usec
notrans  torch.float32    5.71 usec
notrans  torch.float16   26.08 usec
notrans torch.bfloat16   64.06 usec
trans_a  torch.float32    5.72 usec
trans_a  torch.float16   32.21 usec
trans_a torch.bfloat16   32.10 usec
trans_b  torch.float32    5.83 usec
trans_b  torch.float16    9.05 usec
trans_b torch.bfloat16   29.66 usec
```

Also expanded coverage to a range of larger matrix-vector and matrix-matrix sizes.

before:
```
Matrix-vector:
m=1024, n=1024, k=1
====================
notrans  torch.float32   24.75 usec
notrans  torch.float16  258.04 usec
notrans torch.bfloat16  245.64 usec
trans_a  torch.float32   26.94 usec
trans_a  torch.float16  692.09 usec
trans_a torch.bfloat16 1709.53 usec
m=4100, n=4100, k=1
====================
notrans  torch.float32 2811.48 usec
notrans  torch.float16 4192.06 usec
notrans torch.bfloat16 4041.01 usec
trans_a  torch.float32 2778.38 usec
trans_a  torch.float16 17218.41 usec
trans_a torch.bfloat16 27561.21 usec
m=16384, n=16384, k=1
====================
notrans  torch.float32 60157.66 usec
notrans  torch.float16 64121.38 usec
notrans torch.bfloat16 65714.65 usec
trans_a  torch.float32 84975.39 usec
trans_a  torch.float16 1024223.33 usec
trans_a torch.bfloat16 1078683.21 usec

Matrix-matrix:
m=1024, n=1024, k=256
====================
notrans  torch.float32  302.55 usec
notrans  torch.float16 172869.06 usec
notrans torch.bfloat16 172837.81 usec
trans_a  torch.float32  250.03 usec
trans_a  torch.float16 333373.38 usec
trans_a torch.bfloat16 432760.00 usec
m=4100, n=4100, k=128
====================
notrans  torch.float32 5278.56 usec
notrans  torch.float16 1426335.29 usec
notrans torch.bfloat16 1404249.37 usec
trans_a  torch.float32 4818.63 usec
trans_a  torch.float16 2969936.17 usec
trans_a torch.bfloat16 3432565.96 usec
m=16384, n=16384, k=16
====================
notrans  torch.float32 72225.71 usec
notrans  torch.float16 1439875.54 usec
notrans torch.bfloat16 1443716.33 usec
trans_a  torch.float32 221130.21 usec
trans_a  torch.float16 16910654.17 usec
trans_a torch.bfloat16 21447377.63 usec
```

after:
```
Matrix-vector:
m=1024, n=1024, k=1
====================
notrans  torch.float32   25.11 usec
notrans  torch.float16  252.76 usec
notrans torch.bfloat16  238.58 usec
trans_a  torch.float32   26.62 usec
trans_a  torch.float16  167.40 usec
trans_a torch.bfloat16  174.08 usec
m=4100, n=4100, k=1
====================
notrans  torch.float32 2774.28 usec
notrans  torch.float16 3991.70 usec
notrans torch.bfloat16 3945.44 usec
trans_a  torch.float32 3011.25 usec
trans_a  torch.float16 2666.85 usec
trans_a torch.bfloat16 2686.95 usec
m=16384, n=16384, k=1
====================
notrans  torch.float32 58682.15 usec
notrans  torch.float16 63077.52 usec
notrans torch.bfloat16 63319.33 usec
trans_a  torch.float32 70549.57 usec
trans_a  torch.float16 42145.45 usec
trans_a torch.bfloat16 42270.13 usec

Matrix-matrix:
m=1024, n=1024, k=256
====================
notrans  torch.float32  289.37 usec
notrans  torch.float16 179704.87 usec
notrans torch.bfloat16 173490.33 usec
trans_a  torch.float32  330.89 usec
trans_a  torch.float16 42466.26 usec
trans_a torch.bfloat16 42811.19 usec
m=4100, n=4100, k=128
====================
notrans  torch.float32 4793.33 usec
notrans  torch.float16 1407557.04 usec
notrans torch.bfloat16 1388212.17 usec
trans_a  torch.float32 4714.20 usec
trans_a  torch.float16 359406.58 usec
trans_a torch.bfloat16 350419.42 usec
m=16384, n=16384, k=16
====================
notrans  torch.float32 65757.08 usec
notrans  torch.float16 1427715.71 usec
notrans torch.bfloat16 1440883.00 usec
trans_a  torch.float32 202263.44 usec
trans_a  torch.float16 1387522.33 usec
trans_a torch.bfloat16 1762253.92 usec
```

We are improving, but still have a lot of room for improvement compared to float32 BLAS. Full disclosure: applying this same method to gemm_notrans (which does correspond to notrans in the benchmark's nomenclature) does not approve performance across the board; the 16KB x 16KB x 16 matmul regresses and I haven't figured out why yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127318
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-06-14 00:39:18 +00:00
18f5357f4f Introduce heuristic for mixed_mm on A100 (#128232)
This PR introduces a heuristic for tuned_mixed_mm. The heuristic is only enabled on an A100, because it has only been tested on an A100, and it is only enabled if force_mixed_mm="heuristic".

I compared the heuristic to the aten fallback implementation and triton+autotune:
 Geometric mean speedup: 2.51
 ```
 m     n     k  triton + autotune (GB/s)  aten (GB/s)  heuristic (GB/s)  used_heuristic  speedup (heuristic/aten)
  1  4096  4096                    456.95       134.59            459.37            True                      3.41
  1  4096  8192                    523.93       138.29            553.50            True                      4.00
  1  4096 16394                    233.70       161.62            234.14            True                      1.45
  1  8192  4096                    633.25       140.64            574.86            True                      4.09
  1  8192  8192                    737.54       147.41            690.26            True                      4.68
  1  8192 16394                    413.67       175.88            408.68            True                      2.32
  1 16394  4096                    717.22       167.22            665.36            True                      3.98
  1 16394  8192                    812.69       177.17            815.90            True                      4.61
  1 16394 16394                    473.17       178.58            435.11            True                      2.44
  4  4096  4096                    479.46       134.80            486.74            True                      3.61
  4  4096  6333                    174.27       106.74            171.64            True                      1.61
  4  4096  8192                    567.14       138.32            571.09            True                      4.13
  4  4096 12313                    179.65       105.91            180.03            True                      1.70
  4  4096 16394                    222.96       145.54            222.81            True                      1.53
  4  6333  4096                    491.78       126.37            473.20            True                      3.74
  4  6333  6333                    268.79       143.40            269.75            True                      1.88
  4  6333  8192                    783.80       135.12            796.23            True                      5.89
  4  6333 12313                    286.35       142.37            287.30            True                      2.02
  4  6333 16394                    362.47       139.66            361.47            True                      2.59
  4  8192  4096                    642.73       140.53            641.88            True                      4.57
  4  8192  6333                    287.65       137.63            287.38            True                      2.09
  4  8192  8192                    738.42       150.16            721.59            True                      4.81
  4  8192 12313                    301.27       146.18            302.31            True                      2.07
  4  8192 16394                    415.37       167.66            393.41            True                      2.35
  4 12313  4096                    823.66       141.81            745.40            True                      5.26
  4 12313  6333                    433.92       148.17            429.83            True                      2.90
  4 12313  8192                    984.60       149.30            988.95            True                      6.62
  4 12313 12313                    452.00       150.87            452.50            True                      3.00
  4 12313 16394                    609.88       159.20            609.71            True                      3.83
  4 16394  4096                    779.44       157.46            777.10            True                      4.94
  4 16394  6333                    402.93       139.50            309.47            True                      2.22
  4 16394  8192                    950.38       175.49            949.67            True                      5.41
  4 16394 12313                    414.62       153.99            315.95            True                      2.05
  4 16394 16394                    497.56       174.97            461.77            True                      2.64
16  4096  4096                    475.92       134.45            478.57            True                      3.56
16  4096  6333                    146.36       112.50            145.35            True                      1.29
16  4096  8192                    560.00       138.22            557.19            True                      4.03
16  4096 12313                    152.02       105.06            151.27            True                      1.44
16  4096 16394                    222.48       156.72            222.88            True                      1.42
16  6333  4096                    692.41       122.14            696.88            True                      5.71
16  6333  6333                    220.74       140.90            225.41            True                      1.60
16  6333  8192                    813.56       140.21            820.28            True                      5.85
16  6333 12313                    232.48       131.19            232.55            True                      1.77
16  6333 16394                    367.39       134.93            361.87            True                      2.68
16  8192  4096                    665.54       140.29            266.24            True                      1.90
16  8192  6333                    254.77       136.65            240.12            True                      1.76
16  8192  8192                    750.63       146.26            736.93            True                      5.04
16  8192 12313                    266.61       127.13            251.81            True                      1.98
16  8192 16394                    397.25       160.42            390.76            True                      2.44
16 12313  4096                    857.48       141.36            851.36            True                      6.02
16 12313  6333                    423.21       132.40            357.55            True                      2.70
16 12313  8192                   1021.24       145.68           1024.60            True                      7.03
16 12313 12313                    370.12       143.94            383.52            True                      2.66
16 12313 16394                    608.52       141.03            608.48            True                      4.31
16 16394  4096                    826.48       155.94            826.74            True                      5.30
16 16394  6333                    420.38       144.09            265.23            True                      1.84
16 16394  8192                    988.07       156.21            984.63            True                      6.30
16 16394 12313                    431.40       146.92            265.49            True                      1.81
16 16394 16394                    497.39       167.86            461.79            True                      2.75
23  4096  4096                    344.43       132.84            338.64            True                      2.55
23  4096  6333                    195.34       118.48            195.31            True                      1.65
23  4096  8192                    389.83       140.02            376.62            True                      2.69
23  4096 12313                    204.49       137.96            204.80            True                      1.48
23  4096 16394                    242.48       148.99            242.74            True                      1.63
23  6333  4096                    429.25       126.52            517.75            True                      4.09
23  6333  6333                    295.56       133.51            296.14            True                      2.22
23  6333  8192                    594.88       137.05            581.78            True                      4.25
23  6333 12313                    315.18       131.67            314.64            True                      2.39
23  6333 16394                    386.46       141.45            386.54            True                      2.73
23  8192  4096                    553.52       142.05            568.35            True                      4.00
23  8192  6333                    215.58       139.01            210.86            True                      1.52
23  8192  8192                    609.21       154.85            528.76            True                      3.41
23  8192 12313                    220.38       142.93            233.54            True                      1.63
23  8192 16394                    402.63       158.39            403.21            True                      2.55
23 12313  4096                    723.54       131.58            581.94            True                      4.42
23 12313  6333                    307.90       131.58            307.90            True                      2.34
23 12313  8192                    893.36       129.97            623.72            True                      4.80
23 12313 12313                    322.40       134.84            317.80            True                      2.36
23 12313 16394                    512.97       142.31            409.45            True                      2.88
23 16394  4096                    703.66       154.54            643.53            True                      4.16
23 16394  6333                    305.55       127.55            293.17            True                      2.30
23 16394  8192                    768.12       154.60            681.53            True                      4.41
23 16394 12313                    311.61       140.92            307.01            True                      2.18
23 16394 16394                    467.24       171.07            467.29            True                      2.73
32  4096  4096                    344.71       132.30            338.62            True                      2.56
32  4096  6333                    206.48       107.59            205.55            True                      1.91
32  4096  8192                    387.24       137.82            353.12            True                      2.56
32  4096 12313                    216.35       120.61            214.50            True                      1.78
32  4096 16394                    242.05       149.92            241.94            True                      1.61
32  6333  4096                    525.50       127.12            518.02            True                      4.08
32  6333  6333                    300.50       118.41            296.55            True                      2.50
32  6333  8192                    600.92       136.99            601.94            True                      4.39
32  6333 12313                    316.13       136.45            316.03            True                      2.32
32  6333 16394                    386.11       141.34            386.10            True                      2.73
32  8192  4096                    546.18       140.18            341.14            True                      2.43
32  8192  6333                    218.40       130.65            263.42            True                      2.02
32  8192  8192                    608.29       147.16            542.12            True                      3.68
32  8192 12313                    225.60       135.04            225.23            True                      1.67
32  8192 16394                    434.75       160.42            401.28            True                      2.50
32 12313  4096                    787.80       136.28            583.60            True                      4.28
32 12313  6333                    316.66       125.76            323.35            True                      2.57
32 12313  8192                    891.38       128.88            639.50            True                      4.96
32 12313 12313                    326.11       132.37            325.88            True                      2.46
32 12313 16394                    521.64       139.47            395.69            True                      2.84
32 16394  4096                    625.55       158.46            651.16            True                      4.11
32 16394  6333                    304.14       131.13            284.55            True                      2.17
32 16394  8192                    767.79       162.95            704.34            True                      4.32
32 16394 12313                    310.74       137.68            303.39            True                      2.20
32 16394 16394                    465.92       171.43            465.37            True                      2.71
43  4096  4096                    345.05       133.87            196.47            True                      1.47
43  4096  6333                    148.64        99.92            148.97            True                      1.49
43  4096  8192                    386.50       135.39            214.00            True                      1.58
43  4096 12313                    190.39       109.36            156.27            True                      1.43
43  4096 16394                    203.63       150.24            204.05            True                      1.36
43  6333  4096                    421.35       106.04            132.25            True                      1.25
43  6333  6333                    224.75       113.01            224.97            True                      1.99
43  6333  8192                    471.11       117.61            327.39            True                      2.78
43  6333 12313                    234.55       115.61            234.74            True                      2.03
43  6333 16394                    311.56       132.24            312.01            True                      2.36
43  8192  4096                    400.73       140.12            269.11            True                      1.92
43  8192  6333                    167.32       119.13            168.84            True                      1.42
43  8192  8192                    435.45       146.98            286.21            True                      1.95
43  8192 12313                    161.05       127.82            162.78            True                      1.27
43  8192 16394                    207.16       156.40            208.90            True                      1.34
43 12313  4096                    484.01       120.10            313.35            True                      2.61
43 12313  6333                    234.54       106.63            232.85            True                      2.18
43 12313  8192                    515.34       130.23            411.70            True                      3.16
43 12313 12313                    239.39       130.04            239.03            True                      1.84
43 12313 16394                    316.02       137.39            316.29            True                      2.30
43 16394  4096                    475.60       152.57            340.97            True                      2.23
43 16394  6333                    241.21       132.49            208.59            True                      1.57
43 16394  8192                    499.34       157.43            361.61            True                      2.30
43 16394 12313                    246.25       132.31            211.68            True                      1.60
43 16394 16394                    302.90       158.56            277.05            True                      1.75
64  4096  4096                    280.48       126.82            195.97            True                      1.55
64  4096  6333                    150.94       101.63            150.48            True                      1.48
64  4096  8192                    305.47       135.06            211.03            True                      1.56
64  4096 12313                    158.12       110.06            158.15            True                      1.44
64  4096 16394                    206.68       136.21            201.28            True                      1.48
64  6333  4096                    409.11       105.10            296.07            True                      2.82
64  6333  6333                    229.98       108.46            230.59            True                      2.13
64  6333  8192                    469.32       112.24            330.58            True                      2.95
64  6333 12313                    245.02       117.16            244.84            True                      2.09
64  6333 16394                    317.78       125.80            318.37            True                      2.53
64  8192  4096                    323.42       139.92            267.31            True                      1.91
64  8192  6333                    167.51       118.45            167.56            True                      1.41
64  8192  8192                    341.13       146.71            284.88            True                      1.94
64  8192 12313                    172.21       123.42            171.97            True                      1.39
64  8192 16394                    217.22       153.18            216.99            True                      1.42
64 12313  4096                    482.19       123.32            311.82            True                      2.53
64 12313  6333                    238.73       123.88            238.66            True                      1.93
64 12313  8192                    516.32       122.11            330.50            True                      2.71
64 12313 12313                    248.73       125.32            296.82            True                      2.37
64 12313 16394                    314.98       134.06            320.31            True                      2.39
64 16394  4096                    476.59       154.58            340.84            True                      2.20
64 16394  6333                    240.54       119.60            214.82            True                      1.80
64 16394  8192                    501.36       149.02            359.45            True                      2.41
64 16394 12313                    244.65       126.01            222.47            True                      1.77
64 16394 16394                    302.48       160.36            283.66            True                      1.77
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128232
Approved by: https://github.com/Chillee
2024-06-14 00:31:22 +00:00
cyy
9ebec1f345 Enable Wunused-function in torch_cpu (#128576)
Follows #128499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128576
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-14 00:12:58 +00:00
6767e38267 Fix manual licensing (#128630)
It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for:
clog - BSD-3
eigen - MPL-2.0
ffnvcodec - LGPL-2.1
-> **hungarian - Permissive (free to use)**
irrlicht - The Irrlicht Engine License (zlib/libpng)
-> **pdcurses - Public Domain for core**
-> **sigslot - Public Domain**
test - BSD-3
Vulkan - Apache-2.0 or MIT
fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967

This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630
Approved by: https://github.com/malfet
2024-06-14 00:12:09 +00:00
afdaa7fc95 [while_loop] expose it as torch.while_loop (#128562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128562
Approved by: https://github.com/zou3519
2024-06-13 23:44:10 +00:00
c486e2ab64 Add coloring to fx graph print out (#128476)
Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail.

Old:
<img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6">

New:
<img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476
Approved by: https://github.com/ezyang
2024-06-13 23:39:04 +00:00
61421c42c0 [custom_op] don't invoke autograd.Function when unnecessary (#127976)
This matches our autograd logic for pytorch native operators. There's no
need to invoke an autograd.Function if we're under a torch.no_grad() or
if none of the inputs have requires_grad=True (invoking an
autograd.Function results in (noticeable) overhead).

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976
Approved by: https://github.com/williamwen42
2024-06-13 23:38:23 +00:00
b72989a2b5 [ONNX] Add upsample trilinear to skip decomp (#128259)
(1) Add upsample trilinear vec to skip decomposition
(2) Add tests to make sure that torch.export.export still decomposes them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259
Approved by: https://github.com/justinchuby
2024-06-13 23:31:34 +00:00
8c20f53a5e Try seeding individual foreach tests (#128220)
A first easy attempt to deflake foreach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220
Approved by: https://github.com/ZainRizvi, https://github.com/crcrpar, https://github.com/huydhn
2024-06-13 22:42:16 +00:00
865d7b3424 [Reland][dynamo] Enable some inlining inbuilt nn module tests (#128440)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-13 22:39:22 +00:00
3a0006ef22 Remove global variable SIZE, and fix linter warning (#128559)
- Resolve a TODO by removing global variable `SIZE`.
- Fix a linter warning in `test/test_nestedtensor.py`.

`pytest pytorch/test/test_sort_and_select.py` and ` pytest test/test_nestedtensor.py` pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128559
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-06-13 22:09:51 +00:00
6211e67e49 Document torch.jit.frontend.get_default_args (#128408)
Fixes #127896

### Description
Add docstring to `torch/jit/frontend.py:get_default_args` function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408
Approved by: https://github.com/malfet
2024-06-13 21:49:16 +00:00
bf8a05f483 [FSDP2] Included module FQN in FSDPParamGroup record_functions (#128624)
This PR adds the module FQN into the `FSDPParamGroup` `record_function`s for improved clarity in profiler traces.

Differential Revision: [D58544809](https://our.internmc.facebook.com/intern/diff/D58544809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128624
Approved by: https://github.com/ckluk2
2024-06-13 21:35:33 +00:00
c8e9656a12 Revert "Add test to xfail_list only for abi_compatible (#128506)"
This reverts commit 49366b2640df1cba5a3b40bedd31b57b08529612.

Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes an inductor test to fail in trunk 49366b2640 ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2166824714))
2024-06-13 21:30:07 +00:00
8763d44bf1 add xpu to torch.compile (#127279)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279
Approved by: https://github.com/dvrogozh, https://github.com/svekars
2024-06-13 21:15:09 +00:00
790138fdc7 Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter (#127556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127556
Approved by: https://github.com/awgu
ghstack dependencies: #127454, #127455
2024-06-13 20:52:46 +00:00
3b28dc6c9d Improve the scheduling for fused_matmul_reduce_scatter (#127455)
In fused_all_gather_matmul, each rank copies their shard into their
local p2p buffer, performs a barrier, then performs (copy -> matmul) for
each remote shard. The (copy -> matmul)s for remote shards run on two
streams without synchronization. This not only allows for
computation/communication overlapping, but also computation/computation
overlapping which alleviates the wave quantization effect caused by
computation decomposition.

However, the synchronization-free approach doesn't work well with
fused_matmul_reduce_scatter, in which there's a barrier in every step.
Without synchronization between the two streams, a matmul in one stream
can delay a barrier in the other stream, further delaying the copy
waiting for the barrier.

This PR addresss the issue by adding synchronization between the two
streams such that the matmul of step i can only start after the barrier
of step i-1 completes. With this approach, we lose the
computation/computation overlapping, but avoid slowdown due to delayed
barrier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455
Approved by: https://github.com/Chillee
ghstack dependencies: #127454
2024-06-13 20:52:46 +00:00
c0b40ab42e doc string for torch.jit.frontend.get_jit_class_def method (#128391)
Fixes #127904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128391
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-06-13 19:51:02 +00:00
a3af32c2fb Add functionality to make ViewAndMutationData (slightly more) cache safe (#127618)
This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses.

That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache.

To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function.

By doing this, we can now run the dynamic shapes cache test with autograd turned on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618
Approved by: https://github.com/bdhirsh
2024-06-13 19:45:33 +00:00
39193b10e8 [inductor] fx graph cache: memoize devices to make cache key calculation more predictable (#128366)
Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this:
- We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key.
- Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID.
- That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects.

The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation?

I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-06-13 19:25:14 +00:00
c54e358bdb enable comprehensive padding internally (#128555)
Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature.

Test Plan:
```
time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus'

```

Differential Revision: D58425432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555
Approved by: https://github.com/eellison
2024-06-13 19:20:00 +00:00
cdc37e4bff Add a shape property to IR nodes (#127818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127818
Approved by: https://github.com/peterbell10
2024-06-13 19:11:52 +00:00
5a80d2df84 [BE] enable UFMT for torch/nn/utils (#128595)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128595
Approved by: https://github.com/Skylion007
2024-06-13 18:34:57 +00:00
9f55c80a9f [AOTI] Fix a minimal_arrayref_interface test failure (#128613)
Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter.

Test Plan: CI

Differential Revision: D58528575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613
Approved by: https://github.com/hl475
2024-06-13 18:25:04 +00:00
a265556362 inductor fusion logs: make it easier to attribute to aten graph (#127159)
Summary:

I want to be able to look at inductor fusion logs and reason about which parts of the aot_autograd aten graph were fused / not fused.

This PR adds a short description of each buffer to the fusion logs. Example for forward of `Float8Linear`:

```
torch._inductor.scheduler.__fusion: ===== attempting fusion (1/10): 13 nodes =====
torch._inductor.scheduler.__fusion: fuse_nodes_once, candidates:
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf0'), Reduction(['[254201]', 'max', 'origins={abs_1, max_1}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf3'), Reduction(['[114688]', 'max', 'origins={abs_2, max_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf6'), Pointwise(['[]', 'origins={reciprocal_1, convert_element_type_6, clamp_min_2, mul_2, copy_1, reciprocal_3, convert_element_type_5}'])
torch._inductor.scheduler.__fusion:   ExternKernelSchedulerNode(name='buf10')
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf2'), Pointwise(['[]', 'origins={full_default}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf8'), Pointwise(['[8192, 7168]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_type
_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf4'), Reduction(['[512]', 'max', 'origins={abs_2, max_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf13'), Pointwise(['[8192, 7168]', 'origins={clone_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf7'), Pointwise(['[16384, 8192]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_typ
e_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   ExternKernelSchedulerNode(name='buf9')
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf1'), Reduction(['[528]', 'max', 'origins={abs_1, max_1}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf5'), Pointwise(['[]', 'origins={convert_element_type, clamp_min, convert_element_type_1, copy, reciprocal_2, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf12'), Pointwise(['[8192, 16384]', 'origins={clone_1}'])
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf7: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf12: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf1: numel/rnumel mismatch (reduce) (528, 1), (254201, 528)
torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf1: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf12 with buf1: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf7: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf12: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf8: intermediate nodes between node1 & node2
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf13: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf4: numel/rnumel mismatch (reduce) (512, 1), (114688, 512)
torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf4: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf13 with buf4: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf8: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf13: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf9 with buf10: node1 is extern or nop
torch._inductor.scheduler.__fusion: found 4 possible fusions
torch._inductor.scheduler.__fusion: fusing buf7 with buf12
torch._inductor.scheduler.__fusion: fusing buf8 with buf13
torch._inductor.scheduler.__fusion: fusing buf4 with buf6
torch._inductor.scheduler.__fusion: fusing buf1 with buf5
torch._inductor.scheduler.__fusion: completed fusion round (1/10): fused 13 nodes into 9 nodes
```

Test Plan: will add tests after we align some version of this can land

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127159
Approved by: https://github.com/mlazos
2024-06-13 18:22:02 +00:00
de9a072ac4 Updating the sigslot license to Public Domain (#128085)
It seems that Sigslot's license is Public Domain, not Apache 2. https://sigslot.sourceforge.net

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128085
Approved by: https://github.com/janeyx99
2024-06-13 18:13:54 +00:00
8733c4f4be docs: Add link to test-infra issue (#128608)
It's not immediately obvious from this file that the issue being referred to is in another repo. Add that detail and link to make it easier for folks reading this code to jump to the correct issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128608
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/ZainRizvi
2024-06-13 18:00:53 +00:00
dd19c9150c Revert "[aota] compiled forward outputs requires_grad alignment with eager (#128016)"
This reverts commit b459713ca75f6ab7c8a59acec0258e0f77904ada.

Reverted https://github.com/pytorch/pytorch/pull/128016 on behalf of https://github.com/bdhirsh due to fix torchbench regression ([comment](https://github.com/pytorch/pytorch/pull/128016#issuecomment-2166446841))
2024-06-13 17:56:42 +00:00
52f529105d force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation (#127454)
When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation.
This copy can be avoided if the lhs operand already has the following stride order:

    lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride()

In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454
Approved by: https://github.com/Chillee
2024-06-13 17:52:37 +00:00
d5780396c7 Skip debug asserts for mixed dense, subclass views in autograd_not_implemented_fallback (#128057)
Fixes #125503
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128057
Approved by: https://github.com/albanD, https://github.com/soulitzer
ghstack dependencies: #127007
2024-06-13 17:13:02 +00:00
9a8917fdbd Naive CPU kernels for jagged <-> padded dense conversions (#127007)
This PR introduces naive CPU impls for:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`

On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007
Approved by: https://github.com/davidberard98
2024-06-13 17:13:02 +00:00
a0604193a2 handle call_function with Parameter args in DDPOptimizer splitting (#128034)
When nn module inlining is enabled, modules are replaced with the underlying function calls in the output fx graph.
example:
```
class GraphModule(torch.nn.Module):
  def forward(self, L_x_: "f32[1024, 1024]"):
      l_x_ = L_x_

      # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_structured_trace.py:284 in forward, code: return self.layers(x)
      l__self___layers_0: "f32[1024, 1024]" = self.L__self___layers_0(l_x_);  l_x_ = None
      l__self___layers_1: "f32[1024, 1024]" = self.L__self___layers_1(l__self___layers_0);  l__self___layers_0 = None
      return (l__self___layers_1,)
```

will be
```
class GraphModule(torch.nn.Module):
    def forward(self, L_self_layers_0_weight: "f32[1024, 1024]", L_self_layers_0_bias: "f32[1024]", L_x_: "f32[1024, 1024]", L_self_layers_1_weight: "f32[1024, 1024]", L_self_layers_1_bias: "f32[1024]"):
        l_self_layers_0_weight = L_self_layers_0_weight
        l_self_layers_0_bias = L_self_layers_0_bias
        l_x_ = L_x_
        l_self_layers_1_weight = L_self_layers_1_weight
        l_self_layers_1_bias = L_self_layers_1_bias

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias)
        input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_x_, l_self_layers_0_weight, l_self_layers_0_bias);  l_x_ = l_self_layers_0_weight = l_self_layers_0_bias = None
        input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_self_layers_1_weight, l_self_layers_1_bias);  input_1 = l_self_layers_1_weight = l_self_layers_1_bias = None
        return (input_2,)
```
The DDP optimizer when performing splitting, does not handle the inlined graph since it does not handle function calls since earlier we did not have function calls with params as inputs. (but calls to modules instead).

This diff addresses that, it uses the example_value in the arguments to determine Parameter arguments of a function call
and the Parameter properties.
This address #https://github.com/pytorch/pytorch/issues/127552

running the optimizer on the code above with inlining yields to the following splitting:
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    return linear

---submod_1 graph---
graph():
    %input_1 : [num_users=1] = placeholder[target=input_1]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%input_1, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return linear

---final graph---
graph():
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_bias]
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_bias]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return (submod_1,)
---------------

```
where as without inlining it uses to be
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l__self___layers_0 : [num_users=1] = call_module[target=L__self___layers_0](args = (%l_x_,), kwargs = {})
    return l__self___layers_0
/data/users/lsakka/pytorch/pytorch/torch/_inductor/compile_fx.py:133: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(

---submod_1 graph---
graph():
    %l__self___layers_0 : [num_users=1] = placeholder[target=l__self___layers_0]
    %l__self___layers_1 : [num_users=1] = call_module[target=L__self___layers_1](args = (%l__self___layers_0,), kwargs = {})
    return l__self___layers_1

---final graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_,), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0,), kwargs = {})
    return (submod_1,)
---------------
```

TESTING:

(1) running
``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1   pytest test/distributed/test_dynamo_distributed.py -k ```
result in reduction in failures from 6 to 2 with this PR.

The two remaining are FSDP related which does not sounds trivial and have so many details. will leave them for future work.

Co-authored-by: Animesh Jain <anijain@umich.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128034
Approved by: https://github.com/anijain2305, https://github.com/wconstab
2024-06-13 17:07:27 +00:00
3e3435678c Remove some implications from the static_eval pattern matcher (#128500)
We should be able to remove this as, with the new canonicalisation, we
have that `a < b` and `-a > -b` should be canonicalised to the same
expression (if SymPy does not interfere too much).

nb. I thought this would cut further the compilation time, but I was running
the benchmarks wrong (not removing triton's cache oops). It turns out that
after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500
Approved by: https://github.com/ezyang
ghstack dependencies: #128410, #128411
2024-06-13 16:50:00 +00:00
0fdd8d84fa Do not generate -1* in SymPy expressions when canonicalising (#128411)
Partially addresses https://github.com/pytorch/pytorch/issues/128150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128411
Approved by: https://github.com/ezyang
ghstack dependencies: #128410
2024-06-13 16:49:59 +00:00
bdeb9225b0 Do not call get_implications unnecessarily (#128410)
This should improve compilation times. With this PR and the patch in
the original issue, I get a compilation time of `Compilation time: 307.30 second`.

Fixes https://github.com/pytorch/pytorch/issues/128398
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128410
Approved by: https://github.com/Chillee
2024-06-13 16:49:55 +00:00
cyy
e2a72313e8 Concat namespaces of torch/csrc/profiler code and other fixes (#128606)
Improve namespaces and modernize codebase of torch/csrc/profiler code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128606
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-06-13 16:46:34 +00:00
7c370d2fb0 expose set_thread_name to Python and set thread names (#128448)
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.

Threads named:

* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread

Test plan:

```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45   00:00:02 pt_elastic
3264281 3267950 pts/45   00:00:00 pt_elastic
```

dataloading

```py
import torch
import time

from torch.utils.data import (
    DataLoader,
    Dataset,
)

class NoopDataset(Dataset):
    def __getitem__(self, index):
        return index

    def __len__(self):
        return 10

dataloader = DataLoader(NoopDataset(), num_workers=2)

for i, x in enumerate(dataloader):
    print(i, x)
    time.sleep(10000)
```

```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45   00:00:02 pt_main_thread
1228312 1230058 pts/45   00:00:00 pt_main_thread
1228312 1230059 pts/45   00:00:00 pt_main_thread
1230052 1230052 pts/45   00:00:00 pt_data_worker
1230052 1230198 pts/45   00:00:00 pt_data_worker
1230052 1230740 pts/45   00:00:00 pt_data_worker
1230055 1230055 pts/45   00:00:00 pt_data_worker
1230055 1230296 pts/45   00:00:00 pt_data_worker
1230055 1230759 pts/45   00:00:00 pt_data_worker
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro
2024-06-13 16:38:23 +00:00
b05b8d3989 [EZ][ALI Migration] Add logging for workflow type determination (#128619)
To help figure out what went wrong when the wrong label appears to have been set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128619
Approved by: https://github.com/zxiiro, https://github.com/clee2000
2024-06-13 16:37:07 +00:00
e9b81e4edf Fakify torch bind input by default (#128454)
Summary: Try a reland of https://github.com/pytorch/pytorch/pull/127116 after some fixes landed

Differential Revision: D58418251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128454
Approved by: https://github.com/angelayi
2024-06-13 16:25:11 +00:00
c63ccead5e Revert "[dynamo] Enable some inlining inbuilt nn module tests (#128440)"
This reverts commit 1602c7d0c861a4382746ccb18c76d8703a636f4e.

Reverted https://github.com/pytorch/pytorch/pull/128440 on behalf of https://github.com/clee2000 due to new test broke internally D58501220 ([comment](https://github.com/pytorch/pytorch/pull/128440#issuecomment-2166127531))
2024-06-13 16:14:37 +00:00
17b45e905a Fix get output code when caching is enabled (#128445)
Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits.

Test Plan: ci

Differential Revision: D58429602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral
2024-06-13 16:00:30 +00:00
93a14aba6e [BE]: Update mypy to 1.10.0 (#127717)
Updates mypy to the latest and greatest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717
Approved by: https://github.com/ezyang
2024-06-13 15:57:13 +00:00
49366b2640 Add test to xfail_list only for abi_compatible (#128506)
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.

We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.

- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-13 15:32:15 +00:00
cf7adc2fa1 [Inductor] Update Intel GPU Triton commit pin. (#124842)
Update Intel triton for Pytorch 2.4 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124842
Approved by: https://github.com/EikanWang
2024-06-13 14:34:37 +00:00
edb45dce85 Add OpInfo entry for as_strided_copy (#127231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231
Approved by: https://github.com/lezcano
2024-06-13 13:58:47 +00:00
7cc07a3eb1 [custom_op] stop using nonlocals to store information (#128547)
Fixes https://github.com/pytorch/pytorch/issues/128544
Fixes https://github.com/pytorch/pytorch/issues/128535

We had a problem with multithreading where the nonlocals were being
clobbered. In the first place, we stored these nonlocals because we
wanted to ferry information from an autograd.Function.apply to
autograd.Function.forward.

Our new approach is:
- pass the information directly as an input to the
  autograd.Function.apply. This means that the autograd.Function.forward
  will receive the information too.
- this messes up ctx.needs_input_grad, which has an element per input to
  forward. The user should not see the additional information we passed.
  We fix this by temporarily overriding ctx.needs_input_grad to the
  right thing.
- this exposed a bug in that ctx.needs_input_grad wasn't correct for
  TensorList inputs. This PR fixes that too.

Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547
Approved by: https://github.com/williamwen42, https://github.com/soulitzer
2024-06-13 13:36:39 +00:00
2b9465d62a [aota] Allow some mutations in backward (#128409)
https://github.com/pytorch/pytorch/issues/127572

Allow mutations in backward on forward inputs, if
1/ not mutationg metadata
Enforced at compilation time.

2/ if create_graph=True: mutated input does not require_grad
Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled()

Adding input_joint_info to track mutations of inputs during joint.
Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409
Approved by: https://github.com/bdhirsh
2024-06-13 12:09:08 +00:00
d0c08926d1 allow inlining functions in _python_dispatch and _is_make_fx_tracing (#128485)
This fix grab breaks in torch_multimodal_clip benchmark.

Co-authored-by: Animesh Jain <anijain@umich.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128485
Approved by: https://github.com/anijain2305
ghstack dependencies: #128428
2024-06-13 09:56:39 +00:00
1fd2cd26a0 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
2024-06-13 09:46:22 +00:00
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
88974fedd0 Clean up xpu ut to make CI happy (#128383)
# Motivation
Before #127611 merged, the xpu-specific UT `test/test_xpu.py` was skipped temporarily. This PR aims to fix the UT bug introduced by #127741.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128383
Approved by: https://github.com/EikanWang
2024-06-13 07:06:41 +00:00
ce79b09415 [CUDA][Sparse] Change comparison function of test_sparse_semi_structured.py and bump tolerances for sp24_matmuls (#128553)
Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553
Approved by: https://github.com/jcaip
2024-06-13 06:58:07 +00:00
0678742924 [MPS] Add Metal implementation of exp op (#128421)
To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor)
Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU

Fix bug in non-contiguous tensors handling

Fixes https://github.com/pytorch/pytorch/issues/84936
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421
Approved by: https://github.com/kulinseth
ghstack dependencies: #128373, #128375
2024-06-13 06:53:17 +00:00
14c9eb5ed2 Add XPU code owners (#128486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128486
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-13 06:33:45 +00:00
518c9e6455 Forward fix lint (#128587)
merge at will
After https://github.com/pytorch/pytorch/pull/125968
and https://github.com/pytorch/pytorch/pull/127693
landrace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128587
Approved by: https://github.com/huydhn
2024-06-13 06:19:03 +00:00
c52eda896e [dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428
Approved by: https://github.com/yanboliang, https://github.com/mlazos
ghstack dependencies: #126578, #128440, #128470, #128453, #128484
2024-06-13 06:08:56 +00:00
1f6e84fa68 [inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484
Approved by: https://github.com/mlazos
ghstack dependencies: #126578, #128440, #128470, #128453
2024-06-13 06:08:56 +00:00
ea541dd965 SymIntify cross_entropy_loss_prob_target numel call (#128141)
This PR replaces call to ```numel``` with ```sym_numel``` in cross_entropy_loss_prob_target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128141
Approved by: https://github.com/ezyang
2024-06-13 05:37:17 +00:00
ade3d07483 GGML inspired int8 MM Metal shader (#127646)
## Context

This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library.
llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE

## Key Changes

Made the following changes to the original code:

* Memory layout of weight and scales is different than llama.cpp.
* Weight dequantization (scales multiplication) is done after MM is finished.
* Following PyTorch naming convention (M, K, N and assuming row major).

## Benchmark

When M = 1, mv shader improves existing ATen int8mm by 40%.
When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow.
![image](https://github.com/pytorch/pytorch/assets/8188269/fd9eff71-c538-4263-a7b5-f96fe479ae9d)

Hence the kernel chooses different shaders based on M.

## Test Plan

Tests are passing:
```
❯ python test/test_mps.py -v -k _int8_
/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev
  Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok

----------------------------------------------------------------------
Ran 12 tests in 1.180s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646
Approved by: https://github.com/malfet
2024-06-13 05:23:56 +00:00
b86b4ace88 Invalidate eager params when inlining and freezing nn modules (#128543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128543
Approved by: https://github.com/anijain2305
2024-06-13 04:50:17 +00:00
83bb9b7c53 [BE] explicitly export subpackage torch.utils (#128342)
Resolves #126401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128342
Approved by: https://github.com/Skylion007
ghstack dependencies: #127707
2024-06-13 04:39:16 +00:00
2229884102 Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-13 04:08:20 +00:00
d3b8230639 Fix profiler_kineto Clang errors (#128464)
Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense.

Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass

Differential Revision: D58431005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464
Approved by: https://github.com/aaronenyeshi
2024-06-13 03:10:50 +00:00
d630e1e838 Revert "[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)"
This reverts commit f2d7f235a684c593f5a1ff2ca0b47b47274bfe85.

Reverted https://github.com/pytorch/pytorch/pull/128269 on behalf of https://github.com/anijain2305 due to incorrect ([comment](https://github.com/pytorch/pytorch/pull/128269#issuecomment-2164267320))
2024-06-13 03:04:26 +00:00
7fe9ab9ccc update amp example to device-agnostic (#127278)
As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic.

Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278
Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars
2024-06-13 02:01:16 +00:00
cyy
3f9b8446cf [8/N] Remove unused functions (#128499)
Follows #128407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499
Approved by: https://github.com/malfet
2024-06-13 01:15:11 +00:00
ede74940a1 optimize vec isa check dispatch logical. (#128320)
Optimize cpu vec isa check dispatch by archecture, it makes code easy to read and maintaince.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128320
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-13 01:06:34 +00:00
c1cd946818 [cond] add a set_ and data mutation expected failure test (#128457)
A follow up of the discussion in https://github.com/pytorch/pytorch/pull/126936.

Cond errors out early because of a graph break triggered by DelayGraphBreakVariable, which is created due to `aten.set_` [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/tensor.py#L366-L376).

We might need to see what happened to this test if we allow graph break in higher order op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128457
Approved by: https://github.com/zou3519
2024-06-13 00:16:59 +00:00
c472cec565 [checkpoint] Clean up selective activation checkpoint and make public (#125795)
Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit

Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.

In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)

Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)

Tensor object preservation
- We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.

Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.

"bc-breaking" for existing users of the private API:
- Existing policy functions must now change their return value to use the Enum.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
2024-06-12 23:57:33 +00:00
25b7537a27 doc comment typo fixes and improvements (#128512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128512
Approved by: https://github.com/LucasLLC
2024-06-12 23:55:09 +00:00
eb1db6702f [2nd try][AOTI] Switch to use shim v2 (#128521)
Test Plan: Sandcastle

Differential Revision: D58470269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128521
Approved by: https://github.com/desertfire
2024-06-12 23:44:24 +00:00
4423e1bbdc [release] Increase version 2.4.0->2.5.0 (#128514)
Same as https://github.com/pytorch/pytorch/pull/121974
Branch cut for 2.4.0 completed hence advance main version to 2.5.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128514
Approved by: https://github.com/malfet
2024-06-12 23:40:01 +00:00
3bc2004f91 [ts_converter] Fix prim::dtype (#128517)
Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch.

Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909

Reviewed By: jiashenC

Differential Revision: D58469232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517
Approved by: https://github.com/jiashenC
2024-06-12 23:02:50 +00:00
2fa6f80b13 Perform reciprocal optimization with foreach_div (#128433)
Fixes https://github.com/pytorch/pytorch/issues/114165

Internal xref
https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433
Approved by: https://github.com/awgu
2024-06-12 22:57:03 +00:00
8db4a41973 Use computeStorageNbytesContiguous if possible (#128515)
```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515
Approved by: https://github.com/ezyang
2024-06-12 22:53:06 +00:00
e2610240f9 [ROCm] Enable several inductor UTs (#127761)
Fixes #ISSUE_NUMBER

Needs https://github.com/pytorch/pytorch/pull/125396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127761
Approved by: https://github.com/peterbell10, https://github.com/pruthvistony
2024-06-12 22:47:45 +00:00
bb3cf8a339 Lift inductor lowerings for jagged <-> padded dense kernels (#125968)
This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
    * NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968
Approved by: https://github.com/davidberard98
2024-06-12 22:46:09 +00:00
b4a7b543e5 Add targeted unit tests for guards-related functions used in the codecache (#128482)
Summary: Add a few unit tests that exercise `produce_guards_expression` and `evaluate_guards_expression` (and specifically "ToFloat" "FloatTrueDiv" added in https://github.com/pytorch/pytorch/pull/128418)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128482
Approved by: https://github.com/ezyang
ghstack dependencies: #128418
2024-06-12 22:41:50 +00:00
1f302d6885 Support aten operations with out tensor (#124926)
This PR intends to support the aten operations with the `out` tensor.

Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.

However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.

Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```

W/O this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    return (clamp_max, clamp_max)
```

W/ this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max);  arg3_1 = clamp_max = None
    return (copy_,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
2024-06-12 22:31:59 +00:00
f4edd67fe7 [c10d] fix OSS commSplit bug (#128459)
Summary:
D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL`

Revert this change from D56907877

Test Plan: CI

Differential Revision: D58436088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459
Approved by: https://github.com/shuqiangzhang
2024-06-12 22:29:01 +00:00
f39ab8a0fe Fix side effect pruning (#128028)
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.

This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
   involved in a return from the function or intermediate variable
   during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
   NestedUserFunctionVariable to a global list

The new algorithm reflects this, but please let me know if there are
more cases to handle.

Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
  SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
  -- the functorch dynamo graphs no longer return dead cellvars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
2024-06-12 22:25:37 +00:00
cyy
3008644297 [Caffe2] Remove remaining unused perfkernels (#128477)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-12 22:19:36 +00:00
55a6b38f52 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-12 22:15:02 +00:00
6206da55ef Fix lint after #119459 (#128558)
TSIA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128558
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet
2024-06-12 22:11:37 +00:00
2b28b107db [dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128453
Approved by: https://github.com/yf225
ghstack dependencies: #126578, #128440, #128470
2024-06-12 22:03:45 +00:00
6aef2052ea Save backward graphs lazily to cache (#126999)
This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126791
2024-06-12 21:58:34 +00:00
87072dcfdb Change Dynamo's custom ops warning message to be less spammy (#128456)
This is a short-term fix (for 2.4). In the longer term we should
fix https://github.com/pytorch/pytorch/issues/128430

The problem is that warnings.warn that are inside Dynamo print
all the time. Python warnings are supposed to print once, unless their
cache is reset: Dynamo ends up resetting that cache everytime it runs.

As a workaround we provide our own warn_once cache that is keyed on the
warning msg. I am not worried about this increasing memory usage because
that's effectively what python's warnings.warn cache does.

Test Plan:
- fix tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456
Approved by: https://github.com/anijain2305
2024-06-12 21:57:12 +00:00
c53d65b3d3 [inductor] fix linear add bias pattern (#128473)
Fix https://github.com/pytorch/pytorch/issues/128287.
Previous the assertion in `linear_add_bias` are pretty bad
```
assert packed_weight_node.name == "_reorder_linear_weight"
assert transpose_weight_node.name == "permute_default"
```
because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute.

Check `target` instead `name` can solve this issue.

UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473
Approved by: https://github.com/jgong5
2024-06-12 21:55:35 +00:00
bb13fad7aa Share TCPStore by default when using c10d rdzv handler (#128096)
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.

Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.

Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.

Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)

Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)

Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch

import os
import time

logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)

def _run_test(store):

    if dist.get_rank() == 1:
        logger.info("Rank %s is sleeping", dist.get_rank())
        time.sleep(5)
        key = "lookup_key"
        logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
        store.check([key])
    else:
        logger.info("rank %s done", dist.get_rank())

def main() -> None:
    use_gpu = torch.cuda.is_available()
    dist.init_process_group(backend="nccl" if use_gpu else "gloo")
    dist.barrier()

    logger.info(f"Hello World from rank {dist.get_rank()}")

    host = os.environ['MASTER_ADDR']
    port = os.environ['MASTER_PORT']
    world_size = os.environ['WORLD_SIZE']

    logger.info("testing TCPStore")
    store = dist.TCPStore(
        host_name=host, port=int(port), world_size=int(world_size),
    )
    _run_test(store)

if __name__ == "__main__":
    main()
~~~

With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]:     main()
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]:     _run_test(store)
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]:     store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
    main()
  File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
    run(args)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-05_17:40:22
  host      : devgpu011.cln5.facebook.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2279237)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~

Differential Revision: D58180193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
2024-06-12 21:49:42 +00:00
c0ea8fc3a3 Disable inlining nn modules on static inputs tests (#128529)
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529
Approved by: https://github.com/anijain2305
ghstack dependencies: #128528
2024-06-12 21:40:29 +00:00
ff3ba99320 Disable inline nn modules on unstable ptr test (#128528)
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528
Approved by: https://github.com/anijain2305
2024-06-12 21:40:29 +00:00
1026b7cfbe Add docstring for the torch.typename function (#128129)
Fixes: #127885

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128129
Approved by: https://github.com/malfet
2024-06-12 21:34:20 +00:00
cba840fde9 Fix accidental variable shadow (#128460)
Fixes #128322

We should probably crank up clang's warning levels...

Test:
```
import torch

def addmv_slice(input, mat, vec, slice_op):
    vec = vec[slice_op]
    res = torch.addmv(input, mat, vec)  # traced line: 25
    return res

torch._dynamo.reset()
model_opt = torch.compile(addmv_slice)

input = torch.empty(size=[11]).uniform_(-1, 1)
mat = torch.empty([11, 128]).uniform_(-10.0, 20.0)

vec = torch.empty([256]).uniform_(-10.0, 20.0)
slice_op = slice(None, None, 2)
out = model_opt(input, mat, vec, slice_op)

vec = torch.empty([384]).uniform_(-10.0, 20.0)
slice_op = slice(None, None, 3)
out = model_opt(input, mat, vec, slice_op)
```
before this change the test fails with:
```
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function getitem>(*(FakeTensor(..., size=(s0,)), slice(None, None, s1)), **{}):
slice step cannot be zero
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128460
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 21:14:04 +00:00
0444e89931 [export] Remove replace_sym_size_ops_pass (#128443)
Summary: Not needed anymore.

Test Plan: CI

Differential Revision: D58429458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128443
Approved by: https://github.com/angelayi
2024-06-12 21:03:06 +00:00
67e6c76a18 Support apply_(callable) sugar for CPU NJTs (#125416)
Example:
```python
nt.apply_(lambda x: x * 2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125416
Approved by: https://github.com/soulitzer
2024-06-12 20:30:57 +00:00
dd143d44cc [BE] enable UFMT for top-level files torch/*.py (#127707)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127707
Approved by: https://github.com/ezyang
2024-06-12 20:15:05 +00:00
cc231a8e2b First version of AOTAutogradCache (#126791)
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.

CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.

On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object

On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.

For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.

V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.

We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.

Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
2024-06-12 20:04:44 +00:00
7775fee10f [tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)
as titled, this PR refactors the PrepareModuleInput style to have common
method prepare_input_arg, allow both args/kwargs to reuse this logic

This also fixes https://github.com/pytorch/pytorch/issues/128365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431
Approved by: https://github.com/awgu
2024-06-12 19:16:33 +00:00
ec1fdda196 Fix jagged NT softmax semantics (#119459)
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
2024-06-12 19:12:03 +00:00
817ce6835b Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit 4c971932e839fc5da2b91906ad028d4654932bca.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))
2024-06-12 18:47:52 +00:00
6d1b1ddd3e Select Runner Label Dynamically (#127287)
Updated `get_workflow_type.py` logic to dynamically select a prefix for the runner label.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127287
Approved by: https://github.com/ZainRizvi
2024-06-12 18:47:47 +00:00
7db501ba2b Revert "[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350)"
This reverts commit 45dccfddcd8fce804f50075484421ade27f1f021.

Reverted https://github.com/pytorch/pytorch/pull/128350 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128350#issuecomment-2163669538))
2024-06-12 18:35:18 +00:00
d71f92213c [DSD] keep 'exp_avg' as DTensor after torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#128004)
Fixes #126950
`ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict`
Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004
Approved by: https://github.com/fegin
2024-06-12 18:14:56 +00:00
624e8ae491 Documentation for is_dependent function (#128197)
Docstring for torch.distributions.constraints.is_dependent

Fixes #127900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128197
Approved by: https://github.com/fritzo, https://github.com/malfet
2024-06-12 17:50:41 +00:00
a70a7337d2 Update torch.nanmean() docstring to mention input dtype requirement (#128155)
Fixes #120570

## Description
Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex.
Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated.

## Checklist

- [X] The issue that is being fixed is referred in the description.
- [X] Only one issue is addressed in this pull request.
- [x] Labels from the issue that this PR is fixing are added to this pull request.
- [X] No unnecessary issues are included into this pull request.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155
Approved by: https://github.com/malfet
2024-06-12 17:46:36 +00:00
0f52dc7e51 Document torch.cuda.profiler.stop (#128196)
Fixes #127918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128196
Approved by: https://github.com/malfet, https://github.com/eqy
2024-06-12 17:39:43 +00:00
5001f41b90 Revert "Make TraceUtils.h to be device-agnostic (#126969)"
This reverts commit 648625b230e8e6e7478fb219ff4f0aa6a45070f5.

Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))
2024-06-12 16:32:57 +00:00
f89574fa23 Revert "Pass params to dump_nccl_trace_pickle (#128307)"
This reverts commit eb567b1f40233667b982f81e3a75deec0fdfd9ca.

Reverted https://github.com/pytorch/pytorch/pull/128307 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert 126969 ([comment](https://github.com/pytorch/pytorch/pull/128307#issuecomment-2163459399))
2024-06-12 16:29:51 +00:00
81e4e12f02 Revert "Support aten operations with out tensor (#124926)"
This reverts commit cba195c8edd6c7149036ef0767772d11fff5390e.

Reverted https://github.com/pytorch/pytorch/pull/124926 on behalf of https://github.com/clee2000 due to newly added test broke in internal D58444103.  Test passed in OSS CI though ([comment](https://github.com/pytorch/pytorch/pull/124926#issuecomment-2163441547))
2024-06-12 16:20:04 +00:00
c5172b8de8 Revert "[AOTI] Switch to use shim v2 (#127674)"
This reverts commit 9a38cae299e5ffd8143182bec878c28f96cfd72a.

Reverted https://github.com/pytorch/pytorch/pull/127674 on behalf of https://github.com/clee2000 due to tests failed internally D56709309 ([comment](https://github.com/pytorch/pytorch/pull/127674#issuecomment-2163436728))
2024-06-12 16:17:07 +00:00
9e39c62908 correct avx512_vnni isa name. (#128318)
`x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`.
This PR correct the function name to `avx512_vnni`.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-06-12 16:12:49 +00:00
f2dcbe89d6 Revert "Prevent expansion of cat indexing to avoid int64 intermediate (#127815)"
This reverts commit 793df7b7cb1473004837f5867f4c1c4b2b0f751d.

Reverted https://github.com/pytorch/pytorch/pull/127815 on behalf of https://github.com/clee2000 due to the newly added test is failing internally D58444153.  Test exists in opensource and passed in OSS CI, maybe env difference? ([comment](https://github.com/pytorch/pytorch/pull/127815#issuecomment-2163421968))
2024-06-12 16:09:22 +00:00
8df56afc20 Add support in Python API for the recommended max working set size. (#128289)
Adds ways for users to request recommended max size for Metal on Mac. It plumbs through
https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc

Can be used like
```
        max_memory = torch.mps.recommended_max_memory()
        print ("Recommended Max Memory : ", (max_memory/(1024*1024*1024)), "GB")
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289
Approved by: https://github.com/malfet
2024-06-12 16:03:57 +00:00
b19c2319e4 [ROCm] TunableOp for gemm_and_bias (#128143)
Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm.  gemm_and_bias was notably missing.  This PR closes that gap.

This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked.

CC @xw285cornell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143
Approved by: https://github.com/Skylion007
2024-06-12 15:53:39 +00:00
3c971d2ef3 Flip default value for mypy disallow_untyped_defs [final] (#127836)
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code.  I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.

The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped.  Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 15:28:42 +00:00
15ab636007 Revert "Fix side effect pruning (#128028)"
This reverts commit a55d0d9718c11eb2897423c78eff18b168dd0a06.

Reverted https://github.com/pytorch/pytorch/pull/128028 on behalf of https://github.com/clee2000 due to broke test in internal D58443816.  Test exists in external too though ([comment](https://github.com/pytorch/pytorch/pull/128028#issuecomment-2163249251))
2024-06-12 14:55:57 +00:00
5ef70faaa7 Revert "Make torch_geometric models compatible with export (#123403)" (#128377)
This reverts commit d78991a7381adb3df5e9b63c365db4506643edce.

This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377
Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire
2024-06-12 14:53:01 +00:00
71f491554c Revert "First version of AOTAutogradCache (#126791)"
This reverts commit abc3eec22d38079bee855fbcb75da62a9558284c.

Reverted https://github.com/pytorch/pytorch/pull/126791 on behalf of https://github.com/DanilBaibak due to The changes broke a number of linux jobs ([comment](https://github.com/pytorch/pytorch/pull/126791#issuecomment-2163081643))
2024-06-12 13:59:29 +00:00
abc3eec22d First version of AOTAutogradCache (#126791)
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.

CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.

On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object

On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.

For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.

V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.

We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.

Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
2024-06-12 13:44:30 +00:00
2e065f2486 [Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#127592)
Fixes #127402

- Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py
- Add checks of mutation for QLinearPointwiseBinaryPT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592
Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee
2024-06-12 10:49:16 +00:00
46a35a1ed4 [BE] enable UFMT for torch/__init__.py (#127710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127710
Approved by: https://github.com/ezyang
ghstack dependencies: #127703, #127708, #127709
2024-06-12 10:40:23 +00:00
26433b86de [BE][Easy] sort __all__ in torch/__init__.py (#127709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127709
Approved by: https://github.com/ezyang
ghstack dependencies: #127703, #127708
2024-06-12 10:21:36 +00:00
2386045e4f Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-12 09:39:58 +00:00
1edcb31d34 [RELAND][inductor][cpp] bf16/fp16 gemm template computed with fp32 (#128472)
reland for https://github.com/pytorch/pytorch/pull/126068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128472
Approved by: https://github.com/desertfire
2024-06-12 08:37:16 +00:00
ebb00a92bd [dynamo] Skip freezing expect failure for inlining inbuilt nn modules (#128470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128470
Approved by: https://github.com/mlazos
ghstack dependencies: #126578, #128440
2024-06-12 08:21:50 +00:00
1602c7d0c8 [dynamo] Enable some inlining inbuilt nn module tests (#128440)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #126578
2024-06-12 08:21:50 +00:00
04037f3d22 [BE] sort imports in torch/__init__.py (#127708)
----

- Sort import via `usort`
- Change relative import `from . import xxx` to absolute import `from torch import xxx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127708
Approved by: https://github.com/ezyang
ghstack dependencies: #127703
2024-06-12 08:03:54 +00:00
0b331fd5d7 [CUDA] Abate SoftMax.cu compiler warning spam (#128468)
Avoids excessively spammy warnings such as
```
pytorch/aten/src/ATen/native/cuda/SoftMax.cu(844): warning #191-D: type qualifier is meaningless on cast type
        [&] { const auto& the_type = input.scalar_type(); constexpr const char* at_dispatch_name = "host_softmax"; at::ScalarType _st = ::detail::scalar_type(the_type); ; switch (_st) { case at::ScalarType::Double: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Double)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Double), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Double>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Float: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Float)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Float), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Float>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Half: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Half)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Half), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Half>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::BFloat16: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::BFloat16)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::BFloat16), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::BFloat16>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } default: do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str('"', at_dispatch_name, "\" not implemented for '", toString(_st), "'")))); }; } while (false); } }()

```
and
```
SoftMax.cu:844: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘long unsigned int’ [-Wsign-compare]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128468
Approved by: https://github.com/valentinandrei
2024-06-12 07:47:14 +00:00
8b3daf1768 Add FloatTrueDiv and ToFloat to SYMPY_INTERP (#128418)
Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this:
```
Ne(CeilToInt(FloatTrueDiv(ToFloat(8*L['t0']) - 4.0, 8.0))*CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0))) and ...
```
It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any?

Test Plan:
`TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418
Approved by: https://github.com/ezyang
2024-06-12 06:26:43 +00:00
a421699998 Revert "[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)"
This reverts commit 089f9a116ac8b2c14d6351b52614b529caba126b.

Reverted https://github.com/pytorch/pytorch/pull/128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - 089f9a116a ([comment](https://github.com/pytorch/pytorch/pull/128431#issuecomment-2162197858))
2024-06-12 06:25:53 +00:00
dcc0093dba [BE][Easy] export explicitly imported public submodules (#127703)
Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703
Approved by: https://github.com/ezyang
2024-06-12 05:52:18 +00:00
62311257ad Add 1 test case for Convtranspose1D in op microbenchmark (#127216)
Operator Convtransposd1d suffers performance regression with specific shape, #120982. Then we'd like to have this shape included into op level benchmark in this PR.

I reproduced the regression that convtranspos1d with shape [2016, 1026, 1024, 256, 1, 224]. Here is the summary:

Hardware info: Intel SPR8480-56cores per socket with frequency=2.1G.
Performance comparison between torch 1.13 vs. torch 2.2
Benchmarking **PyTorch1.13**: ConvTranspose1d Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **0.96s**

Benchmarking **PyTorch2.2:** ConvTranspose1d
Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **7.988s**

Also benchmarking for 7 rounds to check the variance.

  | Round1 | Round2 | Round3 | Round4 | Round5 | Round6 | Round7 | Normalized   Variance
-- | -- | -- | -- | -- | -- | -- | -- | --
Pytorch1.13 | 0.971 | 0.972 | 0.969 | 0.970 | 0.972 | 0.970 | 0.971 | 0.0002%
Pytorch 2.2 | 8.064 | 8.053 | 8.027 | 7.927 | 7.971 | 7.929 | 7.902 | 0.0059%
Ratio v2.2 vs.   v1.13(Lower is better) | 8.31 | 8.28 | 8.29 | 8.18 | 8.20 | 8.18 | 8.14 |  

Reproduce script:
numctl -N 0 python -m pt.conv_test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127216
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
2024-06-12 05:33:54 +00:00
089f9a116a [tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)
as titled, this PR refactors the PrepareModuleInput style to have common
method prepare_input_arg, allow both args/kwargs to reuse this logic

This also fixes https://github.com/pytorch/pytorch/issues/128365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431
Approved by: https://github.com/awgu
2024-06-12 05:22:24 +00:00
77a0ca66e4 Add threadfence to 2-stage reduction for correct writes visibility (#128455)
Final block accumulating 2-stage reduction result has to complete acquire pattern to make sure the writes of all other blocks are visible to it, see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=atom#release-and-acquire-patterns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128455
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-12 04:13:36 +00:00
c0b87afcad [RELAND2][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Fixes https://github.com/pytorch/pytorch/issues/111837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
2024-06-12 04:09:23 +00:00
02e7519ac3 DOC: strip inaccurate either float32 or float64 statement from set_default_type (#128192)
Fixes #126647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128192
Approved by: https://github.com/malfet
2024-06-12 03:57:48 +00:00
cyy
8cf302dce4 [5/N] Change static functions in headers to inline (#128406)
Follows #128286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128406
Approved by: https://github.com/ezyang
2024-06-12 03:25:54 +00:00
86b5df3e71 Documenting the torch.fx.annotate.annotate function (#128337)
Fixes #127903

This PR adds docstring to the `torch.fx.annotate.annotate` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128337
Approved by: https://github.com/malfet
2024-06-12 03:06:32 +00:00
7c2058338a Improve convert fp32 to fp16 fx pass (#127829)
Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion.

Test Plan:
```
buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder
```

Differential Revision: D57803843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829
Approved by: https://github.com/Skylion007
2024-06-12 02:50:37 +00:00
3ddec713b8 Revert "[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)"
This reverts commit cac7a22b92478d897488688010e562b7bd36b97f.

Reverted https://github.com/pytorch/pytorch/pull/128177 on behalf of https://github.com/clee2000 due to broke test/test_quantization.py::TestQuantizedLinear::test_qlinear_cudnn on sm86 tests cac7a22b92 https://github.com/pytorch/pytorch/actions/runs/9470648757/job/26100448913.  Probably a landrace, test ran on the PR and succeed ([comment](https://github.com/pytorch/pytorch/pull/128177#issuecomment-2161977110))
2024-06-12 02:20:15 +00:00
85eeb90d2c [dynamo] Fix graph breaks related to HF ModelOutput (#127780)
Fixes https://github.com/pytorch/pytorch/issues/126028 and https://github.com/pytorch/pytorch/issues/126027.

Changes:
- Support building `CustomizedDictVariable` in` VariableBuilder` (but only for HF `ModelOutput` subclasses)
- Remove `DataClassVariable` since it's not really being used anywhere (`CustomizedDictVariable` can be used instead)
- Support side effects for `CustomizedDictVariable`
- Allow `NO_HASATTR` leaf guard on `DictSubclassGuardManager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127780
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-06-12 02:16:24 +00:00
7f6daf289b [inductor] parallel compile: set LD_LIBRARY_PATH for sub-processes in internal (#128376)
Test Plan: `TORCHINDUCTOR_WORKER_START=subprocess TORCHINDUCTOR_COMPILE_THREADS=16 buck run mode/opt scripts/slarsen/torch_compile:run`

Differential Revision: D58371264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128376
Approved by: https://github.com/eellison
2024-06-12 01:55:53 +00:00
3d55d84ec2 [Fix] Check tensor dtype before using torch.allclose in _trace log (#128438)
#### Issue
`torch.allclose` errors out during logging due to different dtypes.

#### Test
* `pytest test/test_jit.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128438
Approved by: https://github.com/angelayi
2024-06-12 01:52:09 +00:00
bb2a995529 Back out "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)" (#128432)
Summary:
Original commit changeset: c7d2e6b13922

Original Phabricator Diff: D57618942

Differential Revision: D58383241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128432
Approved by: https://github.com/ezyang, https://github.com/Yuzhen11
2024-06-12 01:34:32 +00:00
cyy
9538bf4e7c [2/N] Remove inclusion of c10/util/string_utils.h (#128372)
Follows  #128300.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372
Approved by: https://github.com/aaronenyeshi
2024-06-12 01:18:20 +00:00
cyy
219da29dfd [7/N] Remove unused functions (#128407)
Follows  #128309
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407
Approved by: https://github.com/ezyang
2024-06-12 01:10:33 +00:00
cyy
fb013ecb24 Remove unused private List::ptr_to_first_element (#128405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405
Approved by: https://github.com/ezyang
2024-06-12 01:07:14 +00:00
6af4c6acad Migrate test to internal base class, fixes (#128367)
Summary:
## Remove etc deps
converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server

## Adopt pytorch test convetions
- test starts with `test_TESTS.py`
- Test base class is torch.testing._internal.common_utils.TestCase
- include __main__  handler

## reduce test timing (used to take > 300 seconds):

3.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic
2.59s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path
2.30s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched
2.17s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents
2.12s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic
2.08s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations
1.32s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc
1.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash
1.03s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc
0.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown

Test Plan: pytest --durations=0  test/distributed/launcher/run_test.py

Differential Revision: D58388182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367
Approved by: https://github.com/d4l3k
2024-06-12 01:03:40 +00:00
786c24a4cd [inductor] Always realize sigmoid for CPU (#128339)
Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339
Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10
2024-06-12 00:46:33 +00:00
5d8c7f39d4 Revert "Introduce int_oo (#127693)"
This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c.

Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))
2024-06-11 23:36:08 +00:00
c9c1fed065 Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)"
This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63.

Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))
2024-06-11 23:34:03 +00:00
94fea82d66 init sub comment (#128082)
Fixes #127905

### Description

Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082
Approved by: https://github.com/titaiwangms
2024-06-11 22:42:35 +00:00
447173198b Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139)
Fixes: #127916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139
Approved by: https://github.com/SherlockNoMad
2024-06-11 22:42:11 +00:00
b79d056e76 [export] FIx unflattener for preserving modules containing unused inputs (#128260)
Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs.

This also fixes unflattener issues in D57829276.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260
Approved by: https://github.com/pianpwk
2024-06-11 22:32:08 +00:00
eb567b1f40 Pass params to dump_nccl_trace_pickle (#128307)
Summary:
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}

Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true

Test Plan:
unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307
Approved by: https://github.com/d4l3k
ghstack dependencies: #128191
2024-06-11 22:28:53 +00:00
1dd2431f86 [Test] Add test for only_active flag (#128191)
Summary:
Add a unit test for the only_active flag to _dump_nccl_trace API call.
With this flag, we only expect active records to be returned.

Test Plan:
Unit test.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191
Approved by: https://github.com/d4l3k
2024-06-11 22:26:01 +00:00
5fcb5f0c8b init reshape_from_tensor_shape comment (#128171)
Fixes #127897

### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171
Approved by: https://github.com/titaiwangms
2024-06-11 21:56:33 +00:00
a55d0d9718 Fix side effect pruning (#128028)
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.

This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
   involved in a return from the function or intermediate variable
   during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
   NestedUserFunctionVariable to a global list

The new algorithm reflects this, but please let me know if there are
more cases to handle.

Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
  SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
  -- the functorch dynamo graphs no longer return dead cellvars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
2024-06-11 21:40:48 +00:00
8c1247cffb [BE] Fixed CPU autocast warning (#127774)
This PR fixes
```
/data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774
Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l
2024-06-11 21:33:35 +00:00
70a1e85718 [Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856)
Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856
Approved by: https://github.com/awgu
2024-06-11 20:15:03 +00:00
adb699189b Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)"
This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9.

Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084.  Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))
2024-06-11 19:41:41 +00:00
eqy
45dccfddcd [cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350)
CC @vedaanta-nvidia @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350
Approved by: https://github.com/Skylion007
2024-06-11 19:22:21 +00:00
3e09123797 Enable UFMT on test_nestedtensor.py (#128359)
split it into two PRs since it is more than 2k lines of change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359
Approved by: https://github.com/davidberard98
2024-06-11 19:14:04 +00:00
61f922c2ca Fix 'get_real_value' on placeholder nodes (#127698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698
Approved by: https://github.com/jansel
ghstack dependencies: #127695, #127696
2024-06-11 18:57:25 +00:00
984b1a8c35 Fix 'get_attr' call in dynamo 'run_node' (#127696)
Fixes #124858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696
Approved by: https://github.com/jansel
ghstack dependencies: #127695
2024-06-11 18:57:25 +00:00
205410cb44 add xpu to torch.tensors (#127280)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280
Approved by: https://github.com/svekars
2024-06-11 18:13:01 +00:00
cac7a22b92 [cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)
Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166

CC @nWEIdia @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-06-11 18:09:25 +00:00
8a09940a54 [inductor] fix compile time regression by caching get_gpu_type (#128363)
We observed signficant compile time regression in torchtitan when turning
on 2D parallel + torch.compile recently. So I decided to get a deeper
understanding why.

It turns out this is affecting **all the trainings** that have functional collectives
captured in the graph, not only 2D parallel (2D parallel was just the
job that happen to have collectives captured in the TP region).

The root cause is because when doing inductor lowering, we are calling
the comm analysis pass to get a estimated collective time for each
collective node in the graph, for each call to check the collective
node, we are calling `get_gpu_type()`, which under the hood calls a
`torch.utils.collect_env.run` to get the GPU info. However, this call is
super expensive! The reason is that this call effectively spawns a new
process and call `nvidia-smi` to get the GPU info, so the cost is **linear**
to the number of collective nodes in the graph.

see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75

The fix is to add a lru cache to the function, so that we only call this
once and reuse the cached results afterwards

torchtitan benchmark shows:
* before this fix: 2D parallel + fp8 compile time: 6min +
* after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement)

There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363
Approved by: https://github.com/yf225
2024-06-11 18:02:13 +00:00
1d233b8f50 Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704)"
This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165.

Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
491c4a5dcb Revert "Make sure #126704 is BC for torch.save-ed nn.Module (#128344)"
This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb.

Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
4345d98663 [dynamo] Fix for #127696 (#128358)
Test Plan:
`buck2 test @//mode/dev-nosan //executorch/exir/backend/...`
https://www.internalfb.com/intern/testinfra/testrun/12666373989243932

Differential Revision: D58384518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358
Approved by: https://github.com/ydwu4
2024-06-11 16:43:15 +00:00
a838e90964 Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970)
### Motivation
Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations.
Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming.
Hence with this PR  we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded.
The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices

### Changes
Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU.
Include code to check if  intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests

### Additional Context
please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970
Approved by: https://github.com/albanD
2024-06-11 16:35:17 +00:00
29081059b6 [Static Runtime] Fix & run gen_static_runtime_ops (#128299)
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.

I added a number of ops to the blocklist:
```
+        "_nested_tensor_storage_offsets",
+        "_nested_get_values",  # no CPU backend
+        "_nested_get_values_copy",  # no CPU backend
+        "_nested_view_from_jagged",  # testing needs to be patched
+        "_nested_view_from_jagged_copy",  # testing needs to be patched
+        "_nested_view_from_buffer",  # testing needs to be patched
+        "_nested_view_from_buffer_copy",  # testing needs to be patched
+        "_int_mm",  # testing needs to be patched
+        "_to_sparse_csc",  # testing needs to be patched
+        "_to_sparse_csr",  # testing needs to be patched
+        "segment_reduce",  # testing needs to be patched
```

Most of these are added just because testing doesn't work right now.

Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.

Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
2024-06-11 16:27:39 +00:00
f8c45996d5 [MPS] Make erfinv compilable for bfloat16 (#128375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375
Approved by: https://github.com/Skylion007
ghstack dependencies: #128373
2024-06-11 16:04:11 +00:00
c13e03c874 Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374
Approved by: https://github.com/Skylion007
2024-06-11 15:58:28 +00:00
053930e194 [MPS][BE] Remove code duplication (#128373)
Use `scalarToMetalTypeString` instead of `getMetalType`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373
Approved by: https://github.com/Skylion007
2024-06-11 15:58:04 +00:00
9a38cae299 [AOTI] Switch to use shim v2 (#127674)
Differential Revision: D56709309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674
Approved by: https://github.com/desertfire
2024-06-11 15:01:25 +00:00
55901fb3da [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-06-11 14:04:52 +00:00
fc77fdca6f [guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224
Approved by: https://github.com/ezyang
2024-06-11 14:01:34 +00:00
648625b230 Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-11 08:38:07 +00:00
207c2248a8 [inductor] Fix lowering full with SymBool value (#128213)
Fixes #128161, fixes #128095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213
Approved by: https://github.com/lezcano
2024-06-11 08:33:35 +00:00
a206dcc79e fb_memcache: Move to fbcode from thirdparty (#128174)
Summary: The fb_memcache injections location and path is changing.

Test Plan: Existing tests should pass.

Reviewed By: bertmaher, oulgen

Differential Revision: D57973772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174
Approved by: https://github.com/oulgen
2024-06-11 07:46:12 +00:00
f2d7f235a6 [dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)
Fixes https://github.com/pytorch/pytorch/issues/101168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269
Approved by: https://github.com/jansel
ghstack dependencies: #128295, #126578, #128268, #128254
2024-06-11 07:09:04 +00:00
402b289f3b Properly register parameter for binary folding test (#128356)
This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356
Approved by: https://github.com/anijain2305
ghstack dependencies: #128355
2024-06-11 06:48:26 +00:00
a32157c67c Mark params static if inlining modules and freezing (#128355)
Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile

This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355
Approved by: https://github.com/anijain2305
2024-06-11 06:48:26 +00:00
24e7f29099 Lowering for avg_pool_3d_backward (Fixes:#127101) (#127722)
We implemented a lowering for the avg_pool3d_backward operation and created tests for it.
We ran some benchmarks and achieved the following results:

```
[-------------- avgpool_3d_backwards --------------]
                             |  Decomposed  |  Eager
16 threads: ----------------------------------------
      (3, 5, 400, 200, 200)  |     6061     |  11160
      (3, 5, 300, 200, 200)  |     4547     |   8372
      (3, 5, 200, 200, 200)  |     3032     |   5585
      (3, 5, 300, 300, 300)  |    10100     |  18840
      (3, 5, 100, 100, 100)  |      381     |    703
      (3, 5, 100, 300, 200)  |     2270     |   4190
      (8, 8, 128, 128, 128)  |     3397     |   6253
      (2, 3, 150, 150, 150)  |      520     |    947
      (1, 3, 128, 128, 128)  |      161     |    299
      (8, 16, 64, 64, 64)    |      851     |   1569
      (1, 1, 50, 50, 50)     |       17     |     11
      (3, 5, 20, 40, 40)     |       17     |     30
      (3, 5, 10, 20, 20)     |       17     |     11
      (1, 1, 10, 10, 10)     |       16     |     11
      (3, 5, 5, 10, 10)      |       17     |     11
      (3, 5, 2, 5, 5)        |       17     |     11
```
These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations.
We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes.

Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication.
We diffed the kernels and they are identical.

Fixes #127101

Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722
Approved by: https://github.com/jansel
2024-06-11 06:39:04 +00:00
5b5d269d34 Speed up fx graph iteration by implementing it in C++ (#128288)
Before this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s)
```

After this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s)
```

5.7x improvement

Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-11 05:48:31 +00:00
fa88f390a0 Revert "[inductor] enable fx graph cache on torchbench (#128239)"
This reverts commit 734e8f6ad7e7f0fa0341fb658f1f986225173f5f.

Reverted https://github.com/pytorch/pytorch/pull/128239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to surface a bunch of inductor failures in trunk 734e8f6ad7 ([comment](https://github.com/pytorch/pytorch/pull/128239#issuecomment-2159789242))
2024-06-11 04:53:38 +00:00
fe39c07826 [pipelining][doc] Remove duplicated words (#128368)
"for execution" is used in both step titles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368
Approved by: https://github.com/wconstab
ghstack dependencies: #128361
2024-06-11 04:52:57 +00:00
cba195c8ed Support aten operations with out tensor (#124926)
This PR intends to support the aten operations with the `out` tensor.

Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.

However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.

Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```

W/O this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    return (clamp_max, clamp_max)
```

W/ this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max);  arg3_1 = clamp_max = None
    return (copy_,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
2024-06-11 04:35:27 +00:00
16e67be7f1 Also preserve unbacked SymInts when partitioning as backward inputs (#128338)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128338
Approved by: https://github.com/IvanKobzarev
2024-06-11 04:27:09 +00:00
7afffdf48b [CI] Comment hf_T5_generate, hf_GPT2 and timm_efficientnet in inductor cpu smoketest for performance unstable issue (#127588)
Fixes #126993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127588
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire
2024-06-11 03:12:11 +00:00
ca45649eb5 [easy][dynamo][inline work] Fix test with inlining inbuilt nn modules (#128254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128254
Approved by: https://github.com/williamwen42
ghstack dependencies: #128295, #126578, #128268
2024-06-11 03:02:51 +00:00
490 changed files with 16089 additions and 16456 deletions

View File

@ -373,6 +373,13 @@ case "$image" in
CONDA_CMAKE=yes
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
@ -490,6 +497,7 @@ docker build \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "HALIDE=${HALIDE}" \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \

View File

@ -0,0 +1 @@
340136fec6d3ebc73e7a19eba1663e9b0ba8ab2d

View File

@ -1 +1 @@
b8c64f64c18d8cac598b3adb355c21e7439c21de
aac14a3b93f11d781d1d5ebc5400b15ae8df5185

View File

@ -0,0 +1,46 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
COMMIT=$(get_pinned_commit halide)
test -n "$COMMIT"
# activate conda to populate CONDA_PREFIX
test -n "$ANACONDA_PYTHON_VERSION"
eval "$(conda shell.bash hook)"
conda activate py_$ANACONDA_PYTHON_VERSION
if [ -n "${UBUNTU_VERSION}" ];then
apt update
apt-get install -y lld liblld-15-dev libpng-dev libjpeg-dev libgl-dev \
libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev
fi
conda_install numpy scipy imageio cmake ninja
git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git
cmake -DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \
-DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_EH=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_BUILD_32_BITS=OFF \
-S llvm-project/llvm -B llvm-build -G Ninja
cmake --build llvm-build
cmake --install llvm-build --prefix llvm-install
export LLVM_ROOT=`pwd`/llvm-install
export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config
git clone https://github.com/halide/Halide.git
pushd Halide
git checkout ${COMMIT} && git submodule update --init --recursive
pip_install -r requirements.txt
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build
cmake --build build
test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3
cmake --install build --prefix ${CONDA_PREFIX}
chown -R jenkins ${CONDA_PREFIX}
popd
rm -rf Halide llvm-build llvm-project llvm-install
python -c "import halide" # check for errors

View File

@ -85,10 +85,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.9.0
mypy==1.10.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.9.0
#Pinned versions: 1.10.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8

View File

@ -103,6 +103,14 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -155,6 +155,14 @@ COPY ci_commit_pins/executorch.txt executorch.txt
RUN if [ -n "${EXECUTORCH}" ]; then bash ./install_executorch.sh; fi
RUN rm install_executorch.sh common_utils.sh executorch.txt
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
ARG ONNX
# Install ONNX dependencies
COPY ./common/install_onnx.sh ./common/common_utils.sh ./

View File

@ -550,6 +550,11 @@ test_inductor_micro_benchmark() {
python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"
}
test_inductor_halide() {
python test/run_test.py --include inductor/test_halide.py --verbose
assert_git_not_dirty
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -1242,6 +1247,8 @@ elif [[ "$TEST_CONFIG" == deploy ]]; then
test_torch_deploy
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
test_inductor_halide
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then

View File

@ -14,12 +14,14 @@ runs:
- name: Cleans up diskspace
shell: bash
run: |
set -ex
diskspace_cutoff=${{ inputs.diskspace-cutoff }}
diskspace=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')
docker_root_dir=$(docker info -f '{{.DockerRootDir}}')
diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')
msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then
docker system prune -af
diskspace_new=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')
diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')
if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then
echo "Error: Available diskspace is less than $diskspace_cutoff percent. Not enough diskspace."
echo "$msg"

View File

@ -244,6 +244,7 @@
- torch/csrc/xpu/**
- torch/xpu/**
- test/xpu/**
- test/test_xpu.py
- third_party/xpu.txt
- .ci/docker/ci_commit_pins/triton-xpu.txt
approved_by:

View File

@ -6,9 +6,9 @@ from github import Auth, Github
from github.Issue import Issue
WORKFLOW_TYPE_LABEL = "label"
WORKFLOW_TYPE_RG = "rg"
WORKFLOW_TYPE_BOTH = "both"
WORKFLOW_LABEL_META = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
LABEL_TYPE_KEY = "label_type"
def parse_args() -> Any:
@ -49,47 +49,44 @@ def is_exception_branch(branch: str) -> bool:
def get_workflow_type(issue: Issue, username: str) -> str:
user_list = issue.get_comments()[0].body.split("\r\n")
try:
run_option = issue.get_comments()[1].body.split("\r\n")[0]
except Exception as e:
run_option = "single"
user_list = issue.get_comments()[0].body.split()
if user_list[0] == "!":
# Use old runners for everyone
return WORKFLOW_TYPE_LABEL
elif user_list[1] == "*":
if run_option == WORKFLOW_TYPE_BOTH:
# Use ARC runners and old runners for everyone
return WORKFLOW_TYPE_BOTH
if user_list[0] == "!":
print("LF Workflows are disabled for everyone. Using meta runners.")
return WORKFLOW_LABEL_META
elif user_list[0] == "*":
print("LF Workflows are enabled for everyone. Using LF runners.")
return WORKFLOW_LABEL_LF
elif username in user_list:
print(f"LF Workflows are enabled for {username}. Using LF runners.")
return WORKFLOW_LABEL_LF
else:
# Use only ARC runners for everyone
return WORKFLOW_TYPE_RG
elif username in user_list:
if run_option == WORKFLOW_TYPE_BOTH:
# Use ARC runners and old runners for a specific user
return WORKFLOW_TYPE_BOTH
else:
# Use only ARC runners for a specific user
return WORKFLOW_TYPE_RG
else:
# Use old runners by default
return WORKFLOW_TYPE_LABEL
print(f"LF Workflows are disabled for {username}. Using meta runners.")
return WORKFLOW_LABEL_META
except Exception as e:
print(
f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"
)
return WORKFLOW_LABEL_META
def main() -> None:
args = parse_args()
if is_exception_branch(args.github_branch):
output = {"workflow_type": WORKFLOW_TYPE_LABEL}
print(f"Exception branch: '{args.github_branch}', using meta runners")
output = {LABEL_TYPE_KEY: WORKFLOW_LABEL_META}
else:
try:
gh = get_gh_client(args.github_token)
# The default issue we use - https://github.com/pytorch/test-infra/issues/5132
issue = get_issue(gh, args.github_repo, args.github_issue)
output = {"workflow_type": get_workflow_type(issue, args.github_user)}
output = {LABEL_TYPE_KEY: get_workflow_type(issue, args.github_user)}
except Exception as e:
output = {"workflow_type": WORKFLOW_TYPE_LABEL}
print(f"Failed to get issue. Falling back to meta runners. Exception: {e}")
output = {LABEL_TYPE_KEY: WORKFLOW_LABEL_META}
json_output = json.dumps(output)
print(json_output)

View File

@ -29,6 +29,7 @@ python3 -m tools.pyi.gen_pyi \
--native-functions-path aten/src/ATen/native/native_functions.yaml \
--tags-path aten/src/ATen/native/tags.yaml \
--deprecated-functions-path "tools/autograd/deprecated.yaml"
python3 torch/utils/data/datapipes/gen_pyi.py
RC=0
# Run lintrunner on all files

View File

@ -180,6 +180,9 @@ def mock_gh_get_info() -> Any:
return {
"closed": False,
"isCrossRepository": False,
"headRefName": "foo",
"baseRefName": "bar",
"baseRepository": {"defaultBranchRef": {"name": "bar"}},
"files": {"nodes": [], "pageInfo": {"hasNextPage": False}},
"changedFiles": 0,
}

View File

@ -2330,6 +2330,15 @@ def main() -> None:
dry_run=args.dry_run,
)
return
if not pr.is_ghstack_pr() and pr.base_ref() != pr.default_branch():
gh_post_pr_comment(
org,
project,
args.pr_num,
f"PR targets {pr.base_ref()} rather than {pr.default_branch()}, refusing merge request",
dry_run=args.dry_run,
)
return
if args.check_mergeability:
if pr.is_ghstack_pr():

View File

@ -15,17 +15,20 @@ on:
required: false
type: string
default: "5132"
description: |
Fetch's GitHub Issue from pytorch/test-infra
Example: https://github.com/pytorch/test-infra/issues/5132
outputs:
workflow-type:
label-type:
description: Type of runners to use
value: ${{ jobs.runner-determinator.outputs.workflow-type }}
value: ${{ jobs.runner-determinator.outputs.label-type }}
jobs:
runner-determinator:
runs-on: linux.4xlarge
outputs:
workflow-type: ${{ steps.set-condition.outputs.workflow-type }}
label-type: ${{ steps.set-condition.outputs.label-type }}
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
ISSUE_NUMBER: ${{ inputs.issue_number }}
@ -54,5 +57,5 @@ jobs:
echo "Output: '${output}'"
WORKFLOW_TYPE=$(echo "${output}" | jq -r '.workflow_type')
echo "workflow-type=$WORKFLOW_TYPE" >> "$GITHUB_OUTPUT"
LABEL_TYPE=$(echo "${output}" | jq -r '.label_type')
echo "label-type=$LABEL_TYPE" >> "$GITHUB_OUTPUT"

View File

@ -54,6 +54,7 @@ jobs:
pytorch-linux-focal-py3-clang9-android-ndk-r21e,
pytorch-linux-jammy-py3.8-gcc11,
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks,
pytorch-linux-jammy-py3.12-halide,
pytorch-linux-jammy-xpu-2024.0-py3,
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-focal-py3-clang10-onnx,

View File

@ -56,3 +56,29 @@ jobs:
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-periodic-dynamo-benchmarks-build.outputs.test-matrix }}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-test-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}

View File

@ -81,32 +81,6 @@ jobs:
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks
cuda-arch-list: '8.0'
test-matrix: |
{ include: [
{ config: "inductor_torchbench_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.gcp.a100" },
]}
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_10-gcc9-inductor-test-gcp:
name: cuda12.1-py3.10-gcc9-sm80
uses: ./.github/workflows/_linux-test.yml
needs: linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp
with:
build-environment: linux-focal-cuda12.1-py3.10-gcc9-sm80
docker-image: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_10-gcc9-inductor-build-gcp.outputs.test-matrix }}
use-gha: anything-non-empty-to-use-gha
secrets:
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
linux-focal-cuda12_1-py3_12-gcc9-inductor-build:
name: cuda12.1-py3.12-gcc9-sm86
uses: ./.github/workflows/_linux-build.yml
@ -128,6 +102,26 @@ jobs:
docker-image: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda12_1-py3_12-gcc9-inductor-build.outputs.test-matrix }}
linux-jammy-cpu-py3_12-inductor-halide-build:
name: linux-jammy-cpu-py3.12-gcc11-inductor-halide
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image-name: pytorch-linux-jammy-py3.12-halide
test-matrix: |
{ include: [
{ config: "inductor-halide", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
]}
linux-jammy-cpu-py3_12-inductor-halide-test:
name: linux-jammy-cpu-py3.12-gcc11-inductor-halide
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_12-inductor-halide-build
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor-periodic.yml but this only runs inductor_timm
name: cuda12.4-py3.10-gcc9-sm86

View File

@ -136,7 +136,7 @@ init_command = [
'numpy==1.24.3 ; python_version == "3.8"',
'numpy==1.26.0 ; python_version >= "3.9"',
'expecttest==0.1.6',
'mypy==1.9.0',
'mypy==1.10.0',
'sympy==1.11.1',
'types-requests==2.27.25',
'types-PyYAML==6.0.7',
@ -216,7 +216,6 @@ exclude_patterns = [
'c10/util/complex_math.h',
'c10/util/complex_utils.h',
'c10/util/flat_hash_map.h',
'c10/util/Float8*.h',
'c10/util/logging*.h',
'c10/util/hash.h',
'c10/util/strong_type.h',
@ -999,7 +998,6 @@ command = [
]
exclude_patterns = [
'tools/gen_vulkan_spv.py',
'torch/__init__.py', # Skip this file to format because it's part of the public API
# We don't care too much about files in this directory, don't enforce
# formatting on them
'caffe2/**/*.py',
@ -1099,7 +1097,6 @@ exclude_patterns = [
'test/test_namedtuple_return_api.py',
'test/test_native_functions.py',
'test/test_native_mha.py',
'test/test_nestedtensor.py',
'test/test_nn.py',
'test/test_out_dtype_op.py',
'test/test_overrides.py',
@ -1558,7 +1555,6 @@ exclude_patterns = [
'torch/distributed/tensor/parallel/style.py',
'torch/fft/__init__.py',
'torch/func/__init__.py',
'torch/functional.py',
'torch/futures/__init__.py',
'torch/fx/__init__.py',
'torch/fx/_compatibility.py',
@ -1644,8 +1640,6 @@ exclude_patterns = [
'torch/fx/subgraph_rewriter.py',
'torch/fx/tensor_type.py',
'torch/fx/traceback.py',
'torch/hub.py',
'torch/library.py',
'torch/linalg/__init__.py',
'torch/monitor/__init__.py',
'torch/nested/__init__.py',
@ -1745,35 +1739,6 @@ exclude_patterns = [
'torch/nn/quantized/modules/normalization.py',
'torch/nn/quantized/modules/rnn.py',
'torch/nn/quantized/modules/utils.py',
'torch/nn/utils/__init__.py',
'torch/nn/utils/_deprecation_utils.py',
'torch/nn/utils/_expanded_weights/__init__.py',
'torch/nn/utils/_expanded_weights/conv_expanded_weights.py',
'torch/nn/utils/_expanded_weights/conv_utils.py',
'torch/nn/utils/_expanded_weights/embedding_expanded_weights.py',
'torch/nn/utils/_expanded_weights/expanded_weights_impl.py',
'torch/nn/utils/_expanded_weights/expanded_weights_utils.py',
'torch/nn/utils/_expanded_weights/group_norm_expanded_weights.py',
'torch/nn/utils/_expanded_weights/instance_norm_expanded_weights.py',
'torch/nn/utils/_expanded_weights/layer_norm_expanded_weights.py',
'torch/nn/utils/_expanded_weights/linear_expanded_weights.py',
'torch/nn/utils/_per_sample_grad.py',
'torch/nn/utils/clip_grad.py',
'torch/nn/utils/convert_parameters.py',
'torch/nn/utils/fusion.py',
'torch/nn/utils/init.py',
'torch/nn/utils/memory_format.py',
'torch/nn/utils/parametrizations.py',
'torch/nn/utils/parametrize.py',
'torch/nn/utils/prune.py',
'torch/nn/utils/rnn.py',
'torch/nn/utils/spectral_norm.py',
'torch/nn/utils/weight_norm.py',
'torch/overrides.py',
'torch/quasirandom.py',
'torch/random.py',
'torch/return_types.py',
'torch/serialization.py',
'torch/signal/__init__.py',
'torch/signal/windows/__init__.py',
'torch/signal/windows/windows.py',

View File

@ -461,7 +461,6 @@ filegroup(
filegroup(
name = "caffe2_perfkernels_srcs",
srcs = [
"caffe2/perfkernels/embedding_lookup.cc",
"caffe2/perfkernels/embedding_lookup_idx.cc",
],
)
@ -499,7 +498,6 @@ cc_library(
hdrs = [
"caffe2/core/common.h",
"caffe2/perfkernels/common.h",
"caffe2/perfkernels/embedding_lookup.h",
"caffe2/perfkernels/embedding_lookup_idx.h",
"caffe2/utils/fixed_divisor.h",
] + glob([

View File

@ -57,6 +57,7 @@ nn/qat/ @jerryzh168
/.ci/docker/ @jeffdaily
/.ci/docker/ci_commit_pins/triton.txt @desertfire @Chillee @eellison @shunting314 @bertmaher @jeffdaily @jataylo @jithunnair-amd @pruthvistony
/.ci/docker/ci_commit_pins/triton-rocm.txt @jeffdaily @jataylo @jithunnair-amd @pruthvistony
/.ci/docker/ci_commit_pins/triton-xpu.txt @EikanWang @gujinghui
# Github Actions
# This list is for people wanting to be notified every time there's a change
@ -132,6 +133,15 @@ caffe2/operators/hip @jeffdaily @jithunnair-amd
caffe2/operators/rnn/hip @jeffdaily @jithunnair-amd
caffe2/utils/hip @jeffdaily @jithunnair-amd
# XPU-specific files
/aten/src/ATen/xpu/ @EikanWang @gujinghui
/c10/xpu/ @EikanWang @gujinghui
/torch/csrc/xpu/ @EikanWang @gujinghui
/torch/xpu/ @EikanWang @gujinghui
/test/xpu/ @EikanWang @gujinghui
/test/test_xpu.py @EikanWang @gujinghui
/third_party/xpu.txt @EikanWang @gujinghui
# torch.export
/torch/export/ @avikchaudhuri @gmagogsfm @tugsbayasgalan @zhxchen17
/torch/_export/ @avikchaudhuri @gmagogsfm @tugsbayasgalan @zhxchen17

View File

@ -6,7 +6,7 @@
- [Untrusted inputs](#untrusted-inputs)
- [Data privacy](#data-privacy)
- [Using distributed features](#using-distributed-features)
- [**CI/CD security principles**](#cicd-security-principles)
## Reporting Security Issues
Beware that none of the topics under [Using Pytorch Securely](#using-pytorch-securely) are considered vulnerabilities of Pytorch.
@ -61,3 +61,27 @@ If applicable, prepare your model against bad inputs and prompt injections. Some
PyTorch can be used for distributed computing, and as such there is a `torch.distributed` package. PyTorch Distributed features are intended for internal communication only. They are not built for use in untrusted environments or networks.
For performance reasons, none of the PyTorch Distributed primitives (including c10d, RPC, and TCPStore) include any authorization protocol and will send messages unencrypted. They accept connections from anywhere, and execute the workload sent without performing any checks. Therefore, if you run a PyTorch Distributed program on your network, anybody with access to the network can execute arbitrary code with the privileges of the user running PyTorch.
## CI/CD security principles
_Audience_: Contributors and reviewers, especially if modifying the workflow files/build system.
PyTorch CI/CD security philosophy is based on finding a balance between open and transparent CI pipelines while keeping the environment efficient and safe.
PyTorch testing requirements are complex, and a large part of the code base can only be tested on specialized powerful hardware, such as GPU, making it a lucrative target for resource misuse. To prevent this, we require workflow run approval for PRs from non-member contributors. To keep the volume of those approvals relatively low, we easily extend write permissions to the repository to regular contributors.
More widespread write access to the repo presents challenges when it comes to reviewing changes, merging code into trunk, and creating releases. [Protected branches](https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches) are used to restrict the ability to merge to the trunk/release branches only to the repository administrators and merge bot. The merge bot is responsible for mechanistically merging the change and validating reviews against the path-based rules defined in [merge_rules.yml](https://github.com/pytorch/pytorch/blob/main/.github/merge_rules.yaml). Once a PR has been reviewed by person(s) mentioned in these rules, leaving a `@pytorchbot merge` comment on the PR will initiate the merge process. To protect merge bot credentials from leaking, merge actions must be executed only on ephemeral runners (see definition below) using a specialized deployment environment.
To speed up the CI system, build steps of the workflow rely on the distributed caching mechanism backed by [sccache](https://github.com/mozilla/sccache), making them susceptible to cache corruption compromises. For that reason binary artifacts generated during CI should not be executed in an environment that contains an access to any sensitive/non-public information and should not be published for use by general audience. One should not have any expectation about the lifetime of those artifacts, although in practice they likely remain accessible for about two weeks after the PR has been closed.
To speed up CI system setup, PyTorch relies heavily on Docker to pre-build and pre-install the dependencies. To prevent a potentially malicious PR from altering ones that were published in the past, ECR has been configured to use immutable tags.
To improve runner availability and more efficient resource utilization, some of the CI runners are non-ephemeral, i.e., workflow steps from completely unrelated PRs could be scheduled sequentially on the same runner, making them susceptible to reverse shell attacks. For that reason, PyTorch does not rely on the repository secrets mechanism, as these can easily be compromised in such attacks.
### Release pipelines security
To ensure safe binary releases, PyTorch release pipelines are built on the following principles:
- All binary builds/upload jobs must be run on ephemeral runners, i.e., on a machine that is allocated from the cloud to do the build and released back to the cloud after the build is finished. This protects those builds from interference from external actors, who potentially can get reverse shell access to a non-ephemeral runner and wait there for a binary build.
- All binary builds are cold-start builds, i.e., distributed caching/incremental builds are not permitted. This renders builds much slower than incremental CI builds but isolates them from potential compromises of the intermediate artifacts caching systems.
- All upload jobs are executed in a [deployment environments](https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment) that are restricted to protected branches
- Security credentials needed to upload binaries to PyPI/conda or stable indexes `download.pytorch.org/whl` are never uploaded to repo secrets storage/environment. This requires an extra manual step to publish the release but ensures that access to those would not be compromised by deliberate/accidental leaks of secrets stored in the cloud.
- No binary artifacts should be published to GitHub releases pages, as these are overwritable by anyone with write permission to the repo.

View File

@ -364,7 +364,7 @@ class TORCH_API Context {
bool enabled_flashSDP = true;
bool enabled_mem_efficientSDP = true;
bool enabled_mathSDP = true;
bool enabled_cudnnSDP = true;
bool enabled_cudnnSDP = false;
#ifdef USE_ROCM
bool benchmark_cudnn = true;
#else

View File

@ -462,7 +462,7 @@ inline Tensor _sum_to(
reduce_dims.push_back(i);
}
for (int64_t i = leading_dims; i < static_cast<int64_t>(sizes.size()); ++i) {
if (shape[i - leading_dims] == 1 &&
if (TORCH_GUARD_SIZE_OBLIVIOUS(sym_eq(shape[i - leading_dims], 1)) &&
TORCH_GUARD_SIZE_OBLIVIOUS(sym_ne(sizes[i], 1))) {
reduce_dims.push_back(i);
}

View File

@ -35,6 +35,12 @@ void SavedTensorDefaultHooks::enable() {
tls.disabled_error_message = c10::nullopt;
}
/* static */ bool SavedTensorDefaultHooks::set_tracing(bool is_tracing) {
bool prior = tls.is_tracing;
tls.is_tracing = is_tracing;
return prior;
}
const std::optional<std::string>& SavedTensorDefaultHooks::get_disabled_error_message() {
return tls.disabled_error_message;
}
@ -59,25 +65,20 @@ void SavedTensorDefaultHooks::push_hooks(PyObject* pack_hook, PyObject* unpack_h
tls.stack.emplace(pack_hook, unpack_hook);
}
void SavedTensorDefaultHooks::pop_hooks() {
std::pair<PyObject*, PyObject*> SavedTensorDefaultHooks::pop_hooks() {
// Reference counting is handled by the caller of `pop_hooks`
TORCH_INTERNAL_ASSERT(is_initialized && !tls.stack.empty());
std::pair<PyObject*, PyObject*> hooks = tls.stack.top();
tls.stack.pop();
return hooks;
}
std::pair<PyObject*, PyObject*> SavedTensorDefaultHooks::get_hooks() {
if (!is_initialized || tls.stack.empty()) {
// For tls.is_tracing, see NOTE: [Deferring tensor pack/unpack hooks until runtime]
if (!is_initialized || tls.stack.empty() || tls.is_tracing) {
return std::make_pair(nullptr, nullptr);
}
return tls.stack.top();
}
std::stack<std::pair<PyObject*, PyObject*>> SavedTensorDefaultHooks::get_stack() {
return tls.stack;
}
void SavedTensorDefaultHooks::set_stack(std::stack<std::pair<PyObject*, PyObject*>> stack_) {
tls.stack = std::move(stack_);
}
}

View File

@ -22,17 +22,18 @@ struct TORCH_API SavedTensorDefaultHooksTLS {
// We did this for efficiency (so we didn't have to keep a separate bool
// around)
std::optional<std::string> disabled_error_message;
// See NOTE: [Deferring tensor pack/unpack hooks until runtime]
bool is_tracing = false;
};
} // namespace impl
struct TORCH_API SavedTensorDefaultHooks {
static void push_hooks(PyObject* pack_hook, PyObject* unpack_hook);
static void pop_hooks();
static std::pair<PyObject*, PyObject*> pop_hooks();
static std::pair<PyObject*, PyObject*> get_hooks();
static void lazy_initialize();
static std::stack<std::pair<PyObject*, PyObject*>> get_stack();
static void set_stack(std::stack<std::pair<PyObject*, PyObject*>>);
static const impl::SavedTensorDefaultHooksTLS& get_tls_state();
static void set_tls_state(const impl::SavedTensorDefaultHooksTLS& tls);
@ -42,11 +43,20 @@ struct TORCH_API SavedTensorDefaultHooks {
// hooks, especially if their feature does not work with it. If they are
// disabled, then the following will raise an error:
// - Attempting to push_hooks
// - calling disable(message) with a non-zero stack (from get_stack) size
// - calling disable(message) with a non-zero stack (hooks) size
static void disable(const std::string& error_message);
static void enable();
static bool is_enabled();
static const std::optional<std::string>& get_disabled_error_message();
// NOTE: [Deferring tensor pack/unpack hooks until runtime]
// To preserve eager semantics of pack/unpack hooks firing only once per saved
// variable, Dynamo/AOTAutograd need to defer hook firing until runtime. Using
// disable() would loud error at trace time, and pushing a no-op hook would
// fail when the traced code is wrapped in a disable_saved_tensors_hooks ctx.
// To do so, we disable these hooks during tracing. See
// https://github.com/pytorch/pytorch/issues/113263.
static bool set_tracing(bool is_tracing);
};
} // namespace at

View File

@ -478,8 +478,6 @@ namespace impl {
// (maybe except for some internal prim ops).
using GenericList = List<IValue>;
const IValue* ptr_to_first_element(const GenericList& list);
}
}

View File

@ -350,11 +350,4 @@ void List<T>::unsafeSetElementType(TypePtr t) {
impl_->elementType = std::move(t);
}
namespace impl {
inline const IValue* ptr_to_first_element(const GenericList& list) {
return &list.impl_->list[0];
}
}
}

View File

@ -20,7 +20,7 @@ bool is_cpu_support_avx512() {
#endif
}
bool is_cpu_support_vnni() {
bool is_cpu_support_avx512_vnni() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_x86_avx512vnni();
#else

View File

@ -8,6 +8,6 @@ TORCH_API bool is_cpu_support_avx2();
TORCH_API bool is_cpu_support_avx512();
// Detect if CPU support Vector Neural Network Instruction.
TORCH_API bool is_cpu_support_vnni();
TORCH_API bool is_cpu_support_avx512_vnni();
} // namespace at::cpu

View File

@ -81,7 +81,8 @@ struct GemmParams : OpParams {
}
std::string Signature() const override {
return c10::str(transa, transb, "_", m, "_", n, "_", k);
static std::string val = c10::str(transa, transb, "_", m, "_", n, "_", k);
return val;
}
size_t GetSize(bool duplicate_inputs) const {
@ -143,6 +144,73 @@ private:
bool duplicate_inputs_;
};
template <typename T>
struct GemmAndBiasParams : OpParams {
std::string Signature() const override {
static std::string val = c10::str(transa, transb, "_", m, "_", n, "_", k);
return val;
}
size_t GetSize(bool duplicate_inputs) const {
size_t size = sizeof(T) * ldc * n;
if (duplicate_inputs) {
size += sizeof(T) * lda * ((transa == 'n' || transa == 'N') ? k : m);
size += sizeof(T) * ldb * ((transb == 'n' || transb == 'N') ? n : k);
}
return size;
}
GemmAndBiasParams* DeepCopy(bool duplicate_inputs) const {
GemmAndBiasParams* copy = new GemmAndBiasParams;
*copy = *this;
c10::DeviceIndex device = 0;
AT_CUDA_CHECK(c10::cuda::GetDevice(&device));
size_t c_size = ldc * n * sizeof(T);
copy->c = static_cast<T*>(c10::cuda::CUDACachingAllocator::raw_alloc(c_size));
AT_CUDA_CHECK(c10::cuda::CUDACachingAllocator::memcpyAsync(
copy->c, device, c, device, c_size, getCurrentCUDAStream(device), true));
if (duplicate_inputs) {
size_t a_size = sizeof(T) * lda * ((transa == 'n' || transa == 'N') ? k : m);
size_t b_size = sizeof(T) * ldb * ((transb == 'n' || transb == 'N') ? n : k);
copy->a = static_cast<const T*>(c10::cuda::CUDACachingAllocator::raw_alloc(a_size));
copy->b = static_cast<const T*>(c10::cuda::CUDACachingAllocator::raw_alloc(b_size));
copy->duplicate_inputs_ = true;
}
return copy;
}
// only call on object returned by DeepCopy
void Delete() {
c10::cuda::CUDACachingAllocator::raw_delete(c);
if (duplicate_inputs_) {
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<T*>(a));
c10::cuda::CUDACachingAllocator::raw_delete(const_cast<T*>(b));
}
}
TuningStatus NumericalCheck(GemmAndBiasParams<T> *other) {
auto c_dtype = c10::CppTypeToScalarType<T>::value;
return detail::NumericalCheck(c_dtype, c, other->c, ldc*n) ? OK : FAIL;
}
char transa;
char transb;
int64_t m;
int64_t n;
int64_t k;
at::opmath_type<T> alpha;
const T* a;
int64_t lda;
const T* b;
int64_t ldb;
T* c;
int64_t ldc;
const T* bias;
at::cuda::blas::GEMMAndBiasActivationEpilogue activation;
private:
bool duplicate_inputs_;
};
template <typename T>
struct GemmStridedBatchedParams : OpParams {
GemmStridedBatchedParams() {
@ -150,7 +218,8 @@ struct GemmStridedBatchedParams : OpParams {
}
std::string Signature() const override {
return c10::str(transa, transb, "_", m, "_", n, "_", k, "_B_", batch);
static std::string val = c10::str(transa, transb, "_", m, "_", n, "_", k, "_B_", batch);
return val;
}
size_t GetSize(bool duplicate_inputs) const {
@ -223,7 +292,8 @@ struct ScaledGemmParams : OpParams {
}
std::string Signature() const override {
return c10::str(transa, transb, "_", m, "_", n, "_", k);
static std::string val = c10::str(transa, transb, "_", m, "_", n, "_", k);
return val;
}
size_t GetSize(bool duplicate_inputs) const {

View File

@ -25,35 +25,35 @@
namespace at::cuda::tunable {
template <typename T>
constexpr hipblasDatatype_t HipBlasDataTypeFor();
constexpr hipblasDatatype_t HipDataTypeFor();
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<float>() {
return HIPBLAS_R_32F;
constexpr hipblasDatatype_t HipDataTypeFor<float>() {
return HIP_R_32F;
}
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<Half>() {
return HIPBLAS_R_16F;
constexpr hipblasDatatype_t HipDataTypeFor<Half>() {
return HIP_R_16F;
}
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<BFloat16>() {
return HIPBLAS_R_16B;
constexpr hipblasDatatype_t HipDataTypeFor<BFloat16>() {
return HIP_R_16BF;
}
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<double>() {
return HIPBLAS_R_64F;
constexpr hipblasDatatype_t HipDataTypeFor<double>() {
return HIP_R_64F;
}
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<c10::Float8_e4m3fnuz>() {
constexpr hipblasDatatype_t HipDataTypeFor<c10::Float8_e4m3fnuz>() {
return HIP_R_8F_E4M3_FNUZ;
}
template <>
constexpr hipblasDatatype_t HipBlasDataTypeFor<c10::Float8_e5m2fnuz>() {
constexpr hipblasDatatype_t HipDataTypeFor<c10::Float8_e5m2fnuz>() {
return HIP_R_8F_E5M2_FNUZ;
}
@ -62,6 +62,11 @@ int GetBatchFromParams(const GemmParams<T>* params) {
return 1;
}
template <typename T>
int GetBatchFromParams(const GemmAndBiasParams<T>* params) {
return 1;
}
template <typename T>
int GetBatchFromParams(const GemmStridedBatchedParams<T>* params) {
return params->batch;
@ -77,6 +82,11 @@ int GetStrideAFromParams(const GemmParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideAFromParams(const GemmAndBiasParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideAFromParams(const GemmStridedBatchedParams<T>* params) {
return params->stride_a;
@ -92,6 +102,11 @@ int GetStrideBFromParams(const GemmParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideBFromParams(const GemmAndBiasParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideBFromParams(const GemmStridedBatchedParams<T>* params) {
return params->stride_b;
@ -107,6 +122,11 @@ int GetStrideCFromParams(const GemmParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideCFromParams(const GemmAndBiasParams<T>* params) {
return 1;
}
template <typename T>
int GetStrideCFromParams(const GemmStridedBatchedParams<T>* params) {
return params->stride_c;
@ -122,6 +142,11 @@ float GetAlphaFromParams(const GemmParams<T>* params) {
return params->alpha;
}
template <typename T>
float GetAlphaFromParams(const GemmAndBiasParams<T>* params) {
return params->alpha;
}
template <typename T>
float GetAlphaFromParams(const GemmStridedBatchedParams<T>* params) {
return params->alpha;
@ -137,6 +162,11 @@ float GetBetaFromParams(const GemmParams<T>* params) {
return params->beta;
}
template <typename T>
float GetBetaFromParams(const GemmAndBiasParams<T>* params) {
return 0.0;
}
template <typename T>
float GetBetaFromParams(const GemmStridedBatchedParams<T>* params) {
return params->beta;
@ -152,6 +182,11 @@ const void* GetAScalePointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetAScalePointerFromParams(const GemmAndBiasParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetAScalePointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
@ -167,6 +202,11 @@ const void* GetBScalePointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetBScalePointerFromParams(const GemmAndBiasParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetBScalePointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
@ -182,6 +222,11 @@ const void* GetDScalePointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetDScalePointerFromParams(const GemmAndBiasParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetDScalePointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
@ -197,6 +242,11 @@ const void* GetBiasPointerFromParams(const GemmParams<T>* params) {
return nullptr;
}
template <typename T>
const void* GetBiasPointerFromParams(const GemmAndBiasParams<T>* params) {
return params->bias;
}
template <typename T>
const void* GetBiasPointerFromParams(const GemmStridedBatchedParams<T>* params) {
return nullptr;
@ -212,6 +262,11 @@ hipDataType GetBiasTypeFromParams(const GemmParams<T>* params) {
return HIP_R_32F;
}
template <typename T>
hipDataType GetBiasTypeFromParams(const GemmAndBiasParams<T>* params) {
return HipDataTypeFor<T>();
}
template <typename T>
hipDataType GetBiasTypeFromParams(const GemmStridedBatchedParams<T>* params) {
return HIP_R_32F;
@ -222,6 +277,26 @@ hipDataType GetBiasTypeFromParams(const ScaledGemmParams<T>* params) {
return at::cuda::ScalarTypeToCudaDataType(params->bias_dtype);
}
template <typename T>
at::cuda::blas::GEMMAndBiasActivationEpilogue GetActivationFromParams(const GemmParams<T>* params) {
return at::cuda::blas::GEMMAndBiasActivationEpilogue::None;
}
template <typename T>
at::cuda::blas::GEMMAndBiasActivationEpilogue GetActivationFromParams(const GemmAndBiasParams<T>* params) {
return params->activation;
}
template <typename T>
at::cuda::blas::GEMMAndBiasActivationEpilogue GetActivationFromParams(const GemmStridedBatchedParams<T>* params) {
return at::cuda::blas::GEMMAndBiasActivationEpilogue::None;
}
template <typename T>
at::cuda::blas::GEMMAndBiasActivationEpilogue GetActivationFromParams(const ScaledGemmParams<T>* params) {
return at::cuda::blas::GEMMAndBiasActivationEpilogue::None;
}
static hipblasOperation_t _hipblasOpFromChar(char op) {
switch (op) {
case 'n':
@ -327,9 +402,9 @@ class HipblasltGemmOp : public Callable<ParamsT> {
TuningStatus Call(const ParamsT* params) override {
hipblasOperation_t transa_outer = MapLayoutToHipBlasLt(ALayout);
hipblasOperation_t transb_outer = MapLayoutToHipBlasLt(BLayout);
auto a_datatype = HipBlasDataTypeFor<AT>();
auto b_datatype = HipBlasDataTypeFor<BT>();
auto in_out_datatype = HipBlasDataTypeFor<CT>();
auto a_datatype = HipDataTypeFor<AT>();
auto b_datatype = HipDataTypeFor<BT>();
auto in_out_datatype = HipDataTypeFor<CT>();
auto opa = _hipblasOpFromChar(params->transa);
auto opb = _hipblasOpFromChar(params->transb);
@ -385,13 +460,22 @@ class HipblasltGemmOp : public Callable<ParamsT> {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER, mat1_scale_ptr);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_B_SCALE_POINTER, mat2_scale_ptr);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_D_SCALE_POINTER, result_scale_ptr);
}
const void* bias_ptr = GetBiasPointerFromParams<CT>(params);
auto bias_datatype = GetBiasTypeFromParams<CT>(params);
if (bias_ptr) {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_BIAS_POINTER, bias_ptr);
const void* bias_ptr = GetBiasPointerFromParams<CT>(params);
auto bias_datatype = GetBiasTypeFromParams<CT>(params);
if (bias_ptr) {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_BIAS_POINTER, bias_ptr);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_BIAS_DATA_TYPE, bias_datatype);
auto activation = GetActivationFromParams<CT>(params);
if (activation == at::cuda::blas::GEMMAndBiasActivationEpilogue::RELU) {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_EPILOGUE, HIPBLASLT_EPILOGUE_RELU_BIAS);
}
else if (activation == at::cuda::blas::GEMMAndBiasActivationEpilogue::GELU) {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_EPILOGUE, HIPBLASLT_EPILOGUE_GELU_BIAS);
}
else {
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_EPILOGUE, HIPBLASLT_EPILOGUE_BIAS);
matmul.setAttribute(HIPBLASLT_MATMUL_DESC_BIAS_DATA_TYPE, bias_datatype);
}
}
@ -460,9 +544,9 @@ template <typename AT, typename BT, typename CT, BlasOp ALayout, BlasOp BLayout,
auto GetHipBlasLtTypeStringAndOps() {
hipblasOperation_t transa_outer = MapLayoutToHipBlasLt(ALayout);
hipblasOperation_t transb_outer = MapLayoutToHipBlasLt(BLayout);
auto a_datatype = HipBlasDataTypeFor<AT>();
auto b_datatype = HipBlasDataTypeFor<BT>();
auto in_out_datatype = HipBlasDataTypeFor<CT>();
auto a_datatype = HipDataTypeFor<AT>();
auto b_datatype = HipDataTypeFor<BT>();
auto in_out_datatype = HipDataTypeFor<CT>();
std::vector<hipblasLtMatmulHeuristicResult_t> heuristic_result;
hipblasLtHandle_t handle;
@ -505,6 +589,11 @@ auto GetHipBlasLtGemmTypeStringAndOps() {
return GetHipBlasLtTypeStringAndOps<T, T, T, ALayout, BLayout, GemmParams<T>>();
}
template <typename T, BlasOp ALayout, BlasOp BLayout>
auto GetHipBlasLtGemmAndBiasTypeStringAndOps() {
return GetHipBlasLtTypeStringAndOps<T, T, T, ALayout, BLayout, GemmAndBiasParams<T>>();
}
template <typename T, BlasOp ALayout, BlasOp BLayout>
auto GetHipBlasLtGemmStridedBatchedTypeStringAndOps() {
return GetHipBlasLtTypeStringAndOps<T, T, T, ALayout, BLayout, GemmStridedBatchedParams<T>>();

View File

@ -376,8 +376,8 @@ void TuningContext::EnableNumericsCheck(bool value) {
bool TuningContext::IsNumericsCheckEnabled() const {
static const char *env = getenv("PYTORCH_TUNABLEOP_NUMERICAL_CHECK");
if (env != nullptr && strcmp(env, "0") == 0) {
return false;
if (env != nullptr && strcmp(env, "1") == 0) {
return true;
}
return numerics_check_enable_;
}

View File

@ -48,6 +48,28 @@ class DefaultGemmOp : public Callable<GemmParams<T>> {
}
};
static bool _transposeBoolFromChar(char op) {
return op == 't' || op == 'T';
}
template <typename T>
class DefaultGemmAndBiasOp : public Callable<GemmAndBiasParams<T>> {
public:
TuningStatus Call(const GemmAndBiasParams<T>* params) override {
at::cuda::blas::gemm_and_bias<T>(
_transposeBoolFromChar(params->transa),
_transposeBoolFromChar(params->transb),
params->m, params->n, params->k,
params->alpha,
params->a, params->lda,
params->b, params->ldb,
params->bias,
params->c, params->ldc,
params->activation);
return OK;
}
};
template <typename T>
class DefaultGemmStridedBatchedOp : public Callable<GemmStridedBatchedParams<T>> {
public:
@ -265,7 +287,45 @@ class GemmTunableOp : public TunableOp<GemmParams<T>, StreamTimer> {
}
std::string Signature() override {
return c10::str("GemmTunableOp_", TypeName<T>(T{}), "_", BlasOpToString(ALayout), BlasOpToString(BLayout));
static std::string val = c10::str("GemmTunableOp_", TypeName<T>(T{}), "_", BlasOpToString(ALayout), BlasOpToString(BLayout));
return val;
}
};
template <typename T, BlasOp ALayout, BlasOp BLayout>
class GemmAndBiasTunableOp : public TunableOp<GemmAndBiasParams<T>, StreamTimer> {
public:
GemmAndBiasTunableOp() {
this->RegisterOp(std::string("Default"), std::make_unique<DefaultGemmAndBiasOp<T>>());
auto validators = getTuningContext()->GetTuningResultsValidator().GetAllValidators();
#if defined(USE_ROCM)
bool rocm_validators = false;
static const char *env_hipblaslt = std::getenv("PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED");
if (env_hipblaslt == nullptr || strcmp(env_hipblaslt, "1") == 0) {
rocm_validators = true;
// disallow tuning of hipblaslt with c10::complex
if constexpr (
!std::is_same_v<T, c10::complex<float>> &&
!std::is_same_v<T, c10::complex<double>>) {
for (auto&& [name, op] : GetHipBlasLtGemmAndBiasTypeStringAndOps<T, ALayout, BLayout>()) {
this->RegisterOp(std::move(name), std::move(op));
}
}
AddHipblasltValidator();
}
if (rocm_validators) {
AddRocmValidator();
}
#endif
}
std::string Signature() override {
static std::string val = c10::str("GemmAndBiasTunableOp_", TypeName<T>(T{}), "_", BlasOpToString(ALayout), BlasOpToString(BLayout));
return val;
}
};
@ -308,7 +368,8 @@ class GemmStridedBatchedTunableOp : public TunableOp<GemmStridedBatchedParams<T>
}
std::string Signature() override {
return c10::str("GemmStridedBatchedTunableOp_", TypeName<T>(T{}), "_", BlasOpToString(ALayout), BlasOpToString(BLayout));
static std::string val = c10::str("GemmStridedBatchedTunableOp_", TypeName<T>(T{}), "_", BlasOpToString(ALayout), BlasOpToString(BLayout));
return val;
}
};
@ -330,11 +391,12 @@ class ScaledGemmTunableOp : public TunableOp<ScaledGemmParams<CT>, StreamTimer>
}
std::string Signature() override {
return c10::str("ScaledGemmTunableOp",
static std::string val = c10::str("ScaledGemmTunableOp",
"_", TypeName<AT>(AT{}),
"_", TypeName<BT>(BT{}),
"_", TypeName<CT>(CT{}),
"_", BlasOpToString(ALayout), BlasOpToString(BLayout));
return val;
}
};

View File

@ -57,6 +57,9 @@ struct TORCH_API MPSHooksInterface : AcceleratorHooksInterface {
virtual size_t getDriverAllocatedMemory() const {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual size_t getRecommendedMaxMemory() const {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual void setMemoryFraction(double /*ratio*/) const {
FAIL_MPSHOOKS_FUNC(__func__);
}

View File

@ -324,6 +324,8 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatchedDecomposition, m) {
OP_DECOMPOSE(type_as);
OP_DECOMPOSE(linalg_diagonal);
OP_DECOMPOSE(diagonal_copy);
OP_DECOMPOSE(alias_copy);
m.impl("as_strided_copy", native::as_strided_copy_symint);
m.impl("pad", native::pad_symint);
m.impl("_pad_circular", native::_pad_circular_symint);
OP_DECOMPOSE(swapdims_);

View File

@ -308,6 +308,8 @@ public:
// total GPU memory allocated in the process by Metal driver; including
// implicit allocations from MPS/MPSGraph frameworks and MPSHeapAllocatorImpl.
size_t getDriverAllocatedMemory() const { return current_allocated_size(); }
// recommended Max memory for Metal
size_t getRecommendedMaxMemory() const { return max_device_size(); }
// (see enum DebugVerbosity for description)
uint32_t getDebugVerbosity() const { return m_debug_verbosity; }
// returns the device that we allocate from

View File

@ -794,6 +794,9 @@ struct TORCH_API MPSAllocator final : public IMPSAllocator {
size_t getDriverAllocatedMemory() const override {
return _getAllocImpl().getDriverAllocatedMemory();
}
size_t getRecommendedMaxMemory() const override {
return _getAllocImpl().getRecommendedMaxMemory();
}
ssize_t getLowWatermarkValue() const override {
return _getAllocImpl().getLowWatermarkValue();
}

View File

@ -33,6 +33,7 @@ public:
virtual size_t getTotalAllocatedMemory() const = 0;
virtual size_t getCurrentAllocatedMemory() const = 0;
virtual size_t getDriverAllocatedMemory() const = 0;
virtual size_t getRecommendedMaxMemory() const = 0;
virtual std::pair<const void*, uint32_t> getSharedBufferPtr(const void* ptr) const = 0;
virtual bool recordEvents(c10::ArrayRef<const void*> buffers) const = 0;
virtual bool waitForEvents(c10::ArrayRef<const void*> buffers) const = 0;

View File

@ -32,6 +32,7 @@ struct MPSHooks : public at::MPSHooksInterface {
void emptyCache() const override;
size_t getCurrentAllocatedMemory() const override;
size_t getDriverAllocatedMemory() const override;
size_t getRecommendedMaxMemory() const override;
void setMemoryFraction(double ratio) const override;
// MPSProfiler interface

View File

@ -80,6 +80,10 @@ size_t MPSHooks::getDriverAllocatedMemory() const {
return at::mps::getIMPSAllocator()->getDriverAllocatedMemory();
}
size_t MPSHooks::getRecommendedMaxMemory() const {
return at::mps::getIMPSAllocator()->getRecommendedMaxMemory();
}
void MPSHooks::setMemoryFraction(double ratio) const {
at::mps::getIMPSAllocator()->setHighWatermarkRatio(ratio);
}

View File

@ -4,6 +4,7 @@
#include <ATen/OpMathType.h>
#include <ATen/Parallel.h>
#include <c10/core/ScalarType.h>
#include <c10/macros/Macros.h>
#include <c10/util/Exception.h>
#include <c10/util/Unroll.h>
#include <c10/util/complex.h>
@ -16,6 +17,7 @@
#include <arm_neon.h>
#endif
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-function")
namespace {
/// Wrapper for const_cast<T*> with type-inference.
@ -967,3 +969,4 @@ INSTANTIATE_VDOT_IMPL(c10::complex<double>);
#undef INSTANTIATE_DOT_IMPL
} // namespace at::native
C10_DIAGNOSTIC_POP()

View File

@ -18,10 +18,8 @@ enum class GridSamplerPadding {Zeros, Border, Reflection};
using detail::GridSamplerInterpolation;
using detail::GridSamplerPadding;
namespace {
// See NOTE [ grid_sampler Native Functions ].
void check_grid_sampler_common(
inline void check_grid_sampler_common(
const TensorBase& input,
const TensorBase& grid
) {
@ -60,7 +58,7 @@ void check_grid_sampler_common(
}
// See NOTE [ grid_sampler Native Functions ].
void check_grid_sampler_2d(
inline void check_grid_sampler_2d(
const TensorBase& input,
const TensorBase& grid
) {
@ -72,7 +70,7 @@ void check_grid_sampler_2d(
}
// See NOTE [ grid_sampler Native Functions ].
void check_grid_sampler_3d(
inline void check_grid_sampler_3d(
const TensorBase& input,
const TensorBase& grid,
int64_t interpolation_mode
@ -91,7 +89,7 @@ void check_grid_sampler_3d(
// See NOTE [ grid_sampler Native Functions ].
// cudnn does not support inputs larger than 1024.
bool cond_cudnn_grid_sampler(
inline bool cond_cudnn_grid_sampler(
const TensorBase& input,
const TensorBase& grid
) {
@ -104,6 +102,4 @@ bool cond_cudnn_grid_sampler(
input.sym_size(1) <= 1024);
}
} // anonymous namespace
} // namespace at::native

View File

@ -5,8 +5,7 @@
#include <ATen/TensorUtils.h>
namespace at::native {
namespace {
static C10_UNUSED void multilabel_margin_loss_shape_check(
inline void multilabel_margin_loss_shape_check(
int64_t& nframe,
int64_t& dim,
const int64_t& ndims,
@ -35,7 +34,7 @@ namespace {
}
}
static C10_UNUSED void multi_margin_loss_shape_check(
inline void multi_margin_loss_shape_check(
int64_t& nframe,
int64_t& dim,
const int64_t& ndims,
@ -67,6 +66,4 @@ namespace {
}
}
} // anonymous namespace
} // namespace at::native

View File

@ -525,10 +525,10 @@ static Tensor cross_entropy_loss_prob_target(
switch (reduction) {
case Reduction::Mean:
if (input.numel()==0){
if (input.sym_numel()==0){
return -(input * target * weight_).sum().fill_(std::numeric_limits<double>::quiet_NaN());
} else {
return -(input * target * weight_).sum() / (input.numel() / n_classes);
return -(input * target * weight_).sum() / (input.sym_numel() / n_classes);
}
case Reduction::Sum:
return -(input * target * weight_).sum();
@ -540,10 +540,10 @@ static Tensor cross_entropy_loss_prob_target(
} else {
switch (reduction) {
case Reduction::Mean:
if (input.numel()==0){
if (input.sym_numel()==0){
return -(input * target).sum().fill_(std::numeric_limits<double>::quiet_NaN());
} else {
return -(input * target).sum() / (input.numel() / n_classes);
return -(input * target).sum() / (input.sym_numel() / n_classes);
}
case Reduction::Sum:
return -(input * target).sum();

View File

@ -7,7 +7,7 @@
namespace at::native {
static void check_max_pool1d(
inline void check_max_pool1d(
const Tensor& self,
IntArrayRef kernel_size,
IntArrayRef stride,

View File

@ -1195,15 +1195,6 @@ Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> ho
#undef REPR
}
static Tensor istft(const Tensor& self, const int64_t n_fft, const optional<int64_t> hop_lengthOpt,
const optional<int64_t> win_lengthOpt, const Tensor& window,
const bool center, const bool normalized, const optional<bool> onesidedOpt,
const optional<int64_t> lengthOpt) {
return at::native::istft(
self, n_fft, hop_lengthOpt, win_lengthOpt, window, center, normalized,
onesidedOpt, lengthOpt, /*return_complex=*/false);
}
void _fft_fill_with_conjugate_symmetry_(const Tensor& input, IntArrayRef dim_) {
const auto input_sizes = input.sizes();
const auto input_strides = input.strides();

View File

@ -210,7 +210,6 @@
#include <ATen/ops/zeros_native.h>
#endif
#include <c10/util/StringUtil.h>
#include <algorithm>
#include <cstdint>
#include <utility>
@ -421,8 +420,9 @@ Tensor& set_storage_meta__symint(Tensor& result, Storage storage, c10::SymInt st
// it. TODO: Actually this might not quite be correct if we use special
// pointers to track whether or not fake cuda tensors are pinned or not
const auto itemsize = result.dtype().itemsize();
c10::SymInt new_size_bytes = at::detail::computeStorageNbytes(
size, stride, itemsize, std::move(storage_offset));
c10::SymInt new_size_bytes = result.is_contiguous()
? at::detail::computeStorageNbytesContiguous(size, itemsize, std::move(storage_offset))
: at::detail::computeStorageNbytes(size, stride, itemsize, std::move(storage_offset));
// TODO: When there are unbacked SymInts, we unconditionally skip the
// setter. This is technically wrong, but we cannot conveniently test
// the real condition in many cases, because a lot of people are using

View File

@ -103,7 +103,7 @@ DECLARE_DISPATCH(upsampling_bicubic2d, upsample_bicubic2d_kernel);
DECLARE_DISPATCH(_upsampling_bicubic2d_aa, _upsample_bicubic2d_aa_kernel);
DECLARE_DISPATCH(_upsampling_bicubic2d_aa, _upsample_bicubic2d_aa_backward_kernel);
static C10_UNUSED std::array<int64_t, 3> upsample_1d_common_check(IntArrayRef input_size, IntArrayRef output_size) {
inline C10_UNUSED std::array<int64_t, 3> upsample_1d_common_check(IntArrayRef input_size, IntArrayRef output_size) {
TORCH_CHECK(
output_size.size() == 1,
"It is expected output_size equals to 1, but got size ",
@ -131,7 +131,7 @@ static C10_UNUSED std::array<int64_t, 3> upsample_1d_common_check(IntArrayRef in
return {nbatch, channels, output_width};
}
static C10_UNUSED std::array<int64_t, 4> upsample_2d_common_check(IntArrayRef input_size, IntArrayRef output_size) {
inline C10_UNUSED std::array<int64_t, 4> upsample_2d_common_check(IntArrayRef input_size, IntArrayRef output_size) {
TORCH_CHECK(
output_size.size() == 2,
"It is expected output_size equals to 2, but got size ",
@ -167,7 +167,7 @@ static C10_UNUSED std::array<int64_t, 4> upsample_2d_common_check(IntArrayRef in
return {nbatch, channels, output_height, output_width};
}
static C10_UNUSED
inline C10_UNUSED
std::array<int64_t, 5> upsample_3d_common_check(IntArrayRef input_size, IntArrayRef output_size) {
TORCH_CHECK(
output_size.size() == 3,
@ -365,7 +365,7 @@ inline int64_t nearest_exact_idx(
typedef int64_t (*nearest_idx_fn_t)(int64_t, int64_t, int64_t, std::optional<double>);
template <typename scalar_t>
static scalar_t upsample_get_value_bounded(
scalar_t upsample_get_value_bounded(
scalar_t* data,
int64_t width,
int64_t height,
@ -377,7 +377,7 @@ static scalar_t upsample_get_value_bounded(
}
template <typename scalar_t>
static void upsample_increment_value_bounded(
void upsample_increment_value_bounded(
scalar_t* data,
int64_t width,
int64_t height,
@ -392,17 +392,17 @@ static void upsample_increment_value_bounded(
// Based on
// https://en.wikipedia.org/wiki/Bicubic_interpolation#Bicubic_convolution_algorithm
template <typename scalar_t>
inline scalar_t cubic_convolution1(scalar_t x, scalar_t A) {
scalar_t cubic_convolution1(scalar_t x, scalar_t A) {
return ((A + 2) * x - (A + 3)) * x * x + 1;
}
template <typename scalar_t>
inline scalar_t cubic_convolution2(scalar_t x, scalar_t A) {
scalar_t cubic_convolution2(scalar_t x, scalar_t A) {
return ((A * x - 5 * A) * x + 8 * A) * x - 4 * A;
}
template <typename scalar_t>
inline void get_cubic_upsample_coefficients(
void get_cubic_upsample_coefficients(
scalar_t coeffs[4],
scalar_t t) {
scalar_t A = -0.75;

View File

@ -190,8 +190,7 @@ void gemm_transa_(
}
template <typename scalar_t, typename opmath_t>
typename std::enable_if<std::is_same<scalar_t, opmath_t>::value, void>::type
gemm_transb_(
void gemm_transb_impl(
TransposeType transb,
int64_t m,
int64_t n,
@ -201,12 +200,9 @@ gemm_transb_(
int64_t lda,
const scalar_t* b,
int64_t ldb,
opmath_t beta,
scalar_t* c,
/* we expect pre-applied beta */
opmath_t* c,
int64_t ldc) {
// c *= beta
scale_(m, n, beta, c, ldc);
// c += alpha * (a @ b.T)
for (const auto l : c10::irange(k)) {
for (const auto j : c10::irange(n)) {
@ -225,6 +221,27 @@ gemm_transb_(
}
}
template <typename scalar_t, typename opmath_t>
typename std::enable_if<std::is_same<scalar_t, opmath_t>::value, void>::type
gemm_transb_(
TransposeType transb,
int64_t m,
int64_t n,
int64_t k,
opmath_t alpha,
const scalar_t* a,
int64_t lda,
const scalar_t* b,
int64_t ldb,
opmath_t beta,
scalar_t* c,
int64_t ldc) {
// c *= beta
scale_(m, n, beta, c, ldc);
gemm_transb_impl(transb, m, n, k, alpha, a, lda, b, ldb, c, ldc);
}
// std::is_same<scalar_t, at::BFloat16> || std::is_same<scalar_t, at::Half>
template <typename scalar_t, typename opmath_t>
typename std::enable_if<!std::is_same<scalar_t, opmath_t>::value, void>::type
@ -241,19 +258,45 @@ gemm_transb_(
opmath_t beta,
scalar_t* c,
int64_t ldc) {
// c += alpha * (a @ b.T)
for (const auto i : c10::irange(m)) {
// We need to calculate full-precision dot products for correctness;
// users notice error accumulation with reduced-width types (e.g.,
// https://github.com/pytorch/pytorch/issues/95125 and
// https://github.com/pytorch/pytorch/issues/83863, which were filed
// when we used gemm_transb_impl naively, accumulating into
// float16/bfloat16). The straightforward way to do this is to use
// the vector dot column algorithm anyway, but this gives terrible
// performance because of the non-contiguous matrix
// access. Therefore, we instead elect to allocate temporary space
// to hold the output at higher-precision so that we can accumulate
// into it using the above cache-friendly "load one vector element,
// FMA it with an entire matrix row into the entire result vector"
// algorithm instead.
const auto c_size = m * n;
auto c_accum = std::make_unique<opmath_t[]>(c_size);
if (beta == 1) {
for (const auto j : c10::irange(n)) {
const auto dot = sum(k, [&](int64_t l) -> opmath_t {
return static_cast<opmath_t>(a[l * lda + i]) *
static_cast<opmath_t>(transb == TransposeType::ConjTranspose ? conj_impl(b[l * ldb + j]) : b[l * ldb + j]);
});
if (beta == opmath_t(0)) {
c[j * ldc + i] = alpha * dot;
} else {
c[j * ldc + i] = beta * c[j * ldc + i] + alpha * dot;
for (const auto i : c10::irange(m)) {
c_accum[j * m + i] = c[j * ldc + i];
}
}
} else if (beta == 0) {
for (const auto j : c10::irange(n)) {
for (const auto i : c10::irange(m)) {
c_accum[j * m + i] = 0;
}
}
} else {
for (const auto j : c10::irange(n)) {
for (const auto i : c10::irange(m)) {
c_accum[j * m + i] = beta * c[j * ldc + i];
}
}
}
gemm_transb_impl(transb, m, n, k, alpha, a, lda, b, ldb, c_accum.get(), m);
for (const auto j : c10::irange(n)) {
for (const auto i : c10::irange(m)) {
c[j * ldc + i] = c_accum[j * m + i];
}
}
}

View File

@ -175,12 +175,6 @@ cuda::blas::GEMMAndBiasActivationEpilogue activation_to_gemm_and_blas_arg(Activa
static bool getDisableAddmmCudaLt() {
static const char* env_value = std::getenv("DISABLE_ADDMM_CUDA_LT");
#ifdef USE_ROCM
// if we enable tunable op, it'll take priority over just hipblaslt (heuristics)
// note the current tunable op is not the hipblaslt path (gemm_and_bias)
auto tuning_ctx = at::cuda::tunable::getTuningContext();
if (tuning_ctx->IsTunableOpEnabled()) {
return true;
}
// allow both CUDA and HIP env var names for ROCm builds
// also, current default for ROCm builds is disable by default
if (env_value == nullptr) {
@ -214,6 +208,49 @@ static bool isSupportedHipLtROCmArch(int index) {
}
#endif
template <typename scalar_t>
static void launchTunableGemmAndBias(cublasCommonArgs &args, Tensor& result, const Tensor& self, bool is_rocm) {
bool transa_ = ((args.transa != 'n') && (args.transa != 'N'));
bool transb_ = ((args.transb != 'n') && (args.transb != 'N'));
at::cuda::tunable::GemmAndBiasParams<scalar_t> params;
params.transa = args.transa;
params.transb = args.transb;
params.m = args.m;
params.n = args.n;
params.k = args.k;
params.a = args.mata->const_data_ptr<scalar_t>();
params.lda = args.lda;
params.b = args.matb->const_data_ptr<scalar_t>();
params.ldb = args.ldb;
if (is_rocm) {
params.bias = (&result != &self) ? self.const_data_ptr<scalar_t>() : nullptr;
}
else {
params.bias = self.const_data_ptr<scalar_t>();
}
params.c = args.result->data_ptr<scalar_t>();
params.ldc = args.result_ld;
if (transa_ && transb_) {
static at::cuda::tunable::GemmAndBiasTunableOp<scalar_t, at::cuda::tunable::BlasOp::T, at::cuda::tunable::BlasOp::T> gemm{};
gemm(&params);
}
else if (transa_ && !transb_) {
static at::cuda::tunable::GemmAndBiasTunableOp<scalar_t, at::cuda::tunable::BlasOp::T, at::cuda::tunable::BlasOp::N> gemm{};
gemm(&params);
}
else if (!transa_ && transb_) {
static at::cuda::tunable::GemmAndBiasTunableOp<scalar_t, at::cuda::tunable::BlasOp::N, at::cuda::tunable::BlasOp::T> gemm{};
gemm(&params);
}
else if (!transa_ && !transb_) {
static at::cuda::tunable::GemmAndBiasTunableOp<scalar_t, at::cuda::tunable::BlasOp::N, at::cuda::tunable::BlasOp::N> gemm{};
gemm(&params);
}
else {
TORCH_CHECK(false, "unreachable");
}
}
Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& mat1, const Tensor& mat2, const Scalar& beta, const Scalar& alpha, Activation activation=Activation::None) {
// Make sure to keep addmm_cuda below in sync with this code; it
// preflights a check to try to avoid actually needing to call
@ -341,6 +378,11 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
scalar_type,
"addmm_cuda_lt",
[&] {
auto tuning_ctx = at::cuda::tunable::getTuningContext();
if (tuning_ctx->IsTunableOpEnabled()) {
launchTunableGemmAndBias<scalar_t>(args, result, self, true);
}
else {
at::cuda::blas::gemm_and_bias<scalar_t>(
args.transa == 't',
args.transb == 't',
@ -359,7 +401,7 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
args.result_ld,
activation_to_gemm_and_blas_arg(activation)
);
});
}});
#else
auto activation_epilogue = activation_to_gemm_and_blas_arg(activation);
#if (defined(CUDA_VERSION) && (CUDA_VERSION < 11080))
@ -377,6 +419,11 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
scalar_type,
"addmm_cuda_lt",
[&] {
auto tuning_ctx = at::cuda::tunable::getTuningContext();
if (tuning_ctx->IsTunableOpEnabled()) {
launchTunableGemmAndBias<scalar_t>(args, result, self, false);
}
else {
at::cuda::blas::gemm_and_bias<scalar_t>(
args.transa == 't',
args.transb == 't',
@ -393,7 +440,7 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
args.result_ld,
activation_epilogue
);
});
}});
#endif
} else
{

View File

@ -191,11 +191,47 @@ std::vector<Tensor> foreach_scalar_pow_list_kernel_cuda(
// In the case of division, integer inputs will result in float.
// Currently multi tensor apply can only return result of the same type as
// input.
FOREACH_BINARY_OP_SCALAR(
all_types_complex_bool_half_bfloat16,
div,
std::divides,
/*div_op*/ true);
//
// Implement via multiply with reciprocal as it's faster and makes it match
// the behavior of regular Tensor div by scalar. Loses one bit of
// precision.
Scalar scalar_reciprocal(const Scalar& scalar) {
if (scalar.isFloatingPoint()) {
return Scalar(1. / scalar.toDouble());
} else if (scalar.isIntegral(/*includeBool*/ true)) {
return Scalar(1. / static_cast<double>(scalar.toLong()));
} else if (scalar.isComplex()) {
return Scalar(1. / scalar.toComplexDouble());
}
TORCH_INTERNAL_ASSERT(
false, "divison with ", scalar.type(), " not supported");
}
void foreach_tensor_div_scalar_kernel_cuda_(
TensorList tensors,
const Scalar& scalar) {
check_foreach_api_restrictions(tensors);
if (!can_use_fast_route(tensors, scalar, true)) {
return at::native::foreach_tensor_mul_scalar_kernel_slow_(
tensors, scalar_reciprocal(scalar));
}
all_types_complex_bool_half_bfloat16_<std::multiplies>(
tensors, scalar_reciprocal(scalar));
}
std::vector<Tensor> foreach_tensor_div_scalar_kernel_cuda(
TensorList tensors,
const Scalar& scalar) {
check_foreach_api_restrictions(tensors);
if (!can_use_fast_route(tensors, scalar, true)) {
return at::native::foreach_tensor_mul_scalar_kernel_slow(
tensors, scalar_reciprocal(scalar));
}
return all_types_complex_bool_half_bfloat16<std::multiplies>(
tensors, scalar_reciprocal(scalar));
}
// In the case of subtraction, we dont allow scalar to be boolean following the
// torch.sub logic

View File

@ -807,6 +807,7 @@ struct ReduceOp {
bool is_last_block_done = mark_block_finished();
if (is_last_block_done) {
__threadfence(); // complete the acquire pattern after atomic
value = ident;
if (config.should_block_x_reduce()) {
index_t input_offset = threadIdx.x + threadIdx.y * blockDim.x;

View File

@ -863,8 +863,8 @@ Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_t
auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock -
smem_reduction_sz) / sizeof(scalar_t);
bool can_use_smem = dim_size < max_elements_per_smem;
can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES);
bool can_use_smem = (size_t) dim_size < max_elements_per_smem;
can_use_smem &= !(reinterpret_cast<uintptr_t>(input_ptr) % ALIGN_BYTES);
can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES));
can_use_smem &= !(dim_size % ILP);
@ -899,8 +899,8 @@ Tensor host_softmax(const Tensor & input_, const int64_t dim_, const bool half_t
auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock -
smem_reduction_sz) / sizeof(scalar_t);
bool can_use_smem = dim_size < max_elements_per_smem;
can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES);
bool can_use_smem = (size_t) dim_size < max_elements_per_smem;
can_use_smem &= !(reinterpret_cast<uintptr_t>(input_ptr) % ALIGN_BYTES);
can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES));
can_use_smem &= !(dim_size % ILP);

View File

@ -595,6 +595,7 @@ struct ReduceJitOp {
bool is_last_block_done = mark_block_finished();
if (is_last_block_done) {
__threadfence(); //complete acquire pattern
value = ident;
if (config.should_block_x_reduce()) {
uint32_t input_offset = threadIdx.x + threadIdx.y * blockDim.x;

View File

@ -614,13 +614,6 @@ void run_cudnn_SDP_bprop(
Tensor& dV,
const Tensor& dropoutseed,
const Tensor& dropoutoffset) {
Tensor dO_ = dO;
if (!dO.strides()[dO.strides().size() - 1]) {
TORCH_WARN(
"cuDNN SDPA backward got an innermost stride of 0 in grad_out, which is unsupported. Materializing a contiguous\
tensor which will increase memory usage...");
dO_ = dO.contiguous();
}
cudnnHandle_t handle = getCudnnHandle();
auto key = MHACacheKeyWrapper(
b, h, s_q, s_kv, d, q, k, v, dropout_probability, is_causal, true);
@ -642,7 +635,7 @@ void run_cudnn_SDP_bprop(
k,
v,
o,
dO_,
dO,
softmaxstats,
dQ,
dK,

View File

@ -168,10 +168,6 @@ struct RNNParams {
}
};
static std::vector<int64_t> _hidden_size(const RNNParams& rnn) {
return {rnn.num_layers * rnn.num_directions, rnn.mini_batch, rnn.hidden_size};
}
template<bool is_single_direction>
std::vector<int64_t> _output_size(const RNNParams& rnn) {
auto output_channels = is_single_direction ? rnn.hidden_size

View File

@ -99,15 +99,3 @@ Tensor& mkldnn_transpose_(Tensor& self, int64_t dim0, int64_t dim1) {
} // namespace at
#endif // AT_MKLDNN_ENABLED
namespace at {
namespace native {
static Tensor mkldnn_view_symint(const Tensor& self, c10::SymIntArrayRef size) {
return mkldnn_view(self, C10_AS_INTARRAYREF_SLOW(size));
}
} // namespace native
} // namespace at

View File

@ -659,6 +659,7 @@ id<MTLLibrary> MetalShaderLibrary::compileLibrary(const std::string& src) {
MTLCompileOptions* options = [[MTLCompileOptions new] autorelease];
[options setLanguageVersion:is_macos_13_or_newer(MacOSVersion::MACOS_VER_14_0_PLUS) ? MTLLanguageVersion3_1
: MTLLanguageVersion2_3];
// [options setFastMathEnabled: NO];
auto str = [NSString stringWithCString:src.c_str() encoding:NSASCIIStringEncoding];
auto device = MPSDevice::getInstance()->device();
library = [device newLibraryWithSource:str options:options error:&error];

View File

@ -9,35 +9,72 @@ constant float b[4] = {{-2.118377725, 1.442710462, -0.329097515, 0.012229801}};
constant float c[4] = {{-1.970840454, -1.624906493, 3.429567803, 1.641345311}};
constant float d[2] = {{3.543889200, 1.637067800}};
kernel void erfinv_mps_kernel( device {0} *output [[buffer(0)]],
device {1} *input [[buffer(1)]],
uint index [[thread_position_in_grid]]) {{
kernel void erfinv_kernel( device {0} *output [[buffer(0)]],
device {1} *input [[buffer(1)]],
uint index [[thread_position_in_grid]]) {{
float y = input[index];
float x, z, num, dem; /*working variables */
/* coefficients in rational expansion */
float y_abs = abs(y);
if(y_abs > 1.0f){{
output[index] = NAN;
if (y_abs >= 1.0f) {{
output[index] = {0}( y_abs > 1.0f ? NAN : copysign(INFINITY, y));
return;
}}
if(y_abs == 1.0f){{
output[index] = copysign(INFINITY, y);
return;
}}
if(y_abs <= 0.7f) {{
if (y_abs <= 0.7f) {{
z = y * y;
num = (((a[3]*z + a[2])*z + a[1])*z + a[0]);
dem = ((((b[3]*z + b[2])*z + b[1])*z +b[0]) * z + 1.0f);
num = ((a[3] * z + a[2]) * z + a[1])*z + a[0];
dem = (((b[3] * z + b[2]) * z + b[1]) * z +b[0]) * z + 1.0f;
x = y * num / dem;
}}
else{{
}} else {{
z = sqrt(-1.0f*log((1.0-y_abs)/2.0));
num = ((c[3]*z + c[2])*z + c[1]) * z + c[0];
dem = (d[1]*z + d[0])*z + 1.0f;
num = ((c[3] * z + c[2]) * z + c[1]) * z + c[0];
dem = (d[1] * z + d[0]) * z + 1.0f;
x = copysign(num, y) / dem;
}}
output[index] = x;
}})METAL";
output[index] = {0}(x);
}}
kernel void exp_kernel( device {0} *output [[buffer(0)]],
device {1} *input [[ buffer(1)]],
uint index [[thread_position_in_grid]]) {{
output[index] = {0}(precise::exp(input[index]));
}}
kernel void exp_complex_kernel( device {0}2 *output [[buffer(0)]],
device {0}2 *input [[ buffer(1)]],
uint index [[thread_position_in_grid]]) {{
output[index].x = {0}(precise::exp(input[index].x)*precise::cos(input[index].y));
output[index].y = {0}(precise::exp(input[index].x)*precise::sin(input[index].y));
}}
kernel void tanh_kernel( device {0} *output [[buffer(0)]],
device {1} *input [[ buffer(1)]],
uint index [[thread_position_in_grid]]) {{
output[index] = {0}(precise::tanh(input[index]));
}}
#if __METAL_VERSION__ >= 310
bfloat dot(bfloat2 a, bfloat2 b) {{
return a.x * b.x + a.y * b.y;
}}
#endif
template<typename T>
T complex_div(T a, T b) {{
auto denom = dot(b, b);
return T(dot(a, b), a.y * b.x - a.x * b.y)/denom;
}}
kernel void tanh_complex_kernel( device {0}2 *output [[buffer(0)]],
device {0}2 *input [[ buffer(1)]],
uint index [[thread_position_in_grid]]) {{
//tanh(x+iy)=(tanh(x)+itan(y))/(1+itahnh(x)*tan(y));
auto tanh_x = {0}(precise::tanh(input[index].x));
auto tan_y = {0}(precise::tan(input[index].y));
output[index] = complex_div({0}2(tanh_x, tan_y), {0}2({0}(1), tanh_x * tan_y));
}}
)METAL";

View File

@ -143,7 +143,7 @@ TORCH_IMPL_FUNC(leaky_relu_out_mps)(const Tensor& self, const Scalar& negative_s
Tensor output_ = at::empty_like(self, executeGatherOp ? MemoryFormat::Contiguous : MemoryFormat::Preserve);
@autoreleasepool {
string key = "leaky_relu" + getTensorsStringKey({self}) + ":" + to_string(negative_slope.to<double>());
string key = "leaky_relu" + getTensorsStringKey({self}) + ":" + std::to_string(negative_slope.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -193,8 +193,8 @@ TORCH_IMPL_FUNC(leaky_relu_backward_out_mps)
Tensor output_ = at::empty_like(self, self.suggest_memory_format());
@autoreleasepool {
string key =
"leaky_relu_backward" + getTensorsStringKey({self, grad_output}) + ":" + to_string(negative_slope.to<double>());
string key = "leaky_relu_backward" + getTensorsStringKey({self, grad_output}) + ":" +
std::to_string(negative_slope.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
@ -242,7 +242,7 @@ TORCH_IMPL_FUNC(log_softmax_mps_out)
MPSStream* stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key = "log_softmax_mps_out" + getTensorsStringKey({self}) + ":" + to_string(dim);
string key = "log_softmax_mps_out" + getTensorsStringKey({self}) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -285,7 +285,7 @@ TORCH_IMPL_FUNC(log_softmax_backward_mps_out)
MPSStream* stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key = "log_softmax_backward_mps_out:" + getMPSTypeString(grad_output) + ":" + to_string(dim);
string key = "log_softmax_backward_mps_out:" + getMPSTypeString(grad_output) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* gradOutputTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(grad_output));
MPSGraphTensor* outputTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(output));
@ -539,8 +539,8 @@ TORCH_IMPL_FUNC(threshold_out_mps)
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "threshold_out_mps" + getTensorsStringKey({self}) + ":" + to_string(threshold.to<double>()) + ":" +
to_string(value.to<double>());
string key = "threshold_out_mps" + getTensorsStringKey({self}) + ":" + std::to_string(threshold.to<double>()) +
":" + std::to_string(value.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -587,7 +587,7 @@ TORCH_IMPL_FUNC(threshold_backward_out_mps)
@autoreleasepool {
string key =
"threshold_backward_out_mps" + getTensorsStringKey({self, grad}) + ":" + to_string(threshold.to<double>());
"threshold_backward_out_mps" + getTensorsStringKey({self, grad}) + ":" + std::to_string(threshold.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -826,8 +826,8 @@ static void elu_variants_out_mps(const Tensor& self,
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = func_name + ":" + getTensorsStringKey({self}) + ":" + to_string(alpha.to<double>()) + ":" +
to_string(scale.to<double>()) + ":" + to_string(input_scale.to<double>());
string key = func_name + ":" + getTensorsStringKey({self}) + ":" + std::to_string(alpha.to<double>()) + ":" +
std::to_string(scale.to<double>()) + ":" + std::to_string(input_scale.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);
@ -916,8 +916,8 @@ TORCH_IMPL_FUNC(elu_backward_out_mps)
@autoreleasepool {
string key = "elu_backward_out_mps:" + getTensorsStringKey({grad_output, self_or_result}) + ":" +
to_string(alpha.to<double>()) + ":" + to_string(scale.to<double>()) + ":" +
to_string(input_scale.to<double>()) + ":" + to_string(is_result);
std::to_string(alpha.to<double>()) + ":" + std::to_string(scale.to<double>()) + ":" +
std::to_string(input_scale.to<double>()) + ":" + std::to_string(is_result);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
@ -1010,7 +1010,7 @@ TORCH_IMPL_FUNC(glu_out_mps)(const Tensor& self, const int64_t dim, const Tensor
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "glu_out_mps" + getTensorsStringKey({self}) + ":" + to_string(dim);
string key = "glu_out_mps" + getTensorsStringKey({self}) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), getMPSShape(self));
NSArray<MPSGraphTensor*>* outputTensorsArray = [mpsGraph splitTensor:inputTensor
@ -1052,7 +1052,7 @@ Tensor& glu_backward_mps_out(const Tensor& grad_output, const Tensor& self, cons
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "glu_backward_mps_out" + getTensorsStringKey({grad_output, self}) + ":" + to_string(dim);
string key = "glu_backward_mps_out" + getTensorsStringKey({grad_output, self}) + ":" + std::to_string(dim);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), getMPSShape(self));
MPSGraphTensor* gradOutputTensor =
@ -1855,8 +1855,8 @@ Tensor& hardtanh_backward_out_mps(const Tensor& grad_output,
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "hardtanh_backward_out_mps:" + getTensorsStringKey({grad_output}) + ":" + to_string(min.to<double>()) +
":" + to_string(max.to<double>());
string key = "hardtanh_backward_out_mps:" + getTensorsStringKey({grad_output}) + ":" +
std::to_string(min.to<double>()) + ":" + std::to_string(max.to<double>());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);

View File

@ -136,8 +136,8 @@ static Tensor& addmv_out_mps_impl(const Tensor& self,
Tensor matMulVec = at::mm(mat, vec.unsqueeze(1)).squeeze(1);
@autoreleasepool {
string key = "addmv_out_mps_impl" + getTensorsStringKey({self, matMulVec}) + ":" + to_string(beta_.toDouble()) +
":" + to_string(alpha_.toDouble());
string key = "addmv_out_mps_impl" + getTensorsStringKey({self, matMulVec}) + ":" +
std::to_string(beta_.toDouble()) + ":" + std::to_string(alpha_.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* matMulVecTensor = mpsGraphRankedPlaceHolder(mpsGraph, matMulVec);
MPSGraphTensor* selfTensor = mpsGraphRankedPlaceHolder(mpsGraph, self);

View File

@ -33,7 +33,7 @@ static Tensor& fill_scalar_mps_impl(Tensor& self, const Scalar& value) {
};
@autoreleasepool {
string key = "fill_scalar_mps_impl" + getTensorsStringKey(self) + ":" + to_string(value.toDouble());
string key = "fill_scalar_mps_impl" + getTensorsStringKey(self) + ":" + std::to_string(value.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphScalarPlaceHolder(mpsGraph, getMPSDataType(self.scalar_type()));

View File

@ -193,24 +193,24 @@ static Tensor _mps_convolution_impl(const Tensor& input_t,
string bias_shape_key;
if (bias_defined) {
bias_shape_key = to_string(bias_shape[0]);
bias_shape_key = std::to_string(bias_shape[0]);
} else {
bias_shape_key = "nobias";
}
string key;
if (is3DConv) {
key = "mps_3d_convolution:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + to_string(stride[2]) +
":" + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(dilation[2]) + ":" +
to_string(padding[0]) + ":" + to_string(padding[1]) + ":" + to_string(padding[2]) + ":" + to_string(groups) +
":" + mem_format_key + mps::getTensorsStringKey({input_t, weight_t}) + ":" + to_string(bias_defined) + ":" +
bias_shape_key;
key = "mps_3d_convolution:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(stride[2]) + ":" + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
mps::getTensorsStringKey({input_t, weight_t}) + ":" + std::to_string(bias_defined) + ":" + bias_shape_key;
} else {
key = "mps_convolution:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + to_string(dilation[0]) +
":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" +
to_string(groups) + ":" + mem_format_key + mps::getTensorsStringKey({input_t, weight_t}) + ":" +
to_string(bias_defined) + ":" + bias_shape_key;
key = "mps_convolution:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
mps::getTensorsStringKey({input_t, weight_t}) + ":" + std::to_string(bias_defined) + ":" + bias_shape_key;
}
MPSShape* inputShape = mps::getMPSShape(input_t, memory_format);
@ -388,16 +388,16 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
NSString* ns_shape_key = [[gradOutputShape valueForKey:@"description"] componentsJoinedByString:@","];
string key;
if (is3DConv) {
key = "mps_3d_convolution_backward_input:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" + ":" +
to_string(stride[2]) + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(dilation[2]) +
":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" + to_string(padding[2]) + ":" +
to_string(groups) + ":" + mem_format_key + getTensorsStringKey({grad_output_t, weight_t}) + ":" +
string([ns_shape_key UTF8String]);
key = "mps_3d_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
":" + std::to_string(stride[2]) + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t}) + ":" + string([ns_shape_key UTF8String]);
} else {
key = "mps_convolution_backward_input:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" +
to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" +
to_string(padding[1]) + ":" + to_string(groups) + ":" + mem_format_key +
key = "mps_convolution_backward_input:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, weight_t}) + ":" + string([ns_shape_key UTF8String]);
}
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
@ -547,15 +547,15 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
NSString* ns_shape_key = [[gradOutputShape valueForKey:@"description"] componentsJoinedByString:@","];
string key;
if (is3DConv) {
key = "mps_3d_convolution_backward_weights:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" +
to_string(stride[2]) + ":" + to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" +
to_string(dilation[2]) + ":" + to_string(padding[0]) + ":" + to_string(padding[1]) + ":" +
to_string(padding[2]) + ":" + to_string(groups) + ":" + mem_format_key +
key = "mps_3d_convolution_backward_weights:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(stride[2]) + ":" + std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" +
std::to_string(dilation[2]) + ":" + std::to_string(padding[0]) + ":" + std::to_string(padding[1]) + ":" +
std::to_string(padding[2]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, input_t, grad_weight_t}) + ":" + string([ns_shape_key UTF8String]);
} else {
key = "mps_convolution_backward_weights:" + to_string(stride[0]) + ":" + to_string(stride[1]) + ":" +
to_string(dilation[0]) + ":" + to_string(dilation[1]) + ":" + to_string(padding[0]) + ":" +
to_string(padding[1]) + ":" + to_string(groups) + ":" + mem_format_key +
key = "mps_convolution_backward_weights:" + std::to_string(stride[0]) + ":" + std::to_string(stride[1]) + ":" +
std::to_string(dilation[0]) + ":" + std::to_string(dilation[1]) + ":" + std::to_string(padding[0]) + ":" +
std::to_string(padding[1]) + ":" + std::to_string(groups) + ":" + mem_format_key +
getTensorsStringKey({grad_output_t, input_t, grad_weight_t}) + ":" + string([ns_shape_key UTF8String]);
}
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {

View File

@ -63,7 +63,7 @@ Tensor& random_mps_impl(Tensor& self,
@autoreleasepool {
string key = op_name + getTensorsStringKey({self, mean_opt.value_or(Tensor()), std_opt.value_or(Tensor())}) + ":" +
to_string(val1) + ":" + to_string(val2);
std::to_string(val1) + ":" + std::to_string(val2);
auto cachedGraph = LookUpOrCreateCachedGraph<RandomCachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->stateTensor =
mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[ @(at::mps::detail::PHILOX_STATE_N) ]);
@ -469,7 +469,7 @@ static Tensor& multinomial_with_replacement_mps_kernel(const Tensor& self,
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "multinomial_with_replacement:" + getTensorsStringKey({self}) + ":" + to_string(n_sample);
string key = "multinomial_with_replacement:" + getTensorsStringKey({self}) + ":" + std::to_string(n_sample);
auto cachedGraph = LookUpOrCreateCachedGraph<RandomCachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSShape* prob_shape = getMPSShape(self_v);
newCachedGraph->stateTensor = mpsGraphRankedPlaceHolder(mpsGraph, MPSDataTypeInt32, @[ @7 ]);

View File

@ -236,7 +236,7 @@ static std::tuple<Tensor, Tensor> _mps_linear_backward_weights(const Tensor& gra
MPSStream* stream = getCurrentMPSStream();
@autoreleasepool {
string key = "mps_linear_backward_weights:" + to_string(bias_defined) + ":" +
string key = "mps_linear_backward_weights:" + std::to_string(bias_defined) + ":" +
getTensorsStringKey({input_reshaped, weight, grad_output_reshaped});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_reshaped);

View File

@ -229,8 +229,8 @@ static Tensor& addbmm_or_baddbmm_out_mps_impl(const Tensor& input,
@autoreleasepool {
string key = (opType == ADDBMM_OP_TYPE) ? ("addbmm_out_mps_impl") : ("baddbmm_out_mps_impl");
key += getTensorsStringKey({batch1, batch2, input}) + ":" + to_string(beta.toDouble()) + ":" +
to_string(alpha.toDouble());
key += getTensorsStringKey({batch1, batch2, input}) + ":" + std::to_string(beta.toDouble()) + ":" +
std::to_string(alpha.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mps::mpsGraphRankedPlaceHolder(mpsGraph, input);
@ -331,8 +331,8 @@ static Tensor& addmm_out_mps_impl(const Tensor& bias,
};
@autoreleasepool {
string key = "addmm_out_mps_impl" + getTensorsStringKey({self, other, *bias_}) + ":" + to_string(beta.toDouble()) +
":" + to_string(alpha.toDouble());
string key = "addmm_out_mps_impl" + getTensorsStringKey({self, other, *bias_}) + ":" +
std::to_string(beta.toDouble()) + ":" + std::to_string(alpha.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* selfTensor = nil;
MPSGraphTensor* otherTensor = nil;
@ -615,8 +615,8 @@ Tensor& addr_out_mps(const Tensor& self,
};
@autoreleasepool {
string key = "addr_out_mps_impl" + getTensorsStringKey({vec1, vec2, *self_}) + ":" + to_string(beta.toDouble()) +
":" + to_string(alpha.toDouble());
string key = "addr_out_mps_impl" + getTensorsStringKey({vec1, vec2, *self_}) + ":" +
std::to_string(beta.toDouble()) + ":" + std::to_string(alpha.toDouble());
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* t1 = mps::mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(vec1), inputShape);
MPSGraphTensor* t2 = mps::mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(vec2), otherShape);

View File

@ -69,7 +69,7 @@ static Tensor& mse_loss_backward_out_impl(const Tensor& grad_output,
};
@autoreleasepool {
string key = op_name + reductionToString(reduction) + ":" + to_string(grad_input.sizes()[1]) +
string key = op_name + reductionToString(reduction) + ":" + std::to_string(grad_input.sizes()[1]) +
getTensorsStringKey({input, target, grad_output});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
@ -327,8 +327,8 @@ static void nllnd_loss_backward_impl(Tensor& grad_input_arg,
}
@autoreleasepool {
string key = "nllnd_loss_backward" + getTensorsStringKey({input, grad_output, target, weight, total_weight}) +
to_string(numClasses) + ":" + to_string(ignore_index) + ":" + to_string(isWeightsArrayValid) + ":" +
to_string(isTargetCasted) + ":" + reductionToString(reduction);
std::to_string(numClasses) + ":" + std::to_string(ignore_index) + ":" + std::to_string(isWeightsArrayValid) +
":" + std::to_string(isTargetCasted) + ":" + reductionToString(reduction);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);
@ -463,9 +463,9 @@ static void nllnd_loss_forward_impl(Tensor& output,
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
// TODO: Make the key
string key = "nllnd_loss_forward_impl:" + to_string(ignore_index) + ":" + to_string(isWeightsArrayValid) + ":" +
reductionToString(reduction) + ":" + [ns_shape_key UTF8String] + ":" + getMPSTypeString(input) + ":" +
getMPSTypeString(target) + ":" + to_string(isTargetCasted) + ":" + getMPSTypeString(weight);
string key = "nllnd_loss_forward_impl:" + std::to_string(ignore_index) + ":" + std::to_string(isWeightsArrayValid) +
":" + reductionToString(reduction) + ":" + [ns_shape_key UTF8String] + ":" + getMPSTypeString(input) + ":" +
getMPSTypeString(target) + ":" + std::to_string(isTargetCasted) + ":" + getMPSTypeString(weight);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(input), input_shape);
MPSGraphTensor* targetTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(target), target_shape);
@ -598,7 +598,7 @@ static void smooth_l1_loss_impl(const Tensor& input,
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key = "smooth_l1_loss_impl:" + reductionToString(reduction) + ":" + [ns_shape_key UTF8String] + ":" +
to_string(beta) + ":" + getMPSTypeString(input) + ":" + getMPSTypeString(target);
std::to_string(beta) + ":" + getMPSTypeString(input) + ":" + getMPSTypeString(target);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
// smooth_l1_loss_mps:
// ln = 0.5 * ( xn - yn ) ^ 2 / beta, if |xn - yn| < beta
@ -734,7 +734,7 @@ static void smooth_l1_loss_backward_impl(const Tensor& grad_output,
@autoreleasepool {
string key = "smooth_l1_loss_backward" + getTensorsStringKey({input, grad_output, grad_input, target}) + ":" +
reductionToString(reduction) + ":" + to_string(beta);
reductionToString(reduction) + ":" + std::to_string(beta);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input);

View File

@ -146,64 +146,459 @@ INSTANTIATE_INT4MM(bfloat, 128);
INSTANTIATE_INT4MM(bfloat, 256);
#endif
template <typename T, unsigned blockSize=8>
kernel void
int8pack_mm(constant T *A [[buffer(0)]], constant char *B [[buffer(1)]],
constant T *scales [[buffer(2)]],
device T *outputData [[buffer(3)]],
constant int3 &sizes [[buffer(4)]],
uint2 group_index [[threadgroup_position_in_grid]],
uint2 threadgroup_index [[thread_position_in_threadgroup]]) {
using vecT = typename Vec4Type<T>::type;
const uint lda = sizes.y;
const uint ldc = sizes.z;
int out_idx = (group_index.x * blockSize + threadgroup_index.x) * 4;
int n = out_idx % sizes.z;
int m = out_idx / sizes.z;
// Offset pointers
A += m * lda;
B += n * lda;
outputData += m *ldc;
// ------------------------------ int8 MM For M >= 12 ------------------------------------
/**
* The following code is heavily inspired by llama.cpp (https://github.com/ggerganov/llama.cpp).
* The original code is under MIT License: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE
*
* Matrix Multiplication Algorithm:
* 1. Load A and B blocks (32x32 and 64x32 respectively) into shared memory.
* 2. In 4 simdgroups, calculate the outer product of the loaded blocks. Each simdgroup produces a 2x4 8x8 result.
* 2.1 For how to use outer product to perform matrix multiplication, refer to
* http://mlwiki.org/index.php/Matrix-Matrix_Multiplication#Sum_of_Outer_Products
* 3. Repeat 1 & 2 along K axis, with K block size 32, accumulate the result in the 2x4 8x8 block.
* 4. Dequantize the final result and store it in the output matrix.
*
* Variable names are changed to adapt to PyTorch convention such as M, N, K, etc.
* Assuming row major order.
* For more details please see inline comments.
*/
#include <metal_stdlib>
using namespace metal;
template <typename T> struct BlockType {};
float4 rc = 0;
for (unsigned k = threadgroup_index.y * 4; k < sizes.y; k += 4 * blockSize) {
threadgroup_barrier(mem_flags::mem_none);
auto a_val = float4(*reinterpret_cast<constant vecT *>(A + k));
float4x4 b_val;
for (int i = 0; i < 4; ++i) {
b_val[i] = float4(*reinterpret_cast<constant char4 *>(B + i * lda + k));
}
rc += transpose(b_val) * a_val;
}
template <> struct BlockType<float> {
using simdgroup_type8x8 = simdgroup_float8x8;
using type4 = float4;
};
// Accumulate results acorss SIMD group? (8 threads using vec4)
threadgroup float4 tgp_memory[blockSize][blockSize];
tgp_memory[threadgroup_index.x][threadgroup_index.y] = rc;
threadgroup_barrier(mem_flags::mem_threadgroup);
if (threadgroup_index.y == 0) {
for (int i = 1; i < blockSize; i++) {
rc += tgp_memory[threadgroup_index.x][i];
}
*reinterpret_cast<device vecT *>(outputData + n) =
vecT(rc * float4(*reinterpret_cast<constant vecT *>(scales + n)));
}
template <> struct BlockType<half> {
using simdgroup_type8x8 = simdgroup_half8x8;
using type4 = half4;
};
#if __METAL_VERSION__ >= 310
template <> struct BlockType<bfloat> {
using simdgroup_type8x8 = simdgroup_bfloat8x8;
using type4 = bfloat4;
};
#endif
template<typename T>
float2 get_scale_zero(constant T * scalesAndZeros, uint2 index) {
return float2(1.0, 0.0);
}
#define INSTANTIATE_INT8MM(DTYPE) \
template [[host_name("int8pack_mm_" #DTYPE)]] kernel void \
int8pack_mm<DTYPE>( \
constant DTYPE * A [[buffer(0)]], constant char *B [[buffer(1)]], \
constant DTYPE *scales [[buffer(2)]], \
device DTYPE *outputData [[buffer(3)]], \
constant int3 &sizes [[buffer(4)]], \
uint2 group_index [[threadgroup_position_in_grid]], \
uint2 threadgroup_index [[thread_position_in_threadgroup]]);
template<typename T>
float2 get_scale_zero_q8(constant T * scalesAndZeros, uint2 index) {
T scale = scalesAndZeros[index[0]];
return float2(scale, 0.0);
}
INSTANTIATE_INT8MM(half);
INSTANTIATE_INT8MM(float);
#define BLOCK_SIZE_M 32 // each block takes 32 rows in matrix A
#define BLOCK_SIZE_N 64 // each block takes 64 rows in matrix B
#define BLOCK_SIZE_K 32
#define THREAD_MAT_M 2 // in data loading stage, each thread load 2 simdgroup matrices from matrix A
#define THREAD_MAT_N 4 // in data loading stage, each thread load 4 simdgroup matrices from matrix B
#define THREAD_PER_ROW_A 4 // 4 thread for each row in matrix A to load numbers
#define THREAD_PER_ROW_B 2 // 2 thread for each row in matrix B to load numbers
#define SG_MAT_SIZE 64 // simdgroup matrix is of shape 8x8
#define SG_MAT_ROW 8
// T: input type, W: weight type
template<typename T, typename W, float2 (*get_scale_zero_func)(constant T *, uint2)>
kernel void kernel_mul_mm(
constant T * A [[buffer(0)]],
constant char * B [[buffer(1)]],
constant T * scalesAndZeros [[buffer(2)]],
device T * outputData [[buffer(3)]],
constant uint3 & sizes [[buffer(4)]],
threadgroup char * shared_memory [[threadgroup(0)]], // threadgroup buffer at index 0
uint3 tgpig [[threadgroup_position_in_grid]], // 3d coordinates
uint tiitg [[thread_index_in_threadgroup]], // 128 per threadgroup
uint sgitg [[simdgroup_index_in_threadgroup]]) {
using T4 = typename BlockType<T>::type4;
using Tsimd8x8 = typename BlockType<T>::simdgroup_type8x8;
// sizes: x = M, y = K, z = N
// pytorch: M x K @ N x K -> M x N
// ggml: K x N @ K x M -> N x M
uint32_t M = sizes.x; // M
uint32_t K = sizes.y; // K
uint32_t N = sizes.z; // N
uint32_t nbytes_B = sizeof(W); // number of bytes for one element in B
uint32_t nbytes_B_row = nbytes_B * K; // number of bytes for one row in B
uint32_t nbytes_A = sizeof(T); // number of bytes for one element in A
uint32_t nbytes_A_row = nbytes_A * K; // number of bytes for one row in A
// shared memory for A and B
threadgroup T * shared_memory_A = (threadgroup T *)(shared_memory);
// using half here to store int8, gives us about 8% perf gain comparing to bfloat but not sure why
threadgroup half * shared_memory_B = (threadgroup half *)(shared_memory + 8192);
const uint threadgroup_M = tgpig.x; // total number (M + 31)/32, the index of this threadgroup along M axis
const uint threadgroup_N = tgpig.y; // total number (N + 63)/64, the index of this threadgroup along N axis
// if this block is of 64x32 shape or smaller, bound the number of rows for A and B in this block.
short n_rows_A = min(uint32_t(M - threadgroup_M * BLOCK_SIZE_M), uint32_t(BLOCK_SIZE_M));
short n_rows_B = min(uint32_t(N - threadgroup_N * BLOCK_SIZE_N), uint32_t(BLOCK_SIZE_N));
// a thread shouldn't load data outside of the matrix
short thread_row_A = min(((short)tiitg/THREAD_PER_ROW_A), n_rows_A - 1);
short thread_row_B = min(((short)tiitg/THREAD_PER_ROW_B), n_rows_B - 1);
Tsimd8x8 simdgroup_A[2]; // input, each simdgroup load 128 values of input
simdgroup_half8x8 simdgroup_B[4]; // weight, each simdgroup load 256 values of weight
simdgroup_float8x8 simdgroup_C[8]; // outer product result, 2x4 8x8 blocks.
for (short i = 0; i < 8; i++){
simdgroup_C[i] = make_filled_simdgroup_matrix<float, 8>(0.f);
}
constant T * a_ptr = (constant T *)((constant char *)A
+ nbytes_A_row * (threadgroup_M * BLOCK_SIZE_M + thread_row_A)
+ nbytes_A * (BLOCK_SIZE_K / THREAD_PER_ROW_A * (tiitg % THREAD_PER_ROW_A)));
constant W * b_ptr = (constant W *)(B
+ nbytes_B_row * (threadgroup_N * BLOCK_SIZE_N + thread_row_B)
+ nbytes_B * (BLOCK_SIZE_K / THREAD_PER_ROW_B * (tiitg % THREAD_PER_ROW_B)));
/**
Load weight and input into shared memory:
8192: BLOCK_SIZE_M x BLOCK_SIZE_K x 4(max bytes per value) <----- numbers don't checkout, should be 4096. Changing it to 4096 gives wrong value.
4096: BLOCK_SIZE_N x BLOCK_SIZE_K x 2(storing int8 in half)
K
┌────────────────────────┐ 8192(A) 4096(B)
│ │ ┌────────────────────────┬────────────┐
│ │ │++++++++++++++++++++++++│++++++++++++│
│ │ └────────────────────────┴────────────┘
│ │
│32(BLOCK_SIZE_K) │
├──┬──┬──────────────────┤ K
│++│ │ │ ┌────────────────────────┐
64│++│ │... │ │ │
(BLOCK_SIZE_N)│++│ │ │ │ │
├──┴──┴──────────────────┤ │ │
│ │ │ │
│ ───────────► │ │32(BLOCK_SIZE_K) │
│ for loop │ ├──┬──┬──────────────────┤
│ │ 32│++│ │ ... │
│ │ (BLOCK_SIZE_M)├──┴──┴──────────────────┤
│ │ │ ────────────► │
│ │ │ for loop │
└────────────────────────┘ └────────────────────────┘
B A
*/
for (uint32_t loop_k = 0; loop_k < K; loop_k += BLOCK_SIZE_K) {
// load data and store to threadgroup memory
threadgroup_barrier(mem_flags::mem_threadgroup);
#pragma unroll(16)
for (short i = 0; i < 16; i++) {
half weight = *(b_ptr + i);
// for example, tiitg 32, i 12 -> 0 + 1 = 1, it needs to work on sg mat grid row 1
short sg_mat_grid_row_index = (tiitg % THREAD_PER_ROW_B) * THREAD_PER_ROW_B + i / 8;
// same example, sg mat grid col index: 32 / 2 / 8 = 2, so currently need to work with sg mat at (1, 2)
short sg_mat_grid_col_index = tiitg / THREAD_PER_ROW_B / 8;
// now inside sg mat, which index to write to? starting point is SG_MAT_SIZE * sg_mat_offset
short row_offset = i % 8;
short col_offset = (tiitg / THREAD_PER_ROW_B) % 8;
// now calculates the overall offset for shared_memory_B
short sb_offset = (sg_mat_grid_row_index * 8 + sg_mat_grid_col_index) * 64 + (row_offset * 8 + col_offset);
*(shared_memory_B + sb_offset) = weight;
}
// read 8 values for input matrix
#pragma unroll(2)
for (short i = 0; i < 2; i++) {
*((threadgroup T4 *)(shared_memory_A + (tiitg % THREAD_PER_ROW_A) * 8 * 32 + 8 * (tiitg / THREAD_PER_ROW_A)) + i) = *((constant T4 *)a_ptr + i);
}
a_ptr += BLOCK_SIZE_K;
b_ptr += BLOCK_SIZE_K;
threadgroup_barrier(mem_flags::mem_threadgroup);
// load matrices from threadgroup memory and conduct outer products
// pointing to the shared memory starting address for A, for current simdgroup.
threadgroup T * simdgroup_A_ptr = (shared_memory_A + THREAD_MAT_M * SG_MAT_SIZE * (sgitg / 2));
// pointing to the shared memory starting address for B, for current simdgroup.
threadgroup half * simdgroup_B_ptr = (shared_memory_B + THREAD_MAT_N * SG_MAT_SIZE * (sgitg % 2));
/**
Outer product:
K
────────────►
8 for loop 8 8
┌───┬───┬───┬───┐ ┌───┬───┬───┬───┬───┬───┬───┬───┐
8 │+++│ │ │ │ │ 8│+++│+++│+++│+++│###│###│###│###│
├───┼───┼───┼───┤ │ ├───┼───┼───┼───┼───┼───┼───┼───┤
│+++│ │ │ │ │ │ │ │ │ │ │ │ │ │
├───┼───┼───┼───┤ │ K ├───┼───┼───┼───┼───┼───┼───┼───┤
│###│ │ │ │ │ │ │ │ │ │ │ │ │ │
├───┼───┼───┼───┤ │ ├───┼───┼───┼───┼───┼───┼───┼───┤
│###│ │ │ │ │ │ │ │ │ │ │ │ │ │
└───┴───┴───┴───┘ ▼ └───┴───┴───┴───┴───┴───┴───┴───┘
for loop
+ simdgroup 0,1 + simdgroup 0,2
# simdgroup 2,3 # simdgroup 1,3
*/
#pragma unroll(4)
for (short ik = 0; ik < BLOCK_SIZE_K / 8; ik++) {
#pragma unroll(4)
for (short i = 0; i < 4; i++) {
simdgroup_load(simdgroup_B[i], simdgroup_B_ptr + SG_MAT_SIZE * i);
}
simdgroup_barrier(mem_flags::mem_none);
#pragma unroll(2)
for (short i = 0; i < 2; i++) {
simdgroup_load(simdgroup_A[i], simdgroup_A_ptr + SG_MAT_SIZE * i);
}
simdgroup_A_ptr += BLOCK_SIZE_M / SG_MAT_ROW * SG_MAT_SIZE;
simdgroup_B_ptr += BLOCK_SIZE_N / SG_MAT_ROW * SG_MAT_SIZE;
#pragma unroll(8)
for (short i = 0; i < 8; i++){
simdgroup_multiply_accumulate(simdgroup_C[i], simdgroup_A[i/4], simdgroup_B[i%4], simdgroup_C[i]);
}
}
}
/**
* Each sgitg 0,1,2,3 handles 2x4 8x8.
8 8
┌───┬───┬───┬───┬───┬───┬───┬───┐
8│ 0 │ 0 │ 0 │ 0 │ 1 │ 1 │ 1 │ 1 │
├───┼───┼───┼───┼───┼───┼───┼───┤
│ 0 │ 0 │ 0 │ 0 │ 1 │ 1 │ 1 │ 1 │
├───┼───┼───┼───┼───┼───┼───┼───┤
│ 2 │ 2 │ 2 │ 2 │ 3 │ 3 │ 3 │ 3 │
├───┼───┼───┼───┼───┼───┼───┼───┤
│ 2 │ 2 │ 2 │ 2 │ 3 │ 3 │ 3 │ 3 │
└───┴───┴───┴───┴───┴───┴───┴───┘
scale: 8 x BLOCK_SIZE_N, starting from shared_memory_A. Each sgitg handles 4 8x8 diagonal matrix.
8 8
┌───┬───┬───┬───┬───┬───┬───┬───┐
8│ │ │ │ │ │ │ │ │
└───┴───┴───┴───┴───┴───┴───┴───┘
*/
threadgroup float * temp_str = ((threadgroup float *)shared_memory) \
+ 32 * (sgitg&1) + (16 * (sgitg>>1)) * BLOCK_SIZE_N;
for (int i = 0; i < 8; i++) {
int block_start = 4 * 8 * (sgitg & 1) + (i % 4) * 8;
threadgroup float * temp_scale = (threadgroup float *)shared_memory_B + block_start;
threadgroup float * scale_iter = temp_scale;
// dequantize
for (int j = 0; j < 8; j++) {
// clear next 8 values of scale_iter
*((threadgroup float2x4 *)scale_iter) = float2x4(0.f);
// find scale
int scale_index = threadgroup_N * BLOCK_SIZE_N + block_start + j;
float2 scale_zero = get_scale_zero_func(scalesAndZeros, uint2(scale_index, 0));
// create diagonal matrix of scales
*(scale_iter + j) = scale_zero[0];
// go to next row
scale_iter += BLOCK_SIZE_N;
}
threadgroup_barrier(mem_flags::mem_threadgroup);
simdgroup_float8x8 simd_scale;
simdgroup_load(simd_scale, temp_scale, BLOCK_SIZE_N);
simdgroup_multiply(simdgroup_C[i], simdgroup_C[i], simd_scale);
simdgroup_store(simdgroup_C[i], temp_str + 8 * (i%4) + 8 * BLOCK_SIZE_N * (i/4), BLOCK_SIZE_N);
}
device T * C = outputData + (BLOCK_SIZE_N * threadgroup_N) + (BLOCK_SIZE_M * threadgroup_M) * N;
if (sgitg == 0) {
for (int i = 0; i < n_rows_B; i++) {
for (int j = tiitg; j < n_rows_A; j += BLOCK_SIZE_M) {
float temp = *(temp_str + i + j * BLOCK_SIZE_N);
*(C + i + j * N) = (device T)(temp);
}
}
}
}
#define INSTANTIATE_MM(DTYPE, WDTYPE, DEQUANT_FUNC) \
template \
[[host_name("large_m_int8pack_mm_" #DTYPE)]] \
kernel void kernel_mul_mm<DTYPE, WDTYPE, DEQUANT_FUNC>( \
constant DTYPE * A [[buffer(0)]], \
constant char * B [[buffer(1)]], \
constant DTYPE * scalesAndZeros [[buffer(2)]], \
device DTYPE * outputData [[buffer(3)]], \
constant uint3 & sizes [[buffer(4)]], \
threadgroup char * shared_memory [[threadgroup(0)]], \
uint3 tgpig [[threadgroup_position_in_grid]], \
uint tiitg [[thread_index_in_threadgroup]], \
uint sgitg [[simdgroup_index_in_threadgroup]])
INSTANTIATE_MM(float, char, get_scale_zero_q8);
INSTANTIATE_MM(half, char, get_scale_zero_q8);
#if __METAL_VERSION__ >= 310
INSTANTIATE_INT8MM(bfloat);
INSTANTIATE_MM(bfloat, char, get_scale_zero_q8);
#endif
// ------------------------------ int8 MM For M < 12 ------------------------------------
/* Matrix vector multiplication, used for small M size for matrix multiplication as well.
for loop ->
1 1 1 1 1
┌──────────────────┬──┬──┬──┬──┬───────────┬─────┐ ┌──┐
│ thread 0-> 8│ │ │ │ │ │ │ 8│ │
│ ├──┼──┼──┼──┤ │ │ ├──┤
│ thread 1-> 8│ │ │ │ │ │ │ 8│ │
│ ├──┼──┼──┼──┤ │ │ ├──┤
│ thread 2-> 8│ │ │ │ │ │ │ 8│ │
│ ├──┼──┼──┼──┤ │ │ ├──┤
│ thread 3-> 8│ │ │ │ │ │ │ 8│ │
│ ├──┼──┼──┼──┤ │ │ ├──┤
│ │ │ │ │ │ │ │ │ │
│ thread 4-7 32│ │ │ │ │ │ │ 32│ │
│ │ │ │ │ │ SIMD │ │ │ │
K │ ├──┼──┼──┼──┤ Group 1 │ │ ├──┤
│ │ │ │ │ │ │ │ │ │
│ thread 8-15 64│ │ │ │ │ │ │ 64│ │
│ │ │ │ │ │ │ │ │ │
│ ├──┼──┼──┼──┤ │ │ ├──┤
│ │ │ │ │ │ │ │ │ │
│ thread 16-31 128│ │ │ │ │ │ │ 128│ │
│ │ │ │ │ │ │ │ │ │
│ ├──┼──┼──┼──┼───────────┤ │ ├──┤
│ │ │ │ │ │ │ │ │ │
└──────────────────┴──┴──┴──┴──┴───────────┴─────┘ └──┘
SIMD Group 0 input
N
┌──────────────────┬──┬──┬──┬──┬───────────┬─────┐
│ │ │ │ │ │ │ │
└──────────────────┴──┴──┴──┴──┴───────────┴─────┘
scale
*/
// putting them in the kernel causes a significant performance penalty, could use function constant to optimize?
#define NB_Q8_0 8
#define N_DST 4 // each SIMD group works on 4 rows
#define N_SIMDGROUP 2 // number of SIMD groups in a thread group
#define N_SIMDWIDTH 32 // assuming SIMD group size is 32
template<typename T>
kernel void kernel_mul_mv(
constant T * A [[buffer(0)]],
constant char * B [[buffer(1)]],
constant T * scalesAndZeros [[buffer(2)]],
device T * outputData [[buffer(3)]],
constant uint3 & sizes [[buffer(4)]],
threadgroup char * shared_memory [[threadgroup(0)]],
uint3 tgpig [[threadgroup_position_in_grid]],
uint tiisg [[thread_index_in_simdgroup]],
uint sgitg [[simdgroup_index_in_threadgroup]]) {
using T4 = typename BlockType<T>::type4;
const int nr = N_DST;
const int nsg = N_SIMDGROUP;
const int nw = N_SIMDWIDTH;
// sizes: x = M, y = K, z = N, given mv, x = M = 1
// pytorch: M x K @ N x K -> M x N
// ggml: K x N @ K x M -> N x M
uint32_t K = sizes.y; // K
uint32_t N = sizes.z; // N
const int nb = K/N_SIMDWIDTH; // number of blocks of 32 elements along K axis
const int threadgroup_N = tgpig.x; // threadgroup index along N axis.
const int threadgroup_M = tgpig.y; // threadgroup index along M axis. For matvec multiplication this will always be 0 but keep it for future usage.
/*
* Each SIMD group in a threadgroup handles N_DST = nr = 4 rows.
* - threadgroup_N is the x index of the threadgroup. threadgroup_N * nsg -> the overall offset of SIMD groups, for this threadgroup.
* - threadgroup_N * nsg + sgitg -> the overall index of SIMD group, in all SIMD groups.
* - (threadgroup_N * nsg + sgitg) * nr -> the starting index of the row that this SIMD group needs to handle.
*/
const int first_row = (threadgroup_N * nsg + sgitg) * nr;
const uint offset0 = first_row * K;
// x: weight, y: input
constant char * x = (constant char *) B + offset0;
constant T * y = (constant T *) A + threadgroup_M*K;
// Load data to shared memory
threadgroup T * shared_scale = (threadgroup T *)(shared_memory); // length 8 * sizeof(float)
// Load scale:
if (tiisg < 4) {
*(shared_scale + (sgitg % 2) * 4 + tiisg) = *(scalesAndZeros + (threadgroup_N * NB_Q8_0) + (sgitg % 2) * 4 + tiisg);
}
// Accumulate on float4
float2x4 yl;
float4x4 xl[2];
float4 sumf = 0;
// Group threads in SIMD group into 8x4 block, each thread handles 8 input values.
const int ix = tiisg/4;
const int il = tiisg%4;
// N_SIMDWIDTH = 32 means we have 32 weights in 1 simdgroup.
// Find the starting point of input that this thread need to work on, load yb into yl.
constant T * yb = y + ix * N_SIMDWIDTH + NB_Q8_0*il;
// each thread in a SIMD group deals with NB_Q8_0 quants at a time
for (short ib = ix; ib < nb; ib += nw/4) {
// Load y data
for (short i = 0; i < 2; i++) {
short offset = i * 4;
yl[i] = {*(yb + offset), *(yb + offset + 1), *(yb + offset + 2), *(yb + offset + 3)};
}
for (short row = 0; row < nr; row++) {
// Locate where x should be.
// row offset: row * K
// col offset: ib * N_SIMDWIDTH + il * NB_Q8_0
// x index: row * K + ib * N_SIMDWIDTH + il * NB_Q8_0
constant int8_t * qs = (constant int8_t *)(x + row * K + ib * N_SIMDWIDTH + il * NB_Q8_0);
for (short batch = 0; batch < 2; batch++) {
short offset = batch * 4;
xl[batch][row] = {(float)qs[offset], (float)qs[offset+1], (float)qs[offset+2], (float)qs[offset+3]};
}
}
sumf += yl[0] * xl[0];
sumf += yl[1] * xl[1];
yb += NB_Q8_0 * nw;
}
for (int row = 0; row < nr; ++row) {
const float tot = simd_sum(sumf[row]);
float scale = *(shared_scale + (sgitg % 2) * 4 + row);
if (tiisg == 0 && first_row + row < N) {
outputData[threadgroup_M*N + first_row + row] = (device T)(tot * scale);
}
}
}
#define INSTANTIATE_MV(DTYPE) \
template \
[[host_name("int8pack_mv_" #DTYPE)]] \
kernel void kernel_mul_mv<DTYPE>( \
constant DTYPE * A [[buffer(0)]], \
constant char * B [[buffer(1)]], \
constant DTYPE * scalesAndZeros [[buffer(2)]], \
device DTYPE * outputData [[buffer(3)]], \
constant uint3 & sizes [[buffer(4)]], \
threadgroup char * shared_memory [[threadgroup(0)]], \
uint3 tgpig [[threadgroup_position_in_grid]], \
uint tiisg [[thread_index_in_simdgroup]], \
uint sgitg [[simdgroup_index_in_threadgroup]])
INSTANTIATE_MV(float);
INSTANTIATE_MV(half);
#if __METAL_VERSION__ >= 310
INSTANTIATE_MV(bfloat);
#endif
)METAL_QUANTIZED");
Tensor _weight_int4pack_mm_mps(const Tensor& A, const Tensor& B, int64_t qGroupSize, const Tensor& qScaleAndZeros) {
@ -295,7 +690,13 @@ Tensor _weight_int8pack_mm_mps(const Tensor& A, const Tensor& B, const Tensor& s
}
#endif
id<MTLComputeCommandEncoder> computeEncoder = mpsStream->commandEncoder();
const std::string kernel = fmt::format("int8pack_mm_{}", scalarToMetalTypeString(A));
std::string kernel;
// heuristic, to use mv kernel for mm with small M. M = 10 is the performance tipping point.
if (M < 12) {
kernel = fmt::format("int8pack_mv_{}", scalarToMetalTypeString(A));
} else {
kernel = fmt::format("large_m_int8pack_mm_{}", scalarToMetalTypeString(A));
}
id<MTLComputePipelineState> quantizedPSO = lib.getPipelineStateForFunc(kernel);
[computeEncoder setComputePipelineState:quantizedPSO];
mtl_setBuffer(computeEncoder, A, 0);
@ -303,7 +704,15 @@ Tensor _weight_int8pack_mm_mps(const Tensor& A, const Tensor& B, const Tensor& s
mtl_setBuffer(computeEncoder, scales, 2);
mtl_setBuffer(computeEncoder, C, 3);
[computeEncoder setBytes:sizes.data() length:sizeof(uint32_t) * sizes.size() atIndex:4];
[computeEncoder dispatchThreads:MTLSizeMake(M * N / 4, 8, 1) threadsPerThreadgroup:MTLSizeMake(8, 8, 1)];
if (M < 12) {
[computeEncoder setThreadgroupMemoryLength:32 atIndex:0];
[computeEncoder dispatchThreadgroups:MTLSizeMake((N + 7) / 8, M, 1)
threadsPerThreadgroup:MTLSizeMake(64, 1, 1)];
} else {
[computeEncoder setThreadgroupMemoryLength:12288 atIndex:0];
[computeEncoder dispatchThreadgroups:MTLSizeMake((M + 31) / 32, (N + 63) / 64, 1)
threadsPerThreadgroup:MTLSizeMake(128, 1, 1)];
}
#if _CAPTURE_KERNEL
if (getMPSProfiler().isCapturing()) {
getMPSProfiler().stopCapture(mpsStream);

View File

@ -106,7 +106,7 @@ Tensor& arange_mps_out(const Scalar& start, const Scalar& end, const Scalar& ste
auto stream = getCurrentMPSStream();
auto mpsDataType = getMPSDataType(result);
@autoreleasepool {
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + to_string(size);
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + std::to_string(size);
auto cachedGraph = cache_->LookUpAs<RangeCachedGraph>(key);
if (!cachedGraph) {
cachedGraph = cache_->CreateCachedGraphAs<RangeCachedGraph>(key, ^MPSCachedGraph*() {
@ -173,7 +173,7 @@ Tensor& range_mps_out(const Scalar& start, const Scalar& end, const Scalar& step
auto stream = getCurrentMPSStream();
auto mpsDataType = getMPSDataType(result);
@autoreleasepool {
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + to_string(size);
string key = "arange_mps_out" + getTensorsStringKey({result}) + ":" + std::to_string(size);
auto cachedGraph = cache_->LookUpAs<RangeCachedGraph>(key);
if (!cachedGraph) {
cachedGraph = cache_->CreateCachedGraphAs<RangeCachedGraph>(key, ^MPSCachedGraph*() {
@ -221,8 +221,8 @@ Tensor& linspace_out_mps(const Scalar& start, const Scalar& end, int64_t steps,
bool start_less_end = (start.to<double>() <= end.to<double>());
@autoreleasepool {
string key =
"linspace_out_mps:" + getTensorsStringKey({result}) + ":" + to_string(steps) + to_string(start_less_end);
string key = "linspace_out_mps:" + getTensorsStringKey({result}) + ":" + std::to_string(steps) +
std::to_string(start_less_end);
auto cachedGraph = cache_->LookUpAs<RangeCachedGraph>(key);
if (!cachedGraph) {

View File

@ -359,8 +359,8 @@ static void impl_func_norm_mps(const Tensor& input_tensor,
NSString* ns_key = [[wrappedAxes valueForKey:@"description"] componentsJoinedByString:@","];
string keepdim_info = (keepdim) ? "keepdim=1" : "keepdim=0";
string tensor_key = cdist ? getTensorsStringKey({input_tensor, other_tensor}) : getTensorsStringKey({input_t});
string key = string("norm_out_mps:") + [ns_key UTF8String] + ":" + tensor_key + ":p" + to_string(p) + ":" +
keepdim_info + ":" + toString(in_dtype) + ":" + to_string(castInputData);
string key = string("norm_out_mps:") + [ns_key UTF8String] + ":" + tensor_key + ":p" + std::to_string(p) + ":" +
keepdim_info + ":" + toString(in_dtype) + ":" + std::to_string(castInputData);
auto cachedGraph = LookUpOrCreateCachedGraph<MPSBinaryCachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->inputTensor_ = mpsGraphRankedPlaceHolder(mpsGraph, input_tensor);
@ -572,7 +572,7 @@ static Tensor std_var_common_impl_mps(const Tensor& input_t,
string op_key = (stdVarType == STANDARD_DEVIATION) ? "std_mps" : "var_mps";
NSString* ns_key = [[wrappedAxes valueForKey:@"description"] componentsJoinedByString:@","];
string bessel_corrected = (use_correction && correction_value) ? "unbiased " : "biased ";
string use_dim_info = (use_dim) ? "use_dim=1:" + to_string(dim_value.size()) : "use_dim=0";
string use_dim_info = (use_dim) ? "use_dim=1:" + std::to_string(dim_value.size()) : "use_dim=0";
string keepdim_info = (keepdim) ? "keepdim=1" : "keepdim=0";
string key = op_key + ":" + getTensorsStringKey(input_t) + ":" + use_dim_info + ":" + keepdim_info + ":" +
string([ns_key UTF8String]) + ":" + bessel_corrected + ":" + std::to_string(correction_value);
@ -700,7 +700,7 @@ static void min_max_out_mps(const Tensor& input_t,
auto stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key = func_name + getTensorsStringKey({input_t, indices_t}) + ":" + to_string(dim_);
string key = func_name + getTensorsStringKey({input_t, indices_t}) + ":" + std::to_string(dim_);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_t);
MPSGraphTensor* outputTensor = nil;
@ -860,7 +860,7 @@ static void argmax_argmin_out_mps(const Tensor& input_t,
@autoreleasepool {
NSString* ns_key = [[apparent_in_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key =
func_name + ":" + to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" + string([ns_key UTF8String]);
func_name + ":" + std::to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" + string([ns_key UTF8String]);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
auto inputScalarType = input_t.scalar_type();
MPSGraphTensor* inputTensor =
@ -1217,7 +1217,7 @@ TORCH_IMPL_FUNC(any_out_mps)
@autoreleasepool {
MPSShape* input_t_shape = getMPSShape(input_t);
string key = string("any_out_mps:") + getMPSShapeString(input_t_shape) + ":" + to_string(dim_) + ":" +
string key = string("any_out_mps:") + getMPSShapeString(input_t_shape) + ":" + std::to_string(dim_) + ":" +
getMPSTypeString(input_t);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSDataType input_type = getMPSDataType(input_t);
@ -1313,7 +1313,7 @@ TORCH_IMPL_FUNC(all_out_mps)
@autoreleasepool {
MPSShape* input_t_shape = getMPSShape(input_t);
string key = string("all_out_mps:") + getMPSShapeString(input_t_shape) + ":" + to_string(dim_) + ":" +
string key = string("all_out_mps:") + getMPSShapeString(input_t_shape) + ":" + std::to_string(dim_) + ":" +
getMPSTypeString(input_t);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSDataType input_type = getMPSDataType(input_t);
@ -1531,8 +1531,8 @@ static void median_out_mps(const Tensor& input_t,
auto stream = at::mps::getCurrentMPSStream();
@autoreleasepool {
string key =
func_name + ":" + to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" + getTensorsStringKey(indices_t);
string key = func_name + ":" + std::to_string(dim_) + ":" + getTensorsStringKey(input_t) + ":" +
getTensorsStringKey(indices_t);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSGraphTensor* inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, input_t);
MPSGraphTensor* castInputTensor =

View File

@ -108,8 +108,8 @@ TORCH_IMPL_FUNC(topk_out_mps)
// Input as placeholders
MPSShape* input_shape = getMPSShape(self);
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key = string("topk:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":k" + to_string(k) +
":dim" + to_string(dim_) + ":largest" + to_string(largest);
string key = string("topk:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":k" + std::to_string(k) +
":dim" + std::to_string(dim_) + ":largest" + std::to_string(largest);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->selfTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), input_shape);
@ -320,12 +320,12 @@ TORCH_IMPL_FUNC(cat_out_mps)
};
@autoreleasepool {
string key =
"cat_out_mps:" + to_string(dimension) + ":" + (memory_format == MemoryFormat::ChannelsLast ? "NHWC" : "NCHW");
string key = "cat_out_mps:" + std::to_string(dimension) + ":" +
(memory_format == MemoryFormat::ChannelsLast ? "NHWC" : "NCHW");
if (!all_same_dtype) {
key += getTensorsStringKey(input_tensors, true, all_same_sizes_and_stride);
} else {
key += ":" + getMPSTypeString(input_tensors[0].scalar_type(), true) + ":" + to_string(inputs.size());
key += ":" + getMPSTypeString(input_tensors[0].scalar_type(), true) + ":" + std::to_string(inputs.size());
}
for (auto idx : skipped_tensor_indices) {
key += "," + std::to_string(idx);

View File

@ -60,8 +60,8 @@ TORCH_IMPL_FUNC(sort_stable_out_mps)
// Input as placeholders
MPSShape* input_shape = getMPSShape(self);
NSString* ns_shape_key = [[input_shape valueForKey:@"description"] componentsJoinedByString:@","];
string key = string("sort:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":dim" + to_string(dim) +
":descending" + to_string(descending);
string key = string("sort:") + [ns_shape_key UTF8String] + ":" + getMPSTypeString(self) + ":dim" +
std::to_string(dim) + ":descending" + std::to_string(descending);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
newCachedGraph->selfTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSDataType(self), input_shape);

View File

@ -240,8 +240,8 @@ static void clamp_scalar_out_mps(const Tensor& input_t,
@autoreleasepool {
// the optional min/max refs could affect how we build the cached graph
string key = op_name + (has_min ? ("_min:" + to_string(min_scalar)) : "") +
(has_max ? ("_max:" + to_string(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
string key = op_name + (has_min ? ("_min:" + std::to_string(min_scalar)) : "") +
(has_max ? ("_max:" + std::to_string(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
if (has_min)
newCachedGraph->minTensor = [mpsGraph

View File

@ -8,47 +8,17 @@
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/erfinv_native.h>
#include <ATen/ops/exp_native.h>
#include <ATen/ops/tanh_native.h>
#endif
#include <fmt/format.h>
namespace at::native {
static const std::string& getMetalType(const c10::ScalarType& t) {
// Mapping from c10::ScalarType to integral type that can be used for unary ops
static std::unordered_map<c10::ScalarType, std::string> scalar_to_metal_type = {
{c10::ScalarType::Half, "half"},
{c10::ScalarType::Float, "float"},
{c10::ScalarType::Long, "long"},
{c10::ScalarType::Int, "int"},
{c10::ScalarType::Short, "short"},
{c10::ScalarType::Bool, "bool"},
{c10::ScalarType::Char, "int8_t"},
{c10::ScalarType::Byte, "uint8_t"},
};
auto it = scalar_to_metal_type.find(t);
TORCH_CHECK(it != scalar_to_metal_type.end(), "Unsupported type ", t);
return it->second;
}
static const std::string& getMetalType(const c10::Scalar& s) {
return getMetalType(s.type());
}
static const std::string& getMetalType(const Tensor& t) {
return getMetalType(t.scalar_type());
}
static mps::MetalShaderLibrary lib(UNARY_KERNEL_TEMPLATE, 2);
TORCH_IMPL_FUNC(erfinv_out_mps)(const Tensor& self, const Tensor& output_) {
// handle erfinv ops using metal kernel
// erfinv algorithm ported from aten/src/ATen/native/Math.h
// https://github.com/pytorch/pytorch/blob/4154c8ea159fdaecc71ee9af820ac956193c875b/aten/src/ATen/native/Math.h#L152
TORCH_CHECK(self.scalar_type() != ScalarType::Double, "MPS does not support erfinv op with scalar type: Double");
Tensor inputTensor = self;
static void exec_unary_kernel(const Tensor& self, const Tensor& output_, const std::string& name) {
Tensor inputTensor = self.contiguous();
Tensor outputTensor = output_;
bool needs_output_copy = false;
uint32_t length = output_.numel();
@ -57,10 +27,16 @@ TORCH_IMPL_FUNC(erfinv_out_mps)(const Tensor& self, const Tensor& output_) {
}
using namespace mps;
@autoreleasepool {
auto cplState = lib.getPipelineStateForFunc("erfinv_mps_kernel", {getMetalType(outputTensor), getMetalType(self)});
id<MTLComputePipelineState> cplState = nil;
if (c10::isComplexType(self.scalar_type())) {
auto scalarStr = self.scalar_type() == kComplexFloat ? "float" : "half";
cplState = lib.getPipelineStateForFunc(name + "_complex_kernel", {scalarStr, scalarStr});
} else {
cplState = lib.getPipelineStateForFunc(name + "_kernel",
{scalarToMetalTypeString(outputTensor), scalarToMetalTypeString(self)});
}
if (!self.is_contiguous()) {
inputTensor = inputTensor.contiguous();
if (!outputTensor.is_contiguous()) {
outputTensor = outputTensor.contiguous();
needs_output_copy = true;
}
@ -69,7 +45,7 @@ TORCH_IMPL_FUNC(erfinv_out_mps)(const Tensor& self, const Tensor& output_) {
dispatch_sync(mpsStream->queue(), ^() {
id<MTLComputeCommandEncoder> computeEncoder = mpsStream->commandEncoder();
getMPSProfiler().beginProfileKernel(cplState, "erf_inv", {inputTensor});
getMPSProfiler().beginProfileKernel(cplState, name, {self});
[computeEncoder setComputePipelineState:cplState];
mtl_setBuffer(computeEncoder, outputTensor, 0);
@ -83,4 +59,19 @@ TORCH_IMPL_FUNC(erfinv_out_mps)(const Tensor& self, const Tensor& output_) {
output_.copy_(outputTensor);
}
}
TORCH_IMPL_FUNC(erfinv_out_mps)(const Tensor& self, const Tensor& output_) {
// handle erfinv ops using metal kernel
// erfinv algorithm ported from aten/src/ATen/native/Math.h
// https://github.com/pytorch/pytorch/blob/4154c8ea159fdaecc71ee9af820ac956193c875b/aten/src/ATen/native/Math.h#L152
TORCH_CHECK(self.scalar_type() != ScalarType::Double, "MPS does not support erfinv op with scalar type: Double");
exec_unary_kernel(self, output_, "erfinv");
}
TORCH_IMPL_FUNC(exp_out_mps)(const Tensor& self, const Tensor& output_) {
exec_unary_kernel(self, output_, "exp");
}
TORCH_IMPL_FUNC(tanh_out_mps)(const Tensor& self, const Tensor& output_) {
exec_unary_kernel(self, output_, "tanh");
}
} // namespace at::native

View File

@ -26,7 +26,6 @@
#include <ATen/ops/cumsum_native.h>
#include <ATen/ops/erf_native.h>
#include <ATen/ops/exp2_native.h>
#include <ATen/ops/exp_native.h>
#include <ATen/ops/expm1_native.h>
#include <ATen/ops/floor_native.h>
#include <ATen/ops/frac_native.h>
@ -54,7 +53,6 @@
#include <ATen/ops/sinh_native.h>
#include <ATen/ops/sqrt_native.h>
#include <ATen/ops/tan_native.h>
#include <ATen/ops/tanh_native.h>
#include <ATen/ops/trunc_native.h>
#include <ATen/ops/view_as_real.h>
#endif
@ -236,7 +234,6 @@ CREATE_MPS_STRUCTURED_UNARY_ROUNDING_TORCH_IMPL_FUNC(round_out_mps, round)
}); \
}
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(exp_out_mps, exponent)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(exp2_out_mps, exponentBase2)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(reciprocal_out_mps, reciprocal)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(sqrt_out_mps, squareRoot)
@ -254,7 +251,6 @@ CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(acos_out_mps, acos)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(atan_out_mps, atan)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(sinh_out_mps, sinh)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(cosh_out_mps, cosh)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(tanh_out_mps, tanh)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(asinh_out_mps, asinh)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(acosh_out_mps, acosh)
CREATE_MPS_STRUCTURED_UNARY_TORCH_IMPL_FUNC(atanh_out_mps, atanh)

View File

@ -36,8 +36,8 @@ static std::string getUniqueKey(const ScalarType& dtype,
const bool consecutive,
c10::optional<int64_t> dimOpt) {
return "_unique2_mps:" + getMPSTypeString(dtype) + "[" + getArrayRefString(base_shape) + "]:[" +
(dimOpt.has_value() ? to_string(dimOpt.value()) : "None") + "]:[" + to_string(return_inverse) + "]:[" +
to_string(return_counts) + "]:[" + to_string(consecutive) + "]";
(dimOpt.has_value() ? std::to_string(dimOpt.value()) : "None") + "]:[" + std::to_string(return_inverse) + "]:[" +
std::to_string(return_counts) + "]:[" + std::to_string(consecutive) + "]";
}
// dim arg not supported when non consecutive, ie sorted

View File

@ -99,7 +99,7 @@ static void upsample_out_template(const Tensor& input,
@autoreleasepool {
string key = "upsample_" + std::string(resize_mode_str) + (align_corners ? "_aligned_corners" : "") +
getTensorsStringKey({input}) + ":[" + to_string(scale_h) + "," + to_string(scale_w) + "]:[" +
getTensorsStringKey({input}) + ":[" + std::to_string(scale_h) + "," + std::to_string(scale_w) + "]:[" +
(is_backward_pass ? getArrayRefString(input_size) : "Undefined") + "]";
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {

View File

@ -42,7 +42,7 @@ static std::string getStridedKey(const ScalarType& self_dtype,
}
return (is_scatter ? "scatter:" : "gather:") + dtype_key + "[" + getArrayRefString(base_shape) + "]:[" +
getArrayRefString(new_shape) + "]:[" + getArrayRefString(stride) + "]:[" + to_string(storage_offset) + "]";
getArrayRefString(new_shape) + "]:[" + getArrayRefString(stride) + "]:[" + std::to_string(storage_offset) + "]";
}
// initializes the MTLBuffers for tensor data and runs the MPSGraph for the view op

View File

@ -14648,11 +14648,13 @@
variants: function
dispatch:
CUDA: _fbgemm_jagged_to_padded_dense_forward
CPU: _jagged_to_padded_dense_forward_cpu
- func: _padded_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor
variants: function
dispatch:
CUDA: _fbgemm_dense_to_jagged_forward_symint
CPU: _padded_dense_to_jagged_forward_cpu
- func: _nested_tensor_softmax_with_shape(Tensor self, Tensor query) -> Tensor
dispatch:
@ -14728,12 +14730,12 @@
CUDA: _scaled_dot_product_efficient_attention_backward_cuda
tags: nondeterministic_seeded
- func: _scaled_dot_product_cudnn_attention(Tensor query, Tensor key, Tensor value, bool compute_log_sumexp, float dropout_p=0.0, bool is_causal=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor philox_seed, Tensor philox_offset)
- func: _scaled_dot_product_cudnn_attention(Tensor query, Tensor key, Tensor value, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, SymInt max_q, SymInt max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)
dispatch:
CUDA: _scaled_dot_product_cudnn_attention_cuda
tags: nondeterministic_seeded
- func: _scaled_dot_product_cudnn_attention_backward(Tensor grad_out, Tensor query, Tensor key, Tensor value, Tensor out, Tensor logsumexp, Tensor philox_seed, Tensor philox_offset, float dropout_p, bool is_causal, *, float? scale=None) -> (Tensor, Tensor, Tensor)
- func: _scaled_dot_product_cudnn_attention_backward(Tensor grad_out, Tensor query, Tensor key, Tensor value, Tensor out, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, SymInt max_q, SymInt max_k, float dropout_p, bool is_causal, Tensor philox_seed, Tensor philox_offset, *, float? scale=None) -> (Tensor, Tensor, Tensor)
dispatch:
CUDA: _scaled_dot_product_cudnn_attention_backward_cuda
tags: nondeterministic_seeded

View File

@ -246,5 +246,104 @@ Tensor NestedTensor_to_mask(const Tensor& nt, std::optional<int64_t> mask_dim, s
return result;
}
Tensor _jagged_to_padded_dense_forward_cpu(
const Tensor& values,
TensorList offsets_list,
c10::IntArrayRef max_lengths,
const double padding_value) {
// TODO: Make this kernel more efficient using TensorIterator or something.
TORCH_INTERNAL_ASSERT(
offsets_list.size() == 1 && max_lengths.size() == 1,
"_jagged_to_padded_dense_forward(): only a single jagged dim is supported for now");
// allocate appropriately-sized padded tensor
auto offsets = offsets_list[0];
TORCH_CHECK(
offsets.dim() == 1,
"_jagged_to_padded_dense_forward(): expected 1D offsets, but got offsets.dim() == ",
offsets.dim());
auto batch_size = offsets.size(0) - 1;
auto max_length = max_lengths[0];
auto values_shape = values.sizes().vec();
std::vector<int64_t> padded_shape;
padded_shape.reserve(values.dim() + 1);
padded_shape.push_back(batch_size);
padded_shape.push_back(max_length);
padded_shape.insert(padded_shape.end(), values_shape.begin() + 1, values_shape.end());
Tensor padded = values.new_full(padded_shape, padding_value);
// copy data to padded tensor
for (auto i : c10::irange(batch_size)) {
auto start_offset = offsets[i].item<int64_t>();
auto end_offset = offsets[i + 1].item<int64_t>();
auto length = end_offset - start_offset;
// NB: truncate to max length to match CUDA kernel behavior.
length = std::min(length, max_length);
auto source = values.slice(0, start_offset, start_offset + length);
auto dst = padded.select(0, i).slice(0, 0, length);
dst.copy_(source);
}
return padded;
}
Tensor _padded_dense_to_jagged_forward_cpu(
const Tensor& padded,
TensorList offsets_list,
c10::optional<int64_t> total_L) {
// TODO: Make this kernel more efficient using TensorIterator or something.
TORCH_INTERNAL_ASSERT(
offsets_list.size() == 1,
"_padded_dense_to_jagged_forward(): only a single jagged dim is supported for now");
// allocate appropriately-sized values tensor
auto offsets = offsets_list[0];
TORCH_CHECK(
offsets.dim() == 1,
"_padded_dense_to_jagged_forward(): expected 1D offsets, but got offsets.dim() == ",
offsets.dim());
auto final_offset = offsets[-1].item<int64_t>();
int64_t total_L_val = total_L.has_value() ? (*total_L) : final_offset;
if (total_L.has_value()) {
// error if the offsets try to index past the end of the packed dimension
TORCH_CHECK(
final_offset == total_L_val,
"_padded_dense_to_jagged_forward(): final offset should match total_L value");
}
TORCH_CHECK(
padded.dim() >= 2,
"_padded_dense_to_jagged_forward(): expected padded dim >= 2, but padded.dim() == ",
padded.dim());
std::vector<int64_t> values_shape;
values_shape.reserve(padded.dim() - 1);
values_shape.push_back(total_L_val);
auto padded_shape = padded.sizes();
values_shape.insert(values_shape.end(), padded_shape.begin() + 2, padded_shape.end());
Tensor values = padded.new_empty(values_shape);
// copy data to values tensor
auto batch_size = offsets.size(0) - 1;
for (auto i : c10::irange(batch_size)) {
auto start_offset = offsets[i].item<int64_t>();
auto end_offset = offsets[i + 1].item<int64_t>();
auto length = end_offset - start_offset;
TORCH_CHECK(
length <= padded_shape[1],
"_padded_dense_to_jagged_forward(): found batch item of length ", length,
" when max length specified by padded input is ", padded_shape[1]);
auto dst = values.slice(0, start_offset, end_offset);
auto source = padded.select(0, i).slice(0, 0, length);
dst.copy_(source);
}
return values;
}
} // namespace native
} // namespace at

View File

@ -309,7 +309,7 @@ struct PackedConvWeightsOnednn : public ConvPackedParamsBase<kSpatialDim> {
namespace onednn_utils {
static ideep::attr_t create_attr_by_post_op(
inline ideep::attr_t create_attr_by_post_op(
const c10::string_view& binary_post_op,
double binary_alpha,
double input1_scale,
@ -389,27 +389,9 @@ static ideep::attr_t create_attr_by_post_op(
return ideep::attr_t();
}
// Try to reorder tensor to expected desc at runtime
// Do it in a `try...catch...` manner to avoid oneDNN's errors
// TODO: Move it to third_party/ideep
static void try_reorder(
ideep::tensor& t,
const ideep::tensor::desc&& desc,
ideep::scale_t scales) {
if (t.get_desc() != desc) {
try {
t = t.reorder_if_differ_in(desc);
} catch (...) {
ideep::tensor&& plain = t.to_public(nullptr, t.get_data_type());
t = plain.reorder_if_differ_in(desc);
}
t.set_scale(scales);
}
}
// ONEDNN requires symmetric quantization of weight
// Use this util function to check.
static bool is_weight_symmetric_quant(
inline bool is_weight_symmetric_quant(
const at::Tensor& weight,
bool is_transposed_conv) {
bool is_symmetric = true;
@ -438,7 +420,7 @@ static bool is_weight_symmetric_quant(
// When qengine is x86, use this util func to check if onednn kernel
// is preferred than fbgemm's to get better performance.
static bool should_use_onednn_quant(
inline bool should_use_onednn_quant(
const at::Tensor& weight,
bool is_transposed_conv,
int groups,
@ -472,29 +454,4 @@ at::Tensor _qconv_prepack_onednn(
int64_t groups,
std::optional<torch::List<int64_t>> input_shape=c10::nullopt);
static at::Tensor _quantized_convolution_onednn(
at::Tensor act, // contains quantized values but not QTensor
double act_scale,
int64_t act_zero_point,
at::Tensor weight, // MKLDNN tensor with quantized values
at::Tensor weight_scales,
at::Tensor weight_zero_points,
std::optional<at::Tensor> bias, // Bias is packed if not None
torch::List<int64_t> stride,
torch::List<int64_t> padding,
torch::List<int64_t> dilation,
bool transposed,
int64_t groups,
double output_scale,
int64_t output_zero_point,
std::optional<at::Tensor> accum=c10::nullopt, // accum to fused with conv add
double accum_scale=1.0,
int64_t accum_zero_point=0,
bool fp32_output=false,
std::optional<c10::string_view> binary_attr=c10::nullopt,
std::optional<at::Scalar> binary_alpha=c10::nullopt,
std::optional<c10::string_view> unary_attr=c10::nullopt,
torch::List<std::optional<at::Scalar>> unary_scalars=torch::List<std::optional<at::Scalar>>(),
std::optional<c10::string_view> unary_algorithm=c10::nullopt);
#endif // #if AT_MKLDNN_ENABLED()

View File

@ -14,6 +14,7 @@
#include <ATen/native/mkl/SparseBlasImpl.h>
#include <ATen/native/sparse/SparseBlasImpl.h>
#include <ATen/native/sparse/SparseCsrTensorMath.h>
#include <c10/macros/Macros.h>
#include <c10/util/irange.h>
#include <ATen/AccumulateType.h>
@ -314,14 +315,6 @@ inline Tensor get_result_tensor_for_unary_op(F op, const Tensor& input) {
}
} // namespace
// Only accept squares sparse matrices or dense input as a vector
// TODO: Check what happens with MKL, the output error reported with non square
// matrices tends to be high See:
// https://github.com/pytorch/pytorch/issues/58770
static bool is_square_or_vec(int64_t dim_i, int64_t dim_j, int64_t dim_k) {
return (dim_i == dim_k && dim_k == dim_j) || (dim_i == dim_j && dim_k == 1);
}
Tensor& normal_sparse_csr_(
Tensor& self,
double mean,
@ -473,7 +466,10 @@ CREATE_UNARY_UFUNC(tan);
CREATE_UNARY_UFUNC(tanh);
CREATE_UNARY_UFUNC(trunc);
CREATE_UNARY_UFUNC(conj_physical);
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-function")
static CREATE_UNARY_UFUNC(relu);
C10_DIAGNOSTIC_POP()
// With addition of `round.decimals` overload, using CREATE_UNARY_UFUNC leads
// to unresolved overload.

View File

@ -598,17 +598,6 @@ at::Tensor post_process_flash_output(
return out;
}
int64_t handle_private_use(const Tensor& query_, const Tensor& key, const Tensor& value,
const std::optional<Tensor>& attn_mask_, double dropout_p, bool is_causal, std::optional<double> scale){
int64_t choice_int = static_cast<int64_t>(sdp::SDPBackend::math);
try {
choice_int = _fused_sdp_choice_stub(query_.device().type(),
query_, key, value, attn_mask_, dropout_p, is_causal, scale);
} catch(const ::c10::Error& e){
}
return choice_int;
}
bool should_compute_logsumexp(const Tensor& query, const Tensor& key, const Tensor& value) {
const bool any_inputs_require_grad = query.requires_grad() || key.requires_grad() || value.requires_grad();
const bool gradmode_enabled = at::GradMode::is_enabled();
@ -666,7 +655,7 @@ Tensor scaled_dot_product_attention(
case sdp::SDPBackend::cudnn_attention: {
bool compute_logsumexp = should_compute_logsumexp(query_, key, value);
auto out_lse_softmax = at::_scaled_dot_product_cudnn_attention(
query_, key, value, compute_logsumexp, dropout_p, is_causal, scale);
query_, key, value, dropout_p, is_causal, compute_logsumexp, scale);
return std::get<0>(out_lse_softmax);
}
case sdp::SDPBackend::flash_attention: {

View File

@ -735,27 +735,14 @@ std::tuple<Tensor, Tensor, Tensor, Tensor, c10::SymInt, c10::SymInt, Tensor, Ten
return std::make_tuple(attention, logsumexp, Tensor(), Tensor(), max_seqlen_batch_q, max_seqlen_batch_k, philox_seed, philox_offset, debug_attn_mask);
}
// Adapted from TE
// extract seed and offset from PhiloxCudaState
__global__ void unpack_cudnn(at::PhiloxCudaState arg, int64_t* seed_ptr, int64_t* offset_ptr) {
if (arg.captured_) {
*seed_ptr = static_cast<int64_t>(*arg.seed_.ptr);
*offset_ptr = static_cast<int64_t>(
*(arg.offset_.ptr) + static_cast<int64_t>(arg.offset_intragraph_));
} else {
*seed_ptr = static_cast<int64_t>(arg.seed_.val);
*offset_ptr = static_cast<int64_t>(arg.offset_.val);
}
}
std::tuple<Tensor, Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_cuda(
std::tuple<Tensor, Tensor, Tensor, Tensor, c10::SymInt, c10::SymInt, Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_cuda(
const Tensor& query,
const Tensor& key,
const Tensor& value,
bool compute_logsumexp,
double dropout_p,
bool is_causal,
c10::optional<double> scale) {
bool training,
std::optional<double> scale) {
// Used for tracking usage statistics
C10_LOG_API_USAGE_ONCE("torch.sdpa.flash_attention_cudnn");
// Query (Batch x Num_heads x Q_seq_len x Dim_per_head)
@ -774,33 +761,9 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_c
Tensor attention, log_sumexp;
at::Tensor cudnn_seed, cudnn_offset;
cudnn_seed = at::empty({}, at::dtype(at::kLong).device(at::kCUDA));
cudnn_offset = at::empty({}, at::dtype(at::kLong).device(at::kCUDA));
const bool use_dropout = std::fpclassify(dropout_p) != FP_ZERO;
// See Note [Seed and Offset Device] in _efficient_attention_forward
at::PhiloxCudaState philox_state;
const bool in_capture_stream =
at::cuda::currentStreamCaptureStatus() != at::cuda::CaptureStatus::None;
if (use_dropout) {
// Device
auto gen = at::get_generator_or_default<at::CUDAGeneratorImpl>(
c10::nullopt, at::cuda::detail::getDefaultCUDAGenerator());
// See Note [Acquire lock when using random generators]
std::lock_guard<std::mutex> lock(gen->mutex_);
// if using dropout, we produce 1 random number for each element of the
// attention tensor
// TODO(eqy): should state be advanced per thread (local) amount or per call/launch (global) amount
philox_state = gen->philox_cuda_state(batch_size * num_heads * max_seqlen_batch_q * max_seqlen_batch_k);
unpack_cudnn<<<1, 1, 0, at::cuda::getCurrentCUDAStream()>>>(
philox_state, static_cast<int64_t*>(cudnn_seed.data_ptr()), static_cast<int64_t*>(cudnn_offset.data_ptr()));
}
auto cudnn_seed = at::zeros({1}, query.options().dtype(kLong));
auto cudnn_offset = at::zeros({1}, query.options().dtype(kLong));
const auto softmax_scale = sdp::calculate_scale(query, scale).as_float_unchecked();
Tensor debugmask;
run_cudnn_SDP_fprop(batch_size/*int64_t b*/,
num_heads/*int64_t h*/,
@ -808,7 +771,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_c
max_seqlen_batch_k/*int64_t s_kv*/,
head_dim/*int64_t d*/,
softmax_scale/*float scaling_factor*/,
compute_logsumexp/* bool */,
training/* bool */,
is_causal/* bool */,
dropout_p/*double dropout_probability*/,
query/* Tensor q*/,
@ -819,7 +782,7 @@ std::tuple<Tensor, Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_c
cudnn_seed/*Tensor dropoutseed*/,
cudnn_offset/*Tensor dropoutoffset*/);
return std::make_tuple(attention, log_sumexp, cudnn_seed, cudnn_offset);
return std::make_tuple(attention, log_sumexp, Tensor(), Tensor(), max_seqlen_batch_q, max_seqlen_batch_k, cudnn_seed, cudnn_offset, Tensor());
}
std::tuple<Tensor, Tensor, Tensor, Tensor> _scaled_dot_product_efficient_attention_cuda(

View File

@ -171,32 +171,18 @@ std::tuple<Tensor, Tensor, Tensor> _scaled_dot_product_cudnn_attention_backward_
const Tensor& value,
const Tensor& out,
const Tensor& logsumexp,
const Tensor& philox_seed,
const Tensor& philox_offset,
// const Tensor& cumulative_sequence_length_q,
// const Tensor& cumulative_sequence_length_k,
// const int64_t max_seqlen_batch_q,
// const int64_t max_seqlen_batch_k,
const Tensor& cumulative_sequence_length_q,
const Tensor& cumulative_sequence_length_k,
const int64_t max_seqlen_batch_q,
const int64_t max_seqlen_batch_k,
double dropout_p,
bool is_causal,
c10::optional<double> scale) {
auto& ctx = at::globalContext();
if (ctx.deterministicAlgorithms()) {
if (ctx.deterministicAlgorithmsWarnOnly()) {
TORCH_WARN_ONCE(
"cuDNN Attention defaults to a non-deterministic algorithm. ",
"To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False).");
}
}
const Tensor& philox_seed,
const Tensor& philox_offset,
std::optional<double> scale) {
const int64_t batch_size = query.size(0);
const int64_t num_heads = query.size(1);
const int64_t head_dim = query.size(3);
const int64_t max_seqlen_batch_q = query.size(1);
const int64_t max_seqlen_batch_k = key.size(1);
const auto softmax_scale = sdp::calculate_scale(query, scale).as_float_unchecked();

View File

@ -6,7 +6,6 @@
#include <ATen/core/Tensor.h>
#include <ATen/core/grad_mode.h>
#include <ATen/cuda/CUDAContext.h>
#include <ATen/cuda/CUDAConfig.h>
#include <ATen/detail/CUDAHooksInterface.h>
#include <ATen/native/DispatchStub.h>
#include <ATen/native/transformers/cuda/sdp_utils.h>
@ -45,28 +44,14 @@
namespace sdp {
namespace {
// TODO(eqy): more benchmarking to determine whether this should include sm86/89
// Needs to be kept in-sync with test_fused_chocie in test_transformers.py
bool check_prefer_cudnn_attention() {
auto dprops = at::cuda::getCurrentDeviceProperties();
return dprops->major >= 9;
}
// flash_attention V2 is universally faster than efficient_attention and Math
std::array<SDPBackend, num_backends> priority_order(sdp_params const& params) {
constexpr std::array<SDPBackend, num_backends> default_order{
SDPBackend::flash_attention,
SDPBackend::cudnn_attention,
SDPBackend::efficient_attention,
SDPBackend::math};
constexpr std::array<SDPBackend, num_backends> cudnn_order{
SDPBackend::cudnn_attention,
SDPBackend::flash_attention,
SDPBackend::efficient_attention,
SDPBackend::math};
static const bool prefer_cudnn = check_prefer_cudnn_attention();
return prefer_cudnn ? cudnn_order : default_order;
return default_order;
}
bool use_tensor_cores(sdp_params const& params, cudaDeviceProp* dprops, bool is_half) {
@ -466,6 +451,17 @@ bool check_cudnn_hardware_support(sdp_params const& params, bool debug) {
return true;
}
bool check_is_causal(sdp_params const& params, bool debug) {
// Check that the input is causal
if (!params.is_causal) {
if (debug) {
TORCH_WARN("CuDNN requires is_causal=True.");
}
return false;
}
return true;
}
bool check_for_nested_inputs(sdp_params const& params, bool debug) {
// Check that the input is nested
if (has_for_nested_inputs(params)) {
@ -489,6 +485,22 @@ bool check_dtypes_low_precision(sdp_params const& params, bool debug) {
}
}
bool check_runtime_enabled_cudnn(sdp_params const& params, bool debug) {
static c10::once_flag supported_flag;
static bool supported = false;
c10::call_once(supported_flag, []() {
supported = (c10::utils::check_env("TORCH_CUDNN_SDPA_ENABLED") == true);
});
if (!supported) {
if (debug) {
TORCH_WARN(
"The CuDNN backend needs to be enabled by setting the enviornment variable`TORCH_CUDNN_SDPA_ENABLED=1`");
}
return false;
}
return true;
}
bool check_runtime_disabled_cudnn(sdp_params const& params, bool debug) {
// We check the global context to see if user has explicitly turned of cudnn
// sdp kernels
@ -501,15 +513,13 @@ bool check_runtime_disabled_cudnn(sdp_params const& params, bool debug) {
return true;
}
bool check_cudnn_deterministic(const sdp_params& params, bool debug) {
auto& ctx = at::globalContext();
if (ctx.deterministicAlgorithms()) {
if (!ctx.deterministicAlgorithmsWarnOnly()) {
if (debug) {
TORCH_WARN("cuDNN SDPA is not deterministic.");
}
return false;
bool check_cudnn_requires_grad(sdp_params const& params, bool debug) {
// Check that the input is causal
if (input_requires_grad(params)) {
if (debug) {
TORCH_WARN("CuDNN does not currently support inputs with requires_grad=True.");
}
return false;
}
return true;
}
@ -517,29 +527,21 @@ bool check_cudnn_deterministic(const sdp_params& params, bool debug) {
} // namespace
bool can_use_cudnn_attention(const sdp_params& params, bool debug) {
#if defined(USE_ROCM) || !AT_CUDNN_ENABLED() || \
(defined(CUDNN_VERSION) && CUDNN_VERSION < 8900)
TORCH_WARN_ONCE(!debug, "Torch was not compiled with cuDNN attention.");
return false;
#endif
// Define gate functions that determine if a flash kernel can be ran
// Replace with std::to_array when we migrate to c++20
constexpr auto general_constraints =
array_of<bool (*)(sdp_params const&, bool)>(
check_for_nested_inputs,
check_nonzero_sequence_lengths_dense,
check_last_dim_stride_equals_1_dense<true /*ignore_singleton_dim>*/>,
check_all_tensors_on_device,
check_tensor_shapes,
check_cudnn_tensor_shapes,
check_runtime_enabled_cudnn,
check_runtime_disabled_cudnn,
check_cudnn_deterministic,
// check_cudnn_layout,
check_cudnn_hardware_support,
check_all_tensors_on_device,
check_cudnn_tensor_shapes,
check_cudnn_layout,
// check_is_causal,
check_dtypes_low_precision,
check_for_attn_mask_cudnn,
check_cudnn_hardware_support
);
check_for_nested_inputs,
check_cudnn_requires_grad,
check_dtypes_low_precision);
for (auto& constraint : general_constraints) {
if (!constraint(params, debug)) {
return false;
@ -683,7 +685,6 @@ SDPBackend select_sdp_backend(sdp_params const& kernel_params) {
switch (backend) {
case SDPBackend::cudnn_attention:
if (sdp::can_use_cudnn_attention(kernel_params, print_debug)) {
TORCH_WARN("USING CUDNN SDPA");
return SDPBackend::cudnn_attention;
}
break;

View File

@ -266,18 +266,7 @@ inline bool check_requires_grad_and_nested(sdp_params const& params, bool debug)
inline bool check_for_attn_mask(sdp_params const& params, bool debug) {
if (params.attn_mask.has_value()) {
if (debug) {
TORCH_WARN("Flash Attention do not support non-null attn_mask.");
}
return false;
}
return true;
}
// TODO(eqy): remove this once support is added
inline bool check_for_attn_mask_cudnn(sdp_params const& params, bool debug) {
if (params.attn_mask.has_value()) {
if (debug) {
TORCH_WARN("cuDNN Attention does not support non-null attn_mask.");
TORCH_WARN("Flash Attention does not support non-null attn_mask.");
}
return false;
}
@ -324,7 +313,7 @@ inline bool check_tensor_shapes(sdp_params const& params, bool debug) {
(query_dim == 4))) {
if (debug) {
TORCH_WARN(
"All fused kernels requires query, key and value to be 4 dimensional, but got Query dim: ",
"Both fused kernels requires query, key and value to be 4 dimensional, but got Query dim: ",
query_dim,
", Key dim: ",
params.key.dim(),
@ -436,7 +425,7 @@ inline bool check_nonzero_sequence_lengths_dense(sdp_params const& params, bool
if (zero_seq_len_q || zero_seq_len_k) {
if (debug) {
TORCH_WARN(
"All fused kernels do not support zero seq_len_q or seq_len_kv.");
"Both fused kernels do not support zero seq_len_q or seq_len_kv.");
}
return false;
}
@ -471,7 +460,7 @@ inline bool check_last_dim_stride_equals_1_dense(sdp_params const& params, bool
}
epilogue_message << " instead.";
TORCH_WARN(
"All fused kernels require the last dimension of the input to have stride 1. ",
"Both fused kernels require the last dimension of the input to have stride 1. ",
"Got Query.stride(-1): ",
params.query.sym_stride(-1),
", Key.stride(-1): ",

View File

@ -5,7 +5,7 @@
namespace at::native {
template <typename T>
static void vol2col(
void vol2col(
const T* data_vol,
const int64_t channels,
const int64_t depth,
@ -56,7 +56,7 @@ static void vol2col(
}
template <typename T>
static void col2vol(
void col2vol(
const T* data_col,
const int64_t channels,
const int64_t depth,

View File

@ -6,6 +6,14 @@ import textwrap
import pandas as pd
# Hack to have something similar to DISABLED_TEST. These models are flaky.
flaky_models = {
"yolov3",
"gluon_inception_v3",
}
def get_field(csv, model_name: str, field: str):
try:
return csv.loc[csv["name"] == model_name][field].item()
@ -25,6 +33,13 @@ def check_accuracy(actual_csv, expected_csv, expected_filename):
status = "PASS" if expected_accuracy == "pass" else "XFAIL"
print(f"{model:34} {status}")
continue
elif model in flaky_models:
if accuracy == "pass":
# model passed but marked xfailed
status = "PASS_BUT_FLAKY:"
else:
# model failed but marked passe
status = "FAIL_BUT_FLAKY:"
elif accuracy != "pass":
status = "FAIL:"
failed.append(model)

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -150,7 +150,7 @@ hf_Bert_large,pass,0
hf_BigBird,pass,46
hf_BigBird,pass,43
@ -378,4 +378,4 @@ vision_maskrcnn,pass,17
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
150
151
152
153
154
155
156
378
379
380
381

View File

@ -98,7 +98,7 @@ hf_Bert_large,pass,6
hf_BigBird,pass, 52
hf_BigBird,pass,49
@ -286,4 +286,4 @@ vision_maskrcnn,pass,34
yolov3,pass,9
yolov3,pass,8

1 name accuracy graph_breaks
98
99
100
101
102
103
104
286
287
288
289

View File

@ -242,7 +242,7 @@ pyhpc_equation_of_state,pass,0
pyhpc_isoneutral_mixing,fail_to_run,0
pyhpc_isoneutral_mixing,pass,0
@ -350,4 +350,4 @@ vision_maskrcnn,fail_to_run,0
yolov3,fail_to_run,0
yolov3,pass,0

1 name accuracy graph_breaks
242
243
244
245
246
247
248
350
351
352
353

View File

@ -338,4 +338,4 @@ vision_maskrcnn,pass,28
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
338
339
340
341

View File

@ -338,4 +338,4 @@ vision_maskrcnn,pass,28
yolov3,pass,2
yolov3,pass,0

1 name accuracy graph_breaks
338
339
340
341

View File

@ -242,7 +242,7 @@ pyhpc_equation_of_state,pass,0
pyhpc_isoneutral_mixing,fail_to_run,0
pyhpc_isoneutral_mixing,pass,0
@ -350,4 +350,4 @@ vision_maskrcnn,fail_to_run,0
yolov3,fail_to_run,0
yolov3,pass,0

1 name accuracy graph_breaks
242
243
244
245
246
247
248
350
351
352
353

View File

@ -14,11 +14,11 @@ AllenaiLongformerBase,pass,9
BartForCausalLM,pass,12
BartForCausalLM,pass,6
BartForConditionalGeneration,pass,24
BartForConditionalGeneration,pass,8
@ -34,11 +34,11 @@ BlenderbotForCausalLM,eager_fail_to_run,0
BlenderbotSmallForCausalLM,pass,12
BlenderbotSmallForCausalLM,pass,6
BlenderbotSmallForConditionalGeneration,pass,24
BlenderbotSmallForConditionalGeneration,pass,8
@ -102,11 +102,11 @@ M2M100ForConditionalGeneration,pass,4
MBartForCausalLM,pass,12
MBartForCausalLM,pass,6
MBartForConditionalGeneration,pass,24
MBartForConditionalGeneration,pass,8
@ -130,23 +130,23 @@ MobileBertForQuestionAnswering,pass,3
OPTForCausalLM,pass,12
OPTForCausalLM,pass,6
PLBartForCausalLM,pass,12
PLBartForCausalLM,pass,6
PLBartForConditionalGeneration,pass,29
PLBartForConditionalGeneration,pass,8
PegasusForCausalLM,pass,12
PegasusForCausalLM,pass,6
PegasusForConditionalGeneration,pass,23
PegasusForConditionalGeneration,pass,7
@ -158,7 +158,7 @@ RobertaForQuestionAnswering,pass,5
Speech2Text2ForCausalLM,pass,12
Speech2Text2ForCausalLM,pass,6
@ -170,11 +170,11 @@ T5Small,pass,5
TrOCRForCausalLM,pass,12
TrOCRForCausalLM,pass,6
XGLMForCausalLM,pass,12
XGLMForCausalLM,pass,6

1 name accuracy graph_breaks
14 DebertaForQuestionAnswering pass 5
15 DebertaV2ForMaskedLM pass_due_to_skip 0
16 DebertaV2ForQuestionAnswering eager_1st_run_OOM 0
17 DistilBertForMaskedLM pass 5
18 DistilBertForQuestionAnswering pass 5
19 DistillGPT2 pass 5
20 ElectraForCausalLM pass 4
21 ElectraForQuestionAnswering pass 5
22 GPT2ForSequenceClassification pass 7
23 GoogleFnet pass 5
24 LayoutLMForMaskedLM pass 5
34 OPTForCausalLM pass 12 6
35 PLBartForCausalLM pass 12 6
36 PLBartForConditionalGeneration pass 29 8
37 PegasusForCausalLM pass 12 6
38 PegasusForConditionalGeneration pass 23 7
39 RobertaForCausalLM pass 5
40 RobertaForQuestionAnswering pass 5
41 Speech2Text2ForCausalLM pass 12 6
42 T5ForConditionalGeneration pass 5
43 T5Small pass 5
44 TrOCRForCausalLM pass 12 6
102
103
104
105
106
107
108
109
110
111
112
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
158
159
160
161
162
163
164
170
171
172
173
174
175
176
177
178
179
180

View File

@ -150,7 +150,7 @@ hf_Bert_large,pass,0
hf_BigBird,fail_accuracy,46
hf_BigBird,fail_accuracy,43

1 name accuracy graph_breaks
150
151
152
153
154
155
156

View File

@ -98,7 +98,7 @@ hf_Bert_large,pass,6
hf_BigBird,pass,52
hf_BigBird,pass,49

1 name accuracy graph_breaks
98
99
100
101
102
103
104

Some files were not shown because too many files have changed in this diff Show More