Commit Graph

568 Commits

Author SHA1 Message Date
ed5d6d2a20 python definitely_contiguous-> is_contiguous_or_false (#156515)
We probably can avoid having those in python as well and  just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515
Approved by: https://github.com/bobrenjc93
2025-06-30 17:31:51 +00:00
75a7d9e868 Revert "python definitely_contiguous-> is_contiguous_or_false (#156515)"
This reverts commit 4c0091fda65b714fa73671a15e379f814af153e0.

Reverted https://github.com/pytorch/pytorch/pull/156515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some torch.export failures internally ([comment](https://github.com/pytorch/pytorch/pull/156515#issuecomment-3014104570))
2025-06-27 19:07:06 +00:00
cbcffce48a address remaining straight forward gso in meta_registrations (#156902)
Those are all straight forward generalization of existing checks,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156902
Approved by: https://github.com/ColinPeppler
2025-06-27 06:19:54 +00:00
4c0091fda6 python definitely_contiguous-> is_contiguous_or_false (#156515)
We probably can avoid having those in python as well and  just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515
Approved by: https://github.com/bobrenjc93
2025-06-26 00:47:14 +00:00
04178d347c [Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-06-24 06:09:59 +00:00
6ed85bfe6a Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-20 19:42:57 +00:00
72c8751b61 Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048)
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
2025-06-20 01:41:03 +00:00
0b62465b99 Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466)"
This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e.

Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))
2025-06-19 14:25:38 +00:00
3f69e3b3a0 Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-19 04:50:18 +00:00
830a335a7d Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-18 15:15:05 +00:00
06408dae49 Revert "Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)"
This reverts commit 0029259bdfeee627181df2b9f5ff6979f65090ec.

Reverted https://github.com/pytorch/pytorch/pull/154757 on behalf of https://github.com/laithsakka due to post land issue ([comment](https://github.com/pytorch/pytorch/pull/154757#issuecomment-2971385787))
2025-06-13 19:11:43 +00:00
0029259bdf Add view_simple as meta function for view, and avoid calling reshape_view_helper. (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-12 09:58:15 +00:00
9bd0830ed8 [dynamic shapes] guard_or_false for cat, repeat (#155290)
Summary:
assumes:
- specified repeats are non-negative
- 1d cat arguments like [u0] aren't non-zero sized (replaces existing size-oblivious)

Test Plan:
test_export

Rollback Plan:

Differential Revision: D76092011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155290
Approved by: https://github.com/laithsakka
2025-06-11 21:03:32 +00:00
f8baec8984 Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel, https://github.com/davidberard98
2025-06-11 19:12:52 +00:00
e12597090c Revert "Update auto-tuning support for _scaled_grouped_mm (#150944)"
This reverts commit 09328eb02f5412d2211b5fd638ce82d0e03b9c1f.

Reverted https://github.com/pytorch/pytorch/pull/150944 on behalf of https://github.com/davidberard98 due to breaks internal usage & complicates triton pin update - more details in https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957246463 ([comment](https://github.com/pytorch/pytorch/pull/150944#issuecomment-2957248841))
2025-06-09 23:12:56 +00:00
09328eb02f Update auto-tuning support for _scaled_grouped_mm (#150944)
1. Enable strided inputs
2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs
3. Fix non-TMA load variant
4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor
5. Fix cases when group size along K dimension is not multiple of block size along K
6. Updated meta registration
7. Update synthetic offsets creation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150944
Approved by: https://github.com/ngimel
2025-06-08 10:18:13 +00:00
fc77269262 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
2025-06-06 15:48:00 +00:00
5130ac64f4 Revert "Add randint_like tensor overload for high (#154899)"
This reverts commit 72fe1d5f42aa9bffa876932a3b4fcae052b99168.

Reverted https://github.com/pytorch/pytorch/pull/154899 on behalf of https://github.com/seemethere due to Failing internal tests see https://fburl.com/diff/bai044ob ([comment](https://github.com/pytorch/pytorch/pull/154899#issuecomment-2942740661))
2025-06-05 04:54:05 +00:00
72fe1d5f42 Add randint_like tensor overload for high (#154899)
Fixes #135664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154899
Approved by: https://github.com/StrongerXi
ghstack dependencies: #154863
2025-06-04 03:37:09 +00:00
77d85a4629 Symintify baddbmm (#154656)
Previously we would specialize on the shape in this if-statement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154656
Approved by: https://github.com/pianpwk
2025-06-02 15:23:14 +00:00
1bc5762495 [Intel GPU][Inductor] Fallback embedding_dense_backward on XPU (#151637)
Reopen #146888, now the modification only affects xpu device. We do not  want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151637
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang
2025-05-19 02:19:37 +00:00
c2bc7e2827 API change for new enum in cusparseltsplitkmode-t for cusparseLT 0.7.0+ (#150536)
Changing the bool to int to express split_k_mode. Before 0.7.0 we only have 2 cusparseLtSplitKMode_t enum values ONE_KERNEL and TWO_KERNELS so a boolean is enough but since 0.7.0 there are more.

For Blackwell, there has to be minor change to parameter split_k_one_kernel (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp#L103), since there are new values introduced to enum [cusparseLtSplitKMode_t](https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t) and a bool type is not enough for it (would have to be replaced with integer) https://docs.nvidia.com/cuda/cusparselt/types.html#cusparseltsplitkmode-t

Error we see without the change
```
RuntimeError: CUDA error: invalid value when calling `cusparseLtMatmulAlgSetAttribute( &handle, &alg_sel, CUSPARSELT_MATMUL_SPLIT_K_MODE, &splitKMode, sizeof(splitKMode))`

To execute this test, run the following from the base repo dir:
    python test/test_sparse_semi_structured.py TestSparseSemiStructuredCUSPARSELTCUDA.test_csrc_cslt_sparse_mm_search_cuda_int8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150536
Approved by: https://github.com/jcaip, https://github.com/atalman
2025-05-14 23:36:53 +00:00
0139ce9303 Add skip_dtype_check_in_meta_registrations config to torch/fx/experimental/_config (#153513)
Helion relies on torch/fx/experimental 's fake_tensor tracing but does its own dtype checking, which conflicts with some meta kernel's existing dtype checking. This PR adds a config so that we skip those dtype checking in meta kernels and rely on the calling system to do the dtype checking.

Currently it only applies to `baddbmm`, but I expect that similar changes will need to be done to other meta kernels in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153513
Approved by: https://github.com/jansel
2025-05-14 09:14:11 +00:00
9c99ea2991 error out on negative offs or on K=0 in group gemm (#153226)
Error out if K=0 in one of the grouped gemms to avoid hangs in #152668
Also, adds meta function for _scaled_grouped_mm (TODO: do the same for _grouped_mm, unless it's done already)

One weird thing I'm seeing, when running all grouped_gemm tests, I'm erroring out with
```
  File "/data/users/ngimel/pytorch/torch/_inductor/graph.py", line 1246, in call_function
    out = lowerings[target](*args, **kwargs)  # type: ignore[index]
  File "/data/users/ngimel/pytorch/torch/_inductor/lowering.py", line 445, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 444, in tuned_scaled_grouped_mm
    if is_nonzero and can_use_triton_kernel(mat_a, mat_b, offs, bias):
  File "/data/users/ngimel/pytorch/torch/_inductor/kernel/mm_scaled_grouped.py", line 375, in can_use_triton_kernel
    offs is not None
  File "/home/ngimel/.conda/envs/pytorch_monarch/lib/python3.10/site-packages/sympy/core/relational.py", line 516, in __bool__
    raise TypeError("cannot determine truth value of Relational")
torch._inductor.exc.InductorError: LoweringException: TypeError: cannot determine truth value of Relational
```
which is weird, there's no relational that sympy has to evaluate in `offs is not None`, and when running this test separately (`test_scaled_grouped_gemm_2d_3d_fast_accum_True_strided_False_use_torch_compile_True_cuda`) it passes. I suspect some autotuning cache has to be reset between runs, but don't know what to look for.
Edit: that error is "fixed" by setting `dynamic=False`, now with correct meat function something's wrong with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153226
Approved by: https://github.com/kwen2501
2025-05-10 01:13:18 +00:00
d9dc6b56ec Support using SymInt shapes for torch.baddbmm no-broadcast case (#153112)
A typical `bmm` kernel in Helion needs to pass in symint shapes to `torch.baddbmm`. Currently `self.expand((dim1, dim2, dim3))` in baddbmm runs unconditionally and it doesn't work with symint shapes (it raises the following error):
```
Traceback (most recent call last):
  File "/home/willfeng/local/helion_yf225/helion/_compiler/type_propagation.py", line 699, in propagate_call
    CheckForIndexCalls.retry_call(self.value, proxy_args, proxy_kwargs),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/helion_yf225/helion/_compiler/tile_index_proxy.py", line 104, in retry_call
    return fn(*proxy_args, **proxy_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/utils/_stats.py", line 27, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1338, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1986, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 1450, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_subclasses/fake_tensor.py", line 2645, in _dispatch_impl
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_ops.py", line 806, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_prims_common/wrappers.py", line 309, in _fn
    result = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/willfeng/local/pytorch/torch/_meta_registrations.py", line 2172, in meta_baddbmm
    self = self.expand((dim1, dim2, dim3))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: /home/willfeng/local/pytorch/build/aten/src/ATen/RegisterCompositeExplicitAutograd_0.cpp:5025: SymIntArrayRef expected to contain only concrete integers
```
This PR changes it so that we don't run `expand()` when not necessary, which makes the Helion use case (i.e. no broadcasting) work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153112
Approved by: https://github.com/jansel
2025-05-08 21:34:24 +00:00
84aa0985fb [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-03 02:23:54 +00:00
7c3e679ddd Revert "[Inductor] Add decomposeK as an autotuning choice for mm (#150654)"
This reverts commit fdcfc6a61a2146c7c961073e029ead633113eb9a.

Reverted https://github.com/pytorch/pytorch/pull/150654 on behalf of https://github.com/wdvr due to Failing ROCM tests: inductor/test_subgraph_choice.py::TestSubgraphChoice::test_subgraph_decompose_k [GH job link](https://github.com/pytorch/pytorch/actions/runs/14786111108/job/41515742446) [HUD commit link](3c54e0c216) ([comment](https://github.com/pytorch/pytorch/pull/150654#issuecomment-2846470409))
2025-05-02 06:31:38 +00:00
fdcfc6a61a [Inductor] Add decomposeK as an autotuning choice for mm (#150654)
As a result of adding subgraph as a choice to inductor https://github.com/pytorch/pytorch/pull/149761 and enabling FP32 output from PyTorch GEMMs from FP16/BF16 inputs: https://github.com/pytorch/pytorch/pull/150812, this PR enables decompose_k as an autotuning choice for Inductor in generating the fastest matmuls with Triton. DecomposeK is currently only enabled for `torch.compile`.

Followups:
* decompose_k does not currently support epilogue fusion, which will take some work to enable
* Enable autotuning the bmm with Triton Templates as well without requiring tons of more compile time, async compilation. Anecdotal evidence shows that Triton BMM performs better usually than aten BMM
* Add for addmm
* Enable for Inference and AOTI

Below are the results of running TritonBench for Split-K shapes, comparing the aten performance versus pt2_triton, which now autotunes on decompose_k, seeing >10% speedup compared to aten on average, and for some shapes over 3x the performance of the best Triton mm previously:

<img width="929" alt="Screenshot 2025-04-28 at 9 15 39 PM" src="https://github.com/user-attachments/assets/27d85bbc-4f3a-43a6-a8fa-d4a5bbb8c999" />

TorchInductor Benchmark Dashboard:
<img width="1727" alt="Screenshot 2025-04-30 at 2 02 53 PM" src="https://github.com/user-attachments/assets/4acd7ffc-407f-4cfd-98bb-2e3d8b1f00b3" />

We see speedups across all runs for training. Compile time increased as expected, with more `mm` options to tune over.

Differential Revision: [D73820115](https://our.internmc.facebook.com/intern/diff/D73820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150654
Approved by: https://github.com/eellison
2025-05-01 23:01:30 +00:00
f0c9b3385d Support more dtypes for input, indices in gather (#151822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151822
Approved by: https://github.com/ngimel
2025-05-01 16:35:23 +00:00
701c0848b8 [dynamic shapes] aten.constant_pad_nd meta impl (#152129)
We know the output shape, and we know this always produces a clone. Avoids data-dependent errors from the decomposition.

along with https://github.com/pytorch/pytorch/pull/150483, should fix https://github.com/pytorch/pytorch/issues/123855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152129
Approved by: https://github.com/laithsakka
2025-05-01 08:32:10 +00:00
632b89af43 [dynamic shapes] support SymInt inputs for kthvalue (#152151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152151
Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet
2025-05-01 03:47:23 +00:00
2667cb69d9 [inductor] align replicationpad on processing bool dtype with eager (#147666)
Fixes #143779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147666
Approved by: https://github.com/jansel
2025-04-28 21:54:31 +00:00
0413358a77 Non-deterministic alert in histc_cuda for floating types only (#151701)
The note about atomic add only applies for floating point. The
implementation is deterministic for integer data types.

fixes: #151610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151701
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-04-24 21:16:46 +00:00
7dd2ed1197 [dtensor] add op support for torch._grouped_mm (#151072)
This PR would make TP work with Grouped MM in MoE implementations like https://github.com/pytorch/torchtitan/pull/1084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151072
Approved by: https://github.com/wanchaol, https://github.com/wwwjn
2025-04-12 07:07:44 +00:00
246f3b6530 [Quant][PT2E][X86] enable qconv1d-relu fusion (#150751)
**Summary**
As the title.
- The `conv1d - relu` pattern will be annotated by the `X86InductorQuantizer`.
- The pattern will be fused as `qconv_pointwise` during lowering.

**Test plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qconv1d_relu_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150751
Approved by: https://github.com/jerryzh168, https://github.com/leslie-fang-intel
2025-04-09 14:42:02 +00:00
b01877aa13 Fix addbmm & addmv & baddbmm out dtype check (#148176)
----

- torch.addbmm
- torch.addmv
- torch.baddbmm

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148176
Approved by: https://github.com/jansel
ghstack dependencies: #148174
2025-04-09 07:02:56 +00:00
3e0038ae85 Fix torch.matmul related out dtype check (#148174)
----

- torch.matmul -> CompositeImplicitAutograd -> dot_out (when left_dim == 1 & right_dim == 1)
                                            -> mv_out (when left_dim == 2 & right_dim == 1)
                                            -> mm_out (when left_dim == 1 & right_dim == 2)
                                            -> ...
- torch.dot
- torch.vdot
- torch.mm
- torch.mv

ISSUE related:
https://github.com/pytorch/pytorch/issues/138399
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148174
Approved by: https://github.com/jansel
2025-04-08 17:00:28 +00:00
52d172eafd Facilitate at::_weight_int4pack_mm_with_scale_and_zeros related registration (#147962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147962
Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang
ghstack dependencies: #137566

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
2025-04-08 15:36:07 +00:00
c974b5322a enable torch.compile for torch._scaled_mm nvfp4 recipe (#150462)
Summary:

Updates the meta registration for `torch._scaled_mm` to work for the
nvfp4 recipe.

Test Plan:

```bash
pytest test/test_matmul_cuda.py -s -k test_blockwise_nvfp4
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150462
Approved by: https://github.com/eellison
2025-04-02 01:08:40 +00:00
dad0854d48 meta registration for torch._scaled_mm with mxfp8 (#148461)
Summary:

Adds the meta registration logic for torch.compile to work with
`torch._scaled_mm` with mxfp8.  Thanks to @eellison  for the pointer to make inductor work with this.

Test Plan:

```
pytest test/test_matmul_cuda.py -k test_blockwise_mxfp8_compile -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148461
Approved by: https://github.com/drisspg, https://github.com/eellison
2025-03-27 02:32:40 +00:00
05f2cbfe19 Add meta function for out variants of ones,zeros,empty (#149098)
Open another PR to fix merge conflicts. Fixes https://github.com/pytorch/pytorch/issues/135832

For aten.ones, aten.zeros, followed this [link](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.64r4npvq0w0) to register meta functions.

For aten.empty.out, followed this [part](https://docs.google.com/document/d/1GgvOe7C8_NVOMLOCwDaYV1mXXyHMXY7ExoewHqooxrs/edit?tab=t.0#heading=h.iy9lxhxhtl5v) to register a decomp for empty that handles the FakeTensor input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149098
Approved by: https://github.com/williamwen42
2025-03-14 22:17:30 +00:00
eqy
ec93aa7f84 fix cuDNN SDPA meta registration (#148921)
Update `cuDNN SDPA` meta registration to matching memory layout behavior in: https://github.com/pytorch/pytorch/pull/138354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148921
Approved by: https://github.com/drisspg, https://github.com/jbschlosser
2025-03-13 07:33:16 +00:00
9db9593bba Add some more meta kernels (#147862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147862
Approved by: https://github.com/zou3519
2025-03-05 18:33:00 +00:00
c21dc11a17 [Intel GPU] Enable SDPA on XPU (#147614)
Motivation
===

This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`.

Depends on OneDNN version v3.7 upgrade in #147498
Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-04 01:40:45 +00:00
f4235310e8 [BE][Ez]: Remove redundant empty tensor copies in meta-reg (#147978)
Empty_likes includes a memory_format arg. Let's use it to avoid unnecessary copy operations. Noticed while reviewing: https://github.com/pytorch/pytorch/pull/147862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147978
Approved by: https://github.com/jansel
2025-02-27 23:16:44 +00:00
c73a92fbf5 [BE][CI] bump ruff to 0.9.2: multiline assert statements (#144546)
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements

> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
>     len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
>     f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
3ecfe6be25 [Submodule] Turning flash-attention integration into 3rd party submod (#144120) (#146372)
Summary:

# Summary

### Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419

### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a  //caffe2:ATen-cu --show-full-output    `

I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
dacdc9782b [Inductor] Add input value checking to randint meta function (#147191)
Fixes #147070

Adding value checking for the range to the meta function, similar to which in the CUDA/CPU aten op.

Test with
```
PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_tensor_creation_ops.py -k test_randint_inference
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147191
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2025-02-25 02:18:16 +00:00
8a5265cb37 [Intel GPU] qlinear_pointwise.binary[_tensor] XPU support (#135337)
# Motivation
This PR intends to enable quantized fusion `qlinear+add` at Intel GPU backend.

At backend level, we register the op via schema  `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary")` and `TORCH_SELECTIVE_NAME("onednn::qlinear_pointwise.binary_tensor")` which is the one already defined in `x86InductorQuantzer`

At Inductor level, we have small modification at `torch/_inductor/fx_passes/quantization.py` to allow signed int8 data type(s8) during op lowering. As for the pattern matching, we greatly reuse the code existing at x86InductorQuantizer.

# UT verification
```bash
python test/inductor/test_mkldnn_pattern_matcher.py -v \
    -k test_qlinear_add_xpu
```

# Runtime Verification
```bash
onednn_verbose,primitive,exec,gpu:0,matmul,jit:gemm:any,undef,src_s8::blocked:ab::f0 wei_s8::blocked:ab::f0 bia_f32::blocked:ab::f0_mask2 dst_f32::blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+dst:0:f32+wei:2:f32 attr-zero-points:src0:0:s32 attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu,,4x4:4x4,0.0319824
```
The verbose is collected from UT. We can see the attribute ` attr-post-ops:eltwise_linear:1:0.654408+sum:0.00511256+eltwise_relu`, the post add and ReLU is successfully fused on GEMM computation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135337
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/liangan1, https://github.com/jerryzh168
ghstack dependencies: #133307, #135189

Co-authored-by: guangyey <guangye.yu@intel.com>
2025-02-21 02:09:28 +00:00
83bb921a5a [ROCm] Update meta_registration for efficient attention (#146979)
Fixes a series of failing and skipped unit tests.

For nvidia hw, the longsumexp last dimension is required to be a multiple of 32.  This is not the case for rocm.

A related issue: https://github.com/pytorch/pytorch/issues/146848

The unit tests in question:
```bash
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaDynamicTests	test_sdpa_rewriter_6_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_13_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_prev_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_11_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_14_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_15_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_17_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_1_freezing
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_2_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_3_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_4_cuda
inductor.test_fused_attention	SDPAPatternRewriterCudaTests	test_sdpa_rewriter_6_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146979
Approved by: https://github.com/shunting314
2025-02-20 15:05:13 +00:00