Commit Graph

269 Commits

Author SHA1 Message Date
fc0376e8b1 [BE][2/6] fix typos in test/ (test/test_*.py) (#157636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157636
Approved by: https://github.com/yewentao256, https://github.com/mlazos
ghstack dependencies: #156311, #156609
2025-07-09 11:02:23 +00:00
f56bfb3030 [CPU] Fix memory access for sbgemm bf16 (#156585)
Fixes #156022.

1. The original dtype conversion overwrites the whole `n_*ldc_` instead of `n_*m_` with stride `ldc_`, causing the potential memory issue.
2. Fix the None value issue in attention backward UT, as the sbgemm bf16 could be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156585
Approved by: https://github.com/mingfeima, https://github.com/aditew01, https://github.com/ezyang
2025-07-08 02:36:28 +00:00
b221be9140 Fix typo: 'intial_query_grad' → 'initial_query_grad' in test_transformers.py (#157306)
This is a minor typo fix in `test/test_transformers.py`:

- Renamed `intial_query_grad` to `initial_query_grad` for improved clarity and correctness in test variable naming.

There are **no functional or logic changes** — this PR is aimed purely at improving readability and maintaining code quality.

Thanks to the PyTorch team for their work and review time
Please feel free to suggest if this needs any adjustment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157306
Approved by: https://github.com/Skylion007
2025-07-03 14:08:12 +00:00
d9577df312 [ROCm] Bump AOTriton to 0.10b (#156499)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-25 07:09:03 +00:00
04178d347c [Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-06-24 06:09:59 +00:00
9e132b770e [CUDA] Skip test on low vram machines (#156548)
I noticed some jobs error out after merging #155397 due to the test requiring >15GB GPU memory to execute and some of the machines it's running on has 8GB GPUs. This PR adds the skip option on those machines.

CC: @eqy @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156548
Approved by: https://github.com/eqy, https://github.com/malfet
2025-06-21 22:32:57 +00:00
1cfdcb975a [CUDA] fix illegal memory access in attention (#155397)
Fixes https://github.com/pytorch/pytorch/issues/150054

CI seemed to be messed up in the old one, old PR:
https://github.com/pytorch/pytorch/pull/155145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155397
Approved by: https://github.com/ngimel
2025-06-21 12:32:00 +00:00
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
092aed1b18 [Intel GPU] Enable GQA and different head_dim of value for SDPA (#150992)
In OneDNN v3.7, SDPA doesn't support num_head_q != num_head_kv (aka GQA) and head_dim_qk != head_dim_v.
In OneDNN v3.8, SDPA supports these two scenarios. Enable them in this PR.   SDPA UTs pass in local test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150992
Approved by: https://github.com/guangyey, https://github.com/drisspg, https://github.com/EikanWang

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-17 11:09:51 +00:00
65b9c13cce [Intel GPU] Enable safe softmax for XPU SDPA (#151999)
Fix https://github.com/intel/torch-xpu-ops/issues/1432#event-16899653975

When one row of Q*K attention score is masked with `-inf`, `softmax(score)` would output `NaN` for whole row which would cause model corruption.

With this new flag, it would output `0` for whole row which is aligned with Pytorch CPU/CUDA's behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151999
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-13 08:53:47 +00:00
d3c8f36ba0 Revert "[Intel GPU] Make SDPA output has the same stride as Query. (#154340)"
This reverts commit 0f10df71a66cb1b0c3659381b7db8e06d95f0d67.

Reverted https://github.com/pytorch/pytorch/pull/154340 on behalf of https://github.com/etaf due to This PR breaks hugging face E2E run on XPU. ([comment](https://github.com/pytorch/pytorch/pull/154340#issuecomment-2942954192))
2025-06-05 06:46:24 +00:00
0f10df71a6 [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/guangyey
2025-06-04 07:16:56 +00:00
7b074346e0 [Intel GPU] Support f32 intermediate dtype, headdim size <=576 and f32 causal mask for SDPA (#152091)
In OneDNN v3.7, SDPA has below defects:

1. The dtype of intermediate value is the same as QKV, while Pytorch uses FP32 dtype for intermediate value to make sure better accuracy.
2. Only support headdim size <= 256.
3. Don't support implict causal mask when QKV is FP32. We need to build an attention mask explicitly with aten ops.

In OneDNN v3.8, they have update for these defects. Since these are tiny changes, I decided to put them in single PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152091
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/drisspg
2025-06-04 05:18:36 +00:00
d6edefefbf [CUDA] Fixes for backwards in memefficient attn for large tensors (#154663)
followup to #154029.

@ngimel Backwards had the same problem as well so this PR fixes it and adds support for logsumexp computation in the forward pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154663
Approved by: https://github.com/ngimel
2025-05-30 19:30:07 +00:00
e313152a33 SDPA fix memory efficient attention for large batch dim (#154029)
Fixes #146704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154029
Approved by: https://github.com/ngimel
2025-05-28 16:53:53 +00:00
f363a3f51a Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)"
This reverts commit 9386701b51aadce951bf38daf497b0257a3f2211.

Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951))
2025-05-14 20:53:49 +00:00
de92296bbb [Intel GPU] undo broadcast on zero stride tensor for SDPA (#151976)
Fix https://github.com/pytorch/pytorch/issues/152290.

The model **hubert** uses aten::expand to build attention mask by broadcasting. Pytorch uses strides[d]=0 to represent broadcast, which is not supported by oneDNN.  This PR handles this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151976
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg
2025-05-14 16:09:03 +00:00
eqy
9386701b51 [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 (#149282)
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
2025-05-14 01:39:24 +00:00
0488883d6e [cuDNN][SDPA] Fix head-dim 256 condition for SM 10.0 (#152076)
turns out the backward is not supported yet, whoops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152076
Approved by: https://github.com/drisspg
2025-05-02 18:43:33 +00:00
eqy
1bb13a16bb [CUDA][SDPA] Bump python fused_attention_vs_math_ref_grads fudge_factor for sm120 (#152491)
🍦

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152491
Approved by: https://github.com/Skylion007
2025-04-30 19:22:21 +00:00
05933e08ca [ATen][CUDA][SDPA] Enable SDPA on sm_121 (#152314)
This PR adds support for `sm_121` of the DGX Spark. The `sm_121` is binary compatible with `sm_120` (just like `sm_89` and `sm_86`), therefore a compilation targeting `sm_121` is not required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152314
Approved by: https://github.com/eqy
2025-04-30 18:04:50 +00:00
8adfcd35c3 [cuDNN][SDPA] Loosen constraints for GQA for cuDNN Attention (#150337)
cuDNN attention doesn't require key and value tensors to have the same number of heads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150337
Approved by: https://github.com/drisspg
2025-04-06 20:31:11 +00:00
a8dd9b6c27 [cuDNN][SDPA] abide by enable_gqa convention in cuDNN (#149976)
long overdue

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149976
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-03-29 04:24:51 +00:00
eqy
de73790fe6 [cuDNN][SDPA] cuDNN SDPA supports head_dim <= 256 on sm90 and sm100 as of 9.5.1+ (#149904)
gqa check PR will go next...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149904
Approved by: https://github.com/drisspg
2025-03-25 23:10:16 +00:00
0dcd482e54 [SDPA] Respect sdpa_kernel's priority_order setting in torch.compile (#147768)
[https://github.com/pytorch/pytorch/pull/140467](https://github.com/pytorch/pytorch/pull/140467) added the option to specify a priority order for SDPA but the `torch.compile` path silently ignored this setting as I wasn't aware of the separate context manager handling on `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147768
Approved by: https://github.com/drisspg
2025-03-13 18:52:34 +00:00
67742128b7 [ROCm] Bump AOTriton to 0.9.2b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-07 22:10:07 +00:00
96176e32a9 Revert "[ROCm] Bump AOTriton to 0.9.1b (#148433)"
This reverts commit 8af79b7ec816f5c73536a806aa4c7ea1f7bd3867.

Reverted https://github.com/pytorch/pytorch/pull/148433 on behalf of https://github.com/jovianjaison due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/148433#issuecomment-2704638858))
2025-03-06 18:32:48 +00:00
8af79b7ec8 [ROCm] Bump AOTriton to 0.9.1b (#148433)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.9b:

* Optimize these Non-power-of-two head dimensions: 48, 80, 96, 160, 192, 224. Inputs with these head dimensions do not need padding to power-of-two anymore.
* `is_causal=True` cases are now supported with persistent dynamic algorithm, which requires an atomic tensor but does load balance between different CTAs
* `dropout_p > 0.0` cases now support full 64-bit offsets and use all i64x4 PRNG outputs
* The precise AOTriton shared library version can now be identified with `readelf -p .comment libaotriton_v2.so`
  + However, this does not guarantee the GPU images stored under `aotriton.images` have the same version, since they can be overwritten.
* The newly added fused backward kernel will be used for smaller workloads, due to less kernel invocation overhead.
* Support gfx1201 (RX 9070XT). Need to be enabled at runtime with `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148433
Approved by: https://github.com/jeffdaily
2025-03-05 19:11:57 +00:00
93e9daed54 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-03-04 23:09:09 +00:00
c21dc11a17 [Intel GPU] Enable SDPA on XPU (#147614)
Motivation
===

This PR is part of the plan of OneDNN Upstreaming, as #114848 [(comment)](https://github.com/pytorch/pytorch/issues/114848#issuecomment-2451553203) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`.

Depends on OneDNN version v3.7 upgrade in #147498
Depends on BUILD_GRAPH switch in #147608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147614
Approved by: https://github.com/jansel, https://github.com/EikanWang
2025-03-04 01:40:45 +00:00
7ffae2c028 Split test_transformers.py (#147441)
Split test_transformers.py into test_transformers.py and test_transformers_privateuser1.py. Currently the privateuse1 test cases in test_transformers.py are skipped since they conflict with cuda test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147441
Approved by: https://github.com/drisspg
2025-02-26 11:54:24 +00:00
fa8e3a28a7 Revert "[cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)"
This reverts commit 533b884870acd951e684e0bf551eb76904dec047.

Reverted https://github.com/pytorch/pytorch/pull/141178 on behalf of https://github.com/jeanschmidt due to Broke internal arvr signals, see D69971019. @jbschlosser please help the author get this PR merged ([comment](https://github.com/pytorch/pytorch/pull/141178#issuecomment-2676317470))
2025-02-22 17:28:12 +00:00
533b884870 [cuDNN][SDPA][Nested Tensor] Experimental cuDNN Nested Tensor SDPA Support (forward only) (#141178)
Disabled by default for now behind `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1`

Just wanted to get this out before starting a series of SDPA cleanup PRs---the biggest thing is we don't need the boilerplate around all of the `build_graph_and_tensors*` functions anymore as we can now use the `UID`-style referencing of tensor nodes as was done for the Conv-V8 API backend.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141178
Approved by: https://github.com/jbschlosser
2025-02-21 05:22:19 +00:00
46e83bb637 Fix linter F821 error (#146665)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-02-08 07:19:37 +00:00
70577d335e [ATen][CUDA][Transformers] Add Blackwell support to SDPA (#145602)
This PR adds sm_100 and sm_120 archs to support SDPA (Flash Attention and Memory Efficient Attention) on Blackwell machines.

Special thanks to @Fuzzkatt for co-authoring these changes!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145602
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/eqy, https://github.com/malfet

Co-authored-by: Patrick Wang <22803332+Fuzzkatt@users.noreply.github.com>
2025-01-24 22:27:39 +00:00
cbb1ed2966 [1/N] OpenReg: Replace open_registration_extension.cpp with openreg (#141815)
As described in OpenReg [next-steps](https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/README.md#next-steps), here we replace the current `open_registration_extension.cpp` test in PyTorch CI with openreg.

The current `open_registration_extension.cpp` contains two parts:
1. Implentations to support `PrivateUse1` backend.
2. Helper functions used for UTs in `test_cpp_extensions_open_device_registration.py` and `test_transformers.py`.

For the first part, we'll replace it with openreg. For the second part, we'll migrate them to ut files step by step.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141815
Approved by: https://github.com/albanD
2025-01-14 15:59:00 +00:00
cyy
df458be4e5 [4/N] Apply py39 ruff and pyupgrade fixes (#143257)
```torch/fx/passes/annotate_getitem_nodes.py``` was changed to support the new type hinting annotations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143257
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-01-04 10:47:51 +00:00
0a94bb432e [ROCm] CK Flash Attention Backend (#143695)
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.

Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet

Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-03 22:01:36 +00:00
f1cbf4b1b5 Enable ruff's unused variable checking everywhere in pytorch (#136965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136965
Approved by: https://github.com/cyyever, https://github.com/albanD
2024-12-22 02:33:11 +00:00
993b2f0ee0 Fix unused variables in test/test_transformers.py (#143407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143407
Approved by: https://github.com/drisspg
2024-12-18 09:59:24 +00:00
969b07b96f Revert "[ROCm] CK Flash Attention Backend (#138947)"
This reverts commit 500d02921bcf1619e268196866ddf099a4b94080.

Reverted https://github.com/pytorch/pytorch/pull/138947 on behalf of https://github.com/atalman due to Breaks default windows checkout ([comment](https://github.com/pytorch/pytorch/pull/138947#issuecomment-2548998359))
2024-12-17 16:46:57 +00:00
500d02921b [ROCm] CK Flash Attention Backend (#138947)
Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling `torch.backends.cuda.preferred_rocm_fa_library("ck")`. Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via `USE_FLASH_ATTENTION`) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138947
Approved by: https://github.com/pruthvistony, https://github.com/xw285cornell, https://github.com/leitian

Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
2024-12-17 02:18:07 +00:00
424156c26c [ROCm] Update to AOTriton 0.8b (#140172)
Notable new features for SDPA operators on AMD systems from AOTriton 0.8b:

1. Nestedtensor support;
2. MQA/GQA support;
3. Restore Efficient attention support for causal=True and seqlen_q != seqlen_k cases;
    + The kernel should use top-left alignment, bottom right alignment will be added later
4. Move gfx1100 (RX7900/W7800/W7900) out of experimental support status.
   However, users are strongly recommended to update to ROCM 6.2.4, notably for
   its firmware updates.

Related unit tests are enabled as well.

Notable related changes from AOTriton 0.8b:

1. AOTriton 0.8b moves the GPU kernel out of libaotriton.so to a separate directory `aotriton.images`;
2. LZMA replaces ZSTD as GPU kernel compression algorithm for better compression ratio: aotriton0.8b (.so + aotriton.images take 350MB) compared to aotriton0.7b .so: 800MB
3. The compression cannot be disabled now, and `liblzma` is hard run-time dependency.
    + Should not be a problem, since `lzma` is part of Python Standard Library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140172
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-12-06 21:45:18 +00:00
eqy
8fc6d3a5d8 [SDPA] Allow user-specified priority order with context manager (#140467)
TODO: docs changes?
For better debuggability of issues like https://github.com/pytorch/pytorch/issues/139298

Better testing, current sketch:

``` Python
import torch
from torch.nn.functional import scaled_dot_product_attention
from torch.nn.attention import SDPBackend, sdpa_kernel

q = torch.randn(64, 1024, 8, 64, dtype=torch.half, device='cuda')
print(torch._C._get_sdp_priority_order())

orders = [[SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.MATH, SDPBackend.CUDNN_ATTENTION, SDPBackend.EFFICIENT_ATTENTION],
          [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.CUDNN_ATTENTION, SDPBackend.MATH]]
import time
times = list()
for order in orders:
    print(order)
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    with sdpa_kernel(order, set_priority=True):
        scaled_dot_product_attention(q, q, q)
    torch.cuda.synchronize()
    t1 = time.perf_counter()
    times.append(t1 - t0)
print(times)
assert times[0] < times[1]
assert times[0] > times[2]
assert times[1] > times[2]
print(torch._C._get_sdp_priority_order())
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140467
Approved by: https://github.com/drisspg
2024-12-06 07:56:35 +00:00
02b52572db Lint: switch oncall owner for test_transformers (#141722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141722
Approved by: https://github.com/malfet
2024-11-27 21:45:43 +00:00
7671dd436e [SDPA-CPU] Fix Edge case w/ fused flash cpu kernel (#141519)
Fixes https://github.com/pytorch/pytorch/issues/141128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141519
Approved by: https://github.com/jgong5, https://github.com/jbschlosser
2024-11-26 22:07:56 +00:00
ad37afd590 Revert "Always unspecialize float in OSS (#138922)"
This reverts commit ba5253da9b30ed4d998cee1d865f92b2c27d3086.

Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/yf225 due to perf regression on torchbench ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2499277511))
2024-11-26 00:03:03 +00:00
ba5253da9b Always unspecialize float in OSS (#138922)
Fixes https://github.com/pytorch/pytorch/issues/107277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138922
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-11-24 01:58:13 +00:00
a8c90e5140 Revert "Always unspecialize float in OSS (#138922)"
This reverts commit 6d779d05492813da1c19ac0c562d0d5f8473f27e.

Reverted https://github.com/pytorch/pytorch/pull/138922 on behalf of https://github.com/huydhn due to Sorry for reverting your change but there is some slow tests failing after this land ([comment](https://github.com/pytorch/pytorch/pull/138922#issuecomment-2495076878))
2024-11-22 23:18:36 +00:00