**Summary**
Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA.
This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend.
**Benchmarking**
To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.
Settings:
- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs
| | Variable Length API | SDPA |
|--------|--------------------|----------|
| Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms |
| TFLOPs | 231.812 | 320.840 |
The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length.
**Testing**
Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA.
**Next steps**
Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics.
(This stack builds on top of #162326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502
Approved by: https://github.com/v0i0, https://github.com/drisspg
**Summary**
Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA.
This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend.
**Benchmarking**
To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.
Settings:
- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs
| | Variable Length API | SDPA |
|--------|--------------------|----------|
| Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms |
| TFLOPs | 231.812 | 320.840 |
The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length.
**Testing**
Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA.
**Next steps**
Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics.
(This stack builds on top of #162326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502
Approved by: https://github.com/v0i0, https://github.com/drisspg
Currentlly SPDA XPU use own `priority_order` instead of the one from global context. Hence it does not support `with sdpa_kernel(order, set_priority=True)` with set_priority=True.
This PR enables this feature. To make default `priority_order` from global context works for XPU, I also move MATH backend to lowest priority, otherwise `cudnn attention` and `overrideable attention` will never be selected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159464
Approved by: https://github.com/guangyey, https://github.com/drisspg
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: mayuyuace <qiming1.zhang@intel.com>
Introduces support for a new `OVERRIDEABLE` backend in the SDPA module, improves backend selection logic, and adds corresponding tests. In addition, a fallback mechanism was added when a specific backend is unavailable, enhancing user configurability.
### Backend Support and Selection Enhancements:
* Added `at::SDPBackend::overrideable` to the list of available SDPA backends in the `Context` class (`aten/src/ATen/Context.h`).
* Updated the backend selection logic in `select_sdp_backend_xpu` to include the `OVERRIDEABLE` backend and added a fallback mechanism for unsupported `FLASH_ATTENTION` on XPU.
* Adjusted error messaging in `_fused_sdp_choice_xpu` to reflect the inclusion of the `OVERRIDEABLE` backend. (`aten/src/ATen/native/mkldnn/xpu/Attention.cpp`)
### Test Additions for Backend Fallback and Selection:
* Added new unit tests to validate fallback behavior for `FLASH_ATTENTION` to `OVERRIDEABLE` and to verify correct backend selection when `MATH` is enabled. (`test/test_transformers.py`,)
### Codebase Updates for Backend Integration:
* Introduced `OVERRIDEABLE` as a new member of the `_SDPBackend` enum. (`torch/_C/__init__.pyi.in`)
* Extended `_backend_names` and updated related methods to handle the `OVERRIDEABLE` backend. (`torch/nn/attention/__init__.py`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156669
Approved by: https://github.com/guangyey, https://github.com/drisspg
# Summary
Updates the SDPA docs to fix some small inaccuracies and points to the new sdpa_kernel context manger. The Enum like type binded from cpp SDPBackend does not render its fields for some reason. Manually list them instead for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121180
Approved by: https://github.com/mikaylagawarecki
# Summary
Simplification of Backend Selection
This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.
For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.
Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.
Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).
A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
# Summary
This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements.
The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: https://github.com/pytorch/pytorch/issues/108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking.
The larger theme though is laid here: https://github.com/pytorch/pytorch/issues/110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family.
Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf.
```Shell
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Average | 1.2388050062214226 | | | | | | | |
| Max | 1.831672915579016 | 128 | 32 | 1024 | 2048 | 2048 | torch.bfloat16 | 64 |
| Min | 0.9430534166730135 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114823
Approved by: https://github.com/cpuhrsch