This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.
to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
### tl;dr
This PR adds GQA support to higher order op `flex_attention`.
## Details
When `enable_gqa` is set to True, HOP `flex_attention(score_mod, query, key, value, block_mask, enable_gqa)` runs Group Query Attention(GQA), where the number of query heads (Hq) is a multiple of number of key/value heads (Hkv). Each group of query heads (`Hq//Hkv` heads) attends to a shared kv head.
Otherwise, `flex_attention` assumes Multi Head Attention (MHA) where the number of query heads is equal the number of kv heads.
The `score_mod` and `mask_mod` API are adapted accordingly to take `q_head` as head index.
```
def score_mod(score: torch.Tensor, batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor
def mask_mod(batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor
```
## Example
```python
import torch
from torch.nn.attention.flex_attention import flex_attention
from torch.nn.attention.flex_attention import create_block_mask
torch.manual_seed(0)
def query_key_value_clones(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
dtype: torch.dtype = None,
):
"""Clones the query, key, and value tensors and moves them to the specified dtype."""
if dtype is None:
dtype = query.dtype
query_ref = query.clone().detach().to(dtype).requires_grad_(query.requires_grad)
key_ref = key.clone().detach().to(dtype).requires_grad_(key.requires_grad)
value_ref = value.clone().detach().to(dtype).requires_grad_(value.requires_grad)
return query_ref, key_ref, value_ref
# Lets create some input tensors
# The input tensor has shape (batch_size, num_heads, seq_len, head_dim).
# query and key/value can have different num_heads and seq_len
# Here 8 query heads share one KV head.
query = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)
key = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)
value = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)
query1, key1, value1 = query_key_value_clones(query, key, value)
# Lets create a score_modification. We take alibi_bias as an example.
# score_mod takes batch index, query head index, query index, and key/value index.
def _generate_alibi_bias(num_kv_heads: int, num_q_heads: int):
def _alibi_bias(
score: torch.Tensor,
b: torch.Tensor,
hq: torch.Tensor,
token_q: torch.Tensor,
token_kv: torch.Tensor,
) -> torch.Tensor:
# Let's calculate kv head from query head index
group = num_q_heads // num_kv_heads
hkv = hq // group
scale = torch.exp2(-((hkv + 1) * 8.0 / num_kv_heads))
return score + (token_kv - token_q) * scale
return _alibi_bias
# Let's apply a casual mask on top of it
def causal_mask(b, h, q, kv):
return q >= kv
# Generate a block mask for our new mask_mod function.
# The mask is broadcasted long head & batch dimensions.
block_mask = create_block_mask(causal_mask, B=1, H=1, Q_LEN=2048, KV_LEN=2048)
# Lets call flex_attention with our new score modification and block mask under eager mode.
output = flex_attention(query, key, value, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True)
# Now lets compile flex_attention and run the flex_attention kernel.
compiled_flex_attention = torch.compile(flex_attention)
out_compiled = compiled_flex_attention(query1, key1, value1, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True)
torch.testing.assert_close(output, out_compiled, atol=5e-2, rtol=2e-2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131559
Approved by: https://github.com/drisspg
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.
However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.
Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).
other changes:
need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards
also updated the torchao APIs to the current versions
X-link: https://github.com/pytorch/benchmark/pull/2394
Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
(should all be ~1.0
0.997x
1.006x
0.994x
Reviewed By: xuzhao9
Differential Revision: D60252821
Pulled By: HDCharles
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory.
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module>
from torch.torch_version import __version__ as __version__
File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module>
from torch.version import __version__ as internal_version
ModuleNotFoundError: No module named 'torch.version'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812
Approved by: https://github.com/eellison, https://github.com/malfet
Summary:
Remove operator_benchmark caffe2 build due to the removal of caffe2: 2fd75667b4
Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain.
Test Plan: Sandcastle CI
Differential Revision: D60086216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460
Approved by: https://github.com/vmpuri
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2388
We can enable accuracy checks at Diff time since it is not a performance metric.
* Refactor the existing diff time test to use the new PT2 Benchmark Runner.
* Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection.
Test Plan:
Sandcastle CI
Or buck test command:
```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429
Reviewed By: oulgen
Differential Revision: D59825601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266
Approved by: https://github.com/oulgen
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.
However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.
Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).
Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```
Differential Revision: D59332736
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops
We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.
Taken over from https://github.com/pytorch/pytorch/pull/128937
joint work with @imzhuhl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops
We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.
Taken over from https://github.com/pytorch/pytorch/pull/128937
joint work with @imzhuhl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 *trillion* element tensor if fully materialized.
```
print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 2**20, 2**20, _compiled=True)))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.
Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:
<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">
What's nice is the dashboard shows the nightly commits for each run.
Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df
Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 )
Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.
Since this is not a real issue, I'll raise the tolerance to make it pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941