Commit Graph

283 Commits

Author SHA1 Message Date
f02e3947f6 Expand type checking to mypy strict files (#165697)
Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697
Approved by: https://github.com/ezyang
2025-10-18 04:34:45 +00:00
20eeb54814 Add api info for torch._C._nn.pyi (#162936)
Fix part of #148404

APis involved are as followed:

- silu
- silu_
- smooth_l1_loss
- soft_margin_loss
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162936
Approved by: https://github.com/FFFrog, https://github.com/ezyang
2025-09-24 04:55:57 +00:00
5599f487ef Fully native DTensor.__new__ (#162508)
Move the entirety of `__new__` into C++, saving a layer of disable_dynamo and making progress toward all-C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162508
Approved by: https://github.com/ezyang
ghstack dependencies: #161695
2025-09-21 18:36:05 +00:00
9ba918082a Add api info for torch._C._nn.pyi (#162707)
Fix part of #148404

APis involved are as followed:

- multilabel_margin_loss
- multi_margin_loss
- nll_loss_nd
- relu6
- relu6_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162707
Approved by: https://github.com/ezyang
2025-09-21 06:17:15 +00:00
28e8531032 Add api info for torch._C._nn.pyi (#162361)
Fix part of #148404

APis involved are as followed:

- im2col
- l1_loss
- mish
- mish_
- mse_loss
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162361
Approved by: https://github.com/ezyang
2025-09-12 05:56:22 +00:00
aac1a50a19 Add api info for torch._C._nn.pyi (#162148)
Fix part of #148404

APis involved are as followed:

- cross_entropy_loss
- hardsigmoid_
- hardswish
- hardswish_
- huber_loss
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162148
Approved by: https://github.com/FFFrog, https://github.com/ezyang
2025-09-06 05:21:40 +00:00
88d94d17e8 Add torch.Tensor._make_dtensor to accelerate DTensor.__new__ further (#161590)
This seems to be a (very very roughly) ~8% improvement on DTensor benchmark very similar to the benchmark from #160580 (120ish usec -> 110ish usec)

Differential Revision: [D81530105](https://our.internmc.facebook.com/intern/diff/D81530105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161590
Approved by: https://github.com/albanD
ghstack dependencies: #161466, #161586
2025-09-05 18:43:41 +00:00
29280864d9 Add new parameter for gen_pyi.py to make it more configureable. (#161772)
This is a reposting of PR #128519.
This change is important to how we maintain PyTorch at Google.

From the previous PR:
"
This will make the script more flexible for the directory where it is executed.
...
We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access.
"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772
Approved by: https://github.com/albanD

Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com>
2025-09-05 00:48:15 +00:00
a75e8cd270 Add api info for torch._C._nn.pyi (#161958)
Fix part of #148404

APis involved are as followed:

- max_pool2d_with_indices
- max_pool3d_with_indices
- elu
- glu
- max_unpool2d
- max_unpool3d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161958
Approved by: https://github.com/ezyang
2025-09-02 20:39:20 +00:00
b146e1a264 [BE] remove duplicates in generated torch._VF.__all__ (#157365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157365
Approved by: https://github.com/Skylion007
2025-07-01 15:43:20 +00:00
2f94f69b7c [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-26 14:05:54 +00:00
e600e044a7 Revert "[aotd] Support mutations of the same input in fw and bw (#155354)"
This reverts commit 3f920f3d8f5bd15d2222758f21f9a5d36e4dad1f.

Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see 930b575389/1 ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))
2025-06-24 04:42:14 +00:00
3f920f3d8f [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-23 22:25:45 +00:00
1982ec2d22 Add api info for torch._C._nn.pyi (#148405)
APis involved are as followed:

- adaptive_avg_pool2d
- adaptive_avg_pool3d
- binary_cross_entropy
- col2im

ISSUE Related:
https://github.com/pytorch/pytorch/issues/148404
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148405
Approved by: https://github.com/ezyang
2025-06-14 07:57:07 +00:00
f04fd4dc4e typing: allow integer in bitwise operations (#155704)
Fixes #155701 (false positives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155704
Approved by: https://github.com/Skylion007, https://github.com/aorenste
2025-06-12 22:40:17 +00:00
0a7eef140b Add torch.Tensor._make_wrapper_subclass to torch/_C/__init__.pyi (#154022)
Fixes #153790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154022
Approved by: https://github.com/Skylion007
2025-05-27 14:10:00 +00:00
27f7b65a69 [BE] Ensure generated stub files by gen_pyi are properly formatted (#150730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150730
Approved by: https://github.com/aorenste
2025-05-17 12:30:40 +00:00
a4c828199e [BE] Add __all__ to torch/nn/functional.pyi and torch/return_types.pyi (#150729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150729
Approved by: https://github.com/aorenste
2025-05-15 19:01:57 +00:00
f7a5aa1d8d [torchgen] Refactor and simplify gen_pyi.py to use Generic TypeAlias (PEP 585) and Union Type (PEP 604) (#150727)
https://github.com/pytorch/pytorch/pull/129001#discussion_r1645126801 is the motivation for the whole stack of PRs. In `torch/__init__.py`, `torch._C.Type` shadows `from typing import Type`, and there is no type stub for `torch._C.Type` in `torch/_C/__init__.pyi`. So we need to use `from typing import Type as _Type`. After enabling [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585) in the `.pyi` type stub files, we can use `type` instead of `typing.Type` or `from typing import Type as _Type`.

------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150727
Approved by: https://github.com/aorenste
ghstack dependencies: #150726
2025-05-15 09:36:42 +00:00
e33bc41958 add torch.float4_e2m1fn_x2 to PyTorch (#148791)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/146578 to get around
rebase conflicts.

Test Plan:

```
pytest test/quantization/core/experimental/test_floatx.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148791
Approved by: https://github.com/drisspg, https://github.com/eqy, https://github.com/jeffdaily
2025-03-27 17:32:20 +00:00
f30776c37a [BE] Upgrade to mypy 1.14 (#145966)
Upgrade mypy version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145966
Approved by: https://github.com/Skylion007
2025-03-04 20:58:26 +00:00
382fbcc1e4 add the torch.float8_e8m0fnu dtype to PyTorch (#147466)
Summary:

Continuing the work from https://github.com/pytorch/pytorch/pull/146427

Adds the `torch.float8_e8m0fnu` dtype to PyTorch, as detailed in
https://github.com/pytorch/pytorch/issues/146414 . Please see the issue for a detailed definition of the format.  Example of basic functionality:

```python
import torch

# round trip
x0 = torch.randn(4, 4, dtype=torch.float32)
x1 = x0.to(torch.float8_e8m0fnu)  # RNE rounding
x2 = x1.to(torch.float32)  # 2 ** exponent

# creation with empty
x0 = torch.empty(4, 4, dtype=torch.float8_e8m0fnu)

# printing
print(x0)
```

Done in this PR:
* numerical correctness
* op coverage (except for `torch._scaled_mm`): create tensor, cast to/from float32
* printing a tensor works

For future PRs:
* performance optimizations for casting
* torch._scaled_mm
* PT2
* various cleanups (detailed in comments with issue numbers)

Test Plan:

```
pytest test/quantization/core/experimental/test_float8.py -s
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147466
Approved by: https://github.com/drisspg
2025-02-20 13:55:42 +00:00
8bea08e5bc [BE] Fix tensor stub (#147384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147384
Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/atalman
2025-02-19 19:47:03 +00:00
cyy
82aaf64422 [3/N] Apply py39 ruff fixes (#142115)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142115
Approved by: https://github.com/ezyang
2024-12-11 17:50:10 +00:00
cyy
aa95618268 [2/N] Apply py39 ruff fixes (#141938)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141938
Approved by: https://github.com/ezyang
2024-12-05 06:26:06 +00:00
1f3d8896bc Fix mismatched tensor metadata between FakeTensor and Intel XPU concrete tensor when running F.logsigmoid (#141333)
Fixes https://github.com/pytorch/pytorch/issues/141332
`F.logsigmoid` will return two outputs: `output` and `buffer`.
For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid`  will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the  implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
2024-12-02 22:09:20 +00:00
c223e0642c Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-11 23:55:27 +00:00
beae7725be Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit d3788190685685cb828bdf6bed90270c0b60affc.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. See D65641103 for more details ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2465906839))
2024-11-08 23:44:41 +00:00
d378819068 Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-07 20:54:39 +00:00
6add86a29f Revert "Tighten type hints for tensor arithmetic (#135392)"
This reverts commit bf5cd8d0116d90d24b8acb38d578b8952dab22ef.

Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking lint on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11673543178/job/32504499599) [HUD commit link](bf5cd8d011) ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2455908056))
2024-11-04 23:30:15 +00:00
bf5cd8d011 Tighten type hints for tensor arithmetic (#135392)
Fixes #124015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392
Approved by: https://github.com/ezyang
2024-11-04 22:10:04 +00:00
e9d2765ec8 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit d1bb8e828f280d1c66fff193c043d5bc36154577.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2379214226))
2024-09-27 12:54:47 +00:00
d1bb8e828f Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-26 04:52:05 +00:00
e3b89ca124 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit b1a02bf70824a4802411ddd5be1d3610e7a2e269.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626))
2024-09-25 14:11:01 +00:00
b1a02bf708 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-24 21:34:43 +00:00
fd182b90a7 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit d45b0151e5d9a9358368b9fbd7fa454edd5d9709.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))
2024-09-23 19:57:13 +00:00
d45b0151e5 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-20 02:41:56 +00:00
66dd4577b1 Track base of FunctionalTensor in inference mode. (#135141)
The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141
Approved by: https://github.com/zou3519
2024-09-06 00:10:25 +00:00
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
26c6786109 return_and_correct_aliasing: skip dispatcher when swapping storage (#132524)
`return_and_correct_aliasing` is used by FunctionalTensor today to ensure that when we call view/inplace ops, the input and output `FunctionalTensors` share the same storage.

This was previously done with a dispatcher call to `aten.set_`. In this PR I swap it out with a util that just manually does the storage swap. Benefits:

(1) we know this is safe in the specific way it is used by FunctionalTensor: avoiding the extra assertions in `aten.set_` is necessary to avoid some unbacked symint errors

(2) this should improve compile times a bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132524
Approved by: https://github.com/ezyang
ghstack dependencies: #132243, #132337, #132322
2024-08-06 00:44:35 +00:00
af8b8a47cb fsdp.set_: convey to functionalization that it mutates storage (#132322)
Fixes https://github.com/pytorch/pytorch/issues/132197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132322
Approved by: https://github.com/albanD, https://github.com/yf225
ghstack dependencies: #132243, #132337
2024-08-05 21:28:59 +00:00
bcb4f7c172 Revert "Grouped Query Attention (#128898)"
This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))
2024-08-02 18:58:46 +00:00
6b28af1b79 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-31 22:58:51 +00:00
499ead96ff Revert "Grouped Query Attention (#128898)"
This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))
2024-07-30 13:11:24 +00:00
d039b14207 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-29 21:49:06 +00:00
b193894b94 FakeTensor cache SymInt support (#127596)
Adds support for SymInts in the FakeTensor cache.

A couple notes:
1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain.
2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail.
3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not).
4. In the cache entry we store SymInts as a _SymIntOutputStub.

Perf example:
```
python benchmarks/dynamo/timm_models.py --ci --accuracy --timing
--explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda
--training --amp --total-partitions 2 --partition-id 0 --output
/tmp/training_timm_models.csv --filter crossvit_9_240
```
fake tensor cache before:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 68137
INFO:   cache_misses: 837
INFO:   cache_bypasses:
INFO:     symbolic shape:            48224
INFO:     CompositeImplicitAutograd: 917
INFO:     non-fake tensor:           70
INFO:     non-FakeTensor output:     62
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```
and after:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 88187
INFO:   cache_misses: 14233
INFO:   cache_bypasses:
INFO:     CompositeImplicitAutograd: 1037
INFO:     non-FakeTensor output:     602
INFO:     non-fake tensor:           70
INFO:     unsafe view:               36
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596
Approved by: https://github.com/eellison
ghstack dependencies: #131014, #129780
2024-07-21 19:26:38 +00:00
567482973d typing fake_tensor.py (#128041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041
Approved by: https://github.com/eellison
ghstack dependencies: #129182
2024-07-13 06:07:40 +00:00
d1d0a7080f [torchgen] reference generated comment to actual location of the generator and template (#130020)
As per title.

```diff
# torch/_VF.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/return_types.pyi

- # @generated from torch/_C/return_types.pyi
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in
```

```diff
# torch/_C/__init__.pyi

- # @generated from torch/_C/__init__.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in
```

```diff
# torch/_C/_nn.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in
```

```diff
# torch/_C/_VariableFunctions.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/nn/functional.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020
Approved by: https://github.com/ezyang
2024-07-05 21:47:14 +00:00
8a5fda0377 added type hints for __contains__ (#129653)
- Fixes #129646
- Added test in test/typing/reveal/tensor_constructors.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129653
Approved by: https://github.com/ezyang
2024-06-30 11:49:11 +00:00
8a67daf283 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-29 09:23:35 +00:00