Commit Graph

113 Commits

Author SHA1 Message Date
3cda34ebde [2/N] Apply ruff UP035 check in torch files (#164054)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
`TypeAlias` is also imported from `typing`.
This PR is the follow-up of #163947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164054
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-09-29 03:35:32 +00:00
0ebfa3d7d2 Avoid fast path mask left-align check in compiled TransformerEncoder (#163773)
Fixes #163640

This PR avoids a mask left align check in the case that we're operating under torch.compile / torch.export. Originally, I planned to make a more invasive change to auto-disable the fast path entirely underneath torch.compile / torch.export, but I realized during testing that the fast path wasn't actually causing compile issues outside of the narrow issue identified here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163773
Approved by: https://github.com/mikaylagawarecki
2025-09-26 22:29:37 +00:00
163f0d8f2a [BE][Ez]: Auto add return type annotations for methods in torch/nn/module (#157925)
Automatically type a bunch of methods in nn.Module using ruff's type inference rules

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157925
Approved by: https://github.com/albanD
2025-07-09 21:12:25 +00:00
596b418391 [BE][PYFMT] migrate PYFMT for {torch,test}/{nn,optim}/** to ruff format (#144548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144548
Approved by: https://github.com/ezyang
2025-06-14 11:27:04 +00:00
671553bd23 Update documentation wording for transformer-related layers (#155123)
<img width="947" alt="Screenshot 2025-06-04 at 1 33 53 PM" src="https://github.com/user-attachments/assets/4dbb66b3-43f4-4d04-afb5-dc80cec0f2cd" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155123
Approved by: https://github.com/albanD, https://github.com/jbschlosser
2025-06-04 22:20:32 +00:00
4b0cf9fc00 Optimize transformer encoder/decoder init suggestion (#146882)
Fixes #72253

Add hint message for users to manually initialize after created.

## Test Result

**Before**

![image](https://github.com/user-attachments/assets/1914223f-008e-4ff7-aea1-c54c55679f65)

![image](https://github.com/user-attachments/assets/fd4110c1-26f7-48fe-9582-80581ab72328)

**After**

![image](https://github.com/user-attachments/assets/12270ba2-b384-4fe6-b351-4287b272d102)

![image](https://github.com/user-attachments/assets/0194e3a0-700a-40da-a9de-e9854c2d5d2e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146882
Approved by: https://github.com/jbschlosser
2025-04-11 02:31:56 +00:00
cb83850a24 Fix docs format error in torch.nn (#150156)
Fixes #150152

Fix format error in [torch.nn.CosineSimilarity](https://pytorch.org/docs/stable/generated/torch.nn.CosineSimilarity.html#torch.nn.CosineSimilarity), [torch.nn.KLDivLoss](https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html#torch.nn.KLDivLoss) and other pages.

## Test Result

### Before

#### torch.nn.CosineSimilarity

![Image](https://github.com/user-attachments/assets/1ad633d9-dfaf-43f0-a536-9035a24bf858)

#### torch.nn.KLDivLoss

![Image](https://github.com/user-attachments/assets/20a001b0-1f66-414e-b554-11934d65a4bf)

### After
#### torch.nn.CosineSimilarity
![image](https://github.com/user-attachments/assets/a2d9ea8d-5637-4604-a0e4-9231a4deee44)

#### torch.nn.KLDivLoss
![image](https://github.com/user-attachments/assets/d0e319f9-a3b3-47a7-b2f8-060d46d53bc7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150156
Approved by: https://github.com/cyyever, https://github.com/malfet
2025-03-28 20:54:09 +00:00
09ae69a364 Revert "Fix type annotation of Linear.bias (#142326)"
This reverts commit 81e370fc6b90f9cb98c88f3173e738aba0dc650a.

Reverted https://github.com/pytorch/pytorch/pull/142326 on behalf of https://github.com/malfet due to This introduced a graph break and regressed inductor tests, see 73622fc5fa/1 ([comment](https://github.com/pytorch/pytorch/pull/142326#issuecomment-2614196349))
2025-01-26 03:41:00 +00:00
5b988ac4fa [Easy] Replace paper description with link to make a concise description. (#145031)
Description in [Transformer,](https://pytorch.org/docs/main/generated/torch.nn.Transformer.html), [TransformerEncoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerEncoderLayer.html), [TransformerDecoderLayer](https://pytorch.org/docs/main/generated/torch.nn.TransformerDecoderLayer.html) pages contain authors and paper details seems redundant for users who want to know how to use it, replace with a link to paper content, users can go to the paper detail if they want to learn more.

**Test Result**

**Before**
![image](https://github.com/user-attachments/assets/678402b1-e759-402c-b56b-e24f63dc8490)
![image](https://github.com/user-attachments/assets/ca191734-f2ce-493f-bf34-2d7046a9868f)
![image](https://github.com/user-attachments/assets/10f55083-6eb6-4b1c-9a77-579f0c4c56ed)

**After**
![image](https://github.com/user-attachments/assets/020f81ca-d89b-47d1-a7a9-cae1893df968)
![image](https://github.com/user-attachments/assets/5b9b34df-b892-4d71-8cdb-df18380b2744)
![image](https://github.com/user-attachments/assets/b3348da2-842a-4037-bad3-f23687503cf8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145031
Approved by: https://github.com/mikaylagawarecki
2025-01-24 23:01:02 +00:00
81e370fc6b Fix type annotation of Linear.bias (#142326)
Currently the `bias` attribute of `torch.nn.Linear` (and `Bilinear`) is typed incorrectly, because it relies on the implicit `Module.__getattr__` which types it as `Tensor | Module`. This has two issues:

- It hides the fact that `bias` is optional, and can be `None`, which in turn can hide actual bugs on user side.
- It blurs the type due to having `Module` in the union, which can require unnecessary `isistance(linear.bias, Tensor)` on user side.

This PR types the `bias` attribute explicitly to fix these issues.

CC @ezyang @Skylion007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142326
Approved by: https://github.com/ezyang
2025-01-24 22:43:52 +00:00
b8f383107e Link to transformer tutorial in transformer docs (#144425)
<img width="1045" alt="Screenshot 2025-01-08 at 4 50 20 PM" src="https://github.com/user-attachments/assets/05adfecb-8a23-4c48-9a2c-50c5b3f886b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144425
Approved by: https://github.com/albanD
2025-01-09 17:42:09 +00:00
5aa9f2b660 Fixed issue with nn.Transformer().generate_square_subsequent_mask() (#137654)
Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype().

Fixes #137186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137654
Approved by: https://github.com/mikaylagawarecki
2024-10-10 03:10:01 +00:00
62ccf6d7cd [BE] enable UFMT for torch/nn/modules (#128594)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594
Approved by: https://github.com/mikaylagawarecki
2024-06-23 05:37:57 +00:00
2db33054b3 Disable fast path in TransformerEncoderLayer when there are forward (pre-)hooks attached to modules (#128415)
Fixes #128413

Disable fast-path if there are forward hooks or pre-hooks.

Example failure case given in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128415
Approved by: https://github.com/mikaylagawarecki
2024-06-21 17:38:08 +00:00
d4022b4658 Revert "[BE] enable UFMT for torch/nn/modules (#128594)"
This reverts commit 95ac2d648279ebc73feccf6d8eccafa4b2759de8.

Reverted https://github.com/pytorch/pytorch/pull/128594 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128594#issuecomment-2181788935))
2024-06-21 00:50:08 +00:00
95ac2d6482 [BE] enable UFMT for torch/nn/modules (#128594)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596
2024-06-17 16:29:25 +00:00
27f9d3b0a1 Flip default value for mypy disallow_untyped_defs [8/11] (#127845)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127845
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843, #127844
2024-06-08 18:49:56 +00:00
c8e117fb76 Tiny comments improvement (#123426)
Fixed a typo in `functional.py` and moved comment line to correct place in `transformer.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123426
Approved by: https://github.com/mikaylagawarecki
2024-04-05 17:25:42 +00:00
50073248ed add a note wrt torch.nn.functional.scaled_dot_product_attention (#120668)
followup change of https://github.com/pytorch/pytorch/pull/120565

- Added a note in the transformer class pointing out the mask definition is opposite to that of :attr:`attn_mask` in
            torch.nn.functional.scaled_dot_product_attention.
@mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120668
Approved by: https://github.com/mikaylagawarecki
2024-02-28 21:16:34 +00:00
9c55aa6ff6 TransformerEncoder/Decoder: add type hints (#120550)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120550
Approved by: https://github.com/mikaylagawarecki
2024-02-28 19:36:08 +00:00
64660b51f6 Add the hyperlink of the transfomer doc (#120565)
Fixes #120488

- The shape for forward pass is clearly stated in the main [transformer class](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html)

- Boolean mask for _key_padding_mask is also explained in the main transformer class.

Therefore, add the hyperlink to the transformer class explicitly so the user can refer back to the main class. Also, correct several symbols in the transform doc from normal text style to math style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120565
Approved by: https://github.com/mikaylagawarecki
2024-02-26 23:11:58 +00:00
d0cf2182ea Fix TransformerEncoderLayer for bias=False (#116760)
Fixes https://github.com/pytorch/pytorch/issues/116385

Don't call `torch._transformer_encoder_layer_fwd` when `bias=False`

`bias=False` was not something that `torch._transformer_encoder_layer_fwd`  was meant to work with, it was my bad that this wasn't tested as I approved https://github.com/pytorch/pytorch/pull/101687.

`bias=False` was causing the `tensor_args` in [`TransformerEncoder`](a17de2d645/torch/nn/modules/transformer.py (L663-L677)) to contain `None`s and error on checks for the fastpath like `t.requires_grad for t in tensor_args`.

Alternative fix would be to
1) Pass `torch.zeros_like({*}.weight)` to the kernel when `bias=False` and filter `tensor_args` as appropriate
2) Fix `torch._transformer_encoder_layer_fwd` to take `Optional<Tensor>` for biases and fix the kernels as appropriate

Let me know if these approaches are preferable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116760
Approved by: https://github.com/jbschlosser
2024-01-05 00:13:10 +00:00
0f6f582c0d Add config to disable TransformerEncoder/MHA fastpath (#112212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112212
Approved by: https://github.com/jbschlosser
2024-01-02 23:59:30 +00:00
6d5fe07659 Fix numpy warning when importing torch without numpy installed (#115867)
Fixes #115638

I verified locally that with no numpy install the warning no longer occurs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115867
Approved by: https://github.com/soulitzer
2023-12-15 02:22:12 +00:00
5540d276ce Fix docstring errors in container.py, _functions.py, transformer.py, comm.py, parallel_apply.py, data_parallel.py, scatter_gather.py (#113250)
Fix docstring errors in container.py, _functions.py, transformer.py, comm.py, parallel_apply.py, data_parallel.py, scatter_gather.py

Fixes #112603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113250
Approved by: https://github.com/mikaylagawarecki
2023-11-10 21:07:25 +00:00
63d45275f4 is causal hints for transformer (#106143)
Summary:
make is_causal hint flags available for the top level transformer module.

It's debatable whether this is useful -- at present we autodetect causal masks for src and tgt masks in transformer encoder and decoder, respectively. is_causal flags available woul enable users to short-cut this check by asserting whether they mask is causal, or not.

I am putting this diff up for discussion, not as a solution.  Not doing anything may be the right solution, unless there is strong (data-driven) user demand. -- it appears the consensus is to move ahead with this, as per discussions below.

@cpuhrsch @mikaylagawarecki @jbschlosser @janEbert

Test Plan: sandcastle

Differential Revision: D47373260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106143
Approved by: https://github.com/mikaylagawarecki
2023-08-04 14:16:48 +00:00
3db255020b Clarify the clarification (#106358)
Summary: Clarify the clarification

Differential Revision: D47941982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106358
Approved by: https://github.com/mikaylagawarecki
2023-08-03 16:58:36 +00:00
723bc136a1 Add context for warning about batch_first (#106139)
Summary: Add context for warning about batch_first

Test Plan: sandcastle github

Differential Revision: D47809651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106139
Approved by: https://github.com/mikaylagawarecki
2023-07-27 23:02:05 +00:00
28a4fc8d8a Fixe some typos (#105869)
### Description:
- Fixes for typos in comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105869
Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007
2023-07-26 16:23:57 +00:00
66b73b08df Allow disabling bias for Transformer (#101687)
As used by T5 and PaLM, citing "increased training stability for large models" (https://arxiv.org/abs/2204.02311).

Depends on #101683, which allows disabling bias for `LayerNorm`s. Marked as draft due to this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101687
Approved by: https://github.com/mikaylagawarecki
2023-07-26 13:50:41 +00:00
79c5e33349 [BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436
Approved by: https://github.com/malfet, https://github.com/albanD
2023-07-21 07:38:46 +00:00
11b753af01 Refactor causal mask generation and detection for nn.transformer (#105265)
Summary:
* Create a private global-scope function _generate_subsequent because static class attribute member functions not supported by TorchScript resulting in torchscripting errors.
* Make TransformerEncoder and TransformerDecoder consistent w.r.t. is_causal handling by calling _detect_casual_mask
* Clarify documentation that is_causal is a hint
* Move causal mask detection into a method _detect_causal_mask
* only accept input-size compatible causal mask as causal mask
* update _generate_subsequent_causal_mask to include factory kwargs for dtype and device:
   avoid extra copies & conversions by passing directly to torch.full.

Test Plan: sandcastle & github CICD
Continuation of #101487 (due to a tooling issue) which is a continuation-in-part of https://github.com/pytorch/pytorch/pull/98327 by @janEbert

Differential Revision: D47427117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105265
Approved by: https://github.com/mikaylagawarecki
2023-07-19 01:26:50 +00:00
07a1c3f7ff Exercise subclass of TransformerEncoderLayer (#105297)
Summary: Exercise subclass of TransformerEncoderLayer
Additional unit tests for change in #102045 to show correct e2e operation (cf. issue #100188)

Also: remove batch_first from list of TS module constants where it is not used to resolve torchscripting warning

Test Plan: saqndcastle, github

Differential Revision: D47503004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105297
Approved by: https://github.com/davidberard98
2023-07-17 16:03:10 +00:00
c7a76d9be5 Replace use of first_layer in init with encoder_layer argument to init (#104058)
Summary:
Replace use of `first_layer` in init with `encoder_layer` argument to init
(better eng)

Test Plan: sandcastle, github CI

Differential Revision: D46940537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104058
Approved by: https://github.com/mikaylagawarecki
2023-07-12 05:31:15 +00:00
58feefa4ed add custom device support for special nn.modules (#103419)
Fixes #103818
1. for some special nn.Modules, there are checks which only support cuda, so I add `privateuse1` check.
2. when get the device type for `privateuse1` by `torch._C._get_privateuse1_backend_name()`, it will get error in `torch.jit.script`, so I add a global variable to avoid this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103419
Approved by: https://github.com/albanD
2023-06-26 00:58:29 +00:00
4d89489df5 Move static checks of layers[0] (e.g., isinstance check) to model build time (#102045)
Summary: Move static checks of layers[0] (e.g., isinstance check) to model build time because isinstance() does not work for torchscripted code.  Because the validation is now performed while constructing the object, the isinstance() call is performed in eager mode at model build time, and we avoid needing to call  isinstance() at runtime to determine whether the layers in a model are an instance of the TransformerEncoderLayer class, or its derived classes.

Test Plan: sandcastle, github

Differential Revision: D46096222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102045
Approved by: https://github.com/mikaylagawarecki
2023-05-30 19:42:01 +00:00
2361f7f0ce Update doc strings to make description of is_causal consistent for nn.Transformer and nn.MHA (#101089)
Summary: Update doc strings to make description of is_causal consistent for nn.Transformer and nn.MHA

Test Plan: sandcastle & github CI/CD

Differential Revision: D45737197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101089
Approved by: https://github.com/mikaylagawarecki
2023-05-13 18:14:38 +00:00
e5b065525b Add unit test for nested_tensor input to nn.TransformerEncoder (#100650)
Summary: Add unit test for nested_tensor input & fix

Test Plan: sandcastle

Differential Revision: D45580393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100650
Approved by: https://github.com/jbschlosser
2023-05-05 23:34:14 +00:00
8430430e94 Handle trailing masked column behavior for nested tensor (#100113)
Summary:
Handle trailing masked column behavior for nested tensor by padding during to_padded, to original tensor size

https://github.com/pytorch/pytorch/issues/97111

Test Plan: sandcastle & github

Differential Revision: D45167874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100113
Approved by: https://github.com/bertmaher, https://github.com/cpuhrsch, https://github.com/drisspg
2023-05-03 00:30:17 +00:00
c757647dd8 [Better Transformer] make is_causal a hint and force attn_mask to be set on is_causal=True in F.MHA (#97214)
Summary:
This fixes an issue raised in [is_causal parameter in torch.nn.TransformerEncoderLayer.forward does not work #96941](https://github.com/pytorch/pytorch/issues/96941) where results computed with is_causal do not properly reflect causal masking.

In PyTorch 2.0, Accelerated PT Transformers added the is_causal parameter to legacy nn.Transformer* and nn.MHA APIs aligned with and intended to engage the is_causal parameter of the new scaled_dot_product_attention (SDPA) operator.

At present is_causal works differently for Transformer* modules, the nn.MHA and F.MHA:
* The nn.Transformer* modules treat is_causal as an optional indicator about the format of attn_mask. This is because some layers (such as the CLIP layer use the attention mask in the layer, and thus the attn_mask was a required feature.)
* Initially, nn.MHA and F.MHA were defined to align with F.SDPA in behavior: a user may specify either the attention mask, or is_causal, but not both.  It seemed to make sense at the time to align SDPA and MHA, esp since there was a larger overlap of parameters which have since changed, e.g., with the removal of need_weights from SDPA. (See below for why this makes sense.)

Unfortunately, this does not work because of how MHA was changed to handle the need_weights parameter.  When need_weights is present, we do not (any more) call SDPA because support for need_weights was removed from SDPA before the release.  The rationale is that need_weights defeats all optimization at the foundation of SDPA performance.  Having the flag might thus mislead users into thinking they get good performance and have them disappointed when they enable a legacy feature of MHA which massively degrades performance.  (They might not think anything of enabling that, because it is on by default in MHA today, which leads to more  issues.)

Since SDPA does not (no longer) support need_weights, we need to pick a separate path which implements attention using a set of discrete operations that allocates a tensor for weights.  Alas, this code path does not have support for is_causal, because attention is implemented as matmul and using the attention mask.  Thus, is_causal has no impact.  (A substantially similar situation arises with how kpm is implemented today because Nested Tensors are not supported by torch.compile() in 2.0)

This problem was masked because all uses of legacy nn.MHA (and F.MHA) come through nn.Transformer* which called self-attention (i.e., nn.MHA) only ever with the attention mask attn_mask, and never with is_causal, a missed optimization opportunit that would have been addressed in a future performance update.

Regrettably, always calling nn.MHA with attn_mask prevented diagnosing of the issue of not having a suitable attention mask when need_weights support was dropped from SDPA and a discrete implementation of attention was added for that scenario, and for the execution path with key_padding_mask.

We have two options to address this issue:

Solution 1: Whenever nn.MHA and F.MHA are executed with is_causal set, we internally create a causal mask at significant expense of allocating a tensor and filling it with a triangular causal matrix.  This increases memory usage, and runtime, for allocating a causal mask.  To add insult to injury, in all current (and likely future) execution scenarios, MHA is called by a model using the nn.Transformer API which already has that matrix and passes it from nn.module to nn.module.  Then the passing in of attn_mask has to be suppressed by nn.TransformerEncoderLayer, only for nn.MHA to immediately allocate the very same tensor again to satisfy the requirement to have an attention mask for the computation. (We expect new use cases to use SDPA directly.)

Solution 2: We align the behavior of nn.MHA and F.MHA with the rest of the existing nn.Transformer API, and require the attention mask to be passed into nn.MHA in addition to is_causal as an optional indicator about the nature of the attention mask rather than as an alternative to attn_mask.  Then, when we choose the code path for processing MHA with need_weights or a key_padding_mask, we have the attn_mask passed down through the nn.Transformer* hierarchy, without the added overhead of allocating an attention mask as in scenario 1.

This PR implements solution 2 which offers better performance and in retrospect aligns MHA better with the rest of the Transformer modules as the definition of SDPA evolved into a more streamlined high-performance operator.  It ostensibly changes how is_causal works, by requiring the attention mask to be specified.  However, as described here, and as shown in the submitted issue, is_causal is not working as intended today, so it requires a change regardless.

In that sense, a change in API does not occur per-se, as the current implementation is not working, and a change has to occur either way to resolve the submitted issue, breaking any use cases that depend on the current implementation.  Checks exist (and more can be added) that flag any scenarios where is_causal is passed as True, but no attention mask is provided, ensuring that there's not quiet change from even the faulty behavior present in 2.0.

As  an upside, the present implementation will improve performance by addressing the passing of the is_causal flag from Transformer modules to MHA, speeding up training for these examples, e.g., finetuning BERT, RoBERTa, XLM-R models.

Differential Revision: D44245725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97214
Approved by: https://github.com/albanD
2023-03-25 01:36:30 +00:00
95c166cd3d Add is_causal API for TransformerDecoder (#97166)
The same API is implemented for `TransformerEncoder`, where this argument is passed through to the sublayers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97166
Approved by: https://github.com/mikekgfb
2023-03-24 20:00:53 +00:00
03b6e6979c Transformers: fix src and key padding mask bool regression (#96009)
Summary: fix src and pad mask bool regression

This fixes a regression introduced previously with #92733. That PR unified testing of masks to remove Byte Tensors as permissible mask, introduced mask compatibility check, and mask conversion to FP mask.  The problem addressed in this PR was that after the first mask had been converted, a check for mask compatibility would fail.

Test Plan: sandcastle & github

Differential Revision: D43782858

Fixes  https://github.com/pytorch/pytorch/issues/95702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96009
Approved by: https://github.com/malfet
2023-03-05 01:50:46 +00:00
5b1cedacde [BE] [2/3] Rewrite super() calls in functorch and torch (#94588)
Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied.

- #94587
- #94588
- #94592

Also, methods with only a `super()` call are removed:

```diff
class MyModule(nn.Module):
-   def __init__(self):
-       super().__init__()
-
    def forward(self, ...):
        ...
```

Some cases that change the semantics should be kept unchanged. E.g.:

f152a79be9/caffe2/python/net_printer.py (L184-L190)

f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94588
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-10 21:16:33 +00:00
ffb3561caa [Docs] Add pointer to FlashAttention paper (#94253)
As discussed with @drisspg, we're adding pointers to the docs for MHA and Transformers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94253
Approved by: https://github.com/drisspg, https://github.com/malfet
2023-02-07 08:05:10 +00:00
7265f60ad0 Regularize mask handling for attn_mask and key_padding_mask (#92733)
Summary:
Regularize mask handling for attn_mask and key_padding_mask
* Update documentation to remove reference to byte masks (which were deprecated long ago)
* Introduce check and warn about deprecation if attn_mask and key_padding_mask types mismatch
* Convert all masks to float before combining
* Combine by adding

Test Plan: sandcastle & github CI

Differential Revision: D42653215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92733
Approved by: https://github.com/ngimel, https://github.com/drisspg
2023-01-24 14:12:05 +00:00
af589b3d1f switch causal mask for is_causal flag (#91171)
Summary: switch causal mask for is_causal flag

Test Plan: sandcastle & github

Differential Revision: D42089340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91171
Approved by: https://github.com/wushirong, https://github.com/drisspg
2022-12-30 17:24:58 +00:00
93cb580677 lint transformer.py (#91048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91048
Approved by: https://github.com/ZainRizvi, https://github.com/kit1980, https://github.com/ezyang
2022-12-16 23:51:42 +00:00
512ec181ec Introduce causal mask (#90508)
Summary: Introduce causal mask

This PR introduces a causal mask option _causal_mask (as well as causal mask detection if attn_mask is provided), since current custom kernels do not support arbitrary masks.

Test Plan: sandcastle & github ci/cd

Differential Revision: D41723137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90508
Approved by: https://github.com/albanD
2022-12-16 21:39:42 +00:00
18c1f2f82e [torch] [analytics] add pytorch event logger callsites to transformers and encoder/decoders (#88896)
Differential Revision: D41227275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88896
Approved by: https://github.com/mikekgfb
2022-11-15 20:35:36 +00:00
7ad87f63e2 Support src_mask and src_key_padding_mask for Better Transformer (#88488)
Fixes T135842750 (follow-up for #87377)

## Description

At present, having both `src_key_padding_mask` and `src_mask` at the same time is not supported on the fastpath in Transformer and Multi-Head Attention.

This PR enables using both masks on the fastpath on CPU and GPU: if both masks are passed, we merge them into a 4D mask in Python and change mask type to 2 before passing downstream.

Downstream processing in native code is not changed, as it already supports 4D mask. Indeed, it is done depending on the device:
- on CUDA, by `SoftMax.cu::masked_softmax_cuda`. When mask type is 2, it calls either `dispatch_softmax_forward` -> `softmax_warp_forward` or `at::softmax` (depending on the input size). In both cases 4D mask is supported.
- on CPU, by `SoftMax.cpp::masked_softmax_cpp`. It calls `hosted_softmax` which supports 4D mask.

## Tests
- Extended `test_mask_check_fastpath` to check that fast path is indeed taken in Transformer when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path_mock` to check that fast path is taken in MHA when two masks are passed
- Added `test_multihead_self_attn_two_masks_fast_path` to check that fast and slow paths give the same result when two masks are passed in MHA
- `test_masked_softmax_mask_types` now covers mask type 2
- `test_transformerencoderlayer_fast_path` (CPU smoke test) is expanded to the case of both masks provided simultaneously
- `test_masked_softmax_devices_parity` checks that mask type 2 is accepted by CPU and CUDA paths

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88488
Approved by: https://github.com/mikekgfb
2022-11-10 08:12:56 +00:00