Commit Graph

2227 Commits

Author SHA1 Message Date
b85f21fc1d Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136653
2024-10-01 10:23:22 +00:00
c610aa80dc Testing: Unblock new_* testing on MPS (#137003)
By changing `other_dtype` to `torch.half` rather than `double` in
`sample_inputs_new_fns` if MPS is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986
2024-09-30 19:06:12 +00:00
370c1c4297 [aotd] Fix rrelu compilation (#136008)
Issues:
https://github.com/pytorch/pytorch/issues/135083
https://github.com/pytorch/pytorch/issues/120292

rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph.

Also that decomposition is registered as python_dispatch kernel for AutogradCUDA.
Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_rrelu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008
Approved by: https://github.com/bdhirsh
2024-09-25 11:26:19 +00:00
efed357ef5 Add dtypes support in opinfo for Intel Gaudi (#132840)
## Motivation
This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584
we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840
Approved by: https://github.com/albanD
2024-09-24 17:17:15 +00:00
f276da7f98 Remove prims.slice_in_dim and prims.slice (#136150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150
Approved by: https://github.com/ezyang
2024-09-23 01:27:22 +00:00
b6ffa381e1 [BE]: Add half CUDA support nextafter (#136373)
Making CUDA support match CPU support for nextafter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373
Approved by: https://github.com/ezyang
2024-09-21 17:13:45 +00:00
0c936c3ecb Add decomps for max_unpool (#133146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-20 21:35:25 +00:00
15dba021bb [ROCm][CI] upgrade CI to ROCm 6.2 (#132555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-09-20 17:39:31 +00:00
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
462b727d1e Revert "Add decomposition for permute_copy (#130944)"
This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))
2024-09-17 13:42:55 +00:00
2c4ae81494 Revert "Add decomposition for squeeze_copy (#130941)"
This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34.

Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))
2024-09-17 13:39:07 +00:00
23c0d2689e [BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091)
Testing if op info coverage has issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091
Approved by: https://github.com/ezyang
2024-09-16 18:22:16 +00:00
b491e2974c [BE][Ez]: Add full half/bfloat16 dtype for unique and isin (#136114)
Fixes #136090

* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
2024-09-16 17:49:12 +00:00
c33b0580e6 Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-16 15:46:57 +00:00
ab9a7eadd3 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-15 19:35:14 +00:00
db393fb95e Add Half support for reflection and replication padding on CPU (#135931)
Fixes #135680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931
Approved by: https://github.com/Skylion007
2024-09-14 14:18:55 +00:00
1786a17fed Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)"
This reverts commit 51c52061339069a2162e921e5b464fad5a411522.

Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))
2024-09-14 02:31:06 +00:00
51c5206133 Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 02:20:58 +00:00
9a04cfbeff fix for fp16 (#134106)
This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny

Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation.

I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.

Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability.

```
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
```

Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2024-09-11 22:02:07 +00:00
74fd1bf965 [ROCm] Update to AOTriton 0.7b (#134498)
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
    + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`

Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.

Fixes #133540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-09-11 20:34:01 +00:00
e05ea2b179 Add decomposition for transpose_copy (#130943)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-11 19:45:22 +00:00
a15aabc975 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-09-06 19:06:23 +00:00
43f4947d44 fix fake tensor tolist implementation (#135131)
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.

Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.

Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.

Differential Revision: D62197742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
2024-09-05 23:20:31 +00:00
eqy
4f70b3cfae [CUDA][complex][TF32] Update test_noncontiguous_samples tolerances for complex64 (#134526)
Recent cuDNN heuristics change surfaces same TF32 issue as `float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526
Approved by: https://github.com/ezyang
2024-09-04 23:37:16 +00:00
758d787901 Added complex support for torch.logsumexp (#133187)
Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`.

Fixes #133047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-09-03 17:28:36 +00:00
6fce1faa10 change multinomial to use async asserts instead of a synchronization (#134818)
Fixes https://github.com/pytorch/pytorch/issues/134442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818
Approved by: https://github.com/ezyang
ghstack dependencies: #134813
2024-09-03 06:33:24 +00:00
e21d7b77ce Update ForeachfuncInfo.sample_inputs_func to yield scalars & scalarlists that are more friendly to test_meta (#134552)
for `test_meta.py` to see more "PASSED" instead of "XFAIL".

`pytest test_meta.py -k "_foreach_"` ran 6400 test cases and:
- This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed
- main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552
Approved by: https://github.com/janeyx99
2024-08-30 17:30:50 +00:00
f997b2b8e6 Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)"
This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3.

Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](f685018ea9) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))
2024-08-28 23:10:07 +00:00
f685018ea9 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-08-28 21:30:39 +00:00
289486d007 Move attention kernels back from fake_impls to meta_registrations (#134288)
See #121528 for additional context.

In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).

Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.

Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.

Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
2024-08-27 21:10:36 +00:00
55236d0cb7 TestForeach::test_parity: Remove check for error message text (#134251)
Previously, error messages were expected to be string equivalent to
error messages thrown by the ref function.  This check fails for dozens
of torch functions, and doesn't appear to add much value for the end
user.  This commit removes this check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253, #134344
2024-08-26 22:40:54 +00:00
ef8c474fcf Add the fast path for bfloat16 lgamma (#134344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253
2024-08-26 22:40:54 +00:00
3c5883e550 Fix test_parity xfail for sigmoid (#134253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 22:40:54 +00:00
a0e062c6f1 Add mean.dtype_out (#133506)
Give it a try and see if CI is happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506
Approved by: https://github.com/bdhirsh
2024-08-26 19:26:11 +00:00
27d97b9649 Remove unnecessary test skip (#134250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 14:34:53 +00:00
83b5d449a3 Add full float16/bfloat16 support to MaxUnPool (#133774)
It already supported half so might as well add bfloat16 support for parity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-22 13:34:43 +00:00
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
fb26b84390 Update fused kernels and call _safe_softmax from SDPA (#133882)
# UPDATE:
This is  take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty

# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
215b14530a Add Half for sparse.mm reduce (#133672)
This PR is to add Half support for sparse.mm reduce in CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672
Approved by: https://github.com/Skylion007
2024-08-17 15:20:39 +00:00
cfec69e2a1 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))
2024-08-15 17:55:07 +00:00
ec49ce5f8e [CUDA]: Add frexp CUDA bfloat16 support (#133313)
Fixes #133263 Add CUDA bfloat16 support to cuda_frexp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-08-15 15:20:00 +00:00
caba37e99b Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser, https://github.com/Chillee
2024-08-13 23:37:50 +00:00
7be77658e9 [Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155)
This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance.
BTW, I fixed the UT of `byte` by setting the range of the sample inputs  to [0, 255] since the range of `torch.uint8` is [0, 255].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130724
2024-08-13 01:12:05 +00:00
1434e0b121 Add a private _safe_softmax (#131060)
# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060
Approved by: https://github.com/jbschlosser
2024-08-08 23:09:38 +00:00
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
bcb4f7c172 Revert "Grouped Query Attention (#128898)"
This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))
2024-08-02 18:58:46 +00:00
3a355c1891 Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630)
Fixes #130916
As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram.
The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-08-02 01:51:09 +00:00
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
6b28af1b79 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-31 22:58:51 +00:00
499ead96ff Revert "Grouped Query Attention (#128898)"
This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))
2024-07-30 13:11:24 +00:00