Commit Graph

14889 Commits

Author SHA1 Message Date
b5dd60fa75 Fix namespace issues with qnnpack (#134336)
After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added.

Test Plan: Sandcastle

Differential Revision: D61679037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336
Approved by: https://github.com/Skylion007
2024-08-25 19:50:01 +00:00
eqy
e93ca12c88 [CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031)
Fixes #134001
Incorrect assumption that two same-shape tensors being contiguous meant that they would have the same stride

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134031
Approved by: https://github.com/drisspg, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-08-25 14:31:30 +00:00
050aa67e41 [traced-graph][sparse] fix restrictive assert for sparse add (#134037)
exporting sparse addition can be CPU/Meta this fixes the overly restrictive assert and adds an exporting test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134037
Approved by: https://github.com/ezyang
2024-08-24 20:26:47 +00:00
50d5aa8c10 Enable optimized dynamic quantization on aarch64 (#126687)
oneDNN+ACL has optimized kernels for s8s8 matmul, so input is signed. This change leaves behaviour on all other platforms the same. This change requires https://github.com/intel/ideep/pull/313 to go in, and oneDNN 3.5 for the optimized kernels. This change speeds up dynamic quantized linear by ~10x.

Also, do you have a policy on copyright headers? Arm's usual policy when contributing to open source projects is to include a copyright header on any file which is modified. Would this be acceptable? If not, is there somewhere else suitable to note copyright?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126687
Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/snadampal

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-24 18:40:12 +00:00
195abdb85c ppc64le: VSX Support for Inductor (#132746)
### Description

This PR extends the `VecISA` class to include support for VSX on the `ppc64le` architecture within the Inductor backend. This enhancement enables vectorization support, resulting in performance improvements when using `torch.compile()` on `ppc64le`.

### Fixes

- Resolved the `test_acosh_with_negative_large_input` test case in `test_cpu_repro.py` by implementing `acosh` for VSX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132746
Approved by: https://github.com/jansel
2024-08-24 03:36:09 +00:00
aa9f4cc733 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-08-23 23:26:51 +00:00
eaa2c0e009 Improves error message when passing wrong tensor type to torch.nn.functional.one_hot (#134209)
The function expects a Tensor of type LongTensor. It currently throws the following error: "one_hot is only applicable to index tensor." which, imo, does not provide the user with enough information on what the problem is.

PR simply adds extra information to the error message on this specific scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134209
Approved by: https://github.com/mikaylagawarecki
2024-08-23 22:40:05 +00:00
adf0f50cc7 [Compile] Add NEON implementation for bf16->fp32 cast (#134297)
This changes assembly generated for the following routine
```cpp
void bfloat16tofloat(c10::BFloat16* in, float* out) {
        auto tmp0 = at::vec::Vectorized<c10::BFloat16>::loadu(in, 8);
        auto tmp1 = at::vec::convert<float>(tmp0);
        tmp1.store(out);
}
```
from
```asm
bfloat16tofloat(c10::BFloat16*, float*):
0000000000000034        stp     x29, x30, [sp, #-0x10]!
0000000000000038        mov     x29, sp
000000000000003c        sub     x9, sp, #0x90
0000000000000040        and     sp, x9, #0xffffffffffffffe0
0000000000000044        mov     x8, #0x0
0000000000000048        adrp    x9, 0 ; 0x0
000000000000004c        ldr     x9, [x9]
0000000000000050        ldr     x9, [x9]
0000000000000054        str     x9, [sp, #0x88]
0000000000000058        stp     xzr, xzr, [sp, #0x10]
000000000000005c        ldr     q0, [x0]
0000000000000060        str     q0, [sp]
0000000000000064        ldr     q1, [sp, #0x10]
0000000000000068        stp     q0, q1, [sp, #0x20]
000000000000006c        add     x9, sp, #0x40
0000000000000070        add     x10, sp, #0x20
0000000000000074        add     x11, x10, x8
0000000000000078        ldp     d0, d1, [x11]
000000000000007c        shll.4s v0, v0, #16
0000000000000080        shll.4s v1, v1, #16
0000000000000084        stp     q0, q1, [x9], #0x20
0000000000000088        add     x8, x8, #0x10
000000000000008c        cmp     x8, #0x20
0000000000000090        b.ne    0x74
0000000000000094        add     x8, sp, #0x40
0000000000000098        ld1.4s  { v0, v1 }, [x8]
000000000000009c        st1.4s  { v0, v1 }, [x1]
00000000000000a0        ldr     x8, [sp, #0x88]
00000000000000a4        adrp    x9, 0 ; 0x0
00000000000000a8        ldr     x9, [x9]
00000000000000ac        ldr     x9, [x9]
00000000000000b0        cmp     x9, x8
00000000000000b4        b.ne    0xc4
00000000000000b8        mov     sp, x29
00000000000000bc        ldp     x29, x30, [sp], #0x10
00000000000000c0        ret
00000000000000c4        bl      0xc4
```
to
```asm
bfloat16tofloat(c10::BFloat16*, float*):
0000000000000034        ldr     q0, [x0]
0000000000000038        shll.4s v1, v0, #16
000000000000003c        shll2.4s        v2, v0, #16
0000000000000040        st1.4s  { v1, v2 }, [x1]
0000000000000044        ret
```

And as result speeds up `python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16` from 33 to 90 tokens/sec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134297
Approved by: https://github.com/kimishpatel
2024-08-23 22:22:59 +00:00
55cdcef0f7 [fp8 rowwise] Work around CUDA Invalid Memory Access bug (#134227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134227
Approved by: https://github.com/drisspg, https://github.com/eqy
ghstack dependencies: #134223, #134224, #134225, #134226
2024-08-23 07:27:55 +00:00
9d81767d43 [fp8 rowwise] Rework dispatch logic (#134226)
It's likely a matter of opinion, but I find this new version to have less duplication, even if it might have more boilerplate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134226
Approved by: https://github.com/drisspg
ghstack dependencies: #134223, #134224, #134225
2024-08-23 07:27:55 +00:00
0afb4872aa [fp8 rowwise] Support non-contiguous inputs and clarify checks (#134225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134225
Approved by: https://github.com/drisspg
ghstack dependencies: #134223, #134224
2024-08-23 07:27:52 +00:00
9f8d3f511f [fp8 rowwise] Some clean-up (#134224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134224
Approved by: https://github.com/drisspg
ghstack dependencies: #134223
2024-08-23 07:27:48 +00:00
2f198605ac [fp8 rowwise] Simplify epilogue visitor tree via common blocks (#134223)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134223
Approved by: https://github.com/drisspg
2024-08-23 07:27:41 +00:00
311af3b988 Add new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked (#134232)
Summary:
This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff.

We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI

Reviewed By: jerryzh168

Differential Revision: D61395887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232
Approved by: https://github.com/houseroad
2024-08-23 04:54:26 +00:00
0eb9c870fd [reland][ROCm] TunableOp for gemm_and_bias (#128919)
Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias`

Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919
Approved by: https://github.com/malfet
2024-08-22 18:27:50 +00:00
83b5d449a3 Add full float16/bfloat16 support to MaxUnPool (#133774)
It already supported half so might as well add bfloat16 support for parity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-22 13:34:43 +00:00
aea1148d56 [fp8 rowwise] Clarify dtypes (#134114)
Disambiguate some of the dtypes (e.g., for the scales), move the "constant" ones out of the function, and use safe casting functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134114
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111, #134112, #134113
2024-08-22 11:07:39 +00:00
72586ccd14 [fp8 rowwise] Don't build separate kernel for no bias (#134113)
CUTLASS automatically skips a stage in the epilogue if we provide a nullptr. Thus, instead of building a special kernel for bias=None, we can reuse one of the other ones.

This also considerably simplifies the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134113
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111, #134112
2024-08-22 11:07:39 +00:00
d64fa11095 [fp8 rowwise] Fix bias calculation being done in low precision (#134112)
The compute dtype for the bias addition was set to ElementBias. Thus, for a bf16 bias, we would cast the fp32 accum to bf16 and _then_ add the bias. It is however (slightly?) more accurate to first add the bias in fp32 and only cast at the end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134112
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111
2024-08-22 11:07:34 +00:00
15faed60ca [fp8 rowwise] Make schedule selection more readable (#134111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134111
Approved by: https://github.com/drisspg
ghstack dependencies: #134110
2024-08-22 11:07:30 +00:00
b8ea5b01c9 [fp8 rowwise] Allocate workspace as a PyTorch Tensor (#134110)
This makes us pass through the CUDA caching allocator which is safer e.g. in case of CUDA graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134110
Approved by: https://github.com/drisspg
2024-08-22 11:07:26 +00:00
cyy
4c8193b8f0 [14/N] Fix clang-tidy warnings in aten/src/ATen (#132733)
Follows #133807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132733
Approved by: https://github.com/ezyang
2024-08-22 10:09:15 +00:00
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
3d8db41337 Add new op wrapped_quantized_linear (#134024)
Summary:
This diff adds a new operator wrapped_quantized_linear (torch.ops._quantized.wrapped_quantized_linear) and takes the following input argument: input (in fp32) , input_scale, input_zero_point, weight (in fp32), weight_scale, weight_zero_point, bias (in fp32), output_scale, output_zero_point, and out_channel. It does the following

1. Use quantize_per_tensor(input, input_scale, input_zero_point) to quantize the input tensor to int8
2. Use quantized::linear_prepack(weight, weight_scale, weight_zero_point, bias) to pack the weight and bias
3. Use quantized::linear to perform int8 quantized linear
4. dequantize

This new op is essentially a wrapper of mutiple ops. We do this as torch.export cannot handle models where it has old quantize apis.

Reviewed By: jerryzh168

Differential Revision: D61377266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134024
Approved by: https://github.com/houseroad
2024-08-21 09:26:58 +00:00
43f78bf37a [MPS] Gather sliced inputs to batch norm (#133610)
This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in 4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372) to decide if gathering is necessary.

It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs.

### Performance impact

#### With fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 282 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 448 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 705 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 1.11 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.16 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 11.7 msec per loop
```

#### Without fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 284 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 265 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 715 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 675 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.19 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 7.13 msec per loop
```

Please feel free to push back or request changes.

Fixes #133520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610
Approved by: https://github.com/malfet
2024-08-20 18:24:48 +00:00
cyy
8d93fe510e Remove NestedTensorFactories.h (#133809)
Since it has no code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133809
Approved by: https://github.com/ezyang
2024-08-20 16:16:30 +00:00
48ee0984ac Add C API to return all torch function disablement status (#133136)
This PR adds a C function to check if all torch function is disabled.
Recall that there are three torch function enablement states:
* All disabled
* Torch Function Subclass disabled
* All enabled

The API before this change provides two functions:
* `_is_torch_function_enabled` - returns True iff the current TF state is All enabled
* `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty.

The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case.

In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API:
* A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state)
* A flag to indicate if modes are disabled (ie if All disabled is the current state)
* A symbolic stack which can be checked if any modes are present

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134
2024-08-20 07:15:04 +00:00
32f57ac627 [BE] Fix lint issues in qlinear_prepack.cpp (#133797)
Summary: This diff fixed many lint issues in qlinear_prepack.cpp. I'am fixing them as I want to add more ops/funcs into this file later.

Test Plan: Sandcastle

Differential Revision: D61425436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133797
Approved by: https://github.com/Skylion007
2024-08-20 04:23:25 +00:00
cyy
c51fc7e98e Enable clang-tidy in aten/src/ATen/native/nested/ (#133829)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133829
Approved by: https://github.com/Skylion007
2024-08-20 01:52:15 +00:00
c7af2728d3 Remove aten dispatch to empty in foreach_norm cuda kernel (#133897)
Saves significant time on aten dispatch. For 2k tensors, goes from 38ms to 58us.
Should shave some overhead mentioned in https://github.com/pytorch/pytorch/issues/133586

Before PR:
![image](https://github.com/user-attachments/assets/7813f059-0f7f-4d44-a9f0-1aaf94ae849f)

After:
![image](https://github.com/user-attachments/assets/ad0855b1-2743-432a-ad31-b574c620e2fd)

script:
```
import torch

# warm up caching allocator
a = torch.rand(200, 10, device="cuda")
b = torch.rand(200, 10, device="cuda")
c = a + b
del a, b, c

ts = [torch.rand(2, 3, device="cuda") for _ in range(2000)]

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    torch._foreach_norm(ts)

print(p.key_averages().table(sort_by="cpu_time_total"))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133897
Approved by: https://github.com/albanD, https://github.com/drisspg
2024-08-20 01:27:09 +00:00
fb26b84390 Update fused kernels and call _safe_softmax from SDPA (#133882)
# UPDATE:
This is  take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty

# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
0a976b8899 Enable bf16 float32 mkldnn matmul when float32 precision is 'medium' (#130919)
This fixes an issue on AArch64 cpus supporting BF16, caused when torch.set_float32_matmul_precision("highest") does not disable the bf16 downconversion in mkldnn_matmul.

This was discovered from a unit test failure where the decorator `torch.testing._internal.common_mkldnn.bf32_on_and_off`, which internally switches the float32_matmul_precision between "medium" and "highest" was not having the desired effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130919
Approved by: https://github.com/jgong5
2024-08-19 09:18:12 +00:00
cyy
0d4cedaa47 [13/N] Fix clang-tidy warnings in aten/src/ATen (#133807)
Follows #133425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133807
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-08-18 17:54:12 +00:00
cyy
47ed5f57b0 [12/N] Fix clang-tidy warnings in aten/src/ATen (#133425)
Follows  #133758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133425
Approved by: https://github.com/ezyang
2024-08-18 11:03:55 +00:00
cyy
f8cf1829b5 [Reland] [11/N] Fix clang-tidy warnings in aten/src/ATen (#133758)
Reland of #133298. Remove possible changes that may increase the build time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133758
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-08-17 23:09:44 +00:00
215b14530a Add Half for sparse.mm reduce (#133672)
This PR is to add Half support for sparse.mm reduce in CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672
Approved by: https://github.com/Skylion007
2024-08-17 15:20:39 +00:00
d5f6d68d68 [PT2] Resolve PT2 compatility issue in slice and diff (#133740)
Summary:
# context
* when running an IG FM training with PT2 we found there are a few graph break due to torch.diff call in [jagged_tensor.py](https://fburl.com/code/cwssxabc)
```
_length: List[int] = (
    _length_per_key_from_stride_per_key(torch.diff(offsets), stride_per_key)
    if variable_stride_per_key
    else torch.sum(torch.diff(offsets).view(-1, stride), dim=1).tolist()
)
```
* look into the failure, we found the TORCH_CHECK in diff should be TORCH_SYM_CHECK
* slice_forward error: df3d7729e, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxXZ2em/index.html)
```
RestartAnalysis
Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time.  You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs.  Could not guard on data-dependent expression ((5*u37 + u38)//(u37 + u38)) < 0 (unhinted: ((5*u37 + u38)//(u37 + u38)) < 0).  (Size-like symbols: u38, u37)

ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.

Potential framework code culprit (scroll up for full backtrace):
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/e99934938a0abe90/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 771, in slice_forward
    if end_val < 0:
```
* after this diff: [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpAhv2Sh/failures_and_restarts.html)

Test Plan:
# command
* run model
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2
```
* generate tlparse
```
tlparse `ls -t /var/tmp/tt/* | head -1`
```

Reviewed By: ezyang

Differential Revision: D56339251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133740
Approved by: https://github.com/ezyang
2024-08-17 06:07:21 +00:00
19ff9059eb Revert "[Inductor][CPP] Support vectorization of remainder (#129849)"
This reverts commit 8624a571b4eecd11547867591d70992843265e97.

Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to ptedge_executorch_benchmark build failed again with LLVM crash ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2294408526))
2024-08-16 22:41:05 +00:00
861bdf96f4 [MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393)
Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors.

Summary of changes (starting with macOS 15):
- Add support for **MPS strided API** (strides/storage offsets etc):
   - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc)
   - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc)
   - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc)
   - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc)
- Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW).
- Add support for strided output buffers (previously we would create a contiguous buffer

OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets.

---

Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14:
```
- test_train[functorch_maml_omniglot-mps]: 27% faster
- test_train[timm_vision_transformer-mps]: 12% faster
- test_train[hf_T5-mps]: 9.46% faster
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393
Approved by: https://github.com/albanD

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
2024-08-16 21:07:50 +00:00
41e6619509 [codemod] Del un at::native::metal @ MPSCNNFullyConnectedOp.h:6 (export D59157302) (#133515)
Manual export of D59157302

Original description:
Removes a using namespace from the global namespace in pursuit of enabling -Wheader-hygiene. Qualifies instances that relied on the using namespace.

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133515
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-08-16 19:59:07 +00:00
611c104370 [MPS] Add workaround for nonzero with large/complex inputs (#126188)
Fixes Issue #122916

Resolves correctness issue seen with large inputs to the mps nonzero op by using a different scatter mode. Native nonzero op is still used with smaller inputs for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126188
Approved by: https://github.com/kulinseth, https://github.com/malfet
2024-08-16 19:04:04 +00:00
b833990a8f Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 4aa66f68a803927ddd127ceaaa1521b8d6e90e5f.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))
2024-08-16 18:09:33 +00:00
a1a869f2f5 [ts_converter][reland] Add support for LinearOpContext and Conv2dOpContext in quantization pass (#133622)
Summary: Reland of D60871242

Test Plan: CI

Differential Revision: D61352600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133622
Approved by: https://github.com/SherlockNoMad
2024-08-16 01:55:45 +00:00
1653f7786d Fix type promotion for ldexp (#133519)
According to the documentation, ldexp of half and int should return half tensor and ldexp of double should not overflow for 64-bit exponent

Introduce `_pow2` helper function that does not follow scalar to float32 promotion pattern if `self` is reduced precision float or double

Add regression tests to `test_ldexp` and enable it to run on both CPU and GPU

Fixes https://github.com/pytorch/pytorch/issues/133267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133519
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2024-08-16 01:26:26 +00:00
cyy
8f7cf796ea [14/N] Use std::optional (#133417)
Follows #132527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133417
Approved by: https://github.com/ezyang
2024-08-16 00:48:34 +00:00
4aa66f68a8 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-15 18:33:22 +00:00
cfec69e2a1 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))
2024-08-15 17:55:07 +00:00
ec49ce5f8e [CUDA]: Add frexp CUDA bfloat16 support (#133313)
Fixes #133263 Add CUDA bfloat16 support to cuda_frexp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-08-15 15:20:00 +00:00
8624a571b4 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-08-15 02:06:30 +00:00