1465 Commits

Author SHA1 Message Date
7231118db3 Turn some const variables into constexpr in C++ code (#165401)
This PR checks the C++ code and turns some const variables into constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165401
Approved by: https://github.com/Skylion007
2025-10-17 13:24:46 +00:00
9e94ec76b8 Revert "Turn some const variables into constexpr in C++ code (#165401)"
This reverts commit 5b2afe4c5dc87786ca65bf22ca9a78f7c21a33a4.

Reverted https://github.com/pytorch/pytorch/pull/165401 on behalf of https://github.com/seemethere due to This is breaking test/distributions/test_distributions.py::TestDistributions::test_binomial_sample on HUD, see 5b2afe4c5d ([comment](https://github.com/pytorch/pytorch/pull/165401#issuecomment-3414023134))
2025-10-17 06:14:09 +00:00
5b2afe4c5d Turn some const variables into constexpr in C++ code (#165401)
This PR checks the C++ code and turns some const variables into constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165401
Approved by: https://github.com/Skylion007
2025-10-17 00:40:11 +00:00
26f3803433 Remove workaround to old CUDA bug (#164354)
As in the title.

A check for https://github.com/pytorch/pytorch/issues/164348 to see if the workaround can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164354
Approved by: https://github.com/janeyx99, https://github.com/ngimel, https://github.com/malfet, https://github.com/jeffdaily
ghstack dependencies: #164350
2025-10-16 00:55:43 +00:00
ef50c9b557 Remove unnecessary "static" for definitions in anonymous namespace (#165035)
This PR removes unnecessary "static" for C++ functions and variables in anonymous namespace as detected by clang-tidy. This enhances code readability. The related rules are planed to be enabled in follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165035
Approved by: https://github.com/Skylion007
2025-10-11 00:04:23 +00:00
7614338b69 Revert "Add SVE128 ISA (#158932)"
This reverts commit 92284fb2ff44f09a9c7df0d8cf6cac9903e376a4.

Reverted https://github.com/pytorch/pytorch/pull/158932 on behalf of https://github.com/malfet due to Hmm, but from OSS point of view, this is a no-op ([comment](https://github.com/pytorch/pytorch/pull/158932#issuecomment-3387961238))
2025-10-10 01:17:02 +00:00
f231be25c6 Mark unused parameters in C++ code (#164912)
This PR adds unused parameter name comments in C++ declarations to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912
Approved by: https://github.com/Skylion007
2025-10-09 06:23:25 +00:00
43fc859625 Don't return values in void functions (#164809)
This PR fixes returning values in void C++ functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164809
Approved by: https://github.com/janeyx99
2025-10-08 01:04:14 +00:00
9fff8155c3 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-06 01:06:01 +00:00
2c5ed6e7c0 Revert "[2/N] Fix clang-tidy readability checks (#164652)"
This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419.

Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))
2025-10-05 21:36:57 +00:00
3c5ca685d6 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-05 07:05:11 +00:00
92284fb2ff Add SVE128 ISA (#158932)
Summary: Partly Importing and adapting https://github.com/pytorch/pytorch/pull/138388, adding SVE128 as ISA.

Intention is to add SVE128 translation layers for Vectorized data types.
Idea is to have 1 PR per file, aside from the current one, plus a last one modifying cmake files to enable the new ISA selectively.

Tested current changes on a nightly run, to verify no regressions occur on systems leveraging SVE256.

No regressions spotted when running test_ops.py, a set of 34k unit tests. A machine leveraging SVE128 was used towards this testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158932
Approved by: https://github.com/malfet
2025-09-29 14:49:19 +00:00
f6ea41ead2 [CPU] Adding missing brackets in native MaxUnpool log (#163039)
As stated in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163039
Approved by: https://github.com/Skylion007
2025-09-16 21:28:15 +00:00
0d421ace32 fix spelling of word - when (#160185)
just found a typo while understanding the codebase while working on another PR

This fixes typo in word `when` in files

```
native/cpu/PaddingKernel.cpp
native/cpu/batch_norm_kernel.cpp
```

@eqy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160185
Approved by: https://github.com/yewentao256, https://github.com/ezyang
2025-08-31 13:38:23 +00:00
cyy
8939d151d0 Use std::apply for CPU code (#152526)
The supported compilers are recent enough to enable std::apply in C++17.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152526
Approved by: https://github.com/ezyang
2025-08-28 02:47:54 +00:00
4651aaac47 Fix typo: 'complext' (#160335)
minor fix for a typo: `complext` to `complex`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160335
Approved by: https://github.com/Skylion007
2025-08-25 10:37:59 +00:00
0f801a510f Using std::vector or c10::SmallVector instead of CArray (#160959)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160959
Approved by: https://github.com/Skylion007
2025-08-20 05:32:29 +00:00
daeb3a6094 Using std::make_unique<T>() instead of unique<T>(new T()) (#160723)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160723
Approved by: https://github.com/Skylion007
2025-08-19 10:25:47 +00:00
83e2ea8135 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-08-01 01:55:48 +00:00
61aa2ae20f Revert "[CPU] fix _weight_int8pack_mm with large output shape (#158341)"
This reverts commit e469414b59ceeaae2860e36708de8852b9892776.

Reverted https://github.com/pytorch/pytorch/pull/158341 on behalf of https://github.com/albanD due to Breaks slowtest ([comment](https://github.com/pytorch/pytorch/pull/158341#issuecomment-3132641530))
2025-07-29 13:56:20 +00:00
e469414b59 [CPU] fix _weight_int8pack_mm with large output shape (#158341)
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.

**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
2025-07-29 01:14:50 +00:00
7f649ed4f8 Add basic torch.hash_tensor op (#154149)
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.

- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
abe0c9538a [BE] Fix extra-semi warnings (#158730)
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )

300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
 - TensorPipe
 - LLVM
 - FBGEMM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
2025-07-22 01:05:03 +00:00
8c3f206457 Fix AArch64 segfaults by disabling strict-aliasing in GridSamplerKernel for GCC 12 and above (#158117)
This PR disables `strict-aliasing` GCC C++ optimization flag on all AArch64 cpus for GCC versions 12 and above.

Pull Request #152825 upgraded gcc version from 11 to 13 in manywheel which caused several segmentation faults in unit tests ( not visible in CI workflows because the jammy gcc version has not been updated yet ).

We Identified the problem also exists in GCC12 hence the ` __GNUC__ >= 12`

Fixes #157626

fixes these tests failures when pytorch is built in GCC12 and above
```
test_ops.py::TestCommonCPU::test_noncontiguous_samples_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_grid_sampler_2d_cpu Fatal Python error: Segmentation fault
test_ops.py::TestMathBitsCPU::test_neg_view_nn_functional_grid_sample_cpu_float64 free(): invalid next size (fast)
test_ops.py::TestCompositeComplianceCPU::test_backward_grid_sampler_2d_cpu_float32 Fatal Python error: Segmentation fault
test_ops.py::TestCommonCPU::test_dtypes_nn_functional_grid_sample_cpu Fatal Python error: Segmentation fault

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158117
Approved by: https://github.com/malfet
2025-07-15 18:26:38 +00:00
1f57e0e04d [CPU] Support GQA for flash attention (#157893)
As many models require GQA, we support it in flash attention for CPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157893
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-07-13 09:49:02 +00:00
b7860c7863 Implement fast exp for AVX2 and AVX512 for the flash attention (#151441)
**Implement fexp for avx2 and avx512**

Cristiano and all propose a clever exp using the IEEE representation with a fine control of the precision, especially useful
for mix computation of the flash attention.

- Implement Fast Exponential Computation on SIMD Architectures
  A. Cristiano I. Malossi, Yves Ineichen, Costas Bekas, and Alessandro Curioni
- AVX2 and AVX512 float only, up to 20% faster for mix precision flash attention
  than the current implementation.
- For the other types legacy implementation.

**Precision**

1 ULP only valid in hybrid mode fp32 -> f16 due to the cast during the
store operation in the flash attention:

**Benchmark**

Machine Xeon 6972P, results in TOPs, Python forward pass flash attention

numhead 16, Head dimension 64

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 0.8  | 1.3  |
| 1024  | 1.7  | 1.7  |
| 2048  | 6    | 6.1  |
| 4096  | 16   | 16.8 |
| 8192  | 30.6 | 32.3 |
| 16384 | 40   | 40.8 |
| 32768 | 44.9 | 51.4 |
| 65536 | 45.8 | 54.4 |

numhead 16, Head dimension 128

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 2.5  | 4.1  |
| 1024  | 3.3  | 4    |
| 2048  | 11.4 | 10.5 |
| 4096  | 27.4 | 28.4 |
| 8192  | 44.4 | 46   |
| 16384 | 64.2 | 68.1 |
| 32768 | 77.8 | 83   |
| 65536 | 82.1 | 88.1 |

numhead 16, Head dimension 256

|Seq. L.| PT   | fexp |
|-------|------|------|
| 512   | 1.7  | 3.4  |
| 1024  | 4.2  | 6.5  |
| 2048  | 14.6 | 16.1 |
| 4096  | 30.1 | 31.1 |
| 8192  | 60   | 62   |
| 16384 | 83.3 | 87.3 |
| 32768 | 98.7 | 106  |
| 65536 | 102.2| 107.1|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151441
Approved by: https://github.com/mingfeima
2025-07-10 05:51:31 +00:00
d6237721c0 [Build] Make PyTorch compilable with gcc-14 on ARM (#157867)
Fixes numerous ICEs in vreg allocations for SVE+BF16
```
/pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: error: unrecognizable insn:
   25 | #pragma omp parallel
      |         ^~~
(insn 257 256 258 30 (set (reg:VNx8BF 449 [ bf16_vec1_217 ])
        (unspec:VNx8BF [
                (reg:VNx8BF 455)
                (reg:VNx8BF 456)
            ] UNSPEC_IORF)) "/pytorch/aten/src/ATen/cpu/vec/sve/vec_bfloat16.h":228:31 discrim 1 -1
     (nil))
during RTL pass: vregs
/pytorch/aten/src/ATen/ParallelOpenMP.h:25:9: internal compiler error: in extract_insn, at recog.cc:2812
0xd73c33 internal_error(char const*, ...)
	???:0
0xd73d1f fancy_abort(char const*, int, char const*)
	???:0
0x890053 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
	???:0
0x890087 _fatal_insn_not_found(rtx_def const*, char const*, int, char const*)
	???:0
0x1379093 extract_insn(rtx_insn*)
	???:0

```
And one in RTL-expand pass while compiling Activation.cpp
```
during RTL pass: expand
In file included from /pytorch/aten/src/ATen/native/cpu/Activation.cpp:12,
                 from /pytorch/build/aten/src/ATen/native/cpu/Activation.cpp.DEFAULT.cpp:1:
/pytorch/aten/src/ATen/native/cpu/Activation.cpp: In lambda function:
/pytorch/aten/src/ATen/native/cpu/Activation.cpp:94:7: internal compiler error: Segmentation fault
   94 |       });
      |       ^
/pytorch/aten/src/ATen/Dispatch.h:201:7: note: in definition of macro 'AT_DISPATCH_SWITCH'
  201 |       __VA_ARGS__                                                           \
      |       ^~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:72:3: note: in expansion of macro 'AT_PRIVATE_CASE_TYPE_USING_HINT'
   72 |   AT_PRIVATE_CASE_TYPE_USING_HINT(enum_type, scalar_t, __VA_ARGS__)
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:214:3: note: in expansion of macro 'AT_DISPATCH_CASE'
  214 |   AT_DISPATCH_CASE(at::ScalarType::Double, __VA_ARGS__) \
      |   ^~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/Dispatch.h:218:34: note: in expansion of macro 'AT_DISPATCH_CASE_FLOATING_TYPES'
  218 |   AT_DISPATCH_SWITCH(TYPE, NAME, AT_DISPATCH_CASE_FLOATING_TYPES(__VA_ARGS__))
      |                                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/pytorch/aten/src/ATen/native/cpu/Activation.cpp:70:5: note: in expansion of macro 'AT_DISPATCH_FLOATING_TYPES'
   70 |     AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "log_sigmoid_cpu", [&] {
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~
0xd73c33 internal_error(char const*, ...)
	???:0
0x134f987 rebuild_jump_labels(rtx_insn*)
	???:0
```

Interestingly enough, attempt to compile `Unfold2d.cpp` for `-march=armv8-a+sve` (i.e. without sve+bf16) support also causes ICE
```
/pytorch/aten/src/ATen/native/cpu/Unfold2d.cpp:221:1: error: unrecognizable insn:
  221 | }
      | ^
(insn 2918 2917 2919 296 (set (reg:VNx8BI 5917)
        (unspec:VNx16BI [
                (reg:VNx8BI 5920)
                (reg:VNx8BI 5922)
                (const_vector:VNx4BI [
                        (const_int 0 [0]) repeated x8
                    ])
            ] UNSPEC_TRN1_CONV)) "/usr/include/aarch64-linux-gnu/bits/string_fortified.h":29:33 discrim 1 -1
     (expr_list:REG_EQUAL (const_vector:VNx8BI [
                (const_int 1 [0x1]) repeated x9
                (const_int 0 [0])
                (const_int 1 [0x1]) repeated x2
                (const_int 0 [0]) repeated x4
            ])
        (nil)))
during RTL pass: vregs
```

Which could be worked around by adding
```patch
diff --git a/aten/src/ATen/native/cpu/Unfold2d.cpp b/aten/src/ATen/native/cpu/Unfold2d.cpp
index 8ef0741e77af0a..59c76505dd6246 100644
--- a/aten/src/ATen/native/cpu/Unfold2d.cpp
+++ b/aten/src/ATen/native/cpu/Unfold2d.cpp
@@ -169,6 +169,10 @@ static void unfolded2d_acc_channels_last(

 /* note: due to write issues, this one cannot be parallelized as well as
  * unfolded2d_copy */
+#if defined(__GNUC__) && __GNUC__ == 14 && defined(__ARM_FEATURE_SVE)
+// Workaround for gcc-14.2.0 ICE during RTL pass: vregs when compiling for SVE
+__attribute__((optimize("no-tree-vectorize")))
+#endif
 void unfolded2d_acc_kernel(
     ScalarType dtype,
     void *finput_data,
```

Fixes https://github.com/pytorch/pytorch/issues/157842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157867
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-07-09 02:59:08 +00:00
d26ca5de05 Support transpose and pack for bit8 (#156065)
To be used by CPU INT8 SDPA in torchao. https://github.com/pytorch/ao/pull/2380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156065
Approved by: https://github.com/mingfeima, https://github.com/ezyang
2025-07-07 01:40:47 +00:00
e96f530af5 Remove unnecessary use of c10::SmallVector from moments_utils (#156714)
It's just making arrays of a particular size. (If it was resizing the vectors, we'd see compile errors.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156714
Approved by: https://github.com/Skylion007
2025-06-24 22:30:10 +00:00
c82a174cea Extract CPU log_softmax kernels to header (#156243)
This allows sharing them with ExecuTorch.

Differential Revision: [D76830114](https://our.internmc.facebook.com/intern/diff/D76830114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156243
Approved by: https://github.com/janeyx99
2025-06-23 21:31:16 +00:00
bf50d71553 Add missing inline namespace CPU_CAPABILITY to Gelu/Elu.h (#156512)
As I recently learned the hard way (#156243), it is necessary to put kernel code that uses Vectorized in headers in this namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156512
Approved by: https://github.com/malfet
2025-06-21 06:26:23 +00:00
e2351f2dcf fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379)
This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379
Approved by: https://github.com/janeyx99
2025-06-20 20:54:53 +00:00
10d41c7d20 Add SDPA patterns for T5 models (#155455)
* Add SDPA patterns for T5 models.
* Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-06-18 02:09:55 +00:00
e323d46b61 ELU: compute ELU(0) with the cheaper definition (#155765)
Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.)

Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765
Approved by: https://github.com/ezyang
2025-06-17 18:20:22 +00:00
cyy
c2beeadeb4 [Reland] Use 3.27 as the minimum CMake version (#154783)
Reland of #153153, which was incidentally closed.
Update the minimum CMake version to 3.27 because of it provides more CUDA targets such as CUDA::nvperf_host so that it is possible to remove some of our forked CUDA modules. See https://github.com/pytorch/pytorch/pull/153783.
It's also possible to facilitate future third-party updates such as FBGEMM (its current shipped version requires 3.21).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154783
Approved by: https://github.com/ezyang
2025-06-14 16:37:51 +00:00
cyy
debd095149 Avoid index integer overflow in gemm_notrans_ (#154809)
Use uint64_t index types to avoid
```
 torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long'
    #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24
    #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12
    #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
    #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154809
Approved by: https://github.com/soulitzer
2025-06-03 19:28:34 +00:00
cfbd99fdfd [Pytorch] Add option to CPU Blas GEMM to avoid output downcast (#154012)
Summary:
Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t):
1. elementwise vector multiply (scalar_t x scalar_t -> opmath_t)
2. vector reduction to a scalar value (opmath_t -> opmath_t)
3. optional downcast if opmath_t != out_t

The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel

Test Plan: Attention CI passes

Differential Revision: D75023858

topic: not user facing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154012
Approved by: https://github.com/Valentine233, https://github.com/aditew01, https://github.com/CaoE, https://github.com/drisspg
2025-05-27 17:43:21 +00:00
ed5f4a4fa8 Replace size() checks with empty() (#153805)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153805
Approved by: https://github.com/nareshrajkumar866, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-05-19 16:20:57 +00:00
cb57b19c3a [ATen-CPU] Use math.h for GeLU as well as cmath (#153742)
Summary:
## Context

See https://github.com/pytorch/pytorch/pull/149164 for more context.

Originally, this fix worked but more recently including `cmath` by itself no longer provides access to math constants on Windows platforms. I found that including `math.h` resolves this.

I'm not sure exactly what changed, but this PR updates the header to just use both includes fix the symbols not being found. It might be a bug with a recent Windows update perhaps?

Test Plan:
CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153742
Approved by: https://github.com/swolchok, https://github.com/Skylion007
2025-05-18 19:06:45 +00:00
741539a790 Split out second pass of LayerNorm for profiler attribution reasons (#153578)
Summary:
Split out second pass of LayerNorm so it's more likely to show up in
profiler output. In my testing with perf, the samples from the lambda in the
current implementation are attributed somewhat haphazardly.

Differential Revision: D74181627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153578
Approved by: https://github.com/hl475
2025-05-16 08:07:13 +00:00
ea17cd067d Add vec_reduce_all specialization for std::plus on AArch64 (#152388)
AArch64 has an instruction for this.

Differential Revision: [D73817183](https://our.internmc.facebook.com/intern/diff/D73817183/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152388
Approved by: https://github.com/Skylion007
ghstack dependencies: #152365, #152366
2025-05-15 21:26:18 +00:00
f47bf38e30 [float16]: Fast path for torch.dot with float16/bfloat16 (#152799)
Fixes #152798

Add the fast path for dot with contiguous tensors for float16/bfloat16 types.

Performance with patch (see issue for benchmark and current performance):

![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de)

**We see up to 10x+ improvement in performance.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799
Approved by: https://github.com/malfet
2025-05-06 14:59:27 +00:00
fdadda21b6 Revert "[float16]: Fast path for torch.dot with float16/bfloat16 (#152799)"
This reverts commit d57bf53225004a684952222722a4f7322a21a596.

Reverted https://github.com/pytorch/pytorch/pull/152799 on behalf of https://github.com/malfet due to This broke C10_MOBILE builds, not sure why it was not surfaced on pull, see a766c1d117/1 ([comment](https://github.com/pytorch/pytorch/pull/152799#issuecomment-2852084433))
2025-05-05 19:17:59 +00:00
d57bf53225 [float16]: Fast path for torch.dot with float16/bfloat16 (#152799)
Fixes #152798

Add the fast path for dot with contiguous tensors for float16/bfloat16 types.

Performance with patch (see issue for benchmark and current performance):

![Improved dot performance](https://github.com/user-attachments/assets/57f64e90-8191-4710-adb0-f430644827de)

**We see up to 10x+ improvement in performance.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152799
Approved by: https://github.com/malfet
2025-05-05 18:29:39 +00:00
220870ce9e [caffe2] Support building for armv8.1 (#152766)
Summary:
- Remove explicit `-march=` compiler flags, as they're already implied by
   the toolchain:
https://www.internalfb.com/code/fbsource/[7f85b0565073]/fbcode/tools/build/buck/wrappers/defs.bzl?lines=819
- Gate non-8.1 compliant opcodes with `__ARM_FEATURE_*`.

Test Plan: CI

Reviewed By: rahulg

Differential Revision: D74023601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152766
Approved by: https://github.com/Skylion007
2025-05-04 19:09:21 +00:00
f0c9b3385d Support more dtypes for input, indices in gather (#151822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151822
Approved by: https://github.com/ngimel
2025-05-01 16:35:23 +00:00
fcbbb03d48 Extend vec backend with BF16 SVE intrinsics (#143666)
- Following the work in https://github.com/pytorch/pytorch/pull/119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`.
- Added bf16 detection in CMake
- Added a guard for native NEON code to prevent compilation errors

@aditew01 @maajidkhann please have a look

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143666
Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/nikhil-arm

Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>
2025-04-28 18:25:44 +00:00
7e8b9b3f51 ReducedPrecisionFloatGemvFastPathKernel: Correctly type parallel_for lambda arguments as int64_t (#152233)
This plus the previous irangeification PR seem like a better fix for #150637 than #150949 to me -- should make sure we are using 64-bit math for indexing everywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152233
Approved by: https://github.com/Skylion007, https://github.com/cyyever
ghstack dependencies: #152232
2025-04-28 07:19:26 +00:00
3b7d6bbe8b irangeify ReducedPrecisionFloatGemvKernel.cpp (#152232)
We should be using irange, especially because we had 32-bit overflow issues in this file recently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152232
Approved by: https://github.com/Skylion007
2025-04-28 07:19:26 +00:00
9480ed4cd3 Fix typos in multiple files (#152254)
Fix typos in multiple files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152254
Approved by: https://github.com/Skylion007
2025-04-26 17:18:39 +00:00