32 Commits

Author SHA1 Message Date
259e79e3ff Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 16:11:58 +00:00
eaadd1282c Revert "Move Half to headeronly (#159172)"
This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa.

Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](6d0f4566e2).  Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))
2025-07-30 15:10:26 +00:00
6d0f4566e2 Move Half to headeronly (#159172)
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
2025-07-30 05:02:13 +00:00
af4da0799c [PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465)
As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en)
is part of the base aarch64 instruction set, so it should work fine on mobile.

Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-12 19:46:30 +00:00
2240018c03 Construct c10::Half from float16_t on ARMv8 (#120425)
By hiding float32 constructors and exposing float16 ones. This allows compiler do implicit conversions as needed, and in safe cases optimize out unneeded upcasts to fp32, see example [below](https://godbolt.org/z/5TKnY4cos)
```cpp
#include <arm_neon.h>

#ifndef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
#error Ieeee
#endif

float16_t sum1(float16_t x, float16_t y) {
    return x + y;
}

float16_t sum2(float16_t x, float16_t y) {
    return static_cast<float>(x) + static_cast<float>(y);
}
```
both sum variants are  compiled to scalar fp16 add, if build for the platform that supports fp16 arithmetic
```
sum1(half, half):                            // @sum1(half, half)
        fadd    h0, h0, h1
        ret
sum2(half, half):                            // @sum2(half, half)
        fadd    h0, h0, h1
        ret
```

Fixes build error in some aarch64 configurations after #119483 which are defined as supporting FP16 but don't define _Float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120425
Approved by: https://github.com/mikekgfb, https://github.com/atalman, https://github.com/snadampal
2024-02-23 04:22:45 +00:00
f9eb66e16d [BE][EZ] Flatten preprocessor hierarchy (#120422)
Instead of
```cpp
#if defined(foo)
#else
#if defined(bar)
#else
#endif
#endif
```
use
```cpp
#if defined(foo)
#elif defined(bar)
#else
#endif
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120422
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
2024-02-22 22:38:08 +00:00
f2452e98a6 Revert "Native Half on ARM (#119483)"
This reverts commit 8f3fd79b23d483e846537b62f49111696d117870.

Reverted https://github.com/pytorch/pytorch/pull/119483 on behalf of https://github.com/malfet due to Broke nightly arm builds (and will be breaking runtime), as F16 arithmetic is ARMv8.2 only, see https://github.com/pytorch/pytorch/actions/runs/8000968963/job/21851281141 ([comment](https://github.com/pytorch/pytorch/pull/119483#issuecomment-1959944948))
2024-02-22 17:41:55 +00:00
8f3fd79b23 Native Half on ARM (#119483)
Summary: Native Half on ARM

Test Plan: sandcastle

Differential Revision: D53585776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119483
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-02-21 17:46:16 +00:00
4eefe7285a Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012)
Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
  union {
     uint16_t h;
     float16_t f16;
  } x = {h};
  return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en)

As results, very slow and naive [`torch.mm`](edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)) runs 3x faster: 85 msec before to 27 msec (measured by running e41341df2d/benchmarks/benchmark_torch_mm.py )

This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit

"Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)`  (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` )
But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012
Approved by: https://github.com/huydhn
2024-02-16 03:04:06 +00:00
26b6de43e5 Revert "Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)" (#120001)
This reverts commit d833e2f2364a01c6fdab689a8bb5bbf55a5b60f7.

This is failing some RL builds internally using clang 13 D53791577

https://github.com/pytorch/pytorch/pull/119895#issuecomment-1946859332.  The bot doesn't like a commit being merged into the stack base and fails to revert the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120001
Approved by: https://github.com/malfet
2024-02-15 19:41:51 +00:00
d833e2f236 Use ARMV8.2 scalar fp16<->fp32 conversion (#119895)
Thanks to discussion with @mikekgfb I've realized that SVE is the
feature availble by default on Apple Silicon, so let use it to speed up
portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine:
```cpp
float sve_fp16_to_fp32_value(uint16_t h) {
  union {
     uint16_t h;
     float16_t f16;
  } x = {h};
  return x.f16;
}
```
that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--)

As results, very slow and naive [`torch.mm`](edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)) runs 3x faster: 85 msec before to 27 msec (measured by running e41341df2d/benchmarks/benchmark_torch_mm.py )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895
Approved by: https://github.com/mikekgfb
ghstack dependencies: #119892
2024-02-14 23:42:53 +00:00
813246c554 Add scalar conversion using avx instructions for half (#102140)
### Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398

Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
2023-08-30 13:26:53 +00:00
e0f1fe102a Revert "Add scalar conversion using avx instructions for half (#102140)"
This reverts commit 1d6a44656755c89f4f9a878865dcb0ac39af9a74.

Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/ZainRizvi due to Sorry, this is still breaking internal builds. Specifically, the dynamo test test_repros.py::DynamicShapesReproTests::test_odict_get_item_index_name ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1686684117))
2023-08-21 16:51:50 +00:00
1d6a446567 Add scalar conversion using avx instructions for half (#102140)
### Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398

Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
2023-08-18 04:07:59 +00:00
6d86a255e6 Revert "Add scalar conversion using avx instructions for half (#102140)"
This reverts commit 888bdddb1ed0f3bfbbfc964f3b6080b0ea431dfd.

Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/jeanschmidt due to This is breaking internal tests @cpuhrsch can share more context and help with a follow up ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1660686075))
2023-08-01 16:35:23 +00:00
888bdddb1e Add scalar conversion using avx instructions for half (#102140)
### Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398

Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
2023-07-30 11:25:28 +00:00
fe99d39fbd migrate PyTorch to c10::bit_cast (#98418)
Use the standardized version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98418
Approved by: https://github.com/ezyang
2023-04-06 19:38:06 +00:00
e03ac0ee8c Add bf16 and change header file include path (#91838)
# Motivation
We would like to add the bfloat16 header file to PyTorch to make PyTorch and Intel extension for PyTorch support the bfloat16 data type.

# Solution
- Note that bfloat16 is an Intel extension implementation in the DPC++ compiler instead of standard SYCL, we need to guarantee the bfloat16 header can be included only using the DPC++ compiler. Please refer to [sycl 2020 feature test macros](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_feature_test_macros). Intel DPC++ compiler uses [SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc) to check bfloat16 feature.
- Refer to [intel/llvm](59dd38795c/clang/lib/Basic/Version.cpp (L129)). SYCL_LANGUAGE_VERSION is defined in both SYCL 1.2.1 and SYCL 2020. But only CL_SYCL_LANGUAGE_VERSION is defined in SYCL 1.2.1. So we should check CL_SYCL_LANGUAGE_VERSION first for SYCL 1.2.1. If it is not defined then check SYCL_LANGUAGE_VERSION for SYCL 2020. This will guarantee to be compatible with SYCL 1.2.1 and SYCL 2020.

# Additional
No need UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91838
Approved by: https://github.com/ezyang
2023-01-11 15:18:56 +00:00
700941f683 Fixup c10 headers with clang-tidy (#91407)
Clang-tidy was not applied properly to headers in c10 as documented #91406. These are the easy automated fixes that came out of applying clang-tidy to the c10 part of the code base. cc @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91407
Approved by: https://github.com/ezyang
2022-12-28 11:12:22 +00:00
92a6b970ba Be compatible with SYCL 2020 and SYCL1.2.1 for sycl.hpp (#83259)
- In SYCL2020, SYCL provides one standard header file: <sycl/sycl.hpp>, which needs to be included in every translation unit that uses the SYCL programming API.

- For compatibility with SYCL 1.2.1, SYCL provides another standard header file: <CL/sycl.hpp>, which can be included in place of <sycl/sycl.hpp>.

- SYCL documents this change in [doc](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:headers-and-namespaces)(4.3).

- SYCL_LANGUAGE_VERSION substitutes an integer reflecting the version number and revision of the SYCL language being supported by the implementation in SYCL 2020. In SYCL1.2.1, the macro name is CL_SYCL_LANGUAGE_VERSION. So these two macros can be used to distinguish SYCL1.2.1 and SYCL2020.

- SYCL 2020 doc: https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
- SYCL 1.2.1 doc: https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83259
Approved by: https://github.com/malfet
2022-09-06 18:38:30 +00:00
84564f2fab Enhance negative operator for SYCL half conversion (#79850)
Enhance negative operator for SYCL half conversion with sycl::bit_cast implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79850
Approved by: https://github.com/ngimel
2022-06-21 18:16:02 +00:00
68e012b023 Optimize half conversion for SYCL kernel
## Motivation:
Add support for SYCL half implicit/explicit conversion in SYCL kernels.

## Additional Context:
Macro `SYCL_LANGUAGE_VERSION` is suggested by SYCL compiler to instead of `__SYCL_DEVICE_ONLY__` in current version unless device and host specific implementation of the same function is necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76515
Approved by: https://github.com/ezyang
2022-05-04 00:57:03 +00:00
8f4cec2231 [warnings][Caffe2] Suppress warnings in caffe2 headers (#71196)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71196

`caffe2` headers contain code that can elicit warnings when built with strict compiler flags.  Rather than force downstream/consuming code to weaken their compiler flags, suppress those warnings in the header using `#pragma clang diagnostic` suppressions.

Test Plan: CI Pass

Reviewed By: malfet

Differential Revision: D33536233

fbshipit-source-id: 74404e7a5edaf244f79f7a0addd991a84442a31f
2022-01-12 10:16:35 -08:00
ba8c1fc648 Add Half conversion of bit cast for SYCL kernel (#64340)
Summary:
## Motivation
Enhance the performance of Half/float conversion in SYCL kernels.

## Solution
Add the native SYCL half type to help convert the half from/to float in the kernel code.

## Additional Context
`__SYCL_DEVICE_ONLY__` is a MACRO only valid when compiling the kernel code for SYCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/64340

Reviewed By: gchanan

Differential Revision: D30720823

Pulled By: ezyang

fbshipit-source-id: e7e770d02df5b2d45da61d2fed3ba59383b3dc3a
2021-09-08 08:25:47 -07:00
44cc873fba [PyTorch] Autoformat c10 (#56830)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830

Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase.

Test Plan: CI

Reviewed By: zertosh

Differential Revision: D27979080

fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151
2021-04-30 21:23:28 -07:00
b150f150ba Add division overload with rounding_mode selection (#51706)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50280

As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}`
argument so `torch.div` can be used as a replacement for `floor_divide` during
the transitional period.

I've included dedicated kernels for truncated and floor division which
aren't strictly necessary for float, but do perform significantly better (~2x) than
doing true division followed by a separate rounding kernel.

Note: I introduce new overloads for `aten::div` instead of just adding a default
`rounding_mode` because various JIT passes rely on the exact operator schema.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26123271

Pulled By: mruberry

fbshipit-source-id: 51a83717602114597ec9c4d946e35a392eb01d46
2021-02-04 13:08:36 -08:00
5ab9593098 torch.reciprocal: promote integer inputs to float (#49102)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49102

Reviewed By: VitalyFedyunin

Differential Revision: D25639541

Pulled By: soulitzer

fbshipit-source-id: 1dd360bd7b77f106d606143d8d3961610bac8cb7
2020-12-18 16:17:30 -08:00
d035d05080 [pytorch] expose __ldg(const Half* ptr) to Clang in host mode (#38151)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38151

We need to expose this method to Clang unconditionally when building CUDA, otherwise it would error on device code calling `__ldg` with `Half*`.

Test Plan:
```
buck build -c fbcode.caffe2_use_mpi=1 -c fbcode.cuda_use_clang=true mode/opt //experimental/training_supercomputer/trainer/hpc_pt:trainer
```

Reviewed By: ngimel

Differential Revision: D21481297

fbshipit-source-id: aacfe7de2cdc8542908249081ddb58170b1e35ff
2020-05-21 22:18:32 -07:00
a41ff31702 Correctly gate __CUDA_ARCH__ with defined() (#25729)
Summary:
Undefined preprocessor macros being evaluated cause
errors on some compilers/configs. There is an ungated define in caffe2
which is inconsistent with the rest of the file and should be
fixed anyway because it's causing issues in ovrsource.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25729

Test Plan: contbuilds

Differential Revision: D17211552

Pulled By: akrieger

fbshipit-source-id: 499b123894b255f37ff68079c4ba3650b1599a5c
2019-09-06 09:42:15 -07:00
a15845555c Negate halves on GPU using __hneg() when possible, instead of using float conversion.
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23626

Test Plan: Imported from OSS

Differential Revision: D16656730

Pulled By: ezyang

fbshipit-source-id: 7e1f4e334f484a3ed4392949ff7679cefd67a74e
2019-08-05 16:21:38 -07:00
8420a2025b Turn the Half::from_bits into a constexpr function to avoid unresolve… (#17661)
Summary:
…d symbol errors when building in DEBUG mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17661

Differential Revision: D14319610

Pulled By: soumith

fbshipit-source-id: 6c508a37155e29260f403d7174f343aa1ff32385
2019-03-05 07:31:38 -08:00
d408324350 Move files to/from c10/core and c10/util (#15316)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316

This starts cleaning up the files in c10 according to the module structure we decided on.

Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h

Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp

i-am-not-moving-c2-to-c10

Reviewed By: dzhulgakov

Differential Revision: D13498493

fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63
2019-01-10 16:22:22 -08:00