pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Jane Xu	259e79e3ff	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 16:11:58 +00:00
PyTorch MergeBot	eaadd1282c	Revert "Move Half to headeronly (#159172 )" This reverts commit 6d0f4566e2b6e05369d8bb6c0d0e83a0eee982aa. Reverted https://github.com/pytorch/pytorch/pull/159172 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/16613893793/job/47002486679) [HUD commit link](`6d0f4566e2`). Note to self: why isn't Dr. CI updating ([comment](https://github.com/pytorch/pytorch/pull/159172#issuecomment-3136769493))	2025-07-30 15:10:26 +00:00
Jane Xu	6d0f4566e2	Move Half to headeronly (#159172 ) Essence of this copypasta: - combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h - Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy - Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly. - Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly). Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172 Approved by: https://github.com/albanD, https://github.com/desertfire	2025-07-30 05:02:13 +00:00
Scott Wolchok	af4da0799c	[PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465 ) As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en) is part of the base aarch64 instruction set, so it should work fine on mobile. Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-12 19:46:30 +00:00
Nikita Shulga	2240018c03	Construct `c10::Half` from `float16_t` on ARMv8 (#120425 ) By hiding float32 constructors and exposing float16 ones. This allows compiler do implicit conversions as needed, and in safe cases optimize out unneeded upcasts to fp32, see example [below](https://godbolt.org/z/5TKnY4cos) ```cpp #include <arm_neon.h> #ifndef __ARM_FEATURE_FP16_SCALAR_ARITHMETIC #error Ieeee #endif float16_t sum1(float16_t x, float16_t y) { return x + y; } float16_t sum2(float16_t x, float16_t y) { return static_cast<float>(x) + static_cast<float>(y); } ``` both sum variants are compiled to scalar fp16 add, if build for the platform that supports fp16 arithmetic ``` sum1(half, half): // @sum1(half, half) fadd h0, h0, h1 ret sum2(half, half): // @sum2(half, half) fadd h0, h0, h1 ret ``` Fixes build error in some aarch64 configurations after #119483 which are defined as supporting FP16 but don't define _Float16. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120425 Approved by: https://github.com/mikekgfb, https://github.com/atalman, https://github.com/snadampal	2024-02-23 04:22:45 +00:00
Nikita Shulga	f9eb66e16d	[BE][EZ] Flatten preprocessor hierarchy (#120422 ) Instead of ```cpp #if defined(foo) #else #if defined(bar) #else #endif #endif ``` use ```cpp #if defined(foo) #elif defined(bar) #else #endif ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/120422 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007	2024-02-22 22:38:08 +00:00
PyTorch MergeBot	f2452e98a6	Revert "Native Half on ARM (#119483 )" This reverts commit 8f3fd79b23d483e846537b62f49111696d117870. Reverted https://github.com/pytorch/pytorch/pull/119483 on behalf of https://github.com/malfet due to Broke nightly arm builds (and will be breaking runtime), as F16 arithmetic is ARMv8.2 only, see https://github.com/pytorch/pytorch/actions/runs/8000968963/job/21851281141 ([comment](https://github.com/pytorch/pytorch/pull/119483#issuecomment-1959944948))	2024-02-22 17:41:55 +00:00
Michael Gschwind	8f3fd79b23	Native Half on ARM (#119483 ) Summary: Native Half on ARM Test Plan: sandcastle Differential Revision: D53585776 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119483 Approved by: https://github.com/ezyang, https://github.com/jgong5	2024-02-21 17:46:16 +00:00
Nikita Shulga	4eefe7285a	Use ARMV8 fconv insns to seepd up scalar fp16<->fp32 (#120012 ) Thanks to discussion with @mikekgfb I've realized that FP16_ARITH is the feature available by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0602/2023-12/SIMD-FP-Instructions/FCVT--Floating-point-Convert-precision--scalar--?lang=en) As results, very slow and naive [`torch.mm`](`edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)`) runs 3x faster: 85 msec before to 27 msec (measured by running `e41341df2d/benchmarks/benchmark_torch_mm.py` ) This is a reland of https://github.com/pytorch/pytorch/pull/119895 that got reverted because it was not buildable using Jetson toolkit "Fixed" the problem by guarding the fast conversions with `!defined(__CUDACC__)` (for internal folks, tested it by running `buck build @arvr/mode/embedded/jetson/linux/opt-stripped //xplat/caffe2:caffe2_ops_cuda_ovrsource` ) But also, extended the conversion to all AARHC64 platforms, not just the ones that support FP16 arithmetic extensions (i.e. ARMv8.2) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120012 Approved by: https://github.com/huydhn	2024-02-16 03:04:06 +00:00
Huy Do	26b6de43e5	Revert "Use ARMV8.2 scalar fp16<->fp32 conversion (#119895 )" (#120001 ) This reverts commit d833e2f2364a01c6fdab689a8bb5bbf55a5b60f7. This is failing some RL builds internally using clang 13 D53791577 https://github.com/pytorch/pytorch/pull/119895#issuecomment-1946859332. The bot doesn't like a commit being merged into the stack base and fails to revert the PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120001 Approved by: https://github.com/malfet	2024-02-15 19:41:51 +00:00
Nikita Shulga	d833e2f236	Use ARMV8.2 scalar fp16<->fp32 conversion (#119895 ) Thanks to discussion with @mikekgfb I've realized that SVE is the feature availble by default on Apple Silicon, so let use it to speed up portable but slow bit mashing algorithm implemented as `c10::detail::fp16_ieee_from_fp32_value` by using the following implicit conversion routine: ```cpp float sve_fp16_to_fp32_value(uint16_t h) { union { uint16_t h; float16_t f16; } x = {h}; return x.f16; } ``` that according to the https://godbolt.org/z/8s14GvEjo is turned into [`fcvt s0,h0`](https://developer.arm.com/documentation/ddi0596/2021-12/SVE-Instructions/FCVT--Floating-point-convert-precision--predicated--) As results, very slow and naive [`torch.mm`](`edd9ddf73f/aten/src/ATen/native/cpu/BlasKernel.cpp (L108)`) runs 3x faster: 85 msec before to 27 msec (measured by running `e41341df2d/benchmarks/benchmark_torch_mm.py` ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119895 Approved by: https://github.com/mikekgfb ghstack dependencies: #119892	2024-02-14 23:42:53 +00:00
CaoE	813246c554	Add scalar conversion using avx instructions for half (#102140 ) ### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 5.07165 \| 5.418 \| 0.5798 \| 0.5123 \| 1.373694951 \| 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 1.37455 \| 1.2505 \| 8.8336 \| 9.7684 \| 1.373635008 \| 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 28.72 \| 30.7069 \| 3.813 \| 3.75 \| 1.31977124 \| 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 4.5783 \| 4.703 \| 4.703 \| 5.1 \| 1.028980189 \| 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 13.896 \| 14.8138 \| 1.6635 \| 1.6274 \| 1.298704663 \| 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 2.11291 \| 2.1158 \| 2.26778 \| 2.272 \| 0.951105348 \| 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 0.4204 \| 0.3843 \| 0.0649 \| 0.0633 \| 2.102711703 \| 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 0.1134 \| 0.11 \| 0.1476 \| 0.143 \| 2.23042328 \| 3.612398 Single core: shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 124.413 \| 114.44 \| 10.553 \| 11.2486 \| 1.31395433 \| 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 28.99 \| 28.0781 \| 9.5092 \| 10.9258 \| 1.324296999 \| 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 640.8276 \| 591.964 \| 59.18776 \| 60.854 \| 1.334956391 \| 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 88.57 \| 90.214 \| 54.358 \| 59.205 \| 1.031258214 \| 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 318.6197 \| 285.155 \| 28.4999 \| 29.4387 \| 1.315298144 \| 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 31.3981 \| 34.0544 \| 25.6557 \| 28.7811 \| 1.068505738 \| 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 8.87882 \| 8.207 \| 0.386056 \| 0.3939 \| 1.567866 \| 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 2.4167 \| 2.38295 \| 0.3769 \| 0.4066 \| 1.39402491 \| 3.30061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch	2023-08-30 13:26:53 +00:00
PyTorch MergeBot	e0f1fe102a	Revert "Add scalar conversion using avx instructions for half (#102140 )" This reverts commit 1d6a44656755c89f4f9a878865dcb0ac39af9a74. Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/ZainRizvi due to Sorry, this is still breaking internal builds. Specifically, the dynamo test test_repros.py::DynamicShapesReproTests::test_odict_get_item_index_name ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1686684117))	2023-08-21 16:51:50 +00:00
CaoE	1d6a446567	Add scalar conversion using avx instructions for half (#102140 ) ### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 5.07165 \| 5.418 \| 0.5798 \| 0.5123 \| 1.373694951 \| 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 1.37455 \| 1.2505 \| 8.8336 \| 9.7684 \| 1.373635008 \| 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 28.72 \| 30.7069 \| 3.813 \| 3.75 \| 1.31977124 \| 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 4.5783 \| 4.703 \| 4.703 \| 5.1 \| 1.028980189 \| 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 13.896 \| 14.8138 \| 1.6635 \| 1.6274 \| 1.298704663 \| 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 2.11291 \| 2.1158 \| 2.26778 \| 2.272 \| 0.951105348 \| 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 0.4204 \| 0.3843 \| 0.0649 \| 0.0633 \| 2.102711703 \| 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 0.1134 \| 0.11 \| 0.1476 \| 0.143 \| 2.23042328 \| 3.612398 Single core: shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 124.413 \| 114.44 \| 10.553 \| 11.2486 \| 1.31395433 \| 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 28.99 \| 28.0781 \| 9.5092 \| 10.9258 \| 1.324296999 \| 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 640.8276 \| 591.964 \| 59.18776 \| 60.854 \| 1.334956391 \| 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 88.57 \| 90.214 \| 54.358 \| 59.205 \| 1.031258214 \| 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 318.6197 \| 285.155 \| 28.4999 \| 29.4387 \| 1.315298144 \| 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 31.3981 \| 34.0544 \| 25.6557 \| 28.7811 \| 1.068505738 \| 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 8.87882 \| 8.207 \| 0.386056 \| 0.3939 \| 1.567866 \| 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 2.4167 \| 2.38295 \| 0.3769 \| 0.4066 \| 1.39402491 \| 3.30061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch	2023-08-18 04:07:59 +00:00
PyTorch MergeBot	6d86a255e6	Revert "Add scalar conversion using avx instructions for half (#102140 )" This reverts commit 888bdddb1ed0f3bfbbfc964f3b6080b0ea431dfd. Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/jeanschmidt due to This is breaking internal tests @cpuhrsch can share more context and help with a follow up ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1660686075))	2023-08-01 16:35:23 +00:00
CaoE	888bdddb1e	Add scalar conversion using avx instructions for half (#102140 ) ### Motivation Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up. ### Testing Test maxpool, and compared with the results of #98819. Single socket (28 cores): shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 5.07165 \| 5.418 \| 0.5798 \| 0.5123 \| 1.373694951 \| 3.430786 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 1.37455 \| 1.2505 \| 8.8336 \| 9.7684 \| 1.373635008 \| 4.132924 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 28.72 \| 30.7069 \| 3.813 \| 3.75 \| 1.31977124 \| 2.783006 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 4.5783 \| 4.703 \| 4.703 \| 5.1 \| 1.028980189 \| 3.1293 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 13.896 \| 14.8138 \| 1.6635 \| 1.6274 \| 1.298704663 \| 2.982699 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 2.11291 \| 2.1158 \| 2.26778 \| 2.272 \| 0.951105348 \| 3.179012 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 0.4204 \| 0.3843 \| 0.0649 \| 0.0633 \| 2.102711703 \| 1.779492 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 0.1134 \| 0.11 \| 0.1476 \| 0.143 \| 2.23042328 \| 3.612398 Single core: shape \| fp16 forward / ms \| bf16 forward / ms \| fp16 backward / ms \| bf16 backward / ms \| speedup ratio (fp16 forward) \| speedup ratio (fp16 backward) -- \| -- \| -- \| -- \| -- \| -- \| -- size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: contig \| 124.413 \| 114.44 \| 10.553 \| 11.2486 \| 1.31395433 \| 3.923844 size: (1, 56, 264, 264), kernel: 3, stride: 1, mem_format: CL \| 28.99 \| 28.0781 \| 9.5092 \| 10.9258 \| 1.324296999 \| 3.888377 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: contig \| 640.8276 \| 591.964 \| 59.18776 \| 60.854 \| 1.334956391 \| 3.704458 size: (32, 16, 200, 200), kernel: 3, stride: 1, mem_format: CL \| 88.57 \| 90.214 \| 54.358 \| 59.205 \| 1.031258214 \| 3.75285 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: contig \| 318.6197 \| 285.155 \| 28.4999 \| 29.4387 \| 1.315298144 \| 3.759747 size: (32, 32, 100, 100), kernel: 3, stride: 1, mem_format: CL \| 31.3981 \| 34.0544 \| 25.6557 \| 28.7811 \| 1.068505738 \| 3.841587 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: contig \| 8.87882 \| 8.207 \| 0.386056 \| 0.3939 \| 1.567866 \| 3.50387 size: (4, 19, 10, 16, 16), kernel: 3, stride: 1, mem_format: CL3d \| 2.4167 \| 2.38295 \| 0.3769 \| 0.4066 \| 1.39402491 \| 3.30061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140 Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch	2023-07-30 11:25:28 +00:00
mikey dagitses	fe99d39fbd	migrate PyTorch to c10::bit_cast (#98418 ) Use the standardized version. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98418 Approved by: https://github.com/ezyang	2023-04-06 19:38:06 +00:00
Yu, Guangye	e03ac0ee8c	Add bf16 and change header file include path (#91838 ) # Motivation We would like to add the bfloat16 header file to PyTorch to make PyTorch and Intel extension for PyTorch support the bfloat16 data type. # Solution - Note that bfloat16 is an Intel extension implementation in the DPC++ compiler instead of standard SYCL, we need to guarantee the bfloat16 header can be included only using the DPC++ compiler. Please refer to [sycl 2020 feature test macros](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_feature_test_macros). Intel DPC++ compiler uses [SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc) to check bfloat16 feature. - Refer to [intel/llvm](`59dd38795c/clang/lib/Basic/Version.cpp (L129)`). SYCL_LANGUAGE_VERSION is defined in both SYCL 1.2.1 and SYCL 2020. But only CL_SYCL_LANGUAGE_VERSION is defined in SYCL 1.2.1. So we should check CL_SYCL_LANGUAGE_VERSION first for SYCL 1.2.1. If it is not defined then check SYCL_LANGUAGE_VERSION for SYCL 2020. This will guarantee to be compatible with SYCL 1.2.1 and SYCL 2020. # Additional No need UT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91838 Approved by: https://github.com/ezyang	2023-01-11 15:18:56 +00:00
Aaron Gokaslan	700941f683	Fixup c10 headers with clang-tidy (#91407 ) Clang-tidy was not applied properly to headers in c10 as documented #91406. These are the easy automated fixes that came out of applying clang-tidy to the c10 part of the code base. cc @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/91407 Approved by: https://github.com/ezyang	2022-12-28 11:12:22 +00:00
cchheennhhaaoo	92a6b970ba	Be compatible with SYCL 2020 and SYCL1.2.1 for sycl.hpp (#83259 ) - In SYCL2020, SYCL provides one standard header file: <sycl/sycl.hpp>, which needs to be included in every translation unit that uses the SYCL programming API. - For compatibility with SYCL 1.2.1, SYCL provides another standard header file: <CL/sycl.hpp>, which can be included in place of <sycl/sycl.hpp>. - SYCL documents this change in [doc](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:headers-and-namespaces)(4.3). - SYCL_LANGUAGE_VERSION substitutes an integer reflecting the version number and revision of the SYCL language being supported by the implementation in SYCL 2020. In SYCL1.2.1, the macro name is CL_SYCL_LANGUAGE_VERSION. So these two macros can be used to distinguish SYCL1.2.1 and SYCL2020. - SYCL 2020 doc: https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf - SYCL 1.2.1 doc: https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf Pull Request resolved: https://github.com/pytorch/pytorch/pull/83259 Approved by: https://github.com/malfet	2022-09-06 18:38:30 +00:00
cchheennhhaaoo	84564f2fab	Enhance negative operator for SYCL half conversion (#79850 ) Enhance negative operator for SYCL half conversion with sycl::bit_cast implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79850 Approved by: https://github.com/ngimel	2022-06-21 18:16:02 +00:00
cchheennhhaaoo	68e012b023	Optimize half conversion for SYCL kernel ## Motivation: Add support for SYCL half implicit/explicit conversion in SYCL kernels. ## Additional Context: Macro `SYCL_LANGUAGE_VERSION` is suggested by SYCL compiler to instead of `__SYCL_DEVICE_ONLY__` in current version unless device and host specific implementation of the same function is necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/76515 Approved by: https://github.com/ezyang	2022-05-04 00:57:03 +00:00
Nolan O'Brien	8f4cec2231	[warnings][Caffe2] Suppress warnings in caffe2 headers (#71196 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/71196 `caffe2` headers contain code that can elicit warnings when built with strict compiler flags. Rather than force downstream/consuming code to weaken their compiler flags, suppress those warnings in the header using `#pragma clang diagnostic` suppressions. Test Plan: CI Pass Reviewed By: malfet Differential Revision: D33536233 fbshipit-source-id: 74404e7a5edaf244f79f7a0addd991a84442a31f	2022-01-12 10:16:35 -08:00
johnlu	ba8c1fc648	Add Half conversion of bit cast for SYCL kernel (#64340 ) Summary: ## Motivation Enhance the performance of Half/float conversion in SYCL kernels. ## Solution Add the native SYCL half type to help convert the half from/to float in the kernel code. ## Additional Context `__SYCL_DEVICE_ONLY__` is a MACRO only valid when compiling the kernel code for SYCL backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/64340 Reviewed By: gchanan Differential Revision: D30720823 Pulled By: ezyang fbshipit-source-id: e7e770d02df5b2d45da61d2fed3ba59383b3dc3a	2021-09-08 08:25:47 -07:00
Scott Wolchok	44cc873fba	[PyTorch] Autoformat c10 (#56830 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56830 Opt into formatting on GitHub and format everything. This is a trial run before turning on formatting for more and eventually all of the codebase. Test Plan: CI Reviewed By: zertosh Differential Revision: D27979080 fbshipit-source-id: a80f0c48691c08ae8ca0af06377b87e6a2351151	2021-04-30 21:23:28 -07:00
Peter Bell	b150f150ba	Add division overload with rounding_mode selection (#51706 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/51706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/50280 As mentioned in gh-43874, this adds a `rounding_mode={'true', 'trunc', 'floor'}` argument so `torch.div` can be used as a replacement for `floor_divide` during the transitional period. I've included dedicated kernels for truncated and floor division which aren't strictly necessary for float, but do perform significantly better (~2x) than doing true division followed by a separate rounding kernel. Note: I introduce new overloads for `aten::div` instead of just adding a default `rounding_mode` because various JIT passes rely on the exact operator schema. Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D26123271 Pulled By: mruberry fbshipit-source-id: 51a83717602114597ec9c4d946e35a392eb01d46	2021-02-04 13:08:36 -08:00
Jeffrey Wan	5ab9593098	`torch.reciprocal`: promote integer inputs to float (#49102 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/49091 Pull Request resolved: https://github.com/pytorch/pytorch/pull/49102 Reviewed By: VitalyFedyunin Differential Revision: D25639541 Pulled By: soulitzer fbshipit-source-id: 1dd360bd7b77f106d606143d8d3961610bac8cb7	2020-12-18 16:17:30 -08:00
Igor Sugak	d035d05080	[pytorch] expose __ldg(const Half* ptr) to Clang in host mode (#38151 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/38151 We need to expose this method to Clang unconditionally when building CUDA, otherwise it would error on device code calling `__ldg` with `Half*`. Test Plan: ``` buck build -c fbcode.caffe2_use_mpi=1 -c fbcode.cuda_use_clang=true mode/opt //experimental/training_supercomputer/trainer/hpc_pt:trainer ``` Reviewed By: ngimel Differential Revision: D21481297 fbshipit-source-id: aacfe7de2cdc8542908249081ddb58170b1e35ff	2020-05-21 22:18:32 -07:00
Andrew Krieger	a41ff31702	Correctly gate __CUDA_ARCH__ with defined() (#25729 ) Summary: Undefined preprocessor macros being evaluated cause errors on some compilers/configs. There is an ungated define in caffe2 which is inconsistent with the rest of the file and should be fixed anyway because it's causing issues in ovrsource. Pull Request resolved: https://github.com/pytorch/pytorch/pull/25729 Test Plan: contbuilds Differential Revision: D17211552 Pulled By: akrieger fbshipit-source-id: 499b123894b255f37ff68079c4ba3650b1599a5c	2019-09-06 09:42:15 -07:00
Hong Xu	a15845555c	Negate halves on GPU using __hneg() when possible, instead of using float conversion. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23626 Test Plan: Imported from OSS Differential Revision: D16656730 Pulled By: ezyang fbshipit-source-id: 7e1f4e334f484a3ed4392949ff7679cefd67a74e	2019-08-05 16:21:38 -07:00
Davide Libenzi	8420a2025b	Turn the Half::from_bits into a constexpr function to avoid unresolve… (#17661 ) Summary: …d symbol errors when building in DEBUG mode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/17661 Differential Revision: D14319610 Pulled By: soumith fbshipit-source-id: 6c508a37155e29260f403d7174f343aa1ff32385	2019-03-05 07:31:38 -08:00
Sebastian Messmer	d408324350	Move files to/from c10/core and c10/util (#15316 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316 This starts cleaning up the files in c10 according to the module structure we decided on. Move to c10/util: - Half.h, Half-inl.h, Half.cpp, bitcasts.h Move to c10/core: - Device.h, Device.cpp - DeviceType.h, DeviceType.cpp i-am-not-moving-c2-to-c10 Reviewed By: dzhulgakov Differential Revision: D13498493 fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63	2019-01-10 16:22:22 -08:00

32 Commits