This PR removes unnecessary "static" for C++ functions and variables in anonymous namespace as detected by clang-tidy. This enhances code readability. The related rules are planed to be enabled in follow-up PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165035
Approved by: https://github.com/Skylion007
Summary:
Currently, we assume that refcount_ and weakcount_ are always stored in an 8-byte aligned address right next to each other. Based on this assumption, we load 8 bytes in intrusive_ptr::reset_ to check the values of both counts. However, that assumption is not part of C++ language standard so it's essentially undefined behavior.
This change eliminates that assumption by combining refcount_ and weakcount_ in a single 64-bit count and we use the lower 32 bits for refcount_ and upper 32 bits for the weakcount_.
In addition to eliminating the undefined behavior, the change also eliminates the read of weakcount_ after decrementing refcount_ in intrusive_ptr::reset_. This claws back lost performance introduced in https://github.com/pytorch/pytorch/pull/162784 for non-final refcount_ decrementing.
Reviewed By: yfeldblum
Differential Revision: D82869192
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163394
Approved by: https://github.com/Skylion007
Summary:
Please see D21021645 for details about the optimization and why it's beneficial.
A similar change has been added to libstdc++ as well, see dbf8bd3c2f
Rollback Plan:
Reviewed By: yfeldblum
Differential Revision: D81960754
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162784
Approved by: https://github.com/swolchok
Summary: Relax fences for intrusive ptr's refcnt dec op for performance testing.
lock needs acquire when the op succeeds and relaxed if the op is not. In addition, the expire call and the following refcnt reads were merged to remove one extra read.
incref does not need any fences because the caller should already have a valid reference. use_count follows the same reasoning.
decref only needs a release fence to make sure every write op prior to it has finished. When the refcnt goes to zero, there should be a acquire fence to make sure no read op reads stale data before the object is destructed. However, microbenchmark showed that the optimal fence for decref is not performing noticeably better than the current decref with acq-rel, so we keep decref as-is.
This change should have no material impact on x86, but for Arm64 (and other CPUs with weak memory models), it should boost performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162072
Approved by: https://github.com/swolchok, https://github.com/yfeldblum
This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!).
Why is it not a clean copy paste?
- For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs
- Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that.
Outside of the copy paste, I:
- changed the tests to call torch::headeronly instead of c10
- updated header_only_apis.txt
- added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415
Approved by: https://github.com/albanD
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly.
It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302
Approved by: https://github.com/malfet, https://github.com/albanD
Summary:
D76832520 switched Executorch to use the caffe c10 headers. This copy contains a shadow, which is treated as an error for certain embedded compile flows.
Simple rename to avoid.
Test Plan:
CI
Rollback Plan:
Differential Revision: D77446104
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157107
Approved by: https://github.com/Skylion007
Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself
Test Plan:
CI
Rollback Plan:
Reviewed By: swolchok
Differential Revision: D76920013
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503
Approved by: https://github.com/swolchok
Summary:
Revive https://github.com/pytorch/pytorch/pull/138406. Only limit the scope to files in c10.
Summary from the original PR,
```
Looking in the code I see
// NB: __cplusplus doesn't work for MSVC, so for now MSVC always uses
// the "__declspec(deprecated)" implementation and not the C++14
// "[[deprecated]]" attribute. We tried enabling "[[deprecated]]" for C++14 on
// MSVC, but ran into issues with some older MSVC versions.
But looking at the MSVC C++ support table I see that the [[deprecated]] attribute is supported as of MSVC 2015 and that the vast majority of C++17 features became supported in MSVC 2015 or later.
Since PyTorch is C++17 now, I infer that PyTorch must not support versions of MSVC earlier than MSVC 2015, so the versions of MSVC supported by PyTorch must support [[deprecated]].
Therefore, since we are finished deprecating old MSVCs we can deprecate C10_DEPRECATED.
```
Test Plan: CI
Differential Revision: D72762767
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151058
Approved by: https://github.com/r-barnes
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError`
`torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](cb8a72b301/Python/errors.c (L282)), namely
- Convert cstr into Python string with `PyUnicode_FromString`
- Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](cb8a72b301/Python/errors.c (L32))
- Set `error_code` property using `PyObject_SetAttrString`
- decref all temporary references
Test that it works and captures CPP backtrace (in addition to CI) by running
```python
import os
os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1'
import torch
x = torch.rand(10, device="cuda")
y = torch.arange(20, device="cuda")
try:
x[y] = 2
print(x)
except torch.AcceleratorError as e:
print("Exception was raised", e.args[0])
print("Captured error code is ", e.error_code)
```
which produces following output
```
Exception was raised CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0
#7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0
#8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0
#9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0
#10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0
#11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0
#12 at::native::abs(at::Tensor const&) from ??:0
#13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0
#14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0
#15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0
#17 at::_ops::abs::call(at::Tensor const&) from ??:0
#18 at::native::isfinite(at::Tensor const&) from ??:0
#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0
#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0
#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0
#22 PyObject_CallFunctionObjArgs from ??:0
#23 _PyObject_MakeTpCall from ??:0
#24 _PyEval_EvalFrameDefault from ??:0
#25 _PyObject_FastCallDictTstate from ??:0
#26 _PyStack_AsDict from ??:0
#27 _PyObject_MakeTpCall from ??:0
#28 _PyEval_EvalFrameDefault from ??:0
#29 _PyFunction_Vectorcall from ??:0
#30 _PyEval_EvalFrameDefault from ??:0
#31 _PyFunction_Vectorcall from ??:0
#32 _PyEval_EvalFrameDefault from ??:0
#33 _PyFunction_Vectorcall from ??:0
#34 _PyEval_EvalFrameDefault from ??:0
#35 PyFrame_GetCode from ??:0
#36 PyNumber_Xor from ??:0
#37 PyObject_Str from ??:0
#38 PyFile_WriteObject from ??:0
#39 _PyWideStringList_AsList from ??:0
#40 _PyDict_NewPresized from ??:0
#41 _PyEval_EvalFrameDefault from ??:0
#42 PyEval_EvalCode from ??:0
#43 PyEval_EvalCode from ??:0
#44 PyUnicode_Tailmatch from ??:0
#45 PyInit__collections from ??:0
#46 PyUnicode_Tailmatch from ??:0
#47 _PyRun_SimpleFileObject from ??:0
#48 _PyRun_AnyFileObject from ??:0
#49 Py_RunMain from ??:0
#50 Py_BytesMain from ??:0
#51 __libc_init_first from ??:0
#52 __libc_start_main from ??:0
#53 _start from ??:0
Captured error code is 710
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152023
Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel
ghstack dependencies: #154436
Fixes: https://github.com/intel/torch-xpu-ops/issues/1503
`sycl/ext/oneapi/bfloat16.hpp` header file is a DPC++ compiler internal header. It's not documented for usage (see extension specification linked below) and is not guaranteed to exist. Instead, documented usage of extension suggests to rely on including `sycl/sycl.hpp` which in its turn includes `bfloat16.hpp` header (which is implementation detail).
We stepped into issues by explicitly including `bloat16.hpp` sycl header whithin user facing production environment when `intel-sycl-rt` wheel is installed (which is the dependency of `torch` wheel package built and publicly available for xpu). Compiler includes this file from `intel-sycl-rt` and due to `#pragma once` usage its content is included as well giving redefinitions of symbols in this file (previous inclusion is coming from `sycl/sycl.hpp`):
```
In file included from /workspace/lib/python3.12/site-packages/torch/include/c10/util/BFloat16.h:23:
/opt/intel/oneapi/compiler/2025.0/bin/compiler/../../include/sycl/ext/oneapi/bfloat16.hpp:60:23: error: redefinition of 'BF16VecToFloatVec'
60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
| ^
/workspace/include/sycl/ext/oneapi/bfloat16.hpp:60:23: note: previous definition is here
60 | template <int N> void BF16VecToFloatVec(const bfloat16 src[N], float dst[N]) {
|
```
While SYCL header files themselves can be improved (`#pragma once` dropped), we still must correct usage of sycl `bfloat16.hpp` header in pytorch, i.e. drop it. This fortunately helps to address the reported issue of redefinitions though follow up on compiler side is still required.
Also, `SYCL_EXT_ONEAPI_BFLOAT16_MATH_FUNCTIONS` used to cover inclusion of `sycl/sycl.hpp` does not make sense since it's defined in this very header. Thus, we should use `SYCL_LANGUAGE_VERSION` instead which is defined on compiler level.
See: f958dce280/sycl/doc/extensions/experimental/sycl_ext_oneapi_bfloat16_math_functions.asciidoc
CC: @EikanWang, @guangyey, @gujinghui
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152562
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD