Files
pytorch/build.bzl
hongxyan 66a76516bf [ROCm] Disabling Kernel Asserts for ROCm by default - fix and clean up and refactoring (#114660)
Related to #103973  #110532 #108404 #94891

**Context:**
As commented in 6ae0554d11/cmake/Dependencies.cmake (L1198)
Kernel asserts are enabled by default for CUDA and disabled for ROCm.
However it is somewhat broken, and Kernel assert was still enabled for ROCm.

Disabling kernel assert is also needed for users who do not have PCIe atomics support. These community users have verified that disabling the kernel assert in PyTorch/ROCm platform fixed their pytorch workflow, like torch.sum script, stable-diffusion. (see the related issues)

**Changes:**

This pull request serves the following purposes:
* Refactor and clean up the logic,  make it simpler for ROCm to enable and disable Kernel Asserts
* Fix the bug that Kernel Asserts for ROCm was not disabled by default.

Specifically,
- Renamed `TORCH_DISABLE_GPU_ASSERTS` to `C10_USE_ROCM_KERNEL_ASSERT` for the following reasons:
(1) This variable only applies to ROCm.
(2) The new name is more align with #define CUDA_KERNEL_ASSERT function.
(3) With USE_ in front of the name, we can easily control it with environment variable to turn on and off this feature during build (e.g. `USE_ROCM_KERNEL_ASSERT=1 python setup.py develop` will enable kernel assert for ROCm build).
- Get rid of the `ROCM_FORCE_ENABLE_GPU_ASSERTS' to simplify the logic and make it easier to understand and maintain
- Added `#cmakedefine` to carry over the CMake variable to C++

**Tests:**
(1) build with default mode and verify that USE_ROCM_KERNEL_ASSERT  is OFF(0), and kernel assert is disabled:

```
python setup.py develop
```
Verify CMakeCache.txt has correct value.
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=0
```
Tested the following code in ROCm build and CUDA build, and expected the return code differently.

```
subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
```
This piece of code is adapted from below unit test to get around the limitation that this unit test now was skipped for ROCm. (We will check to enable this unit test in the future)

```
python test/test_cuda_expandable_segments.py -k test_fixed_cuda_assert_async
```

Ran the following script, expecting r ==0 since the CUDA_KERNEL_ASSERT is defined as nothing:
```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>> r
0
```

(2) Enable the kernel assert by building with USE_ROCM_KERNEL_ASSERT=1, or USE_ROCM_KERNEL_ASSERT=ON
```
USE_ROCM_KERNEL_ASSERT=1 python setup.py develop
```

Verify `USE_ROCM_KERNEL_ASSERT` is `1`
```
/xxxx/pytorch/build$ grep USE_ROCM_KERNEL_ASSERT CMakeCache.txt
USE_ROCM_KERNEL_ASSERT:BOOL=1
```

Run the assert test, and expected return code not equal to 0.

```
>> import sys
>>> import subprocess
>>> r=subprocess.call([sys.executable, '-c', "import torch;torch._assert_async(torch.tensor(0,device='cuda'));torch.cuda.synchronize()"])
>>>/xxxx/pytorch/aten/src/ATen/native/hip/TensorCompare.hip:108: _assert_async_cuda_kernel: Device-side assertion `input[0] != 0' failed.
:0:rocdevice.cpp            :2690: 2435301199202 us: [pid:206019 tid:0x7f6cf0a77700] Callback: Queue 0x7f64e8400000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

>>> r
-6
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114660
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/jithunnair-amd
2023-12-13 15:44:53 +00:00

317 lines
11 KiB
Python

load(
":ufunc_defs.bzl",
"aten_ufunc_generated_cpu_kernel_sources",
"aten_ufunc_generated_cpu_sources",
"aten_ufunc_generated_cuda_sources",
)
def define_targets(rules):
rules.cc_library(
name = "caffe2_core_macros",
hdrs = [":caffe2_core_macros_h"],
)
rules.cmake_configure_file(
name = "caffe2_core_macros_h",
src = "caffe2/core/macros.h.in",
out = "caffe2/core/macros.h",
definitions = [
"CAFFE2_BUILD_SHARED_LIBS",
"CAFFE2_PERF_WITH_AVX",
"CAFFE2_PERF_WITH_AVX2",
"CAFFE2_PERF_WITH_AVX512",
"CAFFE2_USE_EXCEPTION_PTR",
"CAFFE2_USE_CUDNN",
"USE_MKLDNN",
"CAFFE2_USE_ITT",
"USE_ROCM_KERNEL_ASSERT",
"EIGEN_MPL2_ONLY",
],
)
rules.cc_library(
name = "caffe2_serialize",
srcs = [
"caffe2/serialize/file_adapter.cc",
"caffe2/serialize/inline_container.cc",
"caffe2/serialize/istream_adapter.cc",
"caffe2/serialize/read_adapter_interface.cc",
],
copts = ["-fexceptions"],
tags = [
"-fbcode",
"supermodule:android/default/pytorch",
"supermodule:ios/default/public.pytorch",
"xplat",
],
visibility = ["//visibility:public"],
deps = [
":caffe2_headers",
"//c10",
"//third_party/miniz-2.1.0:miniz",
"@com_github_glog//:glog",
],
)
#
# ATen generated code
# You need to keep this is sync with the files written out
# by gen.py (in the cmake build system, we track generated files
# via generated_cpp.txt and generated_cpp.txt-cuda
#
# Sure would be nice to use gen.py to create this list dynamically
# instead of hardcoding, no? Well, we can't, as discussed in this
# thread:
# https://fb.facebook.com/groups/askbuck/permalink/1924258337622772/
gen_aten_srcs = [
"aten/src/ATen/native/native_functions.yaml",
"aten/src/ATen/native/tags.yaml",
] + rules.glob(["aten/src/ATen/templates/*"])
gen_aten_cmd = " ".join([
"$(execpath //torchgen:gen)",
"--install_dir=$(RULEDIR)",
"--source-path aten/src/ATen",
] + (["--static_dispatch_backend CPU"] if rules.is_cpu_static_dispatch_build() else []))
gen_aten_outs_cuda = (
GENERATED_H_CUDA + GENERATED_CPP_CUDA +
aten_ufunc_generated_cuda_sources()
)
gen_aten_outs = (
GENERATED_H + GENERATED_H_CORE +
GENERATED_CPP + GENERATED_CPP_CORE +
aten_ufunc_generated_cpu_sources() +
aten_ufunc_generated_cpu_kernel_sources() + [
"Declarations.yaml",
] + gen_aten_outs_cuda
)
rules.genrule(
name = "gen_aten",
srcs = gen_aten_srcs,
outs = gen_aten_outs,
cmd = gen_aten_cmd,
tools = ["//torchgen:gen"],
)
rules.genrule(
name = "gen_aten_hip",
srcs = gen_aten_srcs,
outs = gen_aten_outs_cuda,
cmd = gen_aten_cmd + " --rocm",
features = ["-create_bazel_outputs"],
tags = ["-bazel"],
tools = ["//torchgen:gen"],
)
rules.genrule(
name = "generate-code",
srcs = [
":DispatchKeyNativeFunctions.cpp",
":DispatchKeyNativeFunctions.h",
":LazyIr.h",
":LazyNonNativeIr.h",
":RegisterDispatchDefinitions.ini",
":RegisterDispatchKey.cpp",
":native_functions.yaml",
":shape_inference.h",
":tags.yaml",
":ts_native_functions.cpp",
":ts_native_functions.yaml",
],
outs = GENERATED_AUTOGRAD_CPP + GENERATED_AUTOGRAD_PYTHON + GENERATED_TESTING_PY,
cmd = "$(execpath //tools/setup_helpers:generate_code) " +
"--gen-dir=$(RULEDIR) " +
"--native-functions-path $(location :native_functions.yaml) " +
"--tags-path=$(location :tags.yaml) " +
"--gen_lazy_ts_backend",
tools = ["//tools/setup_helpers:generate_code"],
)
rules.cc_library(
name = "generated-autograd-headers",
hdrs = [":{}".format(h) for h in _GENERATED_AUTOGRAD_CPP_HEADERS + _GENERATED_AUTOGRAD_PYTHON_HEADERS],
visibility = ["//visibility:public"],
)
rules.genrule(
name = "version_h",
srcs = [
":torch/csrc/api/include/torch/version.h.in",
":version.txt",
],
outs = ["torch/csrc/api/include/torch/version.h"],
cmd = "$(execpath //tools/setup_helpers:gen_version_header) " +
"--template-path $(location :torch/csrc/api/include/torch/version.h.in) " +
"--version-path $(location :version.txt) --output-path $@ ",
tools = ["//tools/setup_helpers:gen_version_header"],
)
#
# ATen generated code
# You need to keep this is sync with the files written out
# by gen.py (in the cmake build system, we track generated files
# via generated_cpp.txt and generated_cpp.txt-cuda
#
# Sure would be nice to use gen.py to create this list dynamically
# instead of hardcoding, no? Well, we can't, as discussed in this
# thread:
# https://fb.facebook.com/groups/askbuck/permalink/1924258337622772/
GENERATED_H = [
"Functions.h",
"NativeFunctions.h",
"NativeMetaFunctions.h",
"FunctionalInverses.h",
"RedispatchFunctions.h",
"RegistrationDeclarations.h",
"VmapGeneratedPlumbing.h",
]
GENERATED_H_CORE = [
"Operators.h",
# CPUFunctions.h (and likely similar headers) need to be part of core because
# of the static dispatch build: TensorBody.h directly includes CPUFunctions.h.
# The disinction looks pretty arbitrary though; maybe will can kill core
# and merge the two?
"CPUFunctions.h",
"CPUFunctions_inl.h",
"CompositeExplicitAutogradFunctions.h",
"CompositeExplicitAutogradFunctions_inl.h",
"CompositeExplicitAutogradNonFunctionalFunctions.h",
"CompositeExplicitAutogradNonFunctionalFunctions_inl.h",
"CompositeImplicitAutogradFunctions.h",
"CompositeImplicitAutogradFunctions_inl.h",
"CompositeImplicitAutogradNestedTensorFunctions.h",
"CompositeImplicitAutogradNestedTensorFunctions_inl.h",
"MetaFunctions.h",
"MetaFunctions_inl.h",
"core/TensorBody.h",
"MethodOperators.h",
"core/aten_interned_strings.h",
"core/enum_tag.h",
]
GENERATED_H_CUDA = [
"CUDAFunctions.h",
"CUDAFunctions_inl.h",
]
GENERATED_CPP_CUDA = [
"RegisterCUDA.cpp",
"RegisterNestedTensorCUDA.cpp",
"RegisterSparseCUDA.cpp",
"RegisterSparseCsrCUDA.cpp",
"RegisterQuantizedCUDA.cpp",
]
GENERATED_CPP = [
"Functions.cpp",
"RegisterBackendSelect.cpp",
"RegisterCPU.cpp",
"RegisterQuantizedCPU.cpp",
"RegisterNestedTensorCPU.cpp",
"RegisterSparseCPU.cpp",
"RegisterSparseCsrCPU.cpp",
"RegisterMkldnnCPU.cpp",
"RegisterCompositeImplicitAutograd.cpp",
"RegisterCompositeImplicitAutogradNestedTensor.cpp",
"RegisterZeroTensor.cpp",
"RegisterMeta.cpp",
"RegisterQuantizedMeta.cpp",
"RegisterNestedTensorMeta.cpp",
"RegisterSparseMeta.cpp",
"RegisterCompositeExplicitAutograd.cpp",
"RegisterCompositeExplicitAutogradNonFunctional.cpp",
"CompositeViewCopyKernels.cpp",
"RegisterSchema.cpp",
"RegisterFunctionalization_0.cpp",
"RegisterFunctionalization_1.cpp",
"RegisterFunctionalization_2.cpp",
"RegisterFunctionalization_3.cpp",
]
GENERATED_CPP_CORE = [
"Operators_0.cpp",
"Operators_1.cpp",
"Operators_2.cpp",
"Operators_3.cpp",
"Operators_4.cpp",
"core/ATenOpList.cpp",
"core/TensorMethods.cpp",
]
# These lists are temporarily living in and exported from the shared
# structure so that an internal build that lives under a different
# root can access them. These could technically live in a separate
# file in the same directory but that would require extra work to
# ensure that file is synced to both Meta internal repositories and
# GitHub. This problem will go away when the targets downstream of
# generate-code that use these lists are moved into the shared
# structure as well.
_GENERATED_AUTOGRAD_PYTHON_HEADERS = [
"torch/csrc/autograd/generated/python_functions.h",
"torch/csrc/autograd/generated/python_return_types.h",
]
_GENERATED_AUTOGRAD_CPP_HEADERS = [
"torch/csrc/autograd/generated/Functions.h",
"torch/csrc/autograd/generated/VariableType.h",
"torch/csrc/autograd/generated/variable_factories.h",
]
GENERATED_TESTING_PY = [
"torch/testing/_internal/generated/annotated_fn_args.py",
]
GENERATED_LAZY_H = [
"torch/csrc/lazy/generated/LazyIr.h",
"torch/csrc/lazy/generated/LazyNonNativeIr.h",
"torch/csrc/lazy/generated/LazyNativeFunctions.h",
]
_GENERATED_AUTOGRAD_PYTHON_CPP = [
"torch/csrc/autograd/generated/python_functions_0.cpp",
"torch/csrc/autograd/generated/python_functions_1.cpp",
"torch/csrc/autograd/generated/python_functions_2.cpp",
"torch/csrc/autograd/generated/python_functions_3.cpp",
"torch/csrc/autograd/generated/python_functions_4.cpp",
"torch/csrc/autograd/generated/python_nn_functions.cpp",
"torch/csrc/autograd/generated/python_nested_functions.cpp",
"torch/csrc/autograd/generated/python_fft_functions.cpp",
"torch/csrc/autograd/generated/python_linalg_functions.cpp",
"torch/csrc/autograd/generated/python_return_types.cpp",
"torch/csrc/autograd/generated/python_enum_tag.cpp",
"torch/csrc/autograd/generated/python_sparse_functions.cpp",
"torch/csrc/autograd/generated/python_special_functions.cpp",
"torch/csrc/autograd/generated/python_torch_functions_0.cpp",
"torch/csrc/autograd/generated/python_torch_functions_1.cpp",
"torch/csrc/autograd/generated/python_torch_functions_2.cpp",
"torch/csrc/autograd/generated/python_variable_methods.cpp",
]
GENERATED_AUTOGRAD_PYTHON = _GENERATED_AUTOGRAD_PYTHON_HEADERS + _GENERATED_AUTOGRAD_PYTHON_CPP
GENERATED_AUTOGRAD_CPP = [
"torch/csrc/autograd/generated/Functions.cpp",
"torch/csrc/autograd/generated/VariableType_0.cpp",
"torch/csrc/autograd/generated/VariableType_1.cpp",
"torch/csrc/autograd/generated/VariableType_2.cpp",
"torch/csrc/autograd/generated/VariableType_3.cpp",
"torch/csrc/autograd/generated/VariableType_4.cpp",
"torch/csrc/autograd/generated/TraceType_0.cpp",
"torch/csrc/autograd/generated/TraceType_1.cpp",
"torch/csrc/autograd/generated/TraceType_2.cpp",
"torch/csrc/autograd/generated/TraceType_3.cpp",
"torch/csrc/autograd/generated/TraceType_4.cpp",
"torch/csrc/autograd/generated/ADInplaceOrViewType_0.cpp",
"torch/csrc/autograd/generated/ADInplaceOrViewType_1.cpp",
"torch/csrc/lazy/generated/LazyNativeFunctions.cpp",
"torch/csrc/lazy/generated/RegisterAutogradLazy.cpp",
"torch/csrc/lazy/generated/RegisterLazy.cpp",
] + _GENERATED_AUTOGRAD_CPP_HEADERS + GENERATED_LAZY_H