mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 12:54:11 +08:00
Forbid trailing whitespace (#53406)
Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
This commit is contained in:
committed by
Facebook GitHub Bot
parent
cab2689eb1
commit
8c798e0622
@ -24,6 +24,6 @@ rm cert.txt
|
||||
if ! [ -x "$(command -v xcodebuild)" ]; then
|
||||
echo 'Error: xcodebuild is not installed.'
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
PROFILE=PyTorch_CI_2021
|
||||
ruby ${PROJ_ROOT}/scripts/xcode_build.rb -i ${PROJ_ROOT}/build_ios/install -x ${PROJ_ROOT}/ios/TestApp/TestApp.xcodeproj -p ${IOS_PLATFORM} -c ${PROFILE} -t ${IOS_DEV_TEAM_ID}
|
||||
|
3
.github/workflows/lint.yml
vendored
3
.github/workflows/lint.yml
vendored
@ -40,6 +40,9 @@ jobs:
|
||||
rm -r "shellcheck-${scversion}"
|
||||
shellcheck --version
|
||||
.jenkins/run-shellcheck.sh
|
||||
- name: Ensure no trailing spaces
|
||||
run: |
|
||||
(! git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' || (echo "The above files have trailing spaces; please remove them"; false))
|
||||
- name: Ensure no tabs
|
||||
run: |
|
||||
(! git grep -I -l $'\t' -- . ':(exclude)*.svg' ':(exclude)**Makefile' ':(exclude)**/contrib/**' ':(exclude)third_party' ':(exclude).gitattributes' ':(exclude).gitmodules' || (echo "The above files have tabs; please convert them to spaces"; false))
|
||||
|
@ -21,7 +21,7 @@ if (( $num_gpus == 0 )); then
|
||||
fi
|
||||
if (( $num_gpus >= 1 )); then
|
||||
"$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 128 --epoch_size 12800 --num_epochs 2 --num_gpus 1
|
||||
# Let's skip the fp16 bench runs for now, as it recompiles the miopen kernels and can take 10+min to run.
|
||||
# Let's skip the fp16 bench runs for now, as it recompiles the miopen kernels and can take 10+min to run.
|
||||
# We can resume when we (1) bindmount the miopen cache folder in jenkins; (2) install the pre-compiled miopen kernel library in the docker
|
||||
# "$PYTHON" "$caffe2_pypath/python/examples/imagenet_trainer.py" --train_data null --batch_size 256 --epoch_size 25600 --num_epochs 2 --num_gpus 1 --float16_compute --dtype float16
|
||||
fi
|
||||
|
@ -159,7 +159,7 @@ with `brew install cmake` if you are developing on MacOS or Linux system.
|
||||
check whether your Git local or global config file contains any `submodule.*` settings. If yes, remove them and try again.
|
||||
(please reference [this doc](https://git-scm.com/docs/git-config#Documentation/git-config.txt-submoduleltnamegturl) for more info).
|
||||
|
||||
- If you encountered error such as
|
||||
- If you encountered error such as
|
||||
```
|
||||
fatal: unable to access 'https://github.com/pybind11/pybind11.git': could not load PEM client certificate ...
|
||||
```
|
||||
@ -169,11 +169,11 @@ with `brew install cmake` if you are developing on MacOS or Linux system.
|
||||
openssl x509 -noout -in <cert_file> -dates
|
||||
```
|
||||
|
||||
- If you encountered error that some third_party modules are not checkout correctly, such as
|
||||
- If you encountered error that some third_party modules are not checkout correctly, such as
|
||||
```
|
||||
Could not find .../pytorch/third_party/pybind11/CMakeLists.txt
|
||||
```
|
||||
remove any `submodule.*` settings in your local git config (`.git/config` of your pytorch repo) and try again.
|
||||
remove any `submodule.*` settings in your local git config (`.git/config` of your pytorch repo) and try again.
|
||||
|
||||
## Nightly Checkout & Pull
|
||||
|
||||
|
@ -1,4 +1,4 @@
|
||||
# PyTorch Glossary
|
||||
# PyTorch Glossary
|
||||
|
||||
- [PyTorch Glossary](#pytorch-glossary)
|
||||
- [Operation and Kernel](#operation-and-kernel)
|
||||
@ -39,7 +39,7 @@ For example, this
|
||||
to create Custom Operations.
|
||||
|
||||
## Kernel
|
||||
Implementation of a PyTorch operation, specifying what should be done when an
|
||||
Implementation of a PyTorch operation, specifying what should be done when an
|
||||
operation executes.
|
||||
|
||||
## Compound Operation
|
||||
@ -57,7 +57,7 @@ Same as Compound Operation.
|
||||
## Leaf Operation
|
||||
An operation that's considered a basic operation, as opposed to a Compound
|
||||
Operation. Leaf Operation always has dispatch functions defined, usually has a
|
||||
derivative function defined as well.
|
||||
derivative function defined as well.
|
||||
|
||||
## Device Kernel
|
||||
Device-specific kernel of a leaf operation.
|
||||
@ -79,4 +79,4 @@ using just-in-time compilation.
|
||||
|
||||
## Scripting
|
||||
Using `torch.jit.script` on a function to inspect source code and compile it as
|
||||
TorchScript code.
|
||||
TorchScript code.
|
||||
|
@ -300,7 +300,7 @@ Tensor trace_backward_batching_rule(const Tensor& grad, IntArrayRef input_sizes)
|
||||
auto grad_input = at::zeros(grad_physical.getPhysicalShape(input_sizes), grad.options());
|
||||
// Batched Diagonal View
|
||||
auto grad_input_diag = at::diagonal(grad_input, /*offset*/0, /*dim1*/-2, /*dim2*/-1);
|
||||
// Append a dimension of size one to the grad output
|
||||
// Append a dimension of size one to the grad output
|
||||
auto grad_physical_tensor = grad_physical.tensor().unsqueeze(-1);
|
||||
grad_input_diag.copy_(grad_physical_tensor);
|
||||
return grad_physical.getPhysicalToLogicalMap().apply(grad_input);
|
||||
|
@ -38,7 +38,7 @@ struct CPUGeneratorImplStateLegacy {
|
||||
* new data introduced in at::CPUGeneratorImpl and the legacy state. It is used
|
||||
* as a helper for torch.get_rng_state() and torch.set_rng_state()
|
||||
* functions.
|
||||
*/
|
||||
*/
|
||||
struct CPUGeneratorImplState {
|
||||
CPUGeneratorImplStateLegacy legacy_pod;
|
||||
float next_float_normal_sample;
|
||||
@ -119,7 +119,7 @@ uint64_t CPUGeneratorImpl::seed() {
|
||||
* must be a strided CPU byte tensor and of the same size as either
|
||||
* CPUGeneratorImplStateLegacy (for legacy CPU generator state) or
|
||||
* CPUGeneratorImplState (for new state).
|
||||
*
|
||||
*
|
||||
* FIXME: Remove support of the legacy state in the future?
|
||||
*/
|
||||
void CPUGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
|
||||
|
@ -94,7 +94,7 @@ TORCH_API Tensor flatten_indices(const Tensor& indices, IntArrayRef full_size, b
|
||||
// new_indices = [ 3, 1, 3 ] # uncoalesced
|
||||
TORCH_API Tensor flatten_indices_by_dims(const Tensor& indices, const IntArrayRef& sizes, const IntArrayRef& dims_to_flatten);
|
||||
|
||||
// Find the CSR representation for a row `indices` from the COO format
|
||||
// Find the CSR representation for a row `indices` from the COO format
|
||||
TORCH_API Tensor coo_to_csr(const int64_t* indices, int64_t dim, int64_t nnz);
|
||||
|
||||
}} // namespace at::sparse
|
||||
|
@ -114,7 +114,7 @@ std::string used_cpu_capability() {
|
||||
case native::CPUCapability::AVX2:
|
||||
ss << "AVX2";
|
||||
break;
|
||||
#endif
|
||||
#endif
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
@ -47,7 +47,7 @@ using VmapDimVector = SmallVector<int64_t, kVmapStaticDimVecSize>;
|
||||
// argument.
|
||||
|
||||
// VmapTransform for operators that take tensors with multiple batch dims.
|
||||
// Given one or more logical views on Tensors, `logicalToPhysical`
|
||||
// Given one or more logical views on Tensors, `logicalToPhysical`
|
||||
// permutes all of the batch dims to the front of the tensor, aligns
|
||||
// and expands the batch dims to match each other (according to their `level`),
|
||||
// and returns a VmapPhysicalView on the tensor(s).
|
||||
|
@ -143,7 +143,7 @@ namespace detail {
|
||||
/**
|
||||
* Helper function for checking the validity of new random generator
|
||||
* state. Right now following conditions are checked:
|
||||
*
|
||||
*
|
||||
* - The new state tensor must be a torch.ByteTensor
|
||||
* - Data of the new state tensor must be contiguous
|
||||
*/
|
||||
|
@ -40,13 +40,13 @@ typedef at::detail::Array<float, 2> FLOAT2;
|
||||
* Note that currently this implementation of the philox engine is not used
|
||||
* anywhere except for tests in cpu_generator_test.cpp. However, this engine
|
||||
* will replace curandStatePhilox4_32_10_t in the future.
|
||||
*
|
||||
*
|
||||
* The philox engine takes a seed value, a subsequeunce
|
||||
* for starting the generation and an offset for the subsequence.
|
||||
* Think of this engine as an algorithm producing a huge array. We are
|
||||
* parallelizing this array by partitioning the huge array and assigning
|
||||
* a thread index to each partition. In other words, each seed value
|
||||
* (there are 2^64 possible seed values) gives a sub array of size
|
||||
* Think of this engine as an algorithm producing a huge array. We are
|
||||
* parallelizing this array by partitioning the huge array and assigning
|
||||
* a thread index to each partition. In other words, each seed value
|
||||
* (there are 2^64 possible seed values) gives a sub array of size
|
||||
* 2^128 (each element in that array is a 128 bit number). Reasoning
|
||||
* behind the array being of size 2^128 is, there are 2^64 possible
|
||||
* thread index value and there is an array of size 2^64 for each of
|
||||
@ -59,9 +59,9 @@ typedef at::detail::Array<float, 2> FLOAT2;
|
||||
* seed: Seed values could be any number from 0 to 2^64-1.
|
||||
* subsequence: Subsequence is just the cuda thread indexing with:
|
||||
* - blockIdx.x * blockDim.x + threadIdx.x
|
||||
* offset: The offset variable in PhiloxEngine decides how many 128-bit
|
||||
* offset: The offset variable in PhiloxEngine decides how many 128-bit
|
||||
* random numbers to skip (i.e. how many groups of 4, 32-bit numbers to skip)
|
||||
* and hence really decides the total number of randoms that can be achieved
|
||||
* and hence really decides the total number of randoms that can be achieved
|
||||
* for the given subsequence.
|
||||
*/
|
||||
|
||||
|
@ -254,5 +254,3 @@ Also, there's some requirements on the operator schema for it to be callable fro
|
||||
* Except for `Tensor` or `Tensor[]`, only arguments of type `int`, `double` and `bool` are supported. These can be in any position in the argument list and will be read from the caffe2 operator arguments, based on the argument name in the operator schema.
|
||||
* We do not support lists (`int[]`, `double[]` or `bool[]`) or optionals (`int?`, `double?`, `bool?`) yet.
|
||||
* The operator must return a single `Tensor` or multiple tensors as in `(Tensor, Tensor, Tensor)`. It cannot return a list `Tensor[]`, optional `Tensor?` or any primitive types.
|
||||
|
||||
|
||||
|
@ -1124,12 +1124,12 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
|
||||
const FunctionSchema& forward_schema = getMethod("forward").getSchema();
|
||||
std::string input_types = getSchemaInputTypesString(forward_schema);
|
||||
const std::vector<Argument>& forward_args = forward_schema.arguments();
|
||||
|
||||
|
||||
std::string single_output = "";
|
||||
if (forward_args.size() == 2 &&
|
||||
forward_args[1].type()->cast<TupleType>() == nullptr) {
|
||||
// if the output type is a single tuple, it needs to be wrapped in an outer tuple
|
||||
// to match eager's behavior
|
||||
// to match eager's behavior
|
||||
single_output = ", '" + forward_args[1].type()->annotation_str() + "',";
|
||||
}
|
||||
std::string pre_hook_schema =
|
||||
@ -1138,9 +1138,9 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
|
||||
"This error occured while scripting the forward pre-hook '" +
|
||||
pre_hook_name + "' on module '" + name()->name() +
|
||||
"'. If you did not want to script this pre-hook remove it from the "
|
||||
"original NN module before scripting. Pre-hooks for module '" +
|
||||
name()->name() + "' are expected to have the following signature: "
|
||||
+ pre_hook_schema + " with a return type of either 'None'" +
|
||||
"original NN module before scripting. Pre-hooks for module '" +
|
||||
name()->name() + "' are expected to have the following signature: "
|
||||
+ pre_hook_schema + " with a return type of either 'None'" +
|
||||
single_output + " or 'Tuple[" + input_types + "]'.";
|
||||
return return_string;
|
||||
}
|
||||
@ -1148,7 +1148,7 @@ std::string ClassType::getForwardPreHookErrorMessage(int pre_hook_idx) const {
|
||||
std::string ClassType::getForwardHookErrorMessage(int hook_idx) const {
|
||||
const std::string& hook_name = forward_hooks_[hook_idx]->name();
|
||||
const FunctionSchema& forward_schema = getMethod("forward").getSchema();
|
||||
std::string input_types = getSchemaInputTypesString(forward_schema);
|
||||
std::string input_types = getSchemaInputTypesString(forward_schema);
|
||||
|
||||
// create expected output types string
|
||||
const Argument& pre_output =
|
||||
@ -1160,33 +1160,33 @@ std::string ClassType::getForwardHookErrorMessage(int hook_idx) const {
|
||||
std::string hook_schema = hook_name + "(self, input: Tuple[" +
|
||||
input_types + "], output: " + output_types + ")";
|
||||
std::string return_string =
|
||||
"This error occured while scripting the forward hook '"
|
||||
"This error occured while scripting the forward hook '"
|
||||
+ hook_name + "' on module " + name()->name() +
|
||||
". If you did not want to script this hook remove it from" +
|
||||
" the original NN module before scripting. This hook was" +
|
||||
" expected to have the following signature: " + hook_schema +
|
||||
". The type of the output arg is the returned type from" +
|
||||
" either the forward method or the previous hook if it exists. " +
|
||||
"Note that hooks can return anything, but if the hook is " +
|
||||
". The type of the output arg is the returned type from" +
|
||||
" either the forward method or the previous hook if it exists. " +
|
||||
"Note that hooks can return anything, but if the hook is " +
|
||||
"on a submodule the outer module is expecting" +
|
||||
" the same return type as the submodule's forward.";
|
||||
return return_string;
|
||||
}
|
||||
|
||||
void checkForwardHookInputArguments(
|
||||
const FunctionSchema& forward_schema,
|
||||
const FunctionSchema& hook_schema,
|
||||
const std::string& hook_id,
|
||||
const FunctionSchema& forward_schema,
|
||||
const FunctionSchema& hook_schema,
|
||||
const std::string& hook_id,
|
||||
const std::string& hook_err_msg) {
|
||||
// check for proper tuple input types
|
||||
const std::vector<Argument>& forward_args = forward_schema.arguments();
|
||||
const Argument input_arg = hook_schema.arguments()[1];
|
||||
TORCH_CHECK(
|
||||
input_arg.type()->cast<TupleType>() != nullptr,
|
||||
input_arg.type()->cast<TupleType>() != nullptr,
|
||||
hook_id,
|
||||
"expected the input argument to be typed as a Tuple but found type: '",
|
||||
input_arg.type()->annotation_str(),
|
||||
"' instead.\n",
|
||||
input_arg.type()->annotation_str(),
|
||||
"' instead.\n",
|
||||
hook_err_msg
|
||||
);
|
||||
|
||||
@ -1229,7 +1229,7 @@ void checkForwardHookInputArguments(
|
||||
}
|
||||
|
||||
void ClassType::checkForwardPreHookSchema(
|
||||
int pre_hook_idx,
|
||||
int pre_hook_idx,
|
||||
const FunctionSchema& pre_hook_schema) const {
|
||||
const torch::jit::Function* pre_hook = forward_pre_hooks_[pre_hook_idx];
|
||||
std::string hook_id =
|
||||
@ -1261,7 +1261,7 @@ void ClassType::checkForwardPreHookSchema(
|
||||
pre_hook_err_msg
|
||||
);
|
||||
const Argument return_arg = pre_hook_schema.returns()[0];
|
||||
std::string wrong_type_returned_err_msg = hook_id +
|
||||
std::string wrong_type_returned_err_msg = hook_id +
|
||||
"returned the wrong type of: '" +
|
||||
return_arg.type()->annotation_str() + "'.";
|
||||
|
||||
@ -1269,9 +1269,9 @@ void ClassType::checkForwardPreHookSchema(
|
||||
return;
|
||||
}
|
||||
if (forward_args.size() == 2 && *forward_args[1].type() == *return_arg.type()) {
|
||||
// TORCH_CHECK below is for the edge case where forward's input is a tuple and the
|
||||
// TORCH_CHECK below is for the edge case where forward's input is a tuple and the
|
||||
// pre-hook returns a matching tuple. Eager doesn't support this- the working eager return
|
||||
// for a tuple type is the forward's input tuple wrapped inside of another tuple.
|
||||
// for a tuple type is the forward's input tuple wrapped inside of another tuple.
|
||||
TORCH_CHECK(
|
||||
return_arg.type()->cast<TupleType>() == nullptr,
|
||||
wrong_type_returned_err_msg,
|
||||
@ -1316,7 +1316,7 @@ void ClassType::checkForwardPreHookSchema(
|
||||
for (int i = 1; i < forward_args.size(); ++i) {
|
||||
if (*forward_args[i].type() != *return_tuple_types[i - 1]) {
|
||||
TORCH_CHECK(
|
||||
false,
|
||||
false,
|
||||
wrong_type_returned_err_msg,
|
||||
" The returned tuple contains the wrong inner types.\n",
|
||||
pre_hook_err_msg);
|
||||
@ -1325,7 +1325,7 @@ void ClassType::checkForwardPreHookSchema(
|
||||
}
|
||||
|
||||
void ClassType::checkForwardHookSchema(
|
||||
int hook_idx,
|
||||
int hook_idx,
|
||||
const FunctionSchema& hook_schema) const {
|
||||
const torch::jit::Function* hook = forward_hooks_[hook_idx];
|
||||
std::string hook_id =
|
||||
@ -1388,8 +1388,8 @@ torch::jit::Function& ClassType::getMethod(const std::string& name) const {
|
||||
torch::jit::Function* ClassType::findHook(const std::string& name) const {
|
||||
auto hook = findForwardHook(name);
|
||||
if (hook == nullptr) {
|
||||
hook = findForwardPreHook(name);
|
||||
}
|
||||
hook = findForwardPreHook(name);
|
||||
}
|
||||
return hook;
|
||||
}
|
||||
|
||||
|
@ -113,7 +113,7 @@ public:
|
||||
const auto not_nan_mask = _mm256_cmp_pd(values, values, _CMP_EQ_OQ);
|
||||
const auto nan_mask = _mm256_cmp_pd(not_nan_mask, zero_vec, _CMP_EQ_OQ);
|
||||
const auto pi = _mm256_set1_pd(c10::pi<double>);
|
||||
|
||||
|
||||
const auto neg_mask = _mm256_cmp_pd(values, zero_vec, _CMP_LT_OQ);
|
||||
auto angle = _mm256_blendv_pd(zero_vec, pi, neg_mask);
|
||||
angle = _mm256_blendv_pd(angle, nan_vec, nan_mask);
|
||||
|
@ -120,7 +120,7 @@ public:
|
||||
const auto not_nan_mask = _mm256_cmp_ps(values, values, _CMP_EQ_OQ);
|
||||
const auto nan_mask = _mm256_cmp_ps(not_nan_mask, zero_vec, _CMP_EQ_OQ);
|
||||
const auto pi = _mm256_set1_ps(c10::pi<float>);
|
||||
|
||||
|
||||
const auto neg_mask = _mm256_cmp_ps(values, zero_vec, _CMP_LT_OQ);
|
||||
auto angle = _mm256_blendv_ps(zero_vec, pi, neg_mask);
|
||||
angle = _mm256_blendv_ps(angle, nan_vec, nan_mask);
|
||||
|
@ -364,7 +364,7 @@ class Vec256<ComplexDbl> {
|
||||
}
|
||||
|
||||
Vec256<ComplexDbl> sqrt() const {
|
||||
return map(std::sqrt);
|
||||
return map(std::sqrt);
|
||||
}
|
||||
|
||||
Vec256<ComplexDbl> reciprocal() const {
|
||||
|
@ -417,7 +417,7 @@ class Vec256<ComplexFlt> {
|
||||
}
|
||||
|
||||
Vec256<ComplexFlt> sqrt() const {
|
||||
return map(std::sqrt);
|
||||
return map(std::sqrt);
|
||||
}
|
||||
|
||||
Vec256<ComplexFlt> reciprocal() const {
|
||||
|
@ -82,7 +82,7 @@ class Vec256<double> {
|
||||
blend(const Vec256<double>& a, const Vec256<double>& b) {
|
||||
return { a._vec0, b._vec1 };
|
||||
}
|
||||
|
||||
|
||||
|
||||
template <int64_t mask>
|
||||
static std::enable_if_t<blendChoiceDbl(mask) == 4, Vec256<double>> C10_ALWAYS_INLINE
|
||||
@ -206,7 +206,7 @@ class Vec256<double> {
|
||||
for (int i = 0; i < size()/2; i++) {
|
||||
ret._vec0[i] = f(_vec0[i], other._vec0[i]);
|
||||
}
|
||||
for (int i = 0; i < size()/2; i++) {
|
||||
for (int i = 0; i < size()/2; i++) {
|
||||
ret._vec1[i] = f(_vec1[i], other._vec1[i]);
|
||||
}
|
||||
return ret;
|
||||
@ -314,7 +314,7 @@ class Vec256<double> {
|
||||
Vec256<double> C10_ALWAYS_INLINE sqrt() const {
|
||||
return {vec_sqrt(_vec0), vec_sqrt(_vec1)};
|
||||
}
|
||||
Vec256<double> C10_ALWAYS_INLINE reciprocal() const {
|
||||
Vec256<double> C10_ALWAYS_INLINE reciprocal() const {
|
||||
return {
|
||||
vec_div(vd_one, _vec0), // vec_re(_vec0) is estimated one.
|
||||
vec_div(vd_one, _vec1)};
|
||||
|
@ -134,11 +134,11 @@ struct Vec256<c10::qint32> {
|
||||
Vec256<float> vf0 = rhs[0];
|
||||
|
||||
vfloat32 vecf0 = vf0.vec0();
|
||||
vfloat32 vecf1 = vf0.vec1();
|
||||
vfloat32 vecf1 = vf0.vec1();
|
||||
vecf0 = vec_mul(vecf0, inverse_scale_v);
|
||||
vecf1 = vec_mul(vecf1, inverse_scale_v);
|
||||
vecf0 = vec_add(vec_rint(vecf0), vec_zero_point);
|
||||
vecf1 = vec_add(vec_rint(vecf1), vec_zero_point);
|
||||
vecf1 = vec_add(vec_rint(vecf1), vec_zero_point);
|
||||
vint32 veci0 = vec_signed(vecf0);
|
||||
vint32 veci1 = vec_signed(vecf1);
|
||||
|
||||
@ -171,7 +171,7 @@ struct Vec256<c10::qint32> {
|
||||
float multiplier,
|
||||
int32_t zero_point) {
|
||||
const vint32 vmin = vec_splats(std::numeric_limits<value_type>::min());
|
||||
const vint32 vmax = vec_splats(std::numeric_limits<value_type>::max());
|
||||
const vint32 vmax = vec_splats(std::numeric_limits<value_type>::max());
|
||||
vfloat32 vec_mult = vec_splats(multiplier);
|
||||
vint32 vec_zero_point = vec_splats(zero_point);
|
||||
Vec256<c10::qint32> vi = inp[0];
|
||||
|
@ -337,7 +337,7 @@ struct Vec256<c10::qint8> {
|
||||
vint32 veci4 = vec_signed(vecf4);
|
||||
vint32 veci5 = vec_signed(vecf5);
|
||||
vint32 veci6 = vec_signed(vecf6);
|
||||
vint32 veci7 = vec_signed(vecf7);
|
||||
vint32 veci7 = vec_signed(vecf7);
|
||||
|
||||
veci0 = vec_add(veci0, vec_zero_point);
|
||||
veci1 = vec_add(veci1, vec_zero_point);
|
||||
@ -348,7 +348,7 @@ struct Vec256<c10::qint8> {
|
||||
veci5 = vec_add(veci5, vec_zero_point);
|
||||
veci6 = vec_add(veci6, vec_zero_point);
|
||||
veci7 = vec_add(veci7, vec_zero_point);
|
||||
|
||||
|
||||
vint16 vecshi0 = vec_packs(veci0, veci1);
|
||||
vint16 vecshi1 = vec_packs(veci2, veci3);
|
||||
vint16 vecshi2 = vec_packs(veci4, veci5);
|
||||
|
@ -345,8 +345,8 @@ struct Vec256<c10::quint8> {
|
||||
vint32 veci4 = vec_signed(vecf4);
|
||||
vint32 veci5 = vec_signed(vecf5);
|
||||
vint32 veci6 = vec_signed(vecf6);
|
||||
vint32 veci7 = vec_signed(vecf7);
|
||||
|
||||
vint32 veci7 = vec_signed(vecf7);
|
||||
|
||||
veci0 = vec_add(veci0, vec_zero_point);
|
||||
veci1 = vec_add(veci1, vec_zero_point);
|
||||
veci2 = vec_add(veci2, vec_zero_point);
|
||||
@ -356,11 +356,11 @@ struct Vec256<c10::quint8> {
|
||||
veci5 = vec_add(veci5, vec_zero_point);
|
||||
veci6 = vec_add(veci6, vec_zero_point);
|
||||
veci7 = vec_add(veci7, vec_zero_point);
|
||||
|
||||
|
||||
vint16 vecshi0 = vec_packs(veci0, veci1);
|
||||
vint16 vecshi1 = vec_packs(veci2, veci3);
|
||||
vint16 vecshi2 = vec_packs(veci4, veci5);
|
||||
vint16 vecshi3 = vec_packs(veci6, veci7);
|
||||
vint16 vecshi3 = vec_packs(veci6, veci7);
|
||||
|
||||
vuint8 vec0 = vec_packsu(vecshi0, vecshi1);
|
||||
vuint8 vec1 = vec_packsu(vecshi2, vecshi3);
|
||||
|
@ -193,7 +193,7 @@ void CUDAGeneratorImpl::set_state(const c10::TensorImpl& new_state) {
|
||||
} else {
|
||||
TORCH_CHECK(new_state_size == total_size, "RNG state is wrong size");
|
||||
}
|
||||
|
||||
|
||||
uint64_t input_seed;
|
||||
auto new_rng_state = new_state.data<uint8_t>();
|
||||
memcpy(&input_seed, new_rng_state + states_size, seed_size);
|
||||
|
@ -3,7 +3,7 @@
|
||||
|
||||
namespace at { namespace native {
|
||||
|
||||
at::Tensor linspace_from_neg_one(const Tensor& grid, int64_t num_steps,
|
||||
at::Tensor linspace_from_neg_one(const Tensor& grid, int64_t num_steps,
|
||||
bool align_corners) {
|
||||
if (num_steps <= 1) {
|
||||
return at::tensor(0, grid.options());
|
||||
|
@ -96,7 +96,7 @@ void avg_pool2d_out_cpu_template(
|
||||
Tensor &output,
|
||||
const Tensor &input_,
|
||||
IntArrayRef kernel_size,
|
||||
IntArrayRef stride,
|
||||
IntArrayRef stride,
|
||||
IntArrayRef padding,
|
||||
bool ceil_mode,
|
||||
bool count_include_pad,
|
||||
|
@ -11,11 +11,11 @@ extern "C" void zgemm_(char *transa, char *transb, int *m, int *n, int *k, void
|
||||
#endif // AT_BUILD_WITH_BLAS()
|
||||
|
||||
#if AT_BUILD_WITH_BLAS()
|
||||
extern "C" void cswap_(int *n, const void *x, int *incx, void *y, int *incy);
|
||||
extern "C" void cswap_(int *n, const void *x, int *incx, void *y, int *incy);
|
||||
extern "C" void dcopy_(int *n, const double *x, int *incx, double *y, int *incy);
|
||||
extern "C" void scopy_(int *n, const float *x, int *incx, float *y, int *incy);
|
||||
extern "C" void zcopy_(int *n, const void *x, int *incx, void *y, int *incy);
|
||||
extern "C" void ccopy_(int *n, const void *x, int *incx, void *y, int *incy);
|
||||
extern "C" void zcopy_(int *n, const void *x, int *incx, void *y, int *incy);
|
||||
extern "C" void ccopy_(int *n, const void *x, int *incx, void *y, int *incy);
|
||||
extern "C" void daxpy_(int *n, double *a, const double *x, int *incx, double *y, int *incy);
|
||||
extern "C" void saxpy_(int *n, float *a, const float *x, int *incx, float *y, int *incy);
|
||||
extern "C" void caxpy_(int *n, void *a, const void *x, int *incx, void *y, int *incy);
|
||||
@ -279,7 +279,7 @@ void axpy(int64_t n, double a, const double *x, int64_t incx, double *y, int64_t
|
||||
daxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
axpy_stub(
|
||||
kCPU, at::kDouble,
|
||||
n, a, x, incx, y, incy);
|
||||
@ -300,7 +300,7 @@ void axpy(int64_t n, float a, const float *x, int64_t incx, float *y, int64_t in
|
||||
saxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
axpy_stub(
|
||||
kCPU, at::kFloat,
|
||||
n, a, x, incx, y, incy);
|
||||
@ -321,7 +321,7 @@ void axpy(int64_t n, c10::complex<double> a, const c10::complex<double> *x, int6
|
||||
zaxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
axpy_stub(
|
||||
kCPU, at::kComplexDouble,
|
||||
n, a, x, incx, y, incy);
|
||||
@ -342,7 +342,7 @@ void axpy(int64_t n, c10::complex<float> a, const c10::complex<float> *x, int64_
|
||||
caxpy_(&i_n, &a, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
axpy_stub(
|
||||
kCPU, at::kComplexFloat,
|
||||
n, a, x, incx, y, incy);
|
||||
@ -364,7 +364,7 @@ void copy(int64_t n, const double *x, int64_t incx, double *y, int64_t incy) {
|
||||
dcopy_(&i_n, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
copy_stub(
|
||||
kCPU, at::kDouble,
|
||||
n, x, incx, y, incy);
|
||||
@ -384,7 +384,7 @@ void copy(int64_t n, const float *x, int64_t incx, float *y, int64_t incy) {
|
||||
scopy_(&i_n, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
copy_stub(
|
||||
kCPU, at::kFloat,
|
||||
n, x, incx, y, incy);
|
||||
@ -404,7 +404,7 @@ void copy(int64_t n, const c10::complex<double> *x, int64_t incx, c10::complex<d
|
||||
zcopy_(&i_n, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
copy_stub(
|
||||
kCPU, at::kComplexDouble,
|
||||
n, x, incx, y, incy);
|
||||
@ -424,10 +424,10 @@ void copy(int64_t n, const c10::complex<float> *x, int64_t incx, c10::complex<fl
|
||||
ccopy_(&i_n, x, &i_incx, y, &i_incy);
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
#endif
|
||||
copy_stub(
|
||||
kCPU, at::kComplexFloat,
|
||||
n, x, incx, y, incy);
|
||||
}
|
||||
|
||||
|
||||
}}} // namespace at::native::cpublas
|
||||
|
@ -4,7 +4,7 @@
|
||||
|
||||
namespace at { namespace native {
|
||||
|
||||
// View tensor with new dtype, storage offset, sizes and strides
|
||||
// View tensor with new dtype, storage offset, sizes and strides
|
||||
inline Tensor view_tensor(
|
||||
const Tensor &tensor, ScalarType dtype,
|
||||
int64_t offset, IntArrayRef sizes, IntArrayRef strides) {
|
||||
|
@ -128,7 +128,7 @@ C10_DEVICE static inline scalar_t polevl(const scalar_t x, const scalar_t A[],
|
||||
}
|
||||
|
||||
/* the functions stirling_approx_tail, binomial_inversion, and btrs are adapted
|
||||
* from TensorFlow's random_binomial_op.cc implementation. That code is under
|
||||
* from TensorFlow's random_binomial_op.cc implementation. That code is under
|
||||
* copyright: 2019 The TensorFlow Authors.
|
||||
*
|
||||
* It was released under the Apache License, Version 2.0 (the "License"), available at:
|
||||
|
@ -4,7 +4,7 @@
|
||||
namespace at {
|
||||
namespace native {
|
||||
namespace {
|
||||
// Check foreach API restrictions
|
||||
// Check foreach API restrictions
|
||||
// - Tensor lists must be non-empty.
|
||||
// - All tensors in all lists must have the same dtype.
|
||||
// - All TensorLists and ScalarLists must have the same number of elements.
|
||||
|
@ -51,7 +51,7 @@ Tensor& glu_backward_out(Tensor& grad_input,
|
||||
Tensor secondHalf = input.narrow(wrap_dim, inputSize, inputSize);
|
||||
Tensor gradInputfirstHalf = grad_input.narrow(wrap_dim, 0, inputSize);
|
||||
Tensor gradInputsecondHalf = grad_input.narrow(wrap_dim, inputSize, inputSize);
|
||||
|
||||
|
||||
at::sigmoid_out(gradInputfirstHalf, secondHalf);
|
||||
// for second gradinput half, can get a better performance by fusion
|
||||
auto iter = at::TensorIteratorConfig()
|
||||
|
@ -485,7 +485,7 @@ Tensor _grid_sampler_2d_cpu_fallback(const Tensor& input, const Tensor& grid,
|
||||
}
|
||||
}
|
||||
} else if (interpolation_mode == GridSamplerInterpolation::Bicubic) {
|
||||
// grid_sampler_compute_source_index will "clip the value" of idx depends on the padding,
|
||||
// grid_sampler_compute_source_index will "clip the value" of idx depends on the padding,
|
||||
// which would cause calculation to be wrong,
|
||||
// for example x = -0.1 -> ix = 0 for zero padding, but in bicubic ix = floor(x) = -1
|
||||
// There would be more problem in reflection padding, since the -1 and +1 direction is not fixed in boundary condition
|
||||
|
@ -10,7 +10,7 @@
|
||||
namespace at {
|
||||
namespace native {
|
||||
namespace {
|
||||
|
||||
|
||||
static void im2col_out_cpu_template(
|
||||
Tensor& output,
|
||||
const Tensor& input_,
|
||||
|
@ -56,7 +56,7 @@ namespace {
|
||||
nframe = input.size(0);
|
||||
dim = input.size(1);
|
||||
}
|
||||
|
||||
|
||||
TORCH_CHECK(
|
||||
valid_inputs,
|
||||
"Expected non-empty vector or matrix with optional 0-dim batch size, but got: ",
|
||||
|
@ -40,7 +40,7 @@ inline scalar_t multilabel_margin_loss_forward_inner_sum_cpu(
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
@ -103,7 +103,7 @@ static void multilabel_margin_loss_forward_out_cpu_template(
|
||||
int64_t reduction) {
|
||||
auto target_arg = TensorArg(target, "target", 2);
|
||||
int64_t nframe, dim;
|
||||
const int64_t ndims = input.dim();
|
||||
const int64_t ndims = input.dim();
|
||||
if (ndims <= 1) {
|
||||
nframe = 1;
|
||||
dim = ndims == 0 ? 1 : input.size(0);
|
||||
@ -113,7 +113,7 @@ static void multilabel_margin_loss_forward_out_cpu_template(
|
||||
dim = input.size(1);
|
||||
}
|
||||
multilabel_margin_loss_shape_check(nframe, dim, ndims, target_arg, input, target);
|
||||
|
||||
|
||||
// special case target.dim() <= 1: produce scalar output for scalar inputs
|
||||
// even if reduction == Reduction::None
|
||||
if (reduction != Reduction::None || target.dim() <= 1) {
|
||||
@ -228,12 +228,12 @@ static void multilabel_margin_loss_backward_out_cpu_template(
|
||||
|
||||
multilabel_margin_loss_shape_check(nframe, dim, ndims, target_arg, input, target);
|
||||
checkSameSize(c, target_arg, is_target_arg);
|
||||
|
||||
|
||||
grad_input.resize_as_(input);
|
||||
if (grad_input.numel() == 0) {
|
||||
return;
|
||||
}
|
||||
|
||||
|
||||
TORCH_CHECK(grad_input.is_contiguous(), "grad_input must be contiguous");
|
||||
grad_input.zero_();
|
||||
|
||||
|
@ -205,7 +205,7 @@ void multi_margin_loss_backward_out_cpu_template(
|
||||
int64_t nframe, dim;
|
||||
auto target_arg = TensorArg(target, "target", 2);
|
||||
const auto ndims = input.dim();
|
||||
|
||||
|
||||
TORCH_CHECK(p == 1 || p == 2, "only p == 1 and p == 2 supported");
|
||||
|
||||
multi_margin_loss_shape_check(nframe, dim, ndims, target_arg, input, target);
|
||||
|
@ -153,7 +153,7 @@ static void nll_loss2d_forward_out_frame(
|
||||
for (int64_t b = 0; b < batch_size; b++) {
|
||||
for (int64_t elem = 0; elem < map_size; elem++) {
|
||||
const int64_t cur_target = target_data[b * map_size + elem];
|
||||
|
||||
|
||||
if (cur_target == ignore_index) {
|
||||
continue;
|
||||
}
|
||||
@ -284,7 +284,7 @@ static void nll_loss2d_backward_out_frame(
|
||||
for (int64_t b = start; b < end; b++) {
|
||||
for (int64_t elem = 0; elem < map_size; elem++) {
|
||||
const int64_t cur_target = target_data[b * map_size + elem];
|
||||
|
||||
|
||||
if (cur_target == ignore_index) {
|
||||
continue;
|
||||
}
|
||||
|
@ -68,7 +68,7 @@ pool2d_shape_check(
|
||||
TORCH_CHECK(dilationH > 0 && dilationW > 0,
|
||||
"dilation should be greater than zero, but got ",
|
||||
"dilationH: ", dilationH, " dilationW: ", dilationW);
|
||||
|
||||
|
||||
bool valid_dims = input.size(1) != 0 && input.size(2) != 0;
|
||||
if (memory_format == at::MemoryFormat::ChannelsLast){
|
||||
// Expect tensor in NHWC format and allow 0-dim only for N.
|
||||
@ -81,7 +81,7 @@ pool2d_shape_check(
|
||||
"Expected 3D or 4D (batch mode) tensor with optional 0 dim batch size for input, but got:",
|
||||
input.sizes());
|
||||
}
|
||||
|
||||
|
||||
TORCH_CHECK(kW/2 >= padW && kH/2 >= padH,
|
||||
"pad should be smaller than or equal to half of kernel size, but got ",
|
||||
"padW = ", padW, ", padH = ", padH, ", kW = ", kW, ", kH = ", kH);
|
||||
|
@ -16,7 +16,7 @@ namespace native {
|
||||
#endif
|
||||
|
||||
// integral power in pytorch allows for negative exponents, giving truncated integral results.
|
||||
// e.g. since 2**-1==0.5, the truncated integral result is zero. 1**negative_exponent is the
|
||||
// e.g. since 2**-1==0.5, the truncated integral result is zero. 1**negative_exponent is the
|
||||
// only non-zero result.
|
||||
template <class T,
|
||||
typename std::enable_if<std::is_integral<T>::value, T>::type* = nullptr>
|
||||
|
@ -219,7 +219,7 @@ Tensor& replication_pad1d_backward_out_cpu_template(
|
||||
gradInput.resize_as_(input);
|
||||
if (gradInput.numel() == 0) {
|
||||
return gradInput;
|
||||
}
|
||||
}
|
||||
gradInput.zero_();
|
||||
|
||||
/* backprop */
|
||||
@ -522,7 +522,7 @@ Tensor& replication_pad2d_backward_out_cpu_template(
|
||||
if (gradInput.numel() == 0) {
|
||||
return gradInput;
|
||||
}
|
||||
|
||||
|
||||
gradInput.zero_();
|
||||
|
||||
/* backprop */
|
||||
|
@ -143,7 +143,7 @@ public:
|
||||
return (ptr - other.ptr) / stride;
|
||||
}
|
||||
// }
|
||||
|
||||
|
||||
// Comparison operators {
|
||||
C10_HOST_DEVICE
|
||||
bool operator==(const ConstStridedRandomAccessor& other) const {
|
||||
@ -175,7 +175,7 @@ public:
|
||||
return !(*this < other);
|
||||
}
|
||||
// }
|
||||
|
||||
|
||||
protected:
|
||||
PtrType ptr;
|
||||
index_t stride;
|
||||
@ -186,7 +186,7 @@ template <
|
||||
typename index_t = int64_t,
|
||||
template <typename U> class PtrTraits = DefaultPtrTraits
|
||||
>
|
||||
class StridedRandomAccessor
|
||||
class StridedRandomAccessor
|
||||
: public ConstStridedRandomAccessor<T, index_t, PtrTraits> {
|
||||
public:
|
||||
using difference_type = index_t;
|
||||
|
@ -12,7 +12,7 @@ namespace native {
|
||||
static inline void flip_check_errors(int64_t total_dims, int64_t flip_dims_size, IntArrayRef dims) {
|
||||
if (flip_dims_size==0) {
|
||||
return;
|
||||
}
|
||||
}
|
||||
// check if number of axis in dim is valid
|
||||
if (flip_dims_size < 0 || flip_dims_size > total_dims) {
|
||||
TORCH_CHECK_INDEX(false, "flip dims size out of range, got flip dims size=", flip_dims_size);
|
||||
|
@ -844,14 +844,14 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
|
||||
auto mask_x = must_in_bound ? iVec(-1) : (ix > iVec(-1)) & (ix < iVec(inp_W));
|
||||
auto mask_y = must_in_bound ? iVec(-1) : (iy > iVec(-1)) & (iy < iVec(inp_H));
|
||||
auto mask = cast<scalar_t>(mask_x & mask_y);
|
||||
|
||||
|
||||
auto offset = iy * iVec(inp_sH) + ix * iVec(inp_sW);
|
||||
|
||||
auto val = mask_gather<sizeof(scalar_t)>(Vec(0), data, offset, mask);
|
||||
return val;
|
||||
}
|
||||
|
||||
inline void add_value_bounded(scalar_t* data, int64_t len, const Vec& x, const Vec&y,
|
||||
inline void add_value_bounded(scalar_t* data, int64_t len, const Vec& x, const Vec&y,
|
||||
const Vec& delta) const {
|
||||
|
||||
auto ix = convert_to_int_of_same_size(compute_W.compute_coordinates(x));
|
||||
@ -860,7 +860,7 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
|
||||
auto mask_x = must_in_bound ? iVec(-1) : (ix > iVec(-1)) & (ix < iVec(inp_W));
|
||||
auto mask_y = must_in_bound ? iVec(-1) : (iy > iVec(-1)) & (iy < iVec(inp_H));
|
||||
auto mask = cast<scalar_t>(mask_x & mask_y);
|
||||
|
||||
|
||||
auto i_gInp_offset = iy * iVec(inp_W) + ix;
|
||||
integer_t i_gInp_offset_arr[iVec::size()];
|
||||
i_gInp_offset.store(i_gInp_offset_arr);
|
||||
@ -899,7 +899,7 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
|
||||
// Interpolate the 4 values in the x direction
|
||||
Vec interp_x[4];
|
||||
for (int64_t i = 0; i < 4; ++i) {
|
||||
interp_x[i] =
|
||||
interp_x[i] =
|
||||
coeff_x[0] * get_value_bounded(inp_slice_C_ptr, ix - Vec(1), iy + Vec(-1 + i)) +
|
||||
coeff_x[1] * get_value_bounded(inp_slice_C_ptr, ix + Vec(0), iy + Vec(-1 + i)) +
|
||||
coeff_x[2] * get_value_bounded(inp_slice_C_ptr, ix + Vec(1), iy + Vec(-1 + i)) +
|
||||
|
@ -76,7 +76,7 @@ inline void _vec_log_softmax_lastdim(
|
||||
scalar_t* output_data = output_data_base + i * dim_size;
|
||||
scalar_t tmp_sum = tmp_sum_scalar[j];
|
||||
scalar_t max_input = max_input_arr[j];
|
||||
|
||||
|
||||
// It's necessary to keep the order of the operations below.
|
||||
// In some cases that input is large digits and the difference
|
||||
// is small, if we compute `max_input` plus `tmp_sum` before,
|
||||
|
@ -39,7 +39,7 @@
|
||||
// grad_in[...,i_in_dim,...,i_in_last_dim], where
|
||||
// i_in_dim is in [left_idx_fold, right_idx_fold],
|
||||
// i_in_last_dim = i_out_dim - i_in_dim * step,
|
||||
// left_idx_fold = (i_out_dim - size) / step
|
||||
// left_idx_fold = (i_out_dim - size) / step
|
||||
// if i_out_dim in [left_idx_fold * step, left_idx_fold * step + size)
|
||||
// else (i_out_dim - size) / step + 1,
|
||||
// right_idx_fold = i_out_dim / step.
|
||||
|
@ -45,7 +45,7 @@ static inline void compute_source_index_and_lambda(
|
||||
|
||||
// Helper structs and methods for cpu_upsample_linear
|
||||
//
|
||||
// Interpolation methods that used below are separable, and as such we can compute the interpolation
|
||||
// Interpolation methods that used below are separable, and as such we can compute the interpolation
|
||||
// independently per dimension in a recursive way. Please, refer to #10482 for more context.
|
||||
//
|
||||
// Linear Interpolation structure to compute output value in n-dimensional case.
|
||||
@ -96,26 +96,26 @@ static inline bool is_contiguous_stride(const int64_t* strides) {
|
||||
}
|
||||
|
||||
|
||||
// Helper class to recursively check if all input strides corresponding to interpolated dimensions
|
||||
// Helper class to recursively check if all input strides corresponding to interpolated dimensions
|
||||
// are equal zero except on a single dimension.
|
||||
//
|
||||
//
|
||||
// Inputs: array of strides of size N, non_zero_stride_dim which can be -1, 0, 1, 2, ...
|
||||
// if non_zero_stride_dim, we check that all strides are equal zero, otherwise
|
||||
// 4 strides corresponding to the strides for index_0, weight_0, index_1 and weight_1 for non_zero_stride_dim
|
||||
// dimension should be non zero.
|
||||
//
|
||||
// Unit check of the recursion is to verify whether 4 strides for one interpolated dimension are either zero,
|
||||
//
|
||||
// Unit check of the recursion is to verify whether 4 strides for one interpolated dimension are either zero,
|
||||
// see method is_zero_stride, or (sizeof(index_t), sizeof(scalar_t), sizeof(index_t), sizeof(scalar_t)), see
|
||||
// method is_contiguous_stride.
|
||||
//
|
||||
//
|
||||
// In practice, we have the following cases:
|
||||
// - for ND, float32, channel first, strides are
|
||||
// - for ND, float32, channel first, strides are
|
||||
// dimN-1, dim1, dim0
|
||||
// i0, w0, i1, w1, ..., i0, w0, i1, w1, i0, w0, i1, w1
|
||||
// strides=(0, 0, 0, 0, ..., 0, 0, 0, 0, 4, 4, 4, 4)
|
||||
//
|
||||
// if size dim0 is 1 then its strides are 0 and dim1 strides are equal 4
|
||||
//
|
||||
//
|
||||
// - for ND, float32, channel last, strides are
|
||||
// dimN-1, dimN-2, dim0
|
||||
// i0, w0, i1, w1, i0, w0, i1, w1, ... i0, w0, i1, w1
|
||||
@ -155,7 +155,7 @@ static inline void basic_loop(char** data, const int64_t* strides, int64_t n) {
|
||||
}
|
||||
|
||||
// Linear upsampling computation method using TensorIterator for Nd case.
|
||||
//
|
||||
//
|
||||
// Single loop function for 1d, 2d and 3d cases.
|
||||
// For N dimensions, output value up to Di dimension can be computed as
|
||||
//
|
||||
@ -505,7 +505,7 @@ void cpu_upsample_linear_backward(
|
||||
//
|
||||
template<typename scalar_t>
|
||||
std::vector<Tensor> compute_indices_weights_linear(
|
||||
int64_t input_size, int64_t output_size, int64_t stride, int64_t ndims, int64_t reshape_dim,
|
||||
int64_t input_size, int64_t output_size, int64_t stride, int64_t ndims, int64_t reshape_dim,
|
||||
bool align_corners, const c10::optional<double> opt_scale
|
||||
) {
|
||||
|
||||
@ -516,7 +516,7 @@ std::vector<Tensor> compute_indices_weights_linear(
|
||||
new_shape[reshape_dim] = output_size;
|
||||
|
||||
output.emplace_back(empty(new_shape, CPU(at::kLong)));
|
||||
output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<scalar_t>())));
|
||||
output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<scalar_t>())));
|
||||
output.emplace_back(empty(new_shape, CPU(at::kLong)));
|
||||
output.emplace_back(empty(new_shape, CPU(c10::CppTypeToScalarType<scalar_t>())));
|
||||
|
||||
@ -524,7 +524,7 @@ std::vector<Tensor> compute_indices_weights_linear(
|
||||
auto lambda0_ptr = output[1].data_ptr<scalar_t>();
|
||||
auto input_index1_ptr = output[2].data_ptr<int64_t>();
|
||||
auto lambda1_ptr = output[3].data_ptr<scalar_t>();
|
||||
|
||||
|
||||
for (int64_t i=0; i<output_size; i++) {
|
||||
|
||||
compute_source_index_and_lambda<scalar_t>(
|
||||
@ -543,7 +543,7 @@ std::vector<Tensor> compute_indices_weights_linear(
|
||||
}
|
||||
|
||||
// Upsampling linear interpolation kernel for N-d case.
|
||||
// Input is assumed to be like NCHW, NCL, NCKHW - interpolated spatial dimension
|
||||
// Input is assumed to be like NCHW, NCL, NCKHW - interpolated spatial dimension
|
||||
// are those from the end up to batch size N and number of channels C.
|
||||
//
|
||||
// Internally, it uses TensorIterator to optimize the computations.
|
||||
@ -588,8 +588,8 @@ void upsample_linearNd_kernel_impl(
|
||||
.declare_static_dtype_and_device(input.scalar_type(), input.device())
|
||||
.add_output(output)
|
||||
.add_input(restrided_input);
|
||||
|
||||
for (auto iter=indices_weights.begin(); iter!=indices_weights.end(); iter++) {
|
||||
|
||||
for (auto iter=indices_weights.begin(); iter!=indices_weights.end(); iter++) {
|
||||
for (auto& tensor : *iter) {
|
||||
config.add_input(tensor);
|
||||
}
|
||||
|
@ -71,7 +71,7 @@ void batch_norm_cpu_inference_contiguous_impl(Tensor& output,
|
||||
if (image_size != 1) {
|
||||
const int64_t n_offset = n_channel * image_size;
|
||||
const int64_t loop_size = image_size - (image_size % Vec::size());
|
||||
for (int64_t n = 0; n < n_batch; n++) {
|
||||
for (int64_t n = 0; n < n_batch; n++) {
|
||||
for (int64_t c = 0; c < n_channel; c++) {
|
||||
const Vec alpha_vec(alpha_data[c]);
|
||||
const Vec beta_vec(beta_data[c]);
|
||||
|
@ -1894,7 +1894,7 @@ AT_ERROR("triangular_solve: MAGMA library not found in "
|
||||
magma_int_t n = magma_int_cast(A.size(-2), "A.size(-2)");
|
||||
magma_int_t nrhs = magma_int_cast(b.size(-1), "b.size(-1)");
|
||||
// magma returns early if m <= 0 || n <= 0 for magmaTriangularSolveBatched
|
||||
// magmaTriangularSolve is calling cuBLAS and it prints
|
||||
// magmaTriangularSolve is calling cuBLAS and it prints
|
||||
// ** On entry to DTRSM parameter number 9 had an illegal value
|
||||
// so let's use proper lda parameter here
|
||||
magma_int_t lda = std::max<magma_int_t>(1, n);
|
||||
@ -2282,7 +2282,7 @@ std::tuple<Tensor, Tensor> _syevd_helper_cuda(const Tensor& self, bool compute_e
|
||||
bool upper = uplo == 'U' ? true : false;
|
||||
return _symeig_helper_cuda(self, compute_eigenvectors, upper);
|
||||
}
|
||||
|
||||
|
||||
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ svd ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
template<typename scalar_t>
|
||||
|
@ -50,7 +50,7 @@ static void apply_batched_inverse_lib(Tensor& self, Tensor& self_inv, Tensor& in
|
||||
|
||||
auto& allocator = *::c10::cuda::CUDACachingAllocator::get();
|
||||
|
||||
// Heuristic: For small batch size or large matrix size, we use for-loop to iterate over the batches instead of
|
||||
// Heuristic: For small batch size or large matrix size, we use for-loop to iterate over the batches instead of
|
||||
// calling the batched cublas routine.
|
||||
if (batch_size <= 8 || /* batch_size > 8 && */ n >= 512) {
|
||||
for (int64_t i = 0; i < batch_size; i++) {
|
||||
|
@ -260,7 +260,7 @@ public:
|
||||
|
||||
#ifdef __HIP_PLATFORM_HCC__
|
||||
// clone input to avoid issues with hipfft clobering the input and failing tests
|
||||
clone_input = true;
|
||||
clone_input = true;
|
||||
#else
|
||||
clone_input = false;
|
||||
#endif
|
||||
|
@ -20,9 +20,9 @@ std::vector<Tensor> foreach_tensor_list_op(TensorList tensors1, TensorList tenso
|
||||
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors1[0].scalar_type(), "foreach_binary_op_list_cuda", [&]() {
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<3>(tensor_lists,
|
||||
BinaryOpListAlphaFunctor<scalar_t,
|
||||
BinaryOpListAlphaFunctor<scalar_t,
|
||||
/* depth */ 3,
|
||||
/* r_args_depth */ 2,
|
||||
/* r_args_depth */ 2,
|
||||
/* res_arg_index */ 2>(),
|
||||
Op<opmath_t>(),
|
||||
alpha.to<opmath_t>());
|
||||
@ -40,9 +40,9 @@ void foreach_tensor_list_op_(TensorList tensors1, TensorList tensors2, Scalar al
|
||||
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors1[0].scalar_type(), "foreach_binary_op_list_cuda_", [&]() {
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<2>(tensor_lists,
|
||||
BinaryOpListAlphaFunctor<scalar_t,
|
||||
BinaryOpListAlphaFunctor<scalar_t,
|
||||
/* depth */ 2,
|
||||
/* r_args_depth */ 2,
|
||||
/* r_args_depth */ 2,
|
||||
/* res_arg_index */ 0>(),
|
||||
Op<opmath_t>(),
|
||||
alpha.to<opmath_t>());
|
||||
|
@ -19,9 +19,9 @@ std::vector<Tensor> foreach_binary_op(TensorList tensors, Scalar scalar) {
|
||||
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors[0].scalar_type(), "foreach_binary_op_scalar_cuda", [&]() {
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<2>(tensor_lists,
|
||||
BinaryOpScalarFunctor<scalar_t,
|
||||
BinaryOpScalarFunctor<scalar_t,
|
||||
/* depth */ 2,
|
||||
/* r_args_depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* res_arg_index */ 1>(),
|
||||
Op<opmath_t>(),
|
||||
scalar.to<opmath_t>());
|
||||
@ -37,9 +37,9 @@ void foreach_binary_op_(TensorList tensors, Scalar scalar) {
|
||||
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND3(kBool, kBFloat16, kHalf, tensors[0].scalar_type(), "foreach_binary_op_scalar_cuda_", [&]() {
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<1>(tensor_lists,
|
||||
BinaryOpScalarFunctor<scalar_t,
|
||||
BinaryOpScalarFunctor<scalar_t,
|
||||
/* depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* res_arg_index */ 0>(),
|
||||
Op<opmath_t>(),
|
||||
scalar.to<opmath_t>());
|
||||
|
@ -20,9 +20,9 @@ std::vector<Tensor> foreach_binary_op(TensorList tensors, at::ArrayRef<Scalar> s
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<2, opmath_t>(tensor_lists,
|
||||
scalars,
|
||||
BinaryOpScalarListFunctor<scalar_t,
|
||||
BinaryOpScalarListFunctor<scalar_t,
|
||||
/* depth */ 2,
|
||||
/* r_args_depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* res_arg_index */ 1>(),
|
||||
|
||||
Op<opmath_t>());
|
||||
@ -39,9 +39,9 @@ void foreach_binary_op_(TensorList tensors, at::ArrayRef<Scalar> scalars) {
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<1, opmath_t>(tensor_lists,
|
||||
scalars,
|
||||
BinaryOpScalarListFunctor<scalar_t,
|
||||
BinaryOpScalarListFunctor<scalar_t,
|
||||
/* depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* res_arg_index */ 0>(),
|
||||
Op<opmath_t>());
|
||||
});
|
||||
|
@ -22,9 +22,9 @@ std::vector<Tensor> foreach_pointwise_op(TensorList input, TensorList tensors1,
|
||||
AT_DISPATCH_ALL_TYPES_AND(kHalf, input[0].scalar_type(), "foreach_pointwise_op_cuda", [&]() {
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<4>(tensor_lists,
|
||||
PointwiseOpScalarFunctor<scalar_t,
|
||||
PointwiseOpScalarFunctor<scalar_t,
|
||||
/* depth */ 4,
|
||||
/* r_args_depth */ 3,
|
||||
/* r_args_depth */ 3,
|
||||
/* res_arg_index */ 3>(),
|
||||
Op<opmath_t>(),
|
||||
scalar.to<opmath_t>());
|
||||
@ -43,9 +43,9 @@ void foreach_pointwise_op_(TensorList input, TensorList tensors1, TensorList ten
|
||||
AT_DISPATCH_ALL_TYPES_AND(kHalf, input[0].scalar_type(), "foreach_pointwise_op__cuda", [&]() {
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<3>(tensor_lists,
|
||||
PointwiseOpScalarFunctor<scalar_t,
|
||||
PointwiseOpScalarFunctor<scalar_t,
|
||||
/* depth */ 3,
|
||||
/* r_args_depth */ 3,
|
||||
/* r_args_depth */ 3,
|
||||
/* res_arg_index */ 0>(),
|
||||
Op<opmath_t>(),
|
||||
scalar.to<opmath_t>());
|
||||
@ -64,9 +64,9 @@ void foreach_pointwise_op_(TensorList input, TensorList tensors1, TensorList ten
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<3, opmath_t>(tensor_lists,
|
||||
scalars,
|
||||
PointwiseOpScalarListFunctor<scalar_t,
|
||||
PointwiseOpScalarListFunctor<scalar_t,
|
||||
/* depth */ 3,
|
||||
/* r_args_depth */ 3,
|
||||
/* r_args_depth */ 3,
|
||||
/* res_arg_index */ 0>(),
|
||||
Op<opmath_t>());
|
||||
});
|
||||
@ -91,9 +91,9 @@ std::vector<Tensor> foreach_pointwise_op(TensorList input, TensorList tensors1,
|
||||
using opmath_t = get_opmath_t<scalar_t>::opmath_t;
|
||||
multi_tensor_apply<4, opmath_t>(tensor_lists,
|
||||
scalars,
|
||||
PointwiseOpScalarListFunctor<scalar_t,
|
||||
PointwiseOpScalarListFunctor<scalar_t,
|
||||
/* depth */ 4,
|
||||
/* r_args_depth */ 3,
|
||||
/* r_args_depth */ 3,
|
||||
/* res_arg_index */ 3>(),
|
||||
Op<opmath_t>());
|
||||
});
|
||||
|
@ -33,7 +33,7 @@ template <typename scalar_t, template<class> class Op> void foreach_unary_op_(Te
|
||||
multi_tensor_apply<1>(tensor_lists,
|
||||
UnaryOpFunctor<scalar_t,
|
||||
/* depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* res_arg_index */ 0>(),
|
||||
Op<opmath_t>());
|
||||
}
|
||||
@ -230,7 +230,7 @@ void foreach_tensor_neg_cuda_(TensorList tensors) {
|
||||
}
|
||||
|
||||
// Abs have to go via slow path in case of a complex type.
|
||||
// This is because foreach kernels can't return a different dtype than passed, while
|
||||
// This is because foreach kernels can't return a different dtype than passed, while
|
||||
// abs with complex inputs will produce float output.
|
||||
template<typename T>
|
||||
struct Abs {
|
||||
@ -283,7 +283,7 @@ void foreach_tensor_zero_cuda_(TensorList tensors) {
|
||||
multi_tensor_apply<1>(tensor_lists,
|
||||
ZeroFunctor<scalar_t,
|
||||
/* depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* r_args_depth */ 1,
|
||||
/* res_arg_index */ 0>());
|
||||
});
|
||||
}
|
||||
|
@ -142,14 +142,14 @@ scalar_t reflect_coordinates_set_grad(scalar_t in, int twice_low, int twice_high
|
||||
}
|
||||
}
|
||||
|
||||
template<typename scalar_t>
|
||||
static __forceinline__ __device__
|
||||
template<typename scalar_t>
|
||||
static __forceinline__ __device__
|
||||
scalar_t safe_downgrade_to_int_range(scalar_t x){
|
||||
// -100.0 does not have special meaning. This is just to make sure
|
||||
// it's not within_bounds_2d or within_bounds_3d, and does not cause
|
||||
// undefined behavior. See #35506.
|
||||
if (x > INT_MAX-1 || x < INT_MIN || !::isfinite(static_cast<double>(x)))
|
||||
return static_cast<scalar_t>(-100.0);
|
||||
// -100.0 does not have special meaning. This is just to make sure
|
||||
// it's not within_bounds_2d or within_bounds_3d, and does not cause
|
||||
// undefined behavior. See #35506.
|
||||
if (x > INT_MAX-1 || x < INT_MIN || !::isfinite(static_cast<double>(x)))
|
||||
return static_cast<scalar_t>(-100.0);
|
||||
return x;
|
||||
}
|
||||
|
||||
@ -219,7 +219,7 @@ scalar_t grid_sampler_compute_source_index_set_grad(
|
||||
*grad_in = (*grad_in) * grad_refl * grad_clip;
|
||||
}
|
||||
|
||||
coord = safe_downgrade_to_int_range(coord);
|
||||
coord = safe_downgrade_to_int_range(coord);
|
||||
return coord;
|
||||
}
|
||||
|
||||
@ -244,7 +244,7 @@ scalar_t get_value_bounded(
|
||||
y = compute_coordinates(y, H, padding_mode, align_corners);
|
||||
|
||||
int ix = static_cast<int>(x);
|
||||
int iy = static_cast<int>(y);
|
||||
int iy = static_cast<int>(y);
|
||||
|
||||
if (within_bounds_2d(iy, ix, H, W)) {
|
||||
return data[iy * sH + ix * sW];
|
||||
@ -284,7 +284,7 @@ void add_value_bounded(
|
||||
y = compute_coordinates(y, H, padding_mode, align_corners);
|
||||
|
||||
int ix = static_cast<int>(x);
|
||||
int iy = static_cast<int>(y);
|
||||
int iy = static_cast<int>(y);
|
||||
|
||||
safe_add_2d(data, iy, ix, sH, sW, H, W, delta);
|
||||
}
|
||||
|
@ -978,7 +978,7 @@ Tensor & masked_fill__cuda(Tensor& self, const Tensor & mask, Scalar value) {
|
||||
.add_output(self)
|
||||
.add_input(self)
|
||||
.add_input(b_mask)
|
||||
.build();
|
||||
.build();
|
||||
|
||||
if (b_mask.dtype() == at::ScalarType::Byte) {
|
||||
TORCH_WARN("masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated," \
|
||||
|
@ -58,7 +58,7 @@ Tensor kl_div_backward_cuda(const Tensor& grad, const Tensor& input, const Tenso
|
||||
});
|
||||
});
|
||||
}
|
||||
else {
|
||||
else {
|
||||
grad_input = -at::exp(target) * grad;
|
||||
if (reduction == at::Reduction::Mean) {
|
||||
grad_input /= input.numel();
|
||||
|
@ -91,7 +91,7 @@ struct MagmaStreamSyncGuard {
|
||||
|
||||
static inline int cuda_int_cast(int64_t value, const char* varname) {
|
||||
auto result = static_cast<int>(value);
|
||||
TORCH_CHECK(static_cast<int64_t>(result) == value,
|
||||
TORCH_CHECK(static_cast<int64_t>(result) == value,
|
||||
"cuda_int_cast: The value of ", varname, "(", (long long)value,
|
||||
") is too large to fit into a int (", sizeof(int), " bytes)");
|
||||
return result;
|
||||
|
@ -555,7 +555,7 @@ __global__ void batch_norm_backward_elemt_kernel(
|
||||
const GenericPackedTensorAccessor<stat_accscalar_t, 1, DefaultPtrTraits, index_t> sum_dy_xmu,
|
||||
GenericPackedTensorAccessor<input_scalar_t, 3, DefaultPtrTraits, index_t> grad_input,
|
||||
const int* __restrict__ numel, const int world_size) {
|
||||
|
||||
|
||||
int64_t div = 0;
|
||||
for (int i = 0; i < world_size; i ++) {
|
||||
div += numel[i];
|
||||
@ -955,7 +955,7 @@ std::tuple<Tensor, Tensor> batch_norm_update_stats_cuda_template(
|
||||
}
|
||||
|
||||
// welford kernel for c last tensor calculating mean/biased_variance/unbiased_variance
|
||||
// original apex name: welford_kernel_c_last
|
||||
// original apex name: welford_kernel_c_last
|
||||
template
|
||||
<template<typename T> class VarTransform,
|
||||
typename scalar_t,
|
||||
@ -1632,7 +1632,7 @@ at::Tensor batch_norm_backward_elemt_channels_last_cuda_template(
|
||||
});
|
||||
}
|
||||
C10_CUDA_KERNEL_LAUNCH_CHECK();
|
||||
|
||||
|
||||
return grad_input;
|
||||
}
|
||||
|
||||
|
@ -24,7 +24,7 @@ namespace {
|
||||
// applied to the result of the inline function, and thus the result is incorrect.
|
||||
// e.g. if we use 1.0 / sqrt(2) for 2 ^ (-0.5) in MSVC, we get
|
||||
// int(2 ^ (-0.5)) = int(1.0 / sqrt(2)) = int(1.0 / int(1.414)) = int(1.0 / 1) = 1
|
||||
// However, the correct result is
|
||||
// However, the correct result is
|
||||
// int(2 ^ (-0.5)) = int(1.0 / 1.414) = 0
|
||||
#ifdef _MSC_VER
|
||||
// Functions for pow
|
||||
|
@ -119,14 +119,14 @@ static void _aminmax_kernel_impl(
|
||||
const Tensor& self,
|
||||
int64_t dim,
|
||||
bool keepdim) {
|
||||
at::TensorIterator iter = make_reduction("_aminmax", min_result,
|
||||
at::TensorIterator iter = make_reduction("_aminmax", min_result,
|
||||
max_result, self, dim, keepdim, self.scalar_type());
|
||||
AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBool, self.scalar_type(), "_aminmax_cuda", [&]() {
|
||||
gpu_reduce_kernel<scalar_t, scalar_t>(
|
||||
iter,
|
||||
MinMaxOps<scalar_t, scalar_t, int32_t>{},
|
||||
thrust::pair<scalar_t, scalar_t>(
|
||||
at::numeric_limits<scalar_t>::upper_bound(),
|
||||
at::numeric_limits<scalar_t>::upper_bound(),
|
||||
at::numeric_limits<scalar_t>::lower_bound()
|
||||
)
|
||||
);
|
||||
|
@ -553,7 +553,7 @@ static inline void split_batch_dim_to_32bit_out(
|
||||
const at::Tensor& input,
|
||||
const at::Tensor& weight,
|
||||
IntArrayRef padding, IntArrayRef stride, IntArrayRef dilation, int64_t groups,
|
||||
bool benchmark, bool deterministic, bool allow_tf32,
|
||||
bool benchmark, bool deterministic, bool allow_tf32,
|
||||
int64_t max_worksize, func_t func_32bit) {
|
||||
constexpr int64_t int_max = std::numeric_limits<int>::max();
|
||||
const int64_t ni = input.numel();
|
||||
|
@ -28,7 +28,7 @@
|
||||
# |out ch indx| 16
|
||||
# |params | 20
|
||||
# |-----------|
|
||||
#
|
||||
#
|
||||
|
||||
# After loading w pointer in ip reg.
|
||||
# And after pushing r4-r8 and d8-d15 on stack
|
||||
@ -42,7 +42,7 @@
|
||||
# |out ch indx| 112
|
||||
# |params | 116
|
||||
# |-----------|
|
||||
#
|
||||
#
|
||||
|
||||
# void pytorch_q8conv_ukernel_4x8__aarch32_neon(
|
||||
# size_t mr,
|
||||
|
@ -16,8 +16,8 @@
|
||||
# x2: kc
|
||||
# x3: ks
|
||||
# x4: a
|
||||
# x5: w
|
||||
# x6: c
|
||||
# x5: w
|
||||
# x6: c
|
||||
# x7: c_stride
|
||||
#
|
||||
|
||||
|
@ -28,7 +28,7 @@
|
||||
# |out ch indx| 16
|
||||
# |params | 20
|
||||
# |-----------|
|
||||
#
|
||||
#
|
||||
|
||||
# After loading w pointer in ip reg.
|
||||
# And after pushing r4-r9 and d8-d15 on stack
|
||||
@ -42,7 +42,7 @@
|
||||
# |out ch indx| 104
|
||||
# |params | 108
|
||||
# |-----------|
|
||||
#
|
||||
#
|
||||
|
||||
#
|
||||
# New Struct for pytorch_qnnp_conv_quantization_params
|
||||
|
@ -41,7 +41,7 @@
|
||||
# |out ch indx| 16
|
||||
# |params | 20
|
||||
# |-----------|
|
||||
#
|
||||
#
|
||||
|
||||
# After loading w pointer in ip reg.
|
||||
# And after pushing r4-r8 and d8-d15 on stack
|
||||
@ -56,7 +56,7 @@
|
||||
# |out ch indx| 100
|
||||
# |params | 104
|
||||
# |-----------|
|
||||
#
|
||||
#
|
||||
|
||||
# void pytorch_q8gemm_ukernel_4x8__aarch32_neon(
|
||||
# size_t mr,
|
||||
|
@ -15,8 +15,8 @@
|
||||
# x2: k
|
||||
# x3: a
|
||||
# x4: a_stride
|
||||
# x5: w
|
||||
# x6: c
|
||||
# x5: w
|
||||
# x6: c
|
||||
# x7: c_stride
|
||||
#
|
||||
|
||||
|
@ -21,15 +21,15 @@
|
||||
# |----------------|
|
||||
# |packed_a | 0
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# After loading w pointer in ip reg.
|
||||
# And after pushing r4-r9 and d8-d15 on stack
|
||||
# |----------------|
|
||||
# |r4 - r11 | 0
|
||||
# |r4 - r11 | 0
|
||||
# |packed_a | 32
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# Packed A format.
|
||||
# 4kx4m blocks for alls blocks given 4 rows (4m) are placed in contiguous memory.
|
||||
@ -42,7 +42,7 @@
|
||||
# | | Thus Packed A has (K + 4 - 1)/4 * (M + 4 -1)/4 blocks
|
||||
# | |
|
||||
# |---------------------|
|
||||
#
|
||||
#
|
||||
# Each 4 x 4 blocks is transposed and stored.
|
||||
# Each of the (K + 4 - 1)/4 blocks for a given group of 4 m blocks
|
||||
# are stored adjacent in memory
|
||||
|
@ -20,7 +20,7 @@
|
||||
## Stack
|
||||
# 4 a_stride
|
||||
# 4 packed_w
|
||||
# 4 w_row_ptr
|
||||
# 4 w_row_ptr
|
||||
# 4 w_block_ids_ptr
|
||||
# 4 b
|
||||
# 4 c
|
||||
@ -43,7 +43,7 @@
|
||||
# |out ch indx | 24
|
||||
# |params | 28
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# After loading w pointer in ip reg.
|
||||
# And after pushing r4-r9 and d8-d15 on stack
|
||||
@ -58,7 +58,7 @@
|
||||
# |out ch indx | 120
|
||||
# |params | 124
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# void pytorch_q8gemm_dq_sparse_1x4_ukernel_4x8_packedA__aarch32_neon(
|
||||
# size_t mr,
|
||||
@ -223,7 +223,7 @@ k_loop:
|
||||
# Each iteration produce 4 values each of 4 bytes
|
||||
# Thus 4 x 4 = 16 bytes 2^4
|
||||
# In this implementation, first value will be stored at
|
||||
# 1st value: sp - 12 - r1 * 16
|
||||
# 1st value: sp - 12 - r1 * 16
|
||||
# 2nd value: sp - 12 - (r1 - 1) * 16
|
||||
# and so on.
|
||||
SUB r9, r9, r1, LSL #4
|
||||
|
@ -20,7 +20,7 @@
|
||||
## Stack
|
||||
# 4 a_stride
|
||||
# 4 packed_w
|
||||
# 4 w_row_ptr
|
||||
# 4 w_row_ptr
|
||||
# 4 w_block_ids_ptr
|
||||
# 4 b
|
||||
# 4 c
|
||||
@ -43,7 +43,7 @@
|
||||
# |out ch indx | 24
|
||||
# |params | 28
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# After loading w pointer in ip reg.
|
||||
# And after pushing r4-r9 and d8-d15 on stack
|
||||
@ -58,7 +58,7 @@
|
||||
# |out ch indx | 120
|
||||
# |params | 124
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# void pytorch_q8gemm_dq_sparse_8x1_ukernel_4x8_packedA__aarch32_neon(
|
||||
# size_t mr,
|
||||
|
@ -21,15 +21,15 @@
|
||||
# |----------------|
|
||||
# |packed_a | 0
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# After loading w pointer in ip reg.
|
||||
# And after pushing r4-r9 and d8-d15 on stack
|
||||
# |----------------|
|
||||
# |r4 - r11 | 0
|
||||
# |r4 - r11 | 0
|
||||
# |packed_a | 32
|
||||
# |----------------|
|
||||
#
|
||||
#
|
||||
|
||||
# Packed A format.
|
||||
# 8kx4m blocks for alls blocks given 4 rows (4m) are placed in contiguous memory.
|
||||
@ -42,7 +42,7 @@
|
||||
# | | Thus Packed A has (K + 4 - 1)/4 * (M + 8 -1)/8 blocks
|
||||
# | |
|
||||
# |---------------------|
|
||||
#
|
||||
#
|
||||
# Each 8 x 4 blocks is transposed and stored.
|
||||
# Each of the (K + 4 - 1)/4 blocks for a given group of 8 m blocks
|
||||
# are stored adjacent in memory
|
||||
|
@ -19,7 +19,7 @@
|
||||
# | | Thus Packed A has (K + 4 - 1)/4 * (M + 8 -1)/8 blocks
|
||||
# | |
|
||||
# |---------------------|
|
||||
#
|
||||
#
|
||||
# Each 8 x 4 blocks is transposed and stored.
|
||||
# Each of the (K + 4 - 1)/4 blocks for a given group of 8 m blocks
|
||||
# are stored adjacent in memory
|
||||
|
@ -278,7 +278,7 @@ k_loop:
|
||||
# v10 : x10, x11, x12, x13
|
||||
# v12 : x20, x21, x22, x23
|
||||
# v14 : x30, x31, x32, x33
|
||||
# Then using
|
||||
# Then using
|
||||
# TRANSPOSE_4X4_S32 v16, v18, v20, v22, v4, v5, v6, v7
|
||||
# We get
|
||||
# v16 : x04, x05, x06, x07
|
||||
|
@ -635,7 +635,7 @@ Tensor _sparse_log_softmax(const Tensor& input_, const int64_t dim_, c10::option
|
||||
namedinference::propagate_names(result, input_);
|
||||
return result;
|
||||
}
|
||||
|
||||
|
||||
Tensor _sparse_log_softmax(const Tensor& self, Dimname dim, optional<ScalarType> dtype) {
|
||||
return at::_sparse_log_softmax(self, dimname_to_position(self, dim), dtype);
|
||||
}
|
||||
|
@ -541,8 +541,8 @@ Tensor sparse_mask_helper_cpu(
|
||||
`t` - coalesced sparse tensor input
|
||||
`mask_indices` - mask indices tensor
|
||||
|
||||
Note: The nnz in the output tensor will be same as the `mask_indices`. So it will
|
||||
works independently if the mask is coalesced or not.
|
||||
Note: The nnz in the output tensor will be same as the `mask_indices`. So it will
|
||||
works independently if the mask is coalesced or not.
|
||||
*/
|
||||
TORCH_CHECK(t.is_sparse(), "t: input is not a sparse tensor");
|
||||
TORCH_CHECK(t.is_coalesced(), "t: input is uncoalesced");
|
||||
@ -554,7 +554,7 @@ Tensor sparse_mask_helper_cpu(
|
||||
auto t_v = t._values();
|
||||
auto vsize = t_v.sizes().vec();
|
||||
vsize[0] = r_nnz;
|
||||
|
||||
|
||||
Tensor r_values = at::zeros(vsize, t_v.options());
|
||||
auto t_i = t._indices();
|
||||
auto t_nnz = t._nnz();
|
||||
@ -583,7 +583,7 @@ Tensor sparse_mask_helper_cpu(
|
||||
}
|
||||
}
|
||||
});
|
||||
return r_values;
|
||||
return r_values;
|
||||
}
|
||||
|
||||
}} // namespace at::native
|
||||
|
@ -1116,7 +1116,7 @@ SparseTensor& _sspaddmm_out_cpu(
|
||||
"sspaddmm: Argument #1: Expected dim 1 size ", dim_k, ", got ", t.size(1));
|
||||
|
||||
int64_t nnz = sparse._nnz();
|
||||
// We have to make indices contiguous as we use indices.data_ptr in _to_csr which assumes row-contiguous storage
|
||||
// We have to make indices contiguous as we use indices.data_ptr in _to_csr which assumes row-contiguous storage
|
||||
Tensor indices = sparse._indices().contiguous();
|
||||
Tensor values = sparse._values();
|
||||
|
||||
|
@ -144,8 +144,8 @@ void csrmm2(
|
||||
TORCH_CUDASPARSE_CHECK(cusparseCreateDnMat(
|
||||
&descC, /* output */
|
||||
m, n, ldc, /* rows, cols, leading dimension */
|
||||
c, /* values */
|
||||
cusparse_value_type, /* data type of values */
|
||||
c, /* values */
|
||||
cusparse_value_type, /* data type of values */
|
||||
CUSPARSE_ORDER_COL /* memory layout, ONLY column-major is supported now */
|
||||
));
|
||||
|
||||
|
@ -105,7 +105,7 @@ def main(argv):
|
||||
out_dir = pathlib.Path(__file__).parent
|
||||
|
||||
(out_dir / "nnapi_wrapper.h").write_text(
|
||||
PREFIX +
|
||||
PREFIX +
|
||||
textwrap.dedent("""\
|
||||
#ifndef NNAPI_WRAPPER_H_
|
||||
#define NNAPI_WRAPPER_H_
|
||||
@ -124,7 +124,7 @@ def main(argv):
|
||||
)
|
||||
|
||||
(out_dir / "nnapi_wrapper.cpp").write_text(
|
||||
PREFIX +
|
||||
PREFIX +
|
||||
textwrap.dedent("""\
|
||||
#ifndef _WIN32
|
||||
#include <dlfcn.h>
|
||||
|
@ -140,7 +140,7 @@ struct NnapiCompilation : torch::jit::CustomClassHolder {
|
||||
}
|
||||
|
||||
check_nnapi->Execution_compute(execution);
|
||||
|
||||
|
||||
// TODO: Maybe skip this for fixed-size outputs?
|
||||
for (size_t i = 0; i < outputs.size(); i++) {
|
||||
auto& t = outputs[i];
|
||||
|
@ -18,7 +18,7 @@ TEST(CPUGeneratorImpl, TestGeneratorDynamicCast) {
|
||||
}
|
||||
|
||||
TEST(CPUGeneratorImpl, TestDefaultGenerator) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Check if default generator is created only once
|
||||
// address of generator should be same in all calls
|
||||
auto foo = at::detail::getDefaultCPUGenerator();
|
||||
@ -27,7 +27,7 @@ TEST(CPUGeneratorImpl, TestDefaultGenerator) {
|
||||
}
|
||||
|
||||
TEST(CPUGeneratorImpl, TestCloning) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Check cloning of new generators.
|
||||
// Note that we don't allow cloning of other
|
||||
// generator states into default generators.
|
||||
@ -47,9 +47,9 @@ void thread_func_get_engine_op(CPUGeneratorImpl* generator) {
|
||||
}
|
||||
|
||||
TEST(CPUGeneratorImpl, TestMultithreadingGetEngineOperator) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Check CPUGeneratorImpl is reentrant and the engine state
|
||||
// is not corrupted when multiple threads request for
|
||||
// is not corrupted when multiple threads request for
|
||||
// random samples.
|
||||
// See Note [Acquire lock when using random generators]
|
||||
auto gen1 = at::detail::createCPUGenerator();
|
||||
@ -74,7 +74,7 @@ TEST(CPUGeneratorImpl, TestMultithreadingGetEngineOperator) {
|
||||
}
|
||||
|
||||
TEST(CPUGeneratorImpl, TestGetSetCurrentSeed) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Test current seed getter and setter
|
||||
// See Note [Acquire lock when using random generators]
|
||||
auto foo = at::detail::getDefaultCPUGenerator();
|
||||
@ -92,7 +92,7 @@ void thread_func_get_set_current_seed(Generator generator) {
|
||||
}
|
||||
|
||||
TEST(CPUGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Test current seed getter and setter are thread safe
|
||||
// See Note [Acquire lock when using random generators]
|
||||
auto gen1 = at::detail::getDefaultCPUGenerator();
|
||||
@ -107,7 +107,7 @@ TEST(CPUGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
|
||||
}
|
||||
|
||||
TEST(CPUGeneratorImpl, TestRNGForking) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Test that state of a generator can be frozen and
|
||||
// restored
|
||||
// See Note [Acquire lock when using random generators]
|
||||
@ -124,7 +124,7 @@ TEST(CPUGeneratorImpl, TestRNGForking) {
|
||||
ASSERT_EQ(target_value.sum().item<double>(), forked_value.sum().item<double>());
|
||||
}
|
||||
|
||||
/**
|
||||
/**
|
||||
* Philox CPU Engine Tests
|
||||
*/
|
||||
|
||||
@ -208,7 +208,7 @@ TEST(CPUGeneratorImpl, TestMT19937EngineReproducibility) {
|
||||
// Test Description:
|
||||
// Tests if same inputs give same results when compared
|
||||
// to std.
|
||||
|
||||
|
||||
// test with zero seed
|
||||
at::mt19937 engine1(0);
|
||||
std::mt19937 engine2(0);
|
||||
@ -231,5 +231,5 @@ TEST(CPUGeneratorImpl, TestMT19937EngineReproducibility) {
|
||||
for(int i = 0; i < 10000; i++) {
|
||||
ASSERT_EQ(engine1(), engine2());
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
@ -80,7 +80,7 @@ __global__ void testEngineOffset2(){
|
||||
unsigned long long increment_val = ::ldexp(1.0, 64);
|
||||
at::Philox4_32_10 engine1(123, 0, increment_val);
|
||||
at::Philox4_32_10 engine2(123, increment_val, increment_val);
|
||||
|
||||
|
||||
engine2.incr_n(increment_val);
|
||||
engine2.incr();
|
||||
assert(engine1() == engine2());
|
||||
@ -166,7 +166,7 @@ TEST(CUDAGeneratorImpl, TestGeneratorDynamicCast) {
|
||||
}
|
||||
|
||||
TEST(CUDAGeneratorImpl, TestDefaultGenerator) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Check if default generator state is created only once
|
||||
// address of generator should be same in all calls
|
||||
if (!at::cuda::is_available()) return;
|
||||
@ -186,7 +186,7 @@ TEST(CUDAGeneratorImpl, TestDefaultGenerator) {
|
||||
}
|
||||
|
||||
TEST(CUDAGeneratorImpl, TestCloning) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Check cloning of new generators.
|
||||
// Note that we don't allow cloning of other
|
||||
// generator states into default generators.
|
||||
@ -211,9 +211,9 @@ void thread_func_get_set_current_seed(Generator generator) {
|
||||
current_seed++;
|
||||
generator.set_current_seed(current_seed);
|
||||
}
|
||||
|
||||
|
||||
TEST(CUDAGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Test current seed getter and setter are thread safe
|
||||
// See Note [Acquire lock when using random generators]
|
||||
if (!at::cuda::is_available()) return;
|
||||
@ -229,7 +229,7 @@ TEST(CUDAGeneratorImpl, TestMultithreadingGetSetCurrentSeed) {
|
||||
}
|
||||
|
||||
TEST(CUDAGeneratorImpl, TestRNGForking) {
|
||||
// Test Description:
|
||||
// Test Description:
|
||||
// Test that state of a generator can be frozen and
|
||||
// restored
|
||||
// See Note [Acquire lock when using random generators]
|
||||
|
@ -765,7 +765,7 @@ int main()
|
||||
|
||||
test_THDoubleVector_fill_VSX();
|
||||
test_THFloatVector_fill_VSX();
|
||||
|
||||
|
||||
test_THDoubleVector_muls_VSX();
|
||||
test_THFloatVector_muls_VSX();
|
||||
|
||||
|
@ -93,7 +93,7 @@ static inline uint32_t detectHostSIMDExtensions()
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
|
||||
#elif defined(__EMSCRIPTEN__)
|
||||
|
||||
static inline uint32_t detectHostSIMDExtensions()
|
||||
|
@ -294,9 +294,9 @@ inline __device__ at::BFloat16 gpuAtomicMul(at::BFloat16 * address, at::BFloat16
|
||||
return AtomicFPOp<at::BFloat16>()(address, val,
|
||||
[](at::BFloat16 bsum, at::BFloat16 val) {
|
||||
return THCNumerics<at::BFloat16>::mul(bsum, val);
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
|
||||
inline __device__ double gpuAtomicMul(double * address, double val) {
|
||||
return AtomicFPOp<double>()(address, val,
|
||||
[](double val, unsigned long long int assumed) {
|
||||
|
@ -270,7 +270,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeSplitTail)(benchmark::State& state) {
|
||||
te::For* tail;
|
||||
loop.splitWithTail(m, kChunkSize, &mo, &mi, &tail);
|
||||
}
|
||||
|
||||
|
||||
loop.prepareForCodegen();
|
||||
te::Stmt* s = loop.root_stmt();
|
||||
s = te::IRSimplifier::simplify(s);
|
||||
@ -313,7 +313,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeSplitMask)(benchmark::State& state) {
|
||||
te::For* mi;
|
||||
loop.splitWithMask(m, kChunkSize, &mo, &mi);
|
||||
}
|
||||
|
||||
|
||||
loop.prepareForCodegen();
|
||||
te::Stmt* s = loop.root_stmt();
|
||||
s = te::IRSimplifier::simplify(s);
|
||||
@ -369,7 +369,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeRfactorV1)(benchmark::State& state) {
|
||||
auto bt_body = te::NodeFinder<te::ReduceOp>::find(loop.root_stmt())[0];
|
||||
loop.rfactor(bt_body, mi->var());
|
||||
}
|
||||
|
||||
|
||||
loop.prepareForCodegen();
|
||||
te::Stmt* s = loop.root_stmt();
|
||||
s = te::IRSimplifier::simplify(s);
|
||||
@ -419,7 +419,7 @@ BENCHMARK_DEFINE_F(Reduce1D, TeRfactorV2)(benchmark::State& state) {
|
||||
|
||||
{
|
||||
// Look for the new For and vectorize, but rfactor didn't return the newly added "For *".
|
||||
// Resort to a hack to find the lost "For *".
|
||||
// Resort to a hack to find the lost "For *".
|
||||
// TODO: make it easier to find the transformed loop after rfactor.
|
||||
auto loops = te::NodeFinder<te::For>::find(loop.root_stmt());
|
||||
TORCH_CHECK(loops.size() == 4);
|
||||
|
@ -163,7 +163,7 @@ class AgentBase:
|
||||
r"""
|
||||
Finishes the episode
|
||||
Args:
|
||||
rets (list): List containing rewards generated by selct action calls during
|
||||
rets (list): List containing rewards generated by selct action calls during
|
||||
episode run
|
||||
"""
|
||||
return self.agent_latency, self.agent_throughput
|
||||
|
@ -48,13 +48,13 @@ class CoordinatorBase:
|
||||
|
||||
def run_coordinator(self, episodes, episode_steps, queue):
|
||||
r"""
|
||||
Runs n benchmark episodes. Each episode is started by coordinator telling each
|
||||
observer to contact the agent. Each episode is concluded by coordinator telling agent
|
||||
Runs n benchmark episodes. Each episode is started by coordinator telling each
|
||||
observer to contact the agent. Each episode is concluded by coordinator telling agent
|
||||
to finish the episode, and then the coordinator records benchmark data
|
||||
Args:
|
||||
episodes (int): Number of episodes to run
|
||||
episode_steps (int): Number steps to be run in each episdoe by each observer
|
||||
queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for
|
||||
queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for
|
||||
saving benchmark run results to
|
||||
"""
|
||||
|
||||
@ -96,9 +96,9 @@ class CoordinatorBase:
|
||||
observer_throughput_final = [
|
||||
t for s in observer_throughput_final for t in s]
|
||||
|
||||
benchmark_metrics = {'agent latency (seconds)': {},
|
||||
'agent throughput': {},
|
||||
'observer latency (seconds)': {},
|
||||
benchmark_metrics = {'agent latency (seconds)': {},
|
||||
'agent throughput': {},
|
||||
'observer latency (seconds)': {},
|
||||
'observer throughput': {}}
|
||||
|
||||
|
||||
|
@ -44,19 +44,19 @@ args = vars(args)
|
||||
|
||||
def run_worker(rank, world_size, master_addr, master_port, batch, state_size, nlayers, out_features, queue):
|
||||
r"""
|
||||
inits an rpc worker
|
||||
inits an rpc worker
|
||||
Args:
|
||||
rank (int): Rpc rank of worker machine
|
||||
world_size (int): Number of workers in rpc network (number of observers +
|
||||
1 agent + 1 coordinator)
|
||||
master_addr (str): Master address of cooridator
|
||||
master_port (str): Master port of coordinator
|
||||
batch (bool): Whether agent will use batching or process one observer
|
||||
batch (bool): Whether agent will use batching or process one observer
|
||||
request a at a time
|
||||
state_size (str): Numerical str representing state dimensions (ie: 5-15-10)
|
||||
nlayers (int): Number of layers in model
|
||||
out_features (int): Number of out features in model
|
||||
queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for
|
||||
queue (SimpleQueue): SimpleQueue from torch.multiprocessing.get_context() for
|
||||
saving benchmark run results to
|
||||
"""
|
||||
state_size = list(map(int, state_size.split('-')))
|
||||
@ -82,9 +82,9 @@ def find_graph_variable(args):
|
||||
r"""
|
||||
Determines if user specified multiple entries for a single argument, in which case
|
||||
benchmark is run for each of these entries. Comma separated values in a given argument indicate multiple entries.
|
||||
Output is presented so that user can use plot repo to plot the results with each of the
|
||||
variable argument's entries on the x-axis. Args is modified in accordance with this.
|
||||
More than 1 argument with multiple entries is not permitted.
|
||||
Output is presented so that user can use plot repo to plot the results with each of the
|
||||
variable argument's entries on the x-axis. Args is modified in accordance with this.
|
||||
More than 1 argument with multiple entries is not permitted.
|
||||
Args:
|
||||
args (dict): Dictionary containing arguments passed by the user (and default arguments)
|
||||
"""
|
||||
@ -138,12 +138,12 @@ def print_benchmark_results(report):
|
||||
if x_axis_name:
|
||||
x_axis_output_label = f'{x_axis_name} |'
|
||||
heading += append_spaces(x_axis_output_label, col_width)
|
||||
metric_headers = ['agent latency (seconds)', 'agent throughput',
|
||||
metric_headers = ['agent latency (seconds)', 'agent throughput',
|
||||
'observer latency (seconds)', 'observer throughput']
|
||||
percentile_subheaders = ['p50', 'p75', 'p90', 'p95']
|
||||
subheading = ""
|
||||
if x_axis_name:
|
||||
subheading += append_spaces(' ' * (len(x_axis_output_label) - 1), col_width)
|
||||
subheading += append_spaces(' ' * (len(x_axis_output_label) - 1), col_width)
|
||||
for header in metric_headers:
|
||||
heading += append_spaces(header, col_width * len(percentile_subheaders))
|
||||
for percentile in percentile_subheaders:
|
||||
@ -163,7 +163,7 @@ def print_benchmark_results(report):
|
||||
|
||||
def main():
|
||||
r"""
|
||||
Runs rpc benchmark once if no argument has multiple entries, and otherwise once for each of the multiple entries.
|
||||
Runs rpc benchmark once if no argument has multiple entries, and otherwise once for each of the multiple entries.
|
||||
Multiple entries is indicated by comma separated values, and may only be done for a single argument.
|
||||
Results are printed as well as saved to output file. In case of multiple entries for a single argument,
|
||||
the plot repo can be used to benchmark results on the y axis with each entry on the x axis.
|
||||
@ -171,7 +171,7 @@ def main():
|
||||
find_graph_variable(args)
|
||||
|
||||
# run once if no x axis variables
|
||||
x_axis_variables = args[args['x_axis_name']] if args.get('x_axis_name') else [None]
|
||||
x_axis_variables = args[args['x_axis_name']] if args.get('x_axis_name') else [None]
|
||||
ctx = mp.get_context('spawn')
|
||||
queue = ctx.SimpleQueue()
|
||||
benchmark_runs = []
|
||||
@ -197,7 +197,7 @@ def main():
|
||||
print(f"Time taken benchmark run {i} -, {time.time() - start_time}")
|
||||
if args.get('x_axis_name'):
|
||||
# save x axis value was for this iteration in the results
|
||||
benchmark_run_results[args['x_axis_name']] = x_axis_variable
|
||||
benchmark_run_results[args['x_axis_name']] = x_axis_variable
|
||||
benchmark_runs.append(benchmark_run_results)
|
||||
|
||||
report = args
|
||||
|
@ -1,6 +1,6 @@
|
||||
# Fast RNN benchmarks
|
||||
|
||||
Benchmarks for TorchScript models
|
||||
Benchmarks for TorchScript models
|
||||
|
||||
For most stable results, do the following:
|
||||
- Set CPU Governor to performance mode (as opposed to energy save)
|
||||
@ -24,7 +24,7 @@ or run the test independently:
|
||||
|
||||
should give a good comparison, or you can specify the type of model to run
|
||||
|
||||
`python -m fastrnns.bench --rnns cudnn aten jit --group rnns`
|
||||
`python -m fastrnns.bench --rnns cudnn aten jit --group rnns`
|
||||
|
||||
## Run model profiling, calls nvprof
|
||||
|
||||
@ -33,7 +33,7 @@ should give a good comparison, or you can specify the type of model to run
|
||||
should generate nvprof file for all models somewhere.
|
||||
you can also specify the models to generate nvprof files separately:
|
||||
|
||||
`python -m fastrnns.profile --rnns aten jit`
|
||||
`python -m fastrnns.profile --rnns aten jit`
|
||||
|
||||
### Caveats
|
||||
|
||||
|
@ -1,16 +1,16 @@
|
||||
import operator_benchmark as op_bench
|
||||
import benchmark_caffe2 as op_bench_c2
|
||||
from benchmark_caffe2 import Caffe2BenchmarkBase # noqa
|
||||
from caffe2.python import core
|
||||
from caffe2.python import core
|
||||
|
||||
|
||||
"""Microbenchmarks for element-wise Add operator. Supports both Caffe2/PyTorch."""
|
||||
|
||||
# Configs for C2 add operator
|
||||
# Configs for C2 add operator
|
||||
add_long_configs = op_bench.cross_product_configs(
|
||||
M=[8, 64, 128],
|
||||
N=range(2, 10, 3),
|
||||
K=[2 ** x for x in range(0, 3)],
|
||||
K=[2 ** x for x in range(0, 3)],
|
||||
dtype=["int", "float"],
|
||||
tags=["long"]
|
||||
)
|
||||
@ -22,20 +22,20 @@ add_short_configs = op_bench.config_list(
|
||||
[16, 16, 64, "float"],
|
||||
[64, 64, 128, "int"],
|
||||
],
|
||||
attr_names=["M", "N", "K", "dtype"],
|
||||
tags=["short"],
|
||||
attr_names=["M", "N", "K", "dtype"],
|
||||
tags=["short"],
|
||||
)
|
||||
|
||||
class AddBenchmark(op_bench_c2.Caffe2BenchmarkBase):
|
||||
def init(self, M, N, K, dtype):
|
||||
self.input_one = self.tensor([M, N, K], dtype)
|
||||
self.input_two = self.tensor([M, N, K], dtype)
|
||||
def init(self, M, N, K, dtype):
|
||||
self.input_one = self.tensor([M, N, K], dtype)
|
||||
self.input_two = self.tensor([M, N, K], dtype)
|
||||
self.output = self.tensor([M, N, K], dtype)
|
||||
self.set_module_name("add")
|
||||
|
||||
def forward(self):
|
||||
op = core.CreateOperator(
|
||||
"Add", [self.input_one, self.input_two], self.output, **self.args
|
||||
"Add", [self.input_one, self.input_two], self.output, **self.args
|
||||
)
|
||||
return op
|
||||
|
||||
|
@ -2,7 +2,7 @@
|
||||
import operator_benchmark as op_bench
|
||||
import benchmark_caffe2 as op_bench_c2
|
||||
from benchmark_caffe2 import Caffe2BenchmarkBase # noqa
|
||||
from caffe2.python import core
|
||||
from caffe2.python import core
|
||||
|
||||
"""Microbenchmarks for MatMul operator"""
|
||||
|
||||
@ -10,7 +10,7 @@ from caffe2.python import core
|
||||
mm_long_configs = op_bench.cross_product_configs(
|
||||
M=[8, 64, 128],
|
||||
N=range(2, 10, 3),
|
||||
K=[2 ** x for x in range(0, 3)],
|
||||
K=[2 ** x for x in range(0, 3)],
|
||||
trans_a=[True, False],
|
||||
trans_b=[True, False],
|
||||
tags=["long"]
|
||||
@ -23,13 +23,13 @@ mm_short_configs = op_bench.config_list(
|
||||
[1024, 1024, 256, True, False],
|
||||
[8192, 8192, 1024, True, False],
|
||||
],
|
||||
attr_names=["M", "N", "K", "trans_a", "trans_b"],
|
||||
tags=["short"],
|
||||
attr_names=["M", "N", "K", "trans_a", "trans_b"],
|
||||
tags=["short"],
|
||||
)
|
||||
|
||||
|
||||
class MatMulBenchmark(op_bench_c2.Caffe2BenchmarkBase):
|
||||
def init(self, M, N, K, trans_a, trans_b):
|
||||
def init(self, M, N, K, trans_a, trans_b):
|
||||
self.input_one = self.tensor([N, M]) if trans_a else self.tensor([M, N])
|
||||
self.input_two = self.tensor([K, N]) if trans_b else self.tensor([N, K])
|
||||
self.args = {'trans_a': trans_a, 'trans_b': trans_b}
|
||||
@ -38,7 +38,7 @@ class MatMulBenchmark(op_bench_c2.Caffe2BenchmarkBase):
|
||||
|
||||
def forward(self):
|
||||
op = core.CreateOperator(
|
||||
"MatMul", [self.input_one, self.input_two], self.output, **self.args
|
||||
"MatMul", [self.input_one, self.input_two], self.output, **self.args
|
||||
)
|
||||
return op
|
||||
|
||||
|
@ -22,7 +22,7 @@ unary_ops_list = op_bench.op_list(
|
||||
|
||||
|
||||
class UnaryOpBenchmark(op_bench.TorchBenchmarkBase):
|
||||
def init(self, M, N, op_func):
|
||||
def init(self, M, N, op_func):
|
||||
self.input_one = torch.rand(M, N)
|
||||
self.op_func = op_func
|
||||
|
||||
|
@ -1,5 +1,5 @@
|
||||
import operator_benchmark as op_bench
|
||||
from caffe2.python import core
|
||||
from caffe2.python import core
|
||||
|
||||
|
||||
add_configs = op_bench.cross_product_configs(
|
||||
@ -11,24 +11,24 @@ add_configs = op_bench.cross_product_configs(
|
||||
)
|
||||
|
||||
class AddBenchmark(op_bench.Caffe2BenchmarkBase):
|
||||
def init(self, M, N, K, device):
|
||||
def init(self, M, N, K, device):
|
||||
self.set_module_name("add")
|
||||
self.input_one = self.tensor([M, N, K], device=device)
|
||||
self.input_two = self.tensor([M, N, K], device=device)
|
||||
self.input_one_grad = self.tensor([M, N, K], device=device)
|
||||
self.input_two_grad = self.tensor([M, N, K], device=device)
|
||||
self.input_one = self.tensor([M, N, K], device=device)
|
||||
self.input_two = self.tensor([M, N, K], device=device)
|
||||
self.input_one_grad = self.tensor([M, N, K], device=device)
|
||||
self.input_two_grad = self.tensor([M, N, K], device=device)
|
||||
self.output = self.tensor([M, N, K], device=device)
|
||||
|
||||
def forward(self):
|
||||
op = core.CreateOperator(
|
||||
"Add", [self.input_one, self.input_two], self.output, **self.args
|
||||
"Add", [self.input_one, self.input_two], self.output, **self.args
|
||||
)
|
||||
return op
|
||||
|
||||
def backward(self):
|
||||
grad_op = core.CreateOperator(
|
||||
"AddGradient", [self.output, self.input_one, self.input_two],
|
||||
[self.input_one_grad, self.input_two_grad], **self.args
|
||||
"AddGradient", [self.output, self.input_one, self.input_two],
|
||||
[self.input_one_grad, self.input_two_grad], **self.args
|
||||
)
|
||||
return grad_op
|
||||
|
||||
|
@ -5,8 +5,8 @@ intraop_bench_configs = op_bench.config_list(
|
||||
attrs=[
|
||||
[8, 16],
|
||||
],
|
||||
attr_names=["M", "N"],
|
||||
tags=["short"],
|
||||
attr_names=["M", "N"],
|
||||
tags=["short"],
|
||||
)
|
||||
|
||||
@torch.jit.script
|
||||
@ -24,9 +24,9 @@ class TorchSumBenchmark(op_bench.TorchBenchmarkBase):
|
||||
self.input_one = torch.rand(M, N)
|
||||
self.set_module_name("sum")
|
||||
|
||||
# This is a very temporary method and will be removed soon, so
|
||||
# This is a very temporary method and will be removed soon, so
|
||||
# don't use this method in your benchmark
|
||||
# TODO(mingzhe): use one forward method for both JIT and Eager
|
||||
# TODO(mingzhe): use one forward method for both JIT and Eager
|
||||
def jit_forward(self, iters):
|
||||
return torch_sumall(self.input_one, iters)
|
||||
|
||||
|
@ -10,9 +10,9 @@ add_configs = op_bench.cross_product_configs(
|
||||
)
|
||||
|
||||
# This benchmark uses the auto_set to automatically set requires_grad
|
||||
# for both inputs. The test name can also be used for filtering.
|
||||
# for both inputs. The test name can also be used for filtering.
|
||||
class AddBenchmark(op_bench.TorchBenchmarkBase):
|
||||
def init(self, M, N, K):
|
||||
def init(self, M, N, K):
|
||||
self.input_one = torch.rand(M, N, K, requires_grad=self.auto_set())
|
||||
self.input_two = torch.rand(M, N, K, requires_grad=self.auto_set())
|
||||
self.set_module_name("add")
|
||||
|
@ -4,7 +4,7 @@ import torch
|
||||
"""Microbenchmarks for element-wise Add operator. Supports both Caffe2/PyTorch."""
|
||||
|
||||
add_short_configs = op_bench.config_list(
|
||||
attr_names=['M', 'N', 'K'],
|
||||
attr_names=['M', 'N', 'K'],
|
||||
attrs=[
|
||||
[8, 16, 32],
|
||||
[16, 16, 64],
|
||||
@ -14,12 +14,12 @@ add_short_configs = op_bench.config_list(
|
||||
'device': ['cpu', 'cuda'],
|
||||
'dtype': [torch.float, torch.float64],
|
||||
},
|
||||
tags=['short'],
|
||||
tags=['short'],
|
||||
)
|
||||
|
||||
|
||||
class AddBenchmark(op_bench.TorchBenchmarkBase):
|
||||
def init(self, M, N, K, device, dtype):
|
||||
def init(self, M, N, K, device, dtype):
|
||||
self.input_one = torch.rand(M, N, K, device=device, dtype=dtype, requires_grad=True)
|
||||
self.input_two = torch.rand(M, N, K, device=device, dtype=dtype)
|
||||
self.set_module_name('add')
|
||||
|
@ -12,7 +12,7 @@ add_configs = op_bench.cross_product_configs(
|
||||
|
||||
|
||||
class AddBenchmark(op_bench.TorchBenchmarkBase):
|
||||
def init(self, M, N, K, device):
|
||||
def init(self, M, N, K, device):
|
||||
self.input_one = torch.rand(M, N, K, device=device, requires_grad=True)
|
||||
self.input_two = torch.rand(M, N, K, device=device, requires_grad=True)
|
||||
self.set_module_name("add")
|
||||
|
@ -7,19 +7,19 @@ configs = op_bench.random_sample_configs(
|
||||
N=[7, 8, 9, 10, 11, 12],
|
||||
K=[13, 14, 15, 16, 17, 18],
|
||||
# probs saves the weights of each value
|
||||
probs=op_bench.attr_probs(
|
||||
probs=op_bench.attr_probs(
|
||||
M=[0.5, 0.2, 0.1, 0.05, 0.03, 0.1],
|
||||
N=[0.1, 0.3, 0.4, 0.02, 0.03, 0.04],
|
||||
K=[0.03, 0.6, 0.04, 0.02, 0.03, 0.01],
|
||||
),
|
||||
# this is the number of returned inputs
|
||||
total_samples=10,
|
||||
# this is the number of returned inputs
|
||||
total_samples=10,
|
||||
tags=["short"],
|
||||
)
|
||||
|
||||
|
||||
class AddBenchmark(op_bench.TorchBenchmarkBase):
|
||||
def init(self, M, N, K):
|
||||
def init(self, M, N, K):
|
||||
self.input_one = torch.rand(M, N, K)
|
||||
self.input_two = torch.rand(M, N, K)
|
||||
self.set_module_name("add")
|
||||
|
@ -1,9 +1,9 @@
|
||||
# Sparse benchmarks
|
||||
|
||||
# These benchmarks are for the sparse matrix functionality.
|
||||
# These benchmarks are for the sparse matrix functionality.
|
||||
# They exist for comparing the performance of sparse matrix routines
|
||||
# torch.sparse.mm(sparse, sparse)` with different backends (CPU/CUDA)
|
||||
# and with other frameworks such as scipy.
|
||||
# and with other frameworks such as scipy.
|
||||
|
||||
import sys
|
||||
from scipy import sparse
|
||||
|
@ -3,9 +3,9 @@
|
||||
DATASET_ROOT_DIR=$HOME/datasets/
|
||||
|
||||
# wget https://storage.googleapis.com/sgk-sc2020/dlmc.tar.gz -P $DATASET_ROOT_DIR
|
||||
# tar -xvf $DATASET_ROOT_DIR/dlmc.tar.gz
|
||||
# tar -xvf $DATASET_ROOT_DIR/dlmc.tar.gz
|
||||
|
||||
echo "!! SPARSE SPMS TIME BENCHMARK!! "
|
||||
echo "!! SPARSE SPMS TIME BENCHMARK!! "
|
||||
|
||||
python matmul_dlmc_bench.py --path $DATASET_ROOT_DIR/dlmc/rn50 --dataset random_pruning --operation matmul --output /tmp/matmul_bench.pkl
|
||||
python matmul_dlmc_bench.py --path $DATASET_ROOT_DIR/dlmc/rn50 --dataset random_pruning --operation backward --output /tmp/backward_bench.pkl
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user